Cloud-Synced Crawl History
Overview
42crawl implements a hybrid storage strategy to balance performance, privacy, and persistence. While full crawl data is stored locally in the user's browser, a summarized version is automatically synced to the cloud for authenticated users.
Why This Exists
- Data Portability: Allows logged-in users to see their crawl history (domain, health score, page count) across different devices.
- Privacy for Guests: Unauthenticated users can still benefit from history tracking via
localStoragewithout creating an account. - Performance: Storing large crawl datasets (thousands of pages) in a centralized database would be expensive and slow. Summarization provides the best of both worlds.
How It Works
Hybrid Storage Architecture
The system is orchestrated by the useCrawlHistory hook in src/hooks/useCrawlHistory.ts.
- LocalStorage (Primary):
- Stores the full
CrawlHistoryEntryobject, including the complete array ofCrawledPage[]and fullCrawlStats. - Key:
seo-crawl-history. - Limit: Capped at the 5 most recent entries to prevent quota issues.
- Stores the full
- Supabase Database (Sync):
- Stores a summary in the
crawl_historytable. - Fields:
domain,url,health_score,pages_count,critical_issues, and a JSON version ofstats. - Relationship: Each entry is linked to a
user_id.
- Stores a summary in the
Synchronization Logic
- When a crawl completes,
addEntryis called. - It immediately updates the local
historystate andlocalStorage. - If a user is authenticated (
useris present inuseAuth), thesaveToDatabasefunction is triggered to push the summary to Supabase.
Configuration
Database Schema
The crawl_history table in the public schema contains:
id: UUID (Primary Key)user_id: UUID (Foreign Key to auth.users)domain: TEXTurl: TEXThealth_score: INTEGERpages_count: INTEGERcritical_issues: INTEGERstats: JSONBcrawled_at: TIMESTAMPTZ
Hook Interface
The useCrawlHistory hook provides:
history: The current array of local history entries.addEntry(): Adds a new crawl and triggers the sync.removeEntry(): Deletes an entry locally.getEntriesForDomain(): Filters history for the active site.
User Flow
- Guest Crawl: A guest runs a crawl. The result is saved to
localStorage. They see it in the "Recent Crawls" panel. - Login: The user logs in. Their
localStoragehistory remains available. - Authenticated Crawl: The user runs a new crawl. The summary is pushed to the
crawl_historytable. - Device Switch: The user logs in on a different machine. (Note: Currently, the system only pulls summary data; full page-level details are unique to the device that performed the crawl).
Edge Cases & Limitations
- Data Mismatch: Because the full data is in
localStorage, clearing browser data will remove the ability to view detailed reports for past crawls, even if the summary exists in Supabase. - Max Entries: The 5-entry limit in
localStorageis a safeguard against browser storage limits (usually 5-10MB). - Guest-to-User Migration: Crawls performed as a guest are not retroactively pushed to the database upon login; only new crawls are synced.