Cloud-Synced Crawl History

Overview

42crawl implements a hybrid storage strategy to balance performance, privacy, and persistence. While full crawl data is stored locally in the user's browser, a summarized version is automatically synced to the cloud for authenticated users.

Why This Exists

Data Portability: Allows logged-in users to see their crawl history (domain, health score, page count) across different devices.
Privacy for Guests: Unauthenticated users can still benefit from history tracking via localStorage without creating an account.
Performance: Storing large crawl datasets (thousands of pages) in a centralized database would be expensive and slow. Summarization provides the best of both worlds.

How It Works

Hybrid Storage Architecture

The system is orchestrated by the useCrawlHistory hook in src/hooks/useCrawlHistory.ts.

LocalStorage (Primary):
- Stores the full CrawlHistoryEntry object, including the complete array of CrawledPage[] and full CrawlStats.
- Key: seo-crawl-history.
- Limit: Capped at the 5 most recent entries to prevent quota issues.
Supabase Database (Sync):
- Stores a summary in the crawl_history table.
- Fields: domain, url, health_score, pages_count, critical_issues, and a JSON version of stats.
- Relationship: Each entry is linked to a user_id.

Synchronization Logic

When a crawl completes, addEntry is called.
It immediately updates the local history state and localStorage.
If a user is authenticated (user is present in useAuth), the saveToDatabase function is triggered to push the summary to Supabase.

Configuration

Database Schema

The crawl_history table in the public schema contains:

id: UUID (Primary Key)
user_id: UUID (Foreign Key to auth.users)
domain: TEXT
url: TEXT
health_score: INTEGER
pages_count: INTEGER
critical_issues: INTEGER
stats: JSONB
crawled_at: TIMESTAMPTZ

Hook Interface

The useCrawlHistory hook provides:

history: The current array of local history entries.
addEntry(): Adds a new crawl and triggers the sync.
removeEntry(): Deletes an entry locally.
getEntriesForDomain(): Filters history for the active site.

User Flow

Guest Crawl: A guest runs a crawl. The result is saved to localStorage. They see it in the "Recent Crawls" panel.
Login: The user logs in. Their localStorage history remains available.
Authenticated Crawl: The user runs a new crawl. The summary is pushed to the crawl_history table.
Device Switch: The user logs in on a different machine. (Note: Currently, the system only pulls summary data; full page-level details are unique to the device that performed the crawl).

Edge Cases & Limitations

Data Mismatch: Because the full data is in localStorage, clearing browser data will remove the ability to view detailed reports for past crawls, even if the summary exists in Supabase.
Max Entries: The 5-entry limit in localStorage is a safeguard against browser storage limits (usually 5-10MB).
Guest-to-User Migration: Crawls performed as a guest are not retroactively pushed to the database upon login; only new crawls are synced.

Cloud-Synced Crawl History ​

Overview ​

Why This Exists ​

How It Works ​

Hybrid Storage Architecture ​

Synchronization Logic ​

Configuration ​

Database Schema ​

Hook Interface ​

User Flow ​

Edge Cases & Limitations ​

Related Features ​