Skip to content

Cloud-Synced Crawl History

Overview

42crawl implements a hybrid storage strategy to balance performance, privacy, and persistence. While full crawl data is stored locally in the user's browser, a summarized version is automatically synced to the cloud for authenticated users.

Why This Exists

  • Data Portability: Allows logged-in users to see their crawl history (domain, health score, page count) across different devices.
  • Privacy for Guests: Unauthenticated users can still benefit from history tracking via localStorage without creating an account.
  • Performance: Storing large crawl datasets (thousands of pages) in a centralized database would be expensive and slow. Summarization provides the best of both worlds.

How It Works

Hybrid Storage Architecture

The system is orchestrated by the useCrawlHistory hook in src/hooks/useCrawlHistory.ts.

  1. LocalStorage (Primary):
    • Stores the full CrawlHistoryEntry object, including the complete array of CrawledPage[] and full CrawlStats.
    • Key: seo-crawl-history.
    • Limit: Capped at the 5 most recent entries to prevent quota issues.
  2. Supabase Database (Sync):
    • Stores a summary in the crawl_history table.
    • Fields: domain, url, health_score, pages_count, critical_issues, and a JSON version of stats.
    • Relationship: Each entry is linked to a user_id.

Synchronization Logic

  • When a crawl completes, addEntry is called.
  • It immediately updates the local history state and localStorage.
  • If a user is authenticated (user is present in useAuth), the saveToDatabase function is triggered to push the summary to Supabase.

Configuration

Database Schema

The crawl_history table in the public schema contains:

  • id: UUID (Primary Key)
  • user_id: UUID (Foreign Key to auth.users)
  • domain: TEXT
  • url: TEXT
  • health_score: INTEGER
  • pages_count: INTEGER
  • critical_issues: INTEGER
  • stats: JSONB
  • crawled_at: TIMESTAMPTZ

Hook Interface

The useCrawlHistory hook provides:

  • history: The current array of local history entries.
  • addEntry(): Adds a new crawl and triggers the sync.
  • removeEntry(): Deletes an entry locally.
  • getEntriesForDomain(): Filters history for the active site.

User Flow

  1. Guest Crawl: A guest runs a crawl. The result is saved to localStorage. They see it in the "Recent Crawls" panel.
  2. Login: The user logs in. Their localStorage history remains available.
  3. Authenticated Crawl: The user runs a new crawl. The summary is pushed to the crawl_history table.
  4. Device Switch: The user logs in on a different machine. (Note: Currently, the system only pulls summary data; full page-level details are unique to the device that performed the crawl).

Edge Cases & Limitations

  • Data Mismatch: Because the full data is in localStorage, clearing browser data will remove the ability to view detailed reports for past crawls, even if the summary exists in Supabase.
  • Max Entries: The 5-entry limit in localStorage is a safeguard against browser storage limits (usually 5-10MB).
  • Guest-to-User Migration: Crawls performed as a guest are not retroactively pushed to the database upon login; only new crawls are synced.

Built with VitePress