Scrapingdome
Case study - UK grocery

Six UK grocery chains, four anti-bot stacks, one pipeline.

For a retail-tech client building a UK price comparison product, Scrapingdome runs continuous catalog, pricing, availability, promotions, and reviews across the country's largest grocery retailers. Different platforms, different anti-bot stacks, different rate ceilings, normalized into one product model and one historical price log.

Problem

Six retailers, four platforms, no public API.

A consumer-facing UK price comparison product is only as good as its freshness, breadth, and accuracy. Competing with established players requires three things at once: full catalog coverage across the major UK grocers, near-real-time price and availability updates, and enriched product detail (ingredients, nutrition, allergens, dietary flags, reviews). None of that ships through a public API, and the six target retailers each take a different shape.

The platforms are not interchangeable. Two retailers run on the same Open Source Platform e-commerce stack and expose a workable category-tree API. Two others front their catalog with Algolia, which is permissive but ships search hits, not full product records. Tesco runs a private GraphQL API gated behind an API key, session cookies, and UK IP enforcement. Sainsbury's uses a custom SSR stack protected by Akamai Bot Manager, which fingerprints clients down to the operating system. Building a separate codebase per retailer was a non-starter; it would have multiplied maintenance cost without giving the client any leverage on new retailers added later.

The anti-bot surface is uneven. Algolia tolerates high request rates if you stagger them across facets. The OSP API tolerates aggressive parallelism but emits false-negative product-count hints that have to be ignored. Tesco's GraphQL responses include polymorphic fields where the same path returns either a string or a list, breaking naive parsers. Sainsbury's is the hardest case: Akamai blocks Linux Chrome at the TLS fingerprint level regardless of header spoofing, so the gate cannot be cleared from any Linux container. Controlled testing showed the gate is primarily IP geolocation and operating system fingerprint; a UK-residential or UK-datacenter Windows host passes, a non-UK Windows host fails after roughly three requests, a Linux host never gets a token at all.

The data layer is uneven too. Catalog endpoints return identifiers, names, prices, and category paths. Enrichment endpoints where they exist return ingredients, nutrition, dietary flags, packaging, manufacturer details. Reviews live on yet another endpoint, sometimes paginated, sometimes capped. The stable identifier is one field on some retailers, a different one on others. Joining all of that into a single product table the application can query in milliseconds is its own problem.

Approach

One framework, platform-shaped subclasses, single product model.

01

Platform-shaped scraper hierarchy

A BaseScraper abstract class owns the HTTP client, retry policy, rate limiter, and checkpointing. Two platform classes inherit from it: BaseOSPScraper for the catalog-API retailers and BaseAlgoliaScraper for the search-API retailers, each implementing listing and pagination for its platform. Adding a new retailer on a supported platform is then config: an API base URL, category seeds, and the subclass inherits everything else. Tesco gets its own scraper because the GraphQL surface has no peer in our list. Sainsbury's gets its own scraper because of the Akamai constraint.

02

Stealth where required, lightweight where possible

For the open APIs (Algolia, OSP), httpx with rotating user agents and exponential backoff is enough. For Tesco's GraphQL, the same client plus a managed cookie and session pool refreshed when the API key challenges. For Sainsbury's, SeleniumBase in UC mode on Windows hosts only: the browser opens a session against the grocery domain, and all subsequent API calls are issued via fetch from inside the page context, inheriting the Akamai-blessed session token. Browser auto-restart on hang, forced Chrome cleanup, checkpoint resume.

03

Single normalized product model

Every scraper, regardless of platform, produces a Pydantic Product with the same fields: id, uuid, name, brand, description, ingredients, nutrition, dietary flags, allergens, pack size, price (current, was, unit), availability, offers, promotions, rating, review count, images, category path, country of origin, manufacturer, plus a dozen optional fields. Type coercion and validation happen at this layer, so downstream code never defends against malformed retailer payloads. Each scraper writes to its own per-retailer SQLite database; per-retailer is intentional, it isolates corruption and lets a single retailer be re-run without touching the others.

04

Two-table Supabase pipeline with sync triggers

The exporter upserts into staging tables prefixed by retailer. Products is keyed by product id and holds the static catalog. Updates is append-only and captures every price, availability, and offer change with a timestamp; that is the historical price series. PostgreSQL triggers downstream sync the staging tables into a single app-facing products table with cross-retailer joins. The upsert uses COALESCE on every enrichment column, so a catalog-only refresh never wipes enrichment data fetched on a different schedule. Deadlocks auto-retry with backoff and reconnection.

05

Reviews and images on separate pipelines

Reviews are extracted per retailer on their own weekly schedule, written to per-retailer Reviews tables, and tied to products by uuid. Product images are mirrored to a Supabase Storage bucket on a separate concurrent uploader, so the main pipeline never blocks on slow image downloads. Each pipeline has its own deadline and its own failure mode; one pipeline going down does not stop the others.

Scale and outcome In production

Six retailers live on one codebase, plus a comparison site as cross-retailer index.

All scrapers run on a managed Kubernetes cluster in London with 13 cron jobs built from a single Docker image. Catalogs refresh every two to five hours depending on retailer; enrichment and reviews run weekly; the comparison-site crawler runs daily. Every job has a 23-hour deadline and recovers automatically from transient failures.

6 + 1
grocery retailers plus a cross-retailer comparison index
~230K
grocery products tracked across the six retailers
2 to 5h
catalog refresh cadence depending on retailer

Ocado 51K, Tesco 37K, Asda 61K, Sainsbury's 31K, Morrisons 31K, Iceland 17K. Trolley.co.uk comparison crawler indexes 175K products mapped across the six retailers. Reviews: 3M+ rows. Updates table (the historical price series): more than half a million rows and growing daily.

What this proves

The right axis of abstraction is the platform, not the brand.

Cross-vendor scraping at scale is not a function of writing six scrapers. It is a function of finding the right axis of abstraction (usually the underlying e-commerce platform, not the brand) and building once per axis. Two retailers sharing the OSP stack share a scraper. Two retailers sharing Algolia share a scraper. The brand-specific code is config: an API key, a category seed, a filter rule.

The harder problem is the storage layer. Catalog scrapes and enrichment scrapes run on different schedules and bring different field sets, and a naive upsert will overwrite enrichment with NULL the moment a catalog-only refresh wins the race. COALESCE on every column is a small change with a large effect; it makes the whole pipeline order-independent. The same logic (preserve what you have unless you have something better) applies to other multi-source data systems: CRM enrichment, identity resolution, product matching. It is a pattern, not a one-off fix.

Anti-bot is not a single problem either. The Akamai gate is operating system and IP shaped; the GraphQL gate is key and cookie shaped; the rest are rate shaped. Treating them with the same toolchain wastes effort. Treating them with different toolchains behind a unified interface is the win: the Product model and the Supabase exporter do not know or care how a given record was obtained.

Questions answered in this engagement

How this pipeline works in practice.

How are price changes tracked over time?

Every scrape inserts a row into the retailer's Updates table with a timestamp, price, unit price, availability, and any active promotion. The Products table holds the latest static catalog; the Updates table is the immutable price-and-availability log. A product scraped seven times has one row in Products and seven in Updates. Price-history queries and 'is this currently on offer' queries stay fast and cheap, and the schema survives changes without backfill.

What happens when a product disappears from a retailer's catalog?

After every export, the pipeline computes the set difference between the IDs in the current scrape and the IDs in Supabase. Anything in Supabase but missing in the current scrape is marked out_of_stock with the last known price preserved, so the price series does not go to NULL when something is briefly delisted. If the product reappears in a future scrape, it is automatically reactivated.

What happens when a retailer changes their API?

Most changes are field-level. Tesco, for example, ships some text fields as either a string or a list of strings depending on the product type, caught by a join helper and resolved transparently. For larger changes (a new auth scheme, a moved endpoint), the per-retailer scraper is updated and the rest of the pipeline is unaffected. Per-retailer SQLite buffers and a strict Pydantic model mean a broken scraper cannot poison the central database; bad records are rejected at the model layer and logged.

How is the Akamai-protected retailer kept running?

A separate standalone runner on a UK Windows VPS using the same product model and the same Supabase exporter. It uses SeleniumBase UC mode in either headless or headed configuration. Continuous price-only loops run on machine startup; a weekly full-catalog run extracts the enriched fields. Browser hangs are detected and auto-recovered up to five times before the run pauses for the next scheduled invocation. The only piece of infrastructure that has to be UK-based is the host's IP.

How long does adding a new retailer take?

If the retailer is on a platform already supported (OSP-style catalog API or Algolia), a working scraper is hours to days; the subclass is mostly configuration. Add an API key and an index name, declare the category facets, and the catalog runs. If the platform is new, plan one to two weeks: reverse-engineering the endpoint surface, building the scraper, integrating with the unified product model, adding staging tables, and writing the cron job. After that it is the same operational model as every other retailer.

What does the deployment look like?

A multi-stage Docker image (Python 3.12 plus SeleniumBase and Chrome where needed) is built once, pushed to a private container registry, and consumed by 13 cron jobs in a single namespace on a managed Kubernetes cluster in London. Manifests are code-generated from a Python schedule configuration, so adding or rescheduling a job is a one-line change followed by an apply. A persistent volume claim holds the per-retailer SQLite buffers so checkpoints survive pod restarts. Secrets are managed by kubectl, never committed.

Contact

Need a multi-retailer pricing pipeline or help getting past a hard anti-bot gate? Tell us about it.