Scrapingdome
Case study - Operabase conductor dataset

Ten thousand conductor profiles, nine parallel workers, one relational research dataset.

For an academic research client, Scrapingdome delivered a clean, joinable dataset spanning the full conductor population on the Operabase classical music registry. A signed JavaScript application, authenticated session cloning, deep career-history recovery, and a parent-child relational schema, wired into a pipeline designed to be re-run and adapted to adjacent performer categories without rework.

Problem

The signal is on the page; nobody packages it relational.

Operabase indexes performers, productions, venues and festivals across the world. For a researcher, the value lives in the relationships - which conductor led which production, with which orchestra, in which venue, with which agency representing them - not in any single profile page. Four characteristics of the source make that packaging non-trivial.

The application is a signed JavaScript app. Every data request carries a cryptographic signature regenerated by the application for each call. Captured requests stop working within seconds; the signer rotates with each new deployment, so any replay-based approach inherits a permanent maintenance burden.

Listings are gated and unbounded. The conductor directory is only visible to authenticated users, and the pagination has no advertised total - you stop when the listing stops surfacing new profiles. There is no count to checkpoint against.

Career history is paginated by year. A conductor with forty years of credits only exposes the most recent fifteen seasons at first glance. The rest of the history is reachable, but only by walking back one year at a time, which inflates per-profile cost if not bounded.

Agency contact data lives behind a sub-route. The fields a researcher cares most about - phone, email, named representative - surface only when the profile's manager tab is opened, not from the main page. The official commercial API is not sized for single-project budgets, and no turnkey export covers the relational shape research actually needs.

Approach

Built so the dataset is reproducible and the architecture is reusable.

01

Source feasibility, the right path chosen on day one

A one-day audit compared three approaches: direct request replay, generic browser automation, and an instrumented-browser pattern that lets the application sign its own requests. The first two were ruled out - replay fails on signature rotation, and generic automation reads the page slower than it loads. The instrumented-browser path was chosen and held for the full project with zero maintenance churn from upstream changes.

02

A persistent authenticated session, scaled across machines

One interactive login per machine, captured into a profile that survives restarts. The profile is then cloned twice on the same host, allowing three concurrent workers per machine without any of them tripping a session check. Three machines, three one-time passwords from the client, nine effective workers - the topology is elastic, and adding a fourth machine takes minutes.

03

Deep history without runaway cost

The platform's season summary surfaces only the most recent fifteen years up front. The scraper iterates each visible season, then walks back year by year from the oldest known season, stopping at the first empty year. Bounded by a configurable floor. Recovers the great majority of historical productions while keeping the per-profile cost predictable.

04

Relational output, not a flat dump

The deliverable is two joinable tables. A conductors table carries one row per profile with biography, primary agency contact, and aggregate counts. A performances table carries one row per production or performance with show-level metadata - composer, venue, dates, cast, orchestra, language, duration - and a foreign key back to the conductor. Deduplication is baked into the writer so re-runs converge cleanly.

05

Resumable from the first row, restartable from any failure

Every worker reads its own output on startup and skips what it already produced. Kill the orchestrator, lose a node, reboot a machine, swap a network - the run picks up where it left off, never re-processing a completed profile. A multi-day production run becomes a sequence of restartable shifts, not a single fragile transaction.

Scale and outcome In production

End-to-end across the conductor category.

Roughly ten thousand unique conductor profiles harvested from the authenticated directory. Nine parallel workers from three one-time passwords across three nodes. About two days of wall-clock for the full category, with zero unhandled errors in production operation.

~10K
unique conductor profiles harvested end to end
9
parallel workers from 3 OTPs across 3 nodes
~2 days
wall-clock for the full conductor category

Per-profile time varies with career depth: newer conductors finish in under two minutes, veterans with thirty active years take six to eight. Each row averages two thirds of the available fields populated, with the densest profiles approaching ninety-five percent. Round-robin slice splitting keeps every worker's workload statistically balanced regardless of how dense or sparse the underlying profiles are. Periodic memory cleanup avoids the slow Chrome creep that kills twelve-hour browser sessions. The launcher pre-cleans stale processes, manages logs, and tears down every child process cleanly on interrupt, so a botched run never strands resources.

What this proves

Signed apps are scrapable without breaking the signer.

Letting the application's own front-end mint each request, and intercepting the response on the way back, inherits every security update the deployment ships - for free, with no signature-related maintenance over the project's lifetime. The alternative - tracking and re-implementing obfuscated front-end code every release - is a recurring cost most projects cannot absorb.

Cloned authenticated profiles fan out linearly until the platform pushes back, and many platforms do not. A single login from the client buys three workers per machine. Resume-by-identifier makes the topology elastic - bring a node online halfway through and it picks up the remaining work without coordination overhead.

Relational output beats flat dumps for downstream research. A parent-child schema with stable identifiers lets the consumer join across performances, productions, venues and agencies without re-parsing the source. Refreshes are safe and incremental: only profiles whose activity counter has changed since the last run need to be re-scraped.

Questions answered in this engagement

How this pipeline works in practice.

Why a browser and not a direct API call?

Operabase signs every data request from inside its own JavaScript layer, with a secret that rotates with each deployment. Maintaining a parallel signer would mean tracking and re-implementing obfuscated front-end code every time the platform ships a release. Letting the platform sign its own requests, and reading the responses on the way back, is a maintenance cost of zero over the project's lifetime.

How is the multi-worker setup authenticated?

The client provides one one-time password per machine, used once at setup. That session is then locally cloned to support three concurrent workers per host, all sharing the same identity. The platform does not enforce single-session use of the account, which we verified before scaling.

How is the deep career history recovered?

The platform surfaces the most recent fifteen seasons on first load. The scraper walks back from the oldest visible season, year by year, stopping at the first empty year. A configurable floor prevents runaway descent. The approach recovers the great majority of historical productions while keeping per-profile cost predictable.

What about GDPR and the personal information in the rows?

The dataset contains performer information that is visible on the public profile pages: agency contact, biography, sometimes a public birth date. Deliveries are structured for academic research, single-recipient, with the source platform attributed in every row. We do not publish the underlying dataset openly. The pipeline can be re-run to refresh whenever the client needs current data.

How quickly can you adapt to a different performer category?

Conductors share the same profile structure as singers, directors, choreographers and the rest of the performer roles on Operabase. Adapting the pipeline to a different category is a half-day of work: swap the directory entry point, re-run the listing harvester, re-split the worklist. Everything downstream - the profile scraper, the deduplication, the relational output - is profession-agnostic.

What does a refresh cadence look like?

The default delivery is a one-time export, but the pipeline is restartable by design. A monthly refresh can be wired as a scheduled job that only re-scrapes profiles that have changed since the last run, by comparing a lightweight activity counter exposed on each profile.

Contact

Need a relational research dataset behind authentication? Tell us about it.