Scrapingdome
Case study - County minutes

County meeting minutes turned into structured deal signals.

For a US real estate developer, CivicMine extracts the earliest entitlement signals (rezonings, conditional-use permits, TIF approvals, applicant-attorney pairings) from Planning Commission and Board minutes across six counties in Utah, Colorado, Wyoming, and Montana. Same five platform adapters scale to 40+ counties at near-zero marginal cost.

Problem

The signal is public; the format is unreadable at scale.

Distress acquisitions, ground-up development, and off-market deals are won by the team that sees the entitlement signal first. Those signals are public; they sit inside Planning Commission and Board of Commissioners minutes. But they are unreadable at scale for three compounding reasons.

Every county runs a different CMS: CivicPlus, Granicus, CivicClerk, Revize, Concrete, and a long tail of in-house portals. Roughly half of the meaningful documents are scanned PDFs from the 1990s and 2000s, with no selectable text and no metadata. The signal-to-noise ratio is brutal: a single Planning Commission meeting can run 80 pages, and the relevant sentence is "applicant requests rezoning of parcel 12-345 from A-1 to R-2 with density bonus."

Reading six counties of monthly minutes manually is a full-time analyst job. Reading 40 counties is four full-time analyst jobs. The client wanted the signal, not the headcount.

Approach

Four layers built so the cost curve flattens after six counties.

01

Platform-adapter layer

Five adapters covering the surface area the client cared about: CivicPlus AgendaCenter, CivicClerk OData API, Granicus HTML viewer, Revize CMS, and Concrete CMS. Adding a county that runs on a platform we already adapted is a config entry, not an engineering project. One adapter, dozens of counties.

02

Three-tier text extraction

pdfplumber for clean native PDFs, PyMuPDF as the structural fallback, Tesseract OCR for scanned documents (applied automatically when the first two tiers return empty text). Extracted text is cached to disk per document, so re-runs are instant and only new meetings hit the network.

03

Signal-matching engine

78 keywords across 7 categories (entitlements and zoning, development types, financing, project issues, asset types, scale indicators, stakeholders), plus 6 compound rules that fire only when two terms co-occur in context, like subdivision near approval or tax increment near financing. Categories and keywords live in a YAML config the client owns and edits; new verticals do not require code changes.

04

Entity extraction with role inference

spaCy NER plus a regex pass for LLC, LP, and Inc patterns, role-aware: applicants, attorneys, developers, spokespersons. A substring stoplist filters out government bodies, meeting boilerplate, addresses, and titles (the noise that breaks every naive NER pipeline against government documents). NER only runs on documents that already had a keyword hit, which keeps compute cost negligible.

Scale and outcome In production

Six counties processed in five business days; 40+ addressable on the same adapters.

Output is per-county CSVs plus a combined CSV and a summary CSV. Every row carries meeting date, document URL, matched keyword, category, and a verbatim 500-character context snippet. Every signal is auditable back to the original PDF page in one click.

6
counties live across UT, CO, WY, MT on 5 CMS platforms
1,453
meetings processed over a 39-month window
3,797
keyword hits flagged across 1,071 documents

Build time POC: 5 business days. Re-run latency per county, incremental: minutes. Counties addressable on the existing adapters: 40+. The client's analyst time shifts from reading PDFs to acting on signals; each row is a pre-qualified lead with a citation.

What this proves

Coverage that compounds, not coverage that costs.

Six counties was the POC. The same adapters cover the long tail of US local government: CivicPlus alone powers thousands of municipal portals. Going from 6 counties to 40 is a configuration project, not a build project. Compute is CPU-bound and cached, so re-runs at scale are measured in minutes.

Audit trail is the default. Every signal links back to the verbatim source sentence and the original PDF page. There is no black box, no AI hallucination, no "trust us". Investors, partners, and legal can verify any row in seconds. Keywords, categories, compound rules, and the county list are externalized to YAML; the client expands the signal taxonomy without waiting on a developer.

Phase 2 is monitoring at scale on the same architecture: 34+ additional counties onboarded by platform classification and config, scheduled monitoring with delta delivery, built-in deduplication via per-county state tracking, and a notification layer (email, Slack, webhook) that delivers signals where the deal team already works.

Questions answered in this engagement

How this pipeline works in practice.

How fast can you add a new county?

If the county runs on a platform we have already adapted (CivicPlus, CivicClerk, Granicus, Revize, Concrete), under an hour: a configuration entry plus a short verification run. A new platform is typically one engineering day. The text extraction, keyword engine, NER, and CSV layers are unchanged.

What about counties that only publish recent minutes online?

The pipeline detects and flags the historical horizon for each county explicitly in the summary CSV. The client always knows how far back the data goes. Counties that publish nothing online get escalated as a sourcing decision; nothing is silently dropped.

How accurate is the entity extraction?

High precision, tuned recall. The stoplist-filtered NER plus LLC and LP regex pass catches the entities the client actually cares about (applicants, developers, attorneys). The 500-character context snippet on every row means human verification takes seconds, not minutes. No row is delivered without a citation.

Can the keywords change as the strategy evolves?

Yes. The keyword config is a YAML file the client owns. New keywords, new categories, new compound rules all go live on the next run. No code change, no redeploy.

What does it cost to run at 40 counties?

Compute is negligible (extraction and NER are CPU-bound and cached). The operational cost is bandwidth and storage, both trivial at this volume. The pricing model is one-time onboarding per county plus a flat monthly maintenance retainer.

Contact

Need a public-records signal pipeline for your geography? Tell us about it.