Scrapingdome
Case study - France residential AVM

Four open data sources, three submodels, one national French AVM.

For a European residential proptech client, Scrapingdome built an end-to-end automated valuation model covering the whole of France on open public data. geo-DVF transactions, Notaires-INSEE IPL price index, Base Adresse Nationale geocoder, and BDNB building characteristics wired into three reconciled submodels, with auditable confidence bands and no commercial-data licence cost.

Problem

The signal is public; nobody packages it end to end.

A B2C valuation web product needs three things working together: a clean transaction record, a price index that tracks the market, and a geocoder precise enough to anchor a user-typed address on the right building. In France all three exist as open data; none of them ship as a ready-to-consume valuation engine.

geo-DVF publishes every residential transaction in metropolitan France since 2014, but each yearly file is 50 to 120 MB compressed, with multi-row mutations, mixed property types, anomalous coordinates and ad-hoc string fields. Five years of national history is 17.5 million raw rows that have to be normalised down to ~4.5 million usable mutations before any model can read them.

Notaires-INSEE IPL is the official residential price index, but it is published as 12 separate SDMX series (national flats and houses, IDF region, IDF departments with apartment-only sub-series, Paris flats only). Mapping a user-typed commune to the right index requires a cascade with documented fallbacks, not a single lookup.

BAN (Base Adresse Nationale) is the official open geocoder, but its response includes street-level matches that are not safe to use as a building anchor, so the pipeline has to filter on confidence and on match type before trusting the coordinates.

BDNB, the national building characteristics database from CSTB, adds DPE energy ratings, year of construction and dwelling counts, but ships as a 35 to 45 GB bulk extract that has to be spatially joined onto the DVF parcels before any of its attributes become usable. There is no off-the-shelf European open-data AVM that bundles those four sources into an address-to-valuation flow.

Approach

Built so the model is auditable and the architecture is reusable.

01

Country feasibility audit

Six European countries audited against eight dimensions each (official transaction source, granularity, licence, latency, coverage, portal dependency, public-AVM benchmarks, build risk): France, Spain, Portugal, Ireland, Italy, Netherlands. Output: a comparison table, per-country profiles, a market-benchmark synthesis, and a primary plus alternative recommendation with a decision tree. The audit isolates the countries where a transparent open-data AVM is feasible without commercial data and the ones where a hybrid asking-price proxy is required.

02

Transaction and index data layer

DVF ingest pipeline that pulls Cerema's latest snapshot for 2021-2025, applies sanity bounds (price 10k to 10M EUR, surface 8 to 500 m2, rooms <=12, EUR/m2 between 200 and 50,000), drops mixed-type mutations, aggregates multi-row mutations to one row, derives an INSEE-aligned schema and writes a single 158 MB Parquet snapshot. Notaires-INSEE IPL cascade implementing the documented fallback chain Paris-arr -> IDF-dept (apartments only) -> IDF-region -> national, with twelve IPL series wired in and latest period backfilled to 1992-Q1.

03

BAN geocoder with cache and direct-match resilience

A SQLite-cached BAN client that captures the housenumber and street fields separately so the pipeline can fall back to a direct field match on the DVF parquet (housenumber plus normalised street plus INSEE commune) when the geocoded coordinate is not precise enough to anchor a building. Acceptable type filter (housenumber or street), minimum score filter, retry-with-backoff on transient errors, and a per-process HTTP client. The cache turns repeated address queries into sub-millisecond lookups.

04

Three-submodel valuation engine

The valuation core is three independent estimators reconciled by inverse-variance weighting. Indexation computes a commune-typical EUR/m2 over the last 24 months and projects it to the target date via the IPL cascade. Comparables uses a BallTree haversine k-NN over the cleaned DVF parquet with a two-pool union (a 24-month radius cascade for fresh market signal plus a tight no-date-floor local pool for at-address sales) and composite re-ranking on distance, surface mismatch and recency. Parcel history resolves the cadastral id_parcelle from the geocode, pulls all prior transactions on the same parcelle, and projects the most recent via IPL plus a surface adjustment. The confidence band exposed to the consumer is the union of the active submodel bands - intentionally wide when the signals disagree, transparent rather than misleadingly tight.

05

Enrichment, agents and trends

BDNB national bulk integrated as per-departement Parquet extracts, spatially joined to the DVF parcels for DPE, year of construction, dwelling count and net floor area. Scraper modules for SeLoger, Bien'ici and LeBonCoin Immobilier with agency-attributed extraction, anti-bot rotation, and aggregation per INSEE commune (n_listings, n_flats, n_houses, avg EUR/m2 per type, relevance threshold of 10+ listings). Sold-data trends pipeline producing rolling-12-month heatmaps from DVF by commune and type. Active-listing trends pipeline writing monthly snapshots into a time-series store. FastAPI wrapper exposing valuate(address), find_top_agents(commune) and trends(commune, window) as HTTP endpoints.

Scale and outcome In production

End-to-end address-to-valuation across the entire country.

4.5 million cleaned DVF mutations normalised from 17.5M raw rows. 30,000+ communes addressable via the IPL cascade and the INSEE COG. A stratified backtest by quintile, type and urban/rural slicing reports an urban-apartment MdAPE of 14.5 percent before BDNB enrichment and 12.0 to 12.5 percent after BDNB integration.

4.5M
cleaned DVF mutations from 17.5M raw rows
30K+
communes addressable nationwide via the IPL cascade
12.0-12.5%
urban-apartment MdAPE after BDNB enrichment

BallTree pre-built per type_local; address-to-valuation cold first call ~21 seconds (parquet plus tree warm-up), subsequent calls 1 to 3 seconds. BAN geocode cached locally; repeat lookups sub-millisecond. Full national ingest end-to-end: ~3 minutes on a workstation. The valuate(address) endpoint returns a Valuation object with value, plus or minus std, ci_low, ci_high, per-submodel breakdown, weights, nearest comparables, previous sales at this exact address, the parcel-history anchor, and inference flags when type or surface were inferred.

What this proves

Open data, transparent models, reusable architecture.

Every number on the result card links back to a source the user can audit: the IPL series and period, the cleaned DVF row, the cadastral parcelle. The reconciliation table shows each submodel's estimate, its CI and its weight, so a user disagreeing with the headline can see exactly which signal pulled which way. No black box, no opaque ensemble, no claim of accuracy beyond what the open data supports.

The pipeline is built so the next country slots into the same shape. The country audit already mapped which sources are open, which are paywalled (Kadaster NL, Registradores ES), and where a portal proxy (Idealista, Imovirtual) has to substitute for a missing transaction record. Spain, Portugal and the Netherlands inherit the same submodel architecture; only the data adapters change.

Every threshold, cohort size, surface alpha and weighting rule is a config value, not an engineering choice. The 24-month look-back, the 100 m local pool radius, the per-parcelle cap, the IPL fallback ladder, the inference radius - all live in the model code as named constants the client can adjust without a redeploy.

Questions answered in this engagement

How this pipeline works in practice.

How fast can you add a new country?

The data adapters are the only country-specific layer. For Ireland (PPR open transactions plus CSO RPPI hedonic index) the same pipeline reaches a first valuation in roughly five engineering days. For Spain a hybrid approach (Catastro open plus Idealista asking-price proxy) sits in the 8 to 12 day range. The submodel reconciliation, the geocoder cache, the result card and the API wrapper are reused verbatim; only the transaction adapter, the price-index mapping and the geocoder choice change.

What is the data licence position?

DVF, IPL, BAN, BDNB and the INSEE COG are all open under Licence Ouverte v2.0 or compatible terms, with attribution. The valuation engine carries no commercial-data licence cost. Scraped portal data from SeLoger, Bien'ici and LeBonCoin Immobilier is held to the same proxy and throttle discipline as any external collection and is stored offline with no re-distribution.

How auditable is each valuation?

Every submodel exposes its inputs and its confidence interval. The result card surfaces a per-submodel breakdown with weights, the anchor sale that drives the parcel-history estimator, the IPL period used for the time projection, and inference notes when type or surface were inferred from a neighbouring transaction. A reviewer can re-derive any headline number from the source files in the bundle without running the model.

How is B2C re-identification handled?

Individual DVF transactions never appear directly in consumer-facing output. The comparables panel surfaces aggregated EUR/m2 distributions with up to three anonymised neighbours by composite distance. The previous-sales-at-exact-address section surfaces the user's own building history, which is already public-record under the French open-data rules for DVF, without exposing third-party transactions on neighbouring buildings.

What does refresh cadence look like?

DVF refreshes roughly quarterly with a 6 to 12 month lag on the most recent year; the result card flags this explicitly so consumers do not expect today's market. IPL refreshes quarterly. BDNB updates roughly every six months with a new millesime. Active-listing scrapes from SeLoger, Bien'ici and LeBonCoin refresh monthly by default; the cadence is a config value.

What does scaling beyond one country cost?

The geocoder, the IPL cascade, the BallTree comparables, the parcel-history estimator and the reconciliation layer all carry over unchanged. Per-country cost concentrates in three things: the transaction adapter, the price-index mapping, and the geocoder choice. Storage and compute remain trivial at country scale: 4.5 million DVF rows fit in a 158 MB Parquet file and a national BallTree resolves a query in under a second on a workstation.

Contact

Need an open-data property-intelligence pipeline for your geography? Tell us about it.