Tutorial: deduplicating a multi-source dataset
Combine data about German politicians from several publishers into one dataset, then find and merge the duplicate entities using the nk command line.
Different sources describe the same real-world people and organizations, each with its own identifiers and naming quirks. This tutorial walks the full nomenklatura workflow on real data: convert source entities into statements, cross-reference them to find likely duplicates, judge candidate pairs in an interactive terminal interface, and produce a merged dataset in which each person is one entity — with every source claim still traceable.
The pipeline has a loop at its heart:
The first pass through resolve does nothing, because no deduplication decisions exist yet. After xref and dedupe have produced decisions, running the pipeline again folds them into the output.
Prerequisites
- A Python environment with
nomenklaturainstalled (pip install nomenklatura). This provides thenkcommand, and — through the FollowTheMoney (FtM) dependency — theftmcommand. qsvfor fast CSV operations.curlfor downloads.
The source data
The tutorial uses three datasets. All are published as FtM entities, one JSON object per line, so there is nothing to scrape or map:
de_abgeordnetenwatch_full— members of German parliaments, their mandates, and parties, collected by abgeordnetenwatch.de and published on data.ftm.store.de_bundestag— the OpenSanctions list of politically exposed persons (PEPs) in the German Bundestag.az_laundromat— people and companies from the Azerbaijani Laundromat, a money-laundering and influence scheme that reached European politicians.
The same member of parliament appears in the first two datasets — under differently formatted names and with different identifiers. Whether anyone appears in the third one is the investigative question that deduplication answers.
Create a working directory and download the data (about 55 MB in total):
mkdir -p data/src data/stmt
curl -sf -o data/src/de_abgeordnetenwatch_full.json \
https://data.ftm.store/de_abgeordnetenwatch_full/entities.ftm.json
curl -sf -o data/src/de_bundestag.json \
https://data.opensanctions.org/datasets/latest/de_bundestag/entities.ftm.json
curl -sf -o data/src/az_laundromat.json \
https://data.opensanctions.org/datasets/latest/az_laundromat/entities.ftm.json
Step 1: break entities into statements
A statement is one atomic claim: "entity X has property name with value Y, according to dataset Z". Decomposing entities into statements is what makes the rest of the workflow possible — statements from many sources can be pooled, re-grouped under a shared ID when entities are merged, and each value keeps its provenance throughout.
Convert each file, tagging its statements with the dataset name:
for ds in de_abgeordnetenwatch_full de_bundestag az_laundromat; do
ftm statements -d $ds -f csv data/src/$ds.json -o data/stmt/$ds.csv
done
Step 2: combine the sources into one table
Concatenate the statement tables. Long free-text values (prop_type of text, such as descriptions) don't help with matching, so filter them out:
Step 3: apply the resolver
The resolver is the database of deduplication decisions. nk apply-statements streams statements through it and sets each statement's canonical_id — the identifier of the merged entity it belongs to:
This command creates a SQLite database named nomenklatura.db in the working directory. On this first run it holds no decisions, so every statement keeps its original entity ID as its canonical ID. The step will matter on the second pass.
Step 4: aggregate statements into entities
Statements that share a canonical ID collapse into one entity. The aggregation command expects its input sorted by canonical ID:
qsv sort -s canonical_id data/resolved.csv \
| ftm aggregate-statements -f csv -i - -o data/entities.json
The result, data/entities.json, is a stream of FtM entities combining all three sources — with duplicates still unmerged.
Step 5: find duplicate candidates with xref
nk xref builds a blocking index over the entities, so only pairs that share tokens — name parts, phonetic forms, identifiers — are considered. Each candidate pair is then scored by a matching algorithm; the default for xref is er-unstable, a regression model trained for deduplication.
Pairs scoring at or above the -a threshold are merged automatically. The rest are stored in the resolver as suggestions, ranked by score, awaiting a human decision. Useful options:
-l <NUMBER>— how many candidate pairs to keep (default 5000).-f <DATASET>— only suggest pairs where one entity comes from the given dataset.-f az_laundromatconcentrates review effort on the laundromat connections.--algorithm <NAME>— choose a different scoring algorithm.
The blocking index is written to nomenklatura.data/ and reused on later runs.
Step 6: judge the candidates
Open the interactive deduplication interface:
The screen shows one candidate pair at a time, property by property. Decide with a single keypress:
- X — the two entities are the same (match)
- N — they are different (no match)
- U — unsure, skip for now
- Q — quit
Each decision is written to the resolver immediately. You don't have to finish the queue — judge the high-scoring pairs and quit whenever you want to see results.
Step 7: run the merge again
Repeat steps 3 and 4 to fold the decisions into the output:
nk apply-statements -f csv -i data/combined.csv -o data/resolved.csv
qsv sort -s canonical_id data/resolved.csv \
| ftm aggregate-statements -f csv -i - -o data/entities.json
Merged politicians are now single entities carrying claims from AbgeordnetenWatch and OpenSanctions side by side — one node per person, with each property value still attributable to its source. The xref → dedupe → merge cycle can be repeated as often as needed; each round starts from the decisions already made.
Keeping, sharing, and resetting decisions
Judgements are the product of careful human work — treat them like source code. Export them to a plain-text file that can live in version control:
To restore them into a fresh database (or a colleague's), run nk load-resolver judgements.ijson.
Two cleanup commands are useful during iteration. nk prune deletes undecided suggestions from the resolver — run it before a fresh xref to clear out stale candidates. To start over completely, delete nomenklatura.db and the nomenklatura.data/ directory.
Note: the resolver defaults to SQLite, which suits a single-person project. If you set the NOMENKLATURA_DB_URL environment variable to a PostgreSQL database, several people can judge candidates against the same resolver.
Where to go next
- Add more datasets to the pipeline — any source of FtM entities plugs into step 1. data.ftm.store and OpenSanctions publish many, including the German federal lobby register (
de_lobbyregister) and the EU transparency register (eu_transparency_register). - Match your entities against external databases like Wikidata and OpenCorporates, and pull in what they know — see the enrichment guide.
- Load the merged dataset into Neo4j with followthemoney-graph to explore it as a network.