nomenklatura
nomenklatura is a data integration and enrichment framework for
followthemoney data. It deduplicates and links
records that describe the same real-world entity, and enriches them against
external sources.
To see the full workflow in action, follow the deduplication tutorial, which merges German political and lobbying data from several publishers on the command line.
Capabilities
- Entity resolution — record the human and automated judgements that decide whether two entities are the same, and apply them consistently across a dataset. See the resolver.
- Matching — score candidate pairs of entities using configurable matching algorithms.
- Cross-referencing (
xref) — find likely duplicate candidates within and across datasets at scale, using a blocking index. - Enrichment — look up entities against external data sources (e.g. Wikidata, OpenCorporates) and merge in the results.
- Stores — read and write followthemoney entities and statements to a range of storage backends.
Related resources
This library is part of a broader ecosystem of tools:
- FollowTheMoney — the data model
nomenklaturaoperates on - rigour — text cleaning and validation used throughout
- yente — matching API server built on this library
- OpenSanctions: open source projects