Enrichment
Enrichment looks up your entities in external databases — Wikidata, corporate registries, a yente instance — and merges confirmed matches and their related records into your dataset.
A dataset rarely tells the whole story on its own. The people in it have Wikidata items listing their family members; the companies have registry records naming their officers. Enrichment connects an entity stream to such a source, with the resolver acting as the quality gate: nothing is merged until a match has been confirmed, either by a human or by a score threshold you control.
The workflow has three steps, each of which can be run repeatedly:
- Match — query the external source for each entity and record candidate matches as suggestions in the resolver.
- Judge — confirm or reject the suggestions, e.g. in the
nk dedupeinterface. - Enrich — for confirmed matches, fetch the external record and its related entities.
Configuring an enricher
An enricher is configured in a YAML file. The file doubles as dataset metadata — the entities an enricher produces are tagged with its name, so their origin stays visible after merging. A configuration for matching against the US OFAC sanctions list, served by the OpenSanctions API:
name: us_ofac_sdn
title: US OFAC Specially Designated Nationals
type: nomenklatura.enrich.yente:YenteEnricher
api: https://api.opensanctions.org/
dataset: us_ofac_sdn
api_key: ${YENTE_API_KEY}
cache_days: 30
The type key selects the enricher implementation by import path. The remaining keys depend on the enricher — see the enricher reference for each implementation's options. Three options work for every enricher:
cache_days— how long fetched API responses stay valid in the local cache (default 90). Responses are cached in the same SQL database that holds the resolver, so re-runs don't hit the remote API again.schemata— a list of schema names; only entities of one of these schemata are looked up.topics— a list of topics; only entities carrying one of them are looked up. Use this to enrich, say, only entities taggedrole.pep.
Values in the configuration can reference environment variables with ${VAR} syntax — keep API keys out of the file itself.
Step 1: find candidate matches
nk match streams an entity file through the enricher. The output contains each input entity followed by the candidates found for it, and every candidate pair is recorded in the resolver as a scored suggestion:
Step 2: judge the candidates
The suggestions land in the same review queue that nk xref feeds. Judge them in the interactive interface, using the output file from the match step so both sides of each pair are on screen:
Press X to confirm a match, N to reject it. Only confirmed pairs are enriched.
Step 3: fetch the enrichment data
nk enrich runs the same lookup, but now only acts on pairs the resolver holds a positive judgement for. For each confirmed match, it fetches the external record and the entities related to it — officers of a matched company, family members of a matched person:
The output is a stream of new entities from the external source, not a modified copy of your input. Combine it with your source data the same way any dataset gets merged — through the statements pipeline described in the deduplication tutorial. Because the matched external record shares a canonical ID with your entity, aggregation folds them into one.
Available enrichers
| Name | Source | Matches |
|---|---|---|
WikidataEnricher |
Wikidata | People |
YenteEnricher |
A yente instance | All matchable schemata |
AlephEnricher |
An Aleph / OpenAleph instance | All matchable schemata |
OpenCorporatesEnricher |
OpenCorporates | Companies, officers |
OpenFIGIEnricher |
OpenFIGI | Organizations, securities |
PermIDEnricher |
PermID (LSEG) | Organizations |
BrightQueryEnricher |
BrightQuery (US companies) | Organizations |
Configuration options for each are documented in the enricher reference, which also describes the Enricher interface to implement for connecting a new source.