Matching
A scoring algorithm compares two entities and returns a score between 0.0 and 1.0, together with per-feature explanations of how that score came about.
Matching is used in two distinct situations, and different algorithms fit each. In screening, a query entity (e.g. a customer record) is compared against a list of known entities, and false negatives are costly. In deduplication, entities from overlapping datasets are compared to find records describing the same real-world entity, and the score is a ranking aid for a human reviewer or an auto-merge threshold.
Available algorithms
Each algorithm is identified by a stable name, which is used to select it in the yente API and on the nk command line (nk xref --algorithm <NAME>):
| Name | Class | Use for |
|---|---|---|
logic-v2 |
LogicV2 |
Screening. Rule-based, explainable, multi-script name matching. |
ofac |
OFACMatcher |
Screening, when parity with OFAC's Sanctions List Search is required. |
er-unstable |
EntityResolveRegression |
Deduplication, e.g. in nk xref. Not for regulated screening. |
regression-v1 |
RegressionV1 |
Legacy regression model, the default for nk match. |
logic-v1 |
LogicV1 |
Superseded by logic-v2. |
name-based |
NameMatcher |
Deprecated in favor of ofac. |
name-qualified |
NameQualifiedMatcher |
Deprecated in favor of ofac. |
Prefer logic-v2 for screening and er-unstable for deduplication. The module exposes these choices as the constants DefaultAlgorithm (regression-v1, kept for API compatibility) and DedupeAlgorithm (er-unstable).
To score a pair of entities in Python, look up an algorithm by name and call its compare class method:
from nomenklatura.matching import get_algorithm, ScoringConfig
algorithm = get_algorithm("logic-v2")
config = ScoringConfig.defaults()
result = algorithm.compare(query, candidate, config)
print(result.score, result.explanations)
Interface
nomenklatura.matching.get_algorithm(name)
Return the scoring algorithm class with the given name.
nomenklatura.matching.ScoringAlgorithm
Bases: object
An implementation of a scoring system that compares two entities.
Source code in nomenklatura/matching/types.py
compare(query, result, config)
classmethod
Compare the two entities and return a score and feature comparison.
default_config()
classmethod
get_docs()
classmethod
Return an explanation of the algorithm and its features.
Source code in nomenklatura/matching/types.py
get_feature_docs()
classmethod
nomenklatura.matching.ScoringConfig
Bases: BaseModel
Configuration for a scoring algorithm.
Source code in nomenklatura/matching/types.py
defaults()
classmethod
get_float(key)
Get a float value from the configuration.
get_optional_string(key)
nomenklatura.matching.types.MatchingResult
Bases: object
Score and feature comparison results for matching comparison. This is instantiated for each candidate returned by the search, and the score is used to rank the results. Explanations are lazy-generated for performance.
Source code in nomenklatura/matching/types.py
explanations
property
Return the explanations for the feature results as pydantic models.
nomenklatura.matching.types.FeatureResult
Bases: BaseModel
A explained score for a particular feature result.
Source code in nomenklatura/matching/types.py
Algorithms
nomenklatura.matching.LogicV2
Bases: HeuristicAlgorithm
A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities. Its name matcher uses a versatile matching algorithm that uses cultural reference data for precise and explainable cross-language and cross-script matching.
Source code in nomenklatura/matching/logic_v2/model.py
nomenklatura.matching.OFACMatcher
Bases: HeuristicAlgorithm
An algorithm that emulates the public OFAC Sanctions List Search tool at sanctionssearch.ofac.treas.gov, with mismatch qualifiers layered on top. Reverse-engineered from FAQ 249 and parity fixtures captured against the live tool. Name scoring closely tracks OFAC's reported score, but is an emulation rather than an exact reimplementation. Qualifier features (country, DOB, gender, orgid mismatches) reduce the name score - this departs from FAQ 251 (which says only the name field influences the Score) but mirrors how OFAC users actually triage matches via FAQ 5.
Source code in nomenklatura/matching/name_based/model.py
nomenklatura.matching.EntityResolveRegression
Bases: ScoringAlgorithm
Entity resolution matcher. Do not use this in (regulated) screening scenarios.
Source code in nomenklatura/matching/erun/model.py
compare(query, result, config)
classmethod
Use a regression model to compare two entities.
Source code in nomenklatura/matching/erun/model.py
encode_pair(left, right)
classmethod
Encode the comparison between two entities as a set of feature values.
get_feature_docs()
classmethod
Return an explanation of the features and their coefficients.
Source code in nomenklatura/matching/erun/model.py
load()
cached
classmethod
Load a pre-trained classification pipeline for ad-hoc use.
Source code in nomenklatura/matching/erun/model.py
save(pipe, coefficients)
classmethod
Store a classification pipeline after training.
Source code in nomenklatura/matching/erun/model.py
nomenklatura.matching.RegressionV1
Bases: ScoringAlgorithm
A simple matching algorithm based on a regression model.
Source code in nomenklatura/matching/regression_v1/model.py
compare(query, result, config)
classmethod
Use a regression model to compare two entities.
Source code in nomenklatura/matching/regression_v1/model.py
encode_pair(left, right)
classmethod
Encode the comparison between two entities as a set of feature values.
get_feature_docs()
classmethod
Return an explanation of the features and their coefficients.
Source code in nomenklatura/matching/regression_v1/model.py
load()
cached
classmethod
Load a pre-trained classification pipeline for ad-hoc use.
Source code in nomenklatura/matching/regression_v1/model.py
save(pipe, coefficients)
classmethod
Store a classification pipeline after training.
Source code in nomenklatura/matching/regression_v1/model.py
nomenklatura.matching.LogicV1
Bases: HeuristicAlgorithm
A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities.
This algorithm has been superseeded by logic-v2 and is no longer recommended for new integrations.
Source code in nomenklatura/matching/logic_v1/model.py
nomenklatura.matching.NameMatcher
Bases: HeuristicAlgorithm
Deprecated in favour of ofac, which actually emulates OFAC's
public Sanctions List Search behaviour. This algorithm matches on entity
name using phonetic comparisons and Jaro-Winkler edit distance, vaguely
based on FAQ #249, but does not reach OFAC parity.
Source code in nomenklatura/matching/name_based/model.py
nomenklatura.matching.NameQualifiedMatcher
Bases: HeuristicAlgorithm
Deprecated in favour of ofac, which carries the same qualifier
weights on top of a name score that actually reaches OFAC parity. Same as
the name-based algorithm, but scores are reduced if a mis-match of birth
dates and nationalities is found for persons, or different
tax/registration identifiers are included for organizations and companies.