Skip to content

Matching

A scoring algorithm compares two entities and returns a score between 0.0 and 1.0, together with per-feature explanations of how that score came about.

Matching is used in two distinct situations, and different algorithms fit each. In screening, a query entity (e.g. a customer record) is compared against a list of known entities, and false negatives are costly. In deduplication, entities from overlapping datasets are compared to find records describing the same real-world entity, and the score is a ranking aid for a human reviewer or an auto-merge threshold.

Available algorithms

Each algorithm is identified by a stable name, which is used to select it in the yente API and on the nk command line (nk xref --algorithm <NAME>):

Name Class Use for
logic-v2 LogicV2 Screening. Rule-based, explainable, multi-script name matching.
ofac OFACMatcher Screening, when parity with OFAC's Sanctions List Search is required.
er-unstable EntityResolveRegression Deduplication, e.g. in nk xref. Not for regulated screening.
regression-v1 RegressionV1 Legacy regression model, the default for nk match.
logic-v1 LogicV1 Superseded by logic-v2.
name-based NameMatcher Deprecated in favor of ofac.
name-qualified NameQualifiedMatcher Deprecated in favor of ofac.

Prefer logic-v2 for screening and er-unstable for deduplication. The module exposes these choices as the constants DefaultAlgorithm (regression-v1, kept for API compatibility) and DedupeAlgorithm (er-unstable).

To score a pair of entities in Python, look up an algorithm by name and call its compare class method:

from nomenklatura.matching import get_algorithm, ScoringConfig

algorithm = get_algorithm("logic-v2")
config = ScoringConfig.defaults()
result = algorithm.compare(query, candidate, config)
print(result.score, result.explanations)

Interface

nomenklatura.matching.get_algorithm(name)

Return the scoring algorithm class with the given name.

Source code in nomenklatura/matching/__init__.py
def get_algorithm(name: str) -> Optional[Type[ScoringAlgorithm]]:
    """Return the scoring algorithm class with the given name."""
    for algorithm in ALGORITHMS:
        if algorithm.NAME == name:
            return algorithm
    return None

nomenklatura.matching.ScoringAlgorithm

Bases: object

An implementation of a scoring system that compares two entities.

Source code in nomenklatura/matching/types.py
class ScoringAlgorithm(object):
    """An implementation of a scoring system that compares two entities."""

    NAME = "algorithm_name"
    CONFIG: Dict[str, ConfigVar] = {}

    @classmethod
    def compare(cls, query: E, result: E, config: ScoringConfig) -> MatchingResult:
        """Compare the two entities and return a score and feature comparison."""
        raise NotImplementedError

    @classmethod
    def get_feature_docs(cls) -> FeatureDocs:
        """Return an explanation of the features and their coefficients."""
        raise NotImplementedError

    @classmethod
    def get_docs(cls) -> AlgorithmDocs:
        """Return an explanation of the algorithm and its features."""
        return AlgorithmDocs(
            name=cls.NAME,
            description=cls.__doc__,
            config=cls.CONFIG,
            features=cls.get_feature_docs(),
        )

    @classmethod
    def default_config(cls) -> ScoringConfig:
        """Return the default configuration for the algorithm."""
        return ScoringConfig.defaults()

compare(query, result, config) classmethod

Compare the two entities and return a score and feature comparison.

Source code in nomenklatura/matching/types.py
@classmethod
def compare(cls, query: E, result: E, config: ScoringConfig) -> MatchingResult:
    """Compare the two entities and return a score and feature comparison."""
    raise NotImplementedError

default_config() classmethod

Return the default configuration for the algorithm.

Source code in nomenklatura/matching/types.py
@classmethod
def default_config(cls) -> ScoringConfig:
    """Return the default configuration for the algorithm."""
    return ScoringConfig.defaults()

get_docs() classmethod

Return an explanation of the algorithm and its features.

Source code in nomenklatura/matching/types.py
@classmethod
def get_docs(cls) -> AlgorithmDocs:
    """Return an explanation of the algorithm and its features."""
    return AlgorithmDocs(
        name=cls.NAME,
        description=cls.__doc__,
        config=cls.CONFIG,
        features=cls.get_feature_docs(),
    )

get_feature_docs() classmethod

Return an explanation of the features and their coefficients.

Source code in nomenklatura/matching/types.py
@classmethod
def get_feature_docs(cls) -> FeatureDocs:
    """Return an explanation of the features and their coefficients."""
    raise NotImplementedError

nomenklatura.matching.ScoringConfig

Bases: BaseModel

Configuration for a scoring algorithm.

Source code in nomenklatura/matching/types.py
class ScoringConfig(BaseModel):
    """Configuration for a scoring algorithm."""

    weights: Dict[str, float]
    config: Dict[str, Union[str, int, float, bool, None]]

    @classmethod
    def defaults(cls) -> "ScoringConfig":
        """Return the default configuration."""
        return cls.model_construct(weights={}, config={})

    def get_float(self, key: str) -> float:
        """Get a float value from the configuration."""
        value = self.config.get(key)
        if value is None:
            raise ValueError(f"{key} cannot be None")
        return float(value)

    def get_optional_string(self, key: str) -> Optional[str]:
        """Get a string value from the configuration."""
        value = self.config.get(key)
        if value is None:
            return value
        return str(value)

    def __hash__(self) -> int:
        return hash(self.model_dump_json())

defaults() classmethod

Return the default configuration.

Source code in nomenklatura/matching/types.py
@classmethod
def defaults(cls) -> "ScoringConfig":
    """Return the default configuration."""
    return cls.model_construct(weights={}, config={})

get_float(key)

Get a float value from the configuration.

Source code in nomenklatura/matching/types.py
def get_float(self, key: str) -> float:
    """Get a float value from the configuration."""
    value = self.config.get(key)
    if value is None:
        raise ValueError(f"{key} cannot be None")
    return float(value)

get_optional_string(key)

Get a string value from the configuration.

Source code in nomenklatura/matching/types.py
def get_optional_string(self, key: str) -> Optional[str]:
    """Get a string value from the configuration."""
    value = self.config.get(key)
    if value is None:
        return value
    return str(value)

nomenklatura.matching.types.MatchingResult

Bases: object

Score and feature comparison results for matching comparison. This is instantiated for each candidate returned by the search, and the score is used to rank the results. Explanations are lazy-generated for performance.

Source code in nomenklatura/matching/types.py
class MatchingResult(object):
    """Score and feature comparison results for matching comparison. This is instantiated
    for each candidate returned by the search, and the score is used to rank the results.
    Explanations are lazy-generated for performance."""

    __slots__ = ["score", "_explanations"]

    def __init__(self, score: float, explanations: Dict[str, FtResult]) -> None:
        self.score = score
        self._explanations = explanations

    @property
    def explanations(self) -> Dict[str, FeatureResult]:
        """Return the explanations for the feature results as pydantic models."""
        _explanations: Dict[str, FeatureResult] = {}
        for name, res in self._explanations.items():
            if res.detail is not None or res.score > FNUL:
                _explanations[name] = FeatureResult(
                    score=res.score,
                    detail=res.detail,
                    query=res.query,
                    candidate=res.candidate,
                )
        return _explanations

    def __repr__(self) -> str:
        """Return a string representation of the matching result."""
        return f"<MR({self.score}, expl={self._explanations})>"

explanations property

Return the explanations for the feature results as pydantic models.

__repr__()

Return a string representation of the matching result.

Source code in nomenklatura/matching/types.py
def __repr__(self) -> str:
    """Return a string representation of the matching result."""
    return f"<MR({self.score}, expl={self._explanations})>"

nomenklatura.matching.types.FeatureResult

Bases: BaseModel

A explained score for a particular feature result.

Source code in nomenklatura/matching/types.py
class FeatureResult(BaseModel):
    """A explained score for a particular feature result."""

    # This is the API version of the explanation. it's a pydantic model that can
    # be easily serialized to JSON and returned in the API response. The FtResult
    # is the internal version that is quicker to generate in the millions during
    # matching operations.

    detail: Optional[str]
    score: float

    # Used e.g. for names and identifiers to explain which value from
    # the query and result entities was actually used to make the match.
    query: Optional[str] = None
    candidate: Optional[str] = None

Algorithms

nomenklatura.matching.LogicV2

Bases: HeuristicAlgorithm

A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities. Its name matcher uses a versatile matching algorithm that uses cultural reference data for precise and explainable cross-language and cross-script matching.

Source code in nomenklatura/matching/logic_v2/model.py
class LogicV2(HeuristicAlgorithm):
    """A rule-based matching system that generates a set of basic scores via
    name and identifier-based matching, and then qualifies that score using
    supporting or contradicting features of the two entities. Its name matcher
    uses a versatile matching algorithm that uses cultural reference data for
    precise and explainable cross-language and cross-script matching.
    """

    NAME = "logic-v2"
    features = [
        Feature(func=name_match, weight=1.0),
        Feature(func=address_entity_match, weight=0.98),
        Feature(func=crypto_wallet_address, weight=0.98),
        Feature(func=isin_security_match, weight=0.98),
        Feature(func=lei_code_match, weight=0.95),
        Feature(func=ogrn_code_match, weight=0.95),
        Feature(func=vessel_imo_mmsi_match, weight=0.95),
        Feature(func=inn_code_match, weight=0.95),
        Feature(func=bic_code_match, weight=0.95),
        Feature(func=uei_code_match, weight=0.95),
        Feature(func=npi_code_match, weight=0.95),
        Feature(func=identifier_match, weight=0.85),
        Feature(func=weak_alias_match, weight=0.8),
        Feature(func=address_prop_match, weight=0.2, qualifier=True),
        Feature(func=country_mismatch, weight=-0.2, qualifier=True),
        Feature(func=dob_year_disjoint, weight=-0.15, qualifier=True),
        Feature(func=dob_day_disjoint, weight=-0.25, qualifier=True),
        Feature(func=gender_mismatch, weight=-0.2, qualifier=True),
    ]
    CONFIG = {
        "nm_name_property": ConfigVar(
            type=ConfigVarType.STRING,
            description="The property to use for name matching. If not set, all name properties are used.",
            default=None,
        ),
        "nm_number_mismatch": ConfigVar(
            type=ConfigVarType.FLOAT,
            description="Penalty for mismatching numbers in object or company names.",
            default=0.3,
        ),
        "nm_extra_query_name": ConfigVar(
            type=ConfigVarType.FLOAT,
            description="Weight for name parts in the query not matched to the result.",
            default=0.8,
        ),
        "nm_extra_result_name": ConfigVar(
            type=ConfigVarType.FLOAT,
            description="Weight for name parts in the result not matched to the query.",
            default=0.2,
        ),
        "nm_family_name_weight": ConfigVar(
            type=ConfigVarType.FLOAT,
            description="Extra weight multiplier for family name in person matches (John Smith vs. John Gruber is clearly distinct).",
            default=1.3,
        ),
        "nm_fuzzy_cutoff_factor": ConfigVar(
            type=ConfigVarType.FLOAT,
            description="Extra factor for when a fuzzy match is triggered in name matching. "
            "Below a certain threshold, a fuzzy match is considered as a non-match (score = 0.0). "
            "Adjusting this multiplier will raise this threshold, making a fuzzy match trigger more leniently.",
            default=1.0,
        ),
    }

    @classmethod
    def compute_score(
        cls, scores: Dict[str, float], weights: Dict[str, float]
    ) -> float:
        mains: List[float] = []
        for feat in cls.features:
            if feat.qualifier:
                continue
            score = scores.get(feat.name, FNUL)
            mains.append(score * weights.get(feat.name, FNUL))
        score = max(mains, default=FNUL)
        if score <= FNUL:
            return score
        for feat in cls.features:
            if not feat.qualifier:
                continue
            weight = scores.get(feat.name, FNUL) * weights.get(feat.name, FNUL)
            score += weight
        return score

nomenklatura.matching.OFACMatcher

Bases: HeuristicAlgorithm

An algorithm that emulates the public OFAC Sanctions List Search tool at sanctionssearch.ofac.treas.gov, with mismatch qualifiers layered on top. Reverse-engineered from FAQ 249 and parity fixtures captured against the live tool. Name scoring closely tracks OFAC's reported score, but is an emulation rather than an exact reimplementation. Qualifier features (country, DOB, gender, orgid mismatches) reduce the name score - this departs from FAQ 251 (which says only the name field influences the Score) but mirrors how OFAC users actually triage matches via FAQ 5.

Source code in nomenklatura/matching/name_based/model.py
class OFACMatcher(HeuristicAlgorithm):
    """An algorithm that emulates the public OFAC Sanctions List Search tool at
    sanctionssearch.ofac.treas.gov, with mismatch qualifiers layered on top.
    Reverse-engineered from FAQ 249 and parity fixtures captured against the
    live tool. Name scoring closely tracks OFAC's reported score, but is an
    emulation rather than an exact reimplementation. Qualifier features
    (country, DOB, gender, orgid mismatches) reduce the name score - this
    departs from FAQ 251 (which says
    only the name field influences the Score) but mirrors how OFAC users
    actually triage matches via FAQ 5."""

    NAME = "ofac"
    features = [
        Feature(func=ofac_name_score, weight=1.0),
        Feature(func=country_mismatch, weight=-0.1, qualifier=True),
        Feature(func=dob_year_disjoint, weight=-0.1, qualifier=True),
        Feature(func=dob_day_disjoint, weight=-0.15, qualifier=True),
        Feature(func=gender_mismatch, weight=-0.1, qualifier=True),
        Feature(func=orgid_disjoint, weight=-0.1, qualifier=True),
    ]

    @classmethod
    def compute_score(
        cls, scores: Dict[str, float], weights: Dict[str, float]
    ) -> float:
        score = 0.0
        for feat in cls.features:
            score += scores.get(feat.name, 0.0) * weights.get(feat.name, 0.0)
        return score

nomenklatura.matching.EntityResolveRegression

Bases: ScoringAlgorithm

Entity resolution matcher. Do not use this in (regulated) screening scenarios.

Source code in nomenklatura/matching/erun/model.py
class EntityResolveRegression(ScoringAlgorithm):
    """Entity resolution matcher. Do not use this in (regulated) screening scenarios."""

    NAME = "er-unstable"
    MODEL_PATH = DATA_PATH.joinpath(f"{NAME}.pkl")
    FEATURES: List[CompareFunction] = [
        name_token_overlap,
        name_numbers,
        legal_name_levenshtein,
        person_name_levenshtein,
        org_name_levenshtein,
        strong_identifier_match,
        weak_identifier_match,
        dob_match,
        dob_year_match,
        contact_match,
        family_name_match,
        birth_place,
        gender_mismatch,
        per_country_mismatch,
        position_country_match,
        org_country_mismatch,
        security_isin_mismatch,
        obj_name_levenshtein,
        address_match,
        address_numbers,
    ]

    @classmethod
    def save(cls, pipe: Pipeline, coefficients: Dict[str, float]) -> None:
        """Store a classification pipeline after training."""
        mdl = pickle.dumps({"pipe": pipe, "coefficients": coefficients})
        with open(cls.MODEL_PATH, "wb") as fh:
            fh.write(mdl)
        cls.load.cache_clear()

    @classmethod
    @cache
    def load(cls) -> Tuple[Pipeline, Dict[str, float]]:
        """Load a pre-trained classification pipeline for ad-hoc use."""
        with open(cls.MODEL_PATH, "rb") as fh:
            matcher = pickle.loads(fh.read())
        pipe = cast(Pipeline, matcher["pipe"])
        coefficients = cast(Dict[str, float], matcher["coefficients"])
        current = [f.__name__ for f in cls.FEATURES]
        if list(coefficients.keys()) != current:
            raise RuntimeError("Model was not trained on identical features!")
        return pipe, coefficients

    @classmethod
    def get_feature_docs(cls) -> FeatureDocs:
        """Return an explanation of the features and their coefficients."""
        features: FeatureDocs = {}
        _, coefficients = cls.load()
        for func in cls.FEATURES:
            name = func.__name__
            features[name] = FeatureDoc(
                description=func.__doc__,
                coefficient=float(coefficients[name]),
                url=make_github_url(func),
            )
        return features

    @classmethod
    def compare(cls, query: E, result: E, config: ScoringConfig) -> MatchingResult:
        """Use a regression model to compare two entities."""
        pipe, _ = cls.load()
        encoded = cls.encode_pair(query, result)
        npfeat = np.array([encoded])
        pred = pipe.predict_proba(npfeat)
        score = float(pred[0][1])
        explanations: Dict[str, FtResult] = {}
        for feature, coeff in zip(cls.FEATURES, encoded):
            name = feature.__name__
            explanations[name] = FtResult(score=float(coeff), detail=None)
        return MatchingResult(score=score, explanations=explanations)

    @classmethod
    def encode_pair(cls, left: E, right: E) -> Encoded:
        """Encode the comparison between two entities as a set of feature values."""
        return [f(left, right) for f in cls.FEATURES]

compare(query, result, config) classmethod

Use a regression model to compare two entities.

Source code in nomenklatura/matching/erun/model.py
@classmethod
def compare(cls, query: E, result: E, config: ScoringConfig) -> MatchingResult:
    """Use a regression model to compare two entities."""
    pipe, _ = cls.load()
    encoded = cls.encode_pair(query, result)
    npfeat = np.array([encoded])
    pred = pipe.predict_proba(npfeat)
    score = float(pred[0][1])
    explanations: Dict[str, FtResult] = {}
    for feature, coeff in zip(cls.FEATURES, encoded):
        name = feature.__name__
        explanations[name] = FtResult(score=float(coeff), detail=None)
    return MatchingResult(score=score, explanations=explanations)

encode_pair(left, right) classmethod

Encode the comparison between two entities as a set of feature values.

Source code in nomenklatura/matching/erun/model.py
@classmethod
def encode_pair(cls, left: E, right: E) -> Encoded:
    """Encode the comparison between two entities as a set of feature values."""
    return [f(left, right) for f in cls.FEATURES]

get_feature_docs() classmethod

Return an explanation of the features and their coefficients.

Source code in nomenklatura/matching/erun/model.py
@classmethod
def get_feature_docs(cls) -> FeatureDocs:
    """Return an explanation of the features and their coefficients."""
    features: FeatureDocs = {}
    _, coefficients = cls.load()
    for func in cls.FEATURES:
        name = func.__name__
        features[name] = FeatureDoc(
            description=func.__doc__,
            coefficient=float(coefficients[name]),
            url=make_github_url(func),
        )
    return features

load() cached classmethod

Load a pre-trained classification pipeline for ad-hoc use.

Source code in nomenklatura/matching/erun/model.py
@classmethod
@cache
def load(cls) -> Tuple[Pipeline, Dict[str, float]]:
    """Load a pre-trained classification pipeline for ad-hoc use."""
    with open(cls.MODEL_PATH, "rb") as fh:
        matcher = pickle.loads(fh.read())
    pipe = cast(Pipeline, matcher["pipe"])
    coefficients = cast(Dict[str, float], matcher["coefficients"])
    current = [f.__name__ for f in cls.FEATURES]
    if list(coefficients.keys()) != current:
        raise RuntimeError("Model was not trained on identical features!")
    return pipe, coefficients

save(pipe, coefficients) classmethod

Store a classification pipeline after training.

Source code in nomenklatura/matching/erun/model.py
@classmethod
def save(cls, pipe: Pipeline, coefficients: Dict[str, float]) -> None:
    """Store a classification pipeline after training."""
    mdl = pickle.dumps({"pipe": pipe, "coefficients": coefficients})
    with open(cls.MODEL_PATH, "wb") as fh:
        fh.write(mdl)
    cls.load.cache_clear()

nomenklatura.matching.RegressionV1

Bases: ScoringAlgorithm

A simple matching algorithm based on a regression model.

Source code in nomenklatura/matching/regression_v1/model.py
class RegressionV1(ScoringAlgorithm):
    """A simple matching algorithm based on a regression model."""

    NAME = "regression-v1"
    MODEL_PATH = DATA_PATH.joinpath(f"{NAME}.pkl")
    FEATURES: List[CompareFunction] = [
        name_match,
        name_token_overlap,
        name_numbers,
        name_levenshtein,
        phone_match,
        email_match,
        identifier_match,
        dob_matches,
        dob_year_matches,
        FtResult.unwrap(dob_year_disjoint),
        first_name_match,
        family_name_match,
        birth_place,
        gender_mismatch,
        country_mismatch,
        org_identifier_match,
        address_match,
        address_numbers,
    ]

    @classmethod
    def save(cls, pipe: Pipeline, coefficients: Dict[str, float]) -> None:
        """Store a classification pipeline after training."""
        mdl = pickle.dumps({"pipe": pipe, "coefficients": coefficients})
        with open(cls.MODEL_PATH, "wb") as fh:
            fh.write(mdl)
        cls.load.cache_clear()

    @classmethod
    @cache
    def load(cls) -> Tuple[Pipeline, Dict[str, float]]:
        """Load a pre-trained classification pipeline for ad-hoc use."""
        with open(cls.MODEL_PATH, "rb") as fh:
            matcher = pickle.loads(fh.read())
        pipe = cast(Pipeline, matcher["pipe"])
        coefficients = cast(Dict[str, float], matcher["coefficients"])
        current = [f.__name__ for f in cls.FEATURES]
        if list(coefficients.keys()) != current:
            raise RuntimeError("Model was not trained on identical features!")
        return pipe, coefficients

    @classmethod
    def get_feature_docs(cls) -> FeatureDocs:
        """Return an explanation of the features and their coefficients."""
        features: FeatureDocs = {}
        _, coefficients = cls.load()
        for func in cls.FEATURES:
            name = func.__name__
            features[name] = FeatureDoc(
                description=func.__doc__,
                coefficient=float(coefficients[name]),
                url=make_github_url(func),
            )
        return features

    @classmethod
    def compare(cls, query: E, result: E, config: ScoringConfig) -> MatchingResult:
        """Use a regression model to compare two entities."""
        pipe, _ = cls.load()
        encoded = cls.encode_pair(query, result)
        npfeat = np.array([encoded])
        pred = pipe.predict_proba(npfeat)
        score = float(pred[0][1])
        explanations: Dict[str, FtResult] = {}
        for feature, coeff in zip(cls.FEATURES, encoded):
            name = feature.__name__
            explanations[name] = FtResult(score=float(coeff), detail=None)
        return MatchingResult(score=score, explanations=explanations)

    @classmethod
    def encode_pair(cls, left: E, right: E) -> Encoded:
        """Encode the comparison between two entities as a set of feature values."""
        return [f(left, right) for f in cls.FEATURES]

compare(query, result, config) classmethod

Use a regression model to compare two entities.

Source code in nomenklatura/matching/regression_v1/model.py
@classmethod
def compare(cls, query: E, result: E, config: ScoringConfig) -> MatchingResult:
    """Use a regression model to compare two entities."""
    pipe, _ = cls.load()
    encoded = cls.encode_pair(query, result)
    npfeat = np.array([encoded])
    pred = pipe.predict_proba(npfeat)
    score = float(pred[0][1])
    explanations: Dict[str, FtResult] = {}
    for feature, coeff in zip(cls.FEATURES, encoded):
        name = feature.__name__
        explanations[name] = FtResult(score=float(coeff), detail=None)
    return MatchingResult(score=score, explanations=explanations)

encode_pair(left, right) classmethod

Encode the comparison between two entities as a set of feature values.

Source code in nomenklatura/matching/regression_v1/model.py
@classmethod
def encode_pair(cls, left: E, right: E) -> Encoded:
    """Encode the comparison between two entities as a set of feature values."""
    return [f(left, right) for f in cls.FEATURES]

get_feature_docs() classmethod

Return an explanation of the features and their coefficients.

Source code in nomenklatura/matching/regression_v1/model.py
@classmethod
def get_feature_docs(cls) -> FeatureDocs:
    """Return an explanation of the features and their coefficients."""
    features: FeatureDocs = {}
    _, coefficients = cls.load()
    for func in cls.FEATURES:
        name = func.__name__
        features[name] = FeatureDoc(
            description=func.__doc__,
            coefficient=float(coefficients[name]),
            url=make_github_url(func),
        )
    return features

load() cached classmethod

Load a pre-trained classification pipeline for ad-hoc use.

Source code in nomenklatura/matching/regression_v1/model.py
@classmethod
@cache
def load(cls) -> Tuple[Pipeline, Dict[str, float]]:
    """Load a pre-trained classification pipeline for ad-hoc use."""
    with open(cls.MODEL_PATH, "rb") as fh:
        matcher = pickle.loads(fh.read())
    pipe = cast(Pipeline, matcher["pipe"])
    coefficients = cast(Dict[str, float], matcher["coefficients"])
    current = [f.__name__ for f in cls.FEATURES]
    if list(coefficients.keys()) != current:
        raise RuntimeError("Model was not trained on identical features!")
    return pipe, coefficients

save(pipe, coefficients) classmethod

Store a classification pipeline after training.

Source code in nomenklatura/matching/regression_v1/model.py
@classmethod
def save(cls, pipe: Pipeline, coefficients: Dict[str, float]) -> None:
    """Store a classification pipeline after training."""
    mdl = pickle.dumps({"pipe": pipe, "coefficients": coefficients})
    with open(cls.MODEL_PATH, "wb") as fh:
        fh.write(mdl)
    cls.load.cache_clear()

nomenklatura.matching.LogicV1

Bases: HeuristicAlgorithm

A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities.

This algorithm has been superseeded by logic-v2 and is no longer recommended for new integrations.

Source code in nomenklatura/matching/logic_v1/model.py
class LogicV1(HeuristicAlgorithm):
    """A rule-based matching system that generates a set of basic scores via
    name and identifier-based matching, and then qualifies that score using
    supporting or contradicting features of the two entities.

    This algorithm has been superseeded by logic-v2 and is no longer
    recommended for new integrations."""

    NAME = "logic-v1"
    features = [
        Feature(func=name_literal_match, weight=1.0),
        Feature(func=person_name_jaro_winkler, weight=0.8),
        Feature(func=person_name_phonetic_match, weight=0.9),
        Feature(func=name_fingerprint_levenshtein, weight=0.9),
        # These are there so they can be enabled using custom weights:
        Feature(func=name_metaphone_match, weight=FNUL),
        Feature(func=name_soundex_match, weight=FNUL),
        Feature(func=address_entity_match, weight=0.98),
        Feature(func=crypto_wallet_address, weight=0.98),
        Feature(func=isin_security_match, weight=0.98),
        Feature(func=lei_code_match, weight=0.95),
        Feature(func=ogrn_code_match, weight=0.95),
        Feature(func=vessel_imo_mmsi_match, weight=0.95),
        Feature(func=inn_code_match, weight=0.95),
        Feature(func=bic_code_match, weight=0.95),
        Feature(func=identifier_match, weight=0.85),
        Feature(func=weak_alias_match, weight=0.8),
        Feature(func=country_mismatch, weight=-0.2, qualifier=True),
        Feature(func=last_name_mismatch, weight=-0.2, qualifier=True),
        Feature(func=dob_year_disjoint, weight=-0.15, qualifier=True),
        Feature(func=dob_day_disjoint, weight=-0.2, qualifier=True),
        Feature(func=gender_mismatch, weight=-0.2, qualifier=True),
        Feature(func=orgid_disjoint, weight=-0.2, qualifier=True),
        Feature(func=numbers_mismatch, weight=-0.1, qualifier=True),
    ]

    @classmethod
    def compute_score(
        cls, scores: Dict[str, float], weights: Dict[str, float]
    ) -> float:
        mains: List[float] = []
        for feat in cls.features:
            if feat.qualifier:
                continue
            weight = scores.get(feat.name, FNUL) * weights.get(feat.name, FNUL)
            mains.append(weight)
        score = max(mains)
        if score == FNUL:
            return score
        for feat in cls.features:
            if not feat.qualifier:
                continue
            weight = scores.get(feat.name, FNUL) * weights.get(feat.name, FNUL)
            score += weight
        return score

nomenklatura.matching.NameMatcher

Bases: HeuristicAlgorithm

Deprecated in favour of ofac, which actually emulates OFAC's public Sanctions List Search behaviour. This algorithm matches on entity name using phonetic comparisons and Jaro-Winkler edit distance, vaguely based on FAQ #249, but does not reach OFAC parity.

Source code in nomenklatura/matching/name_based/model.py
class NameMatcher(HeuristicAlgorithm):
    """Deprecated in favour of `ofac`, which actually emulates OFAC's
    public Sanctions List Search behaviour. This algorithm matches on entity
    name using phonetic comparisons and Jaro-Winkler edit distance, vaguely
    based on FAQ #249, but does not reach OFAC parity."""

    NAME = "name-based"
    features = [
        Feature(func=jaro_name_parts, weight=0.5),
        Feature(func=soundex_name_parts, weight=0.5),
    ]

    @classmethod
    def compute_score(
        cls, scores: Dict[str, float], weights: Dict[str, float]
    ) -> float:
        score = 0.0
        for feat in cls.features:
            score += scores.get(feat.name, 0.0) * weights.get(feat.name, 0.0)
        return score

nomenklatura.matching.NameQualifiedMatcher

Bases: HeuristicAlgorithm

Deprecated in favour of ofac, which carries the same qualifier weights on top of a name score that actually reaches OFAC parity. Same as the name-based algorithm, but scores are reduced if a mis-match of birth dates and nationalities is found for persons, or different tax/registration identifiers are included for organizations and companies.

Source code in nomenklatura/matching/name_based/model.py
class NameQualifiedMatcher(HeuristicAlgorithm):
    """Deprecated in favour of `ofac`, which carries the same qualifier
    weights on top of a name score that actually reaches OFAC parity. Same as
    the name-based algorithm, but scores are reduced if a mis-match of birth
    dates and nationalities is found for persons, or different
    tax/registration identifiers are included for organizations and companies."""

    NAME = "name-qualified"
    features = [
        Feature(func=jaro_name_parts, weight=0.5),
        Feature(func=soundex_name_parts, weight=0.5),
        Feature(func=country_mismatch, weight=-0.1, qualifier=True),
        Feature(func=dob_year_disjoint, weight=-0.1, qualifier=True),
        Feature(func=dob_day_disjoint, weight=-0.15, qualifier=True),
        Feature(func=gender_mismatch, weight=-0.1, qualifier=True),
        Feature(func=orgid_disjoint, weight=-0.1, qualifier=True),
    ]

    @classmethod
    def compute_score(
        cls, scores: Dict[str, float], weights: Dict[str, float]
    ) -> float:
        score = 0.0
        for feat in cls.features:
            score += scores.get(feat.name, 0.0) * weights.get(feat.name, 0.0)
        return score