Match company names with Python or Alteryx

Matching company names is a common yet significant challenge in data analysis. Although a company typically has one registered name, such as “Google LLC,” it may appear under various names in different data sources, such as:

Google
Google Inc. (The original name before the company restructured under Alphabet Inc. in 2015)
Google Ireland Limited (The entity responsible for Google’s operations in Europe)
Google UK Limited (The entity responsible for Google’s operations in the United Kingdom)
…

Due to the different conventions and versions of company names, matching them can become a tedious task.

Free text matching wasn’t feasible because of the various naming conventions like in the example list below:

List 1	List 2
Credit Agricole SARL	Crédit Agricole
SA Constructeur Idéal	Constructeur Idéal S.A.
DL Partners Limited	DL Partners LLC
…	…

I identified several common reasons for mismatches:

Special characters (e.g., different accents in French)
Different ways of indicating legal form (e.g., “Limited” vs. “LLC”)
Upper and lower case variations
Leading and trailing spaces
Punctuation differences (e.g., “SA” vs. “S.A.”)
Presence or absence of the legal form (e.g., “Google” vs. “Google LLC”)

So, is there a way to quickly clean up and match these names based on the “fundamental part” of the name?

I found two simple solutions—by “simple,” I mean you can use a packaged solution without addressing each mismatch reason individually.

One solution is with a Python package called “name_matching” and the other is with Alteryx.

Let me guide you through the basics of each solution using a test sample:

“name_matching” packing in Python

This package is developed by “Nederlandse Bank,” and I discovered it through this GitHub repository.

You can find all the details about this repository in a Medium article.

To test the package, I used the sample names provided in the repository’s example—you can find all the details in the repository, so I won’t repeat them here.

Below is the output of the sample:

You can see the matched names and their corresponding matching scores in the results.

Alteryx

First, in case you’re unfamiliar with Alteryx, it’s a software designed to make advanced analytics automation accessible to any data worker. With its user-friendly drag-and-drop interface, I used the “Fuzzy Match” tool and configured it to match company names:

Below is the output of the sample:

Similar to the Python package output shown above, Alteryx name matching also provides the matching results along with a match score.

As you can see, the “matching score” differs between the two tools. For instance, the pair “Bank of China Limited” and “Bank of China” has a matching score of 100% in the “name_matching” package but 87% in the Alteryx output. This indicates that the tools handle suffixes differently.

Which tool is better? Personally, I prefer the “name_matching” package because it allows you to adjust parameters to meet specific requirements (you can find details on adjustable parameters in the Medium article).

However, Alteryx is extremely user-friendly and is also a strong alternative.

It’s important to emphasize that while the matching methods described above can significantly improve productivity in finding matching pairs, they are not always perfect.

For example, two names – “Laboratoire ABC” and “Laboratoire BCD” might receive a high matching score simply because they both contain the word “Laboratoire,” even though they may not be relevant to each other.

Therefore, a manual review is always necessary to identify and address such exceptions.

“name_matching” packing in Python

Alteryx

Related Posts

Leave a Comment Cancel Reply