Matching company names is a common yet significant challenge in data analysis. Although a company typically has one registered name, such as “Google LLC,” it may appear under various names in different data sources, such as:
- Google Inc. (The original name before the company restructured under Alphabet Inc. in 2015)
- Google Ireland Limited (The entity responsible for Google’s operations in Europe)
- Google UK Limited (The entity responsible for Google’s operations in the United Kingdom)
- …
Due to the different conventions and versions of company names, matching them can become a tedious task.
Free text matching wasn’t feasible because of the various naming conventions like in the example list below:
List 1 | List 2 |
Credit Agricole SARL | Crédit Agricole |
SA Constructeur Idéal | Constructeur Idéal S.A. |
DL Partners Limited | DL Partners LLC |
… | … |
I identified several common reasons for mismatches:
- Special characters (e.g., different accents in French)
- Different ways of indicating legal form (e.g., “Limited” vs. “LLC”)
- Upper and lower case variations
- Leading and trailing spaces
- Punctuation differences (e.g., “SA” vs. “S.A.”)
- Presence or absence of the legal form (e.g., “Google” vs. “Google LLC”)
So, is there a way to quickly clean up and match these names based on the “fundamental part” of the name?
I found two simple solutions—by “simple,” I mean you can use a packaged solution without addressing each mismatch reason individually.
One solution is with a Python package called “name_matching” and the other is with Alteryx.
Let me guide you through the basics of each solution using a test sample:
“name_matching” packing in Python
This package is developed by “Nederlandse Bank,” and I discovered it through this GitHub repository.
You can find all the details about this repository in a Medium article.
To test the package, I used the sample names provided in the repository’s example—you can find all the details in the repository, so I won’t repeat them here.
Below is the output of the sample:
You can see the matched names and their corresponding matching scores in the results.
Alteryx
First, in case you’re unfamiliar with Alteryx, it’s a software designed to make advanced analytics automation accessible to any data worker. With its user-friendly drag-and-drop interface, I used the “Fuzzy Match” tool and configured it to match company names:
Below is the output of the sample:
Similar to the Python package output shown above, Alteryx name matching also provides the matching results along with a match score.
As you can see, the “matching score” differs between the two tools. For instance, the pair “Bank of China Limited” and “Bank of China” has a matching score of 100% in the “name_matching” package but 87% in the Alteryx output. This indicates that the tools handle suffixes differently.
Which tool is better? Personally, I prefer the “name_matching” package because it allows you to adjust parameters to meet specific requirements (you can find details on adjustable parameters in the Medium article).
However, Alteryx is extremely user-friendly and is also a strong alternative.
It’s important to emphasize that while the matching methods described above can significantly improve productivity in finding matching pairs, they are not always perfect.
For example, two names – “Laboratoire ABC” and “Laboratoire BCD” might receive a high matching score simply because they both contain the word “Laboratoire,” even though they may not be relevant to each other.
Therefore, a manual review is always necessary to identify and address such exceptions.