Posted in Artificial Intelligence, Machine Learning

Dedupe Duplicates using Fuzzy / Proximity search

Last year I wrote a post about finding similar accounts for Dynamics CRM which generated lot of interest in the community. Understandably so, as this is a very common requirement that is asked for in nearly every CRM project – Duplicate Accounts. CRM duplicate detection capabilities are only basic – they just do partial match, they can’t do any fuzzy or proximity match.

Even with the latest and advanced weaponry in CRM’s armour i.e. Relevance Search it is not there yet where it could tell that the following accounts are infact the same companies.

Account

Potential Duplicate

Reason

Waste Management

Waset Manaegment

Typo

Public Storage Co.

Storage Public Co.

Wrong order

Scotts Miracle-Gro

Scott Miracles Gro

Plural

Melbourne University

Melbourne Univ.

Short form

I decided to improve and generalise my code a bit, so that it can be used not only for CRM for any general requirement where you need to find duplicates based on proximity. I am going to share the code and approach in this blog.

Approach

This proximity search is based on the machine learning algorithms which base the search on Edit Distance. The program starts with finding the exact matches first, if it couldn’t find an exact match, then it widens the search filter to find partial and proximity matches (i.e. words in the same neighbourhood, ordered in a different way, etc.)

Results

I have also attached the original files that I used during my testing i.e. the file containing duplicates and the results (where duplicates were found). Below is the brief snapshot of the results from my test run

Company

Duplicate Found

Kimberly-Clark

Kimberly Clark

San disk

SanDisk

Macy’s

Macy

Starwood Hotels & Resorts

Starwood Hotels And Resorts

Expeditors Washington

Expeditors International of Washington

There were some false positives in the results as well, so you can adjust the thresholds of the algorithm as per your data.

How to use

You got a list of companies and you want to know which of them are duplicates. So, this is what you need to do.

1. Export the list into a CSV file.

2. Point the code to your file.

3. Run the code and it generates a new file results.csv with a new column called Duplicate

Complete source code

Python is a beautiful language and does big things in just few lines of code. Just install Python on your desktop and run the following file. No frills, no servers, no deployment. Too easy.

# ProximitySearch.py
# AUTHOR - MANNY GREWAL 2017 (https://mannygrewal.wordpress.com)
# THIS CODE WILL DO FUZZY SEARCH FOR SEARCH TERM INSIDE A DATABASE. THE PRECENDENCE OF SEARCH STARTS WITH THE # NARROWEST FILTER WHICH SLOWLY WIDENES UP. THE IDEA IS TO GET TO PERFECT MATCHES BEFORE NEAR MATCHES. EACH  # FILTER HAS ITS OWN THRESHOLD CUTOFF.

#IMPORT THE PACKAGES NEEDED
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import csv
import os


#DEFINE AND CONFIGURE
FULL_MATCHING_THRESHOLD = 80
PARTIAL_MATCHING_THRESHOLD = 100
SORT_MATCHING_THRESHOLD = 100
TOKEN_MATCHING_THRESHOLD = 100
MAX_MATCHES=1

#READ THE CURRENT DATABASE
companies_db = "<local path of your CSV file>/CompaniesShort.csv"
pwd = os.getcwd()
os.chdir(os.path.dirname(companies_db))
current_db_dataframe = pd.read_csv(os.path.basename(companies_db),skiprows=1,index_col=False, names=['Company'])
os.chdir(pwd)

def find_matches(matchThis):
    rows = current_db_dataframe['Company'].values.tolist();
    rows.remove(matchThis)
    matches= process.extractBests(matchThis,rows,scorer=fuzz.ratio,score_cutoff=FULL_MATCHING_THRESHOLD,limit=MAX_MATCHES)
    if len(matches)==0:
        matches= process.extractBests(matchThis,rows,scorer=fuzz.partial_ratio,score_cutoff=PARTIAL_MATCHING_THRESHOLD,limit=MAX_MATCHES);
        if len(matches)==0:
            matches= process.extractBests(matchThis,rows,scorer=fuzz.token_set_ratio,score_cutoff=TOKEN_MATCHING_THRESHOLD,limit=MAX_MATCHES);
            if len(matches)==0:
                matches= process.extractBests(matchThis,rows,scorer=fuzz.token_sort_ratio,score_cutoff=SORT_MATCHING_THRESHOLD,limit=MAX_MATCHES);
    
    return matches[0][0] if len(matches)>0 else None


fn_find_matches = lambda x: find_matches(x)
current_db_dataframe['Duplicate']=current_db_dataframe.applymap(fn_find_matches)

current_db_dataframe.to_csv("results.csv")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s