Fuzzy Matching Salesforce Leads to Accounts with Python

Common problem in Salesforce world: you want to bulk convert some Leads but you don’t know which ones are already related to existing Accounts. You can’t just bulk convert and create tons of duplicate accounts, and you have way too many leads to manually search for and create each account if needed. I recently encountered this problem and used a bit of Python and a great fuzzy matching library, FuzzyWuzzy, to solve it.

There are many ways to solve this problem, ranging from 3rd party AppExchange products to custom Apex classes. However there is a major limitation of Apex and that is support for pattern matching. So for me, using Python and Dataloader is the simplest path and allows me to use great tools like FuzzyWuzzy.

The script will try a couple of different ways to match the Lead to the account:

It will extract the domain from the Account.Website field and compare it to the domain from the Lead.Email field. An improvement would be to also check the Lead.Website field against the Account.Website field. If these match, we skip the next step.
It will perform a fuzzy match on the Account.Name and Lead.Company field using FuzzyWuzzy’s token_set_ratio function. You can play with different match functions and ratio thresholds, but I found this one to work best. Below the 100% threshold I had too many false positives.

Here’s the basic steps to make it work.

First, add a custom lookup field to your Lead object called ‘Related Account’. When we match an existing Account to a Lead, we will ultimately store the match here.
Next, extract all of your Account data to a CSV file using the Salesforce DataLoader. You need the Id, Name and URL, at a minimum.
Extract the Leads you want to match into another CSV, and include the Id, Name and Email.
Run the script below and redirect the output to a CSV which will contain the Lead ID and the matched Account ID.
Use DataLoader to update the Lead objects, setting the new custom field you created with the Account Id that matched to the Lead.
You are now ready to bulk convert using some Apex scripting. A good overview of how to do that is to review the Database.convertLead() Apex method. This makes it east for your Apex code to check for a value in the Lead.Related Account field – and merge instead of creating a new Account.

Here’s the example Python script below.

#/usr/bin/python3

import csv 
from urllib.parse import urlparse
from fuzzywuzzy import fuzz 

#function to extract the domain name from an email address (foo@foobar.com > foobar.com)
def extractDomainFromEmail(email):
   return email.split('@')[1]            

#extracts the domain name from a web site URL (https://www.foobar.com > foobar.com)
def extractDomainFromWebsite(website):
   t = urlparse(website).netloc
   return '.'.join(t.split('.')[-2:])

#extracts from your SF instance. Make sure to include Ids, Name, Email and Website fields for Accounts and Leads
accounts_file = 'accounts-extract.csv'
leads_file = 'leads-extract.csv'

dict_accounts = []
dict_match = []

#just used to keep track of how many matches we get
match_count = 0

#read accounts to a dictionary
with open(accounts_file, mode='r') as af:
   accounts_dict = csv.DictReader(af)
   
   for row in accounts_dict:
      
      #as we read in the rows, strip the web site field to a domain if present
      website = str(row['Website'])
      
      if (website != ''):
         row['Website'] = extractDomainFromWebsite(website)
      
      dict_accounts.append(row)

#iterate each lead, trying to find a match on the account
with open(leads_file, mode='r') as lf:
   
   leads_dict = csv.DictReader(lf)
   
   for leadRow in leads_dict:

      leadMatch = False
      
      #extract the email domain of the lead to check against the account's domain
      emailDomain = extractDomainFromEmail(leadRow['Email'])
            
      for accountRow in dict_accounts:

         matched = False
         
         #try to match the email domain to the URL for the account 
         if (emailDomain ==  accountRow['Website']):
            matched = True
         else:    
            #perform a fuzzy match on the Company field vs. Account Name
            ratio = (fuzz.token_set_ratio(str.lower(leadRow['Company']),str.lower(accountRow['Name'])))
         
            if (ratio == 100 ):
               matched = True
               
            #if you want to see what some near matches look like, uncomment this 
            #elif (ratio > 95):
            #  print('No Match: Ratio:'+str(ratio)+' - '+leadRow['Name'])

         if(matched):
            match_count += 1
            leadMatch = True
            break
         
      if(leadMatch == True):
         #uncomment to view the actual matched row
         #print('Matched Lead:'+leadRow['Name']+' - '+leadRow['Company']+' - Account Name: '+accountRow['Name'])
         
         # add a row to the output dictionary with the Lead ID and Account ID to update.
         matched_dict = {'LeadId': leadRow['Id'],'AccountId': accountRow['Id']}  
         dict_match.append(matched_dict)

#uncomment to see how many matched
#print('Matched: '+str(match_count))

#print the csv file
print('LeadId,AccountID')

for outRow in dict_match:
   print(outRow['LeadId']+','+outRow['AccountId'])