Fuzzy Matching Salesforce Leads to Accounts with Python

Common problem in Salesforce world: you want to bulk convert some Leads but you don’t know which ones are already related to existing Accounts. You can’t just bulk convert and create tons of duplicate accounts, and you have way too many leads to manually search for and create each account if needed. I recently encountered this problem and used a bit of Python and a great fuzzy matching library, FuzzyWuzzy, to solve it.

There are many ways to solve this problem, ranging from 3rd party AppExchange products to custom Apex classes. However there is a major limitation of Apex and that is support for pattern matching. So for me, using Python and Dataloader is the simplest path and allows me to use great tools like FuzzyWuzzy.

The script will try a couple of different ways to match the Lead to the account:

  • It will extract the domain from the Account.Website field and compare it to the domain from the Lead.Email field. An improvement would be to also check the Lead.Website field against the Account.Website field. If these match, we skip the next step.

  • It will perform a fuzzy match on the Account.Name and Lead.Company field using FuzzyWuzzy’s token_set_ratio function. You can play with different match functions and ratio thresholds, but I found this one to work best. Below the 100% threshold I had too many false positives.

Here’s the basic steps to make it work.

  • First, add a custom lookup field to your Lead object called ‘Related Account’. When we match an existing Account to a Lead, we will ultimately store the match here.

  • Next, extract all of your Account data to a CSV file using the Salesforce DataLoader. You need the Id, Name and URL, at a minimum.

  • Extract the Leads you want to match into another CSV, and include the Id, Name and Email.

  • Run the script below and redirect the output to a CSV which will contain the Lead ID and the matched Account ID.

  • Use DataLoader to update the Lead objects, setting the new custom field you created with the Account Id that matched to the Lead.

  • You are now ready to bulk convert using some Apex scripting. A good overview of how to do that is to review the Database.convertLead() Apex method. This makes it east for your Apex code to check for a value in the Lead.Related Account field – and merge instead of creating a new Account.

Here’s the example Python script below.

#/usr/bin/python3

import csv 
from urllib.parse import urlparse
from fuzzywuzzy import fuzz 

#function to extract the domain name from an email address (foo@foobar.com > foobar.com)
def extractDomainFromEmail(email):
   return email.split('@')[1]            

#extracts the domain name from a web site URL (https://www.foobar.com > foobar.com)
def extractDomainFromWebsite(website):
   t = urlparse(website).netloc
   return '.'.join(t.split('.')[-2:])

#extracts from your SF instance. Make sure to include Ids, Name, Email and Website fields for Accounts and Leads
accounts_file = 'accounts-extract.csv'
leads_file = 'leads-extract.csv'

dict_accounts = []
dict_match = []

#just used to keep track of how many matches we get
match_count = 0

#read accounts to a dictionary
with open(accounts_file, mode='r') as af:
   accounts_dict = csv.DictReader(af)
   
   for row in accounts_dict:
      
      #as we read in the rows, strip the web site field to a domain if present
      website = str(row['Website'])
      
      if (website != ''):
         row['Website'] = extractDomainFromWebsite(website)
      
      dict_accounts.append(row)

#iterate each lead, trying to find a match on the account
with open(leads_file, mode='r') as lf:
   
   leads_dict = csv.DictReader(lf)
   
   for leadRow in leads_dict:

      leadMatch = False
      
      #extract the email domain of the lead to check against the account's domain
      emailDomain = extractDomainFromEmail(leadRow['Email'])
            
      for accountRow in dict_accounts:

         matched = False
         
         #try to match the email domain to the URL for the account 
         if (emailDomain ==  accountRow['Website']):
            matched = True
         else:    
            #perform a fuzzy match on the Company field vs. Account Name
            ratio = (fuzz.token_set_ratio(str.lower(leadRow['Company']),str.lower(accountRow['Name'])))
         
            if (ratio == 100 ):
               matched = True
               
            #if you want to see what some near matches look like, uncomment this 
            #elif (ratio > 95):
            #  print('No Match: Ratio:'+str(ratio)+' - '+leadRow['Name'])

         if(matched):
            match_count += 1
            leadMatch = True
            break
         
      if(leadMatch == True):
         #uncomment to view the actual matched row
         #print('Matched Lead:'+leadRow['Name']+' - '+leadRow['Company']+' - Account Name: '+accountRow['Name'])
         
         # add a row to the output dictionary with the Lead ID and Account ID to update.
         matched_dict = {'LeadId': leadRow['Id'],'AccountId': accountRow['Id']}  
         dict_match.append(matched_dict)

#uncomment to see how many matched
#print('Matched: '+str(match_count))

#print the csv file
print('LeadId,AccountID')

for outRow in dict_match:
   print(outRow['LeadId']+','+outRow['AccountId'])
         

B2B Revenue Attribution for SEM with Tableau, Python and Machine Learning.

A common challenge for B2B marketers is how to attribute revenue to a search program. Unlike typical direct marketing tactics, the path to B2B revenue often involves many channels, programs, stakeholders and decision makers.

There’s usually not a direct path from a specific digital touch to a sale, which can make long-term SEM programs difficult to justify with revenue contribution. This leaves search marketers with but clicks, search rankings and impressions, which rarely tell a compelling story about marketing’s contribution to revenue.

A client approached us with an interesting proposal: he wanted to demonstrate SEO’s contribution to pipeline and sales. With no direct connection between search impressions, clicks, and ultimate a sale, we had to come up with a solution that relied on statistical inference.

Our solution was to blend a few data sources:

  • Web traffic and conversion data from Adobe Analytics which contained a unique customer identifier on each page as well as each URL touched by that unique customer. This identifier was also passed to salesforce anytime a web conversion occurred, i.e. a form submission.
  • A CSV file of managed SEO terms and preferred landing page URLs from a vendor tool.
  • CRM data from Salesforce.com, which included a unique customer identifier along with all revenue activity. This allowed us to get a file which had a URL and the last-click revenue associated with that URL.
  • Google Search Console data, which allowed us to view search entrances per URL.

The basic application logic worked like this:

  • Map all the data sources together using the unique customer key and load into a temp working source.
  • Grab the top keywords, click and impression data for the URL via the Google Search Console REST API into another working source
  • Classify the keywords into brand, non-brand, product, and service categories using machine learning algorithms. We used TextBlob and FuzzyWuzzy python libraries for this.
  • Visualize and analyze the data using an analytics platorm (Tableau).

We now could drill down into each search query and see the revenue opportunities associated with it.  This made it easy to identify revenue opportunities in an interactive, visual way.