10 common ways your revops data enrichment might be failing
Data enrichment solves many data problems, except it creates some new issues to solve. Read: "Everything Starts Out Looking Like a Toy" #159
Hi, I’m Greg 👋! I write weekly product essays, including system “handshakes”, the expectations for workflow, and the jobs to be done for data. What is Data Operations? is a post that grew into Data & Ops, a team to help you with product, data, and operations.
This week’s toy: a directory of custom Slack emoji responses. If you’re looking for :rick-roll:, they’ve got you covered. This is a great example of finding small amounts of joy in an everyday place.
Edition 159 of this newsletter is here - it’s August 21, 2023.
If you have a comment or are interested in sponsoring, hit reply.
Data Operations (“Everything Starts Out Looking Like a Toy”) is a reader-supported publication. Please consider subscribing.
The Big Idea
A short long-form essay about data things
⚙️ 10 common ways your revops data enrichment might be failing
Picture this: you have a million contact records to fix and need to find a title match based on email and determine the seniority of the contact. Or perhaps you’re focused on accounts and have tens or hundreds of thousands of companies to enrich to get to a Minimum Viable Record ready to be used in your data environment. In either case, you need to use a system of data enrichment to take the first-party data that you know and append second-party or third-party data to align this data with your required needs.
Data enrichment is one of the key tools at your disposal, enabling you to use an existing dataset to add data columns to your records based on a known key. For contacts, this is likely to be an email address or a Linkedin URL; for accounts, it might be some combination of the company name, website, and country. If you use an external source like Zoominfo or Clearbit, you can use their existing ID to re-enrich with new information collected by those teams.
This all works great when there are no problems with data. But since that’s a silly proposition (there are always some problems with data), let’s talk about 10 of the most common data enrichment problems you’re likely to have and some approaches to resolve or remediate them.
Problem 1: Ambiguous Matches for Common Names
When you get an exact match for a person or a company, great! But what happens when you have two records that seem pretty similar, like a contact record with the same first name, last name, and company but a different email address? Or you might find a company record for the same company in a different geographic region, and need to determine if it’s the same account.
For people, companies, or other objects, it will help you to determine a unique compound key for that record. For example, email + first + last name + country might be enough for a person, while website + name + country might be enough for an account. If you do this you’ll need to handle common changes, like when someone changes their email and you need to decide whether to create a new contact or update an existing contact. Either way, you’ll need a standard way of handling identity resolution.
Problem 2: Outdated External Sources
Many enrichment projects start like this: “let’s update all the records, and kick off the project", with the knowledge that new records will be enriched on a schedule. Existing records might get less attention unless the enrichment date is captured and a re-enrichment date is automated. Great, you say - let’s make sure every record is enriched every N months.
But if you think about the real value of enrichment and getting the best data possible in the place where it will do the most good, simply placing a date field on the record and having an automated cycle to re-enrich might not be the best choice. You might want to prioritize recent enrichment for accounts that have recent product activity or contacts that are hand-raisers. One way to do this is to combine the qualification process from SDRs with light data improvement. (Even clarifying the contact’s title goes a long way to making them feel like you know what’s going on.)
Problem 3: Conflicting Information
Even when you use only one enrichment source (most organizations use more than one) you will have the challenge of receiving conflicting information. When you get new information, what wins? When you receive the enrichment, if you store the source and time the value was received, you’ll be able to use that history to help you out.
Which source wins in a conflict of information? Usually, the closest to the original. When a person enters their new phone number into your system, it’s probably the best phone number. If you haven’t talked to someone before and you get a number attributed to them, you might not trust that number until you make a call and validate that they answered. There’s an important edge case here - when you are dealing with company names, you will often find many aliases (e.g. for JP Morgan Chase, you might hear JPMorgan, JPMC, and others) so you may need a separate canonical field for data like “Legal Company Name” in addition to a friendly company name.
Problem 4: Loss of Data Granularity
Enriching data also includes matching or appending product activity to accounts and contacts. When you do this you need to consider the time grain of the information you’re adding and confirm you’re not asking that data to work too hard by interpolation. Here’s what I mean - when you have weekly or monthly data, trying to get that data into a daily metric is not too accurate. You could push yourself into considering an average daily amount but it’s more effective to know the time frame of the metric.
Help yourself out here by labeling your metrics clearly. If you’re counting activity in the last 7 days, label it accordingly. If you’re capturing a weekly, monthly, yearly, or cumulative rollup, use a convention and name it so that the next person knows what they’re looking at. Scaling metrics up (monthly to yearly) is fine; interpreting a smaller time grain from a larger one is fraught with peril.
Problem 5: Personally Identifiable Information
If you have contact data, you have PII. Do yourself a favor when you enrich data using this information: use as little of it as possible. If you need to take the data and analyze it outside of a system, use an identifying key rather than the raw data and join it back in using a query when finished.
This also applies to identity resolution - you’ll do everyone in your org a favor if you create a standard way to link/merge or separate people who are the same but have personal and business emails. There are good reasons to combine or separate these records, but you should know how to proceed consistently.
Problem 6: An Incorrect Enrichment Match
Another common problem is an incorrect enrichment match on data that appears to be correct. In this case, Zoominfo erroneously combined my “Greg Meyer” contact record with several other similar records to create a Frankenrecord of bad data. If you get a lot of new values all at once, you might not have the right person.
Using a combination of email and another key is one way to fight this problem, creating an index key of values that you can compare to the new record. Another solution is to keep a flag on the record that is selected when a large percentage of data is changed in a record, then use a queue to review and remediate records that fall into that condition. If this is too noisy, lower the threshold of the automation that creates the flag.
Problem 7: The Cost of Enriching Frequently
Enrichment services aren’t cheap. They usually bill you by credits and the number of records that you enrich in a monthly time period. And you want your data to be as fresh as possible. How do you balance the need to update data and enrich new records with the cost of frequent enrichment?
After your initial enrichment task, setting a cadence for how often you want entities to be refreshed helps with this process, e.g. having a goal that every active contact might get updated every 180 or 365 days. Adding a secondary enrichment to contacts or accounts to know if events occur (title change, company change, email change) will also help you to determine the optimal cadence to get new enrichment data.
Problem 8: Lack of a standard data model
If you don’t know what a “good record” looks like, enrichment is not going to prove particularly valuable. Use a standard like the Minimum Viable Record to outline the most important fields in an object that need to be populated, and to understand when cross-field data integrity (think city, state, zip as an example for US Addresses) is breaking.
Creating a model for each common entity will also help team members identify obvious bad data, from picklist values that don’t match, to data formatting errors that point to a bad enrichment result or other data problems.
Problem 9: Matching non-equivalent data
This is a sneaky question: when is a contact title of “Vice-President” not the same as another title “VP” or “Vice-President”? Answer: when you are considering two companies with different title structures, like Amazon.com vs. a bank. Understanding equivalencies in data like this requires you to do other things, like measuring an adjacent data value like company industry.
This is one area where AI might be able to help quite a bit by deriving the level and responsibility related to a title when combined with the data on the company where that person works. Imagine a search for “director +” that actually gives you the right segment of contacts you’re seeking in an outbound campaign or in a segmentation exercise of internal contacts.
Problem 10: Not using external standards
By this point in the post, you’ve probably recognized that most of the common problems listed above could be improved with external standards. The meta-problem? There are no established external standards for “good” RevOps data. Savvy operators will use proxy data (e.g. Linkedin’s industry ID) as a way to find relevant sets of clean data. But we’re missing something like an open-source standard that – by entity type – suggests standard ways to gather and clean data based on type and composition.
What would such a solution look like? It’s unlikely to be an across-the-board solution and more like a cookbook of recipes helping you to “clean contact data” or “build account model data for companies with headquarters in a single country” or similar. The point here is that as a data quality professional, it’s incumbent upon you to have an opinionated view of your data and build systems and tests to deliver more of that data.
What’s the takeaway? We made a list of 10 common data quality problems here - there’s probably a long tail list of 1000 or more. The right fix for your org involves understanding what “good” looks like, building enrichment to reinforce that, and setting up monitoring and tests to ensure you catch records that don’t look right.
Links for Reading and Sharing
These are links that caught my 👀
1/ What’s more important? - One of the most useful (and poorly defined) methods of analysis when comparing preferences is Conjoint Analysis. As author Daniel Kyne puts it, “conjoint only measures what’s most important to people when comparing products in a purchase scenario.” Before you design your next feature in a SaaS product where you compare similar bundled options, read this analysis first.
2/ Techniques to test data - I really liked this piece by Karen Zhang on building data tests with dbt. What stood out about the article was the focus on identifying true/false tests that would succeed or fail based on important general conditions we want to identify about the data. For example, in a table where you expect certain values to be unique, create a test for uniqueness using a simple query. Writing these unit tests gives you a documented framework for the logic of your testing, and reminds me that I need to write more unit tests.
3/ Sustainable construction - A great idea to take advantage of a sustainable resource (seaweed) to approach a big problem (housing): using seaweed to create construction bricks. We need innovative solutions to improve the ecological footprint of construction - this is a neat idea.
What to do next
Hit reply if you’ve got links to share, data stories, or want to say hello.
The next big thing always starts out being dismissed as a “toy.” - Chris Dixon