Data Operations: Building a Data Factory

Building a system to fix information on a one-time and a continuous basis

Aug 24, 2020

A “Data Factory” is only as good as the results it produces. When you start applying data operations concepts to your ongoing projects, the data factory is the way to take your raw materials (the data, the instructions) and produce the output of cleaned data that conforms to your governance processes and quality standards.

In part 1 of this post, we discussed the basic idea of a data operations process: that there is important information at the intersection of systems that needs to be managed as an independent process. Because the marketing department and the sales department both have a concept of shared ideas like a “prospect” and a “lead” and an “account”, yet may not have shared definitions for those concepts, there is room for misinterpretation.

In part 2, we took those concepts and explained what they would look like in practice. From operational definitions of this theory to practical metrics to use in measuring data, we talked about a way to identify, test, and implement data operations in your organization. Having a team, process, and tools is a start. Running this process in production requires a “factory” of sorts - one that takes in raw data and outputs information as we’d like it - and this post is about defining and prototyping the data factory.

Data operations as a function exists to provide governance, highlight differences in practice and procedure, and remediate those differences with solutions that work for the whole organization.

Part 3 of this series is about starting to build the actual Data Factory to implement the theory we laid out in Part 1 and the practice we discussed in Part 2 of the series. So, let’s get started!

What is a Data Factory?

A “Data Factory” is a system to transform inputs (the information entering your system, usually from a marketing automation or sales automation system) into usable data. “Usable data” typically means data that clears the standard for the “minimum viable record” and can be placed into your sales or marketing motions.

The term factory implies automation that does the following:

Takes the raw information, e.g. a lead entered into a landing page form or added by a Sales Development Representative into Salesforce, and evaluates it against a standard
Based on this standard, the lead might need to be processed or improved automatically through an enrichment process that uses an external service (like Zoominfo or Clearbit for Leads)
Leads that don’t meet the ingestion standard need to be remediated, or placed into a queue for fixing
And the overall lifecycle needs to improve data or retire it if it’s no longer needed

Raw information could take a lot of forms, but it’s usually the product of the natural action that prospects take when they are engaging with your business. For example, a person signs up for a webinar and puts in their name and a personal email address because they might not be ready to be contacted by a sales team.

What does a data factory do to fix this?

First, the inbound lead needs to be matched against any existing leads or accounts for this customer. If a match is found, this new lead information needs to be matched with the existing lead or contact. If no match is found and enough information is present, you would expect that this lead would be automatically converted onto an existing or new account.

What could go wrong? If the standard for a lead requires a billing country and none is present, or if you need to have a company name and none is presented by the lead or by enrichment, you might not be able to take action. Yet.

A Data Factory needs to have a built-in method for remediation. In Salesforce you can use the Case object or a custom object to track the changes that might need to be made to a lead to bring it up to standard so that it can move on in the sales process.

This lifecycle makes it easy to create, improve, and retire data when it’s no longer needed. And now we know for a given piece of data whether (and when) it’s been ingested, classified, processed, and enriched. Ingesting is straightforward - whether it enters through a landing page or through a .CSV data load if a source isn’t connected to the system. Classifying data means matching it to a standard that can be objectively defined in a series of True/False conditions. When leads pass the standard, they can enter the system; if they don’t, they need to be improved before they can proceed.

How do we know when a Data Factory has the potential to work?

You might wonder - what’s the big deal? You can just fix the data once it’s in the system. There’s no need to create a “Data Factory”, right? But that misses the point. We’re trying to build an integrated system that is part of the business, not just a separate entity.

This is not a one size fits all solution, and will depend upon the structure, team, and process used to create, consume, and process the data. A Data Factory is an abstract concept, and can be realized in many different ways. But you will know when things are working.

You will know you have a strongly functioning data factory and a data-driven culture in your organization when:

Data in the organization is well defined and documented - it’s easy to find out who owns a field in an object and to understand what it’s supposed to store
The system of record for each kind of data is identified - if you’re looking for sales data, do you look in Salesforce, or in another system? Is a “customer id” captured in Marketo, a customer system, or elsewhere?
Interconnected data is defined and the “contract” between departments is established - does Marketing agree that Sales owns certain data, and is the converse true as well?
The process for handling data is well defined and documented - there are diagrams that explain to non-technical users how data moves through the system
The teams that think about this data know and understand the process and are willing to follow it - you’ve got buy-in in the organization for managing the flow of information based on data-driven metrics
There is a clear set of metrics and measures to understand the health of data and they are visible to the organization - there is a scorecard or a dashboard everyone uses to measure data quality
These metrics are integrated into current business process and easily found and visible - people actually use the scorecard or dashboard in their regular meetings
The organization agrees that improvement in these metrics is tied to KPI improvement for the GTM team - Individual members of the team trust the data they are using and are motivated to improve it, and their bosses agree

Great, this is a start. Now that you have a standard to follow, it’s possible to take that model and put it into practice.

From Definition to Action

One way to get the organization to believe in the concept of a data factory and put it into action is to approach a problem everyone thinks is worth solving.

An example of this could be Data Ingest, or the process of adding information to the organization. Getting the quality of inbound leads to improve helps almost every team to do better in the go to market motion.

First, you need to define what it means to add leads to the system. Does it mean landing pages from Marketo? Does it mean leads added from a list or from an event? Does it mean leads added by Sales development reps as they do their work? Yes, it means all of these things. Whenever leads enter the system, they need to be raised to the level of a Minimum Viable Lead.

Second, you identify the “leaky bucket” areas of the process. Does your organization allow people to download whitepapers without using a work email? When prospects sign up, do you attempt to validate their email?

Finally, how do you fix the problem? You need to fix this on a one-time basis, and also continuously.

One time fixes happen all the time as you find issues. When you find that leads entering the system from a particular lead source are missing key data, you need to update both the existing data and fix the landing page missing a field value.

Continuous fixes happen when you build resiliency into the system. For example, when you enrich every lead that enters the system with Clearbit or ZoomInfo, you gain both the ability to identify this lead as a pass/fail on the Minimum Viable Lead definition and also to fix missing data at the insert point of that lead.

From Action to Evaluation

In the next post in this series, we’ll look at taking the results from the Data Factory and using them to drive change in the organization. We’ve started by defining the services of a Data Factory and defined how those services update and improve data. The next step is to drive change based on what you see in the data.

If you found this useful, consider sharing with a friend.