Request for product: an Ops Detective to document problems

When there's a problem, do you have a standardized log and discovery process? Maybe you need an Ops Detective. Read: "Everything Starts Out Looking Like a Toy" #206

Jul 08, 2024

Hi, I’m Greg 👋! I write weekly product essays, including system “handshakes”, the expectations for workflow, and the jobs to be done for data. What is Data Operations? was the first post in the series.

This week’s toy: the progression of memes into … emoji memes. If you don’t follow Jen Daniel’s excellent emoji blog, check it out. Jen makes a new emoji version of the US Flag every July 4th, and it doesn’t disappoint. Edition 206 of this newsletter is here - it’s July 8, 2024.

If you have a comment or are interested in sponsoring, hit reply.

The Big Idea

A short long-form essay about data things

⚙️ An Ops Detective🕵️ can tell you what’s going on

ELI5 is an acronym people use when they encounter a complicated concept and want it explained simply. Some things are pretty straightforward in the Ops world – you know you have a new subscription in SaaS when Stripe tells you that you do – but many other things are not immediately obvious.

Records in your data warehouse that change in unexpected ways? Records that doesn’t show up where you expect? Duplicates created in your system even though you have a solid deduping process. A customer subscription scheduled to end on a date that doesn’t end automatically. All of these require a data detective to troubleshoot and resolve.

Playing the data detective is not always easy, requiring you to:

understand the internal rules of engagement – how should this record change according to our expected processes and norms?
check the history of this record – what happened to this record over time?
view the order of operations across systems - as many ops processes touch multiple systems, you need to know what happened and when

To find out what happened, you need an Ops Detective.

Wait a minute, don’t you already have logs? You do! But they do a lousy job of telling you what happens across systems when you have a common workflow that causes changes to multiple types of records. Solving this problem can take a lot of time and effort and would improve if the lineage and reporting information were better.

How would you make the reporting process better?

Using a product lens, there are some specific ways to make reporting these anomalies clearer. Think about how this would work if every process self-documented its results in a standard format. At key points, each workflow could broadcast a message with a time stamp to capture the transaction.

If you think this sounds like a fancy way of describing a transaction log, you’re not wrong. As a detective, you need to know what happened to an object over time and across systems. This gives you the ability to aggregate rows into a time series and understand the timeline as it happens.

Building a successful transaction log has a few challenges:

Granularity - there’s a lot of noise here, so which details do you care about. Storage is inexpensive so storing a lot of data is not usually an issue, but you want to store things you actually use.
Explanation - when a change happens, how do you summarize it in the context of a larger workflow? Calling it an event name and defining events uniquely might be one way, but it’s not always easy to know what’s happening without an event sequence.
Transaction - some things that happen can be reversed and other items are immutable, so it would be helpful to know all of the events that belong together in a transaction

Now we have the schema for a potential transaction log, how would the mechanics of this work?

An Idea to Unify Ops Reporting

Adding a log for every process (at important events) would add small amount of overhead to existing workflows, and could look like an API to do the following steps.

There needs to be a catalog of processes or workflows that when first run registers itself with the logging app, letting it know what it does, how often it normally runs (if known) and the objects it typically affects.

For example, a “subscription start” workflow might register itself in this way:

Type: On Demand
Name: “Subscription Start”
WorkflowGroup: “Subscription Start”
Description: “Runs when accounts begin their subscription”
Event time: time stamp
Metadata: A JSON package containing the typical keys to be expected, e.g. customerID, subscriptionID, subscriptionStatus

When this workflow runs it will send a message. You can use this message to aggregate it into a group of events that correspond to the Subscription Start for this account. Using the time stamps of these events lines up the transaction in an order that allows it to be observed sequentially and know when things go as expected or not in the expected order or content.

How would you use an Ops Detective?

Let’s say you had this improved transaction log. What would you do with it? I’d use it to answer some common questions and to output a standard result for questions like these for a given transaction:

What happened?
Was the outcome similar or different than expected?
What’s the impact of the change?

Of these questions, number 3 is the hardest to generalize. You are probably creating dashboards or reporting to alert on questions 1 and 2, and using a lot of extra time to understand impact. The goal of an Ops Detective is to standardize more of the typical transaction reporting across systems so that answering a question about a new workflow publishing to this log will be an easier process.

What’s the takeaway? Unexpected outcomes are the norm in operations, and building a structured way to troubleshoot them will make it easier to find patterns and to build an automated alert once you find a pattern. Build a self-reinforcing system by registering each workflow with the Ops Detective.

Links for Reading and Sharing

These are links that caught my 👀

1/ In comedy, context matters … - Have you tried watching a favorite movie from 10 or 20 years ago and find that it … just doesn’t age that well? One of the potential reasons for this is that not only does the context change for comedy over time but also our perception of humor. Before you get cranky at me for not being goofy, studies show that certain types of humor have universal appeal, and others … appeal mostly to 5th grade boys.

2/ Infrastructure spending as data - One of the most prolific spenders of capital in the 20th century was the Bell System (you might know some of their successor companies today). Bell spent tremendous amounts of capital to build a national (and international) communications system. It’s hard to know whether any company could do that level of spending today and have that amount of impact because the time frame multiplied by the spending is nuts.

3/ How to sharpen a pencil - If you know, you know. This is the proper way to sharpen a pencil.

What to do next

Hit reply if you’ve got links to share, data stories, or want to say hello.

Want to book a discovery call to talk about how we can work together?

The next big thing always starts out being dismissed as a “toy.” - Chris Dixon