Why your team can't see what's actually broken

Map the customer journey so teams can see exactly "what should have happened" in any operational workflow, then react faster. Read: "Everything Starts Out Looking Like a Toy" #263

Aug 11, 2025

A sequence diagram takes a complicated flow and explains what happens

Hi, I’m Greg 👋! I write weekly product essays, including system “handshakes”, the expectations for workflow, and the jobs to be done for data. What is Data Operations? was the first post in the series.

This week’s toy: a clickable map of the history of technology. Have you ever wondered what was a precursor of the Instant Camera? Now you can investigate and go back and forth on a rabbit hole of a technology timeline.

Edition 263 of this newsletter is here - it’s August 11, 2025.

Thanks for reading! Let me know if there’s a topic you’d like me to cover.

The Big Idea

A short long-form essay about data things

⚙️ Why your team can't see what's actually broken

In 2012, our Elasticsearch service broke at a SaaS company where I was working. It was an operational and technical failure that left hundreds of customers with impaired use of our customer service software for 48 hours. It wasn't strictly speaking our fault, because the vendor's index failed and was just too big. But tell that to customers.

The search service was how customers discovered new emails in their shared inboxes. When it went down, they had no way to know when new information was arriving. We didn't have a manual workaround for several hours besides asking them to view their email inbox using another client, and we didn't have a clear process yet for what to do when search was inoperative.

The result? Slow response times across hundreds of customers and thousands of their customers. We spent two days rebuilding the search index while individually calling customers with remediation steps. It could have been better.

We had no process for failure

We had better visibility into our code than we did into our processes.

When a build broke, engineering got an alert. When a deployment failed, there was a clear error message. When a database query timed out, there was a stack trace. But when our search service failed? We had no process for what should happen next.

Engineering teams track every commit, every deployment, every bug fix. When something breaks, they might not know exactly what went wrong, but they get a signal that they need to investigate before they can move on to other tasks.

We were missing an analog to the handoffs that happen between teams and describe the work of getting things done. What if we had designed our handoffs to be measurable and observable from day one, just like we create continuous delivery processes for code?

We don't have enough process visibility

The Elasticsearch failure wasn't unique. It was a symptom of a much larger problem: we don't have enough visibility into how our operational processes actually work.

The operational landscape has shifted dramatically. Tool sprawl has created silos where Salesforce doesn't talk to HubSpot, where Jira tickets get lost between Slack threads, and where customer data lives in three different CRMs. Add in remote work, async workflows, and rising customer expectations, and you've got a recipe for operational breakdowns that cost real money.

But the deeper problem isn't just tool integration. It's that teams have fundamentally different ways of documenting and expressing how they do work. Some teams live in Jira, others in Zendesk, others in Figma, and others in Slack. When there's a service issue, there's no common place to see what should have happened, so teams spend more time deciding who did what than actually analyzing the problem.

The handoff failure pattern

We treat handoff failures as inevitable rather than preventable. When a process handoff fails, what do we often do?

We blame the other team. We assume it was a one-off problem. We move on without learning anything. We don't have the data to do anything else.

The core issue: Teams don't have a standard way to create a runbook for what's supposed to happen. When processes break, there's no single source of truth that everyone can reference. Instead, you get:

Tool-specific documentation that only certain teams can access
Verbal handoffs that get lost in translation
Process changes that never make it to all stakeholders
Service issues that devolve into "he said, she said" debates

What if we treated our operations like engineering teams treat their code?

Sequence diagrams can serve as a universal process

Sequence diagrams solve the fundamental coordination problem: they provide a common format that can be consumed and modified by anyone, regardless of their preferred tools or technical background.

The beauty of sequence diagrams isn't that they're revolutionary. It's that they're practical.

They give you:

A single source of truth that everyone can reference
Clear handoff points where responsibility transfers between teams
Error handling paths that show what to do when things break
Version control so process changes are tracked and reversible

Here's the magic: sequence diagrams are text-based, but they're not just for engineers. When you feed a diagram to any chatbot along with a prompt, you can get interviewed about the process. The output becomes the sequence diagram content that demonstrates a change in process.

This means:

Non-technical teams can understand the flow
Process changes can be proposed and discussed in plain language
LLMs can help translate business requirements into technical specifications
Everyone speaks the same language about what should happen

The journey: what should happen

Here's a simple customer journey that any stakeholder can understand at a glance:

The sequence: how it actually works

But the journey view only tells part of the story. Here's the same flow expressed as a proper sequence diagram that shows timing, dependencies, and error handling:

The known process: drilling down

Now let's drill into one specific "known process" from the journey—the onboarding handoff failure. This sub-diagram shows exactly what happens when the handoff between sales and onboarding breaks:

What this approach gives you

Clarity: Every stakeholder can see exactly where they fit in the process and what they're responsible for.

Accountability: When something breaks, you can trace the failure to a specific step, actor, or system interaction.

Speed: Process changes become visual exercises—add a step, remove a handoff, or modify a decision point, then validate the flow.

Alignment: Sales, marketing, customer success, and data teams can see how their work connects to customer outcomes.

Consider this scenario: Your customer success team reports that onboarding is taking too long. With a sequence diagram, you can immediately see that the bottleneck is in the resource provisioning step. Maybe engineering is overwhelmed, or maybe the provisioning process is too complex.

Without the diagram, you'd spend hours in meetings trying to understand who does what when. With it, you can identify the constraint, test solutions, and measure improvements in real-time.

Why this can work

Here's the thing about sequence diagrams: they're not just documentation—they're a way to test your assumptions about how work actually gets done.

When I first started thinking about mapping processes this way, I discovered something obvious that we'd been missing: we had no idea how long anything actually took.

We'd say "onboarding takes 5 days" but couldn't point to where the time was actually spent. We'd blame "slow handoffs" but couldn't identify which handoff was the problem. We'd assume engineering was the bottleneck but had no data to prove it.

The real test

Before sequence diagrams, the conversation went like this:

Me: "We should add a technical review step before provisioning resources." Team: "How long will that add to the process?"
Me: "I don't know, maybe a day?"
Team: "What if it takes longer?"
Me: "We'll figure it out."

With sequence diagrams, the conversation could be:

Me: "Here's the current flow. We can add a technical review here, which should add 4-8 hours depending on complexity."
Team: "What happens if the review fails?"
Me: "Good question. Let me add that path to the diagram."
Team: "Can we test this before we roll it out?"
Me: "Absolutely. Here's how we'd simulate it."

When you can see the process, you can test it before you build it. You can identify the constraints, measure the impact, and validate assumptions before they become problems.

What I learned about process observability

The hard truth is that most teams are flying blind when it comes to process performance. We know something is broken, but we don't know where or why.

We didn't know everything. We thought sales handoffs were the problem, but it was actually the gap between onboarding and engineering. We assumed context was getting lost in Slack, but the real issue was that nobody was documenting decisions in a way that survived the handoff.

Time tracking revealed bottlenecks we never saw. That "5-day onboarding" might actually be 2 days of work spread across 5 days because of handoff delays. The provisioning step we thought took 4 hours actually took 30 minutes, and the other 3.5 hours were waiting for someone to pick up the ticket.

Context preservation was a real problem. By the time a customer request made it from sales to engineering, we'd lost 70% of the context. The engineering team was searching for jobs to be done when they didn’t yet know what problems customers had.

The implementation reality

And here's what we're trying to fix the issues…

Start with one process - Don't try to map everything. Pick the process that's causing the most pain and map just that one.

Measure what you can - Track handoff times, context loss, and failure rates. Don't try to build perfect metrics—just measure what you can see.

Fix the obvious problems first - In an example process, many handoff failures were due to missing information. Adding a simple checklist is way to cut the failure rate dramatically.

Iterate on the process, not the documentation - The diagram is just a tool. Focus on making the actual process better, then update the diagram to reflect reality.

The goal isn't perfect observability—it's enough visibility to know when something's broken and where to look for the problem.

Your next steps

This week, pick one customer workflow that's been causing headaches. Maybe it's the lead handoff from marketing to sales, or the data sync between your CRM and billing system.

Map it as a journey first, then break it down into a sequence diagram. Identify the actors, the timing, and the decision points. Look for places where handoffs could fail and add error handling.

The goal isn't to create perfect documentation. It's to create a shared understanding that makes your team faster, more accountable, and better at delivering customer value.

What we'd do differently

Looking back at the Elasticsearch failure, here's what I’d recommend we do differently if we were designing the process today:

Document the normal flow - Create a sequence diagram showing how messages flow from email arrival to customer discovery
Identify failure modes - Map out what happens when each component fails
Design manual workarounds - Create processes that don't depend on the failed component
Test the recovery process - Practice bringing systems back online before they break
Communicate proactively - Have templates ready for customer communication during outages

The teams that figure this out won't be the ones with the most sophisticated tools or the biggest budgets. They'll be the ones who can see what's actually happening, measure what matters, and fix what's broken.

And when the next Elasticsearch failure happens - because it will - they'll have a process for dealing with it instead of scrambling to figure out what to do.

What’s the takeaway? The goal isn't to prevent all failures. It's to have a clear process for responding to them when they happen. Because when customer handoffs work smoothly, nobody notices. And that's exactly how it should be.

Links for Reading and Sharing

These are links that caught my 👀

1/ How to write better - read more examples of good writing, like this compact example of how to write a solid design document. Less is more in this case.

2/ How do we want to apply AI - it turns out that just applying AI to everything has diminishing returns. What works better? Knowing where AI can add leverage. Leah Tharin suggests you think about how you want to apply it (hint: use it for the easy things you don’t want to think about).

3/ Is OpenAI the new Nokia? - On paradigm shifts in a winner-take-all market.

What to do next

Hit reply if you’ve got links to share, data stories, or want to say hello.

The next big thing always starts out being dismissed as a “toy.” - Chris Dixon