What should we expect from AI agents?
"AI Agents" are the latest trend in making AI tools more effective. What recipes do they need to be useful? Read: "Everything Starts Out Looking Like a Toy" #200
Hi, I’m Greg 👋! I write weekly product essays, including system “handshakes”, the expectations for workflow, and the jobs to be done for data. What is Data Operations? was the first post in the series.
This week’s toy: a quick way to remove Google AI results from your Google Search. This is a really good reminder that removing functionality can result in a much better experience for the user, even if the corporate overlords prefer highly clickable AI-driven linkbait.
It’s a milestone: ⭐️200⭐️ issues! Edition 200 of this newsletter is here - it’s May 27, 2024.
Have a comment? Interested in sponsoring this newsletter? Hit reply.
The Big Idea
A short long-form essay about data things
⚙️ What should we expect from AI agents?
AI-driven tools promise great results. The latest and greatest of these is the trend of Agentic AI, described aptly as a series of connected use cases. In it, we hear about Ezra Klein’s description of the future:
“The example I always use in my head is, when can I tell an AI, my son is turning five. He loves dragons. We live in Brooklyn. Give me some options for planning his birthday party. And then, when I choose between them, can you just do it all for me? Order the cake, reserve the room, send out the invitations, whatever it might be.”
This sounds pretty great as a goal, giving us the ability to outsource important (but mostly structured) decisions to a tool that can “reason” according to rules that we infer or set. Over time, we’ll develop a series of recipes customized with our instructions or adapted from templates shared by others.
The reality today? Pretty different than that. AI tools can do some amazing things but need guardrails like RAG – a technique to provide an authoritative database of answers – to avoid big missteps or complete failure.
One-shot AI queries are unable to get enough context from outside sources to know whether those other sources are good, satirical, or just plain wrong. When this type of error happens, it looks like the weird results that showed up in Google, advising you to make sure to include rocks in your diet and other oddities. In this case, the AI model had ingested too many links from Reddit and The Onion.
You might wonder how you’re going to avoid a similar problem when using AI agents.
What is reasonable to expect from AI agents?
You and I don’t have the resources of Google to fix obvious AI problems in our toolset, but we can limit missteps by setting boundaries on the work we expect AI agents to do.
For starters, how should AI agents behave? We could start by using a different set of words, identifying the rules that AI Agents must follow to create an expected outcome.
To define the outcomes we want, we need to confirm that AI agents and
they need to report the requirements (input) needed to accomplish a task
they will give a confidence score to an answer or next decision
Reporting the input needed to accomplish a task means that for each task, there needs to be a structure to show what happens at the entry point, the data during the task, and the after-task output.
A proposed structure looks like:
data changes that happen - how data moves from one state to another during the process
rules for conflict resolution - how to combine information and resolve conflicts
step entry and step exit points - understanding what is expected at the beginning and end of the process step
an example structure to guide the AI - a template indicating the score needed to proceed or wait, and the example error status
It’s reasonable to expect a consistent output that lets us know if the step was completed or not, and a resulting confidence score. It’s not reasonable to expect the process to continue without either a statement of expected outcome or some true/false flags that could be set. AI models are much better when you give them an obvious task to solve in a sea of data - help them out!
How can we train AI agents?
We talked about the inputs needed to create each part of the task, and the confidence score we’re expecting the AI Agent to produce as part of its outcome. A high scoring step with expected data should continue; while a lower-scoring step might need human remediation and a high-scoring step with no input should not be possible and needs to be caught by the agent as an error.
With this model in mind, training the AI agent looks a lot like the way you would refine other workflow processes:
Create Success criteria
Over time, build a better SOP (Standard Operating Procedure)
Document and create data flows to describe the outcome
Success criteria help us bind the workflow to a world where we can prove if it’s working or not. Once we build the criteria for each step, creating an SOP becomes possible. Finally, the concept of data flows makes it easier to identify steps in the SOP that are completed by humans and which ones are suitable to delegate to AI agents.
How do we validate the output?
Now that we have the structure of the output from the AI agent and a way to score that step, what do we do?
We could use a data factory to watch the outcomes of our workflows – including those done by AI agents – and divert the poor-performing outcomes into queues that are piloted by human operators who can fix or turn off AI agents as needed.
Retaining flexibility to have more or fewer AI agents in your workflow is the core idea of “humans in the middle.” There are some flows where AI will never be appropriate and other flows that are a great fit for an AI agent. The OODA loop (Observe, Orient, Decide, Act) maps well for both human and AI operators.
Where’s the opportunity in this market? Watching a process and creating a process map and associated rules for human and AI agent operators. The team that manages this will be creating a really useful force multiplier for productivity, even when AI agents are not used directly.
What’s the takeaway? AI agents need context to provide useful results, and a rubric for scoring to confirm that the result received matches the characteristics of a useful result. Mapping the flows we’re doing in our organizations to identify the steps that would be appropriate for AI will dramatically improve the effectiveness of our workflows. This “data factory” is the data product organizations need.
Links for Reading and Sharing
These are links that caught my 👀
1/ Remote habits - James Temperton at Posthog shares thoughts on the habits of high-performing remote teams. The thing that sticks out for me the most? The value and quality of their written documentation. That doesn’t mean writing a novel for every policy - it means writing a process that lets people take action confidently even when they don’t have other people in the room.
2/ On touch-tone keypads - Have you wondered why phones have the keypad layout that they do today? The design that’s survived looks a bit different than previous layouts and was chosen for speed and accuracy. Since we don’t dial by hand much today, is it time for a new layout?
3/ Recycling concrete - Concrete production causes a big portion of global emissions every year, and demand is only growing for construction material. Researchers have found a way to recycle that concrete in a low-emission way - pretty exciting.
What to do next
Hit reply if you’ve got links to share, data stories, or want to say hello.
Want to book a discovery call to talk about how we can work together?
The next big thing always starts out being dismissed as a “toy.” - Chris Dixon
Congrats on getting to #200!