Treat the AI Like a User

Lately there have been a lot of companies building AI Agents on top of LLM platforms. Essentially, connecting an AI model to a series of tools and allowing the LLM to interpret and coordinate a series of actions at the user's direction. This is generally referred to as the ReAct paradigm (Reason + Act).

A couple popular libraries implementing this are AutoGPT and LangChain. I've been building with both a fair bit lately, and noticed a really interesting trend. The best way to reliably interact with LLM Agents is to treat them like a user.

That means:

  • Give them simple, single purpose, functions
  • Error handling should be verbose
  • Format responses like strings
  • Inconsistent returns are ok

When you're building tools for an end user, you're typically not throwing them raw JSON and asking them to interpret it. Instead, you're giving them paragraphs, buttons, bulleted lists, etc. All the nice GUI elements we are used to.

Now while LLMs seem pretty proficient at parsing JSON, in practice they are much less reliable than if you pre-parse that JSON into a more clear value.

The techniques I highlight below have helped me reduce a lot of the variability in LLM Agent performance.

Single purpose functions

For the purpose of these examples, lets pretend we're dealing with a Realtor AI Agent. It'll help users shop for, compare, and purchase a home.

Now if I were typically building an API, I would think a generalized Search endpoint would be valuable here. It would take in zip code, price range, bedrooms, bathrooms, acreage, etc. and return an array of houses that meet this criteria.

Unfortunately, in practice this performs quite poorly. The more arguments a function takes, the higher risk of the LLM "forgetting" what arguments go where.

For example:

Prompt: You have access to the function `omni_search`
It optionally takes:
- zip_code
- bedrooms
- upper_price
- lower_price


User: Fine a house near me
AI: please provide your zip code for a search

User: 94104
AI: executing omni_search(zip_code=94104, bedrooms=null, upper_price=null, lower_price=null)

User: filter for 2 bedrooms
AI: executing omni_search(zip_code=94104, bedrooms=2, upper_price=null, lower_price=null)

User: I just want to see houses under $500,000
AI: executing omni_search(zip_code=null, bedrooms=null, upper_price=500000, lower_price=null)

In this example, the omni_search function is working fine for the first couple searches, but then through some miswording, or simply a LLM hiccup, the earlier search parameters are ignored, and the LLM decides to just search any house under $500k.

This is an issue that is going to come up time and time again if you have complex functions, and are relying on the LLM to use its conversation memory to handle requests.

So lets think, how would a user facing application typically go about handling this sort of functionality?

How about a search filter!

Rather than relying on the chat memory to correctly parse the search arguments out for each additional user request, we could emulate a search filter. In a front-end application, you can imagine storing the current state of a filter component in some useState or redux store. Likewise here we can create a search_filters object as a store of value, and have the Agent tools update this state, rather than execute the search directly.

An updated set of functions might look like:

Prompt: You have access to the following functions:
- update_user_zip_code()
- update_user_preferred_bedrooms()
- update_user_maximum_budget()
- update_user_minimum_budget()
- clear_search_filters()

With each state update function calling omni_search(), and returning the result to the LLM to format and display. This way you've removed ambiguity, and you've created a more reliable store of information than the chat history.

Error handling should be verbose

This one is pretty straight forward. If you have an input that the LLM can get wrong, make sure you give it verbose error responses.

Using a tool like LangChain, failed requests can be retried. So your error responses can be used by the LLM to correct the input and resubmit.

User: Find a house near 06096

AI: searching_properties(zip_code=6096)
Error: 500
AI: Sorry, there are no homes available in that area.

The above example is pretty common. The LLM treating a zip-code starting with `0` as a number and just passing the last 4 digits. With it gets an opaque 500, the LLM is forced to come up with a (often wrong) justification. But here with a good error handling the LLM can auto retry the request.

User: Find a house near 06096

AI: searching_properties(zip_code=6096)
Error: The zip_code must be 5 characters. Correct and retry.
AI: searching_properties(zip_code='06096')
[correct result]

Imagine dealing with a user that can only enter strings, and has zero front-end validation. The end tools need to have a combination of flexibility and graceful error handling.

Format responses like strings

LLMs read string. LLMs write strings. Nothing else.

Just as you need to give back well formatted error responses, you also need to format successful responses in a way that is easily parsed by the LLM.

[API Response]
{
  bedrooms: '2',
  bathrooms: '1.5',
  home_type: 'house',
  asking: '300000',
}

In the above JSON response, there are a couple items that might be confusing to the LLM.

bathrooms: 1.5

  • It might get this, it might not. It might round 1.5 to 2. It might assume that all bathrooms have a decimal value and start displaying 2.0, 3.0 for other bathrooms.

asking: 300000

  • One again, seems clear to us, but definitely another opportunity for a mistranslation. Better to be specific.

A better response might look like this:

[API Response]
Asking price is $300,000. The house has:
- 2 bedrooms
- 1 full bathroom
- 1 half-bathroom

Here we've made the response very explicit, and taken away some risk of the LLM mis-interpreting results. But we can improve it further.

[API Response]
Asking price is $300,000. The house has:
- 2 bedrooms
- 1 full bathroom
- 1 half-bathroom

AI: [formatted response]

User: How many acres does it have

AI: The property has 2.5 acres *hallucination*

A devloper is testing a bar. He orders 1 beer, 0 beers, 999 beers, -1 beers. Everything works fine. A user walks into a bar, asks where the bathroom is, the bar bursts into flames.

Some of the biggest issues I've had revolve around the LLM lacking data and hallucinating a response. If data is unknown to you, it's best to be explicit. This is especially true when you have inconsistant data between responses (i.e. some houses have acerage info, some do not).

[API Response]
Asking price is $300,000. The house has:
- 2 bedrooms
- 1 full bathroom
- 1 half-bathroom
- Acres: unknown
- Square footage: unknown
- Swimming pool: unknown
- Year built: unknown
AI: [formatted response]

User: How many acres does it have

AI: I'm sorry, I don't have access to acerage information on this property.

It might seem overly verbose, but passing the LLM an explicit "unknown" or "no data on XYZ" will significantly reduce hallucinations.

Inconsistent returns are ok

If I'm designing an API endpoint to search for homes, I expect it should return an Array of JSON. And if no homes are found, that array will be empty. Return type should always be an array.

Unfortunately with an LLM this tends to create a lot of uncertainty. Often they seem afraid of not returning anything, and will take the `[]` as an opportunity to hallucinate a few homes for us.

User: Find a house near 94114

AI: searching_properties(zip_code=94114, max_price=100000)
Result: [].
AI: I found a few houses for you: *hallucinating*
- 123 Hollywood Lane CA, 94114
- 101 California Street CA, 94114
- 2261 Baker Street CA, 94114
Please let me know if you would like to know more about any of these.

Here the AI has hallucinated a few homes for us because it didn't have any information to go on. What might seem obvious to us (empty array === no houses found) will not always translate to the LLM.

User: Find a house near 94114

AI: searching_properties(zip_code=94114, max_price=100000)
Result: No properties found for this search.
AI: I'm sorry, I'm unable to find any properties for that search.
Please adjust your search terms and try again.

Or an even better response would include which filter is the limiting factor in this search. Since right now the LLM doesn't know whether there are no houses at all in 94114, or if there are none under $100,000.

User: Find a house near 94114

AI: searching_properties(zip_code=94114, max_price=100000)
Result: No properties found for less than 100000.
AI: I'm sorry, I'm unable to find any properties under $100,000.
You can try a different zip code, or increasing the maximum price.

Isn't non-deterministic programming fun!

Building tools with LLMs is a constant rollercoaster ride. Some days you're amazed at all the insights an LLM can pull from data, or its ability to chain together tools to accomplish a task.

Other days you're yelling at the screen while GPT tells the user a house has 1.3333333333 bathrooms. The frustration is exactly the same as watching a user navigate your UI in a totally unpredicted way.

But the LLM === Clueless User mental model has helped me quite a bit in reducing the variability of responses. The superpowers of AI models comes at a cost. Your tools will be much more flexibile, but lose the predictability that we have come to love.