How to test AI agents?
We are developing NewBabylonAI — a decentralized Appstore for AI agents. Watch our Colosseum hackathon submission here: https://arena.colosseum.org/projects/explore/394
This post is from NewBabylonAI CTO Kos.
In our previous Colosseum post, we mentioned that for the hackathon, we decided to focus on two sample agents, Google Flights Booking and Amazon Shopping, to get a first-hand experience and understanding of which tooling the NewBabylon platform developers might need.
It soon became clear that AI agent testing is one of the most critical tasks every agent developer will face. While working on the hackathon submission, we encountered several problems, as outlined below.
AI agent is expected to navigate complex scenarios it has never encountered before. This leads to the first issue — it is impossible to cover all scenarios by tests — the dimension of possible use cases is infinite. Moreover, LLM models are often non-deterministic, which means the same question to the model will result in a different answer, even if the model is constrained with zero temperature parameters. A good blog post (https://152334h.github.io/blog/non-determinism-in-gpt-4/) dives into the non-determinism of the OpenAI GPT model and possible explanations behind it.
The second issue is that the definition of what is correct and what is incorrect behavior is blurred. The agent can solve the same problem in different ways. Sometimes, it can do what we might think are unnecessary steps but still achieve optimal results. Sometimes it can optimally achieve the result, but the result can be sub-optimal. For example, the agent can book a flight without consulting about the desired departure time. So, not just rights and wrongs need to be tested, but more dimensions like accuracy, quality, consistency, and more.
LLMs are effectively black boxes. Crafting effective prompts for optimal results is both an art and a science. Regression testing is extremely difficult in this case — it is impossible to predict how even minor changes in the prompt can affect the overall AI agent quality.
While facing all these issues, we formed a strong opinion that quality assurance/testing tooling should be an essential part of the NewBabylon platform. As with the first agents, we started developing the mentioned tooling based on our experience. However, at a later stage, we will heavily rely on the developers’ feedback to prioritize the required features. As for now, we are focusing on:
- Automated testing of the agent quality when executing a whole chain of steps while completing tasks given by the customer. For this purpose, we only assess how often an agent can complete a given task. In this case, we don’t take efficiency, optimal execution, or optimal result into account.
- Automated testing of the correctness, consistency, and optimal execution of the individual step. Since we are focused on Web-based AI agents for now, we must ensure that an agent understands how to navigate all possible variations of web pages.
- Human review and debugging tool. This tool will allow developers to get a deep understanding of every single step performed by an agent, what information the agent was operating on, what the responses of underlined LLMs were, and how it was transformed into the actions for the next steps.
Our first version of the agent quality assurance framework is up and running. While it is still just a first take on handling outlined problems, we are quite happy with how it works. And that is how this framework is organized:
We have set up the CI so that every time we want to test new code, a set of docker images is built and deployed to the Google Cloud Platform.
A tricky part is having a headless browser with the UI system enabled and our agent extension installed. This allows us to run ‘online’ tests, where the agent and extension have access to the live web page content instead of HTML snapshots. This is important because a static DOM model snapshot doesn’t reflect all the complexities of modern web pages.
For every test case, we span up a clean copy of the headed browser, install the agent extension, and imitate specific user actions to transit a web page to the pre-test state.
After that, we ask our agent to proceed with the next step, which is when LLMs’ input and output are matched against the expected results. Validation is done by checking the presence or absence of certain HTML elements on the web page, their values, state, etc.
We run our tests nightly using the freshest codebase. Usually, we do 20–30 runs of each test to get a statistically significant result since execution flows are not deterministic. The next day, we can get results from dozens of runs by simply querying the test database. The results are then compared with those from the previous day, and a conclusion is made on how the most recent changes influenced the agent quality.
As you might expect, running automated tests on a live website will inevitably result in facing bot-protection systems. For Amazon, for example, we had to implement an in-house deCAPTCHA service, which is extremely helpful in running multiple scenarios simultaneously.
All this might sound complicated, and it really is. However, that is where we see a value we can create for NewBabylonAI developers. We aim to abstract as many complexities as possible and provide developers with simple tooling. Without automated tests, it is impossible to conclude whether your changes improved or worsened the agent quality. What we build will save developers time and add transparency and confidence to the agent development process.