Welcome to the second post in my series on Testing AI. This series of posts was inspired by the Ministry of testing 30 days of AI in testing event. If you haven’t already, check out the first post in this series – Testing AI – How to run Llama 2 LLM locally. In this post I discuss my experience of creating an automated test framework for prompt testing a Large Language Model (LLM) using Playwright.

Automating Testing of an LLM with Playwright

After my last post I already have an LLM running locally. The next question was how to write automated testing? The answer was Prompt Testing with Playwright.

What is Prompt Testing?

Prompt Testing, sometimes called Prompt Engineering, is where you provide a prompt to an LLM and assess its response. In this post I wont cover the finer details of Prompt Testing. Instead focussing on how you can create reliable test automation for it.

Tools for Automated Prompt Testing

For the purpose of this post I will be using the following tools:

  • LLama 2 – Large Language Model created by Meta
  • Ollama – Tool to run Llama 2 locally
  • Playwright – Test framework that supports many different languages.

Ollama provides an API you can use for interacting with your LLM. The API provides a simple way create automated Prompt Testing.

For the tests I chose to write them in TypeScript. I have made my example test framework available on GitHub.

Although we are making API calls there is a significant difference between API testing and Prompt testing.

Difference between Prompt Testing and API Testing

Using the API makes testing the LLM very similar to other API testing, but with one key difference. We will never be 100% sure of the response we will receive from the LLM. To work around this, you have to write assertions that are based on key words or phrases that you expect the response to contain. Rather than being able to assert complete responses. I have some ideas on how we can do this better though – more on that later.

“Forbidden words”

I also played around with a concept of “forbidden words”. These are words that no reply should ever contain and are always asserted not to be in a response.

The use case for this is going to vary depending on the context in which your LLM is being deployed. I imagined that the LLM was going to be a public facing service. In this context there are many things you would want your chat bot not to say. From use of offensive language to commenting on competitors etc. We can assert against it for all tests, as well as creating tests that specifically test against prompt injection.

I included my checkForForbidden words function as part of my overall checkResponseIncludes function as I didn’t want to have to specifically call it from every test. However, I also decided that I wanted a way to bypass it, so made it possible to pass a variable to do that.

The Ollama API provides a way to automate both a single prompt and end to end conversations. For end to end conversations we provide both user and LLM responses up until the final response. You can see examples of both test types in my example tests on GitHub.

Improving Assertions

I decided to tackle assertions in two ways. The first, and most familiar to anyone who has done API testing before, is asserting on the content of the response.

Asserting on the response content

For this part of my tests, I am simply asserting that my response contains a keyword or phrase. I created a re-usable helper function for this.

    for (const word of expectedWords) {
      expect(responseString.includes(word)).toBeTruthy();
    }  

The function is created in a way that you can provide either a single string, or multiple as an array. This allows me to create assertions that check responses for single words or phrases, or a combination of them.

Having to make assertions in a way that allows for the variation in answers that an LLM will return opens us up to flaky test results. What we really need is the ability to evaluate the answer conveys the same information as we expected it to.

Testing an LLM with an LLM

We can use an LLM to test that the expected response and the received response are semanticly similar to each other, I call this method an Evaluator LLM. Doing this means we don’t need to know the exact response we are going to receive, only an idea of what the LLM will say in response to the prompt.

To do this we provide a prompt that looks something like this.

If the following 2 statements appear to give accurate and corroborating answers to the same question respond 'Yes, the statements are similar'. Statement 1 - {{originalResponse}} - end of statement 1. Statement 2 - {{expectedResponse}} - end of statement2

Of course, this means we are testing one black box with another. In the case of my example, we are actually using the same LLM to do the comparison at the end. Not ideal, but ok to demo the concept. This does open us up to incorrect assessments from the LLM(s) but gives another way for us to check the response. Also worth pointing out that the prompt I have created is likely far from infalliable. It would need to be itterated on over time, but it serves the purpose for this demonstration of the concept.

Why not both?

I think the current best way forward for this type of testing is to use a combination of traditional assertions and an Evaluator LLM. This way you can use your traditional assertions to check the response contains (or does not contain) specific key words and use the Evaluator LLM to provide confidence over the semantic similarity to an expected answer.

I have provided an example of using an Evalutor LLM in my example tests in GitHub. In this case I am using Llama 2 again to assess the quality of the answer it provided against a known answer.

Dedicated LLM testing tools

Dedicated LLM testing tools are out there and I looked at a couple before using Playwright.

promptize

One I would like to have investigated further was promptize as it offers additional features that don’t exist natively when using a tool like playwright. Unfortunately, it is built specifically to support OpenAI and after a quick look I decided refactoring it to support other tools and models was going to be more work than I was able to invest at this point. Maybe one for the future though.

promptfoo

I also found promptfoo which looks promising. It is more open than promptize and again offers some nice features natively including creation of test matrices.

I haven’t invested as much time as I would like with promptfoo yet. It definitely has appeal over using a tool like this over my Playwright solution, so I will definitely come back to this in the future. I expect that in order to get consistent results I will need to customise my configuration further than I have so far.

Trulens

Trulens is similar to promptfoo. It offers a way to assess the quality fo your LLM, rather than being a strict test tool. At the time of writing this post I am still investigating Trulens, but it looks promoising and I will be writing a full post about it soon.

Summary

Automated prompt testing feels like the equivalent of automated UI testing for a web application. It doesn’t require an understanding of what is happening under the hood and is similar to the end user experience of the application. However, it also carries some of the same downfalls, albeit for different reasons. Primarily that it is expensive to run. In the case of LLMs this is because when they are cloud hosted you will be charged for every prompt.

Automated prompt testing is a great starting point to automating our testing, but we definitely need to look deeper under the hood to test most efficiently. In future posts I hope to explore dedicated LLM test tools further as well as looking into specialist observability tools, so watch this space!

Further reading

If you enjoyed this post then be sure to check out my other posts on AI and Large Language Models.

Subscribe to The Quality Duck

Did you know you can now subscribe to The Quality Duck? Never miss a post by getting them delivered direct to your mailbox whenever I create a new post. Don’t worry, you won’t get flooded with emails, I post at most once a week.