Putting AI to the Test: A QAE’s Look at Generative Models

4 min readAug 31, 2023

As the field of artificial intelligence continues to evolve, keeping up all the hype around it, testing AI generative models is becoming a critical skill of a QA Engineer’s life to ensure it’s accuracy, reliability, and overall quality.

For the past months, I have been defining an effective testing strategy for an AI generative model and applying it to models similar to OpenAI’s ChatGPT.

So let’s explore specific areas to focus on during testing, considering the unique challenges posed by generative models.

Testing Strategy

Requirements Analysis and Model Understanding
It may seem sometimes that it’s some kind of magic that model can understand what we want but , in fact, there is a translator that will try to transform the text into a logical query. So, before initiating the testing process, it’s crucial to understand the capabilities and limitations of the generative model. It is important to begin with collaboration with data scientists and developers to understand how the model handles, interprets and translates the questions and generates responses.
Test Case Design
Develop a diverse set of test cases that cover various scenarios and user inputs. You can categorize them into the following areas:

Intent Recognition: Test the model’s ability to accurately identify user’s intent and respond accordingly. Usually in the API response, it is possible to validate the intent identified by the model as well as the confidence level. It is important to rephrase the question in several ways to test thoroughly the intent recognition.
Validate the intent and confidence level in the API response:

{
  "id": "chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve",
  "object": "chat.completion",
  "created": 1677649420,
  "model": "gpt-3.5-turbo",
  "usage": {
    "prompt_tokens": 56,
    "completion_tokens": 31,
    "total_tokens": 87
  },
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Solar energy offers numerous benefits such as..."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "intents": [
    {
      "name": "providing_information",
      "confidence": 0.9
    }
  ]
}

Context Management: Evaluate how well the model maintains context in longer conversations. It should be able to link your question to the previous ones so verify that the model understands and responds appropriately to previous user inputs within the same conversation.
Test the it’s responses to a wide range of questions, statements, and prompts to ensure coverage across different scenarios.

Here, the GenAI model needs to understand that “them” refers to our previous question about the boys band:

Q: Name me 5 famous rock bands
Bot:  Certainly, here are five famous rock bands:

    The Beatles
    Led Zeppelin
    Queen
    The Rolling Stones
    Nirvana
Q: Which of them still perform?

Ambiguity and Edge Cases: Test the responses to ambiguous queries and boundary inputs. Add incomplete questions, words that may have several meanings, random words, etc to check the response.
Offensive Content: Ensure the model handles inappropriate or offensive content gracefully. Evaluate responses for potential biases, stereotypes, or controversial content.
Security: Considering the amount of data behind it, it is crucial to test that sensitive information or personal data isn’t revealed. Depending on the context, we see several examples of prompts that can extract some user’s data, provide dangerous or harmful information that should not be answered by the chatbot.
For example, there are prompts trying to get instructions to build dirty bomb and these situations need to be addressed.
Fallback Mechanisms: Check for robust error handling when the model encounters inputs it cannot process effectively.

3. Testing

Manual Testing: Engage in manual testing to verify the correctness of responses based on expected behavior.
Automated Testing: Since we can’t predict the exact responses, automating may seem almost impossible. But instead of asserting the exact behavior, focus your automation on validating the intent detection, response time and keyword matching. Keep track of when the model is retrained or updated. Whenever a new model version is deployed, reevaluate and update the automated tests to ensure they’re still valid and relevant for the new model’s behavior.
Performance Testing: Evaluate response times under various workloads to ensure timely responses.
Security Testing: Verify that the model does not inadvertently expose sensitive information or respond to malicious inputs.
Scalability Testing: Assess the model’s ability to handle increased user load without degradation in performance.

4. Evaluating

Metrics: Apart from the regular test metrics, you can use more specific ones like BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) to quantitatively measure the quality of generated responses. Establish a baseline score to track improvements over time.
Also, you can calculate Intent Matching over time to see if there’s any improvement in identifying correctly the intent.
Entity Recognition can be used to evaluate how well the model identifies and correctly labels entities (e.g., names, dates, locations) mentioned in the user’s message. Calculate the amount of correctly identified entities over time.
User Feedback Integration: Incorporate user feedback and sentiment analysis to identify areas for improvement. It is important to use real user interactions to guide test case design and enhance the model’s performance.

In conclusion, testing AI generative models is a great challenge because we cannot predict with certainty the response and can’t define precisely the expected outcome since the response may vary every time you run the test. Therefore, instead of focusing on specific response, it is crucial to validate that the intent and entity detection works well and that similar questions identify same intents. The model improves over time, and so should the test plan. Also the testing strategy must adapt in order to incorporate the new developments in AI technologies.

Putting AI to the Test: A QAE’s Look at Generative Models

Testing Strategy

Written by Lina K~