Artificial intelligence (AI) is revolutionizing the software industry and fundamentally changing organizations. But with new possibilities come new challenges: faulty AI systems can not only cause financial losses but also risk the trust of your customers and partners. How does testing AI software differ from other applications and tools, and which methods are most effective? This article provides the answers.
The Two Biggest Differences Between AI Software and Conventional Software
Mode of Operation
The core difference lies in the mode of operation: while classic software is deterministic (a system that always delivers the same result for the same inputs), AI often works probabilistically. This means:
Unpredictable Results: AI can react differently to the same inputs.
Dynamic Systems: AI models can evolve through new data.
Data Dependency: The quality of the model depends on the quality of the training data (data that AI models use to learn and improve their predictions).
Quality Characteristics
For software without AI, the following quality characteristics are relevant:
Functionality
Reliability
Usability
Efficiency
Security
Maintainability
Compatibility
Portability
AI software is far more complex and comes with additional quality characteristics:
Flexibility in situations and use cases that were not originally planned
Adaptability of AI models to new hardware or operating environments
Autonomy as a skill of the AI system to work without human control
Evolution of the AI system to improve itself based on new external circumstances
Bias as an indicator of the quality of the results of an AI system.
Ethics in the interaction between AI and humans/the environment. You can find insights on this in our article on Responsible AI.
Side Effects and Reward Hacking as a characteristic that the AI system solves tasks in the sense of the task setter
Transparency, Interpretability, and Explainability for security and trust in AI systems
These factors require special testing strategies to ensure high quality and reliability.
Test Levels of AI Software
Before we get into specific techniques, let's take a look at the individual test levels.
In addition to the previous test levels such as component, integration, system, and acceptance tests for conventional software, we introduce two more preliminary test levels for AI-based systems:
Input Data Tests
Input data tests ensure the data quality for training, evaluation, testing, and deployment:
Reviews
Statistical Techniques such as testing data for biases
Exploratory Data Analysis (EDA) of the training data involves examining and visualizing data to find patterns, trends, or anomalies
Static and Dynamic Tests of the data pipeline (data pipeline = the entire process from collecting and processing data to using it in an AI model).
Machine Learning (ML) Model Tests
A well-trained model is far from being bug-free. Model tests should ensure that it meets all the specified criteria:
Functional performance criteria
Non-functional acceptance criteria
Challenges in Testing AI Software
Data Preparation
The foundation for AI models and thus also for the quality of their results is data. Correct data preparation is therefore essential—the following factors must be considered:
Knowledge of the application area, data and its properties, as well as various data preparation techniques
The difficulty of obtaining high-quality data from various sources.
The correct automation of the data pipeline so that the production data pipeline is both scalable and performant
The verification of possible errors that may have been introduced into the data pipeline during data preparation
Inappropriate data biases, where data is over-weighted by the algorithm or is not fully representative
Training, Validation, and Test Datasets
To develop an ML model, three sets of equivalent data are required (i.e., randomly selected from a single initial dataset):
Training dataset for training the model
Validation dataset for evaluating and tuning the model
Test dataset for tests with the fine-tuned model
Even if an unlimited amount of suitable data is available, the amount of data for training, evaluation, and testing in the ML process is determined by the following factors:
The algorithm used for training
The availability of technical resources
Specification when Testing AI Applications
Specifications form the basis for every tester. However, for AI-based systems, these can be a particular challenge for the following reasons:
the exploratory nature of AI-based system development
the accuracy of an AI-based system is often not known
the probabilistic nature of many AI-based systems requires quality tolerances
specified behavioral requirements are based on mimicking human behavior
in case of high flexibility requirements (e.g., in connection with language)
consideration of the new AI software quality characteristics
A test oracle serves as an information source to determine the expected result. However, for AI systems, even determining this expected result is difficult—this is called the test oracle problem.
Complex AI Systems: For systems that do not always produce the same result (such as in weather forecasts or complex analyses), it can be difficult to say in advance what the "right" result is. To test the system, however, you would need exactly this information.
Self-learning Systems: AI models that are constantly learning change their behavior. What was the right answer yesterday might change today. This makes it difficult to have a fixed benchmark for verification.
Subjective Assessment: For systems like AI voice assistants, "right" is often a matter of personal judgment. What is a good answer for one person may not be for another. There is no simple yes/no answer.
Automation Bias
Certain AI systems help us make better decisions. However, we sometimes rely too much on these systems and trust them blindly. This excessive trust is called "automation bias." There are two types of it:
Automation Acceptance: The person adopts the system's recommendations without considering alternative information or their own assessments. They rely exclusively on the automated suggestions.
Automation Error Rates: The person overlooks system errors because they do not perform adequate monitoring or critical scrutiny due to excessive trust in the system.
Example: Automation bias could occur, for example, when an AI-powered credit scoring system is always automatically followed—without an individual human review. This can lead to unfair or discriminatory credit decisions, as potentially relevant, but not considered in the system, individual factors are ignored.
Concept Drift
The operating environment can change over time without the trained model adapting accordingly. This phenomenon is called concept drift and usually leads to the model's results becoming increasingly inaccurate and less useful.
Example: A typical concept drift is seen in e-commerce when a customer behavior model is not seasonally adjusted and increasingly gives irrelevant recommendations.
Self-Learning Systems
Testing self-learning systems poses particular challenges because these systems can change continuously and unpredictably.
Test cases that were originally designed for specific system states could quickly become irrelevant due to unexpected changes in system behavior. The test strategy must therefore be continuously adapted to effectively cover current and potentially unexpected behavioral changes.
Complex acceptance criteria must be defined that take into account not only current performance but also expected self-improvements of the system.
Because the system changes so quickly, tests must be automated. Manual tests would simply be too slow to keep up with the many changes.
Since system-internal improvements require more and more memory and the test environment becomes more complex to cover all risks, the resource requirements increase.
It is difficult to find all the necessary test cases and environments because you cannot predict how the software will be used in the real world and what problems might arise there.
Autonomous Systems
When we test autonomous systems, we must create situations where the system learns to make decisions on its own. This is the only way we can find out when it can act alone and when it needs help from a human:
Testing Limits: Checking whether the system asks a human for help when it reaches its limits or the environment changes.
Timely Handover: Checking whether the system hands over control to a human in time when it is supposed to.
Unnecessary Help Request: Testing whether the system does not unnecessarily call a human too often, even though it could still solve the task independently.
Methods for Testing AI-Based Systems
Method 1: Pairwise Testing
The number of parameters for an AI-based system can be extremely high. Examples include self-driving cars or language translation systems. Testing all possible settings and combinations would take forever. That's why we only select the most important tests to use our time efficiently.
If you have many settings (parameters) with different values, it would take too long to test every combination. That's why you use combinatorial testing. Only the most important combinations are tested, which greatly reduces the number of tests. Research has shown that most errors are caused by only a few settings occurring at the same time. In practice, pairwise testing is the most widely used method.
Let's look at a concrete example using language translation software. The following parameters play a role here:
Input Mode: Text, speech
Source Language: English, German, Spanish
Target Language: French, Russian, Japanese
Internet Connection: WLAN, Mobile Data, Offline Mode
If we had to test all combinations, that would be 2x3x3x3 = 54 test cases. By using pairwise testing, we drastically reduce the test cases by only ensuring that every combination of two parameters is tested at least once. This could look like this, for example:
Text – English – French – WLAN
Speech – German – Japanese – Mobile Data
Text – Spanish – Russian – Offline Mode
Speech – English – Russian – WLAN
Text – German – Japanese – Mobile Data
Method 2: Back-to-Back Testing
One of the possible solutions to the test oracle problem when testing AI-based systems is represented by the use of back-to-back testing.
In this procedure, an alternative version of the system is used as a pseudo-oracle and its outputs are compared with the test results generated by the SUT (System under Test).
In the context of ML, it is possible to use different frameworks, algorithms, and model settings to create an ML pseudo-oracle.
Important: For pseudo-oracles to be effective in finding errors, no common software should be present in the pseudo-oracle and the SUT.
Advantages of the Pseudo-Oracle:
Easier to develop: The pseudo-oracle does not have to be perfect. For example, it doesn't have to be as fast or efficient as our actual system. It can also be developed on a simpler basis or with other technologies. This often makes its creation cheaper and faster.
No identical errors: It is important that the pseudo-oracle and the system we are testing are developed independently. If both systems use the same code or the same technology, they could make the same mistake and deceive us by providing the same wrong result.
This method is particularly useful in the field of machine learning (ML). Here, a different algorithm or a different model can simply be used for the pseudo-oracle. Back-to-back testing is an effective method for finding errors in complex AI systems, even if the "correct" answer is not known from the outset.
Method 3: A/B Testing
A/B testing is a method in which the response of two variants (A and B) to the same inputs is compared to determine which of the two variants is better. This is also an approach to solving the test oracle problem, where the existing system is used as a partial oracle. Learn more in our Guide on A/B Testing.
Imagine we have an old version (A) and a new, updated version (B) of our AI system. A/B testing compares these two versions directly with each other. We check whether the new version (B) is just as good as or even better than the old one (A). In doing so, we measure how well the systems perform their tasks, for example, using key figures such as the accuracy of a prediction.
An example:
An intelligent traffic system in the city is supposed to regulate traffic better. We create an updated version of it. To test whether the new version is actually better, we could use it for a week and measure the average commuting time. In the next week, we use the old version and also measure the commuting time. If the commuting times are shorter with the new version, we know that our update was successful.
The difference to back-to-back testing:
A/B testing usually compares two slightly different versions of the same system to see which one works better.
Back-to-back testing (as we discussed it before) uses two independent systems to find errors. Here, the goal is not to determine which version is better, but whether a system works without errors at all.
Method 4: Metamorphic Testing
Metamorphic testing is a testing method that is used specifically when traditional test oracles are difficult to define.
Step 1: First, an initial test case is carried out. If this is passed successfully, it serves as the basis for further tests.
Step 2: The further tests are defined by so-called metamorphic relations (MRs). These are rules that determine how a change in the input data should affect the expected test result.
Simply put, this means: If we slightly change a certain input and know how the result should change, we can check whether the AI system behaves as expected and logically.
A typical example would be an AI for image recognition: An initial test could check whether the system correctly recognizes an image. The image is then slightly changed (e.g., rotated). The MR states that despite the rotation, the result should still be recognized as "correct" (e.g., as the same person or the same object). This relation between test cases is then checked.
Business Benefits of Good AI Testing Practices
The careful and professional testing of AI systems is not only a technical but above all a business necessity. Companies benefit from comprehensive AI tests through:
Reduced Risks: Early detection and avoidance of errors that could cause financial damage and loss of reputation.
Greater Legal Certainty: Ensuring that the AI system complies with legal requirements (e.g., data protection) and ethical standards.
Better Customer Satisfaction: Reliable, understandable, and fair AI systems strengthen user trust and increase customer loyalty.
Long-Term Business Success: Robust AI testing strategies help to sustainably support long-term business goals and implement innovations safely.
In short: Good AI testing practices create competitive advantages, secure investments, and sustainably promote the success of your company.
Conclusion: Quality Assurance as a Key to Success
Testing AI software requires new approaches and methods. Companies that face this challenge can not only minimize technical risks but also secure the long-term trust of their target group.
Are you ready to take your AI software to the next level? Use our expertise to test your AI systems safely and successfully. Contact us and benefit from tailored testing strategies that make your company fit for the future.