Testing in black and white - advanced techniques

Mark Harman continues his look at software testing by describing three more techniques that focus on semantics, rather than syntax, to weed out those ’hard to find’ bugs.

Coverage analysis, which we looked at last month, is the most common way of ensuring that testing really does test the software. But it is not the only approach. Notions of coverage are all based upon the syntactic structure of the program, rather than upon the semantic properties of the program. For example, we could try to ensure that all the statements of the program are executed, or we could try to see that every branch of the program is executed. However, errors are by their very nature semantic beasts, and so a purely syntactic approach to testing is not ideal; we are more likely to find bugs if we put some semantic requirement on our test cases.

The first technique we will look at, dataflow testing, seeks to address this problem by introducing some semantic aspects of the program into the notion of coverage. To start with, we need to find a path within the program that begins with the definition of a variable. Suppose that the variable’s name is x. The path we choose must end up with a use of x, and along the way must not go through any statement that defines a new value for x.

By ‘a definition of x’, we mean that x appears on the left-hand side of an assignment statement. A ‘use of x’ means that x is used in some computation, but not altered by it. For example, a use might consist of printing out the value of x, testing its value, or using its value on the right-hand side of an assignment statement.

Definition-use paths

The path we choose is called a definition-use path (or du path for short). The idea behind du paths is that the value defined at the start will pass unaffected along the path until it reaches the end where it is used. A du path therefore represents the channel along which data flows through the program. Dataflow testing is concerned with ensuring that our test cases exercise all of these du paths.

Any statement that defines a value for a variable can be the start of a du path. There may be several du paths from a single statement, because the value of the variable might be used at several points in the program. For example, consider the example program in Listing 1. In order to think about du paths, we need to draw a graph of the program showing all the paths down which the execution can go. This graph is depicted in Figure 1. It is called a Control Flow Graph, because it shows how the flow of control moves from one point in the program to the next. Each point is called a node. To find a du path we look for a node where some variable is defined. Having found this, we then trace forward through the graph to find a point where the variable is used, being careful to ensure that the variable is not redefined along the way.

Suppose we take the assignment to y at node 1. This is an easy one to deal with because there is only one point at which y is used (node 2) and there cannot be any other definition of y along the way from node 1 to node 2, because there are no other nodes along the way. Therefore, there is a du path from node 1 to node 2 and, because there are no other uses of the variable defined at node 1, there are no other du paths that start at node 1. Our test cases should ensure that this du path is followed. This will be easy because any execution of the program will go down this path.

Suppose we start with the assignment at node 2. This node defines a new value for the variable x and so it can be the start of a du path concerned with the flow of the value of x. Node 8 uses the value of x and so we might look for a path from node 2 to node 8. There are three possibilities, each depending upon the outcome of the evaluation of the predicates in the program. These paths are <2,3,4,8>, <2,3,5,6,8> and <2,3,5,7,8>. It turns out that all of these paths contain a definition of the variable x, and so none of them is a du path. From this, we know that the value of x defined at node 2 cannot get through to node 8.

Although there are no du paths from node 2 to node 8, there is a du path starting at node 2. That is, the path <2,3,5,7>. Down this path, the value defined for x at node 2 passes unaltered to the final node at which the value is used. It does not matter that the value of x is also defined at the end nodes of this du path, because it is defined in terms of the value given to it at node 2. In addition, you can see that there is no du path from node 2 to node 4, or from node 2 to node 6, because the value of x is used neither at node 4 nor at node 6. In fact, we can go further than that: no du path could possibly end at either node 4 or node 6, because no variable is used at either of these nodes.

Three du paths end up at node 8. These are <4,8>, <6,8> and <7,8>. We have now considered all the du paths in the program and we can turn to the problem of finding test cases that cover all of them. For this program, there is only one determining value: the initial value of the variable z. It decides which branch is taken at each of the predicate nodes in the program. In this case, we can see that when z is compared to x, the value of x will be 4. We require three values for z: one less than 4, one equal to 4, and one greater than 4. This will ensure that we have exercised all of the du paths in the program.

In this case, covering all the statements of the program amounts to the same thing as covering all of the du paths in the program. However, there are programs where we can cover all the statements without having covered all the du paths. For example, consider the program in Listing 2. We can cover all the statements in this program using only one test case, namely where the initial value of p is greater than zero. However, this would not cover all the du paths, because there is a du path from node 1 to node 4, which is exercised only when the predicate evaluates to false.

There are programs where we can cover all the du paths without covering all of the statements. Such programs involve statements that either define a value which is never used, use a value which is never defined, or neither use nor define any value. The first two cases are rather anomalous and should be reported as errors, but the third is quite common. For example, consider the program fragment in Listing 3. Assume that the value of p is defined somewhere earlier in the program. Dataflow testing will not require us to execute nodes 2, 3, and 4 at all, as these cannot be the start or end of a du chain. On the other hand, to cover all the statements of the program we must ensure that both statements are executed.

The finding of all du paths can be automated (following a similar dependence-tracking type of approach that is used in program slicing, see A piece of cake, EXE, October 1996). This automated computation will produce a conservative approximation to the real answer. That is, the algorithm will correctly identify all du paths, but it may add in a few non-du paths too. This has the effect of making us do slightly more testing than necessary, but it’s a ‘safe approximation’ because we do guarantee to cover all du paths. For instance, dataflow coverage, as well as the more commonly found statement and branch coverage, is handled automatically by some testing tools such as LDRA Testbed (Liverpool Data Research Associates).

Mutation testing

In dataflow testing what we are trying to do is monitor the flow of traffic (data) through the program, so that we ensure that some traffic has been down every little country lane as well as all the major motorways. We hope that the passage of traffic will show up the need for any repair work (fixing the bugs). There is a far more radical solution, which is tantamount to deliberately wrecking the road with pot holes and watching to see how many accidents occur – this bizarre approach to testing is called ‘mutation testing’.

Mutation testing turns the whole problem on its head. Instead of testing the program to find faults, we test the test data by introducing faults to see how good our test data is at finding the faults. The idea is to create many copies of the program, each of which has been altered in some minor way. Each copy resembles the original, but deviates from it. Having done this, we see how many of these different variations of the program behave differently from the original, for our set of test cases.

Each copy of the program is called a mutant, because it is created by ‘mutating’ the program in some way. When a mutant behaves differently from the original program for some test case t, we say that t kills the mutant. For example, suppose we mutate the program in Listing 1, so that the assignment x=y*2 at node 2 is replaced by the assignment x=y*4. This creates a mutant version of the original program. To kill this mutant we need a test case that makes the mutant produce a different value to the original program. Suppose we try the test case z=10. In this case, the mutant will follow the same path and will end up with the same value in x at node 8 as the original program would have done. We say that the test case z=10 fails to kill the mutant. However, had we chosen the test case z=5, we would have killed the mutant. In this way, mutants allow us to assess how good our test cases are.

In mutation testing, we are not directly concerned with testing the program. Rather we are trying to assess the quality of the test data we intend to use to test the program. We want to see how many of the mutants are killed by a set of test data. The more mutants we can kill the better.

Such testing is based on two assumptions about the nature of errors in programs. The first of these is called the ‘competent programmer hypothesis’. According to this hypothesis, programmers write programs that are almost perfect. The competent programmer hypothesis says that program faults are syntactically small and can be corrected with a few keystrokes. The upshot of this hypothesis is that the mutants simulate the likely effect of real faults. Therefore, if the test set is good at catching the artificial mutants, it will also be good at catching the real mutants – the faults in our program.

The second assumption, called the ‘coupling hypothesis’, states that, should there be any big and dramatic effects that arise from bugs in the software then these will be closely coupled to small and simple bugs. The net effect of these assumptions is that we take on faith the basic principle of mutation testing: making small mutations to the program code emulates the real bugs in the software.

Strong and weak mutation testing

In the previous example, we considered that a mutant was killed if a test case made it produce a different output value at the end of the program. This is called ‘strong mutation testing’ because it places a ‘strong’ requirement on the test case. To see why the requirement is strong, consider the opposite: weak mutation testing. In weak mutation testing, we require only that the value produced at the point we introduce the mutation is different from the corresponding value in the original program. This is less demanding on the test cases because it is always easier to force a program to have a different value at the mutation point than it is to force it to have a different value at the end of the program.

For example, in the program in Listing 1, suppose we mutate node 2 by replacing the assignment x=y*2 with x=y*4. If we test the program with the test case z=10, then the value of x just after node 2 has been executed is 8, whereas its value was 4 in the original program. Had we looked only at the value of x at node 2 then we could have tested the program with any test case except z=0, and we would have killed the mutant. However, if we had required the value of x to be different at the end of the program, then only the test cases where z has a value between 4 and 9 would have caught the error and therefore killed the mutant.

Strong mutation testing is tougher (or stronger) on the test set because we can be sure that when a test case strongly kills a mutant, it must definitely weakly kill it. Mutation testing is all about assessing the quality of test cases; it would be natural to prefer strong mutation testing to weak mutation testing. Unfortunately, strong mutation testing proves much more time-consuming when it comes to automating the testing process. As with other testing strategies, mutation testing is a non-starter without automation.

A weak mutation testing tool can store the state just before the mutation point, and then test each mutated version of the statement in isolation, using the previously saved state as the starting state. This is not possible with strong mutation testing, because we have to see if the mutation creates a mutant value that survives all the way through to the end of the program. This entails running the program to the end for each mutated version of each mutation point. There are usually a great many mutations to be considered and so the strong version is simply too expensive. Fortunately, analytic and experimental investigation has shown that strong and weak mutation often turn out to be very close in practice. Many testing experts therefore believe that weak mutation testing is ‘strong enough’.

The problem of equivalent mutants

Suppose we mutate the program in Listing 1, by replacing the assignment at node 2 with x=y+2. This mutation produces a program that behaves in exactly the same way as the original. We say that the mutant is ‘equivalent’ to the original. How can we expect a test case to kill an equivalent mutant?

Equivalent mutants are more of a problem for mutation testing approaches than one might think. There is no automatic way to decide whether a mutant is equivalent to the original program, so in automated mutant generation tools it is likely that some of the generated mutants will be equivalent (and therefore unkillable). Clearly, it is unfair to penalise a test case for failing to kill an equivalent mutant. Suppose a test case fails to kill a mutant. How can we tell whether this is because it is a poor test case or because the mutant is unkillable?

Mutation testing has been implemented in a few commercial CASE tools (three I’m aware of are: Mothra, SafetyNet and Insure++. The latter uses mutation, but not in the way described in this article. It deliberately creates equivalent mutants. If the programmer believes that any of these produce a wrong answer then there must either be a bug in the original program, or the programmer’s understanding of the specification – or the specification itself – must be at fault.) Studies have shown that this technique would have been good at detecting some of the famous bugs, such as the Pentium bug and the Therac-25 bugs. However, a solution to the problem of equivalent mutants needs to be found if the technique is to find wider acceptance in the software engineering community.

Boundary value analysis

The techniques we have looked at so far are known as ‘white box’ techniques, because they require knowledge of the internal structure of the program. By contrast, boundary value analysis is a black box technique. Boundary value analysis is based on the observation that faults tend to cluster around ‘boundaries’ in programs. The boundaries partition the input domain.

As with other forms of testing, we make some assumptions in boundary value analysis. We assume that the program partitions the input into a set of domains in which the program’s behaviour is similar. Because of this, we assume that if an element from some input domain produces an error then a similar error will be produced for all elements of that input domain. We also assume that if a test case fails to produce an error then all other elements in the domain will fail to produce an error.

According to these assumptions, we attempt to cover the structure of the program’s input, not the program’s structure. This requires no knowledge of the program’s syntax – only knowledge of its specification. This allows us to start thinking about test cases as soon as we have agreed upon a specification. By contrast, the white box techniques require the system to be implemented before we can start to define test cases.

Listing 4 contains a very simple example. Suppose the specification says that the program should give the square root of non-negative inputs. If the input is negative, then the program should print a message to say that only non-negative input is handled. The program in Listing 4 is not quite right; it contains a boundary value error.

There is only one input to the program, so the boundary is simply a point on a line (the point x=0). Had there been two inputs then the boundary would have been a line in a plane. With three inputs, it would have been a plane in three dimensional space, and so on. Whatever the boundary, the testing process seeks to test values on and around the boundary. If there is a boundary value error, then the boundary will shift, so we try to find test cases that will detect the shift.

In the case of the square root program, we would try one case where x=-1. This would test the negative input domain and one case where x=0, this would test the non-negative input domain. The second test case will reveal the fault: the <= operator in the conditional should have been a < operator.

As a testing technique, boundary value analysis is still in an embryonic state. While dataflow testing and mutation testing have been implemented in commercial CASE tools – and some form of coverage analysis is found in most tools – boundary value analysis is currently the subject of ongoing research and has yet to find its way into commercial CASE tools.

No guarantees

When we test software, we want to maximise the chance that our test cases will find bugs. No system is ever going to be able to guarantee that it will find every bug, but we can use some theory to improve our chances.

The basis of this theory revolves around the idea that we should exercise some property of the program. The property is either a structural (white box) aspect of the syntax of the program (such as its statements or its definition-use paths), or it is a behavioural (black box) property (such as the output it produces for each input sub-domain).

A more radical approach is provided by mutation testing, where we corrupt the program to see how good our test cases are at detecting the corruption.

Finding effective ways of testing software remains a big problem. The approaches described in this article help, but no technique is perfect. Testing is likely to continue to become increasingly significant in the software development process and we can expect many developments in this field.

Mark Harman is a lecturer in computer science at Goldsmiths’ College, where he works on program manipulation and testing using slicing and transformation. Goldsmiths’ college can provide bursaries for suitably qualified students wishing to study for a masters or doctoral degree. He can be contacted by email at mark.harman@brunel.ac.uk.

Mark Harman would like to thank Robert Hierons, who helped greatly in the production of this article.


The relationship of mutants and coverage analysis

Suppose we want to assess the level of statement coverage that our test set achieves. We can perform this assessment using mutation testing. To do this, we create one (non-equivalent) mutant for each statement. The percentage of mutants killed will be the same as the percentage of statements covered. We can use a similar trick to get mutation testing to emulate branch coverage, in which all the branches in the program under test must be exercised. This is not a particularly efficient way of assessing statement and branch coverage, but it does highlight the connection between mutation testing and coverage analysis.


Listing 1 - The original program to be tested.

1 y=2;

2 x=y*2;

3 if (x<z)

4 x=1;

5 else if (x==z)

6 x=0;

7 else x=x+1;

8 printf("%d",x);


Listing 2 - Covering all statements does not cover all du paths.

1 x=1;

2 if (p>0)

3 y=x*2;

4 y=x+1;


Listing 3 - Covering all du paths does not cover all statements.

1 if (p>0)

2 printf("Hello ");

3 else printf("Goodbye cruel ");

4 printf("world.");


Listing 4 - Boundary value analysis.

1 scanf("%d",&x) ;

2 if (x<=0)

3 printf("Input should be non-negative") ;

4 else printf("The square root is %d",sqrt(x)) ;


(P)1998, Centaur Communications Ltd. EXE Magazine is a publication of Centaur Communications Ltd. No part of this work may be published, in whole or in part, by any means including electronic, without the express permission of Centaur Communications and the copyright holder where this is a different party.

EXE Magazine, St Giles House, 50 Poland Street, London W1V 4AX, email editorial@dotexe.demon.co.uk