Testing in black and white

Software testing is still regarded by many as an art. Mark Harman explains coverage analysis, which allows us to put the whole process onto a more sound engineering footing.

Testing need not simply be a process of inserting output statements into our code to see if the values we expect appear. We can put the whole process onto a more sound engineering-lead footing.

Many software developers would admit that when they test their own software, they tend to choose test cases that do not really push the software particularly hard. For example, last time you tested a piece of software, how many statements of the code do you think went unexecuted by any of the test cases you tried? If your answer is greater than zero then you have, I'm afraid, failed what is regarded by software testing specialists to be the weakest testing adequacy criterion.

This article introduces coverage techniques for finding and assessing software test cases. I will start by briefly explaining why software testing is so much harder for software engineers than the testing of other products is for engineers in other disciplines. Then I'll go on to explain how coverage analysis allows us to assess just how stringent our testing is and how software tools can and cannot help with the testing process.

What is software testing?

Edsger Dijkstra noted that ‘software testing can never establish the correctness of software only its incorrectness'. He was observing that test cases can reveal faults in software, but that we cannot infer from the absence of errors for all test cases tried that the software itself is fault free. We could attempt to test every possible input to the program, to see that the output (and/or its behavioural properties) conform to the specification. This is called 'exhaustive testing'. Sadly, in most real systems exhaustive testing would require an unacceptably large set of test cases.

Dijkstra's aphorism was used to argue for an alternative approach in which software is constructed directly from its specification using a sequence of formal refinement steps. He wanted this construction process to be accompanied by mathematical proofs to establish the software's correctness with respect to its specification. Nowadays this highly rigorous form of verification is reserved for the most safety critical aspects of software development, as it is typically time-consuming and expensive. This leaves us with software testing as our main tool of verification for the less safety critical systems. (Of course, testing is also used as a back up technology for safety-critical systems.)

Testing seeks to answer, among other things, the following questions: What is a good test case? When have we tested a system enough? How many bugs have we failed to find with our test set? Why is testing software so hard?

Testing software is not like testing components in other engineering applications. For example, consider the problem of testing a software component when compared with the problem of testing a structural component in a civil engineering application. The civil engineering product will undergo certain stresses and strains and may be subjected to a variety of weather conditions and temperatures. An important aspect of all these loads is their continuous nature, which requires testing to take place only at the extremes.

If we know that a component behaves well at a temperatures T1 and T2 (where T1 < T2), then we can reasonably conclude that the component will behave well at all temperatures between temperatures T1 and T2. With software, we do not have this considerable luxury. If we know that a loop statement with one determining integer variable, v, terminates when v=T1 and also when v=T2 (where T1 < T2) we cannot reasonably conclude that the loop terminates for all inputs between T1 and T2. Software systems are essentially discrete systems.

This discreteness makes testing a more demanding task than one might think. Ideally, we have to test a software product at all points, not just at the extremes. Unfortunately, this is usually impossible, due to the number of points we would have to consider. Therefore, we have to satisfy ourselves with some measure of 'just how much’ of the system we have managed to test.

Surprisingly, for most systems only about 60% of the program’s statements of the program are ever executed. Therefore, if we could manage to ensure that every statement was executed by at least one test case, then we would probably be testing more thoroughly than average. This is the sort of statement of test thoroughness that coverage analysis make precise. To decide whether we have tested a system enough, we must decide what level of coverage we think is suitable. Usually, coverage is expressed as a percentage; the percentage of some syntactic structure that has been executed by the test set in hand.

Coverage

To get a feeling for how coverage analysis works, consider the C program fragment in Listing 1.

In looking at the structure of this fragment we can see that there are four basic statements: S1, S2, S3, and S4. We might try to construct a set of test cases that ensures that each of these statements is executed at least once (this is statement coverage).

Alternatively (or in addition), we might decide that we should test both the predicates (p and q) with two cases each. One of which tests their behaviour when they evaluate to true and the other of which tests their behaviour when they evaluate to false (this is branch coverage).

Of course, we might not have time to perform all these tests so we might want to decide to cover a certain percentage of statements. Clearly this is a very weak testing requirement; if we settle for less than 100% coverage then we are guaranteeing not to have tested some statements.

The control flow graph

The three most common forms of coverage are based on the Control Flow Graphcontrol flow graph (or CFG). Sometimes this is simply referred to as a ‘flow graph'. A CFG is a graphical representation of the way in which the sequence of control flows from one part of the program to the next. The nodes of the CFG represent simple primitive statements (such as input, output, and assignment) and its edges represent the flow of control. A CFG is simply a formalisation of the very old and familiar flow chart.

To understand what a CFG is, we need to know a little bit about graphs.

A graph is simply a collection of nodes (you can think of these as points which can be 'visited') and arrows between these nodes (called edges), which link nodes together, indicating which nodes can be reached from which other nodes. The map of a one-way traffic system is a graph. Strictly speaking, we call graphs that contain arrows, ‘directed graphs', because there is a direction indicated by the arrow headarrowhead. A one-way system is a directed graph. The London Underground tube map is a graph. The stations are the nodes and the lines between stations are the edges. However, the underground map is not directed because we can travel in either direction.

A path in a graph is a list containing two or more nodes. In a pathpath, each node in the list can be reached in one step down an edge from the node that precedes it. We need at least two nodes in the path as these form the start and the end of the path. We say that a path is ‘a path from A to B’ if the first node in the list is A and the last node in the list is B. In the graph of the London Underground, a path is justsimply a way of getting from station A to station B.

A Control Flow Graphcontrol flow graph is simply a graph that has a few constraints placed on it to ensure that it corresponds to a sensible model of a program. These constraints ensure that it is possible to get from any node in the graph to the exit node of the graph and that every node can be reached from the entry node. Listing 2 contains a simple program, for which the corresponding complete CFG is depicted in Figure 31. The exit node is labelled ‘STOP’ and the entry node is labelled ‘START'.

Statement coverage

In statement coveragecoverage, we seek to design test cases that cause statements to be executed. The level of coverage achieved by a set of test cases is simply the fraction of statements that get executed by one or more of the test cases. This fraction is typically expressed as a percentage: , so 50% statement coverage means that half the statements get executed by one or more test cases (and half do not get executed by any test case).

Throughout this article, I shall use the simple example program in Listing 2. You will need to refer to the Control Flow Graphcontrol flow graph of this program fragment, depicted in Figure 13. In Listing 2, I have labelled each primitive statement and predicate with a number to allow you to correlate them to the nodes of the control flow graph.

In this context ‘statement’ means ‘node in the Control Flow Graphcontrol flow graph’. In a (rather pathological) program that which contained only nested body-less if statements we would be required to cover all the predicates with at least one of our test cases. In 100% statement coverage, we are certain that every node of the CFG has been executed at least once. Of coursecourse, this does not mean that we are guaranteed to uncover any faults in the statements, but if we hadn’’t covered 100%, we could be certain that sometimes we would fail to uncover faults. It does not matter whether a statement is executed once or a thousand times —– all that matters is whether the statement gets executed or not. This means that, in trying to achieve coverage, we suppose that all statements are equally likely to contain bugs. ThereforeTherefore, the more statements we manage to ‘hit’ with our set of test cases, the better.

To achieve 100% statement coverage of the program in Listing 2, we would have to define a set of test cases that covered lines 1, 2, 3, 4, 5, 6, and 7. The tests can be defined in terms of initial values for the ‘unbound variables' (in this case p, q and, sometimes, x). Although z and y are mentioned in the code, their initial values are never required in the definition of a test case, because their initial value does not affect the course the execution takes. (The course that execution takes is simply the path it follows in the CFG of the program).

In order to execute statements 2 and 3, two test cases are required as the statements are alternatives. However, a careful choice of test cases ensures that at least one of these covers statement 6, and so an additional test case for statement 6 is not required. Obviously, since it does not matter how many times we execute a statement, we shall want to cover all the statements with as few test cases as possible, thereby reducing the effort required to reach 100% coverage. A suitable (and minimal) test set for the program in Listing 2 need only contain two tests, for example:

Test One: p has the value 4 and

q has the value 3

Test Two: p has the value 2,

q has the value 6 and

x has the value -1

Notice that, in order to achieve 100% statement coverage of the program, it was necessary to cover both the case where the first predicatefirst predicate evaluated to true and the case where it evaluated to false. However, it was not necessary to cover the case where the second predicate was false. We have covered every possible statement, but not every possible branch. In terms of the CFG, we have hit every node at least once, but there remain edges down which execution has not passed. This is what branch coverage is all about.

Branch coverage

A branching node is one whichone that has more than one outgoing edge in a graph. In a CFG as defined earlier, this means a predicate node.

In order to achieve 100% branch coverage it is necessary to execute each predicate at least twice. In one instanceinstance the predicate must evaluate to false and in another it must evaluate to true. If we have achieved 100% branch coverage, we know that every edge of the CFG has been covered. Notice that we can manage to cover every node of the graph and yet fail to cover every edge, because there may be several possible ways of getting to a node and we may only cover one of the possibilities. For example, the test set defined in the previous section does not achieve 100% branch coverage for the program in Listing 2, although it did achieve 100% statement coverage. We shall have to re-think our test set. However, this does not necessarily entail an increase in the number of test cases. It simply means (in this case) that more care is required in selecting test cases. A test set that achieves 100% branch (and statement) coverage is:

Test One: p has the value 4 and

q has the value 3

Test Two: p has the value 2,

q has the value 6 and

x has the value 17

Path coverage

Remember that a path (from node A to node B) through a graph is a list of nodes; the list of nodes that gets executed when execution starts with A and ends up at B. Of course, there may be many ways to start at a node A and end up at a node B, each of these will be a different path from A to B. To achieve 100% path coverage, every path from the entry node to the exit node of the CFG of the program must be traversed during at least one execution of the program.

In general, the presence of loops means that there are infinitely many paths; we can keep on creating ever longerever-longer paths by simply going round a loop one extra time. Obviously, we cannot cover infinitely many paths with our test set. To overcome this a limit may be placed upon the number of executionsof executions of the loop, or only loop-free paths may be tested. Really however, path testing is only defined to give us an idea of the limit of coverage-based testing.

The program in Listing 2 contains only four paths, so it is possible to achieve 100% path coverage. The four paths are <1,2,4,5,7>, <1,2,4,5,6,7>, <1,3,4,5,7> and <1,3,4,5,6,7>. In order to cover them all, four test cases will be required. For example, the following four test cases will do the trick.

Test One: p has the value 4 and

q has the value 3

Test Two: p has the value 2,

q has the value 6 and

x has the value -1

Test Three: p has the value 5,

q has the value 3 and

x has the value 20

Test Four: p has the value 2,

q has the value 6 and

x has the value 19

Notice that, the values for p and q are reused in the test cases. This is clearly bad from a practical point of view, as less of the program's input domain is exercised (even though all paths through the program are). There is nothing in the definition of path coverage (or in the other two forms of coverage) that requires different values to be selected on each test case. All that is required is the achievement of coverage.

Some paths are impossible

Some paths are impossible; no execution will ever cause them to be traversed. This can happen because, for example, we might write an if statement that always executes its else branch. In the CFG of this program any path that goes through the then part of such an if statement will be an infeasible path. These paths (which that exist in the CFG, but which no execution can go down) are called ‘infeasible paths’.

In the presence of infeasible pathspaths, it may be impossible to achieve 100% statement or branch coverage and it will obviously be impossible to achieve 100% path coverage. For example, consider the program in Figure Listing 34. If you try to generate a test case which guarantees that both the predicates in this program will evaluate to true you will find that you cannot.

That is, should the initial value of p be equal to the initial value of q then the assignments x=x+1; y=x+1; will be executed, guaranteeing that y and x have different values, and therefore guaranteeing that the predicate y==x will evaluate to false. For this programprogram, we can define test sets that achieve 100% statement and branch coverage. However, even though there are no loops, we cannot achieve 100% path coverage, as the path <1,2,3,5,6,7> is infeasible.

Which is the best technique?

How can we decide whether coverage criterion A is better than coverage criterion B? We would ideally like to know whether A finds faults that B does not, but this is a hard concept to define and use in practice. We can provide some theoretically analysis however, and this is achieved using the subsumes relationship. A subsumes B if all test sets that satisfy A also satisfy B.

For example, 100% path coverage subsumes 100% branch coverage, because if all paths are covered by a test set t then t must cover all branches. SimilarlySimilarly, if all branches are covered then all statements must be covered, so 100% branch coverage subsumes 100subsumes 100% statement coverage. The subsumes relationship is transitive, so we also know that 100% path coverage subsumes 100% statement coverage. We can therefore say with theoretical certainty that, of the three techniques we’ve looked at, 100% path coverage is best (though usually unachievable), 100% branch coverage is next bestbest, and 100% statement coverage is worst. Because 100% path coverage is usually impossible, branch coverage is considered by most testers to be a good criterion to aim for. Notice that, in comparing coverage criteria, it is only possible to compare 100% coverage in each category. If we cover all but one path using path coverage, we cannot be sure that we have covered every statement of the program.

Tool support

Tool support is an essential ingredient in a successful software testing strategy. Because of the number of test cases that have to be considered, an un-automated testing strategy is doomed to failure.

Most testing tools provide facilities for capturing test cases as they are executed, so that these test cases may be subsequently ‘replayed’. This capture/replay mechanism provides a log of testing efforttesting effort and helps an organisation to improve its testing process. The implementation of capture/replay is technically un-demanding, but it has proved remarkably effective and productivity-enhancingproductivity enhancing.

Tools also exist which can generate test cases, based upon a desireda desired level of coverage. Most existing tools use a rather simplistic and expensive approach —– they generate test cases at random and measure the level of coverageof coverage achieved (by inserting probe code into the software under test). The creation of further random test cases ceases when either a sufficient level of coverage is attained or when the level of coverage appears not to be increasing (indicating, perhaps, the presence of infeasible paths, or paths down which the system is forced by relatively few test cases).

Unfortunately, completely automated test data generation is impossible. That is to say, the automated generation of a set of testof test data that is guaranteed to achieve 100% statement coverage is impossible. This is because the question as to whether a test set achieves 100% statement coverage is undecidable; it is theoretically impossible to write a program that can decide whether or notwhether 100% statement coverage has been achieved.

It is easy to show that this question is undecidable. We can do this by showing that a solution to the problem would allow us to solve a different undecidable problem —– that of whether or not two functions are equivalent.

Consider the simple program fragment:

scanf("'%d'",&x);

if (f(x) != g(x)) z=1;

In this program fragment, f and g are arbitrary side-effect free functions on integers. If it were to turn out that f and g were equivalent functions, then the assignment statement would not be executed and so 100% statement coverage could be achieved by a set of test cases that did not execute the assignment. However, if the two functions are not equivalent, then there will be somebe some input value for x which will drive the program through the assignment statement and 100% coverage cannot be achieved without the inclusion of a test case which mentions just such an input value for x.

Now sSuppose that there exists an automated system for answeringfor answering whether or not 100% statement% statement coverage has been achieved by some set of test cases. It would be possible to use this system to decide whether the two functions f and g were equivalent. All that would be required would be to generate a singe test case (any one will do –— some input value for x). Next, execute the program on this input. If the assignment statement is executed, then conclude that the two functions are not equivalent. If the assignment is not executed, then ask the automated system whether the test set consisting of this single test case achieves 100% statement coverage. If the system answers ‘no’ then the functions are not equivalent, if the system answers ‘yes’ then the two functions are equivalent.

If we had a tool for deciding whether a test set achieves 100% statement coverage, we could use it to answer the question as to whether or not two functions were equivalent or not. However, the question as to whether two arbitrary functions on the integers are equivalent is knownown to be undecidable, and therefore we have to conclude that the problem of deciding whether 100% statement coverage is achieved by a test set is also undecidable.

Tools can help us with software testing by measuring coverage and by recording and replaying our test cases. They can also be used to generate test cases in an attempt to reach a good (ie high) level of coverage, but they can never do the whole job for us.

The final criteria

Software testing is much harder than testing in other engineering domains because of the discrete naturediscrete nature of the product (software) to be tested. It is possible to make some sensible choices as to what tests to subject a system to, based upon the structure of the software. What we do is to attempt to cover the software, ensuring that all aspects of its structure are exercised by the test set. This does not guarantee that we will trap all faults – nothing can do this for us. It does however, give us a measure of how much testing we have performed and provides us with a way of comparing one testing criterion with another.

Mark Harman is a lecturer in computer science at Goldsmiths’ College. He can be contacted by email at mark.harman@brunel.ac.uk, or by post to Mark Harman, Department of Information Systems and Computing, Goldsmiths’ College, University of London, New Cross, London SE14 6NW.

Thanks are due to Robert Hierons, who checked earlier versions of this article and provided valuable insights into the nature of the software testing process.

Software testing basics

Levels of testing

Testing is carried out at several levels of granularity. Usually three are considered worthy of special note. These three are: unit testing, module testing, and system (or integration) testing.

Unit testing is at the lowest level of granularity. Individual functions and procedures are tested (usually by the programmer who wrote them, though this is not necessarily the best practice). Often this testing is carried out manually and is ad-hoc and effectively amounts to random testing conducted by a highly biased random number generator.

This is a shame; it is perhaps for unit testing that automation has perhaps been most successfully applied in commercially available testing tools. It is also in individual units that some of the most disastrous bugs lurk (for example, Year 2000 bugs, the bug which caused the second crash of the London Ambulance Service system, and the Pentium FDIV bug).

It is also in unit testing that coverage analysis has a role to play.

Module testing consists of testing collections of functions and procedures that belong to a single module. At this level of granularity, coverage analysis provides a way of describing the relative amount of testing effort each module should receive (or has received). That is, we shall not always want to spend the same amount of time testing each module. Some modules matter more than others and should accordingly receive more thorough testing. If we can achieve 100% coverage for all our modules then obviously this would be nice, but often we cannot. In such situations, we can use coverage to say things like, `‘Module A has been tested to 80%, but module B has only been tested to 55%’’.

It is possible to test units and modules as they are developed and the costs of fixing a fault found at these levels are comparatively low. Integration testing, by definition, has to wait until there is something to integrate, and so the faults tend to be uncovered only after more time and effort have been invested in the system. Unfortunately, coverage analysis cannot help us much with integration testing.

Acceptance testing

We often find that some bugs are only found by the user. In recognition of this observation, we often allow the user to form part of the testing process, in an ‘acceptance testing’ phase of the development life cycle. Typically, acceptance testing is regarded as the familiar two stagetwo-stage process, consisting of:

Alpha testing:

In which the user tests the system in a controlled environment under the supervision of the developers.

Beta testing:

In which the user tests the software on their own site.

The concept of beta testing has sadly become synonymous with the practices of unscrupulous software developers who perform no testing whatsoever – leaving it all to the customer, in what is inappropriately termed the ‘beta testing phase'.

Coverage analysis can be useful during acceptance testing , because we can use it to measure how much of the system the user is actually exercising when the system is in operation.

Black box versus. white box

Testing techniques fall into two categories: white box and black box. In white box testing, we can ‘look into the box’ and use the structure of the software under test to guide the choice of test cases. White box testing is considered to be goodgood at finding errant behaviour and additional unwanted behaviour, but not so good at finding missing (or unimplemented) behaviour. By contrast, in black box testing, we cannot `‘see'’ into the box and so our test cases must be of a purely behavioural nature. In black-box testing we ask the question, ‘If I supply this input, what output do I observe?'. Coverage analysis is the most widely used white box testing technique.

 

Listing 1 – White box testing: a simple example.

if(p)

   S1

   else if (q)

      S2

      else S3

S4

 

Listing 2 – A simple program.

1 if (p>q)

2    x=1;

3    else y=2;

4 z=x+3;

5 if (z==p)

6    z=z+1;

7 y=x+1;

 

Listing 3 – An example program fragment which contains infeasible paths.

1 if (p==q)

2 { x=x+1;

3   y=x+1; }

4 else y=2;

5 z=x*x;

6 if (y==x)

7    z=z+4;