How long is this going to take?

Project estimation has always been a hit and miss affair. Mark Harman explains some new and established techniques for providing more reliable estimates than ‘think of a number and double it’.

In his 1979 Pulitzer-prize winning book Gödel, Esher, Bach (2nd edition, Penguin, 1980), Douglas Hofstadter enunciated Hofstadter’s Law, which says: ‘It always takes longer than you think, even if you take into account Hofstader’s Law.’

As many developers know, this could not be more true in the case of software development projects. The ability of software projects to overrun both time scales and budgetary requirements is now so notorious that it hardly bears repetition. Why is it that software projects overrun so often?

Many developers and researchers have addressed this problem, and some factors do appear to be emerging from this soul searching: the importance of clearly specifying requirements is often underestimated; same goes for the technical difficulty involved; often user expectations overestimate what can be achieved in the available time scale; and so on. Most of us are familiar with these explanations. What is it that unifies all of them? The key word that occurs in all these explanations is estimation. Estimation is a crucial (and poorly understood) phase of all but the most trivial software development projects.

Typically, before development begins in earnest, someone has to estimate the physical resources required, the staff involved, the project’s likely duration, the stages at which milestones can realistically be achieved, and so on. Anyone who has tried to provide reliable estimates of these system dynamics knows how incredibly difficult it is to be precise, and how flimsy is the theoretical support for any decisions and predictions we make. If we need to set an upper bound on the execution of a loop, we wouldn’t be so unscientific as to think of a sensible number and then double it to be ‘on the safe side’, but this is just what most developers and their managers find themselves doing when asked for a prediction concerning a software project.

There are several techniques that can be used to try to provide more reliability in software estimation. The most well-known of these, and that used by most project managers (if they use any technique) is called COCOMO (COnstructive COst MOdel). COCOMO is often misused, because it is not carefully fitted to the peculiarities of the local software environment. If misused in this way, the predictions it makes are often far worse than ‘think of a number and double it’. Fortunately, other more recent cost estimation techniques can offer far more accurate predictions that automatically tailor themselves to the local environment, because they base their estimates on past knowledge in just the same way that a good human estimator would do.

Note that project estimation techniques are useful both to developers and project managers. They provide developers with a sound theoretical basis for arguing against unrealistic project goals and they provide managers with more accurate predictions, upon which planning and risk assessment can be more soundly based.

COCOMO

COCOMO is the name given to a family of three cost estimation systems, developed by Barry Boehm in the late seventies and published in his book Software Engineering Economics (Prentice-Hall). The three COCOMO models are known as the ‘basic’, ‘intermediate’ and ‘advanced’ models. All three calculate the unknown project attributes of effort E (measured in person-months), and duration D (measured in elapsed months). The more advanced models are all based on the simple equations used in the basic model, which are:

E = a Sß

and

D = g Eø

In these equations, S is the number of lines of code produced (in thousands). The four values a , ß, g , and ø are constants for which Boehm provides different values depending upon the kind of software project under consideration. For example, for an ‘organic project’ (one that involves a small close-knit team of experienced programmers) the values are a =2.4, ß=1.05, g =2.5, and ø=0.38. For an embedded project (where the project must be developed in an environment that imposes rigid constraints) the values are a =3.6, ß=1.2, g =2.5, and ø=0.32.

It’s not too hard to see the logic behind Boehm’s equations. For example, consider the equation for E. In this equation, S is simply a crude measure of the size of the project. The equation has two constants (determined by the type of project) a and ß. The ß constant is perhaps the more interesting of the two. In all three models its value is greater than one, introducing an exponential increase to project effort in terms of the size of the project. That is, as the project gets larger the amount of effort required accelerates, rather than simply keeping pace. This seems logical because larger projects produce dis-economies of scale due to integration costs: two small projects can be completed with less effort than a single project of their combined size.

It is not difficult to see the appeal of the COCOMO model to project managers. All that is required is to decide what kind of project is to be undertaken (for example, organic or embedded) and to plug in the value of S, and then magically the effort and duration of the project will simply ‘drop out’ of the COCOMO model. Unfortunately, reality is often not as simple as this.

Problems with COCOMO

Barry Boehm formulated the COCOMO model based on a set of projects for which he had access to project data. Many people have applied the technique and found that the basic shape of the graph denoted by the COCOMO equations is borne out. However, the precise values of the constants a , ß, g and ø, can vary widely, depending upon the development environment. This evidence suggests that while the equation may be valid, the constants need to be ‘tuned’ to the development environment. After all, Boehm’s original values were inspired by the study of projects carried out almost a quarter of a century ago, and a lot has changed in the software development industry since then!

There is another problem with the equation itself, which cannot be overcome by tweaking the value of the constants. In order to work out the required effort for a project, we need to know the size of the project. Unfortunately, knowing the size of a project is half the battle. If we knew how much code would be required by an implementation, we could probably provide a reasonable estimate for how long the implementation would take. Even if things were a little more complicated, and we had to predict how long a team of developers would take to produce a certain number of lines of code, we could probably produce a reasonably reliable estimate. When he introduced the COCOMO models, Boehm said that one could only reasonably expect an estimate within 20% of the real value and that we could only expect this 70% of the time. In reality, even this ‘estimate of the power of estimation’ has proved to be optimistic. Empirical investigations of estimates produced by COCOMO and similar formulaic estimation systems show that estimates based on these formulae could, in the worst case, be out by as much as several hundred percent. Not very reassuring when our livelihoods could depend upon estimation precision.

Finally, there is a problem with the way in which software size is measured. Using the number of lines of code as a measure of size is rather arbitrary; 1000 lines of Java will probably go a lot further (that is, will represent more functionality) than 1000 lines of, say, Z80 assembler code. The weakness of code length as a measure of system size has lead to an increasing uptake in ‘function points’ as an alternative measure of system size, but function points are far less rigorously defined than lines of code, which introduces additional problems.

While the COCOMO method has provided developers and project managers with a much-needed handle on the problem of cost estimation, it has required a great deal of ‘tweaking’ to suit particular development environments and, even with this adjustment, it remains a tool for relating effort and duration to project size. Since its introduction in the early eighties, the COCOMO method has remained the principal technique used by project managers to estimate project costs (apart from intuition and luck, that is). While COCOMO is attractively simple, it would be wise to use it only to decide upon best- and worst-case scenarios, rather than to use it to calculate precise predictions of project attributes.

More recently, new models for cost estimation based on Artificial Neural Networks (ANNs) and Case-Based Reasoning (CBR) have been developed. These techniques automatically tailor themselves to the development environment concerned and allow us to predict any set of unknown project attributes in terms of a set of known project attributes. While these new techniques are less well known than COCOMO and other formulaic techniques, their flexibility and adaptability make them extremely attractive.

Neural networks

An Artificial Neural Network is a model (usually in software) of the workings of the human brain. It is currently far too demanding to expect an ANN to achieve the power, flexibility, and creativity of human intelligence, though this remains the dream of some enthusiasts. Currently achievable ANNs contain many orders of magnitude fewer components then the human brain. For example, it would take approximately 1% of all the RAM chips in existence to provide sufficient memory capacity to equal that of a single human brain. Although the replication of human intelligence is a far-off goal for the ANN research programme, more modest attempts to model simple brain-like pattern matching in particular domains have been far more successful. This work has demonstrated that ANNs are good at recognising patterns in data, and almost all applications involve some formulation of a problem in terms of pattern matching. Fortunately, many problems, including that of predicting software project attributes, can be reformulated in this way.

The network itself consists of several layers of neurons, each of which takes inputs from other neurons in the network and provides outputs to other neurons. Essentially, each neuron ‘fires’ its output connections if the sum of its input connections rises above some specific ‘threshold value’. The determination of the threshold values and the connectivity of the network of neurons determine the way in which the network will recognise patterns.

Usually, a fixed network configuration is chosen, while the threshold values are determined by ‘training’ the network with a set of well-understood patterns. A typical configuration involves an input layer, an output layer, and one or more ‘hidden layers’, which ultimately feed input through to the output. The input to each neuron in the input layer comes from the outside world. In the case of project estimation the input will be the values of known project attributes. The output from the output layer is the result produced by the neural network. For project estimation, the output will be the predicted values of unknown project attributes. Information propagates through the network from the input neurons to the output neurons, with the threshold values determining the relative significance of different subsets of input combinations.

The network is taught to recognise patterns in data during a training phase. This involves providing the network with input values for which the corresponding output value is known. The desired result is then ‘back-propagated’ from the output nodes through the intermediate nodes to the input nodes. The threshold values are modified to encourage the network to produce the right output for the input supplied. In this way the threshold values come to represent patterns in the data.

Having taught the network to recognise these patterns, it can be used to predict the likely output that will occur for a known input. In the case of project estimation, we can train the network to recognise patterns in past projects and to provide predictions based upon these for future projects.

Unfortunately, it is possible for the network to overemphasise the importance of certain attributes; essentially, the network is seeing false patterns in the data. ANNs are designed to mimic human brains and, just like human brains, the network can produce wildly inaccurate answers, due to arbitrary biases, based on the natural selectivity of past experience. There is also a problem in understanding the behaviour of the network; even if it produces good estimates, we don’t really understand how it does so. In some applications of ANNs this is not an issue, but in the case of project estimation it is often important to know why and how as well as what. For example, we shall want to know not just what the length of a project will be, but why it will last that long and how it could be changed to reduce this length.

Until these problems can be addressed developers may prefer to use human brains to produce answers rather than silicon replicants. However, ANNs remain an exciting and unconventional model of computation. It could simply be that our understanding of their operation is too meagre, or that the network size which we have hitherto been capable of constructing is of ‘sub-critical mass’. The technology cannot be written off for the future; we might yet see the day when project managers’ brains are replaced by silicon implants. Oh brave new world, that would have such people in it!

Case-Based Reasoning

When asked to provide an estimate as to how long an implementation will take, or how much time will be required for testing, most of us will use intuition, backed up by previous experience. This exploitation of previous experience is an example of Case-Based Reasoning (CBR). That is, we base future predictions on knowledge of past cases. Recent research by Professor Shepperd’s group at Bournemouth University has created a new automated approach to case-based estimation of project attributes. The approach is very successful in providing accurate estimates because they are based on prior knowledge and are tailored to take account of an organisation’s development environment.

Before the approach can be applied, a database of previous project attributes needs to be created. The estimates will only be as good as the quality of the data collected, and so disciplined and structured data collection is essential. Of course, this is also true of the ANN approach and, if we are to tweak it, of the COCOMO technique too.

Collecting data about projects is often regarded as a bit of a chore. It is disliked by developers because the data often goes unused, or worse, is used in a pejorative way, to evaluate programmer productivity. This is a shame, because without reliable data about previous projects it is hard to learn from them. Fortunately, project estimation using CBR uses prior project data in a different way to other techniques. Using CBR, it is more likely that the data will support the developers’ view of the project than oppose it, because CBR respects and takes account of previous development history.

The approach is essentially a codification of common sense. We try to find the set of projects conducted in the past that most closely resembles the one upon which we are about to embark. The approach can be applied equally well to whole projects or to individual project steps, such as the implementation of a GUI, or a port from one language or platform to another. In this way, the idea of case-based estimation is useful to both developers and to managers. Because the estimates are tailored to the development group concerned, they do not provide impossibly ambitious goals imposed in a top down fashion. Instead they provide realistic estimates of the likely attributes of a project.

Of course, the response to an unacceptable estimate may be ‘well, it will have to be completed in half that time’, or ‘we simply can’t devote that many developers to this phase of the project’. Case-based estimation can, at best, provide a realistic estimate. It cannot make the problems of short time scales and scarce resources evaporate. However, what it can do in this situation, is to provide a clear prediction of just how unrealistic a project’s goals are. This prediction will be soundly based on previous project profiles and so it will provide the developer with a stronger case for arguing for additional resources and/or for modifying the expectations of the project manager or user. From a managerial perspective, it is obviously important to know just how hard the team is being driven or (in the utopian world) just how much slack the manager has been able to create.

Collecting previous project data

Suppose we have collected information about some previous projects and stored this in a simple flat-file database. Suitable data will concern any attribute of the project that can be measured on a numerical scale. At first sight, this will include properties such as the size of source code used, the number of different files created, the volume of documentation, the number of developers, the duration to completion, the number of bugs reported, and so on. This data is readily available as most projects proceed; we simply need to devote (a little) extra time to its collection.

It is possible to record more qualitative project data by scaling ‘enumerable’ project attributes, for example the platform on which development took place, the customer for whom the project was developed, and so on. We can record even more ‘soft’ attributes, by simply trusting expert knowledge to determine parameters such as the maintainability of the system and the readability of its documentation.

All of these attributes are recorded on a normalised scale (usually as a real number between 0 and 1). When estimating how a new project will progress some of the attributes we record will be more important than others, but it is easy to get the case-based approach to take account of this, as we shall see.

Estimating the project attributes

To simplify the situation, let’s suppose that we have stored five project attributes: S (the source code size in thousands of lines of code), N (the number of developers), D (the duration in months), L (the number of languages used), and F (the number of files created).

Imagine that we are about to start a new project. We know how many languages will be involved, how many developers will be available, and the number of files that are to be created, and we want to predict the duration of the project and the number of lines of code that will be written. Our estimate will be based on finding the project’s nearest neighbours in three-dimensional space. The dimensions are the three dimensions for which we know the real project values: N, L, and F. Having found a set of nearest neighbours we use these to provide our estimate of the new project’s unknown parameters.

To find the set of nearest neighbours we use ‘Euclidean distance’. This is a simple formula for determining how far two points are from one another, based on the famous Pythagorean rule for calculating the length of one side of a triangle in terms of the lengths of the other two sides. The rule generalises to an arbitrary number of dimensions, so it does not matter how many project attributes we have.

In using the nearest neighbour set to determine the estimate for new project attributes we have several choices. The simplest approach would be to take some number of projects, say five, and to estimate the unknown parameters as the mean of the corresponding known parameters for the five neighbours. This technique suffers from two problems.

First, while a neighbour might be in the nearest five, it may still be a long way off relative to the other four. Second, for some particular unknown attribute, some of the remaining attributes may be poor predictors, merely adding confusing noise.

To remedy the first problem we simply need to identify a weighting to apply to previous projects based upon how far they are from the new one. A close project will contribute far more to the estimate than one far away. If we allow weights to be zero, then we can think of the neighbourhood as containing all projects, simplifying the model further.

To remedy the second problem, we need to conduct some refinement of the model. We can ‘jack-knife’ the project set, by taking out a single project. If we ‘throw away’ one of the project’s known attributes, then we can see how good the system is at predicting the ‘unknown’ attribute of the jack-knifed project. By trying all the different subsets of attributes and jack-knifing on each project in turn, we can determine the best predictors for each attribute.

CBR remains the subject of on-going research. More information, and a tool called ANGEL for case-based cost estimation, can be found at the Bournemouth University Empirical Software Engineering Research Group (ESERG) website (http://dec.bournemouth.ac.uk/ESERG).

A fear of accuracy

Estimating project attributes remains more of an art than a science. However, like all arts there is room for some training and for firm foundations. The evidence strongly suggests that no ‘off the peg’ estimation technique will ever be found; software projects are too diverse and complex to submit to such a crude approach.

Formulaic approaches to estimation, like the COCOMO technique, can be used to quickly determine ‘ball-park’ figures for project effort and duration, but more sophisticated (and accurate) answers can be obtained from techniques that base their estimates on local historical data. These techniques may provide better estimates, but require that we face up to the chore of collecting project attribute data.

Estimating the attributes of a software project more accurately will be of benefit to both developers and their managers. Only competitors have anything to fear from more accurate cost estimation. The techniques described in this article are not a threat to developers or their managers; at worst, we will be in a better position to say just how unrealistic our project goals are.

Mark Harman is a lecturer in computer science at Goldsmiths’ College, where he works on software development, testing, slicing, and evolutionary algorithms. He provides consultancy on development and testing issues and acts as an IT recruitment advisor. Dr. Harman can be contacted by email at mark.harman@brunel.ac.uk, or by post to Mark Harman, Department of Information Systems and Computing, Goldsmiths’ College, University of London, New Cross, London SE14 6NW.

(P)1998, Centaur Communications Ltd. EXE Magazine is a publication of Centaur Communications Ltd. No part of this work may be published, in whole or in part, by any means including electronic, without the express permission of Centaur Communications and the copyright holder where this is a different party.

EXE Magazine, St Giles House, 50 Poland Street, London W1V 4AX, email editorial@dotexe.demon.co.uk