4  Requirements Errors

This chapter is about requirements errors. When people think of software errors, or bugs, they usually think of programming mistakes. Many software errors, however, are due to requirements engineering mistakes. This chapter defines requirements errors and how they differ from implementation errors. It then describes three broad categories of requirements errors: misunderstanding stakeholder goals; relying on invalid domain assumptions; and misunderstanding machine requirements due to incorrect reasoning. It then discusses two examples of requirements errors that have contributed to catastrophic failures of an airplane ground braking system and an ambulance dispatching system.

4.1 Failures, Defects and Errors

Engineers make important distinctions between failures, defects and errors:

  • a failure is a situation where a product does not work as intended (e.g. the light in your room does not turn on when you flip the switch);
  • a defect is a flaw in the product (e.g. the light bulb filament is broken);
  • an error is a human mistake that contributed to some defects (e.g. you dropped the light bulb when replacing it).

We will apply these concepts to software engineering, starting with failures.

Machine failures vs. system failures

In software engineering, we have two types of failures: machine failures (also known as ‘software failures’) and system failures.

Machine failures vs. system failures
  • A machine failure is a situation where the machine does not satisfy its requirements.
  • A system failure is a situation where the World (the machine and world actors) does not satisfy some essential stakeholder goal.

Observe that machine failures and system failures are not necessarily linked. A machine failure can happen without a system failure: the machine could violate its requirements without leading to a violation of important stakeholder goals. Conversely, a system failure can happen without machine failure: the machine could satisfy all its requirements and yet some important stakeholder goal could be violated.

We will later see two examples of system failures without machine failures:

  • an airplane ground braking system where the software satisfies its requirements, yet the ground braking system could not be activated during landing resulting in the plane crashing at the end of the runway.
  • an ambulance dispatching system where the software satisfies its requirements, yet ambulances take longer to arrive at the incident scene than with the previous paper-based system.

In practice, determining whether a system has failed can be subject to interpretation. In some cases, like the two examples above, the system failures are clear. In other situations, different stakeholders may have different views about whether the system has actually failed not, or they may argue about the extent of the failure. This is usually due to different views about what are the essential stakeholder goals.

Determining machine failures can also be subject to interpretation. Sometimes, the situation is clear-cut: the machine clearly fails to satisfy its expected behaviours. In other cases, people may disagree about whether the machine has failed or not. Usually, this is because the description of the machine requirements is either missing, incomplete, or too vague. These disagreements can be fuelled by diverging interests in characterizing the situation as a failure or not, particularly when it involves determining blames and liabilities in a court case.

An important point is that software engineering is about preventing both types of failures: machine failures and system failures. Requirements engineering is principally concerned with preventing system failures.

Implementation defects vs. requirements defects

We also have two types of defects.

Implementation defects vs. requirements defects
  • An implementation defect is a characteristic of the software code that may cause a machine failure.
  • A requirement defect is a characteristic of a requirements description that may cause a machine or a system failure.

Implementation defects, also known as bugs, are properties of the software code. They are located in the code and they may cause machine failures. They are detected through testing, program verification, code reviews, or during the machine operation if they have not been caught before.

Requirements defects are characteristics of requirements descriptions. Example of requirements defects are missing requirements, ambiguous requirements, untestable requirements, infeasible requirements, etc. We will study requirements defects in more details in Chapter 13.

Requirements defects can cause system failures. If the requirements description is incorrect with respect to essential stakeholder goals, this could lead to a situation where the essential goals are not satisfied, even if the machine satisfies its requirements.

Requirements defects can also cause machine failures. If the requirements description is too vague, incomplete or hard to read, the development team could misunderstand the requirements and write code that violates the intended requirements. The development team may dispute the machine failure by claiming that the machine does not violate the requirements as they are written or as they understood them. For the requirements authors, however, the situation would still be a machine failure: the machine does not behave as they intended.

Not all requirements defects cause machine or system failures (luckily!). Many requirements defects, like minor ambiguities, are innocuous or can easily be resolved during the development process. Some requirements defects, however, can have dramatic consequences or can be very expensive too fix if they are not caught early. Chapter 7 will discuss empirical studies and opinions about the prevalence, criticality and cost of requirements defects.

Implementation errors vs. requirements errors

The next distinction is between implementation errors and requirements errors.

Requirements errors vs. implemenation errors
  • An implementation error is a programmer’s mistake that may cause a machine failure.
  • An requirements error is a mistake in the definition, communication or understanding of machine requirements that may cause a system failure.

A symptom of requirements errors is a situation where the machine satisfies its requirements but the system as whole fails to satisfy some essential stakeholder goals.

Unsurprisingly, requirements errors occur the process of discovering, analysing and writing requirements before coding.

What is less obvious is that requirements errors can also be made by developers when coding! Because the machine requirements are rarely (if ever) defined completely and without any ambiguity, programmers have to fill in the gaps. Sometimes, they can do this by talking to the requirements engineers, the product owner, or stakeholder representatives but sometimes they will rely (consciously or not) on their own knowledge and beliefs about the application domain and stakeholder goals. In this process, they could make wrong inferences about what needs to be implemented. These are requirements errors because from the perspective of a programmer, if the machine behaves as they intended but not as required, the mistake is in understanding the requirements, not in writing the code.

4.2 Three Types of Requirements Errors

In Chapter 3, we saw that requirements engineering involves three broad activities:

  1. identifying, formulating and agreeing stakeholder goals (Goals),

  2. identifying and formulating valid domain assumptions (Dom), and

  3. formulating machine requirements (Req) such that Req, Dom \(\vdash\) Goals.

Requirements errors can be made in each of these activities:

Three types of requirements errors
  1. Misunderstanding stakeholder goals: the set Goals does not accurately describe the essential stakeholder goals;
  2. Invalid domain assumptions: some assumptions in Dom are not true of the World.
  3. Inadequate machine requirements due to incorrect reasoning: the machine requirements Req are insufficient to satisfy the stakeholder goals Goals under the domain assumptions Dom; in other words, the formula Req, Dom \(\vdash\) Goals$ does not hold.

These error types highlight once again the importance of discovering and analysing stakeholder goals and domain assumptions. Paying attention to domain assumptions is particularly important because many system failures are caused by invalid domain assumptions. We will see examples on the ground braking system and ambulance dispatching systems in the next two sections.

4.3 Example 1: Airplane Ground Braking System

Let’s return to the ground braking system. The top part of Figure 4.1 recalls how the goal G1 is satisfied by the requirement R1 under the domain assumptions D1 and D2. The logical reasoning is correct: R1, D1, D2 imply G1.

Figure 4.1: Part of a goal model with obstacles for the ground braking system

However, if the domain assumptions are wrong, a machine satisfying R1 may no longer be sufficient for satisfying G1. The red boxes in Figure 4.1 are called obstacles. They show situations where the assumptions D1 would be invalid: The assumption D1 would not hold if the plane is moving on the runway and its wheels are not turning, which could be caused by the plane is skidding on a wet runway.

This violation of D1 did happen and was one of the contributing factors to the crash of an Airbus in Warsaw in 1993. The plane landed on a wet runway, the wheels were skidding and, as a result, the software did not allow the pilot to deploy the full braking system for a certain time. This resulted in the plane crashing at the end of the runway.

Obstacle Analysis: A Preview

In practice, one way to reduce the risk of system failures due to invalid domain assumptions is to perform a systematic obstacle analysis. Obstacle analysis is a technique that involves looking at each assumption one by one, identifying possible ways in which the assumption could be violated, then considering potential ways to change the machine requirements to resolve the violations. An obstacle is a property that violates some goal or some assumption. For example, the plane skidding on a wet runway is an obstacle to assumption D1. Figure 4.1 shows how obstacles are represented in goal models. The two red boxes are obstacles that block the satisfaction of the assumption: O1 is the negation of D1, and O1.1 is a subostacle that implies O1. Obstacle analysis will be described in more details in Chapter 19.

Airbus A320 Crash in Warsaw in 1993

Our ground braking system example is a simplified version of the real system. The conditions that led to the 1993 Warsaw crash are more complex than described above.

In the real system, different conditions are used for activating the reverse thrust, the ground spoiler and the wheels brake. The software was also checking not only that the wheels were turning but also whether the struts on the landing gears were compressed. For example, one of the machine requirements looked like FR1 from Chapter 1:

[FR1] The ground braking system must be enabled when, and only when, at least one of the following conditions is met:

  • the wheel speed is above 72 knots at both main landing gears,
  • the shock absorbers are compressed at both main landing gears.

The domain assumption on which the system relied was that during landing the two main landing gears of the plane would be on the ground with either their wheels turning or shock absorbers compressed. That assumption turned out to be wrong on that particular landing. On that day, due to strong rain and heavy wind, the plane landed on a single landing gear and the wheels skidded on the wet runway, which meant that the software did not allow the pilot to brake for 9 seconds after touch down. The accident investigation report attributes the causes of the accident to a combination of pilot errors and inadequate software design. Further information about the accident can be found in the end of chapter notes.

4.4 Example 2: 1992 Failure of the London Ambulance Service

Let’s now look at our second example: the failure of the London Ambulance Service in 1992. The description that follows is based on the extensive report of the inquiry that analysed the failure.

The Previous Paper-Based System

Up until 1992, the London Ambulance Service was using a paper-based system to allocate ambulance. The process operated as follows.

  • Call Taking. Urgent calls from the public are received by call assistants in the central ambulance control room. When a call is received, a control assistant writes down the call details on a pre-printed incident form. On completion of the call, the incident form is placed on a conveyor belt that transports the incident forms to a central collection point for ambulance dispatchers.
  • Ambulance Allocation. A staff member collects the forms, detects duplicates, and assign the forms to an ambulance allocator for the zone where the incident occurred (London is split into three zones North-East, North-West, and South). The allocator decides which ambulance should be mobilized using ambulance status and location information provided by the radio operator. The ambulance to be mobilized is also recorded on the form which is passed to a despatcher.
  • Ambulance Mobilization. If the ambulance to be mobilized is at its station, the despatcher calls the station directly. If the ambulance is away from the station, the despatcher passes the mobilisatoin instructions to a radio operator who contacts the ambulance.

This paper-based system was slow, laborious and inefficient. The ambulance service was thus unable to meet the Government target requiring an ambulance to arrive at the incident scene within 14 minutes after the first call in 95% of cases.

The New Computer-Based System

The ambulance service decided to replace is the paper-based system by a new computer-based system. The new system would have the following new components.

  • Computer Aided Despatch. Instead of using paper-based incident forms, the call assistant would enter the call details into a new Computer Aided Dispatch software. The software would help the call assistant decide what ambulance to send.
  • Automated Ambulance Localisation System. A GPS-based Ambulance Localisation System would be used to keep track of ambulances’ locations.
  • Mobile Data Terminals. Mobile Data Terminals installed inside the ambulance to communicate with the Ambulance staffs. The new Computer Aided Dispatch software would send the mobilisation instructions to the Mobile Data Terminal, and the Ambulance Staff would use the mobile data terminal to communicate their status: for example, when they arrive at the incident scene, when they handover the patient to an hospital, etc.
  • Station Printers. The Computer Aided Despatch software would also communicate to printers at the ambulance stations to send mobilisation instructions to ambulance staff waiting inside the station.

Following a procurement process, the contract for building this system was awarded to a consortium led by a small software development company with no previous experience with similar systems. Because of political pressure, the project had a very short deadline and had to be launched in a single phase. There was no time for small scale trials.

The system was launched on time but things did not work as planned. Instead of arriving faster, ambulances started to take gradually longer and longer to arrive. In one case, the ambulance arrived 11 hours after the call. Eventually, the system suffered complete failure. Call handlers had to improvise to recover the situation. They could no longer use the paper-based system because the infrastructure had been disassembled.

Goals, Requirements, Assumptions

To understand the failure, let’s first look at the goals, requirements and assumptions for this system.

The main stakeholder goal is:

(G1) An ambulance should arrive at the incident scene within 14 minutes after the first call reporting the incident.

To contribute to that goal, the Computer Aided Despatch software relies on a series of domain assumptions, including:

(D1) Call takers encode incident’s details and location.

(D2) The GPS gives the correct ambulance location.

(D3) When an ambulance is allocated to an incident, the ambulance crew will drive ambulance to incident location.

(D4) When ambulance arrives at location, ambulance crew signal arrival on Mobile Data Terminal.

The software requirements were such that one can be reasonably convinced that if the software satisfies its requirements and the domain assumptions are valid, then the stakeholder goal will be satisfied. So far, so good.

Invalid Domain Assumptions

Unfortunately, many of the domain assumptions were incorrect.

  • The assumption D2 that the GPS would give correct ambulance location was invalid. In some cases, the GPS gave an incorrect location or gave no location at all, for example when an ambulance was in a tunnel. The software was designed with the assumption that it would always have perfect information about the location of all ambulance and it was not able to deal with situations where that assumption was not true.
  • The assumption D3 that the allocated ambulance would go the the incident scene was also invalid. In the previous paper-based system, if several ambulances were waiting at the ambulance station and one of these ambulance was mobilized to an incident, the ambulance staff at the station would decide which ambulance to take – they could just take any ambulance from the station. The ambulance staff continued that practice when the automated system was introduced, but the designer of the system did not anticipate this behaviour. They believed the ambulance staff would always take the ambulance that had been allocated by the software. This meant that when the software allocated an ambulance and the ambulance staff decided to take another ambulance, the software assumed there was a problem with the first ambulance and marked it as unavailable. To recover from this problem, the software would also automatically mobilized another ambulance to the incident. This meant that tree ambulances were used for a single incident: two are going to the incident and the one that remained at the station is marked as unavailable for future mobilisations. As the situation was repeated for multiple incidents, the software had gradually fewer ambulances to choose from, which led to increasingly longer delays in ambulance arrivals.
  • The assumption D4 that ambulance crew would signal arrival at the incident on the Mobile Data Terminal was also invalid. The system was designed assuming that ambulance crew would always push the right buttons on the Mobile Data Terminal to update their status, for example when accepting a mobilization, when arriving at the incident, when arriving at the hospital, and when being ready for a new mobilisation. In some situations, the ambulance crew would forget to push the button, push the wrong button, or push buttons in the wrong sequence. Such errors should be expected, particularly in situations of emergencies. Unfortunately, the software was designed without taking these mistakes into account. It was thus difficult for ambulance crew to fix their mistakes and, as a result the software had incorrect information about the status of many ambulances. This meant that the software had gradually fewer available ambulances to mobilize and it was sometimes trying to mobilise ambulances that were already mobilized.

The global effect was a complete system failure where ambulances took gradually longer to arrive to incidents, up to a point where the control room decided to stop using the software and improvised a paper-based solution.

The Inquiry Report

Soon after the failure, an inquiry was set up to investigate the causes. The inquiry report is publicly available and is a rich source of information about the whole project, from procurement to operation, describing the various errors made along the way that led to the system failure. One of the main conclusions is that the system failure was not due to a software failure:

“On 26 and 27 October 1992 the computer system itself did not fail in a technical sense. Response times did on occasions become unacceptable, but overall the system did what it had been designed to do. However, much of the design had fatal flaws that would, and did, cumulatively lead to all of the symptoms of systems failure."

The report identifies many other factors that contributed to the system failure. Studying these factors is important because they show common problems that we continue to see in software projects today:

  • The system was overambitious and had an impossible timetable.
  • The London Ambulance Service management ignored external advices about risks.
  • The procurement rules emphasised low cost over quality.
  • Project management was inadequate and ambiguous.
  • The early decision to deliver a full implementation in one phase was misguided.
  • There was incomplete “ownership” of the system by the majority of its users. In particular, there was insufficient engagement with ambulance crew in the system design.
  • The system was not fully tested and did not have a satisfactory level of quality and resilience.

Reading the inquiry report gives valuable insights into what can go wrong in the procurement and management of software projects. It will help you recognize the early signs of dysfunctional projects that you may encounter later in life.

4.5 Notes and Further Reading

The airbus ground braking system was first used to illustrate the importance of domain assumptions by Michael Jackson his ICSE’95 keynote talk (Jackson 1995b) and in requirements engineering book (Jackson 1995a). Peter B. Ladkin made the report publicly available (Link) and published an analysis of the technical description of the Airbus A320 Braking System (Ladkin 1995). Ian sommerville discusses the accident from a software engineering perspective in an informative video (Link). You can read about later evolutions of the A320 ground spoiler logic in the the Airbus Safety First magazine (Boissenin and Salavy 2010).

Details of the 1992 failure of the London Ambulance Service can be found in the inquiry report . Anthony Finkelstein and John Dowell introduced that system as a case study for the software engineering community (Finkelstein and Dowell 1996). A collection of resources collected at the time can be found at http://www.cs.ucl.ac.uk/staff/a.finkelstein/las.html. Countless papers have now used that system as motivating or illustrative example, including the work from my PhD thesis (Letier 2001).