4 Requirements Errors
This chapter is about requirements errors. When people talk about software errors or bugs, they usually think of programming mistakes. Many software errors, however, are due to requirements engineering mistakes. This chapter defines requirements errors and how they differ from implementation errors. It then describes three broad categories of requirements errors: misunderstanding stakeholder goals, relying on invalid domain assumptions, and misunderstanding machine requirements due to incorrect reasoning. It then describes two examples of requirements errors that have contributed to catastrophic failures of an airplane ground braking system and an ambulance dispatching system.
4.1 Failures, Defects and Errors
Before talking about requirements errors, we first need to recall what engineers mean by failures, defects and errors:
- a failure is a situation where a product does not work as intended (e.g. the light in your room does not turn on when you flip the switch);
- a defect is a flaw in the product (e.g. the light bulb filament is broken);
- an error is a human mistake that contributed to some defects (e.g. you dropped the light bulb when replacing it).
We will apply these concepts to software engineering, starting with failures.
Machine failures vs. system failures
In software engineering, we must consider two types of failures: machine failures (also known as ‘software failures’) and system failures.
Observe that machine failures and system failures are not necessarily linked. A machine failure can happen without a system failure: the machine could violate its requirements without leading to a violation of important stakeholder goals. Conversely, a system failure can happen without machine failure: the machine could satisfy all its requirements and yet some important stakeholder goal could be violated.
We will later see two examples of system failures without machine failures:
- an airplane ground braking system where the software satisfies its requirements, yet the ground braking system could not be activated during landing resulting in the plane crashing at the end of the runway.
- an ambulance dispatching system where the software satisfies its requirements, yet ambulances take longer to arrive at the incident scene than with the previous paper-based system.
In practice, determining whether a system has failed can be subject to interpretation. In some cases, like the two examples above, the system failures are clear. In other situations, different stakeholders may have different views of whether the system has failed or not, or they may argue about the extent of the failure. This is usually due to different views about what stakeholder goals are essential.
Determining machine failures can also be subject to interpretation. Sometimes, the situation is clear-cut: the machine clearly fails to satisfy its expected behaviours. In other cases, people may disagree about whether the machine has failed or not. This is usually because of a missing, incomplete, or too vague description of machine requirements. These disagreements can be amplified by diverging interests in recognising the presence or absence of a software failure, particularly when the issue involves determining blames and liabilities in a court case.
The important point is that software engineering is about preventing both machine and system failures. Software engineers are responsible for ensuring that their code meets the requirements and that the system as a whole meets the stakeholder goals. This means that understanding stakeholder goals and defining suitable machine requirements to satisfy these goals are essential parts of software engineering.
Implementation defects vs. requirements defects
We consider two types of defects.
Implementation defects, also known as bugs, are properties of the software code. They are located in the code, and they may cause machine failures. They are detected through testing, program verification, code reviews, or during the machine operation if they have not been caught before.
Requirements defects are characteristics of requirements descriptions. Examples of requirements defects are missing requirements, ambiguous requirements, untestable requirements, infeasible requirements, etc. We will study requirements defects in more detail in Chapter 13.
Requirements defects can cause system failures. If the requirements description is incorrect with respect to essential stakeholder goals, this could lead to a situation where the essential goals are not satisfied, even if the machine satisfies its requirements.
Requirements defects can also cause machine failures. If the requirements description is too vague, incomplete or hard to read, the development team could misunderstand the requirements and write code that violates the intended requirements. The development team can dispute the machine failure by claiming that the machine does not violate the requirements as they are written or as they understood them. For the requirements authors, however, the situation would still be a machine failure: the machine does not behave as they intended.
Not all requirements defects cause machine or system failures (luckily!). Many requirements defects, like minor ambiguities, are innocuous or can easily be resolved during the development process. Some requirements defects, however, can have dramatic consequences or can be very expensive too fix if they are not caught early. Chapter 7 will present empirical studies and expert opinions about the prevalence, criticality, and cost of requirements defects.
Implementation errors vs. requirements errors
The next distinction is between implementation errors and requirements errors.
A symptom of requirements errors is a situation where the machine satisfies its requirements, but the system as a whole fails to satisfy some essential stakeholder goals.
Requirements errors can be made in the process of discovering, analysing and writing requirements.
What is less obvious is that requirements errors can also be made by programmers while coding! Because the machine requirements are rarely (if ever) defined completely and without any ambiguity, programmers have to fill in the gaps. Sometimes, they can do this by talking to the requirements engineers, the product owner, or stakeholder representatives, but often, they will rely on their own knowledge and beliefs about the application domain and stakeholder goals. In this process, they could make wrong inferences about what needs to be implemented. These are requirements errors because from the perspective of a programmer, if the machine behaves as they intended but not as required, the mistake is in understanding the requirements, not in writing the code.
4.2 Three Types of Requirements Errors
In Chapter 3, we saw that requirements engineering involves three broad activities:
identifying, formulating and agreeing stakeholder goals (Goals),
identifying and formulating valid domain assumptions (Dom), and
formulating machine requirements (Req) such that Req, Dom \(\vdash\) Goals.
Requirements errors can be made in each of these activities:
These error types highlight once again the importance of discovering and analysing stakeholder goals and domain assumptions. Paying attention to domain assumptions is particularly important because many system failures are caused by invalid domain assumptions. The next two sections present examples of failures due to invalid domain assumptions for the ground braking system and ambulance dispatching systems.
4.3 Example 1: Airplane Ground Braking System
Consider again the ground braking system’s goals, requirements, and domain assumptions of Section 3.3. The top part of Figure 4.1 recalls how goal G1 is satisfied by requirement R1 under domain assumptions D1 and D2. The logical reasoning is correct: R1, D1, D2 imply G1.
However, if the domain assumptions are wrong, a machine satisfying R1 may no longer be sufficient for satisfying G1. The red boxes in Figure 4.1 are called obstacles. They show situations where the assumption D1 would be invalid: The assumption D1 would not hold if the plane is moving on the runway and its wheels are not turning, which could be caused by the plane skidding on a wet runway.
This violation of D1 did happen and was one of the contributing factors to the crash of an Airbus in Warsaw in 1993. The plane landed on a wet runway, the wheels were skidding and, as a result, the software did not allow the pilot to deploy the full braking system for a certain time. This resulted in the plane crashing at the end of the runway.
Obstacle Analysis: A Preview
In practice, one way to reduce errors due to invalid domain assumption is by examining each domain assumption one by one to identify possible ways in which the assumption could be violated and possible ways to revise or extend the requirements to prevent, reduce or mitigate such violations. This is the idea of a goal-oriented requirements analysis technique called Obstacle Analysis that we will study in Chapter 19. An obstacle is a property that violates some goal or some assumption. For example, the plane skidding on a wet runway is an obstacle to assumption D1. Figure 4.1 shows how obstacles are represented in goal models. The two red boxes are obstacles that block the satisfaction of the assumption: O1 is the negation of D1, and O1.1 is a subobstacle that implies O1. Chapter 19 will describe obstacle analysis in more detail.
Airbus A320 Crash in Warsaw in 1993
Our ground braking system example is a simplified version of the real system. The conditions that led to the 1993 Warsaw crash are more complex than described above.
In the real system, different conditions are used to activate the reverse thrust, the ground spoiler, and the wheel brake. The software also checked not only that the wheels were turning but also whether the struts on the landing gears were compressed. For example, one of the machine requirements looked like FR1 from Chapter 1:
[FR1] The ground braking system must be enabled when, and only when, at least one of the following conditions is met:
- the wheel speed is above 72 knots at both main landing gears,
- the shock absorbers are compressed at both main landing gears.
The domain assumption on which the system relied was that during landing, the two main landing gears of the plane would be on the ground with either their wheels turning or shock absorbers compressed. That assumption turned out to be wrong on that particular landing. On that day, due to heavy rain and wind, the plane landed on a single landing gear, and the wheels skidded on the wet runway, which meant that the software did not allow the pilot to brake for 9 seconds after touchdown. The accident investigation report attributed the causes of the accident to a combination of pilot errors and inadequate software design. Further information about the accident can be found in the end of chapter notes.
4.4 Example 2: 1992 Failure of the London Ambulance Service
Let’s now look at our second example: the failure of the London Ambulance Service in 1992. The following description is based on the detailed report of an inquiry that analysed the failure.
The Previous Paper-Based System
Up until 1992, the London Ambulance Service used a paper-based system to allocate ambulances. The process operated as follows.
- Call Taking. Call assistants in the central ambulance control room receive urgent calls from the public. When a call is received, a control assistant writes down the call details on a pre-printed incident form. After the call is completed, the incident form is placed on a conveyor belt that transports it to a central collection point for ambulance dispatchers.
- Ambulance Allocation. A staff member collects the forms, detects duplicates, and assigns them to an ambulance allocator for the zone where the incident occurred (London is split into three zones: North-East, North-West, and South). The allocator decides which ambulance should be mobilised using ambulance status and location information provided by the radio operator. The ambulance to be mobilised is also recorded on the form, which is passed to a dispatcher.
- Ambulance Mobilisation. If the ambulance to be mobilised is at its station, the dispatcher calls the station directly. If the ambulance is away from the station, the dispatcher passes the mobilisation instructions to a radio operator who contacts the ambulance.
This paper-based system was slow, laborious and inefficient. The ambulance service was thus unable to meet the Government target requiring that, for at least 95% of incidents, an ambulance arrives at the incident scene within 14 minutes after the first call.
The New Computer-Based System
The ambulance service decided to replace the paper-based system with a new computer-based system that would have the following components:
- Computer-Aided Dispatch. Instead of using paper-based incident forms, the call assistant would enter the call details into a new Computer-Aided Dispatch software. The software would help the call assistant decide what ambulance to send.
- Automated Ambulance Localisation System. A GPS-based Ambulance Localisation System would be used to keep track of ambulances’ locations.
- Mobile Data Terminals. Mobile Data Terminals would be installed inside the ambulance to communicate with the ambulance staff. The new Computer-Aided Dispatch software would send the mobilisation instructions to the Mobile Data Terminal, and the Ambulance Staff would use the mobile data terminal to communicate their status, such as when they arrive at the incident scene, when they hand over the patient to a hospital, etc.
- Station Printers. The Computer-Aided Despatch software would also communicate to printers at the ambulance stations to send mobilisation instructions to ambulance staff waiting inside the station.
Following a procurement process, the contract for building this system was awarded to a consortium led by a small software development company with no previous experience with similar systems. Because of political pressure, the project had a very short deadline and had to be launched in a single phase. There was no time for small-scale trials.
The system was launched on time, but things did not work as planned. Instead of arriving faster, ambulances started to take gradually longer to arrive. In one case, the ambulance arrived 11 hours after the call. Eventually, the system suffered a complete failure. Call handlers had to improvise to recover the situation. They could no longer use the paper-based system because the infrastructure had been disassembled.
Goals, Requirements, Assumptions
To understand the failure, let’s first look at the goals, requirements and assumptions for this system.
The main stakeholder goal is:
(G1) An ambulance should arrive at the incident scene within 14 minutes after the first call reporting the incident.
To contribute to that goal, the Computer-Aided Despatch software relies on a series of domain assumptions, including:
(D1) Call takers encode the correct incident’s details and location.
(D2) The GPS gives the correct ambulance location.
(D3) When an ambulance is allocated to an incident, the ambulance crew drives the ambulance to the incident location.
(D4) When the ambulance arrives at the incident location, the ambulance crew signals arrival on their Mobile Data Terminal.
The software requirements were such that one can be reasonably convinced that if the software satisfies its requirements and the domain assumptions are valid, then the stakeholder goal will be satisfied. So far, so good.
Invalid Domain Assumptions
Unfortunately, many of the domain assumptions were incorrect.
- The assumption D2 that the GPS would always give the correct ambulance location was invalid. In some cases, the GPS gave an incorrect location or no location at all, for example, when an ambulance was in a tunnel. The software was designed with the assumption that it would always have perfect information about the location of all ambulances and could not deal with situations where that assumption was not true.
- The assumption D3 that the allocated ambulance would go the the incident scene was also invalid. In the previous paper-based system, if several ambulances were waiting at the ambulance station and one of these ambulances was mobilised to an incident, the ambulance staff at the station would decide which ambulance to take – they could take any ambulance from the station. The ambulance staff continued that practice when the automated system was introduced, but the system designer did not anticipate this behaviour. They believed the ambulance staff would always take the ambulance that had been allocated by the software. This meant that when the software allocated an ambulance and the ambulance staff decided to take another ambulance, the software assumed there was a problem with the first ambulance and marked it as unavailable. To recover from this problem, the software would also automatically mobilise another ambulance to the incident. This meant that three ambulances were used for a single incident: two were going to the incident, and the one that remained at the station was marked as unavailable for future mobilisations. As the situation was repeated for multiple incidents, the software had gradually fewer ambulances to choose from, which led to increasingly longer delays in ambulance arrivals.
- The assumption D4 that the ambulance crew would signal arrival at the incident on the Mobile Data Terminal was also invalid. The system was designed assuming that the ambulance crew would always push the right buttons on the Mobile Data Terminal to update their status, for example, when accepting a mobilisation, when arriving at the incident, when arriving at the hospital, and when being ready for a new mobilisation. In some situations, the ambulance crew would forget to push the button, push the wrong button, or push buttons in the wrong sequence. Such errors should be expected, particularly in situations of emergencies. Unfortunately, the software was designed without taking these mistakes into account. It was thus difficult for ambulance crews to fix their mistakes, and as a result, the software had incorrect information about the status of many ambulances. This meant that the software gradually had fewer available ambulances to mobilise, and it sometimes tried to mobilise ambulances that were already mobilised.
The global effect was a complete system failure, as ambulances gradually took longer to arrive at incidents, up to the point where the control room decided to stop using the software and improvised a paper-based solution.
The Inquiry Report
Soon after the failure, an inquiry was set up to investigate the causes. The publicly available inquiry report provides detailed information about the whole project, from procurement to operation. It describes the many errors that led to the system failure. One of the main conclusions is that the system failure was not due to a software failure:
“On 26 and 27 October 1992, the computer system itself did not fail in a technical sense. Response times did on occasions become unacceptable, but overall the system did what it had been designed to do. However, much of the design had fatal flaws that would, and did, cumulatively lead to all of the symptoms of systems failure."
The report identifies many other factors that contributed to the system’s failure. Many of these factors reflect common problems that we continue to see in software projects today:
- The system was overambitious and had an impossible timetable.
- The London Ambulance Service management ignored external advice about risks.
- The procurement rules emphasised low cost over quality.
- Project management was inadequate and ambiguous.
- The early decision to deliver a full implementation in one phase was misguided.
- There was incomplete “ownership” of the system by the majority of its users. In particular, there was insufficient engagement with ambulance crew in the system design.
- The system was not fully tested and did not have a satisfactory level of quality and resilience.
Reading the inquiry report provides valuable insights into what can go wrong in the procurement and management of software projects. These insights can help you recognise the early signs of a dysfunctional project if you ever encounter one in your career.
4.5 Notes and Further Reading
The airbus ground braking system was first used to illustrate the role and importance of domain assumptions in software development in Michael Jackson’s ICSE’95 keynote talk (Jackson 1995b) and his requirements engineering book (Jackson 1995a). Peter B. Ladkin made the report about this accident publicly available (Link) and published an analysis of the technical description of the Airbus A320 Braking System (Ladkin 1995). Ian Sommerville discusses the accident from a software engineering perspective in an informative video (Link). You can read about later evolutions of the A320 ground spoiler logic in the Airbus Safety First magazine (Boissenin and Salavy 2010).
Details of the 1992 failure of the London Ambulance Service can be found in the inquiry report (South West Thames Regional Health Authority 1993). Anthony Finkelstein and John Dowell gathered a series of resources related to this failure (Link) and introduced that system as a case study for the software engineering community (Finkelstein and Dowell 1996). Countless research papers have used that system as a motivating or illustrative example, including the work from my PhD thesis (Letier 2001).