# Fences in Weak Memory Models: Appendix

Jade Alglave<sup>1</sup>, Luc Maranget<sup>1</sup>, Susmit Sarkar<sup>2</sup>, and Peter Sewell<sup>2</sup>

 $^{1}$  INRIA  $^{2}$  University of Cambridge

Abstract. We present an axiomatic framework, implemented in Coq, to define weak memory models w.r.t. several parameters: local reorderings of reads and writes, and visibility of inter and intra processor communications through memory, including full store atomicity relaxation. Thereby, we give a formal hierarchy of weak memory models, in which we provide a formal study of what should be the action and placement of fences to restore a given model such as SC from a weaker one. Finally, we provide a tool, diy, that tests a given machine to determine the architecture it exhibits. We detail the results of our experiments on Power and the model we extract from it. This identified an implementation error in Power 5 memory barriers (for which IBM is providing a software workaround); our results also suggest that Power 6 does not suffer from this problem.

We present here additional data and explanations, as well as the proof sketches of the results that appear in the paper. Moreover, our results are entirely implemented in the Coq proof assistant: the development is available at http://moscova.inria.fr/~alglave/wmm/.

## 1 Tables of notations

| diy relaxation | Relation                     | Source | Target | Processor | Location |
|----------------|------------------------------|--------|--------|-----------|----------|
| Rfe            | $\xrightarrow{\text{rfe}}$   | W      | R      | Different | Same     |
| Rfi            | $\xrightarrow{\mathrm{rfi}}$ | W      | R      | Same      | Same     |
| Wse            | $\xrightarrow{\text{wse}}$ . | W      | W      | Different | Same     |
| Wsi            | $\xrightarrow{\text{wsi}}$   | W      | W      | Same      | Same     |
| Fre            | $\xrightarrow{\text{fre}}$   | R      | W      | Different | Same     |
| Fri            | $\xrightarrow{\text{fri}}$   | R      | W      | Same      | Same     |

#### Communication relaxations

#### **Program order relaxations**

Program order proper

| $diy \ relaxation$     | Relation                                                                                                         | Source | Target | Processor | Location |
|------------------------|------------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| PosRR                  | $\bigcup_{\ell} \stackrel{\mathrm{po}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell})$                   | R      | R      | Same      | Same     |
| PodRR                  | $\bigcup_{\ell_1 \neq \ell_2} \xrightarrow{\text{po}} \cap (\mathbb{R}_{\ell_1} \times \mathbb{R}_{\ell_2})$     | R      | R      | Same      | Diff     |
| PosRW                  | $\bigcup_{\ell} \xrightarrow{\mathrm{po}} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})$                     | R      | W      | Same      | Same     |
| PodRW                  | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\mathrm{po}}{\to} \cap (\mathbb{R}_{\ell_1} \times \mathbb{W}_{\ell_2})$ | R      | W      | Same      | Diff     |
| PosWW                  | $\bigcup_{\ell} \stackrel{\mathrm{po}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})$                   | W      | W      | Same      | Same     |
| PodWW                  | $\bigcup_{\ell_1 \neq \ell_2} \xrightarrow{\mathrm{po}} \cap (\mathbb{W}_{\ell_1} \times \mathbb{W}_{\ell_2})$   | W      | W      | Same      | Diff     |
| $\operatorname{PosWR}$ | $\bigcup_{\ell} \xrightarrow{\mathrm{po}} \cap (\mathbb{W}_{\ell} \times \mathbb{R}_{\ell})$                     | W      | R      | Same      | Same     |
| PodWR                  | $\bigcup_{\ell_1 \neq \ell_2} \xrightarrow{\mathrm{po}} \cap (\mathbb{W}_{\ell_1} \times \mathbb{R}_{\ell_2})$   | W      | R      | Same      | Diff     |

# Dependencies

| diy relaxation  | Relation                                                                                                         | Source | Target | Processor | Location |
|-----------------|------------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| DpsR            | $\bigcup_{\ell} \stackrel{\mathrm{dp}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell})$                   | R      | R      | Same      | Same     |
| DpdR            | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\mathrm{dp}}{\to} \cap (\mathbb{R}_{\ell_1} \times \mathbb{R}_{\ell_2})$ | R      | R      | Same      | Diff     |
| $\mathrm{DpsW}$ | $\bigcup_{\ell} \stackrel{\mathrm{dp}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})$                   | R      | W      | Same      | Same     |
| DpdW            | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\mathrm{dp}}{\to} \cap (\mathbb{R}_{\ell_1} \times \mathbb{W}_{\ell_2})$ | R      | W      | Same      | Diff     |

## Barriers relaxations: base cases

# Sync

| diy relaxation | Relation                                                                                                     | Source | Target | Processor | Location |
|----------------|--------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| SyncsRR        | $\bigcup_{\ell} \stackrel{\text{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell})$               | R      | R      | Same      | Same     |
| SyncdRR        | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell})$ | R      | R      | Same      | Diff     |
| SyncsRW        | $\bigcup_{\ell} \stackrel{\text{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})$               | R      | W      | Same      | Same     |
| SyncdRW        | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})$ | R      | W      | Same      | Diff     |
| SyncsWR        | $\bigcup_{\ell} \stackrel{\text{sync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{R}_{\ell})$               | W      | R      | Same      | Same     |
| SyncdWR        | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{sync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{R}_{\ell})$ | W      | R      | Same      | Diff     |
| SyncsWW        | $\bigcup_{\ell} \stackrel{\text{sync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})$               | W      | W      | Same      | Same     |
| SyncdWW        | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{sync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})$ | W      | W      | Same      | Diff     |

# LwSync

| $\operatorname{diy}$ relaxation | Relation                                                                                                       | Source | Target | Processor | Location |
|---------------------------------|----------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| LwSyncsRR                       | $\bigcup_{\ell} \stackrel{\text{tweyne}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell})$               | R      | R      | Same      | Same     |
| LwSyncdRR                       | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{tweyne}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell})$ | R      | R      | Same      | Diff     |
| LwSyncsRW                       | $\bigcup_{\ell} \stackrel{\text{iwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})$               | R      | W      | Same      | Same     |
| LwSyncdRW                       | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{twsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})$ | R      | W      | Same      | Diff     |
| LwSyncsWR                       | $\bigcup_{\ell} \stackrel{\text{iwsync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{R}_{\ell})$               | W      | R      | Same      | Same     |
| LwSyncdWR                       | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{twsync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{R}_{\ell})$ | W      | R      | Same      | Diff     |
| LwSyncsWW                       | $\bigcup_{\ell} \stackrel{\text{rwsync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})$               | W      | W      | Same      | Same     |
| LwSyncdWW                       | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{rwsync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})$ | W      | W      | Same      | Diff     |

## Barriers relaxations: A-cumulativity

Sync

| diy relaxation | Relation                                                                                                                                       | Source | Target | Processor | Location |
|----------------|------------------------------------------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| ACSyncsRR      | $\bigcup_{\ell} \stackrel{\mathrm{rfe}}{\to}; (\stackrel{\mathrm{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell}))$               | R      | R      | Same      | Same     |
| ACSyncdRR      | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\mathrm{rfe}}{\to}; (\stackrel{\mathrm{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell}))$ | R      | R      | Same      | Diff     |
| ACSyncsRW      | $\bigcup_{\ell} \stackrel{\mathrm{rfe}}{\to}; (\stackrel{\mathrm{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell}))$               | R      | W      | Same      | Same     |
| ACSyncdRW      | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{rfe}}{\to}; (\stackrel{\text{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell}))$     | R      | W      | Same      | Diff     |

LwSync

| $diy\ \mathrm{relaxation}$ | Relation                                                                                                                                     | Source | Target | Processor | Location |
|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| ACLwSyncsRR                | $\bigcup_{\ell} \stackrel{\text{rfe}}{\to}; (\stackrel{\text{lwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell}))$               | R      | R      | Same      | Same     |
| ACLwSyncdRR                | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{rfe}}{\to}; (\stackrel{\text{lwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{R}_{\ell}))$ | R      | R      | Same      | Diff     |
| ACLwSyncsRW                | $\bigcup_{\ell} \stackrel{\mathrm{rfe}}{\to}; (\stackrel{\mathrm{lwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell}))$           | R      | W      | Same      | Same     |
| ACLwSyncdRW                | $\bigcup_{\ell_1 \neq \ell_2} \stackrel{\text{rfe}}{\to}; (\stackrel{\text{lwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell}))$ | R      | W      | Same      | Diff     |

## Barriers relaxations: B-cumulativity

Sync

| $diy \ {\rm relaxation}$ | Relation                                                                                                                                       | Source | Target | Processor | Location |
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| BCSyncsRW                | $\bigcup_{\ell} (\stackrel{\mathrm{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$               | R      | W      | Same      | Same     |
| BCSyncdRW                | $\bigcup_{\ell_1 \neq \ell_2} (\stackrel{\mathrm{sync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$ | R      | W      | Same      | Diff     |
| BCSyncsWW                | $\bigcup_{\ell} (\stackrel{\text{sync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})); rfe$                                          | W      | W      | Same      | Same     |
| BCSyncdWW                | $\bigcup_{\ell_1 \neq \ell_2} (\stackrel{\mathrm{sync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$ | W      | W      | Same      | Diff     |

LwSync

| diy relaxation | Relation                                                                                                                                         | Source | Target | Processor | Location |
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|--------|--------|-----------|----------|
| BCLwSyncsRW    | $\bigcup_{\ell} (\stackrel{\mathrm{lwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$               | R      | W      | Same      | Same     |
| BCLwSyncdRW    | $\bigcup_{\ell_1 \neq \ell_2} (\stackrel{\mathrm{lwsync}}{\to} \cap (\mathbb{R}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$ | R      | W      | Same      | Diff     |
| BCLwSyncsWW    | $\bigcup_{\ell} (\stackrel{\mathrm{lwsync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$               | W      | W      | Same      | Same     |
| BCLwSyncdWW    | $\bigcup_{\ell_1 \neq \ell_2} (\stackrel{\mathrm{lwsync}}{\to} \cap (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell})); \stackrel{\mathrm{rfe}}{\to}$ | W      | W      | Same      | Diff     |

# 2 Properties of validity

From our definition of validity arises a simple notion of comparison among architectures.  $A_1 \leq A_2$  means that  $A_1$  is *weaker* than  $A_2$ :

Definition 1 (Weaker).

$$A_1 \leq A_2 \ \triangleq \ (\stackrel{ppo_1}{\to} \subseteq \stackrel{ppo_2}{\to}) \ \land \ (\stackrel{grf_1}{\to} \subseteq \stackrel{grf_2}{\to})$$

#### 2.1 Validity is decreasing

We prove validity of an execution to be decreasing w.r.t the weaker predicate; thus, a weak architecture exhibits at least all the behaviours authorised by a stronger one:

#### Theorem 1 (Validity is decreasing).

$$\forall A_1 A_2, \ (A_1 \leq A_2) \Rightarrow (\forall X, A_2^{\epsilon} \text{ . valid}(X) \Rightarrow A_1^{\epsilon} \text{ . valid}(X))$$

*Proof (in Coq).* From  $A_1 \leq A_2$ , we immediately have  $A_1.ghb \subseteq A_2.ghb$ , thus if  $A_2.ghb$  is acyclic, so is  $A_1.ghb$ .

#### 2.2 Monotonicity of validity

Programs running on an architecture  $A_1$  exhibit executions that would be valid on a stronger architecture  $A_2$ ; we characterise all such executions as follows:

 $A_1.\operatorname{check}_{A_2}(X) \triangleq \operatorname{acyclic}(\overset{\operatorname{grf}_2}{\to} \cup \overset{\operatorname{ws}}{\to} \cup \overset{\operatorname{fr}}{\to} \cup \overset{\operatorname{ppo}_2}{\to})$ 

#### Theorem 2 (Characterisation).

 $\forall A_1 A_2, \ (A_1 \leq A_2) \Rightarrow (\forall X, (A_1^{\epsilon} \cdot \operatorname{valid}(X) \land A_1 \cdot \operatorname{check}_{A_2}(X)) \Leftrightarrow A_2^{\epsilon} \cdot \operatorname{valid}(X))$ 

Proof (in Coq).

- $\Rightarrow$  X being valid on  $A_1$ , we have all requirements well formedness and uniproc to guarantee it is valid on  $A_2$ , except the last predicate, which holds by the hypothesis check<sub>A2</sub>.
- ⇐ X being valid on  $A_2$  gives us all requirements well formedness and uniproc – to guarantee its validity on  $A_1$  except the last one. As  $A_1 \leq A_2$ , we know that  $A_1 \cdot ghb \subseteq A_2 \cdot ghb$  (lemma ghb\_incl), thus the acyclicity requirement for  $A_1 \cdot ghb$  holds if  $A_2 \cdot ghb$  is acyclic.

### 3 Examples of models in our framework

### 3.1 SC

Definitions

$$Sc \triangleq (MM, \stackrel{\mathrm{rr}}{\to}, \emptyset)$$

$$\stackrel{\mathrm{hb-seq}}{\to} \triangleq \stackrel{\mathrm{ws}}{\to} \cup \stackrel{\mathrm{fr}}{\to} \cup \stackrel{\mathrm{rf}}{\to}$$

Sc characterisation

Corollary 1 (Sc characterisation).

 $\forall A, (A \leq Sc) \Rightarrow (\forall X, \text{valid } AX \land A \text{. check}_{Sc} \ X \Leftrightarrow Sc \text{. valid } X)$ 

Proof (in Coq).

 $\Rightarrow$  As  $\xrightarrow{\text{po}} \cup \xrightarrow{\text{hb-seq}} = Sc.ghb$ , this is a direct consequence of thm. 2.

 $\Leftarrow$  as  $A \leq Sc$ , this is a direct consequence of thm. 1.

Sc is SC SC has been defined in [1] as follows:

 $[\ldots]$  the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Thus an SC execution is a total order  $\xrightarrow{\text{ex}}$  consistent with the program order:

$$\operatorname{seq}(\stackrel{\operatorname{ex}}{\to}) \triangleq \operatorname{total-order}(\stackrel{\operatorname{ex}}{\to}, \mathbb{E}) \land \stackrel{\operatorname{po}}{\to} \subseteq \stackrel{\operatorname{ex}}{\to}$$

The implicit execution model of [1] states that a read r takes its value from the most recent write that precedes it in  $\stackrel{\text{ex}}{\rightarrow}$ . Writing pw(o, r) for the set of writes to the same location that precede r according to partial order o, we extract our  $\stackrel{\text{rf}}{\rightarrow}$  relation from  $\stackrel{\text{ex}}{\rightarrow}$  as follows:

$$pw(\stackrel{\text{ex}}{\to}, r) \triangleq \{w \mid \text{loc}(w) = \text{loc}(r) \land w \stackrel{\text{ex}}{\to} r\}$$
$$SC \cdot rf(\stackrel{\text{ex}}{\to}) \triangleq \{(w, r) \mid w = \max_{\text{ex}} \left( pw(\stackrel{\text{ex}}{\to}, r) \right)\}$$

We extract write serialisation as well, and one of our execution witnesses from  $\stackrel{\text{ex}}{\rightarrow}$ , which we use to show that the *SC* definition from [1] is equivalent to ours:

$$SC \cdot ws(\stackrel{ex}{\to}) \triangleq \bigcup_{\ell} (\mathbb{W}_{\ell} \times \mathbb{W}_{\ell}) \cap \stackrel{ex}{\to}$$
$$SC \cdot wit(\stackrel{ex}{\to}) \triangleq (\mathbb{E}, \stackrel{po}{\to}, SC \cdot rf(\stackrel{ex}{\to}), SC \cdot ws(\stackrel{ex}{\to}))$$

Theorem 3 (Sc is SC).

$$\forall X, \ Sc. \, \mathrm{valid}(X) \Leftrightarrow \exists \stackrel{ex}{\rightarrow}, \ \mathrm{seq}(\stackrel{ex}{\rightarrow}) \wedge SC. \, \mathrm{wit}(\stackrel{ex}{\rightarrow}) = X$$

Proof (in Coq).

⇒ from X being valid on Sc, we have  $\operatorname{acyclic}(\stackrel{\operatorname{ghb}}{\rightarrow})$ , that is  $\operatorname{acyclic}(\stackrel{\operatorname{hb-seq}}{\rightarrow} \cup \stackrel{\operatorname{po}}{\rightarrow})$  on Sc, which gives us an equivalent SC execution by Cor. 1.

 $\Leftarrow$  from  $\xrightarrow{\text{ex}}$ , we produce a *SC*, wit which is valid by Thm. 1.

|  | _ |  |  |
|--|---|--|--|
|  |   |  |  |
|  |   |  |  |
|  |   |  |  |
|  |   |  |  |

#### $3.2 \quad TSO$

Definitions

$$\begin{array}{ccc} \overset{\text{ppo-tso}}{\to} & \triangleq & RM \cup WW \\ Tso^{\epsilon} & \triangleq & (\overset{\text{ppo-tso}}{\to}, \overset{\text{rfe}}{\to}, \emptyset) \\ \end{array}$$

Tso characterisation By Thm. 2 we show the following criterion – where  $\stackrel{\text{hb-tso}}{\rightarrow}$  is  $\stackrel{\text{ws}}{\rightarrow} \cup \stackrel{\text{fr}}{\rightarrow} \cup \stackrel{\text{rfe}}{\rightarrow}$  – characterises valid executions that are  $Tso^*$  on any  $A \leq Tso$ :

A. check  $_{Tso}(X) \triangleq \operatorname{acyclic}(\overset{\operatorname{ppo-tso}}{\to} \cup \overset{\operatorname{hb-tso}}{\to})$ 

Corollary 2 (Tso characterisation).

$$\forall A, (A \leq Tso) \Rightarrow (\forall X, \text{valid } AX \land A. \text{ check}_{Tso} X \Leftrightarrow \text{valid } Tso(X))$$

Proof (in Coq).

 $\Rightarrow \text{ As } \stackrel{\text{ppo-tso}}{\to} \cup \stackrel{\text{hb-tso}}{\to} = Tso. \stackrel{\text{ghb}}{\to}, \text{ this is a direct consequence of thm. 2.} \\ \Leftarrow \text{ as } A \leq Tso, \text{ this is a direct consequence of thm. 1.}$ 

Tso is TSO

Equivalence with the native model Sparc [3, V. 8, Appendix K] formally defines a TSO execution as a partial order  $\stackrel{\text{ex}}{\rightarrow}$  on memory events constrained by some axioms that constrain  $\stackrel{\text{ex}}{\rightarrow}$ . We formulate<sup>1</sup> those as follows:

$$ptso(\stackrel{ex}{\rightarrow}) \triangleq partial-order(\stackrel{ex}{\rightarrow}, \mathbb{E}) \land RM \subseteq \stackrel{ex}{\rightarrow} \land WW \subseteq \stackrel{ex}{\rightarrow} \land \\ \exists \stackrel{tso}{\rightarrow}, \stackrel{tso}{\rightarrow} \subseteq \stackrel{ex}{\rightarrow} \land total-order(\stackrel{tso}{\rightarrow}, \mathbb{W})$$

Note that  $\xrightarrow{\text{tso}}$  above is the postulated total order on stores that justify the name *TSO*. An additional axiom, *Value*, where  $L_a$  (resp.  $S_a$ ) is the notation of [3] for a load (resp. a store) with location *a*:

$$Val(L_a) = Val(\max_{\substack{\text{ex} \\ ex}} \{ S_a \mid S_a \xrightarrow{\text{ex}} L_a \lor S_a \xrightarrow{\text{po}} L_a \})$$

 $[\ldots]$  states that the value of a data load is the value written by the most recent store to that location. Two terms combine to define the most recent store. The first corresponds to stores by other processors, while the second corresponds to stores by the processor that issued the load.

We relate Sparc *TSO* and our *Tso* by interpreting the *Value* axiom as specifying  $\stackrel{\text{rf}}{\rightarrow}$ , and extract  $\stackrel{\text{ws}}{\rightarrow}$  from  $\stackrel{\text{ex}}{\rightarrow}$ , as we did in the *Sc* case:

$$TSO \cdot \mathrm{rf}(\stackrel{\mathrm{ex}}{\to}) \triangleq \{(w, r) \mid w = \max_{\stackrel{\mathrm{ex}}{\to}} \left( \mathrm{pw}(\stackrel{\mathrm{ex}}{\to} \cup \stackrel{\mathrm{po}}{\to}, r) \right) \}$$
$$TSO \cdot \mathrm{ws}(\stackrel{\mathrm{ex}}{\to}) \triangleq \{(w_1, w_2) \mid \exists l, (w_1, w_2) \in (\mathbb{W}_l \times \mathbb{W}_l) \land w_1 \stackrel{\mathrm{ex}}{\to} w_2 \}$$
$$TSO \cdot \mathrm{wit}(\stackrel{\mathrm{ex}}{\to}) \triangleq (\mathbb{E}, \stackrel{\mathrm{po}}{\to}, TSO \cdot \mathrm{rf}(\stackrel{\mathrm{ex}}{\to}), SC \cdot \mathrm{ws}(\stackrel{\mathrm{ex}}{\to}))$$

Theorem 4 (*Tso* is *TSO*).

$$\forall X, Tso^* . valid(X) \Leftrightarrow \exists \stackrel{ex}{\rightarrow}, ptso(\stackrel{ex}{\rightarrow}) \land TSO . wit(\stackrel{ex}{\rightarrow}) = X$$

<sup>&</sup>lt;sup>1</sup> We omit the axioms Atomicity and Termination.

## Proof (in Cog).

- $\Rightarrow$  from X being valid on Tso, we have  $\operatorname{acyclic}(\stackrel{\operatorname{ghb}}{\rightarrow})$ , that is  $\operatorname{acyclic}(\stackrel{\operatorname{hb-tso}}{\rightarrow} \cup \stackrel{\operatorname{po-tso}}{\rightarrow})$ on Tso, which gives us an equivalent TSO execution by Cor. 2.
- $\Leftarrow$  from  $\stackrel{\text{ex}}{\rightarrow}$ , we produce a *TSO*. wit which is valid by Thm. 1.

#### **Barriers** 4

From now on, we note A.  $\stackrel{\text{hb-seq}}{\to}$  for  $\stackrel{?\text{rfe}}{\to} \cup (\stackrel{\text{ws}}{\to}; \stackrel{?\text{rfe}}{\to}) \cup (\stackrel{\text{fr}}{\to}; \stackrel{?\text{rfe}}{\to} \cup \stackrel{\text{ws}}{\to} \cup \stackrel{\text{fr}}{\to}$ . We trivially have A.  $\stackrel{\text{hb-seq}}{\to} \subseteq A$ .  $\stackrel{\text{hb-seq}}{\to}$ , and that  $\stackrel{\text{hb-seq}}{\to}$  is transitive. A key lemma is that forall relation  $\stackrel{\text{r}}{\to}$  over events, such that there is a cycle here one of the set of

in  $\stackrel{\text{hb-seq}}{\to} \cup \stackrel{\text{r}}{\to}$ , then there is a cycle in  $\stackrel{\text{hb-seq}'}{\to}$ ;  $(\stackrel{+}{\to})$ .

#### **Barriers** guarantee 4.1

Consider two architectures  $A_1 \leq A_2$ . We define the predicate fb (fully barriered) on  $A_1 \leq A_2$ , where  $\stackrel{r_2 \setminus 1}{\rightarrow} \triangleq \stackrel{r_2}{\rightarrow} \setminus \stackrel{r_1}{\rightarrow}$  is the set difference, and  $x \stackrel{r_1}{\rightarrow} \stackrel{r_2}{\rightarrow} y \triangleq \exists z, x \stackrel{r_1}{\rightarrow} z \land z \stackrel{r_2}{\rightarrow} y$ stands for the sequence:

$$A_1 \cdot \operatorname{fb}_{A_2}(X) \stackrel{\Delta}{=} \left( (\stackrel{\operatorname{ppo}_{2\backslash 1}}{\to}) \cup (\stackrel{\operatorname{grf}_{2\backslash 1}}{\to}; \stackrel{\operatorname{ppo}_2}{\to}) \right) \subseteq \stackrel{\operatorname{ab}_1}{\to}$$

We prove that the above condition on  $\stackrel{\text{ab}_1}{\rightarrow}$  suffices to restore  $A_2$  from  $A_1$ :

#### Theorem 5 (Barriers guarantee).

 $\forall A_1 A_2, (A_1 \leq A_2) \Rightarrow (\forall X, A_1 . \text{valid}(X) \land A_1 . \text{fb}_{A_2}(X) \Rightarrow A_2^{\epsilon} . \text{valid}(X))$ 

#### Proof (in Coq).

Let relation  $\stackrel{\text{hb'}_2}{\to}$  be  $\stackrel{?\text{rfe}_2}{\to} \cup (\stackrel{\text{ws}}{\to}; \stackrel{?\text{rfe}_2}{\to}) \cup (\stackrel{\text{fr}}{\to}; \stackrel{?\text{rfe}_2}{\to}) \cup \stackrel{\text{ws}}{\to} \cup \stackrel{\text{fr}}{\to}$ , and relation  $\stackrel{\text{hb''}_2}{\to}$ be  $(\stackrel{?\mathrm{rf}_2}{\longrightarrow} \cup \stackrel{\mathrm{ppo}_2}{\longrightarrow})^+$ . We prove that  $\stackrel{?\mathrm{rf}_2}{\longrightarrow} \cup \stackrel{\mathrm{ws}}{\longrightarrow} \cup \stackrel{\mathrm{fr}}{\longrightarrow} \cup \stackrel{\mathrm{ppo}_2}{\longrightarrow}$  is acyclic if and only if  $R_2 = \stackrel{\text{hb}_2}{\rightarrow}; \stackrel{\text{hb}_2}{\rightarrow}$  is. Then, to prove the theorem it suffices to prove that  $\stackrel{\text{ghb}_1}{\rightarrow}$  is cyclic whenever  $R_2$  is.

Let us consider a cycle in  $R_2$ . Because  $m_1 \xrightarrow{\text{rf}} m_2 \xrightarrow{\text{hb}'_2} m_3$  implies  $m_1 \xrightarrow{\text{hb}'_2} m_3$ , we can safely assume that no  $\stackrel{hb"_2}{\rightarrow}$  step ends by a  $\stackrel{?rfi_1}{\rightarrow}$  step. Then, as no two  $\stackrel{rfi}{\rightarrow}$ steps can follow one another, condition  $A_1$ . fb<sub> $A_2$ </sub>(X) suffices to guarantee the inclusion  $R_2 \subseteq (\stackrel{\text{ghb}_1}{\rightarrow})^+$  and conclude.

#### 4.2 Considering a weaker barriers guarantee

When two architectures  $A_2$  and  $A_1$  have the same policy w.r.t. the store atomicity and store buffer relaxations, which we model by  $\stackrel{\text{grf}_1}{\to} \stackrel{\text{grf}_2}{\to}$ , there is no need for a barrier as powerful as above to restore  $A_2$  from  $A_1$ : a barrier that only orders the events that surround it statically—that is, a non cumulative barrier, which action we model by non-cumul $(X, \stackrel{\text{fenced}}{\rightarrow}) \triangleq \stackrel{\text{fenced}}{\rightarrow}$ —is enough. Consider the wfb predicate, which states that the barriers provided by  $A_1$  maintain the pairs that are preserved in the program order on  $A_2$  but not on  $A_1$ .:

$$A_1 \cdot \operatorname{wfb}_{A_2}(X) \triangleq \stackrel{\operatorname{ppo}_{2\backslash 1}}{\to} \subseteq \stackrel{\operatorname{ab}_1}{\to}$$

The same guarantee applies if  $A_2$  hinders the store buffer relaxation by its preserved program order, *i.e.* when  $\stackrel{\text{rfi}}{\to} \subseteq \stackrel{\text{ppo}_2}{\longrightarrow}$ —which is particular to Sc:

### Theorem 6 (Non cumulative barriers guarantee).

 $\forall A_1 A_2, ((A_1 \leq A_2) \land (\stackrel{grf_1}{\rightarrow} = \stackrel{grf_2}{\rightarrow} \lor (\stackrel{rf_i}{\rightarrow} \subseteq \stackrel{ppo_2}{\rightarrow}))) \Rightarrow$  $(\forall X, A_1 . \operatorname{valid}(X) \land A_1 . \operatorname{wfb}_{A_2}(X) \Rightarrow A_2^{\epsilon} . \operatorname{valid}(X))$ 

*Proof (in Coq).* As for Thm. 5 we need prove  $R_2 = (\stackrel{\text{hb}'_2}{\rightarrow}; \stackrel{\text{hb}''_2}{\rightarrow}) \subseteq (\stackrel{\text{ghb}_1}{\rightarrow})^+$ . First, by hypothesis  $ext_1 = ext_2$ , we get  $\stackrel{\text{hb}'_2}{\rightarrow} = \stackrel{\text{hb}'_1}{\rightarrow} \subseteq \stackrel{\text{ghb}_1}{\rightarrow}$ . Second, we prove  $\stackrel{\text{hb}''_2}{\rightarrow} \subseteq (\stackrel{\text{ghb}_1}{\rightarrow})^+$ , by considering each step in  $\stackrel{\text{hb}''_2}{\rightarrow}$ . For a particular step  $m_1 \stackrel{\text{ppo}_2}{\rightarrow} m_2$ , we have either  $m_1 \stackrel{\text{ppo}_1}{\rightarrow} m_2 \Rightarrow m_1 \stackrel{\text{ghb}_1}{\rightarrow} m_2$ , or  $m_1 \stackrel{\text{ppo}_2}{\rightarrow} m_1 \Rightarrow m_2 \Rightarrow m_1 \stackrel{\text{ghb}_1}{\rightarrow} m_2$ . While for all  $\stackrel{\text{rfi}}{\rightarrow}$  steps, we conclude either by  $\stackrel{\text{rfi}}{\rightarrow} \subseteq \stackrel{\text{ghb}_1}{\rightarrow}$  — when  $int_2 = int_1 = true$ , or by  $\stackrel{\text{rfi}}{\rightarrow} \subseteq \stackrel{\text{ppo}_2}{\rightarrow}$ . □

From Tso to Sc As  $\xrightarrow{\text{rfe}}$  are global in both Tso and Sc, and Sc hinders the store buffering relaxation by its  $\xrightarrow{\text{ppo}}$  definition, it suffices by Thm. 6 to fence all pairs in  $\xrightarrow{\text{ppo-sc}} \setminus \xrightarrow{\text{ppo-tso}} = WR$  (where  $WR \triangleq (\mathbb{W} \times \mathbb{R}) \cap \xrightarrow{\text{po}}$ ) to restore Sc from Tso:

#### Corollary 3 (Barriers restoring Sc from Tso).

 $\forall X, (Tso. valid(X) \land non-cumul(X, WR) \subseteq \stackrel{ab_{Tso}}{\rightarrow}) \Rightarrow Sc. valid(X))$ 

From Pso to  $Tso^{\epsilon}$  We comment here on the two definitions of PSO given in Sparc documentations [3]. We adapt the definition of [3, V8] to our framework:

$$Pso^{\epsilon} \cdot Arch \triangleq (\lambda X.RM, \stackrel{\text{rie}}{\rightarrow}, \lambda X.\emptyset)$$

Tso and Pso agree on both the store atomicity and the store buffering relaxations, which allows us to apply Thm. 6:  $Tso^{\epsilon}$  is restored from Pso by inserting non cumulative barriers between all  $\xrightarrow{\text{ppo-pso}} \bigvee \xrightarrow{\text{ppo-pso}} = WW$  pairs. Indeed, TSO is obtained from PSO by adding StoreStore barriers after each write [3, V9].

## 5 diy

#### 5.1 Cycles generation

A type *edge* represents the arrows of an execution witness. It can be in  $\xrightarrow{\text{rf}}$ ,  $\xrightarrow{\text{ws}}$ ,  $\xrightarrow{\text{fr}}$ : their source and target share the same location and have the appropriate

directions (e.g.  $\xrightarrow{\text{rf}}$  has a write as source and a read as target). Similarly to  $\xrightarrow{\text{rf}}$  and  $\xrightarrow{\text{rfe}}$ , we define  $\xrightarrow{\text{wsi}}$  and  $\xrightarrow{\text{ms}}$ ,  $\xrightarrow{\text{fri}}$  and  $\xrightarrow{\text{fre}}$ . An edge can also be in  $\xrightarrow{\text{po}}$ ,  $\xrightarrow{\text{dp}}$  or  $\xrightarrow{\text{fenced}}$ , where we need to specify the source and target directions (except for  $\xrightarrow{\text{dp}}$  whose source is a read by definition), and if they share the same location.

We specify a concrete syntax for the edges: thus Rfe represents a  $\xrightarrow{\text{rfe}}$  arrow, Fre a  $\xrightarrow{\text{fre}}$  arrow, PosRW specifies a  $\xrightarrow{\text{po}}$  arrow with a read (R) as source, a write (W) as target, knowing these events share the same location (s), and DpdR a  $\xrightarrow{\text{dp}}$ arrow with a read (R) as target, with different (d) source and target locations.

arrow with a read (R) as target, with different (d) source and target locations. We define two relations  $\xrightarrow{\text{relax}}$  and  $\xrightarrow{\text{safe}}$ , whose default values are respectively  $\xrightarrow{\text{po}} \cup \xrightarrow{\text{rf}}$  and  $\xrightarrow{\text{ws}} \cup \xrightarrow{\text{fr}} \cup \xrightarrow{\text{dp}} \cup \xrightarrow{\text{fenced}}$ . They can be modified by an input of the user: *e.g.* on a x86 machine, which has a *Tso* model [2],  $\xrightarrow{\text{rfe}}$  can be specified as safe. We show a cycle in  $\xrightarrow{\text{hb-seq}} \cup \xrightarrow{\text{po}}$  is a cycle in  $\xrightarrow{(\text{relax})} + \cup \xrightarrow{(\text{relax})} + \cup \xrightarrow{(\text{safe})} + \cdots \xrightarrow{(\text{safe})} + \cdots$ .

We show a cycle in  $\stackrel{\text{ndseq}}{\to} \stackrel{\text{po}}{\to}$  is a cycle in  $\stackrel{\text{(relay)}}{\to} \stackrel{\text{(relay)}}{\to} \stackrel{\text{(rela)}}{\to} \stackrel{\text{(relay)}}{\to} \stackrel{\text{(relay)}}{\to} \stackrel{\text{(relay)$ 

#### 5.2 Code generation

diy interprets a sequence of edges as a cycle from which it computes a litmus test or fails. One of its execution witnesses includes a cycle compliant with the input edges sequence. The final condition is a conjunction of equalities over the final state of registers and memory locations characterising this cycle.

Test generation performs the following successive steps.

- 1. We map the edges sequence to a circular double-linked list, whose cells represent memory events, with direction, location, and value fields. An additional field records the edge starting from the event. This list represents the *input cycle* and appear in at least one of the execution witnesses of the produced test.
- 2. A linear scan sets the events directions, by comparing each target direction with the following source direction. When equal, the in-between cell direction is set to the common value; otherwise (*e.g.* Rfe; Rfe), the generation fails.
- 3. If no event e has an incoming edge specifying a location change, generation fails. Otherwise, a linear scan starting from e sets the locations. We reject the cycles where e and its predecessor have different locations (*e.g.* Rfe; PodRW).
- 4. We cut the input cycle into maximal sequences of events with the same location, each being scanned w.r.t. the cycle order: we give the value 1 to the first write in this sequence, 2 to the second one, *etc.* Thus the values also reflect the write serialisation order for the specified location.

- 5. Significant reads are the sources of  $\stackrel{\text{fr}}{\rightarrow}$  edges and the targets of  $\stackrel{\text{rf}}{\rightarrow}$  edges. They are associated with the write on the other side of the edge. In the  $\stackrel{\text{rf}}{\rightarrow}$  case, the value of the read is the one of its associated write. In the  $\stackrel{\text{fr}}{\rightarrow}$  case, the value of the read is the value of the predecessor of its associated write in  $\stackrel{\text{ws}}{\rightarrow}$ , *i.e.* by construction the value of its associated write minus 1. Non significant reads do not appear in the test condition.
- 6. We cut the cycle into maximal sequences of events from the same processor, each being scanned, generating load instructions to (resp. stores from) fresh registers for reads (resp. writes). We insert code implementing a dependency in front of events targeting  $\stackrel{\text{dp}}{\rightarrow}$  and the appropriate barrier instruction for events targeting  $\stackrel{\text{fenced}}{\rightarrow}$  edges. We built initial state at this step: stores and loads take their addresses from fresh registers whose content (a memory location) is defined in the initial state. Part of the final condition is also built: for any significant read with value v resulting in a load instruction to register r, we add the equality r = v.
- 7. We complete the final condition to characterise write serialisations. The write serialisation for a given location x is defined by the sequence of values 0 (initial value of x), ..., n, where n is the last value allocated for location x at step 4. If n is 0 or 1 then no addition to the final condition needs to be performed, because the write serialisation is either a singleton or a pair. If n is 2, we add the equality x = 2. Otherwise (n > 2), we add an *observer* to the program, *i.e.* we add a thread performing n loads from x to  $r_1, \ldots, r_n$  and add the equalities  $r_1 = 1 \land \ldots \land r_n = n$  to the final condition.

## References

- L. Lamport. How to Make a Correct Multiprocess Program Execute Correctly on a Multiprocessor. *IEEE Trans. Comput.*, 46(7):779–782, 1979.
- S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-TSO. In TPHOL 2009.
- 3. Sparc Architecture Manual Versions 8 and 9, 1992 and 1994.