Tải bản đầy đủ
9 Peterson–Kearns algorithm based on vector time

9 Peterson–Kearns algorithm based on vector time

Tải bản đầy đủ

493

13.9 Peterson–Kearns algorithm based on vector time








s : The process where the receive event matched with send event s
occurs.
fji : The ith failure on Pj .
ckji : The ith state checkpoint on Pj . The checkpoint resides on the stable
storage.
rsji : The ith restart event on Pj .
rbji : The ith rollback event on Pj .
LastEvent fji = e iff e → rsji .

In a rollback protocol, every process must be contacted at least once to
indicate that a failure has occurred and to send it the information necessary
for recovery. This process is characterized as a series of one or more polling
waves which are typified by the arrival of a polling message which transmits
information necessary for rollback and a response by the polled process. We
define two new event types:
• Ci k (m):

The arrival of the final polling wave message for rollback from
failure fim at process Pk .
• wi k (m): The response to this final polling wave by Pk . If no response is
required, wi k (m) = Ci k (m)

The final polling wave for recovery from failure fim is defined as:
PWi m =

N −1
k=0

wi k m ∪

N −1

Ci k m
k=0

13.9.2 Informal description of the algorithm
When a process Pi restarts after failure fim , it retrieves its latest checkpoint,
including its vector clock value Vi Latest ck fim , from the stable storage and rolls back to it. The message log is replayed until it is exhausted.
Since the vector time of each message is logged with the message, when the
messages are replayed, the clock value of the recovering process is appropriately updated. After the logged messages have been replayed, the recovering
process executes a restart event, rsim , to begin the global rollback protocol, originates a token message containing the vector timestamp of rsi m and
sends the token to its successor process. The token associated with failure
fim and restart event rsi m is denoted by tk(i,m). The timestamp of this token
is denoted as tk(i, m).ts. Process Pi buffers all incoming application messages until the return of the token. When this occurs, Pi resumes normal
execution.

494

Checkpointing and rollback recovery

The token is circulated through all the processes on the ring. When the
token arrives at process Pj , the timestamp in the token is used to determine whether the process Pj must roll back. If tk(i, m).ts < Vj (pj ), then an
orphan event has occurred at Pj and Pj must roll back to an earlier state.
This is accomplished by restoring Pj to the state of ckj , where ckj is the
latest checkpoint at Pj for which Vj (ck j ) < tk(i, m).ts, and then replaying logged messages as long as the timestamp of the message is less than
tk(i, m).ts.
It is possible that an orphan event in Pj is the receipt of a message originating in a non-orphaned send event in process Pi . Since the send event
corresponding to such a receipt does not causally succeed any lost event in Pi ,
the recovery of Pi will not result in the replay of such messages. Therefore,
these messages are lost unless some special actions are taken. To make sure
that these messages are not lost, Pj must request their retransmission during
the rollback.
During the rollback, Pj must also retransmit any message that it sent to Pi
that was lost due to failure. Process Pj can determine whether the messages
it had sent have been received by the failed process Pi by comparing the
vector timestamps of the messages to the timestamp in the token. If Vj (s)[j]
>V i (rsim (j)), where s is the message that was sent to Pi , then it is possible that
the failed process has lost the message and it must be resent. It is also possible
that the message is not lost, but is still in transit; thus Pi must discard any
duplicate messages. Because channels are FIFO, Pi can identify any duplicate
message from its timestamp.
After the logged messages have been replayed and retransmissions of the
required messages are done, Pj instigates a rollback event, rbk j , to indicate
that rollback at it is complete. Vector time is not incremented for this event
so V (j rbk j )= V j (ej ), where ej is the last event replayed. Any logged event
whose vector time exceeds tk(i, m).ts is discarded.
If tk(i, m).ts ≮ Vj Pj when the token arrives, the state of Pj is not
changed. For consistency, however, a rollback event is instigated to indicate
that rollback is complete at Pj and to allow the token to be propagated.
Note that, after the rollback is complete, Vj Pj ≯ Vj rsim , that is, every
event in Pj either happens before the restart event rsi m or is concurrent to it.
The property of vector time that ei → ej iff Vi (ei ) < Vj (ej ) allows us to make
this claim.
The token is propagated from process Pi to process P i+1 modN . As the
token propagates, it rolls back orphan events at every process. When
the token returns to the originating process, the roll back recovery is
complete.

Handling in-transit orphan messages
It is possible for orphan messages to be in transit during the rollback process.
If these messages are received and processed during or after the rollback

495

13.9 Peterson–Kearns algorithm based on vector time

procedure, an inconsistent global state will result. To identify these orphan
messages and discard them on arrival, it is necessary to include an incarnation
number with each message and with the token. inci denotes the current
incarnation number of process Pi , and Inc(ei ) denotes the incarnation number
of event ei . The value returned for an event equals the current incarnation
number of the process in which the event occurred. The incarnation number
in the token is denoted by tk(i, m).inc.
When Pi initiates the rollback process, it increments its current incarnation
number by one and attaches it to the token. A process receiving the token
saves both the vector timestamp of the token and the incarnation number in
the stable storage. Because there is no bound on message transmission time,
the vector timestamps and associated incarnation numbers that have arrived in
the token must be accumulated in a set denoted as OrVecti . The set OrVecti
is composed of ordered pairs of token timestamps and incarnation numbers
received by process Pi .
When an application message is received by process Pi the vector timestamp of the message is compared to the vector timestamps stored in OrVecti .
If the vector timestamp of the message is found to be greater than a timestamp
in OrVecti , then the incarnation number of the message is compared to the
incarnation number corresponding to the timestamp in OrVecti . If the message incarnation number is smaller, then the message is discarded. Clearly,
this is an orphan message that was in transit during the rollback process.
In all other cases, the message is accepted and processed. Upon the receipt
of a token, the receiving process sets its incarnation number to that in the
token.

13.9.3 Formal description of the rollback protocol
The causal rollback protocol is described as set of six rules, CRB1 to CRB6.
For each rule, we first present its formal description and then give a verbal
explanation of the rule.

The rollback protocol
CRB1

wi i (m) occurs iff there exists fim , rsi m such that fim → rsm
i →
wi i (m).
A formerly failed process creates and propagates a token, event
wi i (m), only after restoring the state from the latest checkpoint and
executing the message log from the stable storage.

CRB2

The occurrence of wi i m implies that
tk i m ts = Vi (rsi m ) ∧

496

Checkpointing and rollback recovery

tk i m inc = Inc Latest ck fim + 1∧
Inci = Inc Latest ck fim + 1

CRB3

CRB4

CRB5

CRB6

The restart event increments the incarnation number at the recovering process, and the token carries the vector timestamp of the
restart event and the newly incremented incarnation number.
wi j m i = j occurs iff
∃ rbi k such that ci j m → rbki → wi j m ∧
∀ ej such that Vj (ej ) > tk i m ts, ¬ Recorded(ej )
A non-failed process will propagate the token only after it has
rolled back.
The occurrence of wi j m implies that
Inci = tk i m inc∧ tk i m ts tk i m inc ∈ OrVectj
A non-failed process will propagate the token only after it has
incremented its incarnation number and has stored the vector timestamp of the token and the incarnation number of the token in its
OrVect set.
Polling wave PWi m is complete when Ci j m occurs.
When the process that failed, recovered, and initiated the token,
receives its token back, the rollback is complete.
Any message received by event, n s , is discarded iff ∃ m ∈
OrVect p s such that Inc(s) < Inc m ∧ V m < V s .
Messages that were in transit and which were orphaned by the
failure and subsequent restart and recovery must be discarded.

Example Consider an example consisting of three processes shown in
Figure 13.17. The processes have taken checkpoints C01 , C11 C21 . Each event
on a process time line is tagged with the vector time (x y z) of its occurrence.
Each message is tagged with [i](x y z), where i is the incarnation number
associated with the message send event, and (x y z) is the vector time of the
send event. Process P0 fails just after sending message m5 , which increments
its vector clock to (5, 4, 0).
Upon restart of P0 , the checkpoint C01 is restored, and the restart event, rs01
is performed by the protocol. We assume that message m4 was not logged
into the stable storage at P0 , hence it cannot be replayed during the recovery.
A token, [1](4, 0, 0), is created and propagated to P1 . This is shown in the
figure by a dotted vertical arrow. Upon the receipt of the token, P1 rolls back
to a point such that its vector time is not greater than (3, 0, 0), the time in the
token. Hence P1 rolls back to its state at time (1, 4, 0). P1 then records the
token in its OrVect set and sends the token to P2 . P2 takes a similar action
and rolls back to message send event with time (1, 4, 4). The token is then
returned to P0 and recovery is complete.

497

P0

13.9 Peterson–Kearns algorithm based on vector time

(3, 0, 0)
1
C0

(0, 0, 0) (1, 0, 0) (2, 0, 0)

m4

m5
[0](5, 4, 0)

[0](1, 4, 0)

1

P1

C1

(0, 0, 0)
(1, 1, 0)

(1, 3, 0) (1, 4, 0)
C1,1(1)

(1, 2, 0)

[0](1, 4, 4)

1

P2

m6

C2

(0, 0, 0)

(1, 3, 2)

(0, 0, 1)

Figure 13.17 An example of
rollback recovery in the
Peterson–Kearns algorithm.

(1, 4, 3)

(1, 4, 4)

rb11

W1,1(1)

(1,4,0)
[1](3, 0, 0)

m3
m1
[0](1, 3, 0) [0](1, 4, 0)

C1,0(1)

W1,0(1)

[1](3, 0, 0)

[0](2, 0, 0)
m2

[0](1, 0, 0)

rs01

[1](3, 0, 0)

m0

(4, 4, 0) (5, 4, 0)

Failure
1
(3, 0, 0)
f0

C1,2(1)

rb21
(1, 4, 4)

W1,2(1)

Three messages are in transit while the polling wave is executing. The
message m2 from P0 to P2 with label [0](2, 0, 0) will be accepted when
it arrives. Likewise, message m6 from P2 will be accepted by P1 when
it arrives. However, application of rule CRB6 will result in message m5
with label [0](5, 4, 0) being discarded when it arrives at P1 . The net effect
of the recovery process is that the application is rolled back to a consistent global state indicated by the dotted line, and all processes have sufficient information to discard messages sent from orphan events on their
arrival.

13.9.4 Correctness proof
First we show that all orphaned events are detected and eliminated [28].
Theorem 13.1 The completion of a wave in casual rollback protocol insures
that every event orphaned by failure fim is eliminated before the final polling
wave.
Proof We prove that when the initiator process receives the token back, all
the orphan events have been detected and eliminated. That is, for an event
wi j m , as specified in the causal rollback protocol,
¬Orphan wi j m fim
First we prove that the token, as constructed during the restoration of a
failed process, contains necessary information to determine if any event is
orphaned by a failure. If there exists any orphan event ei due to failure fmj ,
then the vector timestamp in the token will be less than the vector time of
the event, i.e., tk j m ts < Vi ei . By CRB2, the vector timestamp in the

498

Checkpointing and rollback recovery

token, tk j m ts must equal to Vj rsjm , and Vj rsjm = Vj LastEvent fjm .
In other words, the timestamp in the token must be equal to the vector time of
the restart event rsjm at process Pj denoted as Vj (rsjm ), and the vector time of
the restart event at Pj will be one more than the vector time of the latest event
before failure fmj . Since rsjm occupies the same position in causal partial order
as ej and LastEvent fjm → ej , the following must hold: Vj rsjm ≤ Vj ej .
If there exists an orphan ei , then there exist ej such that LastEvent(fjm ) →
ej → ei .
Therefore, Vj ej < Vi ei and Vj rsjm < Vi ei , which proves that when
tk j m ts < Vj ei

(13.1)

there exists an orphan event ei .
We use the above result to prove that there exists no orphan event at the
end of the final polling wave:
¬Orphan wi j m fim

(13.2)

The proof is by contradiction. Let us assume that there exist a polling event
wi j m for which Orphan(wi j m fim ) is true. Then there exists an event ei
such that LastEvent(fim → ei → wi j m . Then there must exist ej such that
ei → ej → wi j m . This implies Orphan(ej fim ). But according to Eq. (13.1),
tk i m ts < Vj ej , which contradicts CRB3: wi j m occurs iff there exists
rbjk such that ci j m → rbjk → wi j m and for every ej such that Vj ej >
tk i m ts, ¬ recorded ej .
Therefore, every event orphaned by a failure fim is eliminated before the
final polling wave is completed.
Now we show that only all orphaned messages are discarded [28].
Theorem 13.2 All orphaned messages are discarded and all non-orphaned
messages are eventually delivered.
Proof Let us consider a send event s, which is not orphaned by the failure
fim . In this case, n s → wi p s m ∨ wi p s m → n s .
Given reliable channels, the message will eventually arrive. The receipt of
a message can only disappear from the causal order if it is lost by a failed
process, rolled back by the protocol, or discarded upon arrival.
The first possibility is that process Pi lost the message due to its failure.
In this case the receiving process p s is i. During the rollback at P s (the
process where the send event occurred), this message will be retransmitted
as the occurrence of the rb event associated with wi s m guarantees this.
Therefore wi i → n s .

499

13.10 Helary–Mostefaoui–Netzer–Raynal communication-induced protocol

The second possibility is that n s → wi p s and n s has rolled back because
n s was orphaned by the failure fim . However, if event s is not orphaned
by fim , Pp s (the receiving process) will request retransmission before the
occurrence of the rollback event rb, and wi p s → n s .
The final possibility is that n s occurs after the wave but is discarded upon
arrival. By CRB6, n s will be discarded if and only if V s > tk i m ts
and inc s < tk i m inc. If s → wi s and Orphan s fmi , then V s ≯
tk i m ts. If wi s → s, then Inc s ≮ tk i m inc. Therefore, n s will not
be discarded and wi p s → n s .
We now prove the converse:
Ifn s → wi p s m ∨ wi p s → n s then¬Orphan s fim
Assume n s → wi p s . From Eq. (13.2), we know ¬ Orphan wi p s , fim ).
Therefore ¬ Orphan n s fim and ¬ Orphan s fim .
Assume wi p s → n s and Orphan s fim . By Eq. (13.1), this implies
tk i m ts < V s . Rule CRB2 of the protocol guarantees that if
Orphan s fim is true, then Inc s < tk i m inc. Rule CRB4 requires that
tk i m ts and tk i m inc are stored in OrVectj before wi j m occurs.
Therefore, there exists z ∈ OrVectj such that V z < V s and Inc z > Inc s .
CRB6 requires such a message must be discarded, contradicting our assumption that wi p s → n s .

13.10 Helary–Mostefaoui–Netzer–Raynal communication-induced protocol
The Helary–Mostefaoui–Netzer–Raynal [15, 16] communication-induced
checkpointing protocol prevents useless checkpoints and does it efficiently.
To prevent useless checkpoints, some coordination is required in taking local
checkpoints. Coordinated checkpointing protocols use additional control messages to synchronize their checkpointing activities, but these result in reduced
process autonomy and degraded performance of the underlying application.
Communication-induced checkpointing protocols achieve this coordination
by piggybacking control information on application messages. No control
messages are needed and no synchronization is added to the application.
More precisely, processes take local checkpoints independently, called basic
checkpoints, and the protocol directs them to take additional local checkpoints, called forced checkpoints. A process takes a forced checkpoint when
it receives a message and a predicate at it becomes true. This predicate is
based on local control variables of the receiving process and on the control
values carried by the message. The values of the local control variables at the
process are based on causal dependencies appearing in its past.
The Helary–Mostefaoui–Netzer–Raynal communication-induced checkpointing protocol ensures that no local checkpoint is useless and it takes as
few forced checkpoints as possible. It is based on the Z-path and Z-cycle

500

Checkpointing and rollback recovery

Cj,y

Cj,y

Pj

Pj
m1

m1

Pi

Ck,z

m2
Pk

Pi
m2
Pk

(a)
Figure 13.18 To checkpoint or
not to checkpoint [16]?

Ci,x

Ck,z

(b)

theory introduced by Netzer and Xu [27]. The protocol is based on Z-path
and Z-cycle theory introduced by Netzer and Xu who showed that a useless
checkpoint exactly corresponds to the existence of a Z-cycle in the distributed
computation. At the model level, the protocol prevents Z-cycles. At the operational level, each message is piggybacked with an integer (Lamport’s clock
value), a vector of integers (checkpoint sequence number), and two boolean
vectors (the size of each vector is n, the number of processes). An interesting feature of this protocol is that for any checkpoint C, it is very easy to
determine a consistent global checkpoint to which C belongs.

13.10.1 Design principles
With each checkpoint C, let us associate a timestamp denoted by C t. The
protocol depends on the following result:
For any pair of checkpoints Cj y and Ck z , such that there is a Z-path from Cj y to Ck z ,
Cj y t < Ck z t implies that there is no Z-cycle.

Thus, if we can manage the timestamps and take checkpoints in such a
way that the timestamps always increase along any Z-path, then no Z-cycles
will form, and no checkpoints will be useless. Each process Pi has a logical
clock lci managed in the following way:
1. Before a process Pi takes a (basic or forced) checkpoint, it increases its
clock by 1 and associates the new clock value with the checkpoint.
2. Every message m is timestamped with the value of its sender clock (let
m t denote the timestamp associated with message m).
3. When a process Pi receives a message m, it updates its local clock lci =
max lci m t .
It follows from this mechanism that, if there is a causal Z-path from Cj y to
Ck z , then we have Cj y t < Ck z t.

To checkpoint or not to checkpoint?
Let us consider the computation depicted in Figure 13.18, where Cj y is a local
checkpoint taken by Pj before sending m1 and Ck z is the first checkpoint of
Pk taken after the delivery of m2 . As the sending of m2 and the delivery of

501

13.10 Helary–Mostefaoui–Netzer–Raynal communication-induced protocol

m1 belong to the same interval of Pi , messages m1 and m2 constitute a Z-path
from Cj y to Ck z . When Pi receives m1 , two cases can occur:
1. m1 t ≤ m2 t. In this case, Cj y t < m1 t < m2 t < Ck z t. Hence, the Z-path
due to messages m1 and m2 in Figure 13.18(a) is in accordance with the
above result.
2. m1 t > m2 t. In this case, a safe strategy to prevent Z-cycle formation is to
direct Pi to take a forced checkpoint Ci x before delivering m1 (as shown
in Figure 13.18(b)),. This “breaks” the m1 m2 Z-path, so it is no longer a
Z-pattern.
This strategy can be implemented in the following way. Each process Pi
maintains a boolean array sent_toi 1
n to determine whether the reception
of a message creates a Z-pattern. The value of sent_toi k is true iff Pi has
sent a message to Pk since its last checkpoint. Each process Pi also maintains
an array of integers min_toi 1
n , where min_toi k keeps the timestamp
of the first message Pi sent to Pk since its last checkpoint.
The condition m1 t > m2 t is then expressed as:
C ≡ ∃ k sent_toi k ∧ m1 t > min_toi k
Therefore, Pi takes a forced checkpoint if C is true. The predicate C is true
if there exists a message from Pi to Pk since its last checkpoint and the
timestamp of m1 is greater than the first message Pi sent to Pk since its last
checkpoint.

Reducing the number of forced checkpoints
Each process Pi maintains the local clock values of other processes. For each
k 1 ≤ k ≤ n , let cli k denote the value of Pk ’s local clock as perceived by
Pi . If k = i, obviously cli i = lci . However, if k = i, the perception of Pk ’s
local clock by Pi is only an approximation such that cli k ≤ lck . Consider
again the situation in Figure 13.18. If the following property holds:
m1 t < m2 t ∨ P whereP ≡ Cj y t ≤ m1 t ≤ cli k < Ck z t
then the Z-path due to messages m1 and m2 is in accordance with the above
result. Let us consider the property P in the case where m1 t > m2 t. Since
m1 t carries the value lcj when m1 is sent, the first relation Cj y t ≤ m1 t
necessarily holds when m1 is received. So, the property P can be violated
only if, when m1 is received, m1 t > cli k or if cli k ≥ Ck z t.
Therefore, to prevent the formation of a Z-path due to messages m1 and
m2 that would violate property P, the protocol requires process Pi to take a
forced checkpoint before delivering m1 if m1 t > cli k or if cli k ≥ Ck z t.
Now we have to determine which value of clk , the approximation cli k
refers to. Let us consider the following two possible cases:

502

Checkpointing and rollback recovery

1. The value of cli k has been brought to Pi by a causal Z-path that started
from Pk and ended before Ck z . This situation is illustrated in Figure 13.19.
The value of cli k is brought to Pi by m in Figure 13.19(a) and by
m and m1 in Figure 13.19(b). In this case, we have cli k < Ck z t and,
consequently, Pi has to take a forced checkpoint only if m1 t > cli k .
2. The value of cli k has been brought to Pi by a causal Z-path that started
from Pk and ended after Ck z . This situation is illustrated in Figure 13.20.
Here the relevant causal Z-path is m in Figure 13.20(a) and m and
m1 in Figure 13.20(b). Both these figures can be redrawn so that they
corresponds to the pattern in Figure 13.21. In one case, m brings the last
value of Pk ’s local clock to Pi , and in the other case it is m m1 . In this
case, we have cli k ≥ Ck z t and Pi has to recognize this pattern and take a
forced checkpoint if it occurs. Let C1 be a predicate describing this pattern
occurrence.
The previous condition C can be redefined as C as follows:
C ≡ ∃ k sent_toi k ∧ m1 t > min_toi k ∧ m1 t > cli k ∨ C1
The predicate C has two parts. The first part is the previous condition C and
the second part is a predicate C1 . The second part will be true if the timestamp
of message m1 is greater than Pk ’s local clock value as perceived by Pi or if
predicate C1 is true.

Figure 13.19 The value of
cli k has been brought to Pi
by a causal Z-path [16].

Cj,y

Cj,y

Pj

Pj
m1

m1
Pi

Pi
m2

m′

m2

Ck,z

Pk

m′′

Pk
(a)

(b)
Cj,y

Cj,y
Pj

Pj
m1

Ci,x

m1

Ci,x
Pi

Pi
m2

Ck,z

m2

m′

Ck,z

m′′

Pk

Pk
(a)

Figure 13.20 The value of cli k has been brought to Pi by a causal Z-path [16].

(b)

Ck,z

503

Figure 13.21 An example of a
Z-cycle [16].

13.10 Helary–Mostefaoui–Netzer–Raynal communication-induced protocol

Ci,x
Pi
m2

m′

Pk
Ck,z

To evaluate the predicate C1 , each process maintains two additional arrays:
1. Array ckpti is a vector that counts the number of checkpoints taken by
each process. So, ckpti k denoted the number of checkpoints taken by Pk
to Pi ’s knowledge. Let m.ckpt be the value appended to m by its sender Pi ,
which is the value of the array ckpti at the time of sending of message m.
2. A boolean array takeni is used in conjunction with ckpti to evaluate C1 .
The value of takeni k is true iff there is a causal Z-path from the last
checkpoint of Pk known by Pi to the next checkpoint of Pi and this causal
Z-path includes a checkpoint.
The array takeni is updated in the following way:
• When process a Pi takes a checkpoint, it sets to true all entries of takeni
except takeni i , which always remains false: ∀k = i: takeni k = true.
• When process Pi sends a message, Pi appends to its current value of takeni
to the message.
• When process Pi receives m, Pi updates takeni in the following way:
∀k = i do case
m ckpt k < ckpti k → skip
m ckpt k > ckpti k → takeni k = m taken k
m ckpt k = ckpti k → takeni k = takeni k ∨ m taken k
end docase
With these data structures, the predicate C1 can be expressed as follows:
C1 ≡ m1 ckpt i = ckpti i ∧ m1 taken i
Consider the example shown in Figure 13.21. The first part of the condition
C1 states that there is a causal Z-path starting from Ci x and arriving at Pi
before Ci x+1 , while the second part indicates that some process has taken a
checkpoint along this causal Z-path.

13.10.2 The checkpointing protocol
Next (see Algorithm 13.4) we describe the Helary–Mostefaoui–Netzer–Raynal
communication-induced checkpointing protocol, which takes as few forced
checkpoints as possible and also ensures that no local checkpoint is useless.