Tải bản đầy đủ
8 Manivannan–Singhal quasi-synchronous checkpointing algorithm

# 8 Manivannan–Singhal quasi-synchronous checkpointing algorithm

Tải bản đầy đủ

484

Checkpointing and rollback recovery

current value of its counter. This strategy helps to reduce the checkpointing
overhead, i.e., the number of checkpoints taken. An alternative approach to
reduce the number of checkpoints is to allow a process to delay processing a
received message until the sequence number of its latest checkpoint is greater
than or equal to the sequence number received in the message.

13.8.1 Checkpointing algorithm
Now, we present the quasi-synchronous checkpointing algorithm formally
(Algorithm 13.2). The variable next i of process Pi represents its local counter.
It keeps track of the current number of checkpoint intervals at process Pi .
The value of the variable sni represents the sequence number of the latest
checkpoint of Pi at any time. So, whenever a new checkpoint is taken, the
checkpoint is assigned a sequence number and sni is updated accordingly.
C.sn denotes the sequence number assigned to checkpoint C and M.sn denotes
the sequence number piggybacked to message M.

Properties
When processes take checkpoints in this manner, checkpoints satisfy the
following interesting properties (Ci k denotes a checkpoint with sequence
number k at process Pi ):
1. Checkpoint Ci m of process Pi is concurrent with checkpoints C∗ m of all
other processes. For example, in Figure 13.15, checkpoint C2 3 is concurrent with checkpoints C1 3 and C3 3 .
2. Checkpoints C∗ m of all processes form a consistent global checkpoint. For
example, in Figure 13.15, checkpoints {C1 4 , C2 4 , C3 4 } form a consistent
global checkpoint. An interesting application of this result is that if process
P3 crashes and restarts from checkpoint C3 5 (in Figure 13.15), then P1
will need to take a checkpoint C1 5 (without rolling back) and the set of
checkpoints {C1 5 , C2 5 , C3 5 } will form a consistent global checkpoint.
Since there may be gaps in the sequence numbers assigned to checkpoints
at a process, we have the following result:
3. The checkpoint Ci m of process Pi is concurrent with the earliest checkpoint Cj n at process Pj such that m ≤ n. For example, in Figure 13.15,
checkpoints {C1 3 , C2 2 , C3 2 } form a consistent global checkpoint.
The following corollary gives a sufficient condition for a set of local
checkpoints to be a part of a global checkpoint.
Corollary 13.1 Let S = Ci1 mi Ci2 mi
Cik mi be a set of local check1
2
k
points from distinct processes. Let m = min mi1 mi2
mik . Then, S can
be extended to a global checkpoint if ∀ l 1 ≤ l ≤ k , Cil mi is the earliest
l
checkpoint of Pil such that mil ≥ m.
The following corollary gives a sufficient condition for a global checkpoint
to be consistent.

485

13.8 Manivannan–Singhal quasi-synchronous checkpointing algorithm

Data Structures at Process Pi :
sni = 0;
{Sequence number of the current checkpoint, initialized to 0.
This is updated every time a new checkpoint is taken.}
next i = 1;
{Sequence number to be assigned to the next basic
checkpoint; initialized to 1.}
When it is time for process Pi to increment nexti :
next i = next i +1;
{next i is incremented at periodic time intervals of X
time units}
When process Pi sends a message M:
M.sn = sni ;
{sequence number of the current checkpoint is appended
to M}
send (M);
Process Pj receives a message from process Pi :
if snj < M.sn then
{if sequence number of the current checkpoint
Take checkpoint C; is less than checkpoint number received in the
C.sn = M.sn;
message, then take a new checkpoint before
snj = M.sn;
processing the message}
Process the message.
When it is time for process Pi to take a basic checkpoint:
if next i > sni then
{skips taking a basic checkpoint if next i ≤ sni
Take checkpoint C;
(i.e., if it already took a forced checkpoint
sni = next i ;
with sequence number ≥ next i )}
C.sn = sni ;
Algorithm 13.2 Manivannan–Singhal quasi-synchronous checkpointing algorithm [25].

Corollary 13.2 Let S = C1 m1 C2 m2
CN mN be a set of local checkpoints one for each process. Let m = min m1 m2
mN . Then, S is a
global checkpoint if ∀ i 1 ≤ i ≤ N , Ci mi is the earliest checkpoint of Pi
such that mi ≥ m.
These properties have a strong implication on the failure recovery. The
task of finding a consistent global checkpoint after a failure is considerably
simplified. If the failed process rolls back to a checkpoint with sequence
number m, then all other processes simply need to roll back to the earliest
local checkpoint C∗ n such that m ≤ n.
Example We illustrate the basic idea behind the checkpoints algorithm
using an example.

486

Checkpointing and rollback recovery

Consider a system consisting of three processes P1 , P2 , and P3 shown in
Figure 13.15. The basic checkpoints are shown in the figure as “ ” and forced
checkpoints are shown as “ ∗ ”. The sequence numbers assigned to checkpoints
are also shown in the figure. Each process Pi increments its variable nexti
every x time units. Process P3 takes a basic checkpoint every x time units, P2
takes a basic checkpoint every 2x time units, and P1 takes a basic checkpoint
every 3x time units. Message M0 forces P3 to take a forced checkpoint with
sequence number 2 before processing message M0 . As a result P3 skips taking
a basic checkpoint with sequence number 2. Message M1 forces process
P2 to take a forced checkpoint with sequence number 3 before processing
M1 because M1 sn > sn2 while receiving the message. Similarly message M2
forces P1 to take a checkpoint before processing the message and M4 forces
P2 to take a checkpoint before processing the message. However, M3 does
not force process P3 to take a checkpoint before processing it. Note that
there may be gaps in the sequence numbers assigned to checkpoints at a
process.

13.8.2 Recovery algorithm
The recovery process is asynchronous; that is, when a process fails, it just
rolls back to its latest checkpoint and broadcasts a rollback request message to
every other process and continues its processing without waiting for any reply
message from them. The recovery is based on the assumption that if a process
Pi fails, then no other process fails until the system is restored to a consistent
state. In addition to the variables defined in the checkpoint algorithm, the
processes also maintains two other variables: inci and rec_linei . The inci is
the incarnation number for process Pi . It is incremented every time a process
fails and restarts from its latest checkpoint. The rec_linei is the recovery
line number. These variables are stored in the stable storage, so that they
are made available for recovery. Initially, ∀i, inci = 0 and rec_linei = 0.
With each message M, the current values of the three variables inci , sni , and
rec_linei are piggybacked. The values of these variable piggybacked to M is
denoted by M inc , M sn , and M rec_line , respectively. C sn denotes the sequence
number of checkpoint C. We present the basic recovery algorithm formally in
Algorithm 13.3.

An explanation
When process Pi fails, it rolls back to its latest checkpoint and broadcasts a
rollback(inci , rec_linei ) message to all other processes and continues its normal execution. Upon receiving this rollback message, a process Pj rolls back
to its earliest checkpoint whose sequence number ≥ rec_linei , and continues
normal execution. If process Pj does not have such a checkpoint, it takes
a checkpoint with the sequence number equal to rec_linei , and continues

487

13.8 Manivannan–Singhal quasi-synchronous checkpointing algorithm

Data structures at process Pi :
integer sni = 0;
integer nexti = 1;
integer inci = 0;
integer rec_linei = 0;
Checkpointing algorithm:
When it is time for process Pi to increment nexti
nexti = nexti + 1;
When it is time for process Pi to take a basic checkpoint
If (nexti > sni ) {
Take checkpoint C;
C sn = nexti ;
sni = C sn ;
}
When process Pi sends a message M:
M sn = sni ;
M rec_line = rec_linei ;
M inc = inci ;
send M ;
When process Pj receives a message M:
if (M inc > incj ) {
rec_linej = M rec_line ;
incj = M inc ;
Roll_Back(Pj );
}
If (M sn > snj ) {
Take checkpoint C;
C sn = M sn ;
snj = C sn ;
}
Process the message;
Basic recovery algorithm (BRA):
Recovery initiated by process Pi after failure:
Restore the latest checkpoint;
inci = inci + 1;
rec_linei = sni ;
send rollback(inci , rec_linei ) to all other processes;
resume normal execution;
Process Pj upon receiving Roll_Back(inci , rec_linei ) from Pi :
If (inci > incj ) {

488

Checkpointing and rollback recovery

incj = inci ;
rec_linj = rec_linei ;
Roll_Back(Pj );
continue as normal;
}
else
Ignore the rollback message;
Procedure Roll_Back(Pj ):
If (rec_linej > snj ) {
Take checkpoint C;
C sn = rec_linej ;
snj = C sn ;
}
else
{
Find the earliest checkpoint C with C sn ≥ rec_linej ;
snj = C sn ;
Restore checkpoint C;
Delete all checkpoints after C;
}
Algorithm 13.3 Manivannan–Singhal quasi-synchronous recovery algorithm [25].

normally. Due to message delays, the broadcast message might be delayed and
a process Pj may come to know about a rollback indirectly through some other
process that has already seen the rollback message. Since every message is
piggybacked with M inc , M sn , and M rec_line , the indirect application message
that Pj receives indicates a rollback incarnation by some other process. If
process Pj receives such a message M, and M inc > incj , then Pj infers that
some failed process had initiated a rollback with incarnation number M inc and
Pj rolls back to its earliest checkpoint whose sequence number ≥ M rec_line ;
if Pj later receives a rollback message corresponding to this incarnation, it
ignores it. Thus, after knowing directly or indirectly about the failure of a
process Pi , all other processes rollback to their earliest checkpoint whose
sequence number is greater than equal to rec_linei . If any process does not
have such a checkpoint, it takes a checkpoint and adds it to the rec_line and
proceeds normally. Note that not all processes need to perform a rollback to
its earliest checkpoint.
Example We illustrate the basic recovery using the example in Figure 13.15.
Suppose process P3 fails at the instant shown. When P3 recovers, it increments
inc3 to 1, sets rec_line3 to sn3 = 5 , rolls back to its latest checkpoint C3 5
and sends a rollback(1, 5) message to all other processes. Upon receiving

489

Figure 13.15 An example
illustrating the
Manivannan–Singhal
algorithm [25].

13.8 Manivannan–Singhal quasi-synchronous checkpointing algorithm

0

3

4
*

P1
M1
0

2

M2
3

4

5

*

P2

6
*

B
M0
0

1

P3

2

M3
3

M4
4

5

Failure

*
C

this message, P2 will rollback to checkpoint C2 5 since C2 5 is the earliest
checkpoint at P2 with sequence number ≥ 5. However, since P1 does not have
a checkpoint with sequence number greater than or equal to 5, it takes a local
checkpoint and assigns 5 as its sequence number. Thus, C1 5 C2 5 C3 5 is
the recovery line corresponding this failure.
Thus, the recovery is simple. The failed process (on recovering from the
failure) rolls back to its latest checkpoint and requests other processes to
rollback to a consistent checkpoint which they can easily determine solely
based on the local information. There is no domino effect and the recovery
is fast and efficient.
In this example, we find that the sequence number of all checkpoints in the
recovery line is the same, but it need not always be the case.

13.8.3 Comprehensive message handling
Rollback to a recovery line that is consistent may result in lost, delayed,
orphan, or even duplicated messages. Existence of these types of message
may lead the system to an inconsistent state. Next, we discuss on how to
modify the BRA to handle these messages.

Handling the replay of messages
Not all messages stored in the stable storage need to be replayed. The BRA
has to be modified so that it can decide which messages need to be replayed.
In Figure 13.16, if we assume that process P1 fails at the point marked X and
initiates a recovery with a new incarnation. After failure it rolls back to its
latest checkpoint, C1 10 , then increments the incarnation inc1 to 1 and sets the
rec_line1 to 10, and sends a rollback(1, 10) message to all other processes.
Upon receiving the rollback message from P1 , process P2 rolls back to its
checkpoint C2 12 . Consequently, all other processes roll back to an appropriate
checkpoint following the BRA approach. After all the processes have rolled
back to a set of consistent checkpoints, these checkpoints form a recovery

490

Checkpointing and rollback recovery

Figure 13.16 Handling of
messages during the
recovery [25].

inc = 1
rec_line = 10
0

2

4

6

8

10

Failure

P1
M1
M6
0

3

4

6

9

M7

12

P2
B
M2
0

4

8

10

P3
C

M3

M4

M8
M5

0

1

2

3

4

5

6

7

8

9

10

11

P4

Message sent after the rollback
Message sent before the rollback

line with number 10. The messages sent to the left of the recovery line carry
incarnation number 0 and messages sent to the right of the recovery line carry
incarnation 1.
To avoid lost messages, when a process rolls back it must replay all
messages from its log whose receive was undone and whose send was not
undone. In other words, a process must replay only those messages that
originated from the left of the recovery line and delivered to the right of
the recovery line. In the example, after the rollback process P2 must replay
messages M1 and M2 from its log but must not replay M3 , because the send of
M1 and M2 were not undone but the send of M3 was. It is easy to determine
the origin of the send of a message M by looking at the sequence number
(M sn ) piggybacked. Therefore, we can state a rule for replaying messages as
follows:
Message replay rule: After a process Pj rolls back to checkpoint C, it replays a
message M only if it was received after C and if M sn < recovery line number.

This section discusses how a process handles received messages. Suppose
process Pj receives a message M from process Pi . At the time of receiving the
message, if Pj is replaying messages from the message log, then Pj will buffer
the message M and will process it only after it finishes with the replaying
of messages from the message log. If Pj does not do this then the following
three cases may occur.

491

13.8 Manivannan–Singhal quasi-synchronous checkpointing algorithm

Case 1: M is a delayed message
A delayed message with respect to a recovery line carries an incarnation
number less than the incarnation number of the receiving process. The process
Pi that sent such a message M was not aware of the recovery process at the
time of sending of M. Therefore, the piggybacked incarnation number of Pi
is less than the latest incarnation number of Pj , the receiving process. In such
a situation, if M.sn < rec_linej , then M is first logged in the message log and
then processed; otherwise, it is discarded because Pi will eventually rollback
and resend the message. In the figure, M4 is logged and then processed by
P2 so that P2 might have to replay M4 due to a failure that may occur later,
whereas M5 is discarded by P2 . P2 discards M5 because M.sn > rec_line2
(11 > 10) and M.inc = 0 is less than inc2 = 1 . Therefore, we have the
following rule for handling delayed messages:
Rule for determining and handling delayed messages: A delayed message M received
by process Pj has M inc less than incj . Such a delayed message is processed by process
Pj only if M.sn < rec_linej ; otherwise, it is discarded.

Case 2: M was sent in the current incarnation
Suppose Pj receives a message M such that incj = M.inc . In this case, if M.sn
< snj , then Pj must log M before processing it. This is done because Pj
might need to replay M due to a future failure. For example, in Figure 13.16,
message M7 is sent by process P1 to process P2 after P1 ’s recovery and after
P2 ’s rollback during the same incarnation. In this case, M.inc = inc2 = 1 and
M.sn = 10 < sn2 = 12 , and M7 must be logged before being processed
because P2 might have to roll back to checkpoint C2 12 in case of a failure.
In that case, P2 will need to replay message M7 . Therefore, the rule for
message logging in this case is stated as follows:
Message logging rule: A message received by process Pj is logged into the message
log if M.inc < incj and M.sn < rec_linej or M.inc = incj and M.sn < snj .

Case 3: Message M was sent in a future incarnation
In this case, M.inc > incj and Pj handles it as follows: Pj sets rec_linej to
M.rec_line and incj to M.inc , and then rolls back to the earliest checkpoint with
sequence number ≥ rec_linej . After the roll back, message M is handled as
in case 2, because M.inc = incj .

Features
The Manivannan–Singhal quasi-synchronous checkpointing algorithm has
several interesting features:
• Communication-induced checkpointing intelligently guides the checkpointing activities to eliminates “useless" checkpoints. Thus, every checkpoint lies on consistent checkpoint.
• There is no extra message overhead involved in checkpointing. Only a
scalar is piggybacked on application messages.

492

Checkpointing and rollback recovery

• It ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. This helps bound the depth of rollback
during a rollback recovery.
• A failed process rolls back to its latest checkpoint and requests other
processes to rollback to a consistent checkpoint (no domino-effect).
• Helps in garbage collection. After a process has established a recovery
line, all checkpoints preceding the line can be deleted.
• The algorithm achieves the best of the both worlds:
• it has easeness and low overhead of uncoordinated checkpointing;
• it has recovery time advantages of coordinated checkpointing.

13.9 Peterson–Kearns algorithm based on vector time
The Peterson–Kearns [28] checkpointing and recovery protocol is based on
the optimistic rollback. Vector time is used to capture causality to identify
events and messages that become orphans when a failed process rolls back.

13.9.1 System model
We assume that there are N processors in the system, which are logically
configured as a ring. Each processor knows its successor on the ring and
this knowledge is stored in its stable storage since it is critical that it be
recoverable after a failure. We assume a single process is executing on each
processor. These N processes are denoted as P0 P1 P2··· PN −1 . We assume
that P i+1 mod N is the successor of Pi for 0 ≤ i < N .
Each process Pi has a vector clock Vi [j], 0 ≤ j ≤ N − 1. Vi (ei ) denotes
the clock value of an event ei which occurred at Pi . The ith component of
the vector is incremented before each event at process Pi and the current
timestamp vector is sent on each message to update the receiving process’s
clock. Vi (pi ) denotes the current vector clock time of process Pi and ei denotes
the most recent event in Pi . Thus Vi (pi ) = Vi (ei ). Each send and receive
event increments the vector time. The processes take periodic checkpoints
of process state and also maintain a message log on the stable storage. The
receipt of incoming messages is also logged periodically. The current vector
clock value is considered a part of the process state and is logged to the stable
storage when a checkpoint is taken.

Notation
The following notation is used to explain the algorithm:
• eji : The ith event on Pj . We use e and e to refer to generic events of Pj .
• s: A send event of the underlying computation.

s : The process where send event s occurs.

493

13.9 Peterson–Kearns algorithm based on vector time

s : The process where the receive event matched with send event s
occurs.
fji : The ith failure on Pj .
ckji : The ith state checkpoint on Pj . The checkpoint resides on the stable
storage.
rsji : The ith restart event on Pj .
rbji : The ith rollback event on Pj .
LastEvent fji = e iff e → rsji .

In a rollback protocol, every process must be contacted at least once to
indicate that a failure has occurred and to send it the information necessary
for recovery. This process is characterized as a series of one or more polling
waves which are typified by the arrival of a polling message which transmits
information necessary for rollback and a response by the polled process. We
define two new event types:
• Ci k (m):

The arrival of the final polling wave message for rollback from
failure fim at process Pk .
• wi k (m): The response to this final polling wave by Pk . If no response is
required, wi k (m) = Ci k (m)

The final polling wave for recovery from failure fim is defined as:
PWi m =

N −1
k=0

wi k m ∪

N −1

Ci k m
k=0

13.9.2 Informal description of the algorithm
When a process Pi restarts after failure fim , it retrieves its latest checkpoint,
including its vector clock value Vi Latest ck fim , from the stable storage and rolls back to it. The message log is replayed until it is exhausted.
Since the vector time of each message is logged with the message, when the
messages are replayed, the clock value of the recovering process is appropriately updated. After the logged messages have been replayed, the recovering
process executes a restart event, rsim , to begin the global rollback protocol, originates a token message containing the vector timestamp of rsi m and
sends the token to its successor process. The token associated with failure
fim and restart event rsi m is denoted by tk(i,m). The timestamp of this token
is denoted as tk(i, m).ts. Process Pi buffers all incoming application messages until the return of the token. When this occurs, Pi resumes normal
execution.