Tải bản đầy đủ
8 Network management, back- up and restoration

8 Network management, back- up and restoration

Tải bản đầy đủ

Network management, back-up and restoration

597

manager can cause attempted communications with hard to reach (i.e., temporarily congested)
destination(s) to be rejected. It makes good sense to reject such communications close to
their point of origin, since rejection of traffic early in the communication path frees as many
network resources as possible, which can then be put to good use in serving communications
between unaffected points of the network.

Expansive control actions
There are many examples of expansive actions. Perhaps the two most worthy of note are:
• temporary alternative re-routing (TAR); and
• network restoration or link back-up.

Temporary alternative re-routing (TAR)
The use of idle capacity via third points is the basis of an expansive action called temporary
alternative re-routing (TAR). Re-routing is generally invoked only from computer controlled
switches where routing table changes can be made easily. It involves temporarily using a
different route to a particular destination. In Figure 14.21, the direct link (or maybe one of
a number of direct links) between routers A and B has failed, resulting in congestion. This
will change the reachability of destinations and the cost of the alternative paths to particular
destinations, as calculated by the routing protocol (as we discussed in Chapter 6).
Some routes will thus change to temporary alternative routes (in the example of
Figure 14.21, the temporary alternative route from router A to router B will be via router C).
The routing tables of all the routers in the network may be changed during the period of the
link failure to reflect the temporary routes which are to be used. The change will typically
occur within about 5 minutes. A reversion to the direct route occurs after recovery of the
failed link.

Network restoration
Network restoration is made possible by providing more plant in the network than the normal
traffic load requires. During times of failure this ‘spare’ or restoration plant is used to ‘stand
in’ for the faulty equipment, for example, a failed cable or transmission system. By restoring

Figure 14.21 Temporary alternative routing (TAR) to overcome a link failure.

598

Quality of service (QOS), network performance and optimisation

service with spare equipment, the faulty line system, switch or other equipment can be removed
from service and repaired more easily.
Network restoration techniques have historically been applied to transmission links on a 1
for N basis, i.e., 1 unit of restoration capacity for every N traffic-carrying units. The following
example shows how 1 for N restoration works. Between two points of a network, A and B,
a number of transmission systems are required to carry the traffic (see Figure 14.22). These
are to be provided in accordance with a 1-in-4 restoration scheme. One example of how this
could be met (Figure 14.22a) is with 5 systems, operated as 4 fully-loaded transmission lines
plus a separate spare. Automatic changeover equipment is used to effect instant restoration
of any of the active cables, by switching their load to the ‘spare’ cable should they fail.10 An
alternative but equally valid 1-in-4 configuration is to load each of the five cables at four-fifths
capacity (Figure 14.22b). Should any of the cables fail, its traffic load must be restored in four
parts — each of the other cables taking a quarter of the load of the failed cable.
In practice, not all cables (or network links) are of the same capacity and it is not always
practicable or economic to restore cables or links exactly as shown in the examples of
Figure 14.22, but the same basic principles can be applied. Another common practice used for
restoration is that of triangulation (concatenating a number of restoration links via third points
to enable full restoration). Figure 14.23 illustrates the principle of triangulation. In the simple
example shown, a cable exists from node A to node B, but there is no direct restoration path.
Restoration is provided instead by plant which is made available in the triangle of links A-C
and C-B. These restoration links are also used individually to restore simpler cable failures,
i.e., on the one-link connections such as A-C or B-C.

Figure 14.22 1 for N transmission link restoration.
10

It is wise to keep the ‘spare’ link ‘warm’ — i.e., active — inactive plant tends not to work when called
into action.

Network management, back-up and restoration

599

Figure 14.23 Restoration by triangulation.

Figure 14.24 Alternative paths A-B in an SDH or SONET network made up of subnetwork rings.

Because of the scope for triangulation, restoration networks (also called protection networks) are often designed on a network-wide basis. This enables overall restoration network
costs to be minimised without seriously affecting their resilience to problems.
Automatic restoration capabilities are built in to many modern transmission technologies (e.g., SDH — synchronous digital hierarchy and SONET — synchronous optical network).
Thus in both SDH and SONET it is intended that highly resilient transmission networks should
be built up from inter-meshed rings and subnetworks (Figure 14.24). Alone the use of a ring
topology leads to the possibility of alternative routing around the surviving ring arc, should
one side of the ring become broken due to a link failure. In addition, multiple cross-connect
points between ring subnetworks further ensure a multitude of alternative paths through larger
networks, as is clear from Figure 14.24. The possibilities are limited only by the capabilities of
the network planner to dimension the network and topology appropriately and the ability of the

600

Quality of service (QOS), network performance and optimisation

network management system to execute the necessary path changes at times when individual
links fail or come back into service.

1 : 1 link restoration by means of back-up links
In smaller networks, where 1 for N restoration might be impractical or too costly, it is common
to provide for 1 : 1 restoration only of critical links in the network. Such back-up links for
data networks are often provided by means of one of the following different types of networks
or network services:
• standby links (links dedicated for back-up purposes, should the main (i.e., normal) link fail;
• VPN (virtual private network);
• dial back-up (telephone or ISDN — integrated services digital network ); or


radio.

Figure 14.25 illustrates a possible network configuration for providing 1 : 1 restoration or link
back-up. The ‘normal connection’ between routers A and B is a direct connection, dimensioned
with a bit rate sufficient to carry the ‘normal’ traffic which flows between the two routers.
Such a direct connection will generally be reliable and secure (i.e., not easy to snoop on
by outsiders). Two alternative back-up links are shown. The first is a VPN (virtual private
network) connection via a public Internet service provider’s (ISP) network. The second is
a dial-back-up connection (using either modems and the analogue telephone network or the
ISDN — integrated services digital network). Either, both or neither of the back-up links may
be in use at any given time.
Some routers are capable of automatically setting up the back-up or standby links when
they detect that the primary or main link has failed. Otherwise external devices may be used
to provide this functionality. Sometimes, the back-up link can simply be configured as a
permanent part of the network topology: the VPN link of Figure 14.25, for example, could be
configured as a direct link between routers A and B, but given a very high link cost weighting,

Figure 14.25 1 : 1 network restoration or link back-up.

Network management, back-up and restoration

601

so that the routing protocol will only select the route in preference to the ‘normal’ route during
times of failure of the direct link.
Using a VPN (virtual private network) service (e.g., an MPLS (multiprotocol label switching) connection across a public IP-based ‘backbone’ or a connection across a public frame
relay or ATM (asynchronous transfer mode) network) is a popular way of providing for network link back-up. Subscription charges must be paid for the VPN connection; usage charges
for carriage of data will, however, only be incurred during the periods of temporary failure of
the direct link. Most of the time, data is carried over the more secure direct link!
The use of modems and dial back-up across the analogue public switched telephone network
(PSTN) is not common nowadays due to the very limited bit rates (typically up to 56 kbit/s)
achievable by this method. Instead, ISDN (integrated services digital network — ‘digital telephone network’) is more common. Using ISDN back-up, ‘dial-up’ telephone connections of
64 kbit/s bit rate are provided between the two endpoints during times when the ‘normal’
direct link is adjudged to have ‘failed’. Determining what constitutes a ‘failure’ of the direct
link may be configurable. A ‘failure’ might be defined to be a ‘complete loss of communication
across the link’ or alternatively an ‘unacceptably poor quality of the direct link’ may similarly
be defined to be treated as a ‘failure’. (Radio links, for example, are rarely completely ‘lost’,
but radio interference may degrade the quality of communication to an unacceptable degree.)
It is important when using dial back-up that the switchover mechanism (between ‘normal’
and ‘dial-back-up’ links) is correctly configured to avoid flip-flopping between the links. If the
direct link is working only intermittently, the back-up link should remain in operation all the
time, and not be permanently switched on and off. Switching the connection over to a different
physical connection (main to back-up, or back-up to main) is a disruptive process, requiring
the lengthy processes of link synchronisation and data communications session recovery. It
should be undertaken as infrequently as possible.
Dial-back-up connections with bit rates higher than 64 kbit/s can be created by means
of bundling a number of individual 64 kbit/s connections and using reverse multiplexing
(Figure 14.26). The reverse multiplexor splits up the bit stream comprising the 384 kbit/s
connection into six separate 64 kbit/s bitstreams, which are then carried by separate dial-up

Figure 14.26 ISDN dial-back-up and reverse multiplexing.

602

Quality of service (QOS), network performance and optimisation

ISDN connections to the destination. At the destination, the six separate data streams are
re-assembled in the correct order to recover the original 384 kbit/s data stream.
Radio technology allows for the rapid establishment of transmission links across even
the most inhospitable terrain. It can be a useful method of augmenting network capacity or
backing up links which might otherwise take a long time to repair. Alternatively, radio links
are sometimes built as a permanent and relatively cheap means of network back-up. A large
number of different applications and users can share the same radio spectrum for network
back-up purposes, since it is unlikely that all the users will require the spectrum at once!

Restrictive control actions
Unfortunately, there will always be some condition under which no further expansive action
is possible. In Figure 14.27, for example, routes A-C and C-B may already be busy with
their own direct traffic, or may not be large enough for the extra demand imposed by A-B
traffic during the period of the failure shown. In this state, congestion on the route A-B
cannot be alleviated by an alternative route via C without causing other problems. Meanwhile,
attempts to reach the problematic destination (B) become a nuisance to other network users,
since the extra network load they create starts to hold up traffic between otherwise unaffected
end-points. When such a situation occurs, the best action is to refuse (or at least restrain)
communication with the affected destination, rejecting connections or packets as near to their
points of origination as possible. In the case of data networking, such call restriction may be
undertaken by means of flow control, ingress control or pacing.
The principle of congestion flow control is that the traffic demand is ‘diluted’ or packets
are ‘held up’ at the network node nearest their point of origin. A restricted number of packets
or data frames to the affected destination (corresponding to a particular packet rate or bit rate)
are allowed to pass into the network. Within the wider network, this reduces the network
overload, relieves congestion of traffic to other destinations, giving a generally better chance
of packet delivery across the network. There are two principal sub-variants of traffic dilution.

Figure 14.27 Restricting communication by congestion flow control near the point of origin.

Network management, back-up and restoration

603

These are by means of 1-in-N or 1-in-T dilution. The 1-in-N method allows every Nth packet
to pass into the network. The remaining proportion of packets, (N-1)/N, may be discarded or
rejected at a point near their origin, before gaining access to the main part of the network.
Alternatively they may be marked for preferential discarding at a later (congested) point along
the network path. By means of such packet dilution, the packet load on the main network (or
the congested part of it) is reduced by a factor of N.
The 1-in-T method, by comparison, performs a similar traffic dilution by accepting only 1
packet every T seconds. This method provides for quite accurate allocation of a specific bit
rate to a specific traffic stream.
When a very large value of N or T is used, nearly all packets will be blocked or held
up at their point of origin. In effect, the destination has been ‘blocked’. The action of complete blocking is quite radical, but nonetheless is sometimes necessary. This measure may be
appropriate following a public disaster (earthquake, riot, major fire, etc.). Frequently in these
conditions the public are given a telephone number or a website address as a point of enquiry,
and inevitably there is an instant flood of enquiries, very few of which can be handled. In
this instance, traffic dilution can be a useful means of increasing the likelihood of successful
communication between unaffected network users.
Without moves to restrict the traffic demand on a network, the volume of successful traffic
often drops as the offered traffic increases. Thus in Figure 14.28, the effective throughput of
the network reduces if the traffic offered to the network exceeds value T0 . This is a common
phenomenon in data networking.
A number of different IP-protocol suite congestion control methods employ restrictive control actions to protect the network against traffic overload (such as that shown in Figure 14.28).
These include:
• TCP (transmission control protocol) flow and congestion control;
• IP precedence (Internet protocol);

Figure 14.28 Congestion reduces the effective throughput of many networks if the traffic demand is
too great.

604

Quality of service (QOS), network performance and optimisation

• IP TOS (type of service);
• class of service (COS), DSCP (differentiated services codepoint) and PHB (per-hop
behaviour) (IP differentiated services (DiffServ));
• admission control (as employed by RSVP (resource reservation protocol) and MPLS (multiprotocol label switching) networks);
• quality of service (QOS) and user priority fields (UPF) of protocols like IEEE 802.1p and
IEEE 802.1q.
All the above are examples of automatic congestion control actions. An alternative, but perhaps
cruder means of congestion control, is for human network managers to undertake temporary
network, routing table or equipment configuration changes as they see fit.
The most widely used automatic congestion control method is the congestion window
employed by the transmission control protocol (TCP). This we discussed in detail in Chapter 7.
The various congestion and flow control protocols employed by TCP act to regulate the rate at
which TCP datagrams may be submitted to the network at the origin (or source end) of a TCP
connection. The flow rate is determined by the TCP protocol, which operates on an end-to-end
basis between the two hosts which are sending and receiving data across the TCP connection
across the network. The maximum allowed rate of datagram submission depends upon not
only the ability of the receiving host to receive them, but also upon the delays and congestion
currently being experienced by datagrams traversing the network. For completeness, we also
recap briefly here the other previously discussed means of automatic congestion control based
on quality-of service (QOS) methods.
IP precedence, the IP type of service (TOS) field and the drop priority of IP differentiated
services (DiffServ) are all used simply to determine the order in which IP (Internet protocol)
packets should be discarded (i.e., dropped) by a router, should an accumulation of incoming
packets exceed the rate at which packets can be forwarded. The DE (discard eligibility) bit
of frame relay and the CLP (cell loss priority) bit of ATM (asynchronous transfer mode)
provide a similar prioritisation of which frame relay frames or ATM cells should be discarded
first at a time of congestion. Alternatively, packets may be delayed by holding them in a
buffer until the congestion subsiders. But simply dropping (or delaying) packets, frames or
cells during a period of network congestion is no guarantee that the quality of service will
be adequate for any of the network’s users. For this purpose some kind of quality of service
scheme must additionally be used in conjunction with packet, frame or cell discarding. Such
schemes are defined for use with IP differentiated services (DiffServ), admission control (as
used by RSVP/MPLS and ATM) and virtual-bridged LAN (VLAN) networks (IEEE 802.1q /
IEEE 802.1p).
Quality of service (QOS) assurance schemes usually work by reserving link capacity, router
forwarding capacity or other networks for particular users. The reservation may be on a
permanent basis (i.e., configured into the network). (An example of a permanent reservation
scheme is the committed information rate (CIR) offered in frame relay networks.) Alternatively,
a negotiated reservation is possible with admission control schemes used in conjunction with
connection-oriented communications networks.
When admission control is used, a request is made for the reservation of network resources
at the time of connection establishment — in line with the bit rate, delay and other quality
stipulations of the request. Should sufficient network resources not be currently available to
meet the request, then the connection request is rejected, and the user must wait until a
later (less congested time) before re-attempting the connection set-up. During the process of
admission control, a traffic contract is negotiated between the network and the user requesting
a connection. The contract commits the network to provide a connection meeting the quality
of service parameters defined in the contract for the whole duration of the active phase of

Network management, back-up and restoration

605

Table 14.7 QOS guarantee in data networks: admission control protocols and parameters used in traffic contracts
Data network type

Admission control
protocol and policing

ATM (asynchronous
transfer mode)

CAC (connection admission
control)
NPC (network parameter control)
UPC (usage parameter control)

DiffServ (IP differentiated RSVP (resource reservation
services)
protocol) [if used]
Frame relay
Pre-configured
MPLS (multiprotocol label RSVP (resource reservation
switching)
protocol) [if used]

Traffic contract
(QOS parameters)
PCR (peak cell rate)
SCR (sustainable cell rate)
Burst cell rate
CLR (cell loss ratio)
CDVT (cell delay variation tolerance)
PHB (per-hop behaviour)
DSCP (differential services codepoint)
CIR (committed information rate)
EIR (excess information rate)
Bit rate
Packet rate
Packet size
Delay

the connection. The QOS parameters typically include the bit rate, delay or latency, packet
size, etc. During the active phase of communication, the network will normally monitor the
connection, policing and enforcing the traffic contract as necessary. Thus if, during the duration
of the connection, the network becomes subject to congestion, the network nodes will try to
determine the cause of the congestion. Provided each user is only subjecting packets of a
size and at a rate conforming with his traffic contract, then these packets are forwarded
appropriately. Packets exceeding the traffic contract, however, may be subjected to packet
shaping. Packet shaping actions can include:
• discarding or delaying packets;
• marking them for preferential discarding at a later point in the path;
• rejecting or fragmenting packets which exceed a certain packet size.
Examples of admission control processes and traffic contracts used by data protocols we have
encountered in this book are listed in Table 14.7.

Network management systems
It is common nowadays for network management computer systems to be provided as an integral part of the subnetworks which they control. Real-time communication between network
elements (e.g., routers, switches and transmission systems) and network management systems
allow real-time network status information to be presented to the human network managers.
Thus the RMON (remote monitoring) MIB and the SNMP (simple network management protocol) (as we discussed in detail in Chapter 9) allow for real-time monitoring of network
performance and alerting of network failures and other alarm conditions.
As adjudged necessary by the human network manager (or automatically by the network
management system software), control signals may be returned by means of SNMP to effect
network configuration changes, thereby relieving congestion or overcoming network failures.
For example, the network manager may choose to downgrade the handling of traffic of medium
priority and temporarily to reject all communications of low priority.
Network management systems can usually be procured from network equipment manufacturers and are often sold with the equipment itself. Such ‘proprietary’ network management

606

Quality of service (QOS), network performance and optimisation

systems are usually optimised for the management of the particular network element — making
for much easier network configuration and monitoring than the alternative ‘command line interface’ on the equipment console port. The drawback of such systems is that they are usually
only suited to management of one type of network element. Correctly, they are termed network
element managers.
To coordinate all the different network elements making up a complex network, an
‘umbrella’ network management system is required. Such systems are mostly the realm
of specialist software-development companies (e.g., Hewlett Packard OpenView, Micromuse
Netcool, Syndesis, etc.11 ). Ideally, an ‘umbrella’ network management system should
coordinate the actions of the various network elements and subnetworks when a network failure
or network congestions arises. It should be able to determine the best overall remedial action,
and prevent different network element managers from undertaking contradictory actions. After
all, the problem might only get worse if two parties pull in opposite directions!

14.9 Performance optimisation in practice
In practice, many data networks evolve without close management. The number of user devices
and applications making use of the network, and the volume of traffic grows over time, often
without close scrutiny of the implications for the network. Network dimensioning and capacity extensions are carried out on an ‘empirical’ basis — ‘try-it-and-see’. The human network
manager might monitor the link utilisation of all the links in the network on a monthly basis or
even only ‘as needed’. Should any of the links be found to be approaching 100% utilisation,
an increase in line bit rate can be arranged. In many cases such a ‘casual approach’ may
be entirely adequate and appropriate, but there are also occasions on which the increase of
link capacity does not resolve a user’s problem of poor quality. What do you do in such an
instance? The answer is more detailed analysis of the network, the user and the application
software. If you are not capable of this on your own, you can contract one of the specialist
network analysis firms to do it for you.
A detailed performance analysis of a network requires special network monitoring (probes)
and analysis tools. It usually starts with a basic analysis of network link performance and
application transaction times (Figure 14.29). The link utilisation chart (Figure 14.29a) is the
first step of analysis. If a particular link is operating near its full capacity, then the line
propagation delays (and consequently the application response time) increase rapidly (as we
saw in Figures 14.5 to 14.8). It may be important to consider the peak utilization (i.e., average
link utilisation during the busiest one-hour or fifteen-minute period) rather than the average
daily utilisation of a particular link. If a particular application is only used at a particular time
of day at which time the network is likely to be heavily loaded, then the average daily link
utilisation will not provide a good indication of the likely level of performance.
If particularly high network traffic demand, or rapid growth in demand is being experienced
from one month to the next, it may be valuable to perform a detailed analysis of the main
users of the network — i.e., the main sources and destinations of traffic. Table 14.8 illustrates
a ‘top talkers table’ generated by some network performance analysis tools for this purpose.
Following simple analysis of the network to identify overloaded links, the next step of a
detailed application performance review is likely to be a study of the average transaction delay.
The example of Figure 14.29b illustrates the transaction delays of a server (i.e., application)
and the network in supporting an imaginary application. There is generally a correlation in
the transaction delay of the application and that caused by the network (at most times of
day, the main delay is caused by the network). There are, however, two exceptional peaks of
11

See Chapter 9.

Performance optimisation in practice

607

Figure 14.29 Network and application performance analysis.
Table 14.8

Top talkers (top sending hosts)

DNS Name
1
2
3
4
5
6
7
8
9
10

CLARK
clark-corp
BOOKKEEPING
www.sap.com
www.company.com
www.supplier.org
DATA SERVER
SECRETARY
BOSS
www.footballscores.com

IP Address

Usage (%)

10.3.16.4
192.168.34.1
206.134.24.101
252.234.13.38
178.121.101.103
23.16.1.252
10.3.16.1
10.3.16.3
10.3.16.2
156.23.45.12

49
12
6
4
4
4
4
3
3
3

high transaction delays (at around 11 : 00 and 15 : 00), which cannot be accounted for by long
network propagation delays. These warrant further analysis. Complaints received from users
about poor performance at these times will not be resolved by merely adding further capacity
to the network. The cause of the long transaction delay lies somewhere else, maybe a routine
is run by the server at these times of day, or a database update is undertaken, or a particular
user or application synchronises information at this time?
Some transaction delays may be explained by the packet sizes being used. A chart which
plots the peak and mean packet sizes may be helpful in this case. Large packet sizes generally
make for efficient usage of the network, since larger packets require relatively less packet
header data (i.e., network overhead ) for a given volume of payload data to be transferred. On