We are currently operating in our laboratory a small scale testbed to experiment with QoS, the problems with the ATM driver and its capabilities have now been resolved and we have concluded in using the latest FreeBSD Releases (2

Fourth Progress Report on the HICID Project,

September 1, 1998 - November 30, 1998.

Panos Gevros, Fulvio Risso, Peter T. Kirstein and Jon Crowcroft

December 2, 1998

Introduction

LEARNET is now available, and we have made some progress in installing a QoS testbed on it. The current configuration is shown and discussed in Section 2. We still need certain changes before the configuration is really suitable for the algorithmic testing we would like to perform. We have continued to work with our laboratory testbed, and have largely completed some simple measurements of algorithms. In Section 3, we describe our local QoS testbed, and the measurements we have been doing. Our progress with IPv6 is considered in Section 4. Our future plans are discussed in Section 5.

The Current LEARNET QoS Testbed

We have installed a CAIRN PC Router at Essex U, and would be ready to do QoS activities if only the configuration was suitable. The current configuration is shown in Fig. 1:

Here the darker shapes indicate the current installation; the lighter circles indicate additional routers, which are planned for the future.

There are two problems with the current Fig. 1 installation:

There is no clear set of similar routers to allow proper experimentation;

The topology is not yet sufficiently rich.

Each of these is considered in turn

The UCL-CS (UCL-CS-P) and the Essex U (Ess-P) CAIRN routers are connected directly to ATM switches. The ATM switches at UCL-EE (UCL-EE-S), UCL-CS (UCL-CS-S) and BT (BT-S) are also connected together directly. Finally the three CISCO routers (UCL-CS-C, BT-C and Ess-C) are all connected to their local switch. Unfortunately, BT-S is connected to Ess-C rather than to Ess-S. This gives the requisite connectivity. However it also requires that any QoS traffic path from sources attached to Ess-P destined to UCL-CS-P must pass through a CISCO (Ess-C). This means that any QoS algorithms supported by the CAIRN routers and not the CISCO ones cannot really be tested. It would be far better if a single mode ATM port could be provided on Ess-S. If this were done, the first improvement would be:

Re-terminate LEARNET on Ess-C.

For many purposes, we would like to investigate also multiple hops and multicast. Several extensions to Fig. 1 would enrich the topology:

Connect a CAIRN PC (UCL-EE-P) router to the ATM switch UCL-EE-S;

Add a CAIRN router (BT-P) at BT.

If all the above were done, it would be possible to establish ATM VCs between any sets of CISCOs or of CAIRN routers. Thus the QoS experiments could be done entirely in the CISCO or CAIRN PC domains, merely by establishing VPNs.

Until some of the above are done (particularly (1)), we can use only QoS algorithms common to both CISCO and CAIRN routers.

The Laboratory Testbed

Introduction

We have extended our tests on our laboratory, small-scale, testbed to experiment with QoS. In our experiments we use research prototypes (mainly but not only those under the altq framework) and measure how effective the traffic management mechanisms are. The problems with the altq ATM driver and its capabilities that were highlighted in the previous report have now been resolved; we are using the latest FreeBSD Releases (2.2.6/7) and the altq-1.1.1 distribution.

In the CAIRN, our colleagues are starting to use also another research prototype from Carnegie Mellon University, the Hierarchical Fair Share Curve (HFSC) used for link sharing (like CBQ). There are some advantages in this system - not the least that other colleagues are exploring its use, and CMU has offered to configure it onto our routers (it is not as well finished a product as altq from the viewpoint of easy configuration). We intend to include HSFC in future tests. The main part of this report is our experiences with CBQ; these are given much more fully in [1], but the salient points are reported below.

Test environment

Here Ammon and Kiki are PCs running FreeBSD, Truciolo is a PC running Microsoft Windows, and Thud is a Solaris Sparc-5.

The main test environment consists of two PCs (Ammon and Kiki) running Unix BSD connected with an ATM 155Mbps link: one runs the CBQ daemon and the other acts as a network capture. The TTCP package is used as Traffic Generator: this program runs on Truciolo, which also runs a CBQ daemon. This double location is basic because the switched Ethernet between the Win95 machine and the CBQ can affect the results in high-bandwidth tests. The TTT package is used as traffic monitor, in certain cases running directly on the second BSD machine, in certain others running the TTTProbe on the BSD machine and the TTTView on the Solaris machine. Since the TTT graphical interface uses a lot of CPU resources, the second option is used to avoid a CPU overload when running some high-bandwidth tests.

Name	Machine Type	Network	OS	Other Packages installed	Task
Kirki	AMD K6/200, 64MB RAM	1 Ethernet 10Mbps, 1 ATM 155Mbps PVC setting: At various speed (generally between 1 and 10 Mbps)	UNIX BSD 2.2.7	ALTQ 1.1.1 TTCP (recv) TTT TTTProbe	Network Capture; in certain cases also Network Monitor
Ammon	Intel P166, 32MB RAM	1 Ethernet 10Mbps, 1 ATM 155Mbps PVC setting: At various speed (generally between 1 and 10 Mbps)	UNIX BSD 2.2.7	ALTQ 1.1.2 TTCP (send)	CBQ daemon In certain cases, also Traffic Generator
Thud	Sun Spark 5	Ethernet 10Mbps	Solaris 2.5.1	TTTview	Network Monitor
Truciolo	Intel PII-266	Ethernet 10Mbps	Windows 95	TTCP (send)	Traffic Generator

Static tests

Static tests help to discover the static behaviour of the CBQ package. Main goals are:

the correctness of the class allocation imposed
the class isolation (the traffic in one class must not affect the other)
the correctness of the classifier mechanism

Bandwidth class allocation tests

These tests show the granularity of the classifier. Results show that:

a class with 0% receives no service: the CBQSTAT program reports a minimum bandwidth share of 34.20 Kbps, but in our test with a 10Mbps root class bandwidth all packets belonging to this class has been discarded by the CBQ daemon without regard to their payload size
the classifier ignores fractionally bandwidth shares (1.2%, i.e.), truncating the decimal part; in this way a class with 0.6% receives no service at all
by default (and for compatibility with earlier release) if not explicitly differently specified the CBQ daemon allocates 2% of the total bandwidth to the Control Class (ctl_class), used to carry control traffic (ICMP, IGMP, RSVP)
the precision of the CBQ output rate can have error margins of several percent against the real interface speed

Finally, the CBQ root bandwidth set in the configuration file and reported by CBQSTAT program is:

the IP and the link layer overhead if the CBQ is running on an Ethernet interface
the IP bandwidth if the CBQ is running on an ATM interface.

Further details are given in [], but a summary of test characteristics are:

test program: TTCP
test duration: is determined when all flow finishes, basically when all data (TCP/UDP payload times n° buf sent) has been sent.
the only "certain" throughput is the TCP/UDP one. Other throughputs are estimates; they are accurate for UDP traffic only.

Classifier and bandwidth share tests

This test suite has a very simple configuration: a root class of 2Mbps has two leaf classes with an 80% and 20% share. Classes are isolated: anyone is allowed to borrow from the root class.

The bandwidth share test uses a very simple classifier with mixed TCP and UDP traffic. In some versions it is based on only the protocol type; in others one includes the destination port field as the classifier filter. These tests have been repeated with different classifier settings.

With these tests we have the following results:

Class isolation: is respected; results with TCP and UDP throughput are quite similar in all tests.

Throughput:

TCP performs a little worse than UDP due to its fair behaviour, compared to the aggressiveness of UDP, despite the fact it usually sends bigger packets than UDP

Anyway you can see bigger variations from the UDP and TCP throughput when they belong to the fast class than when they belong to the small class. The small class seems to be more "stable" (with less throughput variation) than the fast class

Some variations can be seen in the fast class when changing the hardware PVC settings (from 2 to 10 Mbps) especially on the TCP flow, probably due to the different window management (increasing the PVC settings the traffic becomes more bursty and this could change the TCP window characteristics

When computing the overall traffic it seems that not all the bandwidth has been used: the sum of the IP throughput referred to UDP flow (that allows a more affordable IP traffic estimation) is approximately 1380 + 390 = 1770Kbps, compared to approximately 2Mbps of the root class.

Bandwidth share: it seems that the slow class receives a little more than the 20%; more precisely, it seems that the fast class does not receive all the 80% assigned. This result confirms that the main CBQ goal is not the absolute precision.

Class allocation with different packet size (p_size)

The CBQ mechanism uses the "mean packet length" concept to determine how many packets must be forwarded in each class. The goal of another set of tests is to show the CBQ class allocation when flows with different payload size (some time much smaller than the MTU) are sent to the router.

Results confirms that CBQ can be sensitive to the packet size, because it uses a "medium packet size" to compute one sensitive parameter. The CBQ, if not differently specified, sets its packetsize parameter to the MTU link layer, supposing that application tends to generate packets with the MTU length. However this is not true for all application, especially for multimedia audio packets, but for some TCP flows as well (it depends on the MSS). This effect can be limited setting an appropriate packetsize parameter for a certain class in the CBQ configuration file; however this means that we must know in advance the medium packet size of the data carried in that class.

A typical experimental result is shown below:

This graph reports the results of some different UDP/TCP flows, clearly showing the deterioration of throughput with packet size. In this test UDP is more affordable than TCP because it is simpler to generate fixed size packets. With TCP flows, there can be some window management problems; in certain cases the TCP throughput is considerably less than the UDP one or much bigger. Equivalent results to UDP performance can be seen with TCP flow if, after starting the CBQ daemon, the MTU size is chosen carefully. Other tests show that this bad behaviour can be avoided setting an appropriate packet-size parameter. Vice versa increasing the max-burst parameter has no effect because this parameter works on the weighted round-robin mechanism. In other word, increasing max-burst, the scheduler is able to send more packets only if the estimator permits this, i.e. only if the class is not overlimit. But, since choosing the wrong packet-size parameter the class becomes overlimit even if it has transmitted only a few bytes, clearly increasing the max-burst cannot be effective.

Dynamic tests

Dynamic tests help to discover the instantaneous throughput of each connection in the CBQ router. Basically, the main goal is to test the borrow mechanism that is very important in CBQ. Since CBQ not a work-conserving discipline, it can happen that some connections have a backlog even when the output link is idle. This can be avoided by configuring the CBQ to use the excess bandwidth activating the "borrow" flag in some classes. In this way a class is allowed to borrow from its parent if the parent has unused bandwidth. Clearly the behaviour of this mechanism must respect the bandwidth share imposed in the configuration files.

We tested a number of different CBQ configurations as

Fig. 6 (a) Fig. 6 (b)

Fig. 6(c) Fig. 6(d)

In (a) the traces show that, when both flows are active, the bandwidth share is correctly reserved. However, the slow flow is not able to use all the parent bandwidth when the fast flow is off. Even if the fast class (when alone) is able to consume much more bandwidth than the small class, it does not use the total parent bandwidth. When both classes are concurrently transmitting the total traffic is a little bigger than the fast class traffic alone. This result is independent of whether the traffic in each class is TCP or UDP.

In (b), TCP and UDP perform differently. TCP flows are not able to use well the borrow flag. Due to its "fair" behaviour and to the CBQ characteristic of sending packets in burst, if more than one TCP flow is sharing a link with the borrow flag it adapts each other. The result is that if three TCP flows are present, they share equally the bandwidth. Even if all three flows starts at the same instant with different throughput, they adapt themselves quickly to share equally the bandwidth. Vice versa, the UDP flows are able to share the bandwidth; the shares may not be exactly that imposed (e.g. 10-40-50% in one experiment), but are able to transmit with different rate.

In (c) the TCP session always receives a better service than the UDP one. The better performance of TCP flows is due to the fact that the borrow mechanism does not work well if classes have different packet size.

In (d) one set of tests was performed with only two TCP flows, one in the agency1 slow class, and the other in the agency2 one. Results show that the two TCP flows share the total bandwidth equally (4Mbps), despite their different agency shares. When one TCP flow is replaced by a UDP one, the results are different. When no other flows are on, the TCP flow, belonging to the bigger agency, is able to use the total link bandwidth (differently from the previous test). Vice versa the UDP flow, belonging to the smaller agency, is not able to do this. When both flow are on, the UDP flow bandwidth remains unchanged and uses more bandwidth than the TCP flow.

CBQ with RSVP

CBQ can co-exist with RSVP: when the RSVP daemon accept a new connection, CBQ dynamically creates a new class in its queuing hierarchy. The CBQ daemon starts automatically when the RSVPD is activated, loading the standard configuration file. Only two classes are defined, one for best effort and the other for reserved traffic. The latter is characterised by the keyword "cntlload", and is the parent class of all the reserved sessions. The suggested configuration allows only the best effort traffic to borrow from the root class (the same feature is disabled for the control_load class). Reserved sessions class can belong only to the Controlled Load service: Guaranteed Service is not supported, and all sessions requiring a Guaranteed Service are refused. When the RSVP daemon accepts a new reservation, the CBQ mechanism creates a new leaf class reserving to it the bandwidth indicated in the reservation message by the token rate r parameter. In the CL class the receiver does not specify the peak rate. These leaf classes are allowed to borrow from their parent class (cntlload class).

Test configuration

For several practical problems, the reserved flow originates from Ammon to Truciolo, via Kirki. However the only reserved link is the ATM link from Ammon to Kirki (the second hop, from Kirki to Truciolo, is not reserved). The first series of tests involve only two reserved flows, TCP and UDP, while the second series adds to them a third UDP flow (belonging to the Best Effort class) from Ammon to Kirki. Graphs are performed capturing the data traffic on the Kirki ATM interface (with the tttprobe program), and showing the graphical result (with tttview) on Thud.

Test Results

From these experiments, we can conclude that the integration of the CBQ and RSVP is very good. Unfortunately some problems arise, but these are not due to the RSVP but to the CBQ mechanism, especially to the borrow implementation. A provider that would like to adopt RSVP with CBQ must take care that the current limitation of the CBQ does not affect its performance.

CBQ Performance

The CBQ mechanism consists of a classifier, an estimator, and a packet scheduler. Regarding its performance, we can say the following:

The scheduler: since it is a weighted round robin, it has constant overhead

The estimator: it depends on how many flows and how much traffic is involved; generally its overhead increases linearly with the number of classes

The classifier: its overhead depends on how many classes it manages and on the complexity of the classifier filter. Moreover, it depends on how it is engineered. Basically it increases linearly also.

There is little point in trying to understand the CBQ latency, because usually the time a packet spends waiting in the output buffer is much bigger than the time spent in the CBQ mechanism. Even if the CBQ latency increases due to the increased number of classes, the bigger part of the time that a packet spends in a router is in the output buffer. So, if the goal is to limit the end-to-end delay, limiting the buffer length is more effective than decreasing the scheduling overhead.

However the maximum performance in terms of packets/second is an interesting metric. Anyway it is easy to derive, from the packet per second measurement, the latency of each packet in the router machine.

CBQ Throughput (perf)

It is not easy no determine the CBQ throughput, for a number of reasons. First, sending TCP traffic would be better because it does not waste CPU cycles. However it is not trivial to impose a TCP packet size; it can be done only by imposing a packet size is on the socket buffer: however this may overload the CPU. Moreover the path latency can affect the overall throughput.

For these reasons, we chose to use UDP flow, but care must be taken. UDP flows do not adapt to the network load; hence, to avoid overloading the router CPU, the link between the sender and the router must be set to the appropriate speed. Generally there are done two kinds of test:

one where this link is set a little larger than the number of packets that the router can manage (this means that the router has to discard some UDP packet),

one where it is set a little smaller than the router capacity: in this case the number of packet in is equal to the number of packet out from the router.

The use of an UDP flow means that only the flow from the source to the destination is present in the system; there is no flow from the destination to the source, in contrast to TCP flow.

To measure the system performance, XPERFMON++ is used. However the results are not very accurate, because:

when the router is running with high speed flows its CPU is completely overloaded; no CPU cycles are available to other programs

the XPERFMON++ measures all the input and output packets (thus all the multicast and broadcast packets arriving on the Ethernet card of each machine affect the throughput)

The CBQ configuration for this test is very simple: one class (the default class) with the 100% bandwidth. The bandwidth of the root class is the same as the Kirki – Ammon PVC. No packet size is set, because other tests showed that the packetsize parameter has no effect here.

Clearly the throughput depends on the packet size. With small packet size, it is approximately 15Kpps on an AMD K6-200 machine. When the packet size increases, the overall performance decreases; this is because of the higher load in transferring the packets from the network interface to the memory and vice versa. This result is consistent with measurements made in Sony. It is important to remember that at low packet size, the router machine is running at full load; no other processes receive service: this machine appears like a blocked PC.

CBQ classes overload

This test uses the same configuration as in Section 3.6.1; the goal is to understand the deterioration in the CBQ mechanism when a lot of classes are involved in the system. This test uses 60 MBPS PVC, configured with 10-20-50 and 100 classes. Each class uses 0% bandwidth, except the default class that use the 100%. Each class has a different destination port: this is a very simple filter but quite heavy to compute because every packet must be analysed in depth.

The results show a clear worsening; when the number of classes increases to 100, the throughput is reduced by 25%. This is not a problem in the normal use of CBQ stand-alone (since the number of classes is limited because is not possible to allocate a decimal percentage of the root bandwidth). In an RSVP environment it may be more serious, because of its ability to create dynamically many classes.

CBQ Memory load

It is important to understand the memory requirement of the CBQ too. In our experiments we used the same configuration files and machine configurations in the above tests. The results from the TOP and the PS program are consistent, and show that the memory occupancy is low. They indicate use of 260 KB of virtual memory for no classes, and around 400B/class. Apparently around 600 KB of real memory is allocated, but we do not really know how the allocation is made.

Evaluation of CBQ

The main problems of the CBQ package seems to be:

the precision of the bandwidth share imposed is not very accurate

The "typical packet size" must be set a priori. If it is set too low, the throughput is reduced; if is set too high, the class get an unintended proportion of capacity.

the borrow flag

it may not permit the bandwidth allocations requested by each agency;

often it is not able to use all the parent bandwidth

If flows are TCP, the share among the flows is incorrect due to an inconsistency between the underlying "fair share" mechanisms of TCP and the present realisation of the CBQ. UDP traffic is better able to share the excess bandwidth

borrow works badly when the flows have different packet size (the smaller packet size can use more bandwidth than the other)

some problem seems to exist with UDP flows, that performs badly compared to the TCP

The CBQ mechanism tends to transmit packets in bursts (if one class is backlogged, CBQ transmits packet for that class until it reaches the maximum allowed). This behaviour has three main effects:

it increases the medium latency of the CBQ sessions (that is a problem with multimedia applications)

it increases the possibility of packet losses in the next routers in the path (if the traffic become more bursty;

it increases the TCP windows size management problems, because the flow has not constant throughput so the retransmission time has to continuously adapt itself to the network behaviour

Progress with IPv6

It is not yet quite clear how much IPv6 can be used in the HICID/HIGHVIEW/JAVIC projects. It is clear that IPv6 will have much more support for QoS eventually, so we would like to use it for HICID. This requires, however, that the applications we are using would themselves support IPv6. Mainly with effort from outside the project, we are building up an Ip capability for our testbed. Our current progress is summarised below:

Stacks

UCL-CS has put up the IPv6 stack from Microsoft, and is starting with the DASSAULT one for Windows/NT. We have not yet done any inter-working experiments, but already see that they have somewhat different application APIs. We have put up the complete stack in FreeBSD; this is used in CAIRN. We will put up the LINUX one from Lancaster U; we will need this for our mobile work but not for HICID. There was a set of patches for Solaris 2.5 and Solaris 2.6. We were advised to wait for the Solaris 7 version. We have now received it, and have installed it on two of our workstations.

Routers

The CAIRN and CISCO routers we are using both support IPv6; there is no problem at that level. However the QoS support in the routers is much more limited. The CAIRN router supports only HFSC; it does not yet support ALTQ under IPv6. There is a version of ALTQ distributed by INRIA; this has not been evaluated yet by any in the CAIRN community - including UCL. The CISCO support is also much more limited under IPv4 than IPv6.

Applications

A particular problem is that most of the applications do not yet work with IPv6. We have made a start by making RAT work above the Microsoft stack. We have now started with making it work above the DASSAULT stack to see the differences in API. We know that there are already ports of VIC and SDR for IPv6; we have not yet investigated these.

Future Work

Our main activities during the next quarter are the following:

Upgrade the experimental infrastructure in the way outlined in Section 2;
Repeat some of the experiments of Section 3 on the LEARNET WAN;
Start doing really QoS measurements with real applications between UCL and Essex U
Try to have a complete set of IPv6 stacks for Microsoft and Solaris 7 (we believe no more work is needed for HICID purposes with FreeBSD)
Ensure that RAT and VIC operate above IPv6 on some platforms
Do some experiments on HFSC with IPv6
Hopefully make ALTQ work with IPv6

Reference

1. Fulvio Risso ALTQ Package – CBQ Testing, (GIVE WWW REFERENCE)