Next: 4.4 Related Work Up: 4. Congestive Loss on Previous: 4.2 Reliable Transmission Protocol Contents

Subsections

4.3 Congestion Loss under the Many-To-One Data Flow

In this section, we are exploring the contention behavior of our lightweight reliable transmission protocol on different buffering architectures under the same traffic pattern - the many-to-one flow. The function of a reliable protocol in the communication system is for prevention of and/or recovery from any data loss. Most of the data loss situations in modern networks are caused by dropping of packets in the switches, which in turn are induced by the contention problem. We believe that with a better understanding on how buffering within switches affect the performance of the communication system, especially under heavy contention situation, more effective congestion avoidance schedules can be designed for the cluster domain.

4.3.1 Many-To-One Data Flow

Although this communication pattern induces the heaviest congestion on the network link(s), the congestion behavior of this traffic pattern is easier to comprehend than that of the many-to-many communication pattern. The obvious result of this many-to-one traffic is the congestion build-up at the outgoing port(s), to which directly or indirectly connected to the gather root (common sink) of this traffic pattern. As the outgoing link is over-subscribed, excessive packets must be buffered. If the congestion persists for a ``long'' duration, congestion loss will be the result. However, if the volume of the data bursts were within the storage capacity associated with this congested port, no packet would be loss. Therefore, the circumstance that induces packet loss under such a traffic pattern is the arrival of a large burst of data packets targeting to the same outgoing port within a close interval, and the volume of this burst overshoots the storage capacity as well as the drainage speed of the congested port. Moreover, on steady-state condition, the incoming and outgoing flows should become balance since data traffics are regulated by the flow-control protocol. This is commonly known as the "self-clocking" [51] effect of the flow-control protocol. Therefore, we deduce that bursts of data packets are only generated by the sources either at the start of the many-to-one traffic or after any packet loss incident.

In the beginning, all sources have a full window, so packets are injecting into the network at full speed. Therefore, we observe that the volume of the first burst of data packets is proportional to the flow control window size ( $W_{FC}$ ) and the number of concurrent senders (). If the buffering capacity ( $B_{L}$ ) of the bottleneck region along the route of this flow is not large enough to accommodate this burst, a "cluster of packets" is dropped. In this study, we assume that all switches employ the drop-tail FIFO disciplinetypeset@protect @@footnote SF@gobble@opt This is the most commonly used dropping discipline in commercial products. [36] as the buffer management strategy; thus, when the buffer is full, newcomers are discarded. Due to this dropping discipline, packet losses are exhibiting some form of temporal correlation. This is inferred from the following observations. First, if a packet arrives at a congested port is dropped by the switch, packets from different sources arriving to this bottleneck region in close apart are likely to be lost too. Second, if a particular data stream loses a packet, most of the subsequence packets originated from the same sender within the same window session are likely to be lost too. The implication of this packet loss behavior is that these sources will receive the packet loss signals (negative acknowledgments or timeouts) after sometime later but again in close apart. This triggers the error recovery layer to recover from the loss and induces another burst of data packets. Depends on the protocol used by the error recovery layer, this phenomenon may continue to evolve and results in poor network performance as the efficiency of the communication system deteriorates significantly.

When analyzing the contention behavior of the many-to-one traffic in a cluster interconnect, two features should be considered. First, congestion loss only occurs when the first burst of data packets overruns the switch buffer. If the buffer capacity can tolerate the flooding, we would not see any congestion loss problem. Second, the reliable transmission protocol has significant impact on the available throughput especially when congestion loss occurs. Therefore, the overall congestion behavior with respect to the many-to-one flow would be a combined effect of the buffering mechanism and the reliable transmission protocol in use.

For a switched network, the buffering's ability to tolerate congestion is determined by three factors:

The total amount of buffer memory in each switch.
The way how buffer memory is associated to individual ports.
The way buffer memory is organized to store packets of different sizes.

With an enclosed network, if we assume that all traffics are coming from the same application running on the cluster, we can safely narrow our investigation by assuming that the data packet size is fixedtypeset@protect @@footnote SF@gobble@opt Normally, the traffic should have a bimodal distribution, one is peaked at the control packet size, which is of small size packet, and the other is at the maximum packet size. . Therefore, we can eliminate the third factor and focus primarily on the effects of having different buffer architectures as on the congestion behavior. Therefore, our investigations are focusing on understanding the relationship between the buffering architecture of the switched network and the reliable transmission protocol on the congestion behavior. In particular, we would like to make use of this information to predict how much performance loss caused by the congestion.

4.3.2 Congestion Behavior on Input-Buffered Architecture

In this section, we are going to devise a simple empirical formula, which predicts the performance observed by the end-user, when suffered from contention loss on an input-buffered switch using the GBN scheme described in previous section. Assumed that there are P data sources and the many-to-one flow starting at time , where all sources start sending their data packets more or less at the same time. Initially, all sources have an open window of size $W_{FC}$ (in unit of fixed-size packet), and they continually send out their packets to the gather root. Congestion starts to build-up as we have more than one active sender, so packets are buffered at the dedicated memory associated to each ingress port - the input-buffered architecture. As the buffer size is finite which is of $B_{L}$ unit, thus, at any given time , we observe that the queue size ( $Q_{t}$ ) must satisfy this constraint, $0\leq Q_{t}\leq B_{L}$ . Furthermore, after the initial burst, all data flows should be regulated by the gather root under normal circumstances. Therefore, we could see that the congestion loss situation on the input-buffered architecture would happen only if the initial burst were larger than the buffering capacity, i.e. when $W_{FC}>B_{L}$ . In other words, to avoid congestion loss on the many-to-one data flow under an input-buffered switch, one should have a window size ( $W_{FC}$ ) which is smaller than or equal to the input port buffering size ( $B_{L}$ ).

**Figure 4.2:** Evolution of the queue size ( $Q_{t}$ ) over time on an input FIFO queue with congestion loss problem
$\resizebox*{1\columnwidth}{!}{\includegraphics{figures/gbn/in-slide1.eps}}$

As we are interested in studying how the congestion loss evolved under our GBN scheme on this architecture, so in order to study the congestion loss situation, we assume that $W_{FC}>B_{L}$ and each source has a steady stream of fixed-size data packets waiting to be transmitted to the gather root, i.e. all have unlimited amount of data to send. We also assume that the return paths for the acknowledgement packets are noiseless and congestion-free, so the data loss problem is mainly coming from the contention loss on the forward flows. To model the effective throughput delivered to the end-user by this many-to-one flow, we focus primarily on the evolution of a particular input queue, and generalize the result to all input queues involved in this flow. Figure 4.2 shows the queue size evolution of the input port buffer under such a congestion loss situation. Another assumption we have to make is on the departure rate of this buffering system. As there are P senders in this flow, and assume fair scheduling on the switching engine, then the expected (average) departure rate

for each input queue would be of one packet per P time slots, with each time slot is measured as the time used by the egress port to transfer the fixed-size data packet, which can be approximated by the $g_{r}$ parameter.

After the initial burst, as the departure rate of this buffering system is slower than the input rate and the volume of the burst is larger than the buffering capacity, packets at the end of the burst are dropped (label A in Figure 4.2). However, the sender wouldn't detect the loss until the first out-of-sequence packet reaches the gather root (start of period $t_{a}$ ), and stimulates a packet loss signal (Nack). As presented in Figure 4.1, the immediate response of the sender to the reception of the Nack is to transit to the ReSend state, and retransmits all outstanding packets (another burst of full window packets). Then it waits at the Stall state until it receives the first positive acknowledgement.

Under self-clocking principle, removal of one packet at the head of the queue will induce the arrival of another packet to fill the space. However, with an out-of-sequence packet, more than one data packets are injected to the network which causes another period of congestion loss (label B in Figure 4.2). For this burst, as there is only one buffer space left in the queue, thus, only the first packet can get a buffer slot. Since this is the expected packet that the gather root is waiting for, this results in changing the sender from the Stall state back to the normal state. In Figure 4.2, we have labeled the period starting from the detection of the first out-of-sequence packet to the time slot just before the reception of the correct retransmitted packet be period $t_{a}$ . Within this period, all packets received by the gather root are discarded as they are out-of-sequence packets (the consequence of packet loss at label A). Moreover, since the retransmission burst (label B) only appears at the start of this period, and there is only limited space left behind in the queue, most of those packets are dropped. The queue size $Q_{t}$ will gradually drop off as packets are continually drained away without replacement.

Although the sender transits from the Stall state back to the normal state at the start of period $t_{b}$ , it immediately changes back to the ReSend state after the gather root receives the next packet, which is an out-of-sequence packet. Then another burst of packets is generated by the sender, which starts to fill all buffer memory again, and eventually induces another overflow situation (label C). We depict the period $t_{b}$ to be a period between the two transitions of the sender from the Stall state back to the normal state. Like period $t_{a}$ , all packets except the first packet received by the gather root in period $t_{b}$ are out-of-sequence packets (the consequence of packet loss at label B). Switching back to the normal state marks the start of period $t_{c}$ . Within this period, the gather root receives a sequence of in-order packets, which is the outcome of the retransmission burst happened during period $t_{b}$ . And period $t_{c}$ ends with the gather root switches back to the ReSend state when it detects an out-of-sequence packet, which is the consequence of the overflow situation at the end of the retransmission burst at label C.

When we carefully look at the evolution of the queue size as well as the changes of packet statuses, we could find that the packet sequences in periods $t_{a}$ to $t_{c}$ form a pattern that is recurred over time. Thus, we could simplify our analysis by focusing on the derivation of the throughput efficiency observed in this recurrent cycle - the packet loss cycle. We define the throughput efficiency ( $T_{Eff}$ ) to be

$\displaystyle T_{Eff}=\frac{\tilde{I}}{\tilde{I}+\tilde{O}}$

(4.1)

where I denotes the average number of in-order packets delivered to the gather root by this sender during a packet loss cycle, and Õ be the average number of error packets received but originated from this sender in the same period. To derive I and Õ, we need to consider the three subintervals in the packet loss cycle and count their corresponding number of good and error packets in each period. We have seen that period $t_{a}$ is consisted of solely out-of-sequence packets. The start of this period marks the first error packet caused by the congestion loss at label A. Intuitively, this error packet must move forward $B_{L}-1$ unit before it gets to the head of the queue. When the gather root receives this error packet, this causes the gather root to send back a Nack to the sender, which results in filling up the last buffer space by a good packet, and this good packet denotes the end of the period $t_{a}$ . Therefore, we deduce that within the $t_{a}$ period, the gather root receives $B_{L}$ units of error packets.

Similarly, the error packets found in period $t_{b}$ are induced by the retransmission burst at label B. As the expected arrival rate of this burst is one packet per time slot, while the average departure rate of this buffering system is one packet per P time slots, we deduct that only $\frac{W_{FC}-1}{P}$ packets could be buffered when they arrive to the input FIFO queue. However, all of these buffered packets are out-of-sequence packets. After this retransmission burst (label B), the sender transits to the Stall state. No more packets are sent to the queue until it changes back to the normal state at the start of period $t_{b}$ , then the sender injects another packet to the queue which again is an out-of-sequence packet. Therefore, we could see that the expected number of good packets received by the gather root during period $t_{b}$ is one packet, and the expected number of error packets received in this period is $\frac{W_{FC}-1}{P}+1$ packets. Based on the same principle, we estimate the amount of good packets buffered within period $t_{c}$ . Before the retransmission burst arrived to the FIFO queue at the start of period $t_{b}$ , there are $\frac{W_{FC}-1}{P}$ error packets in the queue. So we would expect to have $\frac{B_{L}-\frac{W_{FC}-1}{P}}{1-\frac{1}{P}}$ buffer space to accommodate those in-order packets before the buffer overflow situation happens again, and this becomes the amount of good packets received in period $t_{c}$ . Thus, the throughput efficiency observed in a packet loss cycle becomes

$\displaystyle T_{Eff}$	$\displaystyle =$	$\displaystyle \frac{\frac{B_{L}-\frac{W_{FC}-1}{P}}{1-\frac{1}{P}}+1}{\frac{B_{L}-\frac{W_{FC}-1}{P}}{1-\frac{1}{P}}+2+\frac{W_{FC}-1}{P}+B_{L}}$
	$\displaystyle \approx$	$\displaystyle \frac{B_{L}+1-\frac{W_{FC}}{P}}{2B_{L}+2-\frac{B_{L}+2}{P}}$	(4.2)

Although $T_{Eff}$ is derived by observing the dynamic behavior within a single input FIFO queue, if we assume that the same situation happens to all input FIFO queues and each sender gets a fair share (i.e. $\frac{1}{P}$ ) of the available bandwidth, then the expected throughput efficiency observed on this many-to-one flow should be

$\displaystyle T^{input}_{Eff}=P*(T_{Eff}*\frac{1}{P})=T_{Eff}$

(4.3)

In conclusion, from formula (4.2), we observe that with the congestion loss problem, the sustained throughput of this many-to-one flow would be less than 50% of the available bandwidth, since the denominator is at least twice as large as the numerator. As congestion loss problem only happens when $W_{FC}>B_{L}$ , thus we just need to consider the effect of $W_{FC}$ only. From the formula, we observe that the $W_{FC}$ factor has a negative linear relation with the throughput efficiency, such that if we increase $W_{FC}$ , we would expect to have a decrease in the sustained throughput. As for the factor , it seems to have insignificant effect on the final throughput.

4.3.2.1 Experimental Evaluations

We validate $T^{input}_{Eff}$ by using measurement data obtained from our experimental cluster platform. We use a cluster with 24 high-end PCs (PIII 733) connected by the IBM 8275-326 Fast Ethernet switch, which is revealed as an input-buffered architecture with $B_{L}=43$ units per input port (Appendix A). To drive the Fast Ethernet network, we use the Directed Point (DP) low-latency communication system and implement the described Go-Back-N scheme to support reliability on top of DP. We have shown in Chapter 3 that with such a high-end cluster, the performance bottleneck falls on the network component.

To simulate the many-to-one data flow under steady state streaming, we gathered all performance data by running the Gather collective operation with processes and each process sends out 30000 full size data packets to the gather root, e.g. for , the total message length received by the gather root is approximately 640MBtypeset@protect @@footnote SF@gobble@opt The size of the received message is beyond the memory capacity of a single cluster node, which only has 128 MB of primary memory. To avoid paging overheads that affect the final throughput, all incoming data to the gather root are discarded immediately after protocol checking, thus without copying from the staging buffer to the user-space buffer. This avoids the need to create such a large receive buffer, and hence, avoids the paging overhead. . To induce the congestion loss problem, we started the experiments with $W_{FC}>B_{L}$ for various $W_{FC}$ , P and timeout (TO) combinations, and observe their effects on the final performance. To simplify our analysis, we assume that the timeout value is set to a sufficient large value to avoid false retransmissiontypeset@protect @@footnote SF@gobble@opt The setting of timeout value is depended on the available network information. As this protocol is designed for communications on an enclosed network with negligible transmission error, we can make use of the $g_{s}$ , $g_{r}$ and $B_{L}$ parameters to estimate the timeout setting, instead of using the round-trip estimation. . Each test is conducted with P senders send out their messages ``continuously'' to the sole receiver and we measured how long would it take for all processes to complete this collective operation. By dividing the theoretical performance of the gather operation with the measured result, we calculate the throughput efficiency of this gather operation under congestion loss problem.

**Figure 4.3:** The measured throughput efficiency of our GBN reliable transmission protocol as compared to the measured throughput efficiency of the *simple* GBN scheme on the IBM 8275-326 switch.
[our GBN - Timeout = 1000] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-new-P-TO1000.eps}}$ [simple GBN - Timeout = 1000] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-gbn-P-TO1000.eps}}$ [our GBN - Timeout = 2000] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-new-P-TO2000.eps}}$ [simple GBN - Timeout = 2000] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-gbn-P-TO2000.eps}}$ [our GBN - Timeout = 2000] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-new-W-TO2000.eps}}$ [simple GBN - Timeout = 2000] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-gbn-W-TO2000.eps}}$ [our GBN - P = 15] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-new-to.eps}}$ [simple GBN - P = 15] $\resizebox{0.4\columnwidth}{1.4in}{\includegraphics{figures/gbn/326-gbn-to.eps}}$

The first set of experimental results is presented in Figure 4.3. In this figure, we are comparing the throughput efficiency ( $T_{Eff}^{input}$ ) of our GBN scheme and a simple GBN scheme under the above experimental settings on an input-buffered switch. The simple GBN scheme is the classical version of the GBN scheme which does not have the fast retransmission mechanism. Therefore, whenever the receiver receives an out-of-sequence packet, it simply drops it and waits for the timeout retransmission. In general, we see that the performance of our GBN scheme is better than the simple GBN scheme, except on a few data points. Particularly, our GBN scheme is quite insensitive to the number of senders and the timeout parameter (Figure 4.3(a)(c)(g)), while the simple GBN scheme varies substantially with different P, $W_{FC}$ and timeout parameters (Figure 4.3(b)(d)(f)(h)). With the simple GBN scheme, we observe that the timeout value is closely related to the number of senders (Figure 4.3(b)(d)). Such that with each P value, there is a specific timeout setting that fits it most, otherwise, the performance deteriorates considerably (Figure 4.3(h)). Another interesting observation on Figure 4.3 is that the achieved maximum throughput efficiency is no better than 50% of the available bandwidth on both GBN measurements.

**Figure 4.4:** Comparisons of the measured and predicted performance on the congestion loss problem under input-buffered architecture with our GBN reliable transmission protocol.
[ Measured - Timeout=2000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/in-proc1.eps}}$ [ Predicted -Timeout=2000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/in-proc2.eps}}$ [ Measured - Timeout=2000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/in-win1.eps}}$ [ Predicted - Timeout=2000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/in-win2.eps}}$ [ Measured - P=15] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/in-to1.eps}}$ [Predicted - P=15] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/in-to2.eps}}$

Figure 4.4 presents another comparison about the measured throughput efficiency of our GBN scheme with the predicted performance using Eq. 4.2. Although our predictions do not accurately match the measure performance, (the 95% confidence level of the prediction error is $13\%\pm 0.8\%$ ), our empirical formula does capture those salient features that we have described in previous subsection. First, when there is congestion loss problem, the measured performance is less than 45% of the available bandwidth, and the behavior is independent on the timeout setting if it is being set to a sufficient large value to avoid false alarm (shown in Figure 4.4(e)). This finding shows the importance of adopting congestion avoidance control in the first place as more than 50% of the performance is wasted. Second, both the measured and predicted results show that the throughput efficiency is only slightly affected by the number of senders, though we find that the throughput efficiency improves with increase in

(Figure 4.4(a)(c)). The measured results on different $W_{FC}$ (Figures 4.4(c)) show that an inverse relation exists between the throughput efficiency and $W_{FC}$ , however, the degradation in performance (slope) is greater than what we have estimated.

4.3.3 Congestion Behavior on Output-Buffered Architecture

We have shown that with an input-buffered architecture, we lose more than 50% of the theoretical performance under congestion loss problem. In this section, we switch our analysis to the congestion dynamic happened on the output-buffered architecture under the same GBN ARQ scheme. Then based on these analytical studies, we compare on the performance differences between different buffering architectures.

With the output-buffered architecture, packets start to accumulate in the buffers associated to the egress port that leads to the gather root. As we assume that the buffer size is finite, at any given time , we observe that the queue size ( $Q_{t}$ ) must satisfy this constraint, $0\leq Q_{t}\leq B_{L}$ . From this observation, we conclude that the congestion loss situation would happen on this architecture if and only if we have an initial burst of data packets which is larger than $B_{L}$ , such that, when $W_{FC}*P>B_{L}$ . As compare to input-buffered architecture with the same per port buffering capacity and under the same number of sources, the output-buffered architecture is more sensitive to the congestion loss problem under the many-to-one flow.

To study the steady state contention behavior, we assume that the communication event starts with $W_{FC}*P>B_{L}$ and all sources have unlimited amount of data to send. We also assume that our switched network adopts the drop tail discipline, and packets that arrive at a full buffer are dropped unconditionally. This dropping policy induces some form of temporally correlation amongst the senders, which results in creating wave of retransmission bursts. Under such scenario, some senders fall through to Stall state as their first packets cannot find an empty slot in the output queue, and thus, get into hibernation (i.e. remain inactive until retransmission timer expire). Some senders may alternate between cycles of Pack $^{i}\rightarrow$ Nack events and eventually get into hibernation too. Only limited number of senders could get through this contention period and remain active until those slumbering senders receive their timeout signals and start another cycle of congestion. Therefore, if we collect the activity profile of a particular sender, we would see that its activities are cycling between Pack, Nack and Timeout events. Figure 4.5 shows a sample trace of the sender's activities which corresponds to such an arbitrary sequence of state transitions over time

**Figure 4.5:** The sample trace of a sender's activities
$\resizebox{0.85\columnwidth}{!}{\includegraphics{figures/gbn/trace.eps}}$ (Legend: $P^{i}$ - i consecutive Packs; - Nack; $TO^{j}$ - j Timeout* events)

Based on the above observations, we model the congestion behavior of our reliable protocol in terms of activity cycles of the sender, with each cycle is characterized by some recurrent patterns which we believe, are statistical independent but are probabilistic replicas of one another. A sample cycle is given in Figure 4.6 which is an abstract representation of the sample trace shown in Figure 4.5.

**Figure 4.6:** A typical activity cycle of a sender which is composed of a sequence of recurrent patterns
$\resizebox*{0.9\columnwidth}{!}{\includegraphics{figures/gbn/pattern.eps}}$

The cycle begins with the transition of the sender from the Stall state back to the normal state by receiving a Pack response after a series of the timeout retransmissionstypeset@protect @@footnote SF@gobble@opt Although our reliable protocol implements a timeout timer for each outstanding packet, in this model, a timeout retransmission event corresponds to a batch retransmission of all outstanding packets, instead of the individual retransmission. This is because, our reliable protocol is implemented in the user-space, and therefore the inter-message submission gap is governed by the $O_{s}$ parameter. Since under normal circumstance, $O_{s}<g_{s}$ , we can safely assume that the individual retransmission timers are coalesced to form one single batch retransmission timer. . After getting back to the normal state, the sender continues with a sequence of recurrent pattern, with each pattern composes of a series of Pack events and ends on a Nack event. Each

transition corresponds to the detection of packet loss situation, but successfully recovers by fast retransmission. However, under severe congestion, fast retransmission would not help, then the sender falls through to the Stall state and waits until next timeout signal. This is reflected in the activity cycle as a series of TO events that follow the Nack event. Finally, the cycle is terminated by the last timeout retransmission, which marks the onset of the next activity cycle.

Now we are going to derive an analytical model that captures the throughput efficiency observed by a particular sender under this buffering architecture. Let $T_{i}^{TO}$ denotes the duration of a sequence of timeout events and $T_{i}^{S}$ be the time interval between two consecutive timeout sequences. When adding up these two intervals, we have $C_{i}=T_{i}^{S}+T_{i}^{TO}$ , which becomes the elapsed time of a particular activity cycle. We also define $U_{i}$ be the number of in-order packets received by the gather root with respect to this sender during this $C_{i}$ interval. By viewing $\left\{ (U_{i},C_{i})\right\} _{i}$ as a stochastic sequence of random variables, we define the throughput efficiency be

$\displaystyle T_{Eff}=\frac{E[U]}{\frac{E[C]}{g_{r}}}=\frac{E[U]*g_{r}}{E[C]}$

where E[U] denotes the expected number of packets accepted by the gather root in one activity cycle and E[C] be the expected duration of one activity cycle. Without having packet loss, we expect that the gather root can handle one packet per $g_{r}$ time units, thus, the maximum number of packets that a the gather root can handle during E[C] time units is $\frac{E[C]}{g_{r}}$ packets.

To derive E[U], we have to estimate the total number of Pack responses received within those $P^{i_{n}}N$ sequences. Let $y_{i}$ be the number of $P^{i_{n}}N$ sequences in the interval $T_{i}^{S}$ . For the k-th $P^{i_{n}}N$ sequence, we define $x_{i_{k}}$ to be the number of Pack responses received in that period, $S_{i_{k}}$ to be the duration of that period. Assume if the number of Pack responses received in one $P^{i_{n}}N$ sequence is statistically independent to the number of Pack responses in another $P^{i_{n}}N$ sequence, then we can view $\left\{ x_{i_{k}}\right\}$ as a sequence of independent and identically distributed (i.i.d) random variables. The same assumption can be applied to $y_{i}$ and $S_{i_{k}}$ . Now we have

$\displaystyle U_{i}=\sum _{k=1}^{y_{i}}x_{i_{k}}\: \: \Rightarrow \: \: E[U]=E\left[ \sum ^{y_{i}}_{k=1}x_{i_{k}}\right] \: \: \Rightarrow \: \: E[U]=E[y]*E[x]$

$\displaystyle T_{i}^{S}=\sum ^{y_{i}}_{k=1}S_{i_{k}}\: \: \Rightarrow \: \: E[T... ... \sum ^{y_{i}}_{k=1}S_{i_{k}}\right] \: \: \Rightarrow \: \: E[T^{S}]=E[y]*E[S]$

As for deriving E[C], let $m_{i}$ be the number of timeout periods in the interval $T_{i}^{TO}$ and since the duration of each timeout period is fixed, we have

$\displaystyle C_{i}=T_{i}^{S}+T_{i}^{TO}\: \: \Rightarrow \: \: E[C]=E[T^{S}]+E[T^{TO}]\: \: \Rightarrow \: \: E[C]=E[y]*E[S]+E[T^{TO}]$

Assume that the number of timeout periods in each activity cycle is also an i.i.d random variable, then we have

$\displaystyle E[C]=E[y]*E[S]+E[m]*TO$

and thus,

$\displaystyle T_{Eff}=\frac{E[y]*E[x]*g_{r}}{E[y]*E[S]+E[m]*TO}$

(4.4)

**Figure 4.7:** 3-state Markov chain model
$\resizebox*{3.5in}{2.5in}{\includegraphics{figures/gbn/state.eps}}$

To derive E[y], E[x] and E[m], we make use of concepts from discrete-time Markov chain model [50] to model the activity profile of a sender. A 3-state Markov chain model (see Figure 4.7) is constructed which represents the three different events happened in a sender profile, they are P - Pack, N - Nack and TO - timeout events. With reference to the state transition diagram (Figure 4.1) of our GBN protocol, a Pack event means working in normal state or transit to normal state, a Nack event means transit to the ReSend state (fast retransmission), and the TO event means slumbering in the Stall state but being waked up by the timeout signal. However, transition from one event status to another means the detection of or recovery from some loss events. Therefore, the transition probabilities between certain events are directly related to the loss probabilities on the congested path.

Hence, we define the stochastic process in which this 3-state loss model is based on as, $X_{i}(\omega )$ = the type of events delivered to the sender at the time of receiving the $i^{th}$ event with a finite state space, $\ddot{S}=\{P,N,TO\}$ , and the corresponding transition probability matrix is

$\displaystyle \left[ \begin{array}{ccc} 1-c & c & 0\\ a & 0 & 1-a\\ f & 0 & 1-f \end{array}\right] ,$

with individual row represents

[First row]Transition probabilities from state P to state $\omega$ for $\omega \in \ddot{S}$ , such that $Pr[X_{i}=P\mid X_{i-1}=P]=1-c$ , $Pr[X_{i}=N\mid X_{i-1}=P]=c$ , and $Pr[X_{i}=TO\mid X_{i-1}=P]=0$ .
[Second row]Transition probabilities from state N to state $\omega$ for $\omega \in \ddot{S}$ , such that $Pr[X_{i}=P\mid X_{i-1}=N]=a$ , $Pr[X_{i}=N\mid X_{i-1}=N]=0$ , and $Pr[X_{i}=TO\mid X_{i-1}=N]=1-a$ .
[Third row]Transition probabilities from state TO to state $\omega$ for $\omega \in \ddot{S}$ , such that $Pr[X_{i}=P\mid X_{i-1}=TO]=f$ , $Pr[X_{i}=N\mid X_{i-1}=TO]=0$ , and $Pr[X_{i}=TO\mid X_{i-1}=TO]=1-f$ .

And the transition probabilities (1-a), c and (1-f) reflect the different loss probabilities experience by a particular sender during different congestion stages. The probability c measures the likelihood of a sender to encounter a lost packet event, so that a high value for it would indicate that the congestion loss problem occurs frequently. On the other hand, the probability (1-a) measures the severity of the congestion problem, so a high value means the competition is indeed fierce. In addition, a high value of (1-f) would indicate a high degree of clustering of retransmission bursts, which means a high degree of temporal correlation between losses.

Now we first derive E[x]. Given our loss model and the assumption of being a stationarytypeset@protect @@footnote SF@gobble@opt The probability of going from one state to another is independent of the time at which the step is being made [50]. discrete-time Markov chain, the probability that $x_{i_{k}}=n$ is equal to the probability that there appears to have P events before a PN transition occurs. Thus,

$\displaystyle Pr[x_{i_{k}}=n]=(1-c)^{n-1}c\qquad \textrm{for n}=1,2,\ldots$

and the expected value of

$\displaystyle E[x]=\sum ^{\infty }_{n=1}(1-c)^{n-1}cn=\frac{1}{c}$

(4.5)

Likewise, the probability that $y_{i}=h$ is equal to the probability that there appears to have

consecutive $P^{i_{n}}N$ sequences before a Timeout event occurs, that is

$\displaystyle Pr[y_{i}=h]=a^{h-1}(1-a)\qquad \textrm{for }h=1,2,\ldots$

and the expected value of y becomes

$\displaystyle E[y]=\sum ^{\infty }_{h=1}a^{h-1}(1-a)h=\frac{1}{1-a}$

(4.6)

And the probability mass function and the expected value of the random variable m are

$\displaystyle Pr[m_{i}=q]=(1-f)^{q-1}f$

$\displaystyle E[m]=\sum ^{\infty }_{q=1}(1-f)^{q-1}fq=\frac{1}{f}$

(4.7)

**Figure 4.8:** Events happened in a typical $P^{i_{n}}N$ sequence
$\resizebox*{0.9\columnwidth}{!}{\includegraphics{figures/gbn/series.eps}}$

Finally, to derive an expression for E[S], we look into the time scale of a particular $P^{i_{n}}N$ sequence (see Figure 4.8). Let $b^{j}_{i_{k}}$ be the time gap between the appearances of j-1 $^{th}$ and the $j^{th}$ Pack events; $e_{i_{k}}$ be the time gap between the last Pack event and the Nack event; and $r_{i_{k}}$ be the elapsed time between the Nack event of the last $P^{i_{n}}N$ sequence and the first Pack event of current sequence. Then we can express $S_{i_{k}}$ be

$\displaystyle S_{i_{k}}=r_{i_{k}}+\sum ^{x_{i_{k}}-1}_{j=1}b^{j}_{i_{k}}+e_{i_{k}}.$

For simplicity, we also assume that $e_{i_{k}}$ , $r_{i_{k}}$ and $b^{j}_{i_{k}}$ are random variables and are independent on each other, as well as assume that the sequence of Pack events is modeled as a Poisson process, thus $b^{j}_{i_{k}}$ can be modeled as exponentially distribution with parameter $\lambda$ . It follows that the expected value of S become

$\displaystyle E[S]=E[r]+E\left[ \sum ^{x_{i_{k}}-1}_{j=1}b_{i_{k}}^{j}\right] +E[e]$

$\displaystyle E[S]=E[r]+(E[x]-1)E[b]+E[e]$

From Figure 4.8, we see that $r_{i_{k}}$ is the time lapsed between a Nack event and a Pack event. When a sender receives an Nack event, it responses by re-sending all outstanding packets, and we would expect that the next Pack event is triggered by the reception of the first packet from this ReSend burst. Therefore the elapsed time between this Nack/Pack pair depends on the current queue size ( $Q_{t}$ ) as the first packet of the ReSend burst needs to move forward to the head of the queue before a Pack event is generated. Since $Q_{t}$ is bounded by $B_{L}$ , and under steady state condition on the many-to-one flow, the buffer queue should operate under almost full condition. Thus, we have

$\displaystyle E[r]\approx B_{L}*g_{r}.$

To derive

, which is equal to $\lambda$ , we have to determine the average inter-arrival time $\lambda$ of the Pack events in a $P^{i_{n}}N$ sequence. Consider that if there is no congestion loss problem, all (P) senders get a fair share of the bandwidth, we would expect that the average inter-arrival time observed by a particular sender be $P*g_{r}$ . However, under congestion loss situation with our reliable protocol, not all senders are in active sending mode as some are forced to stay in hibernation condition. Hence we postulate that the expected number of senders (

) be $\frac{B_{L}}{W_{FC}}$ , which is derived on the ground that with a buffering capacity of $B_{L}$ units, after vigorous competition, on average only

senders can win a place for their full window load of packets. Therefore, we have

$\displaystyle E[b]=E[A]*g_{r}=\frac{B_{L}*g_{r}}{W_{FC}}.$

We follow the same logic to deduce Remember that a Pack is sent to a sender only when the gather root receives an in-order packet from that sender. So the inter-arrival time () between Pack events has a direct relationship with the elapsed time between consecutive packets from the same sender. To determine , we have to estimate how many packets from this sender are lost before the gather root detects this loss problem. If we consider the loss is uniformly distributed between 1 and $W_{FC}-1$ , then on average, we would expect to have $\frac{W_{FC}-1}{2}$ lost packets before the gather root detects the loss situation. Since each consecutive packet is separated by time units, becomes

$\displaystyle E[e]=\frac{W_{FC}-1}{2}*E[b]=\frac{(W_{FC}-1)*B_{L}*g_{r}}{2W_{FC}}.$

Now we have

$\displaystyle E[S]$	$\displaystyle =$	$\displaystyle B_{L}g_{r}+(\frac{1}{c}-1)\left( \frac{B_{L}g_{r}}{W_{FC}}\right) +\frac{(W_{FC}-1)B_{L}g_{r}}{2W_{FC}}$
	$\displaystyle \approx$	$\displaystyle \left( \frac{3}{2}+\frac{1-c}{cW_{FC}}\right) B_{L}g_{r}$	(4.8)

By substitute (4.5), (4.6), (4.7) and (4.8) to (4.4), we have

$\displaystyle T_{Eff}$	$\displaystyle =$	$\displaystyle \frac{g_{r}}{\left( \frac{3c}{2}+\frac{1-c}{W_{FC}}\right) B_{L}g_{r}+\frac{(1-a)cTO}{f}}$
	$\displaystyle =$	$\displaystyle \frac{1}{\left( \frac{3c}{2}+\frac{1-c}{W_{FC}}\right) B_{L}+\frac{(1-a)c*\overline{TO}}{f}}$	(4.9)

where $\overline{TO}=\frac{TO}{g_{r}}$ . As $T_{Eff}$ represents the throughput efficiency observed by one particular sender, then the aggregate performance of this many-to-one flow under congestion loss problem becomes

$\displaystyle T^{output}_{Eff}=P*T_{Eff}.$

(4.10)

The above empirical equation (4.9) for predicting the throughput efficiency on the output-buffered architecture is formulated in terms of 3 transition probabilities - a, c and f. From the formula, we observe that transition probability c has a significant role on the final performance, such that if we can find some way to minimize c, the higher throughput we get. To utilize this equation, we need to provide some methods to determine these probabilities. However, we cannot identify a clean association between our target performance parameters ( $W_{FC}$ , , $B_{L}$ & TO) and those transition probabilities. Consequently, we have to rely on some inferential approach [50] to statistically estimate these transition probabilities, which is commonly used in other performance studies [119,74,112,63]. In particular, we look at the information gathered from a sample trace and use this information to estimate on the transition probabilities. For examples, in our case, we have

$\displaystyle \tilde{c}=\frac{\textrm{total number of }PN\textrm{ transitions}}{\textrm{total number of }P}$

(4.11)

$\displaystyle \widetilde{1-a}=\frac{\textrm{total number of transitions from }N\textrm{ to }TO}{\textrm{total number of }P^{i_{n}}N\textrm{ sequences}}$

(4.12)

$\displaystyle \tilde{f}=\frac{\textrm{total number of transitions from }TO\textrm{ to }P}{\textrm{total number of }TO}$

(4.13)

4.3.3.1 Experimental Evaluations

A series of experiments are conducted to validate the above empirical formula. For all of these tests, the same cluster that we have used in Subsection 4.3.2.1 is used, however, the number of cluster nodes involved depends on the configuration of the switch or at most 32 nodes. Furthermore, the same GBN reliable transmission protocol and the DP package are used. The first set of experiments is conducted with this cluster interconnected by the 16-port IBM 8275-416 Ethernet switch. Under our benchmark tests, this switch is revealed to have an output-buffered architecture with $B_{L}=95$ units per output port. As this switch can only support 16 full 100 Mb/s connections, we use a cluster size of 16 nodes to conduct all tests. The second set of experiments are conducted by interconnecting this cluster with the Cisco Catalyst 2980G Ethernet switch, which has 80 Fast Ethernet ports and 2 Gigabit Ethernet ports. By applying our benchmark tests on this switch, we uncover that the switch internal is adopting an output-buffered allocation scheme on a shared memory architecturetypeset@protect @@footnote SF@gobble@opt On the supporting document of this switch [99], Cisco claims that this switch has a low-latency, centralized, shared memory switching fabric architecture. (with $B_{L}=128$ units per output port). We only connect 32 cluster nodes to this switch as this is the maximum size we have for this homogeneous cluster.

We employ a similar testing methodology as appeared in Subsection 4.3.2.1 on each platform, however, to induce the congestion loss problem, we have a different window sizing constraint, that is $W_{FC}*P>B_{L}$ . Since the two switching platforms have different buffering capacity and supported port number, thus the extent of varying the $W_{FC}$ and parameters are different too. In order to apply our empirical formula (Eq. 4.9), we collect activity traces from those processes during the tests. We obtain the measured performance on different parameter settings (different combinations of $W_{FC}$ , and ) by running the same tests for 30 iterations and take the average timing as the result measurement. To minimize the required runtime memory to store the activity traces, only the activity traces of all senders on the last iteration are returned. Then the corresponding transition probabilities are calculated for each sender, and finally, the mean values from all the senders are taken as the transition probabilities for this particular parameter set. Table 4.1 displays a sample set of data collected from the IBM 8275-416 platform by this method.

Table 4.1: Sample data collected on the 416 platform

	$W_{FC}$	$\overline{TO}$	$\tilde{c}$	$\widetilde{1-a}$	$\tilde{f}$	Measured $T_{Eff}^{output}$	Predicted
7	18	400	0.0218	0.0547	0.5329	0.8375	0.7639
7	24	400	0.0216	0.1383	0.5300	0.7846	0.7598
7	36	400	0.0227	0.2344	0.5420	0.6971	0.7189
11	18	400	0.0357	0.2097	0.6109	0.7345	0.7294
11	24	400	0.0338	0.2863	0.5763	0.6947	0.7162
11	36	400	0.0305	0.3956	0.5207	0.6380	0.6798
15	18	400	0.0495	0.2907	0.5346	0.6309	0.6573
15	24	400	0.0441	0.3674	0.4977	0.5975	0.6501
15	36	400	0.0341	0.4675	0.4053	0.5596	0.6477
7	18	1600	0.0112	0.0578	0.6810	0.8929	0.8397
7	24	1600	0.0118	0.0834	0.6091	0.8822	0.8550
7	36	1600	0.0126	0.1106	0.5628	0.8027	0.8363
11	18	1600	0.0202	0.0931	0.6390	0.8366	0.8621
11	24	1600	0.0163	0.1522	0.6444	0.8222	0.8884
11	36	1600	0.0164	0.1798	0.5789	0.7516	0.8390
15	18	1600	0.0204	0.1952	0.6922	0.7976	0.8670
15	24	1600	0.0174	0.2230	0.5918	0.7785	0.8905
15	36	1600	0.0175	0.2635	0.5641	0.6877	0.8238

**Figure 4.9:** Comparing the performance of our GBN scheme with the simple GBN scheme on the IBM 8275-416 switch
[our GBN - Timeout = 1000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/416-new-P-TO1000.eps}}$ [simple GBN - Timeout = 1000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/416-gbn-P-TO1000.eps}}$ [our GBN - Timeout = 1000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/416-new-W-TO1000.eps}}$ [simple GBN - Timeout = 1000] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/416-gbn-W-TO1000.eps}}$ [our GBN - P = 15] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/416-new-to.eps}}$ [simple GBN - P = 15] $\resizebox{0.45\textwidth}{!}{\includegraphics{figures/gbn/416-gbn-to.eps}}$

We start the analysis of our GBN scheme by comparing its performance with the simple GBN scheme (as described in Subsection 4.3.2.1) on the IBM 8275-416 switch, and the results are shown in Figure 4.9. In contrast with the input-buffered architecture, we find that our GBN scheme works effectively on the output-buffered architecture, while the simple GBN scheme performs extremely inefficient on this architecture. The measured results on the simple GBN scheme show that its performance is insensitive to the P and $W_{FC}$ parameters, but is slightly depended on the timeout setting. The major difference between our GBN scheme and the simple GBN scheme is on the existence of fast retransmission mechanism. However, fast retransmission is simply a stochastic approach on improving the efficiency, as each sender attempts to recover from the loss before falling through to the Stall state. Through this kind of random selection, some senders get their chances in continuing their transmissions and effectively utilize the bandwidth, while the rest have to wait for their chances after a timeout interval.

In Figures 4.10 and 4.11, we have the comparisons of the measured and predicted performance of our GBN scheme on the IBM 8275-416 platform for various parameter sets. In general, the empirical formula (Eq. 4.10) correctly reflects the macroscopic behavior of our GBN reliable transmission protocol on the output-buffered architecture. Specifically, both the measured and predicted results (Figure 4.10) show that having a larger TO setting improves the throughput efficiency. This behavior is a result of the addition of the fast retransmission mechanism, which randomly selects a few senders to continue and denies the rest. Besides the TO parameter, we also observe that both the $W_{FC}$ and parameters are inversely related to the throughput efficiency.

**Figure 4.10:** Comparison of the measured and predicted performance of the IBM 8275-416 switch under heavy congestion loss problem with our GBN reliable transmission protocol subjected to different timeout settings
[Measured - P=7] $\resizebox{0.48\columnwidth}{1.8in}{\includegraphics{figures/gbn/416-p8M.eps}}$ [Predicted - P=7] $\resizebox{0.48\columnwidth}{1.8in}{\includegraphics{figures/gbn/416-p8P.eps}}$ [Measured - P=11] $\resizebox{0.48\columnwidth}{1.8in}{\includegraphics{figures/gbn/416-p12M.eps}}$ [Predicted - P=11] $\resizebox{0.48\columnwidth}{1.8in}{\includegraphics{figures/gbn/416-p12P.eps}}$ [Measured - P=15] $\resizebox{0.48\columnwidth}{1.8in}{\includegraphics{figures/gbn/416-p16M.eps}}$ [Predicted - P=15] $\resizebox{0.48\columnwidth}{1.8in}{\includegraphics{figures/gbn/416-p16P.eps}}$

**Figure 4.11:** Comparisons of the measured and predicted performance of the IBM 8275-416 switch with our GBN reliable transmission protocol. The main focus is on revealing the effects of the P and $W_{FC}$ parameters on the final performance.
[Measured - Timeout=200] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416P-TO200M.eps}}$ [Predicted - Timeout=200] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416P-TO200P.eps}}$ [Measured - Timeout=200] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416W-TO200M.eps}}$ [Predicted - Timeout=200] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416W-TO200P.eps}}$ [Measured - Timeout=1600] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416P-TO1600M.eps}}$ [Predicted - Timeout=1600] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416P-TO1600P.eps}}$ [Measured - Timeout=1600] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416W-TO1600M.eps}}$ [Predicted - Timeout=1600] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/416W-TO1600P.eps}}$

On the other hand, observing from the results shown in Table 4.1 and Figure 4.11, we find that the accuracy of our predictions is deteriorating along with the increase in , $W_{FC}$ and , albeit the low error rate (the 95% confidence level of the prediction error falls on $7\%\pm 0.9\%$ ). When the timeout setting is small (subgraphs (a),(b),(c)&(d) of Figure 4.11), we clearly see that our empirical formula correctly reflects the relationships between , $W_{FC}$ & $T_{Eff}^{output}$ , but when the timeout setting is large, the relationships become unclear. This is because, when look into the data shown in Table 4.1, our empirical formula tends to over-estimate the improvement made by the increase in timeout parameter.

**Figure 4.12:** Comparisons of the measured and predicted performance of the Cisco Catalyst 2980G switch under heavy congestion loss problem with our GBN reliable transmission protocol
[Measured - P=15] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980-p16M.eps}}$ [Predicted - P=15] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980-p16P.eps}}$ [Measured - P=31] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980-p32M.eps}}$ [Predicted - P=31] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980-p32P.eps}}$ [Measured - Timeout=400] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980P-TO400M.eps}}$ [Predicted - Timeout=400] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980P-TO400P.eps}}$ [Measured - Timeout=400] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980W-TO400M.eps}}$ [Predicted - Timeout=400] $\resizebox{0.45\columnwidth}{!}{\includegraphics{figures/gbn/2980W-TO400P.eps}}$

Figure 4.12 shows the results of the second set of experiments using the Cisco Catalyst 2980G switch with the 32-node cluster. The same findings could be observed from this set of data as compared to the IBM 8275-416 case, but the accuracy of our predictions on this platform is better than that of the IBM 8275-416 case (the 95% confidence level of the prediction error is $6\%\pm 0.7\%$ ). To uncover the performance difference between the two switches, we extract performance results from these two sets which are collected with the same parameter settings, and present them in Table 4.2. In general, we find that increase in buffering capacity would improve the congestion performance. However, it is interesting to see that the Cisco switch, which has a larger buffering capacity, could result in having poorer congestion performance than the 416 switch on some scenarios.

Table 4.2: Comparisons of the congestion behavior observed on the IBM416 and the Cisco2980 switches under the same parameter sets

Switch	$B_{L}$		$W_{FC}$	$\overline{TO}$	Measured	Predicted
IBM416	95	7	24	400	0.7846	0.7598
Cisco2980	128	7	24	400	0.8088	0.7413
IBM416	95	7	24	1600	0.8822	0.8550
Cisco2980	128	7	24	1600	0.8602	0.8046
IBM416	95	15	24	400	0.5975	0.6501
Cisco2980	128	15	24	400	0.6354	0.6772
IBM416	95	15	24	1600	0.7785	0.8905
Cisco2980	128	15	24	1600	0.7785	0.8571
IBM416	95	15	12	400	0.719	0.7256
Cisco2980	128	15	12	400	0.7797	0.7427
IBM416	95	15	12	1600	0.8745	0.8956
Cisco2980	128	15	12	1600	0.8891	0.8782

**Figure 4.13:** Effect of the timeout ( ) parameter on the throughput efficiency when the network is under heavy congestion loss problem. The data are collected on the IBM 8275-416 platform with , $W_{FC}=12$ .
$\resizebox*{!}{2in}{\includegraphics{figures/gbn/416-p16TO.eps}}$

Our assumption on the timeout (

) setting is that it should be sufficiently large to avoid false retransmission. Therefore, for the timeout setting on all tests, it satisfies this constraint - $\overline{TO}\geq 2B_{L}$ . This ensures that the timeout timer is set to a value larger than the round-trip delay on a saturated network; and if the timer expires, this highly indicates that the network is under heavy congestion. To quantify the effect of the timeout parameter on the throughput efficiency, we intentionally include timeout settings that are out of our assumed range, and the result is shown in Figure 4.13. We observe that the throughput efficiency is in a logarithmic scale with the timeout parameter. This finding supports our initial assumption and indicates that the performance is severely degraded by the false retransmission phenomena. On the other hand, the observed performance does not improve a lot even if we use a very large timeout setting. This justifies that picking a reasonable large timeout setting is sufficient to guard against the congestion loss problem.

4.3.4 Discussion of the Models and Their Implications

In previous subsections, we have examined on the performance behavior of two buffering architectures under heavy congestion loss. Regards to the buffering architectures, another commonly used buffering scheme should also be considered - the shared-buffered [110] or common output-buffered architecture [42]. This scheme makes use of dynamic buffer allocation from a common pool of buffers, and therefore each output port virtually has a larger buffer storage. The main advantages of this scheme as compared to output-buffered architecture are (1) the higher buffer utilization can be achieved and (2) a smaller total buffering capacity is required. However, to avoid unfair utilization, most implementations set up an upper limit on each output port. Due to the architectural similarity between the shared-buffered and the output-buffered schemes, we consider that the above analytical study on the output-buffered scheme is directly applicable to the shared-buffered architecture.

We have two models that describe the congestion behavior of our communication system under different buffering architectures; however, one can show that the 3-state Markov chain model can be used to derive the input-buffered case, if we view the input-buffered case as a special case of this Markov chain model, where & $c=\frac{2P-2}{B_{L}*P+P-W_{FC}}$ . Although using stochastic process to model system dynamic is a powerful technique, it has some known limitations. For some cases, to capture real world phenomena but still keeping the model equation tractable, one has to rely on some known and well-behaved statistical or probabilistic models. Then, the question becomes how close are these assumptions matched with the reality? Besides, we may find that on some cases, there exists no general probability model that suits for our needs. Then one has to collect information from the running system, or if a system does not exist, collect from the simulator. Therefore, in all cases, analysts are required to have statistical expertise. Depending on the techniques used, these skills are usually not common to the parallel programmers [116]. Furthermore, of our study on the output-buffered architecture, we are using the second method to derive those transition probabilities. Although the prediction results are within acceptable accuracy, we believe that the information revealed by the empirical equation (4.10) for output-buffered is not as expressive as that by the empirical equation (4.3) for input-buffered. For example, one can directly estimate the effect of varying the $W_{FC}$ parameter from equation (4.3), while equation (4.10) only shows part of the picture as we cannot take hold of the relationship between $W_{FC}$ and those transition probabilities.

The primary objective of this study is to explore the relationship between buffering architecture and congestion behavior, and through the analysis, we could devise better strategies in handling the contention problem. Although our studies are based on the many-to-one pattern with a single switched network, we believe that our findings can be extended to capture the congestion behavior of different network configurations and communication scenarios. In short summary, here are what we have observed from our analytical studies and previous experiments:

The number of attributed sources () on the contention problem has a negative impact on the throughput efficiency with our GBN reliable protocol, except on the input-buffered case. However, it has the least weight on the performance aspect when compare to other performance parameters.
The larger flow control window size ( $W_{FC}$ ) we set, the more susceptibly we are when facing with the congestion loss problem for all cases. Therefore, the best tactic is to avoid the loss completely. Since different buffering architectures have different overflow conditions, i.e. $W_{FC}>B_{L}$ for the input buffering and $W_{FC}*P>B_{L}$ for the output buffering, we should observe these rules in selecting the optimal window size. However, if we are too conservative in setting the window size, we lose the benefit of having pipelining. Besides, even with a small window setting, we still experience performance problem on a large cluster if the network traffic becomes asymmetric.
Under bulk data transfer, the throughput efficiency improves logarithmically with the increase in timeout value with our GBN reliable protocol, except on the input-buffered case. Therefore, we find that the setting of the timeout value has a significant weight on the resulting throughput. From our experimental results, we observe that it is good enough to set the timeout values to the range between $B_{L}*\exp <\overline{TO}\leq B_{L}*\exp ^{2}$ on the output-buffered case, since the throughput efficiency only scales up logarithmically. While with the input-buffered case, the range between $2*B_{L}*P\leq \overline{TO}\leq 3*B_{L}*P$ should work petty good on our GBN scheme.

To further our understanding on the congestion behavior, we extend this study to another type of network configuration - Hierarchical networktypeset@protect @@footnote SF@gobble@opt A formal definition of the Hierarchical network will appear on Chapter 6 [35]. The hierarchical network makes use of faster technology as the backbone network to support full-connectivity between many smaller subnetworks, while these subnetworks can be composed of a single switched network or another hierarchical network. Figure 4.14 gives an example of a two-level hierarchical network that composes of Fast Ethernet (FE) and Gigabit Ethernet (GE) switches.

**Figure 4.14:** A hierarchical network composes of Gigabit Ethernet and Fast Ethernet switches
$\resizebox*{0.6\columnwidth}{!}{\includegraphics{figures/gbn/hierarchic.eps}}$

For the first set of tests on the hierarchical network, we show that the congestion behavior of the output-buffered case can be used to explain on the congestion behavior of an input-buffered port, which happens to be the bridging port between the two Ethernet technologies. Using the same cluster as in previous experiments, we connect 10 cluster nodes to one IBM 8275-326 switch and there are total three such subnets in this setup. Each IBM 8275-326 switch is connected to a Gigabit Ethernet switch - the Alcatel PowerRail 2200 (PR2200), through a Gigabit uplink port. With this configuration, in theory, we have full-connectivity for all 30 machines. After running our benchmark tests, we find that this Gigabit uplink port has an input-buffered architecture with $B_{L}=45$ units (which matches with buffer size of other FE ports). However, we also uncover a serious problem of this uplink port. Although it is capable to sustain ten full FE streams on the upstream flowtypeset@protect @@footnote SF@gobble@opt Upstream - movement of data from the low-level to upper-level of the hierarchy; while downstream is just the reverse. , it could only manage to sustain at most seven full FE streams on the downstream flow without packet loss. More discussion will be given later in this subsection. As for the PR2200 GE switch, our benchmark tests show that it has a shared-buffered architecture with $B_{L}=820$ units.

Table 4.3: The setting used in emulating the many-to-one flow over a congested uplink port

No. of senders	Subnet A(10 nodes)	Subnet B(10 nodes)	Subnet C(10 nodes)
7	4 senders	3 senders	target receiver
11	6 senders	5 senders	target receiver
15	8 senders	7 senders	target receiver
19	10 senders	9 senders	target receiver

**Figure 4.15:** The measured and predicted results of the many-to-one congestion loss problem on the uplink port under our GBN reliable transmission protocol
[Measured - P=7] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/ge-p8M.eps}}$ [Predicted - P=7] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/ge-p8P.eps}}$ [Measured - P=19] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/ge-p20M.eps}}$ [Predicted - P=19] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/ge-p20P.eps}}$ [Measure - Timeout=200] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/geP-TO200M.eps}}$ [Predicted - Timeout=200] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/geP-TO200P.eps}}$ [Measured - Timeout=200] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/geW-TO200M.eps}}$ [Predicted - P=200] $\resizebox{0.45\columnwidth}{3.75cm}{\includegraphics{figures/gbn/geW-TO200P.eps}}$

To emulate the many-to-one congestion loss problem on the uplink port, we use the following experimental settings, which are summarized in Table 4.3. With such settings, we remove the upstream performance limitation, and by directing all traffics to the same uplink port, we create a scenario similar to the output-buffered congestion problem on an input-buffered uplink port. Therefore, we can apply our previous technique to analyze the congestion behavior induces by this traffic condition, and the results are presented in Figure 4.15. Again, we observe a similar congestion dynamic when compare with the previous experiments. The only exception we have observed is the extraordinary performance improvement with the timeout parameter when we have $W_{FC}=24$ for different

settings. We could not identify any clue except that from the activity traces on the transition probabilities (a, c and f), with $W_{FC}=24$ and increase in timeout value, a sender is less likely to encounter the Nack event, but once it transits to the ReSend state, it is more prone to be kept in the Stall state, i.e. c and a become smaller when increase in timeout setting. Besides, we observe that the performance behavior becomes governed by window flow control ( $W_{FC}$ ) setting rather than on the number of attributed sources (P).

In the previous experiments, we could not show that having a large $B_{L}$ value would be benefit under the congestion problem. To look for supporting evident, we compare the measured performance of the three setups under the same parameter settings, IBM 8275-416, Cisco 2980 and the IBM 8275-326 uplink. The results (as presented in Table 4.4) show that having a larger buffer capacity does provide better performance when subjects to the same traffic loading.

Table 4.4: Comparisons of the measured performance on IBM416, Cisco2980 and the IBM uplink port under the same parameter settings.

Switch	$B_{L}$	P	$W_{FC}$	$\overline{TO}$	Measured	Predicted
uplink	45	15	12	400	0.6068	0.6691
IBM416	95	15	12	400	0.719	0.7256
Cisco2980	128	15	12	400	0.7797	0.7427
uplink	45	15	12	1600	0.7309	0.8666
IBM416	95	15	12	1600	0.8745	0.8957
Cisco2980	128	15	12	1600	0.8891	0.8782
uplink	45	11	30	200	0.4129	0.4798
IBM416	95	11	30	200	0.6263	0.5931
uplink	45	11	30	1600	0.5564	0.6430
IBM416	95	11	30	1600	0.7967	0.8759

Although the above experimental setup looks rather artificial, it shows that the input-buffered uplink port behaves like an output-buffered port when it is subjected to heavy congestive flow across the hierarchy. To further examine on the congestion behavior of the uplink port under realistic traffics, we re-configure the network setup of the cluster to emulate multiple concurrent data flows across the hierarchical network.

Table 4.5: The network configuration used to emulating the multiple one-way data transfer over a congested uplink port

No. of Senders	Subnet A(8 nodes)	Subnet B(4 nodes)	Subnet C(12 nodes)
$P\leq 8$	senders	not used	receivers
$12\geq P>8$	8 senders	senders	receivers

Table 4.6: The network configuration used to emulating the multiple bi-directional data flow over a congested uplink port

No. of Senders	Subnet A(12 nodes)	Subnet B(12 nodes)
$8\leq P\leq 24$	$\frac{P}{2}$ as both senders and receivers	$\frac{P}{2}$ as both senders and receivers

Two sets of tests are carried out. With network configuration shown in Table 4.5, we create multiple one-to-one one-way data flows across the target uplink port; while with network configuration shown in Table 4.6, we create multiple one-to-one bi-directional data flows across the uplink ports. The main reason why we use different network settings is to keep our focus on the congestion loss problem at the target uplink port(s) only, and try to avoid other sources of loss. To mimic the bulk transfer data flow, all tests are conducted with each sender sends out 30000 full size packet to its partner.

**Figure 4.16:** The congestion dynamic of the uplink port under multiple one-way bulk transfers
$\resizebox*{!}{3in}{\includegraphics{figures/gbn/ge-nstream.eps}}$

**Figure 4.17:** The congestion dynamic of the uplink port under multiple bi-directional data transfers
$\resizebox*{!}{3in}{\includegraphics{figures/gbn/ge-xchg.eps}}$

Figures 4.16 and 4.17 show the per-receiver bandwidths measured on the one-way and bi-directional tests with different window flow control and timeout settings. In general, the performance behaviors under such data flows agree with our conclusion made on the studies of the many-to-one congestion loss problem. Particularly, the test results match with our analyses that the flow control window size has the heaviest weight on the final performance, and the timeout setting comes next, while the number of attributed sources has the least weight. Furthermore, Figure 4.16 uncovers the intrinsic limitation of the uplink port, as the measurements show that it can only sustain at most 7 full FE message streams regardless of the window size setting. Once we move beyond its throughput limitation, the flow control setting becomes significant, and this indicates that the uplink port is now working under overload condition. Of the worst, Figure 4.17 further demonstrates that under heavy loading with bi-directional data flows, the circuitry of the uplink port cannot support more than 8 streams, i.e. only supports up to 4 concurrent duplex streams. Under such condition, both the data and control packets are subjected to lose as traffics on both directions are heavily congested, however, we still observe a similar congestion dynamic as identified in previous experiments. This fortifies our belief that the studies on the many-to-one congestion loss problem do capture those salient features of our GBN reliable transmission protocol.

Next: 4.4 Related Work Up: 4. Congestive Loss on Previous: 4.2 Reliable Transmission Protocol Contents