3.2 The Performance of DP

We make use of two implementations of Directed Point to demonstrate how our communication model could be used for performance evaluation and analysis. As the model parameters represent some forms of software overheads and hardware latency, changes in communication hardware and software are being revealed by the changes in these model parameters. This gives us better insights on the performance impacts of various design choices.

We have two clusters that are driven by two different Ethernet implementations of Directed Point, one is a Fast Ethernet cluster (FEDP) and the other is a Gigabit Ethernet cluster (GEDP). The FEDP cluster consists of 16 PCs running Linux 2.0.36. Each node is equipped with a 450MHz Pentium III processor with 512 KB L2 cache, a Intel 440BX PIIX4e chipset that supports a 66/100 MHz system bus, 128 MB of PC100 SDRAM, and uses a Digital 21140A Fast Ethernet adapter for high-speed communication. The whole cluster is connected to a 24-port IBM 8275-326 Fast Ethernet switch which has 5 Gbps backplane capacity. For the GEDP cluster, it consists of four Dell PowerEdge 6300 SMP servers with four Pentium III Xeon processors sharing 1 GB of EDO memory. The Xeon processor consists of 512KB L2 cache and operates at 500 MHz. This Dell system is using the Intel 450NX controller chipset with a 100 MHz front-side bus and has 64-bit 33 MHz PCI slots for the interconnects. All servers are running on Linux 2.2.12 kernel. In addition, each server is equipped with one Packet Engine G-NIC II Gigabit Ethernet adapter, and is connected to the Packet Engine PowerRail 2200 Gigabit Ethernet switch, which has a backplane capacity of 22 Gbps.

To review the performance issues related to high-speed communication, we have performed a series of microbenchmark tests on these clusters. To achieve beyond-microsecond precision, all timing measurements are calculated by using the hardware time-stamp counters in the Intel Pentium processors. If applicable, all data presented in this section are derived from a statistical calculation with multiple iterations of the same benchmark routine. Each test is conducted with at least 200 iterations with the first and last 10% of the measured timing excluded. Only the middle 80% of the timings is used to calculate the average.

3.2.1 Latency with Performance Breakdowns

By executing the associated benchmark routines (Appendix A), we construct a set of model parameters for the two clusters, as shown in Figure 3.1. In the figure, there are two sets of parameters for the Gigabit Ethernet implementation (GEDP), one is obtained when using an SMP kernel, i.e. with the SMP support on the Linux 2.2.12 kernel (GEDP-SMP), and the other is without SMP support (GEDP-UP), i.e. uni-processor mode on an SMP server. The purpose of this comparison is to reveal the differences in performance with respect to different OS modes and hardware platforms.

**Figure 3.1:** Performance breakdown of two DP implementations - Fast Ethernet (FEDP) and Gigabit Ethernet (GEDP) expressed in the form of our model parameters.
[Send overhead: $O_{s}$ ] $\resizebox{0.49\textwidth}{!}{\includegraphics{figures/Sig/Sig6a.eps}}$ [Asynchronous receive overhead: $O_{r}$ ] $\resizebox{0.49\textwidth}{!}{\includegraphics{figures/Sig/Sig6b.eps}}$ [User receive overhead: $U_{r}$ ] $\resizebox{0.49\textwidth}{!}{\includegraphics{figures/Sig/Sig6c.eps}}$ [Network latency: ] $\resizebox{0.49\textwidth}{!}{\includegraphics{figures/Sig/Sig6d.eps}}$ [Inter-packet transmit gap: $g_{s}$ ] $\resizebox{0.49\textwidth}{!}{\includegraphics{figures/Sig/Sig6e.eps}}$ [Inter-packet receive gap: $g_{r}$ ] $\resizebox{0.49\textwidth}{!}{\includegraphics{figures/Sig/Sig6f.eps}}$

3.2.1.0.1 The $O_{s}$ parameter

With DP messaging library, the $O_{s}$ parameter reflects the time used by the host CPU to initiate the transmission while performing the send (dp_write()) operation. Figure 3.1(a) shows the cost associated with the dp_write() operation. It involves a lightweight system call and a cross-domain data movement. We see that the processor speed does affect the software cost with 500MHz Xeon processor performs marginally better than the 450MHz Pentium III processor, as both hardware platforms are operated with a 100 MHz system bus. However, the prevailing software cost of the $O_{s}$ parameter is coming from the data movement overhead in the send operation. We believe that with the DP protocol, any improvement in processor speed would be offset by the data movement cost; hence, using a system with a faster system bus and memory subsystem would benefit most.

Nevertheless, we see that DP still manages to minimize the send overhead and achieve good performance in driving both networks, especially it looks promising on the Gigabit communication. For example, the cost to send a full-size Ethernet packet is about 7 $\mu s$ under the SMP OS, while the theoretical speed in transmitting such an Ethernet packet under Gigabit performance is around 12.3 $\mu s$ . Therefore, an active sending process could saturate the network by continuous transmission. Lastly, observed from the GEDP measurements, with the SMP mode, there is an extra 0.5 $\mu s$ overhead associated to it due to the locking mechanism for integrity control.

3.2.1.0.2 The $O_{r}$ parameter

When examining the $O_{r}$ parameter in Figure 3.1(b), we find that the cost associated to this parameter is proportional to the message size, and the memory copy overhead is higher than that in the case of the $O_{s}$ parameter. This reflects the different nature of the memory copy operation. For example, in the microbenchmark test of the $O_{s}$ parameter, the involved memory copy operation is a $M_{ctm}$ operation, while for the $O_{r}$ parameter, it involves a $M_{mtm}$ operation. Besides, we find that the SMP kernel has an extra 20 $\mu s$ overhead added on to the GEDP-SMP, while both GEDP-UP and FEDP-UP have similar performance.

**Figure 3.2:** Single-trip latency performance with back-to-back (BTB) connection
$\resizebox*{!}{2.5in}{\includegraphics{figures/Sig/Sig7.eps}}$

Besides the SMP overhead, we also observe that the current architecture of this Gigabit Ethernet adapter has a limitation on the achievable performance. Due to the lack of intelligence network processor, incoming messages are not delivered to the user process directly. Instead, they are moved by the DMA engine to a pre-allocated network buffers area. This requires an extra memory copy done in the interrupt handler to deliver the messages to the destined user process. The one-copy cost together with the interrupt overhead ( $\sim 8$ $\mu s$ ) would become a threat to the overall performance, e.g. the total interrupt cost for a full size packet is 19.7 $\mu s$ under GEDP-UP. This $O_{r}$ overhead is larger than the theoretical transmission delay of Gigabit network. Thus, would hinder on the achievable performance.

There are several methods to work out this problem. First, introduces a network processor to the network adapter such that it can be programmed to move the incoming messages directly to their destined buffers. This approach is taken by other lightweight messaging systems that built on top of Myrinet or Giganet [39], e.g. BIP and FM 2.x. Thus, the interrupt and data movement overheads can be eliminated completely. However, almost all commodity Gigabit/Fast Ethernet cards do not provide the luxury to solve this problem. This is because having a network processor together with the associated SRAM is so expensive that the cost of the memory is roughly half of the production cost. From the commercial point of view, this is not justifiable for improvement of just a few microseconds.

Another method is by the mitigation of interrupt overhead through multiple packet receptions - interrupt coalescing. Most Gigabit Ethernet adapters provide a mechanism to perform tuning on the inter-interrupt gap. For examples, wait until there are x incoming packets before raise the interrupt signal, or hold off any pending interrupts until y $\mu s$ has elapsed since handling last packet. The GAMMA messaging system takes a slightly different approach. Whenever the network adapter raises the interrupt signal, the GAMMA protocol blocks off further interrupts by clear the processor's interrupt flag. By this way, the interrupt handler manually checks on further arrival of incoming messages and handles them in one shot. But this method only works under UP kernel. Nevertheless, interrupt coalescing is useful in cases where packets are arriving in back-to-back, but comes at the expense of increased per-packet latency.

3.2.1.0.3 The $U_{r}$ parameter

Since the token buffer pool is accessible by both kernel and user processes, the receiving process can simply check on the TBP for picking up and consuming the messages. As these are done in the user space, no kernel events such as block and wake-up signals are needed. Figure 3.1(c) shows the $U_{r}$ cost of picking up a DP message directly from the TBP without any data movement or system call overheads. Constant overheads, 0.34 $\mu s$ , 0.06 $\mu s$ and 0.07 $\mu s$ were measured for GEDP-SMP, GEDP-UP, and FEDP, respectively.

Moreover, $U_{r}$ is not necessarily a constant value. With real communication events, we need to employ another memory copy operation to move the data from the TBP to the destination buffers. This is because the TBP is a pre-allocated memory region dedicated for incoming messages, hence, it does not directly conform to the desire message-passing semantic. Besides, consecutive messages are stored in TBP, which are not aligned in contiguous memory region. To re-assemble long message, one needs to re-construct the message segments back to one large trunk. Therefore, an add-on $M_{ctm}$ software cost is expected for each arrived segment. On the GEDP platform, this costs an extra overhead of $\sim 10\textrm{ }\mu s$ for a 1500-byte packet.

3.2.1.0.4 The $g_{s}$ and $g_{r}$ parameters

Figure 3.1 (e) and (f) show two other network-dependent parameters, they are the inter-packet transmit gap $g_{s}$ and inter-packet receive gap $g_{r}$ . To justify their relative performance, all parameters are compared with their theoretical limits. Looking at the FEDP data, we find that with modern PC or server hardware and lightweight communication system, we are able to drive the Fast Ethernet network to its full capacity. For example, the measured $g_{s}$ and $g_{r}$ for m = 1500 bytes is 122.75 $\mu s$ and 122.84 $\mu s$ , while the theoretical transmission speed is 123.04 $\mu s$ . This means that the critical path of the communication system falls on the Fast Ethernet network.

For the Gigabit Ethernet, due to the 10-fold increase in network speed, limitations within the host machine start to pop up. The graph with $g_{s}$ -GEDP data (Figure 3.1e) shows that the network adapter cannot transmit data in full gigabit performance. The measured $g_{s}$ value for m = 1500 bytes is 18.76 $\mu s$ but the theoretical speed is 12.3 $\mu s$ . Since the value of $g_{s}$ reflects how fast can the network adapter inject a packet into the network, we clearly see that there exists some bottleneck problem. To explore the problem further, we have performed some investigations on this area, and find that the problem seems related to the PCI performance, even though our Dell server is equipped with a 64bit 33MHz PCI bus. It is known that inefficient use of the PCI bus would result in poor system/network performance [48]. Factors such as the PCI burst size and the PCI latency time are of the most importance, since they can be directly manipulated by the system programmer. In our experiments, we varied the burst size and latency time, and conducted our standard microbenchmarks to measure the resulting $g_{s}$ and $g_{r}$ values.

**Figure:** The measured $g_{s}$ and $g_{r}$ values on the GEDP platform under various PCI settings. With LTXX stands for setting the PCI latency to XX bus cycles; and BYY stands for the PCI burst size (YY d-words).
[Inter-packet transmit gap] $\resizebox{!}{3.5in}{\includegraphics{figures/Sig/pci-gs.eps}}$ [Inter-packet receive gap] $\resizebox{!}{3.5in}{\includegraphics{figures/Sig/pci-gr.eps}}$

A similar pattern also appears in the $g_{r}$ -GEDP data, but the problem is not as clear as that of the $g_{s}$ parameter. We find that the measured $g_{r}$ value for m = 1024 bytes is 10.6 $\mu s$ but the theoretical gap is 8.5 $\mu s$ ; on the other hand, the measured $g_{r}$ value for m = 1500 is 12.6 $\mu s$ while the theoretical gap is 12.3 $\mu s$ . Although we still observe the variation of $g_{r}$ values under different PCI settings (in Figure 3.3(b)), it appears to be less drastic than its $g_{s}$ counterpart, and the measured results look quite independent of the burst size and latency time. Part of the reason may be due to the difference in read and write performance of the PCI bus, in particular, under the Intel 82450NX chipset [49]. For example, the observed throughput for PCI read (from memory) is approximately 47% less than the PCI write (to memory) throughput on a 64-bit PCI bus.

3.2.1.0.5 The

parameter

When look at the L parameter (Figure 3.1(d)), the derived network latency of the GEDP with back-to-back connection is 6.9 $\mu s$ for a 1-byte message, while the network latency of the FEDP with back-to-back connection is 9.9 $\mu s$ for the same size message. We observe that the add-on latency by the GE hardware is much higher than that of the FE, when compare to the theoretical wire delay for the smallest packet size of the GE and FE, which are 0.67 $\mu s$ and 6.7 $\mu s$ respectively. In addition, the time gaps between the network latency measurements with FEDP back-to-back and FEDP through switch, and between FEDP back-to-back and theoretical FE speed are almost constant, while the corresponding gaps on the GE platform seem to be increasing with the message size. This indicates that there exists some store-and-forward stage(s) along the GE network path.

Lastly, Figure 3.2 compares the single-trip latency of the two DP implementations. To avoid add-on latencies from the switches, we connect two nodes back-to-back and measure their single-trip latencies. The GEDP-UP achieves single-trip latency of 16.3 $\mu s$ for sending 1-byte message, while GEDP-SMP achieves 33.4 $\mu s$ and FEDP achieves 20.8 $\mu s$ respectively.

From the above analysis, we obtain two sets of performance metrics, which clearly delineate the performance characteristics of the two DP implementations. In summary, the host/network combination of the FEDP implementation has the performance limitation on its network component. This is being observed by comparing the $O_{s}$ , $O_{r}$ , and $U_{r}$ parameters with the $g_{s}$ , $g_{r}$ and L parameters. And since their performance characteristics satisfy the full-duplex condition, i.e. $(O_{s}+O_{r}+U_{r})<g<L$ , we can directly adopt the previous defined point-to-point communication costs (Eq. 2.6 & 2.7) whenever we want to evaluate on its long message performance. Moreover, the host/network combination of the GEDP implementation has the performance limitation not falling on the network component. For instance, the $O_{r}$ parameter is higher than the $g_{s}$ and $g_{r}$ parameters for both GEDP-SMP and GEDP-UP, which means the performance bottleneck may fall on this region. Therefore, when predicting their long message performance, alternate point-to-point communication cost formulae are required. For example, since the bottleneck stage falls on the receive phase, the new cost formula for predicting the one-way point-to-point communication cost of the GEDP-UP implementation becomes:

3.2.2 Uni-directional Bandwidth

In this section, we are going to explore the one-way bandwidth performance of the two DP implementations with respect to different hardware and OS mode. In the analysis, we try to apply the acquired knowledge from the above section to explain and evaluate the measured performance.

Two sets of uni-directional bandwidth measurements for each DP platform - FEDP, GEDP-UP and GEDP-SMP, are presented in Figure 3.4. To calculate the raw DP bandwidth, we measure the time spent in transmitting 10 MB data from one process to another remote process, plus the time for the receive process to send back a 4-byte acknowledgment. By subtracting the measured time with the single-trip latency of a 4-byte message, we calculate the achieved bandwidth as the number of bytes transferred in the test divided by the result timing. As DP supports unreliable communication only, we have implemented a simple Go-Back-N protocol on top of DP to provide flow control and support limited reliable communication. Since all the protocol works are done in the user space, it has add-on overheads to the $O_{s}$ and $U_{r}$ parameters. For example, the $O_{s}$ -GEDP-SMP value for sending a full load packet is increased from 7 $\mu s$ to 12 $\mu s$ . To calculate the flow-controlled bandwidth of DP, we performed a set of tests similar to what we have done to obtain the raw DP bandwidth.

**Figure 3.4:** Uni-directional bandwidth performance of DP
$\resizebox*{4in}{3in}{\includegraphics{figures/Sig/Sig8.eps}}$ (FC - with flow control on; raw - unreliable mode)

From the figure, we see that the maximum achieved bandwidth for GEDP is 79.5 MB/s, which is the raw DP performance measured under the SMP kernel. Under the UP kernel, the raw GEDP achieves at most 75.2 MB/s. Despite the fact that the SMP kernel has a higher $O_{r}$ overhead, it has a better throughput than the UP kernel. This shows the advantage of sharing the token buffers between the kernel process and user process. Under the UP mode, the user process can only pick up its arrived messages after the interrupt thread returns, so the whole interrupt overhead is included in the delay calculation. However, with the SMP mode, we have more CPU resources and the user process can check out its messages even before the interrupt thread returns. This is because when the receive process gets the CPU cycles and detects that there are arrived messages, it can immediately consume the shared data. Besides, due to the large interrupt overhead on SMP kernel, it is likely that an interrupt thread would pick up more than one arrived packet. Hence, in long run where packets are arriving in back-to-back, this effectively amortizes the interrupt overhead across multiple arrivals.

For the FEDP, the achieved maximum raw DP bandwidth is 12.2 MB/s, which is 97% of the Fast Ethernet theoretical performance, while the raw GEDP-SMP performance only achieves 63.6% of the theoretical gigabit throughput. This shows that there are limiting factors in the host machines which hinder the GE performance. We have seen in Figure 3.1(e) that the network adapter cannot transmit data in full gigabit performance, by dividing the payload size with the corresponding $g_{s}$ value, we have a useful meter to estimate maximum performance we can get. Take the value of $g_{s}$ = 18.7 $\mu s$ at m = 1500 bytes as an example, we find that the maximum transmission throughput is around 80MB/s, which is closely matched with the GEDP-SMP measurement. Similarly, if we assume that the bottleneck is on the $O_{r}$ part, let's take the value of $O_{r}$ = 19.7 $\mu s$ for m = 1500 bytes, we should have the transmission throughput bounded by 76 MB/s. Again, this is closed to the measured performance on GEDP-UP. From these analysis, we can conclude that the performance of the GEDP is limited by the $g_{s}$ parameter when operates under the SMP kernel, but the bottleneck is shifted to the $O_{r}$ parameter when operates under the UP kernel.

To reveal how much improvement we could achieve if we adopt a zero-copy semantic in the send path, we have done some tests that simulated a zero-copy send operation, (simply by removing the memcpy() operation and sending out garbage content). The resulting send gap ( $g_{s}$ ) is approximately 16.4 $\mu s$ for m = 1500 bytes, which would correspond to a bandwidth of 91.5 MB/s. By eliminating the memory copy operation, it should only affect the $O_{s}$ parameter. However, we find that the $g_{s}$ parameter has changed too. This simple experiment suggested that the $g_{s}$ parameter is sensitive to other bus activities, since memory copy operation involves bus transaction on the system bus, which in theory, interferes with the other data movements on the bus network.

With the add-on reliable layer, the FEDP performs almost as good as the raw performance for medium to large-sized messages, which achieves a throughput of 12.1 MB/s. But for the GEDP, the higher protocol overhead does affect the overall performance, especially under the UP kernel mode. Our result shows that under the SMP mode, the maximum achieved GEDP bandwidth with flow control is 77.8 MB/s, with an average drop of 3.4% performance for the packet size ranged between 1K and 1.5K when compared with the raw speed. While for the performance under UP mode, the maximum achieved bandwidth with flow control is 65.2 MB/s and the average performance drop is 13% of the raw speed for the same data range. This further supports our argument that the performance of GEDP-UP is more susceptible to software overheads.

3.2.3 Bi-directional Bandwidth

**Figure 3.5:** Bi-directional bandwidth performance
$\resizebox*{!}{3in}{\includegraphics{figures/Sig/Sig9.eps}}$ (FC - flow control; 2threads - multi-thread mode; single - single thread mode)

For the GEDP, the best bi-directional performance is observed to be about 58 MB/s per process, which is measured with raw DP using multi-thread mode on SMP kernel. Comparing with the uni-directional bandwidth, we have a performance loss of 22 MB/s, which is a 27.5% drop of the peak point-to-point performance. We attribute this performance loss to the contention on the PCI and system buses as there are concurrent DMA transfers to and from the host memory as well as memory copy operations on both send and receive phases.

When compared with the single thread mode on GEDP-SMP and GEDP-UP, which only achieve 47 MB/s per process, we believe that the software overhead induced in the concurrent send and receive operations is the main cause of this performance loss. Therefore, with the add-on reliable layer that adds more software overhead, it is sensible to see that all GEDP-FC performance suffers more. However, it is surprising to find that the bi-directional performance of GEDP-SMP-FC with multiple thread support is worse than the single thread mode. This performance difference is coming from the extra memory contention and synchronization needed in accessing shared data structures on the reliable layer as both threads are concurrently updating these shared information. Finally, similar to the conclusion as appeared in the uni-directional benchmark, the performance of the FEDP on bi-directional communication has achieved a near-optimal result, which attains 12.1 MB/s per process on the raw bandwidth, and 11.7 MB/s per process with the add-on flow control support.