next up previous contents
Next: Bibliography Up: A. Benchmark Methodologies Previous: A.4 Microbenchmark for the   Contents

A.5 Microbenchmark for the $ L$ parameter

This parameter encapsulates all the costs involved in moving the data across the network, thus, it becomes an impracticable task to measure this parameter directly. For example, the physical network may stretch over the building, and therefore, direct signal analysis on two remote ends becomes impossible. A feasible approach to measure this movement cost is by indirect estimation. In this subsection, we first describe the indirect calculation that yields the approximation for the L values between two arbitrary cluster nodes, then we will describe the microbenchmark that portrays the L parameter for the whole cluster network.

Recall that we can depict the total time for a m-byte packet to travel from the source node to its destination as

$\displaystyle T_{ptp}(m)=O_{s}(m)+L(m)+O_{r}(m)+U_{r}(m)$ (A.3)

If we carry out a simple pingpong test, the measured roundtrip time becomes $ RTT(m)=2*T_{ptp}(m) $. Since all other parameters can be measured directly or indirectly, we can therefore deduce the required L value by

$\displaystyle L(m)=\frac{RTT(m)}{2}-O_{s}(m)-O_{r}(m)-U_{r}(m)$ (A.4)

Thus, equation (A.4) becomes an indirect measurement for the L parameter between two machines over a range of packet sizes.

However, showing the L value between two machines of the cluster only delineates part of the cost, since the data are measured under a competition-free condition. To capture the network performance, our microbenchmark works up by generating multiple distinct pairs of pingpong nodes, with nodes of each distinct pair lie across the bisectional plane of the network. When all nodes start the pingpong tests at the same time, this emulates multiple concurrent packets flowing back-and-forth over the bisectional plane. Under this benchmark setting, by varying the number of pingpong pairs, we obtain different sets of L values. Then, the required bilinear function could be obtained by applying multiple regression technique on these datasets. To summarize, Algorithm 8 presents the pseudocode for this microbenchmark.
\begin{algorithm}
% latex2html id marker 6720\par\caption{
{\small \protect\( ...
...columnwidth}{!}{\includegraphics{figures/appdx/alg_L.eps}}\par }
\end{algorithm}

Figure A.4: The L parametric functions of three different network configurations for a 32-node cluster - (a) a single switch (Cisco Catalyst 2980G), (b) a hierarchical network (8x4) and (c) another hierarchical network (16x2)
[(Cisco 2980G) - Comparison of calculated and measured L values on 2P=2 and 2P=32] \resizebox*{0.48\columnwidth}{!}{\includegraphics{figures/appdx/C2980G-Lat.eps}} [(HN 8x4) - Comparison of calculated and measured L values on 2P=2 and 2P=32] \resizebox*{0.48\columnwidth}{!}{\includegraphics{figures/appdx/HN8x4-Lat.eps}}

[(HN 16x2) - Comparison of calculated and measured L values on 2P=2 and 2P=32] \resizebox*{0.48\columnwidth}{!}{\includegraphics{figures/appdx/HN16x2-Lat.eps}} [Comparison of three different configurations on a 32-node cluster] \resizebox*{0.48\columnwidth}{!}{\includegraphics{figures/appdx/comp-Lat.eps}}

One of the merits of this microbenchmark is that it clearly differentiates between different network configurations. For examples, Figure A.4 shows the resulting parametric functions for three different network configurations that support a 32-node cluster. By exploring those equations, we observe that with current cluster size, the Cisco 2980G switch provides the best performance in terms of latency and scalability. In particular, we find that the bisectional bandwidth of the Cisco switch scales up linearly with the number of nodes, while the hierarchical networks have reached the upper bound when we scale up the cluster size.


next up previous contents
Next: Bibliography Up: A. Benchmark Methodologies Previous: A.4 Microbenchmark for the   Contents