next up previous contents
Next: 6. Complete Exchange on Up: 5. Complete Exchange on Previous: 5.4 Experimental Results   Contents


5.5 Summary

In this chapter, we focus on the practical issues of designing high-speed complete exchange algorithms on a commodity cluster interconnected by a non-blocking network. As the performance of interconnection networks is the limiting factor of most existing clusters, improving the communication performance by well scheduling of communication events could leverage the overall performance of the clusters. Four complete exchange algorithms, including, shift exchange, pairwise exchange, group shuffle exchange and synchronous shuffle exchange algorithms are studied and implemented on our experimental platform. Their common characteristic is that they all arrange the communications or message exchanges in a node and link contention-free manner.

Based on the architecture and communication model of cluster, we observe that to achieve optimal result, the network pipes must be fully utilized. Any waiting stage would stall the communication pipelines and decrease the overall communication performance. This optimum could only be fulfilled by the synchronous shuffle exchange algorithm, which adopts a contention-free schedule at the packet level. Due to its effectiveness, our experimental results show that the performance of synchronous shuffle exchange reaches 97% of the available bandwidth. In particular, we show that by careful scheduling of all communication events to balance the usage of available network resources, this hides away the synchronization overheads, and gain considerably improvement even when exchanging small size messages. While both shift and pairwise exchanges show their effectiveness on exchanging long messages, they become inefficient when exchanging small messages as they cannot fill up those network pipelines effectively.

Theoretically, under a contention-free schedule over a non-blocking network, all packets arrived to different input ports are destined to different output ports; therefore, can be routed instantaneously. However, in reality, no global clock is implemented to coordinate all cluster nodes and their events. Thus, their operations are not lock-step synchronized. Any variations in communication schedules will result in drifting from the theoretical harmony. The consequence is packets start to cumulate in the network due to experience of conflicts. Under this situation, the buffering architecture of the switch plays a critical role in the performance issue. In our experiments, we show that the performance of the synchronous shuffle exchange algorithm is affected by the Head-Of-Line blocking problem. To counteract the HOL blocking, we have devised the group shuffle exchange algorithm. Lastly, we show that the group shuffle exchange performs almost as good as the synchronous shuffle exchange algorithm and scales better when it works on an input-buffered switch.


next up previous contents
Next: 6. Complete Exchange on Up: 5. Complete Exchange on Previous: 5.4 Experimental Results   Contents