Next: 1.2 Thesis Statement and Up: 1. Introduction Previous: 1. Introduction Contents

Subsections

1.1 Commodity Supercomputing

Speed is the main reason why we opt for the use of computers - both serial and parallel computers. Building parallel computers to solve large-scale and complex problems bear the same objective - we want to speed up the computations. Since parallel computing is becoming more and more accessible through the advances in processors and network technologies, the use of parallelism is no longer the privilege of a few scientists with access to powerful supercomputers. Now we have pile of PCs sitting in the server room, and they are interconnected by high-speed LAN, these setups become our in-house parallel computer. However, is that the answer to our quest for more processing power? No, this is not. Building an efficient and high-performance cluster is not that simple as plugging in and setting up all components. To construct a high-performance cluster as well as to orchestrate its processing power, we certainly need to adopt a more scientific approach.

1.1.0.0.1 Performance Understanding

Performance understanding is a process which is used to determine explanations on the performance behavior of a parallel application on a parallel machine. The development of efficient parallel applications depends upon a realistic prediction of application behavior, which requires in-depth understanding of both the application and the architecture characteristics [3,29,52]. For example, if the communication overhead of a parallel application dominates its execution time, the programmer may try to improve the communication schedule by rearranging the communication events to reduce the communication overhead. On the other hand, the system designer may try to improve the communication system to achieve the same performance goal. To determine how to perform performance tuning, both the programmer and system designer need to acquire detail knowledge on the operation costs, as well as to identify the potential bottleneck(s).

Therefore, the more we understand a parallel system and its performance characteristics, the more information we have to improve the efficiency of applications, as well as to provide insights on guiding enhancements to the system. This dissertation describes an attempt to improve or accelerate the performance understanding of the commodity clusters by focusing on the communication system. The network is the most critical path of a parallel system as its performance directly influences the capabilities of the system for high-performance computing. On the other hand, the communication efficiency of the network relies on the efficiency of the communication software that drives the interconnect. Therefore, understanding of the performance capabilities of the cluster communication system provides the conceptual framework we need to understand the performance of the cluster system. Here are the main issues related to the performance understanding:

Performance study - What sort of performance could we get on this platform? Knowing the baseline capability of the cluster system is important for performance understanding. Although this argument seems intuitive, it is an important issue on the commodity clusters, where we have the desire and flexibility to select the best configuration for our cluster systems. Performance is only a relative metric, thus, it has no meaning on its own. In the quest for performance, the primal objective is to have efficient utilization of resources. However, without understanding how well can the system performs.
- How can we determine whether the performance of the program is acceptable?
- How to distinguish between efficient, cost-effective algorithms and inefficient algorithms?
- How can we identify the weakness of the program and perform performance tuning?
Performance benchmarking - How can we quantify performance? Low-level architectural benchmarks [45] are the most appropriate because their results measure the basic capabilities of the hardware as well as the software that associated with it; and in principle, the performance of higher level programs should be predictable in terms of them. However, designing and interpreting benchmarks correctly can be difficult. This is because the conclusions drawn from a benchmark study depend not only on the basic timing results, but also on the way these are measured and interpreted. In particular, here are a few challenges in the use of benchmarking:
- Could the benchmark routine simulate the same execution condition as it is when the code path is used by real programs? For example, the behavior of cache memory can affect the accuracy of the benchmark measurement.
- Is the benchmark routine measuring something that has a measurable impact on overall system performance?
- Is the benchmark routine sensitive to changes in hardware characteristics? For example, a benchmark routine that is designed on a uniprocessor machine may not work accurately on a multi-way SMP machines; thus, we should have a clear knowledge on the capability of the benchmark routine.
Therefore, it is important to understand low-level details about architectural implementation when interpreting benchmark results. And we believe that, to assist the understanding process, the definition of the performance parameter should be accompanied with the benchmarking methodology that measures it.
Performance modeling - Which performance features are important to our understanding process? Low-level benchmark is designed to measure the quality of a target hardware/software component. The measurements become the quantitative means for us to analyze or reason on the performance of a particular feature of the cluster system. These performance features, or in other words, the performance parameters, become the building blocks of our performance model, such that we try to understand the performance behavior of the application as a function of these performance parameters. We reckon that a useful performance parameter set - the abstract model, must possess these properties:
- Should accurately reflect the performance capability of existing machines.
- Should be able to evaluate the performance of a particular application on a particular platform and provide insights that can suggest possible improvements in the application.
- Should enable the programmer to predict the impact on the program's performance with regard to design changes in the algorithm or changes to the system configuration.
Therefore, we see that the choice of the performance features may influence the understanding process. However, choosing the right features to model and incorporating them simply, elegantly and accurately requires creativity as well as scientific methodology. This is because different applications exhibit different characteristics on different system settings or environments, and the application behaviors are often dynamic. Hence, it is impossible or irrational to expect any set of performance parameters to cover the whole application domain. Thus, the hypothesis becomes, if we have a minimal set of performance parameters that adequately model or describe the system, then the smaller the set is, the better its usefulness will be.

1.1.0.0.2 Performance Issues on Communication

Traditional approach in assessing the performance of communication systems is by measuring the round-trip latency and the point-to-point bandwidth [20,55]. However, these simple figures

would not give as enough information to evaluate on the real performance of the communication systems;
do not provide insight(s) on why these achievements were obtained;
could not identify where are the potential bottlenecks of individual subsystems;

since these metrics only report of a single point-to-point measurement under a lightly loaded network. In fact, the network performance differs among applications and is depended on

Communication patterns - Besides the one-to-one pattern, example patterns are the one-to-many, many-to-one and many-to-many communications, which are commonly known as the Collective Operations. Collective operations play an important role in the development of parallel applications. Different patterns have different performance characteristics and requirements, and demand specific communication algorithms that can efficiently utilize the underlying resources to achieve optimal results.
Message sizes - Different factors dominate the communication performance for small, medium and large messages. For examples, with small messages, software overhead and network latency dominate the communication cost, while for large messages, network bandwidth is the more important factor in determining communication performance. For medium messages, they cannot completely amortize the software overhead as compared to long messages; therefore, they require better utilization of communication pipelines to hide away the software cost.
Communication schedules - The communication schedule characterizes how we express the communication pattern in terms of communication primitives, such as send and receive operations. By designing communication schedules, we are able to structure our communications to match with the resource constraints and yield highly efficient implementations.
Degree of contention - Mild contention may result in slight decrease in performance due to the higher queueing delays in the router. However, under prolonged contention, the network becomes heavily congested and this results in data loss.

All these communication issues contribute to the overall communication performance.

The performance problem related to the communication software has been an active research issue for the past decade. Currently, lightweight messaging systems [109,59,58,21,114,82,101] offer the best communication performance, as they create a fast communication path that bypasses the traditional in-kernel messaging protocol stack (e.g. TCP/IP), which is a serious obstacle in exploiting the high performance of modern network [54,55,9]. Although most of these lightweight messaging systems are successful in delivering the raw network performance to higher-level applications, their implementations, functionality and interfaces appear differently even though some are built on the same network technologies. There are issues that are not well addressed by these lightweight messaging systems, such as system behavior under heavy load and general purpose flow control. In particular, they lack of supports on guiding the development of high-level communication primitives atop of their lightweight messaging layer.

To better understand the network performance, we adopt the described systematic approach on performance understanding to explore how the underlying communication system supports these communication issues. In this dissertation, we make use of a communication model, which is derived on a resource-centric view of how data move across the abstract machine, to aid our quantitative and qualitative studies of the communication performance. This model is proposed as a ``mid-level'' tool, whose tasks are to map all high-level communication primitives properly onto the low-level architectural abstractions. Through a set of performance metrics, the programmers and system designers would understand the power, the strength and the weakness of their cluster communication system. In addition, by selecting appropriate metrics to analyze on the target application, this could bring out new insights as well as efficient quantitative prediction of the communication performance.

1.1.0.0.3 Limitations of LANs

Having a low-latency communication system that drives the cluster interconnect in high speed is the prerequisite but not the guarantee of achieving high performance. For example, most existing commercial switches adopt packet drop policy on buffer overrun, which is a known performance problem under heavy congestion. Studies on LAN performance [55,79] have shown that the high overheads in traditional communication protocols limit the ability to generate network load, thus the observed contention problem was not critical. With low-latency communication mechanisms, applications can now generate higher load to the network, which could result in congestion drop problem even under a well-balanced communication schedule.

The behavior of the networks in response to congestion becomes an important issue in understanding the communication performance. Depend on the switch architectures, type of network technologies and the host communication software, networks react to congestion in different ways. For instances,

Network routers that employ input-port buffering are known to suffer with the Head-Of-Line (HOL) [44] blocking problem, which is a serious constraint in achieving high-performance.
Some Ethernet-based networks adopt the 802.3x link-based flow control standard [1]. However, it is a non-selective scheme [72], which has the control actions taken to resolve congestion apply to all traffics on that link rather than targeting to the true source of the congestion; therefore, congestion can spread to other parts of the network. Besides, different vendors have their own design strategies, thus, different switches may react differently even under the same contention situation [79].
Most of the commercial switches or routers employ the Drop Tail discipline [36] as their buffer management strategy under the congestion situation. This is the simplest strategy as the router just drops all excess packets when the buffers are full. However, this is found unfair to bursty traffics, as more number of packets is dropped as compared to other sources.

With the use of commodity components as the cluster interconnect, the performance of the network depends on the type of network technologies (e.g. Fast Ethernet, Gigabit Ethernet, ATM, etc.), as well as depends on the internal architecture of a particular router or switch. This poses a challenge to the system designers and programmers on how to effectively utilize the communication network for high-performance communication.

Next: 1.2 Thesis Statement and Up: 1. Introduction Previous: 1. Introduction Contents