Next: 1.2 Thesis Statement and
Up: 1. Introduction
Previous: 1. Introduction
  Contents
Subsections
Speed is the main reason why we opt for the use of computers
- both serial and parallel computers. Building parallel computers
to solve large-scale and complex problems bear the same objective
- we want to speed up the computations. Since parallel computing
is becoming more and more accessible through the advances in processors
and network technologies, the use of parallelism is no longer the
privilege of a few scientists with access to powerful supercomputers.
Now we have pile of PCs sitting in the server room, and they are interconnected
by high-speed LAN, these setups become our in-house parallel computer.
However, is that the answer to our quest for more processing power?
No, this is not. Building an efficient and high-performance cluster
is not that simple as plugging in and setting up all components. To
construct a high-performance cluster as well as to orchestrate its
processing power, we certainly need to adopt a more scientific approach.
Performance understanding is a process which is used to determine
explanations on the performance behavior of a parallel application
on a parallel machine. The development of efficient parallel applications
depends upon a realistic prediction of application behavior, which
requires in-depth understanding of both the application and the architecture
characteristics [3,29,52]. For example,
if the communication overhead of a parallel application dominates
its execution time, the programmer may try to improve the communication
schedule by rearranging the communication events to reduce the communication
overhead. On the other hand, the system designer may try to improve
the communication system to achieve the same performance goal. To
determine how to perform performance tuning, both the programmer and
system designer need to acquire detail knowledge on the operation
costs, as well as to identify the potential bottleneck(s).
Therefore, the more we understand a parallel system and its performance
characteristics, the more information we have to improve the efficiency
of applications, as well as to provide insights on guiding enhancements
to the system. This dissertation describes an attempt to improve or
accelerate the performance understanding of the commodity clusters
by focusing on the communication system. The network is the most critical
path of a parallel system as its performance directly influences the
capabilities of the system for high-performance computing. On the
other hand, the communication efficiency of the network relies on
the efficiency of the communication software that drives the interconnect.
Therefore, understanding of the performance capabilities of the cluster
communication system provides the conceptual framework we need to
understand the performance of the cluster system. Here are the main
issues related to the performance understanding:
- Performance study - What sort of performance could we get on this
platform? Knowing the baseline capability of the cluster system is
important for performance understanding. Although this argument seems
intuitive, it is an important issue on the commodity clusters, where
we have the desire and flexibility to select the best configuration
for our cluster systems. Performance is only a relative metric, thus,
it has no meaning on its own. In the quest for performance, the primal
objective is to have efficient utilization of resources. However,
without understanding how well can the system performs.
- How can we determine whether the performance of the program is acceptable?
- How to distinguish between efficient, cost-effective algorithms and
inefficient algorithms?
- How can we identify the weakness of the program and perform performance
tuning?
- Performance benchmarking - How can we quantify performance? Low-level
architectural benchmarks [45] are the most appropriate because
their results measure the basic capabilities of the hardware as well
as the software that associated with it; and in principle, the performance
of higher level programs should be predictable in terms of them. However,
designing and interpreting benchmarks correctly can be difficult.
This is because the conclusions drawn from a benchmark study depend
not only on the basic timing results, but also on the way these are
measured and interpreted. In particular, here are a few challenges
in the use of benchmarking:
- Could the benchmark routine simulate the same execution condition
as it is when the code path is used by real programs? For example,
the behavior of cache memory can affect the accuracy of the benchmark
measurement.
- Is the benchmark routine measuring something that has a measurable
impact on overall system performance?
- Is the benchmark routine sensitive to changes in hardware characteristics?
For example, a benchmark routine that is designed on a uniprocessor
machine may not work accurately on a multi-way SMP machines; thus,
we should have a clear knowledge on the capability of the benchmark
routine.
Therefore, it is important to understand low-level details about architectural
implementation when interpreting benchmark results. And we believe
that, to assist the understanding process, the definition of the performance
parameter should be accompanied with the benchmarking methodology
that measures it.
- Performance modeling - Which performance features are important to
our understanding process? Low-level benchmark is designed to measure
the quality of a target hardware/software component. The measurements
become the quantitative means for us to analyze or reason on the performance
of a particular feature of the cluster system. These performance features,
or in other words, the performance parameters, become the building
blocks of our performance model, such that we try to understand the
performance behavior of the application as a function of these performance
parameters. We reckon that a useful performance parameter set - the
abstract model, must possess these properties:
- Should accurately reflect the performance capability of existing machines.
- Should be able to evaluate the performance of a particular application
on a particular platform and provide insights that can suggest possible
improvements in the application.
- Should enable the programmer to predict the impact on the program's
performance with regard to design changes in the algorithm or changes
to the system configuration.
Therefore, we see that the choice of the performance features may
influence the understanding process. However, choosing the right features
to model and incorporating them simply, elegantly and accurately requires
creativity as well as scientific methodology. This is because different
applications exhibit different characteristics on different system
settings or environments, and the application behaviors are often
dynamic. Hence, it is impossible or irrational to expect any set of
performance parameters to cover the whole application domain. Thus,
the hypothesis becomes, if we have a minimal set of performance parameters
that adequately model or describe the system, then the smaller the
set is, the better its usefulness will be.
Traditional approach in assessing the performance of communication
systems is by measuring the round-trip latency and the point-to-point
bandwidth [20,55]. However, these simple figures
- would not give as enough information to evaluate on the real performance
of the communication systems;
- do not provide insight(s) on why these achievements were obtained;
- could not identify where are the potential bottlenecks of individual
subsystems;
since these metrics only report of a single point-to-point
measurement under a lightly loaded network. In fact, the network performance
differs among applications and is depended on
- Communication patterns - Besides the one-to-one pattern, example patterns
are the one-to-many, many-to-one and many-to-many communications,
which are commonly known as the Collective Operations. Collective
operations play an important role in the development of parallel applications.
Different patterns have different performance characteristics and
requirements, and demand specific communication algorithms that can
efficiently utilize the underlying resources to achieve optimal results.
- Message sizes - Different factors dominate the communication performance
for small, medium and large messages. For examples, with small messages,
software overhead and network latency dominate the communication cost,
while for large messages, network bandwidth is the more important
factor in determining communication performance. For medium messages,
they cannot completely amortize the software overhead as compared
to long messages; therefore, they require better utilization of communication
pipelines to hide away the software cost.
- Communication schedules - The communication schedule characterizes
how we express the communication pattern in terms of communication
primitives, such as send and receive operations. By designing communication
schedules, we are able to structure our communications to match with
the resource constraints and yield highly efficient implementations.
- Degree of contention - Mild contention may result in slight decrease
in performance due to the higher queueing delays in the router. However,
under prolonged contention, the network becomes heavily congested
and this results in data loss.
All these communication issues contribute to the overall communication
performance.
The performance problem related to the communication software has
been an active research issue for the past decade. Currently, lightweight
messaging systems [109,59,58,21,114,82,101]
offer the best communication performance, as they create a fast communication
path that bypasses the traditional in-kernel messaging protocol stack
(e.g. TCP/IP), which is a serious obstacle in exploiting the high
performance of modern network [54,55,9].
Although most of these lightweight messaging systems are successful
in delivering the raw network performance to higher-level applications,
their implementations, functionality and interfaces appear differently
even though some are built on the same network technologies. There
are issues that are not well addressed by these lightweight messaging
systems, such as system behavior under heavy load and general purpose
flow control. In particular, they lack of supports on guiding the
development of high-level communication primitives atop of their lightweight
messaging layer.
To better understand the network performance, we adopt the described
systematic approach on performance understanding to explore how the
underlying communication system supports these communication issues.
In this dissertation, we make use of a communication model, which
is derived on a resource-centric view of how data move across the
abstract machine, to aid our quantitative and qualitative studies
of the communication performance. This model is proposed as a ``mid-level''
tool, whose tasks are to map all high-level communication primitives
properly onto the low-level architectural abstractions. Through a
set of performance metrics, the programmers and system designers would
understand the power, the strength and the weakness of their cluster
communication system. In addition, by selecting appropriate metrics
to analyze on the target application, this could bring out new insights
as well as efficient quantitative prediction of the communication
performance.
Having a low-latency communication system that drives the cluster
interconnect in high speed is the prerequisite but not the guarantee
of achieving high performance. For example, most existing commercial
switches adopt packet drop policy on buffer overrun, which is a known
performance problem under heavy congestion. Studies on LAN performance
[55,79] have shown that the high overheads
in traditional communication protocols limit the ability to generate
network load, thus the observed contention problem was not critical.
With low-latency communication mechanisms, applications can now generate
higher load to the network, which could result in congestion drop
problem even under a well-balanced communication schedule.
The behavior of the networks in response to congestion becomes an
important issue in understanding the communication performance. Depend
on the switch architectures, type of network technologies and the
host communication software, networks react to congestion in different
ways. For instances,
- Network routers that employ input-port buffering are known to suffer
with the Head-Of-Line (HOL) [44] blocking problem,
which is a serious constraint in achieving high-performance.
- Some Ethernet-based networks adopt the 802.3x link-based flow control
standard [1]. However, it is a non-selective
scheme [72], which has the control actions taken
to resolve congestion apply to all traffics on that link rather than
targeting to the true source of the congestion; therefore, congestion
can spread to other parts of the network. Besides, different vendors
have their own design strategies, thus, different switches may react
differently even under the same contention situation [79].
- Most of the commercial switches or routers employ the Drop Tail discipline
[36] as their buffer management strategy under the
congestion situation. This is the simplest strategy as the router
just drops all excess packets when the buffers are full. However,
this is found unfair to bursty traffics, as more number of packets
is dropped as compared to other sources.
With the use of commodity components as the cluster interconnect,
the performance of the network depends on the type of network technologies
(e.g. Fast Ethernet, Gigabit Ethernet, ATM, etc.), as well as depends
on the internal architecture of a particular router or switch. This
poses a challenge to the system designers and programmers on how to
effectively utilize the communication network for high-performance
communication.
Next: 1.2 Thesis Statement and
Up: 1. Introduction
Previous: 1. Introduction
  Contents