A Latency-Tolerant Partitioner for Distributed Computing on the Information Power Grid

Sajal K. Das and Daniel J. Harvey
Dept. of Computer Science & Engineering
The University of Texas at Arlington
Arlington, TX 76019-0015
E-mail: {das.harvey}@cse.uta.edu

Rupak Biswas
NASA Ames Research Center
Mail Stop T27A-1
Moffett Field, CA 94035-1000
E-mail: rbiswas@nas.nasa.gov

Abstract

NASA's Information Power Grid (IPG) is an infrastructure designed to harness the power of geographically distributed computers, databases, and human expertise, in order to solve large-scale realistic computational problems. This type of a metacomputing environment is necessary to present a unified virtual machine to application developers that hides the intricacies of a highly heterogeneous environment and yet maintains adequate security. In this paper, we present a novel partitioning scheme, called MinEX, that dynamically balances processor workloads while minimizing data movement and runtime communication, for applications that are executed in a parallel distributed fashion on the IPG. We also analyze the conditions that are required for the IPG to be an effective tool for such distributed computations. Our results show that MinEX is a viable load balancer provided the nodes of the IPG are connected by a high-speed asynchronous interconnection network.

1 Introduction

The Information Power Grid (IPG) has been developed by NASA and other collaborative partners to harness the power of geographically distributed resources. The I-WAY experiment [8] identified several classes of applications that would benefit from such an infrastructure:

- Desktop coupling to remote supercomputers to provide access to large databases and high-end graphics facilities.
- User access to sophisticated instruments through remote supercomputer connections utilizing virtual reality techniques [7].
- Remote interactions with supercomputer simulations [9, 10].

There have been numerous attempts by the research community to develop computational grid capabilities. A comprehensive survey of current technology can be found in [13]. The Condor system [17], developed to manage research studies at workstations around the world, is an example of an early success. However, Condor did not adequately deal with security issues that are important for a general computational grid implementation. Other grid-based systems that have been developed include Nimrod [1], NetSolve [4], NEOS [9], Legion [14], and CAVERN [16]. The Globus Metacomputing Infrastructure Toolkit [12] (http://www.globus.org) has been extremely successful in providing a portable virtual machine environment. Mechanisms exist within Globus
to share remote resources, provide adequate security, and allow MPI-based message passing. Due to its general, portable, and modular nature, Globus has been chosen by NASA as the middleware to implement the IPG.

Limited studies have been performed to determine the viability of parallel distributed computing on the IPG. In one such study [2], latency tolerance and load balancing modifications were implemented in connection with a computational fluid dynamics problem to compensate for slower communication speed. Results showed that the application actually ran faster under Globus on two IPG nodes of four processors each than on a single tightly-coupled machine of eight processors. However, this result is clouded in that asynchronous message passing was supported over the high-speed link but not within the single platform. In this paper, we simulate an unsteady adaptive mesh application on a wide area network. The number of IPG nodes, the number of processors per node, and the interconnect speeds are parameterized so that general conclusions can be drawn as to when the IPG would be suitable to solve applications of this nature.

In the past, we have investigated two different load balancing strategies with this application as the test case. The first strategy, called PLUM [19], is an architecture-independent framework geared towards adaptive numerical solutions. PLUM globally partitions the computational mesh after each adaptation and determines whether re-balancing the load would lead to reduced total execution time. If an improvement in the load balance can be achieved, it utilizes an effective re-mapping algorithm to minimize the required data movement. Application processing is temporarily suspended during the partitioning and data re-mapping operations. Although PLUM is designed to utilize any parallel graph partitioner, ParMeTiS [15] has proven to be the most effective.

The second approach uses a general-purpose topology-independent dynamic load balancer utilizing Symmetric Broadcast Networks (SBN) [6]. A salient feature of this SBN-based approach is that it balances processor workloads while the application is executing. Thus it is able to hide the high data migration overhead, albeit at the cost of increased interprocessor communication. Results reported in [3] indicate that both PLUM and the SBN approach have their relative merits, and that they achieve excellent load balance with minimal extra overhead.

In this paper, we propose a novel partitioning approach that optimizes the two important steps of PLUM (balancing and re-mapping) as part of the partitioning process. The goal of this partitioner, called MinEX, is different from that of most partitioners. Instead of attempting to balance the load, the objective is to minimize the total runtime of the application. This approach counters the possibility that perfectly balanced loads can still incur excessive communication and redistribution costs while the application is processed. MinEX is also used to experiment with latency tolerant techniques on the IPG. Results show that MinEX reduces the number of elements migrated by PLUM, and lowers the percentage of edges cut by SBN. For example, for 32 partitions with our test case. PLUM showed an edge cut of 10.9% and redistributed 63,270 mesh elements. The corresponding values for the SBN approach were 36.5% and 19,146. Instead, the MinEX partitioner values were 20.9% and 30,548.

This paper is organized as follows. Section 2 introduces the computational application to be tested and determines its scalability. Section 3 describes the new MinEX partitioner. Section 4 describes the experimental study, analyzes the obtained results and draws conclusions as to the use of the IPG for this and similar applications. Section 5 concludes the paper.

## 2 Computational Test Case

Many computational problems are modeled discretely as an unstructured mesh of vertices and edges. To capture evolving features, the mesh topology is frequently adapted. For an efficient
parallel implementation, this requires dynamic load balancing. In other words, mesh objects will have to be reassigned after each adaptation phase to re-balance the workload among the processors. It is critical to minimize the overhead associated with re-mapping data sets, and to reduce the communication between processors at the next solution step. These goals are especially important in an IPG context where communication bandwidths between nodes are likely to be much smaller than on a single multiprocessor machine.

The computational mesh used for the experiments in this paper simulates an unsteady environment where the adapted region is strongly time-dependent. As shown in Fig. 1, a shock wave is propagated through an initial grid to produce the desired effect. The computational mesh is processed through nine adaptations by moving a cylindrical volume across the domain with constant velocity. Grid elements within the cylindrical volume are refined while previously-refined elements are coarsened in its wake. During the processing, the size of the mesh increases from 50,000 elements to 1,833,730 elements.

![Figure 1: Initial and adapted meshes (after levels 1 and 5) for the simulated unsteady experiment.](image)

To realistically simulate the overhead associated with an adaptive mesh computation, two weights are associated with each vertex and one weight with each edge. These weights respectively reflect the number of time units required for computation, data remapping, and communication. The total time required to process the vertices assigned to a processor $p$ must take into account all three metrics which are defined as follows.

- **Processing Weight** ($W_{v}^{\text{it}}$) is the computational cost to process a vertex $v$.
- **Communication Cost** ($Comm_{v}^{p}$) is the cost to interact with all vertices adjacent to $v$ but whose data sets are not local to processor $p$.
- **Redistribution Cost** ($Remap_{v}^{p}$) is the overhead to copy the data set associated with $v$ to another processor from $p$. Note that the redistribution cost incurred at $p$ includes the operations of packing data and initiating transmission. The redistribution cost incurred by the processor receiving $v$ is the sum of the communication time and the operations of unpacking and merging the data into existing data structures.
Additional metrics that will be needed in this paper are defined below:

- **Weighted Queue Length** ($Q\text{Wgt}(p)$) is the total cost to process the vertices assigned to $p$.
  It is defined as:
  
  $$Q\text{Wgt}(p) = \sum_{v \text{ assigned to } p} (Wgt^v + Comm^v_p + Remap^v_p).$$

- **Total System Load** ($Q\text{Wgt}_{TOT}$) is the sum of $Q\text{Wgt}(p)$ over all processors.

- **Heaviest Load** ($\text{MaxQWgt}$) is the maximum value of $Q\text{Wgt}(p)$ over all processors, and indicates the total time required to process the application.

- **Lightest Load** ($\text{MinQWgt}$) is the minimum value of $Q\text{Wgt}(p)$ over all processors, and indicates the workload of the most lightly-loaded processor.

- **Average Load** ($\text{AvgQWgt}$) is $Q\text{Wgt}_{TOT}/P$, where $P$ is the total number of processors.

- **Load Imbalance Factor** ($\text{LoadImb}$) represents the quality of the partitioning and is defined as $\text{MaxQWgt}/\text{AvgQWgt}$.

Clearly, if the data set for $v$ is already assigned to $p$, no redistribution cost is incurred, i.e. $Remap^v_p = 0$. Similarly, if the data sets of all the vertices adjacent to $v$ are also assigned to $p$, the communication cost, $Comm^v_p$, is 0.

### Table 1: Scalability analysis of the test application

<table>
<thead>
<tr>
<th>Number of Processors</th>
<th>2</th>
<th>4</th>
<th>8</th>
<th>16</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>256</th>
<th>512</th>
<th>1024</th>
<th>2048</th>
</tr>
</thead>
<tbody>
<tr>
<td>Latency</td>
<td>Max. Tolerance</td>
<td>3777</td>
<td>1824</td>
<td>1148</td>
<td>614</td>
<td>324</td>
<td>168</td>
<td>89</td>
<td>72</td>
<td>58</td>
<td>51</td>
</tr>
<tr>
<td></td>
<td>No Tolerance</td>
<td>4547</td>
<td>3193</td>
<td>1699</td>
<td>1033</td>
<td>558</td>
<td>302</td>
<td>173</td>
<td>123</td>
<td>115</td>
<td>109</td>
</tr>
</tbody>
</table>

Table 1 indicates the scalability of this application where the number of processors, $P$, is varied from 2 to 2048. The data was obtained by simulating the application (details presented in Section 4). Each column reflects non-dimensionalized $\text{MaxQWgt}$ values in thousands. The first row of the table assumes that maximum latency tolerance is achieved; the second row assumes that no latency tolerance is achieved. *Maximum latency tolerance* is defined as the ability to utilize all available processors to overlap communication and redistribution costs. Further explanations are provided in Section 3. Table 1 shows that this application can scale to over 1000 processors, indicating good potential for an IPG implementation.

## 3 Proposed MinEX Partitioner

Previous studies with this mesh application under the PLUM framework utilized a variety of general partitioners such as ParMeTiS [15], UAmeTis [20], DAMeTis [20], Jostle-MS [21], and Jostle-MD [21]. Note that UAmeTis, DAMeTis, and Jostle-MD are diffusive schemes designed to modify existing partitions to produce a processor allocation; whereas PMeTis and Jostle-MS are global partitioners which makes no assumptions about the original mesh distribution. Although all these partitioners achieve good load balance while minimizing communication overhead, they fail to
consider the cost of moving data between processors. A unique feature of PLUM is to address this drawback through the use of an efficient heuristic procedure for redistributing data to assigned processors.

In this study, we optimize both communication and remapping costs by implementing a novel partitioner, called MinEX, that considers computational, communication, and data remapping costs. We also redefine the partitioning goal from producing balanced loads to minimizing MaxQWgt.

3.1 General Design

MinEX can be classified as a diffusive multilevel partitioner. Partitioning occurs in three steps: contraction, partitioning, and refinement. Each of these steps are discussed below:

- Similar to other multilevel partitioners, the first step in MinEX is to contract the mesh to a reasonable size. What is different, however, is the contraction procedure. Instead of repeatedly contracting the mesh in halves as is common with other multilevel partitioners, MinEX sequentially contracts one vertex at a time. The advantage to this approach is that a decision can be made each time a vertex is later refined as to whether it should be assigned to another processor. This would make the algorithm more flexible since the graph would not have to be doubled in size before this decision could be made. If \( |V| \) is the number of vertices in the mesh, contraction requires \( O(|V|) \) steps. Total complexity would not be greater than the complexity of contracting the mesh sequentially in halves, since that would involve \( O(|V/2|) \) steps. Performing all the steps would still require \( O(|V/2|) + O(|V/4|) + \cdots = O(|V|) \).

- Once the mesh is sufficiently contracted, the remaining vertices are reassigned according to the criteria to be followed by the partitioning algorithm (described in detail in Section 3.2).

- The mesh is expanded back to its original size through a refinement process. As each vertex is refined, a decision is made as to whether it should be reassigned. This decision employs the same criteria that is followed by the partitioning algorithm in the second step above. Each coarse vertex reassignment in effect reassigns all of the vertices the coarse vertex represents.

3.2 Partitioning Criteria

To describe the criteria for deciding whether a vertex should be reassigned from one processor to another, two metrics, Gain and MinVar, need to be defined:

- Gain represents the change in QWgtTOT that would result from a proposed vertex move. A negative Gain value would indicate that less total processing is required after such a vertex move. The partitioning algorithm favors vertex moves with negative or small Gain values that reduce or minimize overall system load.

- MinVar is computed using the workload (i.e. QWgt\( (p) \)) for each processor \( p \) and the smallest load of any processor (MinQWgt) in accordance with the following formula:

\[
\text{MinVar} = \sum_p (\text{QWgt}(p) - \text{MinQWgt})^2.
\]

In other words, MinVar computes the variance of processor workloads from that of the most lightly-loaded processor. The objective is to initiate vertex moves that lower this value. Since
processors with large $Q_{\text{Wgt}}(p)$ values will have large $\text{MinVar}$ components, this criteria will tend to move vertices away from processors that have high runtime requirements.

$\Delta \text{MinVar}$ is the change in the $\text{MinVar}$ value after moving a vertex from one processor to another. A negative value indicates that the $\text{MinVar}$ value has been reduced.

Let us now describe how the partitioning decisions are made. For each vertex, consider all edges to adjacent vertices that are assigned to other processors. Compute the $\text{Gain}$ and $\text{MinVar}$ values that would result from moving the given vertex to the adjacent processor. Designate the newly computed $\text{MinVar}$ value as $\text{MinVar}_{\text{New}}$ and the original $\text{MinVar}$ value as $\text{MinVar}_{\text{Old}}$. If $\text{MinVar}_{\text{New}} < \text{MinVar}_{\text{Old}}$ and $\text{Gain}/(\text{MinVar}_{\text{Old}} - \text{MinVar}_{\text{New}}) < \text{Throttle}$, the proposed reassignment is considered. Note that $\text{Throttle}$ is a user-supplied parameter. The move chosen will be the one with the smallest $\text{Gain}$ value. To increase efficiency, the program utilizes a minimum heap with vertex pointers to heap locations to quickly find the best move and directly remove heap entries without having to search.

Table 2: Expected runtimes experienced based on varying $\text{Throttle}$ values

<table>
<thead>
<tr>
<th>Metric</th>
<th>Clusters</th>
<th>0</th>
<th>1</th>
<th>3</th>
<th>4</th>
<th>16</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>200k</th>
</tr>
</thead>
<tbody>
<tr>
<td>MaxQWgt</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>1993</td>
<td>427</td>
<td>348</td>
<td>312</td>
<td>291</td>
<td>300</td>
<td>306</td>
<td>312</td>
<td>324</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>1847</td>
<td>1142</td>
<td>748</td>
<td>467</td>
<td>320</td>
<td>304</td>
<td>305</td>
<td>318</td>
<td>345</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>2035</td>
<td>1801</td>
<td>674</td>
<td>556</td>
<td>375</td>
<td>331</td>
<td>324</td>
<td>326</td>
<td>382</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>1868</td>
<td>1516</td>
<td>761</td>
<td>639</td>
<td>412</td>
<td>352</td>
<td>328</td>
<td>371</td>
<td>425</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>1834</td>
<td>1626</td>
<td>835</td>
<td>767</td>
<td>438</td>
<td>373</td>
<td>359</td>
<td>343</td>
<td>400</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>2081</td>
<td>1579</td>
<td>898</td>
<td>825</td>
<td>481</td>
<td>395</td>
<td>357</td>
<td>361</td>
<td>427</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>1884</td>
<td>1279</td>
<td>1032</td>
<td>758</td>
<td>505</td>
<td>383</td>
<td>371</td>
<td>369</td>
<td>414</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>1944</td>
<td>1451</td>
<td>1102</td>
<td>834</td>
<td>531</td>
<td>434</td>
<td>376</td>
<td>380</td>
<td>435</td>
<td></td>
</tr>
<tr>
<td>LoadImb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>7.05</td>
<td>5.09</td>
<td>1.23</td>
<td>1.11</td>
<td>1.01</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>2</td>
<td>8.54</td>
<td>4.16</td>
<td>2.74</td>
<td>1.81</td>
<td>1.26</td>
<td>1.14</td>
<td>1.04</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>3</td>
<td>7.15</td>
<td>6.40</td>
<td>2.50</td>
<td>2.11</td>
<td>1.41</td>
<td>1.19</td>
<td>1.05</td>
<td>1.02</td>
<td>1.01</td>
<td>1.01</td>
</tr>
<tr>
<td>4</td>
<td>6.63</td>
<td>5.41</td>
<td>2.82</td>
<td>2.40</td>
<td>1.58</td>
<td>1.26</td>
<td>1.07</td>
<td>1.03</td>
<td>1.01</td>
<td>1.01</td>
</tr>
<tr>
<td>5</td>
<td>6.53</td>
<td>5.78</td>
<td>3.06</td>
<td>2.83</td>
<td>1.66</td>
<td>1.30</td>
<td>1.11</td>
<td>1.02</td>
<td>1.01</td>
<td>1.01</td>
</tr>
<tr>
<td>6</td>
<td>7.31</td>
<td>5.58</td>
<td>3.25</td>
<td>2.99</td>
<td>1.81</td>
<td>1.40</td>
<td>1.08</td>
<td>1.02</td>
<td>1.01</td>
<td>1.01</td>
</tr>
<tr>
<td>7</td>
<td>6.68</td>
<td>4.61</td>
<td>3.74</td>
<td>2.80</td>
<td>1.84</td>
<td>1.33</td>
<td>1.10</td>
<td>1.03</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>8</td>
<td>6.90</td>
<td>5.15</td>
<td>3.92</td>
<td>3.05</td>
<td>1.94</td>
<td>1.43</td>
<td>1.13</td>
<td>1.06</td>
<td>1.00</td>
<td>1.00</td>
</tr>
</tbody>
</table>

Conceptually, $\text{Throttle}$ acts as a gate that limits increases in $\text{Gain}$ based upon how much of an improvement in $\text{MinVar}$ can be achieved. Table 2 indicates how varying $\text{Throttle}$ affects the expected application runtimes ($\text{MaxQWgt}$) and load balance quality ($\text{LoadImb}$). The $\text{MaxQWgt}$ entries are non-dimensionalized values in thousands. These results were obtained by running the experiments described in Section 4. Table 2 assumes a network of 32 homogeneous processors distributed over one to eight IPG nodes (clusters). The inter-cluster interconnect speed is assumed to be a third of the intra-cluster speed. Results show that a $\text{Throttle}$ value of 64 produces the lowest overall $\text{MaxQWgt}$, and that larger $\text{Throttle}$ values improve $\text{LoadImb}$. Experiments with other network sizes using this same mesh have shown that $\text{Throttle}$ generally converges at values between $P$ and $2P$. Note also that for large values of $\text{Throttle}$, better $\text{LoadImb}$ does not necessarily imply lower $\text{MaxQWgt}$.
3.3 Latency Tolerance

The following processing steps illustrate how communication and data redistribution can be reduced or eliminated.

1. Initiate send of all data sets to be redistributed.
2. Initiate send of communication data needed by adjacent processors.
3. Process vertices that are not waiting for incoming transmissions.
4. Receive and unpack any re-mapped data sets destined for this processor.
5. Receive and unpack communication data destined for this processor.
6. Repeat steps 2 through 5 until all vertices are processed.

The above logic implements a strategy where processors distribute data sets and communication data as early as possible. Servicing of internal mesh vertices can then take place while waiting for expected incoming messages. As data sets and communication data are received, additional communications can be initiated and vertices processed. The most optimistic expectation of this strategy is that the processing activity can entirely hide the data set and communication latency. At the other extreme, the most pessimistic view is that no latency tolerance is achieved. Experiments simulating both views to analyze the effect of latency tolerance on our test application are described in Section 4.

3.4 Partitioning Data Structures

The following data structures are used by the MinEX partitioner to perform its multilevel algorithm:

**Mesh:** The adaptive mesh whose format is \( \{ |V|, |E|, vTot, *VMaP, *VList, *EList \} \) where

- \( |V| \) is the number of active vertices in the mesh
- \( |E| \) is the number of edges in the mesh
- \( vTot \) is the total number of vertices (includes merged vertices)
- \( *VMaP \) is a pointer to list of active vertices
- \( *VList \) is a pointer to list of vertices
- \( *EList \) is a pointer to list of edges.

**VmaP:** A list of active vertex numbers. None of these vertices have been compressed through multilevel partitioning.

**VList:** A complete list of vertices. Each vertex, \( v \), is defined by a VList record as

\( \{ Wgt, Remap_p, |e|, *e, merge, lookup, *vmap, *heap, border \} \) where

- \( Wgt \) is the computational cost to process \( v \).
- \( Remap_p \) is the redistribution cost to copy the data set associated with \( v \) to another processor from \( p \).
- \( |e| \) is the number of adjacent edges associated with \( v \).
\*e is a pointer to the first edge associated with \( v \). Subsequent edges are stored in contiguous memory locations.

\textit{merge} is the vertex that was merged with \( v \) during a contraction operation or \(-1\) if no merge took place.

\textit{lookup} is the active vertex that contains \( v \) after a series of contraction operations or \(-1\) if no merges took place.

\*vmap is a pointer to the position of \( v \) in the active vertex table.

\*heap is the pointer to an entry in the heap that relates to vertex, \( v \). This entry represents a potential reassignment of \( v \). This pointer is used to be able to remove heap entries without searching.

\textit{border} is a boolean flag indicating whether \( v \) is adjacent to vertices assigned to other processors.

\textbf{EList}: The list of edges in the mesh. Each edge record is defined as \{\( v, \text{Comm} \)\} where \( v \) is the adjacent vertex and \text{Comm} is the communication weight associated with this edge.

\textbf{Heap}: The heap of potential vertex reassignments. Each heap record is defined as \{\text{Gain, AMinVar, } v, p\} which specifies the \text{Gain} and \text{AMinVar} that would result from reassigning vertex \( v \) to processor \( p \). The min-heap is keyed by the \text{Gain} value.

\textbf{Stack}: The stack of compressed vertex pairs, \{\text{vertex1, vertex2}\}. These vertices are refined in reverse order from the order that they were compressed.

\subsection*{3.5 Graph Contraction}

The partitioner selects sets of randomly chosen pairs of vertices that are assigned to the same processor. From this set, the vertex pair, \((v, w)\), to be merged has the largest \( \text{Comm}_w/(\text{Remap}_v + \text{Remap}_w) \) value. This formula attempts to find edges with large edge communication costs while minimizing the potential cost of data set redistribution. The motivation behind this strategy is to arrive at a contracted mesh with a small edge cut and with small costs of data distribution.

To contract a vertex, a merged vertex record is created such that the merged vertex, \( M \), is adjacent to all vertices other than \( v \) and \( w \), that were originally adjacent to either of the two original vertices. The edge records corresponding to \( M \) are created accordingly. \text{VMap} is adjusted to contain the newly created vertex and to remove \( v \) and \( w \); \( |V| \) is decremented and \( vTot \) is incremented; \( |E| \) is increased by the number of edges created for \( M \); and the pair \((v, w)\) is pushed onto \text{Stack}.

\subsection*{3.6 Union/Find Algorithm}

A union/find algorithm is utilized so that edges of existing vertices can remain unchanged. For example, if an existing vertex is adjacent to \( v \), accesses to its \text{EList} record will check whether \( v \) has been merged. If it has, \text{lookup} will be accessed to quickly find the appropriate merged vertex. If \text{lookup} is not current, the union/find algorithm will search the chain of vertices beginning with \text{merge} to update the \text{lookup} value so subsequent lookups can be done efficiently. Pseudo code describing the union/find procedure is shown in Fig. 2.
3.7 Partitioning of the Contracted Graph

Once the graph contraction process is complete, the partitioning can be executed. Because the number of vertices is greatly reduced, the MinEX partitioning algorithm can execute efficiently. The algorithm considers every remaining vertex of the mesh to find potential reassignments that will reduce Gain and MinVar as described in Subsection 3.2. All potential vertex reassignments are added to the min-heap. Actual reassignments are executed in heap order. As a reassignment is executed, the heap is adjusted to reflect the new partition status.

3.8 Refinement

The graph is restored to its original size by expanding pairs of vertices in reverse order from how they were merged. The Stack data structure controls the order. As pairs of vertices, \((v, w)\), are refined, merged edges and vertices are deallocated. merge and lookup vertex numbers are also adjusted in the vertex table. The VMap table is adjusted to delete the merged vertex, \(M\), and to add \(v\) and \(w\). \(|V|\) is incremented and \(v\text{Tot}\) is decremented; \(|E|\) is decreased by the number of edges created for \(M\). After each refinement, an immediate decision is made as to whether a partition improvement can be made by reassigning \(v\) or \(w\). When reassignments are made, reassignments of the adjacent border vertices are also considered.

4 Experimental Study

The partitioner MinEX was executed with actual application data to simulate mesh processing for a variety of system configurations. Individual runs simulates networks with a particular number of processors \((P)\), number of clusters \((C)\), Throttle values, and interconnect speeds \((I)\). In our experiments, \(P\) was varied from 2 to 2048; \(C\) was varied from 1 to 8; Throttle was varied to find the optimal value for minimizing runtime; and \(I\) was varied to simulate high-speed cluster interconnections and low-speed wide area network connections.

Based on performance studies [11, 18], typical communication latency and bandwidth slowdowns from integrated clusters to configurations with clusters connected through a high-speed interconnect are in the range of 3 to 100. Wide area network connections are 1,000 to 10,000 times slower than the internal intra-connects of a single cluster. For these experiments, we have assumed that the intra-cluster communication speed to be normalized to a value of 1. Simulations of inter-cluster communication assumed slowdown factors of 3, 10, 100, and 1,000. To simplify the analysis, we have assumed that individual processors are homogeneous and divided as evenly as possible among the clusters.
4.1 Summary of Results

Table 3(a) and 3(b) show results of experimental runs analyzing the effect of varying numbers of clusters and interconnect speeds, assuming \( P = 32 \) homogeneous processors. The interconnect speeds indicate the slowdown factor relative to the intra-cluster communication speed. To be consistent with results presented in Tables 1 and 2, runtimes are shown in thousands. Table 3(a) charts the experimental results when no latency tolerance is achieved; Table 3(b) assumes maximum latency tolerance.

<table>
<thead>
<tr>
<th>Clusters</th>
<th>Interconnect Speeds</th>
<th>Clusters</th>
<th>Interconnect Speeds</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>373</td>
<td>1</td>
<td>306</td>
</tr>
<tr>
<td>2</td>
<td>763</td>
<td>2</td>
<td>305</td>
</tr>
<tr>
<td>3</td>
<td>952</td>
<td>3</td>
<td>324</td>
</tr>
<tr>
<td>4</td>
<td>989</td>
<td>4</td>
<td>328</td>
</tr>
<tr>
<td>5</td>
<td>1021</td>
<td>5</td>
<td>359</td>
</tr>
<tr>
<td>6</td>
<td>1091</td>
<td>6</td>
<td>357</td>
</tr>
<tr>
<td>7</td>
<td>989</td>
<td>7</td>
<td>371</td>
</tr>
<tr>
<td>8</td>
<td>968</td>
<td>8</td>
<td>376</td>
</tr>
</tbody>
</table>

(a) No latency tolerance  
(b) Maximum latency tolerance

The following conclusions can be drawn from the experiments.

• As the interconnect speed slows, the slowdown experienced by utilizing additional clusters increases dramatically. For example, the runtime metric in Table 3(a) is 4,781 when two clusters and an interconnect slowdown of 1000 is assumed. However, the runtime metric is 93,566 when eight clusters are assumed. The ratio, 93,566/4,781 \( \approx 19.57 \). If we consider the interconnect slowdown of 3, the ratio between two clusters and eight clusters is 968/763 \( \approx 1.26 \) which is a much smaller value. The same pattern holds true in Table 3(b).

• For mesh application considered, Globus over low-speed networks such as the Internet is not a viable approach assuming current technology. In fact, the interconnection speed has to improve by at least one or two orders of magnitude before this approach could be useful. Under current technology, applications would have to have minimum communication and data-set re-mapping for low-speed wide area networks to be practical interconnects.

• Latency tolerant algorithms show larger runtime gains when more clusters are utilized. This can be verified by comparing the same rows from Table 3(a) and Table 3(b). The rows that correspond to more clusters show greater latency tolerance runtime gain. The same cannot be said when analyzing columns of the tables where interconnect slowdowns are varied. Latency tolerance runtime gains remain relatively constant in this case. We can also conclude that regardless of whether one or eight clusters of processors are employed, latency tolerance algorithms will always be beneficial to reducing expected application runtimes.

• For our application, Globus could be a viable approach if a high-speed interconnect (slowdown factor between 3 and 10) between clusters is utilized. The first column of Tables 3(a) and 3(b) comparing 1 and 8 clusters with an interconnect slowdown factor of 3, respectively,
show a slowdown factor of 2.04 and 1.22. Similarly, the second column of the tables with an interconnect slowdown factor of 10 show slowdown factors of 4.60 and 3.35, respectively. These factors being smaller than the number of clusters indicate a speedup from when one cluster of \( \frac{1}{3} \) the number of processors are used.

5 Conclusions

This paper presented a latency-tolerant partitioner, called MinEX, that not only balances processor workloads but also minimizes data movement and runtime communication, for adaptive mesh applications that are executed in a parallel distributed fashion on the IPG. We also analyzed the conditions that are required for the IPG to be an effective tool for such distributed computations. Our results demonstrate that MinEX is a viable load balancer provided the nodes of the IPG are connected by a high-speed asynchronous interconnection network. An area of further research includes mathematical analysis of latency tolerance or slowdowns based on the interconnect speed, numbers of clusters employed, and the topology of the mesh.

Acknowledgements

This work was supported by NASA Ames Research Center under Cooperative Agreement Number NCC 2-5393.

References


