next up previous
Next: 6. Future Work Up: JTpack90 (SIAM PPSC97 paper) Previous: 4. Parallelization via PGSLIB

Subsections


5. Results

In this section we present parallel results using TELLURIDE. Note that although solution of the linear systems represent the majority of time spent in the simulations, all results are for the whole code, not just JTPACK90. JTPACK90 was operated in matrix-free mode using reverse communication. Global reduction and gather/scatter functionality was provided by PGSLIB, and no preconditioning is used.

5.1 Implicit Heat Conduction on a Regularly-Connected Mesh

Consider the following conduction test problem, which has an exact analytic solution [5]. Heat is introduced at time zero to an initially cold brick of material on its ymax xz face with a high applied temperature. Both yz faces and the ymin xz face are maintained cold with a low applied temperature, while the two xy faces are insulated. The temperature within the brick attains a steady state distribution, which is computed in TELLURIDE by marching the unsteady heat conduction equation forward in time until the temperature distribution does not change. Steady state was attained in these simulations after five time steps, but one large time step, however, could also achieve the desired result since the TELLURIDE conduction algorithm is fully implicit. The linear system of equations are solved by JTPACK90 using CG.

First consider a parallel simulation of this problem on a multi-processor shared-memory Digital AlphaServer 8400. Here we partition the brick with a 16 x 16 x 192 mesh that is block decomposed evenly along the z axis, i.e., each processor receives a 16 x 16 x Nz mesh, where Nz is some subset of the total mesh in the z direction (192). Table 1 displays the excellent parallel efficiencies realized for this problem on this architecture. The superlinear speedups achieved are likely due to cache effects (decreased cache utilization on fewer processors).


Table: Implicit heat flow on 16 x 16 x 192 mesh (300 MHz Digital AlphaServer 8400). a
  CPU Time  
Processors ($ \mu$s/cell/cycle) Efficiency
1 583 1.00
2 258 1.13
3 162 1.20
4 129 1.13
6 93 1.04
8 69 1.06

Now consider the parallel simulation on a multi-processor distributed-memory IBM SP2. Here we partition the brick with a 16 x 16 x 320 mesh, and again block decompose the mesh evenly along the z axis. The parallel efficiencies, as shown in Table 2, are still quite high, being >85% for all processor numbers tested. This performance is surprising in light of the fact that this mesh is treated as fully unstructured (despite its being simply-connected), necessitating the use of many parallel gather/scatter functions from PGSLIB [2].


Table: Implicit heat flow on 16 x 16 x 320 mesh (67 MHz IBM SP2). a
  CPU Time  
Processors ($ \mu$s/cell/cycle) Efficiency
1 1113 1.00
2 635 0.88
10 124 0.90
20 65 0.86

Though these are encouraging results, we emphasize that these represent a rather idealized problem in that the decomposition is optimal. A somewhat more realistic result is given in the next section.

5.2 Solidification on an Unstructured Hex Mesh

Figure 5.2 shows a 6480-element unstructured hex mesh for a part cast for the LANL inertial confinement fusion program. The chalice consists of a hemispherical shell two inches in diameter. The shell is gated at its pole with a cylindrical ``hot top'' one inch in diameter and about 1.5 inches tall. The hot top serves to continuously supply liquid metal to the hemispherical shell during filling/solidification (to avoid shrinkage defects). The hot top is then cut away and machined after solidification to give the final product (the hemispherical shell). Here the mesh has been decomposed by CHACO [3] for eight processors.

Figure: The chalice mesh, decomposed for 8 processors by CHACO [3].
\includegraphics[width=4.25in,angle=-90,draft=false]{chalice_8pe.ps}

Although we have also simulated the filling of this mold, we present only solidification results here. For this simulation, the mold cavity is assumed to be initially full of quiescent liquid copper at 1270oC. Because only one 90o quadrant is simulated, elements along the two vertical symmetry planes are assumed insulated. The top horizontal plane of the hot top is assumed insulated because of its proximity (1 inch) to the (hot) crucible. For the inner hemispherical surface (adjacent to the graphite mold), a convective heat transfer boundary condition is applied with a heat transfer coefficient of 25 W/m2K. For the outer surfaces, a coefficient of 15 W/m2K is used, which corresponds to experimental values in stationary air.

Table 3 shows results of an implicit heat flow calculation with solidification on a finer mesh than that shown in Figure 5.2, consisting of 46,386 unstructured hex elements, again on a Digital AlphaServer 8400. Again we see excellent parallel efficiencies, which is quite encouraging since this is a more realistic example of the types of parallel casting simulations TELLURIDE must perform.


Table: Implicit heat flow with solidification on 46,386-cell chalice mesh (300 MHz Digital AlphaServer 8400). a
  CPU Time  
Processors ($ \mu$s/cell/cycle) Efficiency
1 5013 1.00
2 2169 1.15
4 1237 1.03
8 721 0.87


next up previous
Next: 6. Future Work Up: JTpack90 (SIAM PPSC97 paper) Previous: 4. Parallelization via PGSLIB
John A. Turner