The solution of the Boltzmann transport equation for the neutron distribution in a nuclear reactor remains one of the most computationally-intensive applications in engineering and science. A three-dimensional (3D) transport analysis of the entire reactor core is still beyond the capability of current machines (and algorithms), but with the use of workstations as nodes in a distributed computing environment, we are getting ready to attack this large-scale problem. Although in this paper we present results for two-dimensional (2D) single assembly problems, the methodology can be naturally extended to 3D large-scale applications.
The exact collision probability formalism is used in the GTRAN2 code (Vujic 1993) to solve the multigroup integral transport equation in general 2D geometries. Combinatorial geometry (CG) is used to describe complex and irregular configurations in one, two, or three dimensions. A modified CG processor is used to perform ray tracing needed for numerical calculation of the collision/transfer probability (CTP) matrices. Thus, GTRAN2 is a unique combination of the computational efficiency of the deterministic code and the geometric flexibility of Monte Carlo codes.
To obtain the discretized integral transport equations for two-dimensional geometry, the area A and boundary L are partitioned into Nr subareas (zones) and Nb subboundaries, respectively, such that
and
where g is the energy group index, and the neutron fluxes, sources and partial currents are spatially averaged over the corresponding volume and/or surface elements. In this case,
is the first flight volume-to-volume collision probability that neutrons emitted uniformly and isotropically in
will have their next collision in
, whereas
is the first flight surface-to-volume collision probability that neutrons entering through
uniformly and isotropically will have their next collision in
. Similarly,
is the first flight escape probability that neutrons emitted uniformly and isotropically in
will escape through
without having a collision, and
is the transmission probability that neutrons entering through
uniformly and isotropically will escape through
without having a collision.
The collision, escape, and transmission probabilities are two-fold integrals that are numerically calculated. For example, the volume-to-volume collision probability is given by:
where
is the optical length across
in the direction of neutron motion at a distance
above the
-axis, and
is the optical length between
and
in the direction of neutron motion.
In our approach, the calculation of the elements of CTP matrices is performed in two independent steps:
This approach has several advantages. By decoupling the geometric part from the calculation of the collision/transfer probability matrices, the geometric data, once precalculated, can be used for each energy group and time step.
PARALLEL MODIFICATIONS
The computer code described above was modified to use the p4 parallel programming system developed at Argonne National Laboratory (Butler and Lusk 1992). P4 is a library of macros and subroutines for programming a variety of parallel machines. The routines also enable workstations to communicate with each other and simulate a parallel computer. Because of the wide range of platforms supported by p4, computer programs using the libraries, such as GTRAN2, are highly portable.
For this study, the first two modules of GTRAN2, namely TPGEOM and CTP, were studied and converted to run in parallel. The third module, the matrix solver, has been parallelized in prior work (Wilson 1993). As described above, the ray tracing algorithm repeats itself along a number of different trajectories. This fact suggested the obvious embarrassingly parallel division of work among the processors by assigned a number of angles to each processor. A flow chart of the first two sections is shown in Figure 1.
Ray tracing in GTRAN2 is perfectly suited for parallel processing, i.e., each track is totally independent of all other tracks. In addition, calculation of the CTP matrices is independent for each energy group.
The code begins with some initialization routines then starts the ray tracing module TPGEOM. Since the ray tracing is done independently for every angle, each CPU was assigned an equal fraction of the angles to analyze. In TPGEOM, interception points with zones are calculated and stored, as are partial zone volumes. Since every processor needs the zone volumes before generating the CTP matrices, the partial volumes calculated by each machine were globally broadcasted and summed at the end of TPGEOM. It should be noted that only the volumes were broadcast, not the entire set of interception points (which may have taken longer to broadcast than to compute).
The calculation of the CTP matrix data using this approach was not as clearly parallel as the ray tracing, but the tasks could still be efficiently allocated. Since the ray tracing was split among angles, the matrix generation followed suit. Each CPU used the geometry data it owned and the globally known material data to calculate a portion of the CTP matrix for the first energy group. Once calculated, the matrix information was asynchronously sent to the master processor where contributions from each angle were collected and the entire matrix normalized. Because of the matrix normalization, a global barrier was required in each energy loop. The process was then repeated for all energy groups.
In principle, the entire CTP matrix generation could be done asynchronously without any barriers since the CTP matrix contributions from each angle are commutative and the energy groups are non-coupled. In practice, however, it is not a trivial task and would require major restructuring of the original GTRAN2 computer code. As this work is a first attempt at parallelization of these two modules for distributed computing environment, such an effort has not yet begun and may be addressed in future studies. However, the CTP module was parallelized for the shared memory computers (IBM 3090/600J) and nearly linear speedups were obtained to up to 6 processors (Vujic 1991). In this case, the calculation of the CTP matrices for one energy group was assigned to one parallel task (processor). The ray tracing data were available to all parallel tasks, and load balancing was perfect.
The modified computer code has been run on three separate platforms. The first was a network of Sun workstations containing one Sparc 10 and five Sparc 2 machines. The Sparc 10 had 96 MB of RAM while the Sparc 2s had 32 MB each. Parallelization across workstations was the original motivation for GTRAN2 modifications since the average users of the code may not have access to large supercomputers. More common would be several workstations linked together. With p4, the workstation network can be very heterogenous. P4 messages can be passed across various architectures with the only limitation being memory size limiting the amount of detail and accuracy of the solution.
The code was next ported to the 32 node Connection Machine-5 at the University of California at Berkeley with only minor modifications required in the CPU timing routines. Each processor on the CM-5 is a Sparc 1 chip with 32 MB of memory.
The final computer used was the IBM SP-1 at Argonne National Laboratory (Gropp 1993). The SP-1 computer consists of 128 IBM RS6000 model 370 processors connected via 4 communication paths. While the SP-1 has 128 processors with 128 MB on each node, the machine is often busy and access to the machine was obtained only recently. Consequently, results are presented just for one to six processors.
A test case of an 8x8 boiling water reactor (BWR) fuel assembly was used for the study. Because of the limited memory on the workstations and the CM-5, a relatively small case was used. The assembly was divided in 324 zones (Figure 2) and GTRAN2 solved for the eigenvalue and power profiles using seven energy groups and 30 angles. On the SP-1, with more available memory, the same assembly was analyzed using 612 zones, 12 groups, and 60 angles (Figure 3).
In each case, the modified GTRAN2 program showed close to linear speedup as the number of processors increased from one to four. When more CPUs were added, communication costs began to outweigh the computational improvement. The following results show that by using more than one processor the execution time of GTRAN2 can be greatly reduced. A limit, however, does exist on improvements and adding too many processors may actually be slower than using fewer machines.

Figure 3 - 612 Zone BWR Assembly Test Case
Figure 4 shows the results from the test case run on the Sparc 2 workstations. The data point for one processor was taken from the original, unmodified version of GTRAN2 while the remaining data used the p4 version. The plot shows a large initial improvement in performance followed by a smaller gain for each additional CPU.
An interesting feature is noticed on the workstations as well as with the two supercomputers. The performance improvement from one to two processors is more than a 50% drop in execution time; meaning that superlinear speedup was seen. A single machine finished in 345 seconds while two computers in tandem completed in 160 seconds. At this time, no explanation can be found. The data point at one processor was the same for both the unmodified code as well as the p4 version running on one processor. The figure shows that the improvement came from the TPGEOM routine where the time dropped by a factor of three for only a doubling of computational power.
As more processors are added, the gain in performance begins to dwindle. The highly parallel TPGEOM routine continues to show slight improvements while the CTP routine with its barriers actually takes two seconds longer to run with six processors than with five. The CTP matrix generation is obviously the limiting factor in parallel speedup and should be the next focus for future work.
The same input case from the workstations was run on the Connection Machine using between one and fifteen processors. The results, while faster overall, showed similar trends as evidenced in Figure 5.
Again, a superlinear speedup was seen between one and two processors (257 and 115 seconds respectively). Furthermore, like with the workstations, using more than 5 processors does not improve performance. Due to the CTP routine, the test case actually took longer to solve using 15 processors than using 10.
Of the four communication paths available on the SP-1, only the slowest two, the Ethernet and the "switch" will run GTRAN2 at this time. Results from at least one of the two faster networks are expected soon. Figure 6 shows results from the Ethernet.
Even though the SP-1 analyzed a larger, slightly different input case, the results from both the Ethernet and the switch communication paths were very similar to the workstation and CM-5 results. A superlinear speedup was seen from one to two processors and after four or five processors, very little gain was realized from adding an additional CPU. In fact, the slow ethernet communication highlighted the problem with CTP and the overall execution time showed a large increase from four to six processors
.
The p4 modifications to GTRAN2 gave both positive and interesting results. Most notable among these is the unexplained superlinear speedup observed in every test case on all three architectures. The data point for one processor has been verified with the unmodified and the p4 codes and the data for two processors is readily duplicated.
No positive explanation exists yet, but it is possible it has to do with the great amount of disk I/O in the TPGEOM module. The graphs show that TPGEOM is the routine with such high speedup while CTP give only a marginal improvement. In the ray tracing TPGEOM routine, the interception data is written to a file as it is calculated. With two processors, the data is being written to two files simultaneously. This behavior is beyond the authors' understanding of Unix hardware but can be duplicated on all three architectures.
In every case, the overall execution time was dramatically reduced. The wallclock times were decreased from between 50% to 80% depending upon the machine and test case, with the best times observed when using about five processors.
As the number of processors increased, the code even began to slow down, caused mainly by the CTP routine. One reason is because of the multiple barriers required before collecting data and normalizing the matrices. With some effort, the code could be restructured to allow more asychronous message passing and perhaps improve performance.
Perhaps the major cause for the increase in run time seen in some results is the shear size of the matrices being passed. The CTP matrix is as large as the number of zones squared and in double precision. Since each processor has a fraction of the total angles, not matrix elements, they must all pass the same amount of data to the master regardless of how many processors are present. (The contributions from each angle are summed) Thus, the communication time grows linearly with the number of processors present.
For future development, the method of load division should be re-thought for the CTP routine. Another scheme such as assigning each processor an energy rather than an angle may solve the communication bottleneck problem and allow the code to continue speeding up with increased number of processors.
Another extension of this work will be a multiple assembly calculation wherein each processor will control an entire assembly. In such a case, both the ray tracing and the CTP matrix generation are completely independent between assemblies and would make the code highly parallel. The only coupling comes about in the solver routine between eigenvalue iterations.
The GTRAN2 general geometry lattice physics code has been converted to run in parallel by using the p4 library routines. The first two modules, the ray tracing and the CTP matrix generation modules, have been modified to divide the workload among several processors, each machine being responsible for a fraction of the total number of angles used in the code.
Results of test cases running on three different platforms all showed nearly linear speedup between one and four processors. As more processors were added, the communication cost of the CTP matrix generation routine became too great and limited further speedup.
The total execution time of the first two modules of GTRAN2 has been reduced by a factor of four while using four processors on the CM-5 and by almost as much on the Sparc 2 workstations and the SP-1 supercomputer. Future effort modifying the CTP matrix generation algorithm to avoid heavy communication will probably improve performance even more.
Part of this work was supported by Argonne National Laboratory, Contract No. 31392401.
Results from the Connection Machine 5 were made possible through the National Science Foundation Infrastructure Grant number CDA-8722788.
Butler, R. and E. Lusk. 1992. "User's Guide to the p4 Programming System." Argonne National Laboratory, Argonne, IL., ANL-92/17.
Gropp, W., E. Lusk, and S. Pieper. 1993 "Users Guide for the ANL IBM SP-1 DRAFT." Argonne National Laboratory, Argonne, IL., ANL/MCS-TM-00.
Vujic, J.L. and W.R. Martin 1991. "Vectorization and Parallelization of a Production Reactor Assembly Code." Progress in Nuclear Energy, Vol. 26, No. 2: 147-162.
Vujic, J.L. 1993. "GTRAN2: An Advanced Transport Theory Code for Advanced Assembly Calculations." In Proceedings of Joint International Conference on Mathematical Methods and Supercomputing in Nuclear Applications (Karlsruhe, Germany, April 19-23). Kernforschungszentrum, Karlsruhe, Germany, Vol.1, 695-706.
Wilson, W.J., J.L. Vujic and A.G. Gu. 1993. "Parallel Multiple-Assembly Calculations in GTRAN2/M." In Transactions of the 1993 American Nuclear Society Winter Meeting (San Francisco, CA, Nov. 14-18). American Nuclear Society, La Grange Park, IL., 204-205.