Steve M. Slater Jasmina L. Vujic University of California University of California Department of Nuclear Engineering Department of Nuclear Engineering Berkeley, CA 94720 Berkeley, CA 94720 (510) 643-9281 (510) 643-8085 slater@nuc.berkeley.edu vujic@nuc.berkeley.edu
In nuclear reactor analysis, the integral transport equation is commonly used as a starting point for describing the exact conditions inside the core. One of the most commonly used techniques for solutions to the transport equation is that of collision probabilities. However, computer codes based on the exact collision probability formalism have historically been limited to a single assembly analysis. A new code has recently been developed at the University of California to overcome this barrier. The MAGGENTA code, based on the existing GTRAN2 code, utilizes the p4 Parallel Programming System and networks of workstations or other supercomputers to solve large multi-assembly problems. Each processor is assigned a unique geometrical region and are coupled through neutron currents at the interfaces. A graphical interface has been developed to simplify the assembly layout construction and processor assignments.
The methodology has been shown to be very accurate. The effective multiplication factors calculated for the same fuel assembly using one, two, and four processors are in agreement. In addition, many large multi-assembly problems have been solved on a variety of computer architectures. Less than 10 minutes of real wallclock time was needed to analyze a case of 16 fuel assemblies on the SP-1. In each assembly, the exact 2D geometrical configuration was preserved, using 1000 meshes to define flat-flux zones. This complex transport problem was solved in 12 energy groups.
One of the most common techniques for solving the integral transport equation is the method of collision/transfer probabilities (CTP). The CTP method is known to give very accurate solutions for parameters, such as flux and the criticality constant "k". No pin cell homogenizations are needed. It faces the serious drawback, however, of being limited by both computer memory and processing power. Typical CTP computer codes analyze only one, or at most a few, fuel assemblies in a given problem. To obtain solutions to problems consisting of many fuel assemblies, one must use more approximate methods, such as nodal diffusion theory, thus sacrificing accuracy for speed. It is desirable, therefore, to be able to use the CTP method to solve large problems containing tens of fuel assemblies, even up to 1/4 of the entire core.
Recently, a method has been developed at the University of California. The existing GTRAN2 computer code1 has been modified to obtain solutions for problems containing an arbitrary number of assemblies. Depending upon the computing resources available, it is possible to analyze problems using 30 fuel assemblies in less than one hour of wallclock time. Tests have been performed to verify the accuracy of the solution algorithm, as well as to study the code's performance on large multi-assembly problems.
GTRAN2, based on the rigorous collision/transfer probability method, is an advanced lattice physics code that solves the multigroup steady-state integral neutron transport equation in arbitrary two-dimensional geometries. Any current or future fuel assembly designs can be modelled by GTRAN2 without any changes to the computational methodology. Since combinatorial geometry describes exact assembly configurations, GTRAN2 allows a zone to be subdivided in any number of ways. For example, burnable poison pins can be modelled as cylinders with additional cylindrical cladding. They can also be subdivided arbitrarily into concentric rings and/or azimuthal zones. Thus, the code combines the geometric flexibility of the Monte Carlo methodology, with the computational efficiency of the deterministic codes. Given transport-corrected macroscopic cross sections, it solves an eigenvalue problem and gives the volumetric flux distribution, incoming/outgoing current distributions, reaction rates, and few group constants for nodal codes.
GTRAN2 consists of three functional modules: TPGEOM, CTP, and FLUJ. TPGEOM is a modified combinatorial geometry processor that is used to perform ray tracing in arbitrary multidimensional configurations. The use of combinatorial geometry permits a detailed description of complex and irregular assembly geometries in one, two, and three dimensions. Ray tracing accuracy is easily improved by increasing the number of pseudo neutron track angles and by decreasing the distance between parallel pseudo neutron tracks. Geometrical preprocessing is only done ONCE, and the calculated geometric data are used repeatedly for all energies and burnup steps.
The CTP module performs an exact numerical integration in order to obtain collision, escape, and transmission probabilities for an assembly (or any region of interest). The ray tracing data from TPGEOM are used in CTP to calculate the optical lengths between various surfaces that tracks intercept. The optical lengths are then used in numerical integration of collision, escape, and transmission probabilities.
FLUJ is the solver for coupled systems of eigenvalue equations. The eigenvalue equation derivations assume isotropic scattering (basic version of GTRAN2) or linearly anisotropic scattering. The double P0 expansion of the angular flux on the outer assembly boundary is also assumed and the final equations involve the volumetric flux and incoming/outgoing current distributions. This system of equations is solved by an iterative method.
A preprocessing code, GIN2, starts from relatively simple engineering-type input and generates combinatorial geometry input for standard PWRs, BWRs and MHTGRs. Vector and parallel processing reduce CPU and wallclock times.
In short, parallel programming is the process of running an application on more than one processor. In practice, however, parallel programming involves many different approaches. A wide variety of parallel computers exist, each with their own advantages and disadvantages. Some computers are designed as "supercomputers" with many processors connected with "wires" or some type of circuit board. Other computers, such as networks of Unix-based workstations, can be linked to combine their computing resources to solve a single problem.
To use multiple processors, the code must usually include statements to tell the computer when to send and/or receive messages. Some compilers attempt to do it automatically, but the more common approach is for the programmer to include explicit communication commands within the code. In addition, each supercomputer will have its own set of "commands" for communication, unique to that machine or family of machines. For example, the command to send a message from processor 1 to processor 2 on an IBM SP1 computer will be different than the equivalent command on an Intel "Delta" machine. Furthermore, the necessary steps for linking workstations involves machine-specific programming beyond the knowledge of most engineers.
As a result, many special computer languages and extensions to existing languages have emerged to create a common set of commands that, when incorporated into a Fortran or C program, will achieve the same result on all types of computers. The aforementioned message passing example could be achieved with the same command on any computer. Two of the most popular message passing systems include p43 and PVM34. Since previous work performed on GTRAN2 used p45, we again decided to continue with p4 over PVM3. The discussions about p4, however, generally apply to PVM3 as well.
P4 consists of a set of commands that can be called from within any Fortran or C program. It has functions which send or receive messages to or from one or all processors. P4 also has a series of global operations that acts on every processor simultaneously. These include: the barrier (which causes each processor to wait until all processors have called the barrier), maximum and minimum data comparisons, and commutative operations (including addition and multiplication).
These and other message passing systems have some clear advantages and disadvantages. Of primary interest to private and public sector engineers is that existing workstations can be used in parallel to solve a large problem. Some problems cannot only be solved in less time on many processors than one, but also more accurately. By using the joint memory resources, one may analyze a more detailed case. One disadvantage is that someone must either write a new code or modify an existing code to use the message passing routines. Care must be taken to minimize the number of messages being sent and prevent "race conditions". This is a situation where one or more processors executes the code faster than the others. If the code is not synchronized, some processors may be sending data other processors are not yet ready to receive. This can result in a loss of data and invalidate the results.
The primary objective of this study was to solve the integral transport equation for a case involving a large number of fuel assemblies. To accomplish this goal, GTRAN2 was modified to use the p4 message passing routines and three test cases were created to study the code. Since p4 is available on most computer architectures, the resulting code is highly portable and has been run without modification on Sun and IBM workstations, a Thinking Machines Connection Machine 5 (CM-5), and an IBM SP-1.
In order to analyze problems with multiple fuel assemblies, we assigned a distinct geometrical region to each processor. Other possibilities for parallelization include dividing the problem by either angle or energy group. In these cases, each processor would generate matrix data for a fraction of the total number of angles or energy groups. Previous work, however, has shown these methods to be limited in both their computational efficiency and their applicability to large problems5.
The original GTRAN2 code assumed isotropically reflective boundary conditions for incoming/outgoing currents; the incoming current was set to the outgoing current at the same location. In MAGGENTA, the boundary conditions are expanded to allow for coupling between assemblies (and processors). When running multiple assembly (processor) cases, the code sets the incoming current of each boundary mesh equal to the outgoing current from its neighboring assembly (processor). On those faces with no adjacent assembly (processors), the user can specify either reflective or vacuum boundary conditions.
With the exception of the boundary conditions described in the previous paragraph, the solution algorithm in MAGGENTA remains nearly identical to that of GTRAN2. Given an initial guess for incoming currents, the CTP matrix is solved on each processor and a new eigenvalue is computed. New fission sources and outgoing currents are produced from the new "k". After each iteration, MAGGENTA communicates the outgoing currents between the processors which become the incoming currents on the appropriate boundary segment of each neighboring fuel assembly. The process repeats until convergence is reached.
The interface currents are the primary coupling mechanism between the nodes and can be thought of as a vehicle for redistributing the neutron flux between assemblies. It was quickly discovered that on large cases convergence was quite slow. This can be visualized by picturing the multi-assembly problem beginning with flat flux initial guess. As each outer iteration repeats, the neutrons are redistributed through the interface currents. However, if the test case is many assemblies wide, it can take time for the proper balance to be reached. Consequently, a global rebalance method was added to MAGGENTA. A system of equations which equates the leakage from each assembly to the sources is described in more detail in other works6. The resulting matrix is solved for a "rebalance factor" for each assembly by which the fluxes and currents are scaled. This rebalance factor is needed not because of the parallel processing but rather because of the current coupling itself. Even if the same algorithm was used on a single processor, the need for convergence acceleration would remain just as great.
A new graphical interface allows the user to specify the assembly input files that belong on each processor and how the assemblies are arranged. For example, when analyzing 16 assemblies, they could be in a 4x4 grid, or a 2x8 grid, or any arbitrary arrangement. The new interface leads to a more accurate modelling of areas such as near the core outer edge. This flexibility, coupled with the arbitrary boundary conditions, allow for the possibility of analyzing a complete 1/4 core case, with reflective boundaries on the 2 internal faces and leakage along the outside of the core.
Three test cases study different aspects of MAGGENTA. The first involves a single fuel assembly based on a typical boiling water reactor (BWR). This simplified assembly consists of 192 flat flux zones with 64 fuel pins. The total assembly width is 13.04 cm, the fuel pin diameter is 1.06 cm with an 8 mm cladding, and the rod pitch is 1.63 cm. This test assembly was solved for k-infinity, zone-averaged fluxes, and reaction rates using GTRAN2 and a 20 group cross section library. To verify the accuracy of MAGGENTA, the assembly split in half and run on two processors, then split into quarters and solved with each quarter on one of four processors.
The second test case is also a BWR fuel assembly but this time subdivided into 324 zones. Using 20 energy group cross sections, each processor was assigned a full assembly. Execution times were obtained for cases involving between 1 and 16 processors. An effort was made to keep the layout of fuel assemblies as close to a square as possible. For example, when analyzing 16 assemblies they were assumed to be in a 4 row and 4 column array.
Finally, since the IBM SP-1 at Argonne National Laboratory is faster and more powerful than other computers, this same BWR assembly was subdivided into 996 zones and results were obtained using a 12 group cross section library. On the SP-1, the largest test case included 30 assemblies. A corresponding picture can be seen in Figure 1.
Figure 1 - Zone Definitions for 30 Assembly Test Case
In all cases, reflective boundary conditions were assumed on the outer faces. In addition, the results were obtained using 32 angles and a 0.1 cm mesh spacing in the ray tracing.
Throughout the discussion of the results, the term "speedup" is frequently referred to as a measure of performance. The definition of speedup is the time it takes to solve a problem on one processor, divided by the time it takes on N processors. Thus the best speedup one can expect is equal to the number of processors used, also referred to as "linear" speedup. In practice, communication will slow the program and prevent linear speedup from being reached.
When obtaining timing results for one processor only, the code should be optimized for a single processor. MAGGENTA has additional overhead associated with the p4 statements that are not needed. However, for the test case presented here, both CPU and wallclock times for a single processor are the same as those from GTRAN2, with no p4 calls.
While we present some speedup results, this paper is not intended to demonstrate the speedup obtained by using parallel computing. Only one example with just four processors is shown just to give a preliminary indication of code performance. With more processors and larger problems communications will take up more time and the speedup should drop.
The test cases were run on several types of computers including a network of 4 Sun Sparc 2 computers, a network of 6 Sun Sparc 20 computers, a Connection Machine 5 at the University of California, Berkeley, and the IBM SP-1 at Argonne National Laboratory. In general, MAGGENTA showed nearly linear speedup on up to four processors, and successfully analyzed problems with as many as 30 fuel assemblies. Since times must be reported in wallclock time, not CPU time, exclusive access to each machine or processing node was obtained. (CPU time ignores idle time associated with message passing.)
Some limitations were noticed related to the convergence of the iterative solver. For cases using 16 or fewer processors, the solver converged in less than 22 outer (fission) iterations. When the problem size increased to 30 assemblies, it took a full 69 iterations for the 996 zone SP-1 case. Before larger problems can be analyzed quickly, new methods of neutron balancing will be needed.
In the following analyses, some interesting results were noticed on each type of computer. First, the code was noticeably slower when using a non-square assembly layout. For two and eight assemblies, a square array is not possible. The two assemblies were stacked vertically, and the eight were in arranged in two rows of three assemblies, and one row of two assemblies.
In addition, many of the speedup results exhibited superlinear speedup. In general this is not possible. However, due to the nature of the CTP methodology in MAGGENTA it can be explained. The CTP matrices used in MAGGENTA are of order
where
is the number of zones in the assembly. As a single assembly is split into four quarters, the number of zones decreases by a factor of 4. So the matrix size is no longer
but rather
. Thus, when the number of zones decreases by a factor of 4, we see a factor of 16 decrease in problem size. Since the execution times in the CTP matrix generation, the TPGEOM ray tracing, and the FLUJ solver modules are related to problem size, we can expect to see superlinear speedup. In practice, these trends are offset by communication time in the solver routine.
The simplified 192 zone BWR assembly was examined on every computer architecture used in this study and showed very nearly linear speedup on each. Table 1 contains the wallclock times and a comparison of the k infinity values obtained for one, two, and four processors. In each case, the execution time improved by more than a factor of two when switching from one to two processors. As two additional processors were added, times increased only slightly or even decreased on the Sparc 20 workstations.
Table 1 - k-infinity and Wallclock Times (sec) for Case 1
---------------------------------------------------------------------------------------------
Number of k-inf Sparc 2 Sparc 20 CM-5 SP-1
Processors
---------------------------------------------------------------------------------------------
1 1.22186 307.1 111.1 935.5 82.4
2 1.22210 128.3 52.4 n/a 37.2
4 1.22206 72.3 58.2 138.7 22.9
NOTE: Results from the CM-5 were not complete due to hardware problems with the computer
---------------------------------------------------------------------------------------------
Another way of looking at the same data is presented in Figure 2. This graph plots speedup as a function on the number of processors on the SP-1. The dashed line represents linear speedup, the "best" possible results, and the heavy line is the speedup for the total wallclock times. Speedup for each of the three MAGGENTA modules are also plotted independently. It is interesting to note that the first two routines, ray tracing and matrix generation, have superlinear speedup on both two and four processors, confirming an earlier discussion of superlinear speedup. The final routine, the matrix solver, showed no improvement even though the problem size decreased on each node. Obviously, the gain in performance from a smaller problem was offset by the added communication overhead in the iteration procedure.
Figure 2 - Speedup for Case 1 on IBM SP-1. A single 192 Zone Fuel Assembly is Divided Between 1, 2, and 4 Processors
The difference in accuracy between one and four processors is only 0.02%. We conclude that the methodology will produce valid results on at least four coupled assemblies or geometrical regions. For cases containing more than four assemblies, perhaps as many as 30 or more, further testing is needed to validate MAGGENTA. Since we know of no other computer program which can analyze such a large region while preserving exact geometry, verification becomes a challenge. It may be solved by performing very long Monte Carlo calculations or by finding actual power measurements taken inside a real reactor core.
With the accuracy of the methodology demonstrated, the next step was to run large multi-assembly cases. Results were obtained from a number of assemblies ranging from one to sixteen. Figure 3 and Figure 4 are plots of wallclock time against the number of processors (equal to number of assemblies) for a network of Sparc 20 workstations and the CM-5, respectively. In both cases, the execution times for the ray tracing and matrix generation modules remained constant regardless of the number of assemblies analyzed. This result is expected since there is no communication within these first two routines, and the problem size is constant on each processor. Furthermore, the effects of an irregular array of assemblies can be seen in Figure 4 for 2 and 8 assemblies. We expect a continual increase in execution time as the number of assemblies increases but other geometrical effects can slow convergence of the solver.
Figure 3 - Case 2 Wallclock Times for Sparc 20 Workstations. Each Processor Solved for a Full Fuel Assembly. Hence the 6 Processor Case Also Refers to 6 Fuel Assemblies.
Figure 4 - Case 2 Wallclock Times for Connection Machine 5. One 324 Zone Fuel Assembly per Processor.
The overall execution times, however, are very encouraging. On the Sparc 20 workstations, the execution time increased by a factor of 2 when the problem size grew from one to six assemblies with the total six assembly execution time taking less than 8 minutes. On the CM-5, an older computer with Sparc 1 processors, the overall time increased from 42 minutes to analyze one assembly to less than 58 minutes for 16 assemblies.
Perhaps the most promising results came from the IBM SP-1. A very detailed assembly with 996 flat flux zones was used to run cases ranging from 1 to 30 assemblies. The wallclock times are shown in Figure 5. As is the case when analyzing problems with many zones, the matrix solver dominated the overall execution time.
Figure 5 - Wallclock Times for Case 3 on the SP-1. One 996 Zone Fuel Assembly per Processor
The execution time increased by less than a factor of 2 when the problem size increased from 1 to 16 fuel assemblies. A noticeable increase was seen in the 30 assembly case due to slow convergence. However, the total execution time for 30 assemblies was only 52 minutes, a very reasonable amount of time for a detailed calculation. Future work will attempt to speed convergence and run larger cases.
MAGGENTA has successfully performed reactor calculations with multiple fuel assemblies while preserving the fine geometrical detail on each assembly. A problem consisting of 30 fuel assemblies, 996 flat flux zones on each assembly (See Figure 1), and 12 group cross sections was solved for k infinity as well as flux and reaction rates for every zone.
The solution for k-infinity on a single assembly and one processor was compared to k obtained from analyzing the same assembly, but subdivided in half and quarter and run on 2 and 4 processors, respectively. The numbers corresponded and the largest discrepancy was a 0.02% difference in k. To the best of the our knowledge, no other computer code can perform such detailed calculations on a large number of fuel assemblies this quickly.
Future work on MAGGENTA will include both improvements on neutron rebalancing methods to obtain a faster convergence, while attempting to validate the code for a multi-assembly problem. One of the next major goals is to analyze a complete 1/4 core using the exact fuel assembly geometries, cross sections, and leakage boundary conditions.
Another area of interest is to modify the assembly current coupling equations. It is a well-known assumption that an isotropic current is not valid near boundaries, especially near highly absorbing media, such as control blades in a BWR. The CTP equations will be derived again without assuming isotropic currents and MAGGENTA will be modified accordingly, most likely discretizing the current into a finite number of angular zones.
This material is based upon work supported by a National Science Foundation Graduate Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Results from the Connection Machine 5 were made possible through the National Science Foundation Infrastructure Grant number CDA-8722788.
The authors gratefully acknowledge use of the Argonne High-Performance Computing Research Facility. The HPCRF is funded principally by the U.S. Department of Energy Office of Scientific Computing.

B. Case 2 - 324 Zone Assembly on Each Processor


C. Case 3 - 996 Zone Assembly on Each Processor

VI. CONCLUSIONS AND FUTURE WORK
ACKNOWLEDGEMENTS
REFERENCES