AHA Grape --- (Adaptive Hydrodynamic Architecture)

Andreas Kugel, University of Mannheim

Several generations of ASIC-based accelerators the GRAPE series have been used with good success in the past to accelerate the gravitational force calculation in stellar dynamics applications. However, these GRAPEs lack the flexibility to adapt them to some variants and optimizations of the algorithms, namely the SPH-based approach.

FPGAs (Field Programmable Gate Array) typically yield less performance than ASICs but provide a very high degree of flexibility. The family of FPGA-devices was introduced in 1984 by Xilinx. FPGAs feature a large number of relatively simple elements with configurable interconnects and an indefinite number of reconfiguration cycles with short configuration times (5 .. 100ms). All configuration information is stored in SRAM cells. The basic processing element (PE) of all current mainstream FPGAs is a 4-input/1-output look-up-table (LUT) with an optional output register. The functionality of the FPGA is thus determined by the contents of the look-up-tables within the PEs and the "wiring" between these elements.

Hence an FPGA processors denotes a system comprising a number of FPGA chips plus memory system and I/O interfaces. Our institute at the University of Mannheim has more than 5 years of experience with FPGA processors. Both small- (microEnable) and large-scale (Enable, Enable++, Atlantis) systems have been developed and used, mainly for high-speed pattern recognition and other image processing applications.

In conventional computers an algorithm is executed by a CPU that processes a sequence of standard instructions. Contrary to this an FPGA processor allows the implementation of an algorithm directly in hardware. In one case the processor may look like a SIMD-machine. If done appropriately input data can be shifted through the processor and are processed on the fly. In another case it may look like a large pipeline of arithmetic units executing the inner loop of a number crunching algorithm. Thus the hardware of the FPGA processor the chips on the board is the same for all applications. However the degree of flexibility is similar to that of a standard cpu, as the configuration data-set is also compiled from a program description.

With AHA-GRAPE we intend to use an FPGA-processor as a third layer in addition to host and GRAPEs to accelerate the SPH part of the computation. The basic equation of the algorithm is made from three terms with the order of N, N*Nn and N2 respectively, where N is in the range of 10^4 to 10^7 and Nn ~ 50. This equation is almost optimally partitioned onto a host workstation (N dependency), an FPGA-processor (N*Nn dependency) and a GRAPE-subsystem (N2 dependency). With the addition of the third the FPGA layer we expect an increase in performance by a factor of 10.

The present FPGA-implementation of the SPH-code fragment utilizes a 28 bit reduced precision floating-point format thus providing a performance of 250MFlops on our - a bit outdated - ENABLE++ processor and expected 1.5GFlops on the new ATLANTIS processor. The performance of the ATLANTIS FPGA-processor can easily be scaled up to a factor of (at least) 6 by the number of boards. Further scaling is possible but communication bandwidth has to be considered. The AHA-GRAPE architecture provides a cost-effective way to improve the performance of GRAPE-based systems running SPH-code.

At present (Jan. 99) a test-implementation of the SPH-loop/step1 on ENABLE++ is carried out to verify the estimated performance. By mid 99 the new ATLANTIS system will be available where the full SPH-code has to be implemented. We expect the first protoype AHA-GRAPE system to be available in mid 2000.