The GRAPE-6 system will consist of a host computer and 12 clusters. Each cluster has two I/O ports, one for local communication and the other for multicast through clusters. The multicast network is necessary to implement the individual timestep algorithm efficiently, but in this paper we do not discuss the details, and concentrate on a single cluster. Figure 1 shows the overall structure of the GRAPE-6 system.
$B?^(B 1: Overall structure of GRAPE-6 system. Eight clusters are shown.
One cluster has one host-interface board (HIB), five network boards (NB), and 16 processor boards (PB). Each NB has one uplink and four downlinks. Thus, 16 PBs are connected to the host through two-level tree network of NBs (see figure 2). HIB and NB handles the communication between PBs and the host.
$B?^(B 2: A GRAPE-6 cluster. First four PBs are shown. Other 12 PBs and interface to the multicast network are omitted.
The PBs perform the force calculation. Each PB houses 16 GRAPE-6 processor chips, which are custom LSI chips to calculate the gravitational force and its first time derivative. Figure 3 shows the photograph of a PB. Two processor chips and four memory chips are mounted on a daughter card, and eight daughter cards are mounted on a processor board. Three large chips on board are the FPGA chips for the reduction network.
$B?^(B 3: The GRAPE-6 processor board.
A single GRAPE-6 processor chip integrates six pipeline processors for the force calculation, one pipeline processor to handle the prediction, network interface and memory interface (see figure 4). One force pipeline can evaluate one particle-particle interaction per cycle. With the present pipeline clock frequency of 88MHz, the peak speed of a chip is 30.1 Gflops. Here, we follow the convention of assigning 38 operations for the calculation of pairwise gravitational force, which is adopted in resent Gordon-Bell prize applications. GRAPE-6 calculates the time derivative, which adds another 19 operations. Thus, the total number of floating point operations for one interaction is 57.
$B?^(B 4: The GRAPE-6 processor chip.
Each pipeline processor of GRAPE-6 calculates the force on eight particles simultaneously, using the virtual multiple pipeline (VMP) architecture introduced with GRAPE-4. In addition, six pipelines in one chip calculate the forces on six different sets of eight particles. Thus, one chip calculates the force on 48 particles. This architecture reduces the required memory bandwidth per chip drastically. The data stored for one particle is 64 bytes. Therefore, if six pipelines require different data, one chip requires 384 bytes per cycle, or 33.8GB/sec of the memory bandwidth. In our architecture, the bandwidth requirement is reduced by a factor of 48 to 0.7 GB/s, since the same data is reused 48 times to calculate force on 48 particles.
All processor chips in a cluster calculate the force on the same set of 48 particles, from different sets of particles. Thus, we need a fast network to first broadcast the 48 particles to all processor chips, and then summing up the result calculated on 256 chips in a cluster. The broadcast in hardware is pretty simple to implement in hardware. The summation network is implemented using FPGA chips as node points. Each FPGA chip implements a sequencer and 4-input adder. This FPGA and other circuits on board operate at 1/4 of the clock speed of the chip. The data width for the network on board is 32 bits. So the communication speed of any link in the network is 88MB/s.
For the link between boards, we have adopted a fast semi-serial link with LVDS signal level, which can achieve the above data rate with four pairs of twisted-pair cables (the same as the standard cable for 100Mbit Ethernet). We adopted DS90CF364AMTD and DS90C363AMTD from National Semiconductor as the LVDS devices.
The structure of a NB is essentially the same as that of the processor board, but it carries links to the next level of the tree (either NB or PB) instead of the processor chips.
As of the time of writing, we have a 4-PB system, where each processor board is directly connected to the host computer without using the network board. Figure 5 shows the structure of this system. This system has a theoretical peak speed of 1.926 Tflops. The host computer is a UP-2000 based Alpha (EV6, 667 MHz) box with two processors and 2 GB of memory, running Compaq Tru64 UNIX.
$B?^(B 5: The present testbed GRAPE-6 configuration used for the simulation.