GRAPE newletter vol.7 ( Nov 25 2005)
Dear Colleagues:
This is the seventh issue of our GRAPE newsletter, a brief summary of
recent developments regarding the GRAPE special-purpose computer for
stellar dynamics. For further information: "http://www.astrogrape.org".
+----------------------- CONTENTS:-----------------------------+
| 1) Status of the GRAPE-DR Project
| 2) The GRAPE Users Group
| 3) Front-end Speedup
| 4) Progrape status report
| 5) Reports from the field: GRAPEs in Heidelberg
+--------------------------------------------------------------+
1---------1---------1---------1---------1---------1---------1---------1
STATUS OF THE GRAPE-DR PROJECT
Last year we started the development of the GRAPE-DR. You can
find a description of the basic concepts in a short paper listed
on http://arXiv.org/abs/astro-ph/0509278, under the title of
"Modified SIMD architecture suitable for single-chip implementation".
In this paper the term GRAPE-DR acquired an altogether new meaning:
Greatly Reduced Array of Processor Elements with Data Reduction.
The logical design of the processor chip was finished in August 2005,
and we are currently working on the physical design of the chip.
If everything goes well, we will receive the first chip sample in
April 2006. Currently, an FPGA prototype system is already working.
The GRAPE-DR chip will operate on a clock cycle of 500 MHz. It
consists of 512 processing elements, each with fully-pipelined
floating-point arithmetic units. The peak speed of a single chip
will be 500 Gflops. Unlike the previous GRAPE hardware, each
processing element is programmable. So the GRAPE-DR can be used for
applications other than the calculation of gravitational interaction.
At present, to program the GRAPE-DR, a user must write the assembly
language code for the interaction kernel. An overview of the
software system, with examples, will soon be available. In the
future, we plan to implement compilers for higher-level languages.
2---------2---------2---------2---------2---------2---------2---------2
THE GRAPE USERS GROUP
The GRAPE Users Group, now starting its second year, is a low-traffic
email list for discussing anything and everything related to the GRAPE
platform, and getting more out of your GRAPEs. Recent threads have
focussed on the diagnosis of GRAPE problems, both in hardware and
software. We also welcome questions on the scientific aspects of GRAPE
use, and of course matters relating to configuration, &cet.
To join our community, please send a subscription request to the
current list maintainer, Michael Sipior (sipior@science.uva.nl).
The Modesta group in Amsterdam also maintains a GRAPE page at
http://modesta.science.uva.nl, and soon we hope to add an archive of
previous Users Group postings there, along with a useful FAQ list.
Stay tuned!
3---------3---------3---------3---------3---------3---------3---------3
FRONT-END SPEEDUP
For many calculations, the GRAPE is so fast that the front-end computer
becomes the bottleneck in speed. It would be great if we could speed up
calculations on standard chips, such as those manufactured by Intel and
AMD. The problem is that chips have become more and more complicated,
whereas compilers have not really kept up. Many of the new instructions
added to the latest chip architectures are simply not used by compilers.
In addition, the compilers typically don't use the registers of modern
chips in an efficient way.
Keigo Nitadori has delved deeply into these problems, and has discovered
a number of clever ways to overcome part of the inefficiencies of the
compilers currently in use. By finetuning the inner loops of N-body codes
in such a way as to achieve optimal assembly code generation, Keigo has
succeeded to speed up gravitational force calculations by factors between
two and ten. He recently wrote a preprint on this topic, "Performance
Tuning of N-Body Codes on Modern Microprocessors: I. Direct Integration
with a Hermite Scheme on x86_64 Architecture", together with Jun Makino
and Piet Hut; see http://arXiv.org/abs/astro-ph/0511062. His codes are
available on http://grape.astron.s.u-tokyo.ac.jp/~nitadori/phantom/.
4---------4---------4---------4---------4---------4---------4---------4
PROGRAPE STATUS REPORT
Implementation of gravity and SPH on FPGA,
written by N. Nakasato (RIKEN, Tokyo)
PROGRAPE (PROgrammable GRAPE) is a hardware accelerator for numerical
simulations, notably, particle simulations. It has been originally
developed as a sub-project of the GRAPE project. PROGRAPE system consists
of a host computer and PROGRAPE boards like other family of GRAPE
project. A crucial difference between GRAPE and PROGRAPE is that in normal
GRAPE, a computational pipeline is implemented on specially developed LSIs
(a.k.a GRAPE chip), on the other hand, the pipeline is implemented on FPGA
chips in the case of PROGRAPE. Note that FPGA is a LSI that enable us to
arbitrary change or in other words to program its internal logic for any
purpose. Furthermore, recent FPGA are large enough to program rather
complex numerical calculations. In PROGRAPE, those pipelines on the FPGA
chips can be reconfigurable or programmable. This programmability is a
most drastic difference between normal GRAPE and PROGRAPE.
There were numerous attempts in a several research fields to use FPGA for
accelerating computational task in this two decades. In many-body
simulations, the first attempt was made by Kim et al. (1995) at Rutgers
University. They have tried to implement GRAPE-2 equivalent pipelines on
a FPGA board and obtained 75 MFLOPS. In 1998, first PROGRAPE has been
developed in Tokyo mainly by Tsuyoshi Hamada and Toshi Fukushige. They
have reported first results in 2000 (PASJ, Vol. 52, 954). In this attempt,
the size of FPGA chips was not enough to implement complex pipelines
however they have achieved peak performance of 1 GFLOPS for GRAPE-3
equivalent pipelines on FPGA. That was more than 10 times faster than
typical speed of PCs at that time. At the almost same time, people in
Mannheim and Heidelberg have been started to implement SPH (Smoothed
Particle Hydrodynamics) pipelines on a FPGA board developed at Mannheim
University. Also, they have proposed AHA-GRAPE project (Adaptive
Hydrodynamic Architecture - GRAvity PipE) to construct a large cluster
of FPGA boards and use it for N-body and SPH simulations. Recently,
they jointly with people in Munich have started GRACE project
(http://www.ari.uni-heidelberg.de/grace/) as a modern version of AHA-GRAPE
(see item 5 below).
After the development of PROGRAPE-2 as a sub-project of GRAPE-6 project
(reported in GRAPE newsletter vol.5), in mid 2004, T.Hamada and N.Nakasato
have been started to construct a new system --- now called PROGRAPE-3 ---
in RIKEN, Wako, Saitama Japan. In this project, main work was not a
development of new hardware since a suitable FPGA board has been already
developed for another purpose in our group. Our development work here has
been developing a software that generates ''hardware'' on FPGA chips. The
software, PGR (Processors Generator for Reconfigurable systems), is like a
new compiler for a new computer. Originally, T.Hamada and T.Fukushige
have started to develop a programming system for PROGRAPE in late
2000. This system is called PGPG (Pipeline Generator for Programmable
Grape) and PGR has been forked from PGPG as a new project in RIKEN. A
highlight of PGR is that we have developed floating-point arithmetic
library with arbitrary precision. Namely, with PGR, one can implement a
arithmetic pipeline using floating-point calculation. This feature enable
us to implement rather complex SPH pipelines on PROGRAPE-3.
So far, we have implemented GRAPE-5 equivalent pipelines and full SPH
pipelines on PROGRAPE-3 and other hardware including MPRACE and Cray XD-1.
The performance is remarkable such that the theoretical peak performance
of GRAPE-5 pipelines on one PROGRAPE-3 board is 324 GFLOPS and SPH
pipelines is 85 GFLOPS. Note that the GRAPE-5 pipelines are implemented
mainly by 5-bit precision LNS (Logarithmic Number System) arithmetic units
and SPH pipelines mainly by 16-bit precision floating-point arithmetic
units. Already, we are using our GRAPE-5 pipelines on PROGRAPE-3 for our
production run of galaxy evolution. Especially, our Tree-GRAPE
implementation using PROGRAPE-3 is fastest in the world and will be very
useful for collision-less simulations. Furthermore, we have already
tested our SPH pipelines on PROGRAPE-3 for 3-D SPH simulations and
obtained promising results. We will firstly use our SPH pipelines on
PROGRAPE-3 for simulation of stellar collision and formation of galaxies.
Details of our SPH implementation will be submitted very soon. Our result
clearly shows that at least for astrophysical many-body simulations, the
reconfigurable supercomputing era is coming now.
Finally, we note RIKEN is a site that hosts largest GRAPE system in FLOPS
counts (comparable to that of Tokyo University. We have 75 TFLOPS MDM
(Molecular Dynamics Machine), three MDGRAPE-2 clusters and a newly
developed MDGRAPE-3 cluster. In next spring, new PROGRAPE cluster will
join this list of multi-GRAPE clusters.
Project Web Page:
http://progrape.jp/pgpg/ (in Japanese and partly in English)
http://progrape.jp/pgr/hiki/ (in English)
Project `members' : T.Hamada, N.Nakasato & T.Ebisuzaki
The Institute of Physical and Chemical Research (RIKEN),
Wako, Saitama, Japan
5---------5---------5---------5---------5---------5---------5---------5
REPORTS FROM THE FIELD: GRAPES IN HEIDELBERG
GRAPE and FPGA research at Astronomisches Rechen-Institut (ARI) of
the Centre for Astronomy of the University of Heidelberg (ZAH)
by Rainer Spurzem (ARI, Heidelberg, Germany)
ARI Heidelberg has been at the forefront of stellar dynamical research and
direct N-body simulation for decades, starting with the first papers of
Sebastian von Hoerner in the 1960s. Its name stands for the German word of
computing (''rechnen''), because the ARI used to be a local computing
centre in the 60s as well.
Nowadays that tradition has been brought back to the institute (which
recently joined the University of Heidelberg under the roof of a newly
founded ''Center for Astronomy'') thanks to GRAPE, the Volkswagen
foundation, and the state of Baden-Wuerttemberg. The latter are funding
our project ``GRACE - Astrophysical Computer Simulations using Programmable
Hardware''. Within the project ARI hosts the GRACE supercomputer
(see http://www.ari.uni-heidelberg.de/grace/pics/gracecluster.jpg),
consisting of a total of 128 CPUs: 32 Dual Pentium Xeon nodes host 32
of the small micro-GRAPE6 PCI boards, total peak speed of 4 Tflops,
plus 32 pieces of a new kind of reconfigurable special purpose
hardware (compute boards using FPGA chips, called MPRACE) designed at
the University of Mannheim (Germany), only some 20km away from
Heidelberg. The name GRACE stands for GRAPE + MPRACE.
The new micro-GRAPE cards have opened up a new and more flexible way to
use the power of GRAPE hardware in a Beowulf PC cluster environment.
Until recently one GRAPE6 board of equivalent speed of 1 Tflop (if fully
equipped; memory for 512k particles) was the smallest unit of GRAPE to be
connected to a host computer; it came in the form of a special box with
interfaces to be connected to the PCI interface of a host computer. Now a
micro GRAPE6 offers 1/8 of the compute power and (at the same equivalent
compute speed) significantly more memory for less than 1/8 of the prise.
micro GRAPE is compact enough to be used as just another PCI card in a 4U
PC box (but beware of the rather large cooling unit, which may obstruct
other slots). The use of more smaller GRAPE units in a parallel computer
potentially reduces the communication bottlenecks (more channels between
host and GRAPE available) but is a challenge for the software (requiring
parallel codes with the ability to use GRAPE).
Our GRACE project and the GRACE cluster couples GRAPE to the
reconfigurable MPRACE boards developed at the University of Mannheim. It
is a collaboration between groups of the Universities of Munich
(A. Burkert, T. Naab, M. Wetzstein), Mannheim (R. Maenner, A. Kugel,
G. Lienhart, G. Marcus) and Heidelberg (R. Spurzem, P. Berczik, G. Kupi,
A. Ernst). The GRACE cluster at ARI has one of the fastest available
network interconnects, an Infiniband with Dual-Port PCI-Express
Host-Cards, which deliver 20 Gbit/s duplex bandwidth in and out of the
nodes. The cluster is now operational since a few weeks, and first test
benchmarks are going on. We expect to be able to do direct N-body
modelling of up to 4 million particles and ten times more for SPH models
in the near future.
What are the astrophysical science goals to be tackled with our new
hardware? Keywords are the dynamics of galactic nuclei with massive
single and multiple black holes (looking for the question of whether and
how massive black holes make it after galaxy mergers to merge themselves,
and to predict the possible gravitational radiation from this process in
the universe), dense star clusters, and numerical modelling of planet
formation (Heidelberg). The Munich group is doing research on galaxy
formation and merging, as well as star formation, with coupled N-body SPH
methods to follow simultaneously the evolution of stars and gas.
On the software side the GRACE project requires some necessary work for
designing and developing new variants of codes. Presently NBODY6++ (the
massively parallel version of Sverre Aarseth's NBODY6, provided by
R. Spurzem) works fine on the cluster, but can't use the GRAPE yet, while
simpler NBODY1-like codes are already using efficiently the many GRAPEs in
parallel. Our colleagues in Mannheim have a good knowhow in programming
the FPGA component MPRACE (compare http://www-li5.ti.uni-mannheim.de/ and
http://www-li5.ti.uni-mannheim.de/fpga). The aim is for the future to
provide users with more easy-to-use software tools to program the FPGA
boards for standard tasks. Why using FPGA at all? Apart from the shorter
development cycles as compared to special-purpose hardware like GRAPE, the
main idea is as follows: while GRAPE is still the fastest hardware
component to accelerate the long-range gravitational forces (numerical
effort scaling with $N^2$), the MPRACE component should accelerate in a
balanced way the next bottleneck in the algorithms scaling with $N\cdot
N_n$ (where $N_n$ is a typical neighbour number of the order of 50-200,
rather independent of $N$). These are typically neighbour forces in NBODY
codes or non-gravitational forces and density computation in SPH codes
(SPH = smoothed particle hydrodynamics). So, the host bottleneck, which
occurs when using better and more complex N-body codes such as NBODY6++ or
SPH is further reduced in the GRACE cluster, opening up to use these
efficiently with GRAPE.
While our new GRACE supercomputer still has a hybrid architecture using
GRAPE and FPGA together, the FPGA pipelines are already reaching speeds
comparable to GRAPE. We are not very far from the point where just FPGA
only based special purpose compute boards may supersede the use of GRAPE
boards at all (at least in the present form of GRAPE6, as they exist now,
but as we know there are also new GRAPE developments going on in
Tokyo...). The development of software, which is able to run on different
types of FPGA platforms is one of our future plans; in this area there is
also an ongoing collaboration with the astrophysical computing group at
RIKEN Institute in Tokyo (N. Nakasato, T. Hamada, see item 4 above) who
have developed a portable FPGA pipeline generator and use another type of
FPGA special hardware board.
It should not be overlooked that while these development works go on the
GRAPE component of the ARI GRACE cluster in Heidelberg is already working and
producing results just using the GRAPEs in the cluster. A similar 32 node PC
cluster with micro GRAPEs (but not the FPGA component) is running at the RIT
(Rochester Institute of Technology, D. Merritt).
As a final side remark, another application at Volkswagen foundation is
pending now, in their special program for the development of sciences in
Central Asia. The project should support the Fessenkov observatory in Almaty,
Kazakhstan with funds to build a small PC cluster facility with micro
GRAPEs. In collaboration between Heidelberg and Almaty (C. Omarov,
E. Vilkoviski) we are working on problems of galactic nuclei (interactions
between stars and central massive accretion disks). If successful the
project will make Almaty as far as I know the first GRAPE site on the
territory of former Soviet Union.
0---------0---------0---------0---------0---------0---------0---------0
Piet Hut and Jun Makino
(submissions to: piet@astrogrape.org or grape@astrogrape.org)
HOW TO (UN)SUBSCRIBE: send an email message to grape@astrogrape.org
with Subject: (un)subscribe
00--------00--------00--------00--------00--------00--------00--------00
Back to GRAPE Project