One major change in going from DDSCAT.4b to 4c and 5a was modification of the code to permit use of the GPFA FFT algorithm developed by Dr. Clive Temperton. DDSCAT continues to offer both the Brenner code as well as the ``old'' Temperton code as alternative FFT implementations. The ``old'' Temperton code requires about 11% more memory than either the Brenner or GPFA codes (to use the ``old'' Temperton algorithm, DDSCAT.f must be compiled with MXMEM=1 rather than MXMEM=0).
To help persuade the user that the GPFA code is an important step forward,
we provide a program TSTFFT to compare the performance of
different 3-D FFT implementations.
To compile, link, and run this program on a Unix system,
position yourself in the DDA/src directory and type
make tstfft tstfft
Output results will be written into a file res.dat. Here is a copy of the res.dat file created when run on a Sun Ultrasparc 170:
CPU time (sec) for different 3-D FFT methods
Machine= Sun Ultrasparc-170
parameter LVR = 64
LVR="length of vector registers" in gpfa2f,gpfa3f,gpfa5f
==========================================================
Brenner Temperton Temperton
NX NY NZ (FOURX) (Old) (GPFA)
2 2 2 0.000068 0.000065 0.000262
3 3 3 0.000200 0.000077 0.000096
4 4 4 0.000064 0.000086 0.000111
5 5 5 0.000651 0.000292 0.000137
6 6 6 0.001171 0.000170 0.000223
8 8 8 0.000350 0.000250 0.000323
9 9 9 0.004654 0.000300 0.000505
10 10 10 0.005218 0.000457 0.000575
12 12 12 0.008664 0.000814 0.000751
15 15 15 0.023387 0.001583 0.001798
16 16 16 0.003137 0.001359 0.002386
18 18 18 0.042196 0.003908 0.003479
20 20 20 0.043754 0.003881 0.004010
24 24 24 0.075672 0.010722 0.007198
25 25 25 0.103492 0.008142 0.011232
27 27 27 0.160173 0.011126 0.012293
30 30 30 0.185177 0.017111 0.014631
32 32 32 0.042211 0.034032 0.023245
36 36 36 0.322581 0.071013 0.027502
40 40 40 0.379551 0.133391 0.044182
45 45 45 0.797107 0.199537 0.069740
48 48 48 0.668408 0.255849 0.094520
50 50 50 0.983722 0.287208 0.115833
54 54 54 1.530319 0.427839 0.142388
60 60 60 1.804154 0.607189 0.181221
64 64 64 1.005013 0.806311 0.342762
72 72 72 3.812469 1.219364 0.362328
75 75 75 5.364067 1.234528 0.437496
80 80 80 4.509048 1.576118 0.563839
81 81 81 8.017456 1.798195 0.560326
90 90 90 10.201881 2.455399 0.779869
It is clear that the GPFA code is generally MUCH faster than the Brenner FFT
(by a factor of 13 for the 90
90
90 case) and significantly faster than the
``old'' Temperton FFT code (by a factor of 3 for the 90
90
90
case), although for some cases the differences are not large (e.g.,
32
32
32).
Since the GPFA code is memory-efficient as well, it appears to be the method
of choice on scalar machines. It appears to also be best on vector machines.
The GPFA code contains a parameter LVR which is set in
data statements in the
routines gpfa2f, gpfa3f, and gpfa5f.
LVR is supposed to be optimized to correspond
to the ``length of a vector register'' on vector machines.
As delivered, this parameter is set to 64, which is supposed to be
appropriate for Crays other than the C90 (for the C90, 128 is supposed to
be preferable).
The value of LVR is not critical for scalar machines, as long as
it is fairly large. We found little difference between LVR=64 and
128 on a Sparc 10/51 and on an Ultrasparc 170. You may wish to
experiment with different LVR values on your computer architecture.
To change LVR, you need to edit gpfa.f and change the three
data statements where LVR is set.
The choice of FFT implementation is obtained by specifying one of: