|
shun_iwasawa |
a35b8f |
Speed:
|
|
shun_iwasawa |
a35b8f |
* If you want to use multiple cores, then compile with -openmp or -fopenmp (see your compiler docs).
|
|
shun_iwasawa |
a35b8f |
Realize that larger FFTs will reap more benefit than smaller FFTs. This generally uses more CPU time, but
|
|
shun_iwasawa |
a35b8f |
less wall time.
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
* experiment with compiler flags
|
|
shun_iwasawa |
a35b8f |
Special thanks to Oscar Lesta. He suggested some compiler flags
|
|
shun_iwasawa |
a35b8f |
for gcc that make a big difference. They shave 10-15% off
|
|
shun_iwasawa |
a35b8f |
execution time on some systems. Try some combination of:
|
|
shun_iwasawa |
a35b8f |
-march=pentiumpro
|
|
shun_iwasawa |
a35b8f |
-ffast-math
|
|
shun_iwasawa |
a35b8f |
-fomit-frame-pointer
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
* If the input data has no imaginary component, use the kiss_fftr code under tools/.
|
|
shun_iwasawa |
a35b8f |
Real ffts are roughly twice as fast as complex.
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
* If you can rearrange your code to do 4 FFTs in parallel and you are on a recent Intel or AMD machine,
|
|
shun_iwasawa |
a35b8f |
then you might want to experiment with the USE_SIMD code. See README.simd
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
Reducing code size:
|
|
shun_iwasawa |
a35b8f |
* remove some of the butterflies. There are currently butterflies optimized for radices
|
|
shun_iwasawa |
a35b8f |
2,3,4,5. It is worth mentioning that you can still use FFT sizes that contain
|
|
shun_iwasawa |
a35b8f |
other factors, they just won't be quite as fast. You can decide for yourself
|
|
shun_iwasawa |
a35b8f |
whether to keep radix 2 or 4. If you do some work in this area, let me
|
|
shun_iwasawa |
a35b8f |
know what you find.
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
* For platforms where ROM/code space is more plentiful than RAM,
|
|
shun_iwasawa |
a35b8f |
consider creating a hardcoded kiss_fft_state. In other words, decide which
|
|
shun_iwasawa |
a35b8f |
FFT size(s) you want and make a structure with the correct factors and twiddles.
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
* Frank van der Hulst offered numerous suggestions for smaller code size and correct operation
|
|
shun_iwasawa |
a35b8f |
on embedded targets. "I'm happy to help anyone who is trying to implement KISSFFT on a micro"
|
|
shun_iwasawa |
a35b8f |
|
|
shun_iwasawa |
a35b8f |
Some of these were rolled into the mainline code base:
|
|
shun_iwasawa |
a35b8f |
- using long casts to promote intermediate results of short*short multiplication
|
|
shun_iwasawa |
a35b8f |
- delaying allocation of buffers that are sometimes unused.
|
|
shun_iwasawa |
a35b8f |
In some cases, it may be desirable to limit capability in order to better suit the target:
|
|
shun_iwasawa |
a35b8f |
- predefining the twiddle tables for the desired fft size.
|