shun_iwasawa a35b8f
Speed:
shun_iwasawa a35b8f
    * If you want to use multiple cores, then compile with -openmp or -fopenmp (see your compiler docs).
shun_iwasawa a35b8f
	Realize that larger FFTs will reap more benefit than smaller FFTs. This generally uses more CPU time, but
shun_iwasawa a35b8f
	less wall time.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
    * experiment with compiler flags
shun_iwasawa a35b8f
        Special thanks to Oscar Lesta. He suggested some compiler flags 
shun_iwasawa a35b8f
        for gcc that make a big difference. They shave 10-15% off
shun_iwasawa a35b8f
        execution time on some systems.  Try some combination of:
shun_iwasawa a35b8f
                -march=pentiumpro
shun_iwasawa a35b8f
                -ffast-math
shun_iwasawa a35b8f
                -fomit-frame-pointer
shun_iwasawa a35b8f
shun_iwasawa a35b8f
    * If the input data has no imaginary component, use the kiss_fftr code under tools/.
shun_iwasawa a35b8f
      Real ffts are roughly twice as fast as complex.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
    * If you can rearrange your code to do 4 FFTs in parallel and you are on a recent Intel or AMD machine,
shun_iwasawa a35b8f
    then you might want to experiment with the USE_SIMD code.  See README.simd
shun_iwasawa a35b8f
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Reducing code size:
shun_iwasawa a35b8f
    * remove some of the butterflies. There are currently butterflies optimized for radices
shun_iwasawa a35b8f
        2,3,4,5.  It is worth mentioning that you can still use FFT sizes that contain 
shun_iwasawa a35b8f
        other factors, they just won't be quite as fast.  You can decide for yourself 
shun_iwasawa a35b8f
        whether to keep radix 2 or 4.  If you do some work in this area, let me 
shun_iwasawa a35b8f
        know what you find.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
    * For platforms where ROM/code space is more plentiful than RAM,
shun_iwasawa a35b8f
     consider creating a hardcoded kiss_fft_state. In other words, decide which 
shun_iwasawa a35b8f
     FFT size(s) you want and make a structure with the correct factors and twiddles.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
    * Frank van der Hulst offered numerous suggestions for smaller code size and correct operation 
shun_iwasawa a35b8f
    on embedded targets.  "I'm happy to help anyone who is trying to implement KISSFFT on a micro"
shun_iwasawa a35b8f
shun_iwasawa a35b8f
    Some of these were rolled into the mainline code base:
shun_iwasawa a35b8f
        - using long casts to promote intermediate results of short*short multiplication
shun_iwasawa a35b8f
        - delaying allocation of buffers that are sometimes unused.
shun_iwasawa a35b8f
    In some cases, it may be desirable to limit capability in order to better suit the target:
shun_iwasawa a35b8f
        - predefining the twiddle tables for the desired fft size.