shun_iwasawa a35b8f
If you are reading this, it means you think you may be interested in using the SIMD extensions in kissfft 
shun_iwasawa a35b8f
to do 4 *separate* FFTs at once.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Beware! Beyond here there be dragons!
shun_iwasawa a35b8f
shun_iwasawa a35b8f
This API is not easy to use, is not well documented, and breaks the KISS principle.  
shun_iwasawa a35b8f
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Still reading? Okay, you may get rewarded for your patience with a considerable speedup 
shun_iwasawa a35b8f
(2-3x) on intel x86 machines with SSE if you are willing to jump through some hoops.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
The basic idea is to use the packed 4 float __m128 data type as a scalar element.  
shun_iwasawa a35b8f
This means that the format is pretty convoluted. It performs 4 FFTs per fft call on signals A,B,C,D.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
For complex data, the data is interlaced as follows:
shun_iwasawa a35b8f
rA0,rB0,rC0,rD0,      iA0,iB0,iC0,iD0,   rA1,rB1,rC1,rD1, iA1,iB1,iC1,iD1 ...
shun_iwasawa a35b8f
where "rA0" is the real part of the zeroth sample for signal A
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Real-only data is laid out:
shun_iwasawa a35b8f
rA0,rB0,rC0,rD0,     rA1,rB1,rC1,rD1,      ... 
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Compile with gcc flags something like
shun_iwasawa a35b8f
-O3 -mpreferred-stack-boundary=4  -DUSE_SIMD=1 -msse 
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Be aware of SIMD alignment.  This is the most likely cause of segfaults.  
shun_iwasawa a35b8f
The code within kissfft uses scratch variables on the stack.  
shun_iwasawa a35b8f
With SIMD, these must have addresses on 16 byte boundaries.  
shun_iwasawa a35b8f
Search on "SIMD alignment" for more info.
shun_iwasawa a35b8f
shun_iwasawa a35b8f
shun_iwasawa a35b8f
shun_iwasawa a35b8f
Robin at Divide Concept was kind enough to share his code for formatting to/from the SIMD kissfft.  
shun_iwasawa a35b8f
I have not run it -- use it at your own risk.  It appears to do 4xN and Nx4 transpositions 
shun_iwasawa a35b8f
(out of place).
shun_iwasawa a35b8f
shun_iwasawa a35b8f
void SSETools::pack128(float* target, float* source, unsigned long size128)
shun_iwasawa a35b8f
{
shun_iwasawa a35b8f
   __m128* pDest = (__m128*)target;
shun_iwasawa a35b8f
   __m128* pDestEnd = pDest+size128;
shun_iwasawa a35b8f
   float* source0=source;
shun_iwasawa a35b8f
   float* source1=source0+size128;
shun_iwasawa a35b8f
   float* source2=source1+size128;
shun_iwasawa a35b8f
   float* source3=source2+size128;
shun_iwasawa a35b8f
shun_iwasawa a35b8f
   while(pDest
shun_iwasawa a35b8f
   {
shun_iwasawa a35b8f
       *pDest=_mm_set_ps(*source3,*source2,*source1,*source0);
shun_iwasawa a35b8f
       source0++;
shun_iwasawa a35b8f
       source1++;
shun_iwasawa a35b8f
       source2++;
shun_iwasawa a35b8f
       source3++;
shun_iwasawa a35b8f
       pDest++;
shun_iwasawa a35b8f
   }
shun_iwasawa a35b8f
}
shun_iwasawa a35b8f
shun_iwasawa a35b8f
void SSETools::unpack128(float* target, float* source, unsigned long size128)
shun_iwasawa a35b8f
{
shun_iwasawa a35b8f
shun_iwasawa a35b8f
   float* pSrc = source;
shun_iwasawa a35b8f
   float* pSrcEnd = pSrc+size128*4;
shun_iwasawa a35b8f
   float* target0=target;
shun_iwasawa a35b8f
   float* target1=target0+size128;
shun_iwasawa a35b8f
   float* target2=target1+size128;
shun_iwasawa a35b8f
   float* target3=target2+size128;
shun_iwasawa a35b8f
shun_iwasawa a35b8f
   while(pSrc
shun_iwasawa a35b8f
   {
shun_iwasawa a35b8f
       *target0=pSrc[0];
shun_iwasawa a35b8f
       *target1=pSrc[1];
shun_iwasawa a35b8f
       *target2=pSrc[2];
shun_iwasawa a35b8f
       *target3=pSrc[3];
shun_iwasawa a35b8f
       target0++;
shun_iwasawa a35b8f
       target1++;
shun_iwasawa a35b8f
       target2++;
shun_iwasawa a35b8f
       target3++;
shun_iwasawa a35b8f
       pSrc+=4;
shun_iwasawa a35b8f
   }
shun_iwasawa a35b8f
}