Page MenuHomeSolus

Optimizations for fftw package
Closed, ResolvedPublic

Description

I have been looking at optimizing the fftw package but I don't have the time and I don't know any good way to benchmark it. Thus I figured I would share my findings in the hopes of someone having a go at it if the have the time and feel like doing it. Here's what I found:

Event Timeline

Jacalz created this task.Aug 8 2019, 8:53 AM
Jacalz updated the task description. (Show Details)Aug 8 2019, 8:54 AM
Jacalz edited projects, added Software; removed Lacks Project.Aug 8 2019, 8:56 AM

If you wish to take a look at optimizing it, you can look at the the phoronix test suite for fftw. Which runs fftw-source/tests/bench --time-repeat 100 -opatient libc64 There are many other ranges you can throw it it like ibc4096x4096 to libc32. You won't be able to use the phoronix test at it compiles it's own fftw and doesn't test the system version.

You can also download the fftw-bench collection. These programs are more comprehensive and take a lot longer to run.

I imagine the biggest win would be from enabling AVX2, although I imagine 03 and LTO can still make a difference. You can specify avx2 : yes in package.yml but you'll only want to enable AVX2 for the single and double variant. Not the long-double one. For the AVX2 build it is probably also worth it to add --enable-fma in addition to --enable-avx2.

Have a look and the glibc and openblas packages for examples of AVX2 enablement. Also take a look at the clearlinux package for fftw here

You need to be careful that after after enabling AVX2, CPUs which do not have AVX2 instructions can still compile fftw and run the non-avx2 libraries as normal. You'll also need to make sure that the avx2 libs are in /usr/lib64/haswell/ and that fftw and programs which use fftw are successfully loading the libs from /usr/lib64/haswell/. You can use strace to verify this.

Lastly, you'll need to demonstrate it actually makes a difference within fftw itself and, if you can, programs which use fftw if they have a benchmark you can run.

I would look at this myself but sadly I do not have any systems that have AVX2 at the moment and I'd imagine 03 and LTO would only provide slight speedups.

So that's a lot, but feel free do start testing if it's something you want to get into and use it as a learning experience if nothing else ;)

Thanks a lot for the information @joeboneichie! It is greatly appreciated, but as it currently stands none of my Linux machines have support for AVX2 either :I

oh well! Hopefully it's still useful info for somebody.

DataDrake triaged this task as Normal priority.Sep 6 2019, 2:36 PM
DataDrake moved this task from Backlog to Improvement on the Software board.
DataDrake added a subscriber: DataDrake.

We should probably enable the Haswell libs at least. Unfortunately Haswell is the closest match for modern architectures because of weirdness in the ISAs between Haswell and Core2.

serebit added a subscriber: serebit.EditedSat, Sep 12, 2:09 AM

Tested a bit with O3 and AVX2. AVX2 helps a *lot*, O3 doesn't seem to help much if at all. Here's the table.

AVX2 clearly helps quite a bit with larger FFT sizes, but has a near-zero impact on smaller sizes such as ibc64. O3 is within variance, based on my testing. It may help more with LTO, but I don't think LTO is actually being applied here. If I figure out how to run a fully linked FFTW directly, I'll post the results here.

This is with --enable-float, by the way.

Thanks for your hard work with testing @serebit. Looks really promising 🙂
We’re you testing with -ffast-math and the other flags as well?

Here are the flags I was running with.
All three: -g2 -pipe -fPIC -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=32 -fasynchronous-unwind-tables -ftree-vectorize -feliminate-unused-debug-types -Wall -Wno-error -Wp,-D_REENTRANT
AVX2: -mtune=haswell -march=haswell -O3
Generic: -mtune=generic -march=x86-64 -O2
Optimized: -mtune=generic -march=x86-64 -O3 -flto

I'm doing some test passes with fastmath and the other additional Clear flags. It seems that the difference between Opt and Gen with smaller FFTs is nonexistent, but with larger FFTs like 4096x4096, there is a slight benefit that's outside of variance. I'll continue testing.

Yeah, I'm getting no benefit from Clear's flags over the ones we're currently using. AVX2 is still a pretty substantial improvement, though, so I'll push a patch adding Haswell AVX2 builds.

I created a patch for using AVX2. All the reverse-dependent benchmarks I've used *are linked against the haswell lib*, but they're getting no performance benefit from it. I've tested gmic, gromacs, hugin, and octave, through Phoronix (each of which do use system FFTW).