Page MenuHomeSolus

Source package for OpenBLAS
Closed, ResolvedPublic

Description

As a complement I made a simple benchmarking project to easily compare performance between the pre-built binaries and the ones built from source if you want to compare by yourselves :
https://github.com/JPenuchot/lib-cmp

Revisions and Commits

Event Timeline

We are using that repo as our upstream: http://github.com/xianyi/OpenBLAS/archive/v0.2.19.tar.gz

Here is our package.yml
https://dev.solus-project.com/source/openblas/browse/master/package.yml;e918d9c40d2e64048900d165572968bc224ffa9c$22

You'll note that we are explicitly building with TARGET=CORE2. If this is not set, the build process will use every single instruction set it can to optimize the performance of the operations. However, this also means that any machine that does not have these instruction sets will get hammered by invalid opcode errors. CORE2 is the current safe option for and machine manufactured in the last 10 years. You will also note that optimize: speed is set which enables -O3 for GCC and that avx2: yes is set which enables AVX2 instructions for any system that has the right cpuid flag.

So, is our OpenBLAS slower that the one you are building? Probably.
Can we improve that without breaking compatibility with older x86_64 architectures? Probably not.

You didn't get it : adding a source package would mean building it at the installation and therefore it would just compile for the machine it's being installed on. Using an unoptimized version of OpenBLAS is completely useless or even counterproductive in a lot of cases.

You didn't get it : adding a source package would mean building it at the installation and therefore it would just compile for the machine it's being installed on.

At no point did you indicate that you were suggesting we build this when installed on a user's machine. The only thing I would get out of your initial mention is "Hey guys, your version of OpenBLAS is slower than the one I just built from source". Which is of course going to be true because OpenBLAS enables every possible set of optimized instructions it can get its hands on when it is compiled. However, there are a few things that you need to understand about why things are the way they are:

  1. You bringing this up has sparked discussion about the AVX2 version of this library that a newer machine like yours will use. Namely, that simply enabling AVX2 isn't good enough and that we need to raise that set of compile flags to -march=haswell in order to get more comparable performance. We are actively working on that and that was discovered after my last post.
  1. We are not Gentoo. We are a binary distribution and as such we /never/ package sources for either the purposes of local builds or building on install. Expecting our users to wait over an hour for a typical 2-4 core machine to compile OpenBLAS every time it is updated is totally unreasonable. This is why we install two versions of the library, one for older machines and one for newer machines that have the AVX2 instruction set. That way people with newer hardware can leverage the AVX2 library. The problem in this case is that the AVX2 library needs to have more extensions enabled than we previously expected and that is why you are seeing the lower performance on your AVX2 capable machine. It is also why we are working to fix that.

Using an unoptimized version of OpenBLAS is completely useless or even counterproductive in a lot of cases.

  1. The most common case of using OpenBLAS is not high-performance computing. It's simple scripts written in Octave by run by students for classes or some random program using it to implement a quick bit of matrix math. That being said, I worked for over a year providing software support for my university's Research Computing department. I am well aware of how OpenBLAS is used in HPC and the limitations of that when you are running it in a cluster environment with many different architectures of machine from different vendors and time periods. And I can tell you that not a single one of our users cared enough to build a faster version of OpenBLAS from scratch. Why? Because our CORE2 version was good enough for 100% of their use cases.

Alright you convinced me. Thanks for your time !

Long story short @sunnyflunk has done further testing and can make the existing one go moar faster :)

Note that this now produces code that runs identical to a source build on my Skylake machine.

In future (post glibc 2.26), I'll be looking at providing more optimized libraries in future up the whole stack to take advantage where it helps most (openblas being a top example). The avx2 experiment was very limited, as many gains came from non-avx2 CPU instructions along with it.

@JPenuchot If you are interested in helping in any way, be it identifying packages that will benefit from advanced CPU instructions, or finding benchmarks to validate that they do, then let me know. Once a few of the current tasks are done and dusted, this will become one of my main focuses (along with generating more PGO packages).