Phoronix conclusions distort their results, shown with the example of GCC vs. LLVM/Clang On AMD's FX-8350 Vishera

Phoronix recently did a benchmark of GCC vs. LLVM on AMD hardware [1]. Sadly their conclusion did not fit the data they showed. Actually it misrepresented the data so strongly, that I decided to speak up here instead of having my comments disappear in their forums [2]. This post was started on 2013-05-14 and got updates when things changed - first for the better, then for the worse.

Update 3 (the last straw, 2013-11-09): In the recent most blatant attack by Phoronix on copyleft programs - this time openly targeted at GNU [3] - Michael Larabel directly misrepresented a post from Josh Klint to badmouth GDB (Josh confirmed [4] this¹). Josh gave a report of his initial experience with GDB in a Kickstarter Update [5] in which he reported some shortcomings he saw in GDB (of which the major gripe is easily resolved with better documentation²) and concluded with “the limitations of GDB are annoying, but I can deal with it. It's very nice to be able to run and debug our editor on Linux”. Michael Larabel only quoted the conclusion up to “annoying” and abused that to support the claim that game developers (in general) call GDB “crap” and for further badmouthing of GDB. With this he provided the straw which I needed to stop reading Phoronix: Michael Larabel is hostile to copyleft and in particular to GNU and he goes as far as rigging test results³ and misrepresenting words of others to further his agenda. I even donated to Phoronix a few times in the past. I guess I won’t do that again, either. I should have learned from the error of the german pirates and should have avoided reading media which is controlled by people who want to destroy what I fight for (sustainable free software).
Update 2 (2013-07-06): But the next [6] went down the drain again… “Of course, LLVM/Clang 3.3 still lacks OpenMP support, so those tests are obviously in favor of GCC.” — I couldn’t find a better way to say that those tests are completely useless while at the same time devaluing OpenMP support as “ignore this result along with all others where GCC wins”…
Update (2013-06-21): The recent report of GCC 4.8 vs. LLVM 3.3 [7] looks much better. Not perfect, but much better.

Taking out the OpenMP benchmarks (where GCC naturally won, because LLVM only processes those tests single-threaded) and the build times (which are irrelevant to the speed of the produced binaries), their benchmark [1] had the following result:

LLVM is slower than GCC by:

10.2% (HMMer)

12.7% (MAFFT)

6.8% (BLAKE2)

9.1% (HIMENO)

42.2% (C-Ray)

With these results (which were clearly visible on their result summary on OpenBenchmarking [8], Michael Larabel from Phoronix concluded:

» The performance of LLVM/Clang 3.3 for most tests is at least comparable to GCC «

Nobu [9] from their Forums supplied a conclusion which represents the data much better:

» GCC is much faster in anything which uses OpenMP, and moderately faster or equal in anything (except compile times) which doesn't [use OpenMP] «

But Michael from Phoronix did not stop at just ignoring the performance difference between GCC and LLVM. He went on claiming, that

In a few benchmarks LLVM/Clang is faster, particularly when it comes to build times.

And this is blatant reality-distortion which I am very tempted to ascribe to favoritism. LLVM is not “particularly” faster when it comes to build times.

LLVM on AMD FX-8350 Vishera is faster ONLY when it comes to build times!

This was not the first time that I read data-distorting conclusions on Phoronix - and my complaints about that in their forum did not change their actions. So I hope that my post here can help making them aware that deliberately distorting test results is unacceptable.

For my work, compiler performance is actually quite important, because I use programs which run for days or weeks, so 10% runtime reduction can mean saving several days - not counting the cost of using up cluster time.

To fix their blunders, what they would have to do is:

Avoiding Benchmarks which only one compiler supports properly (OpenMP).
Marking the compile time tests explicitely, so they strongly stand out from the rest, because they measure a completely different parameter than the other tests: Compiler Runtime vs. Performance of the Compiled Binaries.
Writing conclusions which actually fit their results.

Their current approach gives a distinct disadvantage to GCC (even for the OpenMP tests, because they convey the notion that if LLVM only had OpenMP, it would be better in everything - which as this test shows is simply false), so the compiler-tests from Phoronix work as covert propaganda against GCC, even in tests where GCC flat-out wins. And I already don’t like open propaganda, but when the propaganda gets masked as objective testing, I actually get angry.

I hope my post here can help move them towards doing proper testing again.

PS: I write so strongly here, because I actually like the tests from Phoronix a lot. I think we need rather more than less testing and their testsuite actually seems to do a good job - when given the right parameters - so seeing Phoronix distorting the tests to a point where they become almost useless (except as political tool against GCC) is a huge disappointment to me.

Josh Klint from Leadwerks confirmed that Phoronix misrepresented his post and wrote a followup-post [10]: » @ArneBab That really wasn't meant to be controversial. I was hoping to provide constructive feedback from the view of an Xcode / VS user.« » Slightly surprised my complaints about GDB are a hot topic. I can make just as many criticisms of other compilers and IDEs.« » The first 24 hours are the best for usability feedback. I figure if they notice a pattern some of those things will be improved.« » GDB Follwup [10] « — @Leadwerks [11], 2:04 AM - 11 Nov 13 [4], 2:10 AM - 11 Nov 13 [12] and @JoshKlint [13], 2:07 AM - 11 Nov 13 [14], 8:48 PM - 11 Nov 13 [15]. ↩
The first-impression criticism [5] from Josh Klint was addressed by a Phoronix reader by pointing to the frame command [16]. I do not blame Josh for not knowing all tricks: He wrote a fair account of his initial experience with GDB (and he said later that he wrote the post after less than 24 hours of using GDB, because he considers that the best time to provide feedback) and his experience can serve as constructive criticism to improve tutorials, documentation and the UI of GDB. Sadly his visibility and the possible impact of his work on free software made it possible for Phoronix to abuse a personal report as support for a general badmouthing of the tool. In contrast the full message of Josh Klint ended really positive: Although some annoyances and limitations have been discovered, overall I have found Linux to be a completely viable platform for application development. — Josh Klint, Leadwerks ↩
I know that rigging of tests is a strong claim. The actions of Michael Larabel deserve being called rigging for three main reasons: (1) Including compile-time data along with runtime performance without clear distinction between both, even though compile-time of the full code is mostly irrelevant when you use a proper build system and compile time and runtime are completely different classes of results, (2) including pointless tests between incomparable setups whose only use is to relativate any weakness of his favorite system and (3) blatantly lying in the summaries (as I show in this article). ↩