Bad Benchmarks: About

What's a benchmark?

In short, a performance test. Typically a benchmark is a program that executes a specific workload, and measures how long that takes -- or, conversely, how much work gets done in a given time span.
The idea is noble: figure out which of several competing solutions is "better" by running a test that produces a score. Having a score makes it dead easy to compare: 300 points is obviously better than 200 points! Presumably this test should be created by independent experts who know what they're doing, so the resulting score provides an accurate summary of a much more complex reality.

What's the problem with benchmarks?

The difficulty is coming up with a way to determine a meaningful score. What's really "better" is often hard to define, especially when the thing we're looking at is fairly complex and needs to handle many different tasks. Let's make a non-technical example: to figure out who's the better student, we let students write exams, which produce a score. And while some students really are smarter than others, and some exams have a tendency to reflect this, you'd probably agree that most exams you've ever written have felt both unfair and irrelevant for real life. Has any exam ever measured, once and for all, how good your overall performance is? Don't most of them just test whether you know what the teacher wants to hear? Isn't it really easy for a teacher to give you a crappy exam and get away with it?

Let's turn the tables: how would you design an exam that reliably tells you who's a smart person? Would you ask math questions, or trivia knowledge, or logic puzzles, or creative writing questions? Some of each? How much of each?

Computers are, at least in this regard, similar to people. So are individual components of them, both hardware and software. Whether you want to know what the best laptop or the best phone is that you can get for your money, whether you want to upgrade your graphics card or choose a fast internet browser: all of these perform so many different tasks that it's hard to come up with a useful "exam" for them. It's easy to take any of these devices and give them something to do, and measure how long that takes. But that doesn't mean the result is relevant; relevance depends on the chosen workload, as well as a bunch of other factors.

What's a bad benchmark?

Generally speaking, a benchmark is bad when it produces a meaningless score. There can be different reasons for that:

maybe the workload it runs is too small to be relevant for what you're really interested in. For example, the benchmark might only run one tiny task over and over again, when you wanted an overall picture.
as an extreme case, if a benchmark only exercises one specific feature, it's easy for a program to score high on the benchmark even though it's not that great on the whole. For example, imagine a calculator benchmark that only measures addition. A calculator that has super fast addition, but really slow subtraction, multiplication, division and so on, would undeservedly win a comparison based on this benchmark.
maybe the workload is synthetic (i.e. written just for this benchmark) and doesn't reflect what real programs are doing. For example, the benchmark might solve a common problem in a very inefficient way, whereas real software uses more clever approaches. Or maybe it is just doing something abjectly stupid.
even when a real application is taken as the benchmark's workload, it might be measured in a way that doesn't reflect actual user experience. For example, as a user you probably care about both how fast a program starts up, and how well it runs after having started up; a benchmark can accidentally focus on one of those aspects and neglect the other.
maybe it the benchmark is simply outdated. For example, it may be based on an old version of some software that nobody is using any more because it has been replaced by newer versions that have completely different performance characteristics. Or maybe it measures something that used to be challenging years ago but is a piece of cake for modern systems.
maybe the benchmark has a bug and doesn't measure what it's intended to measure. For example, maybe it checks the wrong condition, terminates too early, and leaves out an interesting part of the workload.

All that said, it's honestly not always easy to define whether a given benchmark is good or bad. A lot of that verdict comes down to whether you think its workload is relevant, which in some cases might be a clear-cut decision, but in other cases certainly can be argued about.

Why is it bad to have bad benchmarks?

Firstly, because benchmark results affect decisions that people make. If a misleading benchmark, well, misleads you to think that one choice is better than another choice, when in reality the opposite is true, then that's obviously bad. Since you're looking at data as a basis for your decision in the first place, it would be better if that data was trustworthy, right?

Secondly, because benchmark results are important for marketing, which in turn is important for a product's success in the market, so manufacturers/developers care very much about them. That means: they'll look at popular benchmarks, and specifically optimize their product for what those benchmarks are testing, instead of spending their time working on something else that might be more relevant to the actual user experience of their customers. And that's the good case -- the bad case is when they even build features into their products that will improve benchmark scores at the expense of working well for regular usage scenarios; for example such tricks might increase memory consumption unnecessarily.

Why do bad benchmarks exist?

Probably because when creating a benchmark, just like when building any other software, it's easier not to care too much about how good it is. Put differently, anybody can create a crappy benchmark, just like anybody can write crappy blog posts!
And also, because it's hard for others to verify whether a benchmark is good or bad: most of the time, as long as it produces a somewhat-believable score, why would you take the time to look at how that score was determined? After all, the whole point of having benchmarks is so you can easily compare how good things are even without understanding all the details of what exactly it means to be "good".

In this sense, the way many people (even reputable websites) run benchmarks seems a bit like sending all the smart students (see above) into a windowless room with two doors, and measuring how soon they'll walk out the other door. Since there are no windows, nobody knows what the poor students are facing in that room. An exam? A pitch black labyrinth, full of traps and dragons? Or just pizza and popcorn?

What's the purpose of this blog?

To fight bad benchmarks! By finding them, analyzing them, and exposing them. The aim is to provide objective, verifiable, independent information about what popular benchmarks are actually doing under the hood, so people can make an informed decision about which benchmarks to trust, and which to ignore.

Who are you?

Doesn't really matter, as this blog isn't about me, but about verifiable technical facts. I'm just a guy who likes his computer to be fast, and who likes benchmarks to be meaningful. I have a bit of a background in software development, including performance measurements, so hopefully I know what I'm talking about here!

Bad Benchmarks

About