2015-07-29

Browsermark: String-Filter

Still following the chain of Browsermark's "next_test" links, we arrive at the String-Filter test at http://web.basemark.com/tests/2.1/benchmarks/javascript/string_filter/test.js. As usual, the file appreciates beautification in order to achieve its full human readability potential.

The test takes a list of 250 movie titles beginning with "The Shawshank Redemption (1994)" (I wonder where they got that from?) and tests them against a series of regular expressions of the form "(1900)", "(1901)", and so on. Apart from that list, the entire benchmark is just this:

[1]  results[c] = [];
[2]  var e = new RegExp("/.*(" + (1900 + (iteration_index % 100)) + ").*/i");
[3]  for (var d = 0; d < this.movies.length; d++) {
[4]    if (e.test(this.movies[d])) {
[5]      results[c].push(this.movies[d]);
[6]      counter++
[7]    }
[8]  }

So we have another micro-benchmark on our hands: it tests creation of regular expression objects ("new RegExp(...)" in line 2) and calling a RegExp's .test() function on a string ("e.test(...)" in line 4).

An interesting (to put it politely) choice is the placement of the "counter++" statement in line 6 inside the if-block. The test isn't counting the number of RegExp objects created, or the number of .test() calls performed, but the number of matches returned by those .test() calls. With enough iterations, this is harmless, because the same input strings are looked at over and over again, so even though some iterations will have zero hits and some have many, in the long run the number of encountered hits will be proportional to the number of tests performed. However, on a slower device, or if the string set was larger or non-repeating, this would be a source of considerable skew.

There isn't much else to say about this test. It does exhibit a few of the bad ideas we've seen in other Browsermark tests, but due to sheer luck they're not that hurtful in this case: since testing 250 movie titles against a regexp isn't quite as short-running as the other tests, the "results" list doesn't grow quite as much, and taking the current time after each iteration also produces less overhead in relation to the workload.

Conclusion:
It's a boring micro-benchmark for "new RegExp" and "RegExp.test()". There are probably very few real situations where the performance of these two operations in isolation matters, so this is a far cry from Browsermark's claim to "Measure Real-Life Browsing Performance".

2015-07-28

Browsermark: String-Chat

Onwards with our journey through Browsermark's JavaScript tests! Next stop: http://web.basemark.com/tests/2.1/benchmarks/javascript/string_chat/test.js.

The high-level summary is that this test has a list of "messages" ("Lorem ipsum" strings devoid of any meaning, about 40 to 80 characters long), and a set of ten banned words, and then it deletes all occurrences of those banned words in the messages as often as it can until four seconds have elapsed, storing all the resulting modified messages in one huge list.

This tiny snippet of code is the entire benchmark (minus the static data), let's take a closer look:

[1]  modifiedMessages[iteration_index] = [];
[2]  for (var e = 0; e < this.messages.length; e++) {
[3]    if (this.messages.length % (e + 1) == 0) {
[4]      counter++;
[5]      for (var d = 0; d < this.bannedWords.length; d++) {
[6]        modifiedMessages[iteration_index].push(
[7]            this.messages[e].replace(this.bannedWords[d]))
[8]      }
[9]    }
[10] }

Line 7 is where the action is: "banned words" are replaced with nothing. So this entire benchmark is just a micro-benchmark for JavaScript's String.replace() function; and more specifically for the special case where there is no second argument (which usually would be the replacement string). I guess if they had called it "String-Replace" that would have sounded boring, and if they had called it "String-Censorship" people would have questioned their moral values, so they called it "String-Chat", which is sure to let you think "yeah, right, chatting is something I do on the internet, so this benchmark must be relevant!".

When it comes to benchmarking, a good rule of thumb is that when a benchmark does all its interesting work in just one line, then it's short enough to disregard it on general principle. One-line benchmarks are so far away from any useful program that they're irrelevant -- the only possible exception is when you're working on the respective execution platform (in this case, the JavaScript engine), and today you're focused on that one feature tested in that one line. That said, ending this post here seems premature, who knows how much fun we'd be missing out on? Let's keep looking.

One consequence of the workload being so small is that, similar to what we saw on Array-Weighted, just measuring how much time has elapsed after each iteration amounts to about 7% of the overall time spent on this test.

Line 6 appends all the results of all the replacements to the "modifiedMessages" list created in line 1. Presumably this is done to prevent JavaScript engines from cleverly deducing that the result of the replacement is never used, and hence optimizing away the entire replacement operation. The problem is, however, that the modifiedMessages list becomes huge! In my tests, it grew to over 900 MB, and according to its profiler, the poor JavaScript engine spent about 20% of the overall time in its garbage collector attempting to free up some space; of course that was mostly futile, as this huge list kept millions of strings reachable, so the GC couldn't collect them.
It's a safe assumption that the more memory is consumed, the more time an engine will spend trying to free it (firstly because that's what you typically want as a user, and secondly because garbage-collecting a larger heap has to take more time as it is more work). In this test, the faster a computer (hardware + JS engine) is, the more memory will be consumed for those results. So the end result is a benchmark that slows itself down the faster it runs. What genius came up with this?

Line 5 iterates over the "banned words". Nothing wrong with that on a technical level, until you look at the words, and at the messages, and remember that String.replace is case sensitive, and you realize that two of the ten banned words will never match because they only occur capitalized. That seems unintentional, so that's a rather amusing face-palm moment.

Line 2 iterates over the messages. Nothing wrong with that on a technical level, until you look at line 3, and start scratching your head. What does it do? It makes sure the body of the loop is only executed if the index of the current message divides the number of messages without remainder. Uhm... why does it do that? I haven't the faintest idea. I can't even imagine what original intention went wrong here to lead to this. If you had four messages, there'd be absolutely no reason for this test to ignore the third (seeing that only 1, 2, and 4 are divisors of 4).
So how many messages are there? 41. Wait. Forty-one! That's a prime number! That doesn't have any divisors, except for 1 and itself! Only the first and the last message are ever looked at! You could literally drop half of the source code of this test, and it would still do exactly the same amount of work. Also, with so many messages being ignored, only two of the ten banned words will ever match. This is beyond face-palm territory. This test is falling flat on its face here, suffering a fractured nose and a severe concussion. Ouch.

Conclusion:
Micro-benchmark for "String.replace(search_term, replacement)" in the special case where replacement is not specified (so defaults to 'undefined') and search_term is never found in the input string. The overall code is so braindead that I would reject a summer internship candidate who submits this in their application.

2015-07-27

Browsermark: Array-Weighted

Let's follow Array-Blur's "next_test" link, which leads us to http://web.basemark.com/tests/2.1/benchmarks/javascript/array_weighted/test.js. Like before, running this code through a beautifier renders it much more readable.

This is a great example for completely artificial code: it doesn't do anything useful, or compute any result, it just plays around with strings and arrays like a kid in a sandbox, forming a shape and destroying it to form another, and so on. See for yourself:

First we have "countries", an array of strings, where each string is a comma-separated list of country names. The fact that they're country names is completely irrelevant, they might as well be random strings.
var countries = [
  "Afghanistan,Albania,...",
  "Bahamas,Bahrain,...",
  ...]
Now for the test's main and only function:
run: function(stop_now, iteration_index) {
Pick an index into the "countries" array...
  var e = iteration_index % countries.length;
...retrieve the string there...
  var o = countries[e];
...and split it at its commas, so "d" is now an array of country names.
  var d = o.split(",");
Initialize two empty arrays "k" and "j", and temporary variable h, which is initialized to the empty string, but that's irrelevant because it will get overwritten two lines down anyway.
  var k = [];
  var j = [];
  var h = "";
Now they use jQuery's each function to iterate over the array "d". As jQuery's documentation describes in detail, this function takes an array and a callback as arguments, and will invoke the callback for every element of the array, passing in the current index and the element at that index. So typically that would look like: $.each(array, function(index, element) { do_something_with(element); });. Of course, with JavaScript not having any binding rules, they can choose to ignore this advice:
  $.each(d, function() {
And instead of taking an "element" argument in the callback, pop the first element off the array. Note that this is really expensive, because the JavaScript engine must move all remaining elements forward to fill the gap.
    h = d.shift();
Next, they copy out the first two and the first three characters of the country name:
    c2 = h.slice(0, 2);
    c3 = h.slice(0, 3);
And append them to either of the two arrays created above, if it doesn't contain the new entry yet. Note that this, again, is fairly expensive, because ".indexOf" must look at every existing element of the array.
    if (k.indexOf(c2) == -1) {
      k.push(c2)
    }
    if (j.indexOf(c3) == -1) {
      j.push(c3)
    }
Finally, the element that was previously popped off the array's beginning is appended back to its end, so that after all iterations have completed, the array will be exactly like it was before.
    d.push(h)
  });
Now the lists of 2-character and 3-character prefixes are copied into a new array containing both their contents:
  var c = k.concat(j);
The resulting list is then sorted. It is unclear why this is being done. Probably because we can.
  c.sort();
Then an index into the sorted prefix list is chosen.
  var g = iteration_index % c.length;
And a reverse-ordered copy of the list of country names is created.
  var m = d.reverse();
Someone felt it would be appropriate to have another temporary variable here, initialized to the empty string, although again this is irrelevant, as the value is never used for anything.
  var a = "";
Now there's one more jQuery-powered iteration, again without the callback's typical arguments...
  $.each(m, function() {
...but just to spice things up, this time around, they pop elements off the array's end. Having reversed it before, this results in the same order of processed elements as before. Which doesn't even matter for the code at hand. Whatever. The string is also split into an array of individual characters. Because you can do that with strings.
    h = m.pop().split("");
"c[g]" is a randomly selected prefix, "p" is its length (either 2 or 3).
    var p = c[g].length;
Now they could just have used "h.slice(0, p)" like above, but why go the fast and elegant route when you can take a slow detour? Having split the country name into an array of single characters two lines above, they can now use the array's "splice()" function to remove the first "p" elements and put them into a new array, which is then joined with empty separators, i.e. turned into a string again. Confused yet?
    var i = h.splice(0, p).join("");
This so laboriously computed prefix of the country name is now compared to the randomly selected prefix, and if they are equal...
    if (i == c[g].toString()) {
...then the rest of the country name is converted back to a string and appended to the prefix. Which gives us back the full country name. It's identical to the result of "m.pop()" above. Cool, we've run in a circle. Except nobody cares. The result is just assigned to the variable "a", which nobody will ever look at.
      a = i + h.join("")
    }
  });
That's it! All done! And since the test has just done maybe about roughly 13-ish things or so, the counter should obviously be incremented by 13. Duh!
  counter += 13;

I still haven't found the "weights" that the test's name "Array-Weighted" led me to believe there would be.

As far as I can see, it tests Array.splice, Array.shift, Array.sort, Array.concat, and String.split on completely contrived examples. I can see how benchmarking such library functions could be considered reasonable; on the other hand all these functions are necessarily fairly expensive (just because of what they do, by definition), so code that cares about performance would typically avoid using them anyway, which in my view drastically reduces the practical relevance of this benchmark.

Here's another interesting observation: this test runs its main function (the one we analyzed above) as often as it can until 4 seconds have elapsed. Generally, that's a relatively sound approach to benchmarking, at least if you intend to measure maximum throughput. However, even though what the function does is fairly complex and inefficient, it's still so fast that checking whether 4 seconds have elapsed yet after every iteration takes more than 4% of the entire benchmark's time. How do I know this? Easy: I've modified the benchmark runner to do 1000 iterations of the test before checking again how much time has passed. Doing so increased the score by more than 4% (averaged over 5 runs).
Sure, you can say that 4% isn't a huge deal. However it is clearly more overhead than I would expect from a benchmark made by a company who claims to be experts at benchmarking. Ask a professional NASCAR or Formula One driver what they would say if whatever system measured their time per lap also slowed them down by 4%!

Conclusion:
I've tried to understand what this test is good for, or what typical use-case it is representative of. I've tried really hard. I couldn't. It just seems pointless.

2015-07-26

Browsermark: Array-Blur

On a high-level, the array-blur benchmark takes an image represented as an array of color values, and repeatedly applies a blurring filter on it. I can't remember ever consciously wanting my browser to blur an image for me, but Browsermark seems to believe it's representative for what browsers do. OK, let's give them the benefit of the doubt and study the code in more detail.

This test's source is located at web.basemark.com/tests/2.1/benchmarks/javascript/array_blur/test.js, running that through a formatter like jsbeautifier.org makes it considerably more readable.

The test's initial setup consists of a single call to initialize a "Bitmap" array with the predefined data, a width, and a height:

    this.b = new Bitmap(this.data, 100, 53);

The code of the Bitmap class shows that this class just wraps the "data" array, and interprets its contents as [p1_red, p1_green, p1_blue, p1_alpha, p2_red, p2_green, p2_blue, p2_alpha, ...], in short: there are four color channel values for each pixel. Right there is where we see the first major issue: "data" contains 1500 entries, but 100 * 53 * 4 = 21200. Only about 7% of the required array elements are present! In a sane programming language, this would cause crashes or errors. In JavaScript, however, missing entries in an array are simply "undefined". The JavaScript engine running this, instead of demonstrating how well it can access array elements, will have plenty of fun detecting out-of-bounds accesses and taking the (presumably slower) code path that returns "undefined" in that case -- a code path that correct, well-written code should almost never hit!

Anyway, let's see what happens when the benchmark runs. The main function (with more meaningful names chosen for the parameters) is:

    run: function(stop_now, iteration_index) {
      this.b.blur(iteration_index % 10) + 2;

      counter += (iteration_index % 10) + 1;

It calls the bitmap's "blur" function, which takes a "count" parameter for how often to apply the blurring filter:

    function blur(count) {
      for (var h = 0; h < count; h++) {
        this.applyConvolutionMatrix([1, 1, 1, 1, 1, 1, 1, 1, 1])
      }
    }


Looks good? Look again. Obviously calling blur(0) does nothing (zero iterations of the for-loop), so they try to avoid that by adding 2 to the result of "iteration_index % 10" (in case you're not familiar with it, '%' means 'modulo', a.k.a. the remainder of the division). However, the "+ 2" is after the function call's closing parenthesis! So what happens is:
  • "blur" is called with the result of iteration_index % 10 (which is 0 for one in ten values of iteration_index),
  • blur runs and returns no value (which, in JavaScript, means it returns undefined), 
  • 2 is added to this undefined return value (which, in JavaScript, is perfectly legal), 
  • the result of this addition is then ignored.

In "counter", they keep track of how many blurring iterations were performed. Due to the bug above, it over-counts the amount of work done. Why do they add 1 here, as opposed to adding 2 right above? Beats me. If the parenthesis bug was fixed, this would under-count the amount of work done.

So what does applyConvolutionMatrix() do? What it should be doing is get each pixel's 3x3 neighborhood (i.e. the pixel itself and its eight direct and diagonal neighbors in all directions), and for each color channel compute the average of those nine pixels, weighted by the convolution matrix. What it actually does is it retrieves the nine pixels we would expect (as an array "p") and then:

    for (var channel = 0; channel < 3; channel++) {
      for (var neighbor = 0; neighbor < 6; neighbor++) {
        n[channel] = n[channel] + matrix[neighbor] * p[neighbor][channel]
      }
      n[channel] = n[channel] / 6;

It computes the weighted average of the first six out of those nine pixels! The three neighbors below the current pixel are just ignored. Since channel only covers the values 0, 1, 2, which correspond to R-G-B, the alpha-channel is also ignored.

Can it get any better? Well, sure! Considering that the algorithm above needs exactly the first six entries of the "matrix" array, they make sure that it is exactly six elements long:

    if (matrix.length == 6) {
      console.log("Invalid matrix.");
      console.log(matrix);
      return
    }

Except that they're doing it wrong and rejecting arrays with the correct length, whereas any given invalid length is fine! Fortunately the array that they actually do pass in has nine elements (three of which will be ignored), so this bogus check luckily doesn't cause any real damage.

You could also ask why the convolution matrix is all ones. Wouldn't multiplication be more interesting if the right-hand factor wasn't always 1? I think with that question you'd be raising an excellent point.

Besides the egregious bugs described above, the rest of the implementation is just about the most inefficient way I've seen in a long time to implement accesses to an RGBA bitmap. The way it allocates and discards short-lived arrays all the time turns the benchmark mostly into a stress test for object allocation and garbage collection. Getting a pixel's neighbors necessarily requires detecting whether the pixel is at the border of the image; doing this check by accessing a non-existent array element and then comparing if the result is undefined, due to JavaScript's semantics, is much less efficient than comparing the requested index against 0 and the array's length. If I rewrite the blurring function's main loop to be less abjectly inefficient, the score goes up by 5x. With modern browsers usually delivering similar performance, who cares about one browser or another being 10% faster, when every single one of them could do the same work five times as quickly if the program was implemented more efficiently?

Conclusion:
The "Array-Blur" test's code is utter crap. It is horribly buggy and horribly inefficient. While you could argue that this shouldn't matter because a fast browser should handle any task quickly, I think how well browsers handle broken code shouldn't be used for performance comparisons, nor for browser vendors as targets to optimize for.

Browsermark: Introduction

By more or less random selection, let's start our investigative journey with the browser benchmark creatively named Browsermark. Browsermark is a relatively new kid on the benchmarking block, having been released less than three years ago in November 2012; the current version 2.1 is from May 2014. The official description is at http://www.basemark.com/product-catalog/browsermark/, supposedly it’s a "hugely popular" benchmark designed to help you "choose the best browser for your PC or smartphone". Sounds good, let's see what it does.

Running the benchmark in the browser shows a series of fancy moving things, but extremely little technical detail about what's going on under the hood. After running the tests, a single performance score is displayed, with no information on how this score was computed (e.g. which individual tests scored how much). The official description (see link above) informs us that we need to pay for the "professional" version if we're interested in the breakdown. There is also no link to the source code of the benchmark -- so at this point, we may as well assume that a random number is chosen as the final score.

Fortunately for our inquiring minds, any benchmark program downloaded and executed by a browser can also be inspected by the user using that browser. In fact, this is even easier by using a command-line tool to retrieve the web site as plain text. So let’s take a look at the HTML source of the page:
$ curl web.basemark.com
This spits out a bunch of text, and at the very end, we find this little snippet:
<div id="next_test" data-next="http://web.basemark.com/tests/2.1"></div>
OK, so let’s look at that:
$ curl web.basemark.com/tests/2.1
lo and behold:
<div id="next_test" data-next="http://web.basemark.com/tests/2.1/benchmarks"></div>
$ curl web.basemark.com/tests/2.1/benchmarks
I'm beginning to see a pattern here:
<div id="next_test" data-next="http://web.basemark.com/tests/2.1/benchmarks/css"></div>
$ curl web.basemark.com/tests/2.1/benchmarks/css
<div id="next_test" data-next="http://web.basemark.com/tests/2.1/benchmarks/css/2d_transform"></div>
Finally, after this chain of redirections, we get to the first actual test's source:
$ curl web.basemark.com/tests/2.1/benchmarks/css/2d_transform
which, of course, contains the link to the next test at the end. Before that, we find this test's source. Looks like the meat of the logic for this test is in http://web.basemark.com/tests/2.1/benchmarks/css/2d_transform/test.js.

The benchmarking framework shared by all individual tests appears to be in http://web.basemark.com/tests/benchmark.js.gz. Unzipping and formatting makes this file human readable.

Now we have what we need to start looking in more detail.

One interesting observation is that the benchmark's description lists three tests in the "JavaScript" section. However, the actual benchmark performs a fourth test, namely "Array Blur". Excited like a little kid about having found this easter egg, let's pick that at the first benchmark to look at, shall we?