Bad Benchmarks: Browsermark: String-Chat

Onwards with our journey through Browsermark's JavaScript tests! Next stop: http://web.basemark.com/tests/2.1/benchmarks/javascript/string_chat/test.js.

The high-level summary is that this test has a list of "messages" ("Lorem ipsum" strings devoid of any meaning, about 40 to 80 characters long), and a set of ten banned words, and then it deletes all occurrences of those banned words in the messages as often as it can until four seconds have elapsed, storing all the resulting modified messages in one huge list.

This tiny snippet of code is the entire benchmark (minus the static data), let's take a closer look:

[1] modifiedMessages[iteration_index] = [];
[2] for (var e = 0; e < this.messages.length; e++) {
[3] if (this.messages.length % (e + 1) == 0) {
[4] counter++;
[5] for (var d = 0; d < this.bannedWords.length; d++) {
[6] modifiedMessages[iteration_index].push(
[7] this.messages[e].replace(this.bannedWords[d]))
[8] }
[9] }
[10] }

Line 7 is where the action is: "banned words" are replaced with nothing. So this entire benchmark is just a micro-benchmark for JavaScript's String.replace() function; and more specifically for the special case where there is no second argument (which usually would be the replacement string). I guess if they had called it "String-Replace" that would have sounded boring, and if they had called it "String-Censorship" people would have questioned their moral values, so they called it "String-Chat", which is sure to let you think "yeah, right, chatting is something I do on the internet, so this benchmark must be relevant!".

When it comes to benchmarking, a good rule of thumb is that when a benchmark does all its interesting work in just one line, then it's short enough to disregard it on general principle. One-line benchmarks are so far away from any useful program that they're irrelevant -- the only possible exception is when you're working on the respective execution platform (in this case, the JavaScript engine), and today you're focused on that one feature tested in that one line. That said, ending this post here seems premature, who knows how much fun we'd be missing out on? Let's keep looking.

One consequence of the workload being so small is that, similar to what we saw on Array-Weighted, just measuring how much time has elapsed after each iteration amounts to about 7% of the overall time spent on this test.

Line 6 appends all the results of all the replacements to the "modifiedMessages" list created in line 1. Presumably this is done to prevent JavaScript engines from cleverly deducing that the result of the replacement is never used, and hence optimizing away the entire replacement operation. The problem is, however, that the modifiedMessages list becomes huge! In my tests, it grew to over 900 MB, and according to its profiler, the poor JavaScript engine spent about 20% of the overall time in its garbage collector attempting to free up some space; of course that was mostly futile, as this huge list kept millions of strings reachable, so the GC couldn't collect them.
It's a safe assumption that the more memory is consumed, the more time an engine will spend trying to free it (firstly because that's what you typically want as a user, and secondly because garbage-collecting a larger heap has to take more time as it is more work). In this test, the faster a computer (hardware + JS engine) is, the more memory will be consumed for those results. So the end result is a benchmark that slows itself down the faster it runs. What genius came up with this?

Line 5 iterates over the "banned words". Nothing wrong with that on a technical level, until you look at the words, and at the messages, and remember that String.replace is case sensitive, and you realize that two of the ten banned words will never match because they only occur capitalized. That seems unintentional, so that's a rather amusing face-palm moment.

Line 2 iterates over the messages. Nothing wrong with that on a technical level, until you look at line 3, and start scratching your head. What does it do? It makes sure the body of the loop is only executed if the index of the current message divides the number of messages without remainder. Uhm... why does it do that? I haven't the faintest idea. I can't even imagine what original intention went wrong here to lead to this. If you had four messages, there'd be absolutely no reason for this test to ignore the third (seeing that only 1, 2, and 4 are divisors of 4).
So how many messages are there? 41. Wait. Forty-one! That's a prime number! That doesn't have any divisors, except for 1 and itself! Only the first and the last message are ever looked at! You could literally drop half of the source code of this test, and it would still do exactly the same amount of work. Also, with so many messages being ignored, only two of the ten banned words will ever match. This is beyond face-palm territory. This test is falling flat on its face here, suffering a fractured nose and a severe concussion. Ouch.

Conclusion:
Micro-benchmark for "String.replace(search_term, replacement)" in the special case where replacement is not specified (so defaults to 'undefined') and search_term is never found in the input string. The overall code is so braindead that I would reject a summer internship candidate who submits this in their application.

Bad Benchmarks

2015-07-28

Browsermark: String-Chat

No comments:

Post a Comment