Misunderstood Mobile Benchmarks Are Hurting The Industry and Consumers

By Patrick Moorhead - June 12, 2015
I have been in and around the benchmarking and benchmarketing scene for 25 years in the PC, server, and now smartphone and tablet markets. Benchmarks have been on a cyclical nature for years and the cycle is fairly predictable. Benchmarks cycle between manufacturer, consortium, benchmark company and industry standard- led formations. There are hybrids as well, like manufacturer-led consortiums, too. Over the course of the past few years, there has been a proliferation of inappropriate or misunderstood benchmarks in the mobile world, and those benchmarks serve to do nothing other than help users generate a single number, a benchmark score, that is supposed to quantify the performance and by proxy, the experience of that device. This impacts companies like chipmakers or chip designers Apple, ARM Holdings, Huawei, Intel, MediaTek, NVIDIA, Qualcomm and Samsung Electronics. It also impacts handset makers like Apple, HTC, Lenovo-Motorola, LG, Sony and Samsung Electronics and the decisions they make. Most importantly, it impacts consumers and I’ll give examples why. The problem with some of these mobile benchmarks and the scores that they generate is that they don’t accurately reflect a user’s experience once they’ve gotten a device and use it. Simply put, the numbers generated do not directly correlate to the user’s experience with the device and device manufacturers, and the press and reviewers using them are unfortunately misleading consumers by using these benchmarks. I don’t think it’s to intentionally mislead, I just think it’s a lack of understanding and maybe a lack of desire to do the extra work. iPhone-6-vs-iPhone-6-Plus Some of these benchmarks and the people who use them in reviews have been responsible for proliferating the 8-core myth, too. These benchmarks are simply run to see the fastest theoretical performance that the system could be tested at, without regard to battery life, operating systems, applications or real world use cases. Many of them simply load up all of the cores to their maximum frequency, which phones never operate at other than benchmarks. As a result, these benchmarks have been given the label across the industry as ”inaccurate or inappropriate benchmarks” that don’t accurately represent a user’s experience. Why should we even care about this? You may be asking, “why should I even care”? First of all, if you look at the history of microprocessor or SoC pricing, you will find a direct correlation between perceived performance and pricing. Don’t even think of invoking the “Apple rule” as they have dominated the mobile SoC benchmarks for most of five years. Then there are consumers. Consider first an example of something I read this morning in the DailyMail:
“£99 Tesco tablet beats £300 Apple rival in speed test: Consumer study shows price and brand is not guarantee to finding best performing device”
In this example, the DailyMail used GeekBench to justify the article and headline. We all know in the industry that a 99 pound Tesco tablet doesn’t outperform a 300 pound Apple iPad mini 3 on real benchmarks or the experience. Admittedly, the DailyMail example is the worst I have seen, but I see this kind of stuff every time I read reviews about a new smartphone or tablet. And I cringe. You should, too. So if you are an SoC manufacturer like Huawei, MediaTek, Qualcomm, or Samsung Electronics, you take a “can’t beat them, join them” approach and add more processor cores to your SoC. Thus we have the 8-core myth, 8 cores so that you look better on inappropriate benchmarks. Some have done this to get the “64-bitness”, too. So how is that 64-bit Android thing working out? Apple and Intel have not taken this approach of wantonly adding meaningless CPU cores and I applaud them for taking the high road. Qualcomm, I believe, will move back to a different approach with their future Kryo core. NVIDIA took an approach in the middle. Why do some use inaccurate or inappropriate mobile benchmarks? The reason that people use inaccurate benchmarks is because these benchmarks make it really easy to simply download, press a button and get a number telling you how fast or slow your smartphone is, in theory. It takes a lot longer to run a benchmark that reflects real-world usage. Part of the reason for this has been because press and device manufacturers have been publishing their scores in these inaccurate or inappropriate benchmarks and give credibility to these scores. But the reality is that these benchmarks don’t even remotely test what a normal user would be doing on their smartphone. These benchmarks are known as synthetic benchmarks. They generally test the components of a computer or in this case a smartphone to see their highest performance in an absolute best case scenario, usually without much context about how those components are be used. AnTuTu and Geekbench, the most commonly used misunderstood or inappropriate mobile benchmarks From my experience, along with many others experts that I have talked to within the industry on this topic, the general agreement is that AnTuTu and Geekbench are the two mobile benchmarks that are the most used and misunderstood. AnTuTu and Geekbench are both commonly used by both device manufacturers and reviewers to quantify the performance of a smartphone in order to show how it performs against others. The problem with these benchmarks is that they do not actually test the smartphone’s actual performance as a system, but rather components of the SoC or some other part of the whole system. In some scenarios, some of these benchmarks may be useful to point to certain capabilities of the CPU, but are not a useful representation of the whole system’s performance or experience. These benchmarks are also easily manipulated and tricked by device OEMs as we saw last year when Anandtech exposed a multitude of smartphone manufacturers ‘boosting’ their benchmarking performance when using some of these benchmarks. They were flat out cheating in benchmarks in order to look better on reviews done by the press. Unsurprisingly, the most cheating was found in AnTuTu where LG, HTC, ASUS and Samsung Electronics were all caught cheating in the benchmark. Some lessons could be learned from the PC world, from my experience, where I found problems in smartphone benchmarking in the past and suggested a list of remedies.  Let’s dive into AnTuTu and GeekBench. AnTuTu AnTuTu originally started out as just a single system test where you pressed a single button and let the test generate you a score based on a bunch of individual system tests. Now, the app is in version 5.7 and still has the single button test, but now includes separate tests for HTML5, video, display, stability and battery. The standard test is still a combination of single thread floating, single thread integer, full CPU integer and full CPU floating performance tests. These don’t do anything other than tell us exactly what the maximum single thread and multi-core performance of the CPU could be if applications could fully utilize all of the cores and battery consumption wasn’t a factor. The benchmark also tests RAM performance, multitasking performance, storage I/O and database I/O. These tests, once again, operate mostly as silos and don’t particularly do a good job of telling the user how the phone will perform at all in real world scenarios. The last two tests of this benchmark are a VERY simple 2D test with tons of bouncing shapes and a very low quality 3D test that looks nothing like any game I’ve ever seen on any decent phone in the last five years, meaning that it doesn’t really test the gaming capability of the phone. Then, the benchmark takes all of these individual scores and combines them together to give you your AnTuTu score, an aggregated number that doesn’t really amount to anything. The problem with using or reporting on AnTuTu is that it purports to be or is used as a full system benchmark, when it really doesn’t do anything other than test different components of the system, fairly poorly, and then generates a composite score based on those individual scores. AnTuTu’s main test doesn’t incorporate any of the other tests they offer like HTML5, battery life, video, stability or screen test which could provide better insights into system performance. This is because if they did incorporate these tests, it would take too long to test and wouldn’t really be as popular of a test as it is today. Geekbench Geekbench is a cross-platform benchmark that started out on MacOS and iOS, popularized by its ability to both run on iOS and Android and provide some feedback about certain aspects of a CPU’s architecture and theoretical capabilities. It is currently on version 3 and now supports testing on Windows, Mac, Linux, Android and iOS. Geekbench isn’t as bad of an offender as AnTuTu when it comes to being a misleading or misunderstood benchmark, but it does only test two components of a smartphone, the CPU and memory, and doesn’t do so in any real world scenarios. This leaves out really important components like the GPU, which is fast-becoming a compute workhorse in a heterogeneous compute environment or storage. The tests it runs are CPU integer and floating point calculations in both single core and multi-core modes as well as memory single core and multi-core. As a result, the benchmark may provide some insights as to the architectural comparison of the CPU in the system, but in the case of two different smartphones with the same SoC this benchmark provides limited to no value. This benchmark should only really be used to compare CPUs from different operating systems and platforms and how they stack up against each other, not a benchmark for comparing phones, especially not for a review. Geekbench certainly has its place in benchmarking, but it doesn’t particularly make sense to be including it the way that smartphone reviewers have been doing in their reviews. As a result, this becomes an inappropriate mobile benchmark because of the way it gets utilized by the press. Geekbench should be commended for their transparency of how and what they’re testing exactly, but it doesn’t change the fact that it’s being misused in mobile testing. Reviews using balanced benchmarks Although there are plenty of reviewers out there using benchmarks in their reviews, there are a select few experienced reviewers that are using the right benchmarks to compare smartphones. Their editorial on these makes technical and experiential sense, too. Reviews using no benchmarks In addition to having reviewers that use good benchmarks, we also have reviewers that simply don’t use benchmarks at all. For some, they simply chalk it up to being about the experience and if the experience is okay, there’s no need for a benchmark. However, this is a dangerous notion because reviewers may miss something critical in a phone’s performance without having any numbers. It also leaves the “experience” up to someone who runs very different apps, tech usage history and even hand size to determine what a good experience is. Having benchmarks also lends credibility to a review when there might be some sort of issues and figuring out the culprit. This list excludes the countless forums around the internet that also quote AnTuTu and Geekbench performance across the globe and non-English speaking publications. Using the wrong benchmarks pulls the industry in the wrong direction Because of the creation, use and promotion of these inaccurate, misunderstood, and/or gameable  benchmarks, we are seeing smartphone manufacturers and SoC vendors dedicating time and engineering resources to ensuring that their performance in these benchmarks is up to expectations. After all, if so many people are using or mischaracterizing AnTuTu and Geekbench, it lends them credibility even when it shouldn’t. It seems like those same resources could be working on further improvements to issues we all have, things like battery life. Additionally, vendors are adding features that make the misrepresentative benchmarks look better, like by adding more CPU cores beyond what any piece of software can use to improve the experience outside of battery life. This propagates the 8-core myth. Additionally, because so many reputable tech blogs don’t run ANY benchmarks at all, they are essentially giving the ones that do more credibility when they show AnTuTu and other benchmarks. While it is understandable that some reviewers either don’t have the time to run benchmarks or aren’t satisfied with the quality of benchmarks, it still isn’t an excuse to not have ANY at all. Additionally, benchmarks are supposed to lend credibility to your experience and explain any kinds of performance differences between devices, be they good or bad. What needs to be done? I stand by my column I wrote in 2013 talking about what needs to be done. Let me recap what I said:
  • The best benchmarks reflect real world usage models
  • Never rely on one benchmark
  • Benchmark shipping devices
  • Application-based benchmarks are the most reliable
  • Look for transparency
  • Look for consistency
Reviewers need to be using a suite of benchmarks that best exemplify real world usage. That means benchmarks that:
  • utilize real game engines for their 3D benchmarks, like 3DMark
  • benchmarks that best mirror applications like Basemark X
  • those that use real applications or reflect them for their testing like PCMark.
At a minimum, the benchmark results must reflect real-world experiences. I do think that the industry can be doing a better job in the future creating a consortium-led approach to benchmarking. Vendors like Apple, ARM Holdings, Huawei, HTC, Huawei, Intel, Lenovo, LG, MediaTek, Qualcomm, Sony, and Samsung Electronics need to do more to move this forward because its the right thing to do. And they know it.
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.