What the heck is that you ask? So, did I; but it looks promising!
I continue to be amazed at the innovations that are coming to market to accelerate deep learning workloads. GraphCore, Habana Labs, Cerebras, Blaize, Groq, Perceive and others are now being joined by a new entrant with radically new ideas: Tenstorrent out of Toronto, Canada. The Game of Launches requires that each new competitor must find a way to best the previous launches from a growing field of contenders for what many expect to become a $25B chip business by 2025. Cerebras staked out the biggest chip imaginable. Groq said they can crank out a petaflop on a chip with a single core. Now Tenstorrent’s claim to future fame and potentially the crown is all about reducing the computation required to get to a good answer, instead of throwing massive amounts of brute-force compute at the problem. The technique is called fine-grained conditional computation, and it is just the beginning of an optimization roadmap Tenstorrent CEO, Ljubisa Bajic, has up his sleeves.
What is conditional computation and what can it do?
There are many forms of conditional computation, but in principle they all come down to the same premise: don’t calculate what you don’t need or already know. For example, multiplying a number by zero makes no sense; don’t waste the time and energy because you already know that the answer is, of course, exactly zero. But how do you design silicon to avoid it? Typically, this example is handled by software that prunes the neural network before execution to avoid asking the silicon to do the multiply. But what about more complex cases that need to be avoided at run-time? Researchers have been studying this and believe it holds potential; a team at Harvard has demonstrated a 1.9X performance boost in Resnet50 with 98% of the original accuracy.
But a more general use of the concept requires silicon to be, well, smart about being smart. In the examples provided by Tenstorrent, the new chip, unfortunately called Grayskull, can detect that it is close enough to an accurate answer to stop processing the network, something Bajic calls “early (model) exit”. The company has demonstrated the concept working well in convolutional neural nets used for image processing, as well as recurrent nets used in language processing. And of course, there’s more to the chip than the conditional computation ability; they device has on-die CPUs and a fast GEMM (matrix multiply) core set to deliver impressive performance.
On a 75W bus-powered PCIE card, Grayskull achieves 368TOPS and, powered by conditional execution, up to 23,345 sentences/second using BERT-Base for the SQuAD 1.1 data set, making it some 26X more performant than today’s leading solution, according to Tenstorrent. I’ve often said and strongly believe that it will take a 3-5X or even 10X advantage to help motivate an ecosystem needed to challenge the status quo. 26X would certainly do the job. A 300W version of the card is expected later this year.
While the company claims Grayskull is the fastest chip in the world, many are competing for that title, and I will withhold judgement until I see some real application benchmarks such as mlperf. But Tenstorrent certainly has my attention with this announcement, and bears watching closely. To my eye, this announcement marks a shift from chips with lots of fast cores and on-die memory & fabric (which describes most of the entrants to date) to a new approach of smart computing where the software, training, and inference chips all coordinate knowledge of the network to reduce the amount of computation. Brute force is great, until something better comes along, and I think it just did.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.