Left H.264, right VPG—YouTube |
ASICS beat GP chips which beat software—it’s the law. When it comes to fixed-function operations like transcoding that’s especially true. The first transcoding work was done in software on the CPU. It was pretty good and easy to see it would scale with CPU clock rate. Transcoding is a highly redundant operation and so it came to make sense for the parallel processors in a GPU to take over. They were an order of magnitude faster than CPUs. As the demand for transcoding began to explode in the early 2000s, Intel reacted and added its Quick Sync dedicated hard-wired transcoder to the X86 and it was blazingly fast—and free.
But X86 processors and all the stuff you have to have around them they add up. So as fast as the Quick Sync was, it didn’t scale well. It was great for individuals, not so great for a farm. The need for streaming content suppliers for multiples streams at different speeds is an ongoing and growing market. Recognizing this, Elemental Technologies, was founded (in Oregon) to serve that market with custom racks of GPUs. Amazon bought the company in 2017 so it could meet its streaming needs.
Google’s YouTube is doing basically the same thing except they’ve committed to custom silicon for video encoding. Google says their Argos chip is a VCU, a dedicated video transcoding chip. The company is revealing its work in a blog post by software engineer Jeff Calow and a paper presented at the ASPLOS conference.
The chip layout is straightforward with an Arm CPU, 10 transcoders, and two banks of LPDDR.
The Argos floorplan (Source ASPLOS ’21, April 19–23, 2021) |
Google has put 10 encoder cores in a chip and complements it with off-the-shelf IP blocks. It’s a dedicated hard-wired processor, which is always going to be the fastest approach. Google says its encoder core can encode 2160p in real-time at up to 60 FPS using three reference frames.
Simplified view of the transcoding workload (Source: YouTube) |
Transcoding for YouTube means taking a single video and converting it into a lot of other videos. There are nine different resolutions created from a single upload: 144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 2160p, and 4320p. These are all different video files, and everyone needs to be created from the original 8K uploaded file.
In his post, Jeff Calow, the lead software engineer at YouTube, said that during the Covid-19 pandemic they saw surges in video consumption as people sheltered at home. In the first quarter of last year, there was a 25 percent increase in watch time around the world.
Google created an AIB with two VCUs, each with 10 transcoders in them (Source: YouTube) |
The folks at YouTube started the project in 2015. They could see the demand for higher quality video (e.g. 1080p, 4K, and now 8K). They also saw that the broader Internet wouldn’t be able to accommodate this growth unless they shifted to more data-efficient video codecs. However, data-efficient video codecs like VP9 use more computer resources to encode than H.264. Calow said, “the combination of these dynamics led us to pursue a dramatically more efficient and scalable infrastructure.”
Design for all scales: global system, chip, and encoder core (Source ASPLOS ’21, April 19–23, 2021) |
The results speak for themselves, as shown in the following table.
System | Throughput (Mpix/s) | Perf/TCO | ||
H.265 | VP9 | H.265 | VP9 | |
Intel Skylake | 714 | 154 | 1.0x | 1.0x |
4x Nvidia T4 | 2484 | – | 1.5x | – |
8x VCU | 5973 | 6122 | 4.4x | 20.8x |
20x VCU | 14932 | 15306 | 7.0x | 33.3x |
Offline two-pass single-output (SOT) throughput in VCU vs. CPU and GPU systems (Source: YouTube) |
Calow added, ” One of the things about this is that it wasn't a one-off program. It was always intended to have multiple generations of the chip with tuning of the systems in between. And one of the key things that we're doing in the next-generation chip is adding in AV1, a new advanced coding standard that compresses more efficiently than VP9, and has an even higher computation load to encode.
“As for me, I’ll be continuing my work on this project, developing future generations, which will keep me busy for a while.”
What do we think?
Google has the size, and resources that they can afford to design and build custom chips just for their own use. Few companies think that way and approach such projects with an ROI model that includes some kind of merchant supplier function.
Transcoding is an ideal problem to throw silicon at because it scales so well and is stabilized by half a dozen standards. YouTube can take this basic design and improve it simply by using the latest fab process—Moore’s law in action.