Nvidia overhauls Quadro product line with Pascal GPUs
Alex Herrera on February 24th 2017 |
At the Solidworks World conference in February, Nvidia released its second round of Pascal-generation Quadro add-in card GPUs for professional-caliber workstation applications, adding to the first round of late 2016. Often, a second round is followed by a third and fourth, before the entire portfolio has been upgraded to the latest generation. But in this case, in short order, Nvidia has overhauled its entire Quadro product line in a span of just two quarters.
JPR was able to get a first-hand look at the top half of the new line — the ultra-high end Quadro P6000 and P5000, the high end Quadro P4000 and the mid-range Quadro P2000 — in an evaluation that returned particularly impressive results. Every product generation Nvidia or its rival AMD launches represents a meaningful bump in performance over the previous one, but in testing these cards, we saw one of the biggest jumps on generation-to-generation capabilities in recent memory.
The first round: ultra-high end Quadro P5000 and P6000
Pascal’s first appearance in the Quadro line came in Q4’16, when two early chip incarnations first made their way into the top end of the product line. The GP104 Pascal chip was Nvidia’s first chip targeting graphics, implementing 2560 CUDA cores followed shortly by the GP102, incorporating 3840 CUDA cores. Both are fabricated in TSMC’s 16 nm process and both showed up at the ultra-high end of Nvidia’s Quadro line in Q4’16, powering the Quadro P5000 (~$1500 est. street price) and P5000 (~$1500) and respectively.
As expected, the two new SKUs have their higher-performing GPU chips complemented with more memory than their Maxwell based predecessors, the Quadro M5000 and M4000, and more high-performance video ports as well. The P6000 comes with 24 GB GDDR5X memory accessed at an impressive 432 GB/s (peak rate), while the P5000 is paired with 16 GB driven at 288 GB/s. and five video ports, ports. Both are equipped with five video ports (four DisplayPort 1.4 and one DVI), and both take advantage of Pascal’s advance in display capability, able to drive up to four 4K screens at 120 Hz or four 5K screens at 60 Hz.
The second round: the high end Quadro P4000, mid-range P2000, and entry P1000, P600 and P400
As has become typical, after a new generation penetrates the top end, it then trickles down the line over time. In January, the entire rest of the line-up got Pascal upgrades, from high end to entry: the P4000 (~$900) in the high end, the P2000 (~$450) in the mid-range, and the P1000, P600 and P400 in the entry class. The P4000 is built around the same GP104 as the P5000, though with fewer (1792) cores enabled, while the P2000 is based on another Pascal chip derivative, the cost-reduced GP106, with 1024 active cores.
Rounding out the launch are three entry cards, the P1000 (4 GB), P600 (2 GB) and P400 (2 GB), replacing the previous K1200, K620, and K420, respectively (the entry Quadros had still been based on Kepler, never having received a Maxwell refresh). All three come built around the most diminutive of the current Pascal chips, the 768 core GP107, with each SKU having 640, 384 and 256 of those cores enabled, respectively. (Or rather, it’s likely chips are binned by functionality at test.)
The Quadro P6000 and P5000 have already begun shipping, while Nvidia promises the rest new Quadro P-series SKUs announced in January will be available in March.
Performance yes, but not without considering watts.
Driving performance up from generation to generation is of course key, but it can’t be done without proper attention to power consumption. Generally speaking, vendors would like to match — and ideally decrease —the power consumption of the previous model. Those targets are particularly important around a couple of critical thresholds. Up to 75 watts, a card can rely on PCIe interface power, but beyond will require auxiliary power connectors. Entry and mid-range SKUs will typically sit at or under that threshold. And pushing beyond 250 watts has its tradeoff as well, since such cards require power supplies, thermal dissipation and slot availability (typically dual-slot width) that not all OEM workstation models can support.
Power consumption for the P6000 (250 W) and P2000 of (75 W) are identical to the levels of the Maxwell-based cards they succeed. The P5000’s power grew from 150 watts to 180 watts, but that growth has to be taken with a big grain of salt, as memory footprint, loading and bandwidth have all increased significantly. The P4000 made the most out of the move to Pascal and 16 nm, actually reducing power from 120 W to 105 W.
The graphics+compute allin- one Quadro GP100
In this refresh cycle, Nvidia did a bit more than upgrade existing SKUs and price points — it added a brand new model in the form of the Quadro GP100. But more than just another SKU, the Quadro GP100 represents a different kind of GPU card, targeting both graphics and computes. As one might assume from the name, this übercard is based on the original, flagship.
“Big Pascal” GP100 first introduced under the Tesla compute-focused brand.
Why wasn’t the GP100 already designed into Quadro? Well, simply put, Big Pascal is too big for the vast majority of applications only concerned with graphics. Most notably, it contains a lot of extra transistors/cost dedicated to boosting double-precision floating point arithmetic, something that graphics applications generally don’t care about, but is a must-have for some compute-oriented applications. Accordingly, Nvidia spun both the GP102 and GP104 shortly after, cutting the extra DP-boosting hardware in favor of more CUDA cores — both chips then used to drive the Quadro top end.
But the GP100 isn’t a graphics-focused card. Rather, it’s a card that is targeting for high-end workstation applications that rely heavily both on graphics and compute applications (and things like high-quality, photorealistic rendering, which is better characterized as “compute” more than “graphics”). The card can do both well and therefore appeals to the higher-demand and deeper- pocketed applications that rely on visualization in addition to tasks amenable to GPU-processing like simulation and analysis.
For some, the Quadro GP100 might recall a technology initiative of several years ago called Project Maximus. That initiative attempted to optimize the co-processing of two cards in a single workstation, a Quadro focused on graphics and a Tesla focused on compute. Well, one can think of the Quadro GP100 as the improved and streamlined evolution of Project Maximus, delivering a solution that can break the serial design-then-simulate workflow into a parallel design-and-simulate approach. Parallelizing shortens the iteration and improves time to completion and, ultimately, time to market. And for those with the absolute ultimate in demand and the pockets to match, the GP100 can be ganged into a dual-GPU solution tightly coupled via Nvidia’s new Pascalgeneration NV-Link interconnect.
A few new things about the GP100 make it particularly well-suited to take on this all-in-one role: the aforementioned 64-bit (and 16-bit, as well) floating point performance, NV-Link, and Pascal architectural improvements, most notably Asynchronous Compute and instruction- level preemption. In order for one GPU to concurrently process both graphics and compute tasks, it needs to be able to switch between the two tasks, quickly and with minimal overhead and predictable latency. With Pascal, Nvidia added the ability for the CUDA cores (we assume at the Stream Multiprocessor, or SM level) to switch far more efficiently from compute tasks to graphics tasks. Asynchronous Compute is useful in traditional graphics application, for example media and entertainment, where graphics engines and developers are increasingly using the GPU to perform physics and kinematics.
In addition, Pascal’s instruction-level preemption makes delays associated with switches more predictable and tolerable. Previous GPUs had to complete kernel execution before getting on to other tasks, like rendering the screen or updating the GUI. A long execution time could thus cause the machine to appear unresponsive until the kernel completed or the OS killed the process due to timeout. That made things harder on CUDA coders, who had to slice up work into smaller chunks on "unnatural" boundaries, just in order to avoid hangs and timeouts. Instructionlevel preemption changes all that. Working in very much the way it does on a CPU, preemption can now occur after completion of any instruction, and after the chip copies its context to GPU memory, can then proceed on another application or OS task. No more waiting for the full kernel to complete, and no more coding games to avoid timeouts.
The GP100’s other noteworthy differentiator from the rest of the Quadro line is in its use of High Bandwidth Memory v2, or HBM2. This memory chip stacking approach dramatically raises bandwidth by trading PCB etch with lower-capacitance interposer and thru-stack vias. Though HBM (in some incarnation) looks to represent the future of GPU memory, HBM2 still looks to be behind the curve with respect to GB per dollar, leaving the more cost effective GDDR5/5X to drive the Quadro line targeting graphics. By contrast, the premium Quadro GP100, like comparable Pascal-based Tesla brand cards, opts for HBM2.
The Quadro GP100 will be available for shipment by the end of Q1’17, with pricing yet-to-be determined.
Benchmarking the Quadro P6000, P5000, P4000 and P2000
As is our norm for evaluating graphics cards, we ran the latest version of SPEC's Viewperf. Viewperf 12 focuses workload on the graphics card, such that the rest of the system isn't (or at least shouldn't often be) the bottleneck. As a result, Viewperf will give a good idea of which card has the highest peak performance. However, it's worth noting that the magnitude of any superior numbers does not indicate the level of superiority it will have in a real-world environment where the rest of the system, OS and application may impose other bottlenecks.
With this benchmarking exercise, we relied again on our standard testbench, an Apexx 4 workstation graciously loaned by Boxx. We'd reviewed the Apexx 4 back in the spring of 2015, and it proved to have excellent performance, particularly for single-thread execution thanks to its liquid-cooled Core i7-6590K CPU. A high-performance platform is highly desirable for running Viewperf in particular, as it helps to ensure that bottlenecks that emerge are as much as possible due to the graphics subsystem, rather than some other weak link in the system, for example a slow disk. The rest of the Apexx 4 configuration complements the blistering 4.125 GHz CPU with 32 GB of memory and a SATA-based solid-state drive (SSD). Best, it provides a standard platform to compare multiple cards in a true applesto- apples manner.
The Quadro deskside line targeting graphics now consists of seven SKUs concurrently. As such, it’s critical for a vendor to ensure sufficient separation of its models in the market, to maximize per-model profitability and limit any possible cannibalization within the line. We’ll look at SKU segmentation by the key performance-sensitive metrics — TFLOPS, memory size and memory bandwidth — show a sensible progression up and down the line.
Of course, that progression of capabilities only applies if those capabilities directly correlate to a comparable progression in graphics performance. Viewperf 12 results for all four new Quadro P-series cards confirm that correlation, as we see that same separation and progression up in performance, climbing from the P2000 up to the P6000.
By normalizing the scores for the P6000, P5000 and P4000 to those of the P2000 (i.e set to 1.0), we see the progression more clearly, as well as a variance across viewsets. On average, the P4000 scored about 47% higher than the P2000, while the P5000 and P6000 delivered 86% and 126% better performance, respectively.
While performance always goes up with higher price points, the same can’t be said for performance per dollar, which typically decline as one climbs up the product line. We see that paradigm consistent with the new Pascal family, as on average, the P4000, P5000 and P6000 deliver 83%, 56% and 29% of the priceperformance of the P2000. Why would buyers pay 29 cents on the dollar for a P6000’s price-performance as compared to the P2000? Simply because thosebuyer don’t care about price-performance at all but need the fastest card they can buy, (almost) regardless of cost.
Performance per watt also tends to decline as the SKU price climbs, a behavior repeated by the new Pascal Quadros, but with one key exception. The P4000 stands out, delivering 5% better performance per watt, on average, than its lowerpriced sibling, the P2000.
To be fair to cards like the P5000 and P6000, however, it’s worth considering that Viewperf 12 does not particularly reward higher end cards with big memories, though many highend applications do. And given that bigger memory chews up significantly more power, performance per watt numbers need to be taken with a grain of salt.
Comparing Pascal-generation Quadros to Maxwell-generation Quadros
Now, a potential buyer of a P2000 isn’t considering a P6000, and vice versa. Rather, buyers of an $X card a couple of years ago are probably most interested in what that same $X can buy now. Accordingly, we measured the percent gain in performance — with same test and same testbench — of each new Pascal-based Quadro over its predecessor, the comparable Maxwell-based card.
Here is where we start noticing that the Pascal generation is not only a solid successor, but it se ems to represent a more compelling upgrade than usual. On average across viewsets, the P2000 and P5000 deliver 67% and 63% better performance than the M2000 and M5000, respectively. The P6000 also provided a sizable 50% gain over the M6000.
And the P4000 again proved to be the standout, scoring 87% better than its predecessor, the M4000.
New Quadro P4000 and P2000 stacked up against comparable AMD Radeon Pro products.
With just two viable suppliers of workstation graphics hardware, it's natural to wonder how the rivals' product lines compare. We did not bother to compare the ultra-high end P6000 and P5000 against any AMD Radeon Pro SKUs, as AMD’s products at those tiers are now relatively old and would not present much competition. But as it happens, we do have Viewperf 12 results for several of the most recent AMD Radeon Pro products, with tests run on the exact same Boxx workstation testbench. The mid-range Quadro P4000 and P2000 are priced similarly (though not identically) is most comparable to the AMD FirePro Radeon WX 7100 and WX 5100, respectively. Both Radeon Pro cards are based on AMD’s recent Polaris architecture, both began shipping in Q4’16, and both we expect will bear prices modestly lower than their Quadro peers.
How'd the scores compare? Quite favorably, if you’re Nvidia. Where the WX 7100 and WX 5100 managed to edge Nvidia’s previous Maxwell-generation M4000 and M2000 (which JPR reviewed two months prior, in December of 2016), they were outmatched by the new P4000 and P2000. Below are the P4000 and P2000 scores, scores per dollar, and scores per watt, each set normalized to that card’s primary AMD competition, the WX 7100 and WX 5100, respectively. That is, in each chart, the AMD rival’s scores are equal to one.
In terms of raw scores, the Quadro P4000 and P2000 outperformed the comparable Radeon Pro SKU by 64% and 77%, respectively. In terms of performance per dollar, AMD’s lower (expected and estimated) ASPs pushed those deficits down to 28% for both. And with respect to performance per watt, the P2000’s edge matched that of the raw score (since both SKUs share the same 75watt specification). But standing out again was the P4000, which by virtue of its reduced power envelope, managed to best the WX 7100 by 103%, basically doubling the latter’s capability.
It’s worth noting that Polaris’s successor, Vega, has been announced, but we’re not likely to see Radeon Pro cards based on this next generation until 2Q ‘17, at the earliest.
What do we think?
We expected Nvidia and its Pascal generation to deliver a solid set of Quadro SKUs offering compelling upgrade options for workstation applications that continue to demand higher and higher performance. But the results exceeded those expectations. Normally, and as a ballpark number, we think of 50% as a solid generation-to-generation bump to see on a graphics-centric benchmark like Viewperf 12. But in this case, the new Quadros delivered anywhere from a low of 50% higher performance (P6000) to 87% (P4000), and when looking at power efficiency, even managed over 100% improvement (P4000).
This Pascal generation of Quadro also looks to give AMD all it can handle in 2017. As we’ve pointed out in recent reviews of products from both vendors, we’re often looking at product lines in a constant game of leapfrog. With this release of the Quadro P5000 and P4000, we’re witnessing the next phase of that game. Last quarter, we got a look at AMD’s latest refresh of its Radeon Pro workstation GPU line-up, as the company introduced a trio of cards spanning high end to mid -range price points — the Polaris-generation Radeon Pro WX 7100, WX 5100 and WX 4100. In that review, AMD’s cards all performed well, delivering not only substantial gains over the AMD SKUs they had eplaced, but to varying degrees established an edge — albeit a more modest one — over comparable Nvidia Quadro cards in the M4000 and M2000.
Now, at the time, based on those results, we could have trumpeted AMD’s superiority, but we didn’t, and for one simple reason. The M4000 and M2000 were somewhat old products, and with Pascal already launched at the ultra-high end, we knew a P4000 and a P2000 were likely just around the corner. We figured those cards would just as likely leapfrog back over AMD’s, and they did just that, by a substantial margin.
Now earlier in January of 2017, we previewed AMD’s next generation Vega, at least as much as we could do on paper and with AMD-supplied testing. With Vega, AMD has its next chance.