News

Quadric’s third-gen Chimera GPNPU a reality

Product family expands to 864 TOPS, adds automotive-grade safety-enhanced versions.

David Harold

Chimera’s QC-series GPNPU adds more configurability and scales single cores to over 100 TOPS. New multi-core cluster QC-M family scales up to a claimed 864 TOPS. Compute density increases TOPS/mm2 up to 2.7× greater than the previous generation. It adds floating point and 4-bit weight support. Safety-enhanced versions add ASIL-B- and ASIL-D-compatible versions for automotive designs. Tools are being certified now.

Chimera
(Source: Quadric)
ADAS compute chiplets for less than $10

Quadric has introduced the Chimera QC-series family of its general-purpose neural processor IP (GPNPU IP). Quadric specializes in a unified hardware and software architecture optimized for on-device ML inference.

The Chimera QC series is designed to sit in the automotive market somewhere between existing solutions that repurpose high-end GPGPUs or mobile phone chipsets.

Quadric Co-founder and CEO Veerbhan Kheterpal said, “A component supplier in the automotive market building a 3nm chiplet could deliver over 400 TOPS of fully C++ programmable ML plus DSP compute for software-defined vehicle platforms for a die cost of well under $10.”

The new IP offering blends the machine learning (ML) performance characteristics of a neural processing accelerator with the C++ programmability of a digital signal processor (DSP). This is the third-generation implementation of the Quadric Chimera architecture and includes single-core, multi-core, and safety-enhanced offerings.

Quadric introduced the Chimera QB-series GPNPU in late 2022, and the QC series is an evolution that adds more configurability to match ML inference workloads for customer system-on-chip (SoC) designs.  

The QC series includes three configurable single-core processor options:

  • Chimera QC Nano processor delivering up to a claimed 7 TOPS of ML.
  • Chimera QC Perform processor with up to 28 TOPS of claimed performance.
  • Chimera QC Ultra processor with up to 108 TOPS.

To deliver higher performance, you use the QC-M family of multi-core GPNPUs with pre-integrated clusters of two, four, or eight of the QC Nano, QC Perform, or QC Ultra building block cores. That way, Quadric says you can scale from small parallel workloads (Nano cores) up to high-compute applications (eight QC Ultra cores). Quadric uses the example of a Level 4 central ADAS application with 864 TOPS for multiple large input format camera streams in parallel.

The Chimera architecture blends multiply-accumulate (MAC) units with C++ programmable 32-bit fixed-point ALUs in each processing element (PE). An array of PEs is scaled from 64 to 1024 PEs to build the Nano, Performance, and Ultra cores.  

Each configured GPNPU core can have a ratio of eight, 16, or 32 int8 MACs for each PE. Designers targeting systems with large, weight-bound workloads, such as large language models (LLMs), should choose the eight-MAC configuration with wide AXI interfaces. Designers building systems operating on more MAC-intensive workloads, such as high-resolution image processing, could choose the higher-ratio 32-MAC-per-ALU option. A 16-bit floating point multiple-accumulate unit at half the throughput rate of the int8 MACs is a configurable option for each processor.

Compared to the previous-generation Chimera processor offering, the new configuration options for Chimera QC cores can deliver up to 2.7× higher TOPS/mm2 compute density, Quadric says. There is a cycle-accurate simulator available.

For LLMs where massive sets of coefficients (weights) must be streamed into the compute engine for each token generated, there is an option to use 4-bit weights, reducing data bandwidth requirements compared to standard 8-bit integer weights. There is an extra-wide AXI interconnect interface for up to 1,024 bits/cycle.

The QC processor series and the multi-core QC-M processor family are offered in safety-enhanced versions that combine hardware enhancements to ensure greater error resiliency. Its toolchain is currently undergoing ISO 26262 tool confidence level certification.

Quadric says the Chimera processor architecture has already been proven at-speed in silicon, and they are ready for immediate customer engagement, with chip design teams looking to start an IP evaluation. The IP is process agnostic, but Quadric gives an example of up to 1.7 GHz operation in 3nm processes using conventional standard cell flows and commonly available single-ported SRAM for the entire family of QB-series GPNPUs.

Although it’s unlikely, Nvidia might ask Chimera to not use the name because Nvidia introduced the Chimera 2 Tegra 4 in 2013 and used it up until 2015.