Scaling the “scaling wall” for future computing systems

To enable future applications such as semi-autonomous, context-aware, AI-enabled digital twins, modern computing systems need performance improvements an order of magnitude beyond grounding. traditional Moore’s Law scale. In this article, Arindam Mallik, Boris Leekens and Eric Mejdrich of imec give their views on how to achieve this ambitious goal. A system-level approach, a software-to-transistor co-design optimization loop, concurrent exploration of new computational capabilities, and a diverse group of people and skills are essential to accomplishing the performance leap required per total cost of possession.

The need for a performance leap

Analyze a whole genome or protein marker analysis from a drop of blood in minutes, while reducing the cost to pennies; interact fluidly in highly dynamic and detailed AR/VR environments; relying on semi-autonomous, context-aware AI personal assistants monitoring our human digital twin: the possibilities for improving our lives offered by the combination of high-performance computing (HPC) and artificial intelligence systems seem infinite.

However, they are limited in at least one respect: the processing capacity and the increasing cost of today’s computer systems. The challenge is anything but trivial, as these new applications will require orders of magnitude improvements in performance and energy efficiency while controlling costs.

We cannot build such powerful systems with the current generation of high performance/AI hardware via traditional scaling. We cannot achieve our goal by simply adding more processor cores and memory devices: the explosion in system footprint, power consumption and cost is no longer justifiable.

But why can’t we just scale today’s systems like we did before? What are the fundamental obstacles? And what approach can we take to achieve non-linear gains in compute capacity and energy efficiency while considering total cost of ownership?

The rise of system-wide walls

For more than 50 years, scaling according to Moore’s Law (the number of transistors doubles on the same silicon area, for the same costs, approximately every two years) and Dennard’s Law (the density of power remains constant as transistors get smaller) has supported incremental improvements in system performance at a constant cost. But for more than a decade now, it has been clear that the dimensional scaling inspired by these laws can no longer be used to provide the system scaling expectations of future applications.

This stems from several factors that we call scaling walls – barriers to further scaling in size, memory/bandwidth, power/thermal, and cost at historic levels.

As the number of transistors for the same Si area continues to double approximately every two years, the industry faces unusually high cost, speed, and power hurdles in complex system architectures. In traditional von Neumann computing architectures, for example, the increase in on-chip cache memory capacity has not kept pace with the evolution of logic, and feeding data to the logic core at sufficient speed is became increasingly difficult. Besides this memory/bandwidth wall, leaking issues broke Dennard’s scaling, leading to heat dissipation issues and stagnant frequency, while the manufacturing cost of the latest nodes skyrocketed.

At the architectural level, placing complex memory hierarchies, multi-core architectures, and domain-specific compute accelerators (xPUs) on a single system-on-chip has become a way to overcome these hurdles at scale. But even in the age of multicore, with the continued scaling of transistors on advanced nodes, the performance, power, area, and cost (PPAC) scaling of processing units today began to reach saturation.

Figure 1 – Evolution of transistor count, core count, power consumption, processor frequency, and on-chip cache capacity in classic von Neumann-based processors

Solving the “innovator’s dilemma” in an increasingly expensive world

Add to these challenges the massive cost scaling wall, and we can see that the “happy scaling era” of “faster and cheaper” is over.

The semiconductor community used to think narrowly about costing. With each new generation of technology, analyzes have shown a reduction in the relative cost/Si mm2 as predicted by Moore’s law. But due to the increasing complexity of semiconductor manufacturing technologies and system architectures, this no longer translates into lower total costs. In addition to the increasing direct costs of packaged silicon in new technology nodes, factors such as equipment maintenance and cooling and power consumption over the life of the systems drive up the total cost of ownership.

As a result, performance per total cost of ownership decreases and systems become more and more expensive for the same physical footprint.

All of these factors contribute to the so-called innovator’s dilemma: the challenge that all companies ultimately face in developing and investing in disruptive innovations alongside their legacy businesses (i.e. those based on sustainable technologies that do not change the value proposition in a market) and bring these innovations to market successfully. Today, the challenge for the industry is to continue generating the same huge growth rate that we have seen in AI and HPC, given the realities of performance per total cost of ownership (TCO). Improving this metric by several orders of magnitude will be the main driver going forward.

Co-optimization across the system stack

We believe that unprecedented gains in computing can only be achieved by leveraging innovations across the entire system stack, from algorithms to core device components. Moreover, these innovations must be co-designed from the start to ensure optimal TCO gains.

These principles guide imec’s Computer System Architecture (CSA) activities.

From bottom to top

System-level thinking involves a fundamental belief in a top-down approach to systems architecture. Traditionally, many developments have been driven by advances in Moore’s Law scaling: new transistor architectures paved the way for new devices and then, higher in the system stack, for new devices. higher performance circuits, memories and processor cores. However, using a purely bottom-up approach limits the overall ability to leverage co-design across the system.

Scaling the
Figure 2 – From traditional bottom-up to top-down approaches

A system-level approach embraces the fact that the application requirements should drive the solutions. We let target appliances drive innovation at the component and system level instead of solving problems with existing hardware. We develop a mindset, framework, and methodology that enables continuous co-design from application to device across the entire stack.

Impactful and relevant applications will drive system development, anticipating what industry and society will need in the years to come.

An optimization loop from architecture to technology

The methodology for developing future compute system architectures begins with understanding the target application requirements and the critical underlying workloads and algorithms.

If we take the “rapid” analysis of a whole genome as a target application, an appropriate workload could be the classification of genetic defects. Next, we consider the full system stack by envisioning what the software components, computing system, and key device technology would look like for the target application: we define the innovations required at the different layers of abstraction, including algorithms, architecture modeling, performance analysis and implementation.

Architecture modeling and analysis then gives insight into expected performance (by TCO) at the system level and how to change our “trajectory” to achieve target performance. The system-level benefits will be the result of the various cross-optimizations at the different layers of abstraction. These optimizations will reinforce each other, ideally leading to non-linear performance improvement. This is the fundamentally iterative co-design loop.

The key performance/TCO metric from the model guides us to the next step(s). These steps can range from restructuring algorithms to evaluating different system designs and, eventually, prototyping: developing proof-of-concepts for a scalable, reliable, and power-efficient architecture capable of delivering the high-performance computing required for next-generation apps.

Toolbox essentials: models, post-Moore technologies, research skills

The challenges are not insignificant and require a team with diverse skills to meet them. Existing models are not powerful enough to model and extract performance information from new system definitions. Therefore, we are developing new scale modeling and simulation capabilities intended to surpass current models in terms of accuracy and speed.

The models themselves incorporate the characteristics of technological building blocks – new technological capabilities that will intercept system scaling challenges, from packaging to compute elements to software innovation. To validate the result of our models, we want to build prototypes of critical technological building blocks at all levels of system development. Technologies include not only hardware elements, but also algorithms, middleware, programming models, and networking stacks, right down to the layer where developers write software and where users interact with a device..

On the fundamental technology side, we examine existing Si-based technologies (such as advanced and 3D optical I/O technologies) and explore emerging AI algorithms and post-Moore computational alternatives. These include quantum computing, optical computing paradigms, and superconducting digital computing, all of which promise unprecedented improvements in power efficiency, computational density, and data bandwidth. interconnection.


Read also:

IGZO-DRAM_imec_IEDM

Comments are closed.