SNIA DEVELOPER CONFERENCE



September 16-18, 2024 Santa Clara, CA

# Big Architectural Changes Are Coming

Jim Handy, Objective Analysis Tom Coughlin, Coughlin Associates

#### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



## How Hardware is Changing

#### From CPUs to CPUs and GPUs/TPUs

- CPU memory channels increasing
  - But DIMMs per channel is decreasing
- GPUs aggressively moving from GDDR to HBM
- From local memory to memory fabrics
  - Communication bandwidth is critical bottleneck
- From centralized to edge processing
  - Reduce bandwidth requirements by reducing data size
  - Delegate tasks to edge or endpoints



### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



## Software Changes

 Virtualization to composability to persistence to disaggregated memory to AI to fabrics to...

Disaggregated memory applications & systems are coming

Also for memory fabrics

- Support for persistence
  - SNIA NVM Programming Model is a strong foundation
  - Persistent caches are coming
- ...also AI-generated code

...also just the fact that different languages are used for AI

And different talents are needed to manage it



## AI Talent Pool





### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



#### Changes Coming in the Foreseeable Future

- Al-oriented networks
- Optical networks
- Widespread use of chiplets
- Widespread use of fabrics
- Machine learning at the edge
- Inference at the endpoints
- Pervasive persistence
- Coprocessors everywhere



### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



## What is CXL Really For?

- Maintaining coherency?
- Eliminating stranded memory?
- Expanding memory size?
- Increasing memory bandwidth?
- Supporting persistent memory?
- Hiding DDR4/DDR5/DDR6 differences?
- Passing messages between xPUs?



## CXL Supports New Memory Architectures

- Disaggregated memory
- Pooled memory
- Memory fabrics
- Shared memory
- Persistent memory



#### **User Wants & Needs**

Microsoft Azure: Pooling can save 7% in memory costs

Eliminates stranded memory

Google: Stranded memory is not important

- VMs are efficiently packed in high-resource servers
- IBM/Georgia Tech: DDR is a poor answer
  - <u>All DRAM should be attached by CXL or OMI</u>
- Al Providers: We need enormous memories
  - Also fast loads of GPU HBM
    - Give us bandwidth!
- Hyperscalers: "Any-to-Any" xPU connections
   PC OFMs: CXL is not immediately useful





#### Optimistic, Pessimistic, & Realistic Forecasts





## Very <u>Optimistic</u> Forecast

<u>2 Years to 100% data center adoption</u> All DDR replaced by CXL in 5 years Widespread use of pooling Instant doubling of memory sizes Al given as the reason CXL to re-use older DIMMs MS-SSD (CXL NAND) catches on Switches everywhere!



#### Very <u>Pessimistic</u> Forecast

#### Extremely slow acceptance

- No acceptance without strong software support
- Two Olympic Cycles to create this software
- Only popular for large-memory systems
- Large-memory servers rarely required
  - A problem that Optane faced
- Pooling not adopted
  - Switches don't find homes



#### **Optimistic and Pessimistic Numbers**

#### **Optimistic/Pessimistic CXL Forecasts** 10<sup>5</sup> **Revenues (\$ millions)** $10^{4}$ $10^{3}$ $10^{2}$ $10^{1}$ 2025 2026 2027 2028 --- Pessimistic --- Optimistic



#### Realistic CXL Forecast

#### **CXL Memory Module Revenues**





## Long-Term Impact

#### Re-thinking system architecture

- Disaggregated memory
- Processor arrays with memory fabrics
- Memory agnostic

#### Better memory bandwidth & size vs. worse latency

Design-arounds will optimize for this



#### New Report: CXL Looks for the Perfect Home

- Released July 2024
- Covers all perspectives
  - Where CXL is useful, and where it isn't
  - Demand drivers for CXL DRAM modules
  - Opportunities outside of DRAM
  - Forecast (Revenues, units, ASP)
- Available for immediate download:
  - Objective-Analysis.com/reports





### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up





## All Emerging Memories are Persistent



21 | ©2024 SNIA. All Rights Reserved.

# RULE 2: None Use a Charge-Based Cell

MRAM: Magnetism ReRAM: Resistance Either metal filament or oxygen vacancy PCM: Resistance, too Crystalline or amorphous FRAM: Atom displacement



# **Emerging Memory Benefits**

- Nonvolatile
- Fast write compared to flash
- Byte writeable
- Scalable well past 28nm
- Radiation-tolerant
- Based on innovative materials



#7.!



# The Economics Are Challenging

A small die size isn't enough Manufacturing scale determines relative cost Economies of scale prevail \$700 Intel's Optane proved the difficulty \$600 (suoilling) \$200 Volume never justified the cost \$300 \$300 \$200 \$200 >\$7B in Intel losses Micron losses ~\$400M/quarter

Source: Objective Analysis, 2022 ູດະວ່າວ່າ ຄ່າຍ ຄຳດຳດຳດຳດຳດຳດຳດຳດຳດຳດຳດຳດຳດຳດາ ຄຳດຳດຳດຳດຳ

Chart Source: Emerging Memories Branch Out



#### Where Will Persistence Reappear?

\*\*\*\*\*\*\*\*\*\*\*

\*\*\*\*\*\*\*\*

## Caches



# Chiplets









#### SRAM Caches Barely Shrink





#### A Persistent Cache? Why Not?

SRAM is not shrinking with the semi process

- Cache's share of CPU chip cost is ballooning
- Emerging (and persistent) memories scale with process
- Foundries have already developed MRAM & ReRAM processes
  - In volume production today
- There are downsides:
  - SRAM is faster than emerging memories, but far more costly
  - Software support isn't fully there, but SNIA's NVM Programming Model is helpful
  - Off-the-shelf software doesn't know what to do with persistence



#### What Becomes Persistent?





#### **Persistent Chiplets**

#### Chiplets are gaining momentum

- FPGAs have used them for over 5 years
- Packaging techniques are well established
- Logic process for logic, memory process for memory
  - More cost-effective and faster time-to-market

















24

33 | ©2024 SNIA. All Rights Reserved.

## Report: Emerging Memories Branch Out



### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



#### Whither the Processor?





### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



# What is "Processing In Memory"?

### It depends on who you talk to

- DIMMs with DRAM and a processor chip
  Hints of HBMs with processor on logic chip
  DRAM chips with an internal processor
  Processing logic within the memory bit cells
- Analog neural net chips

### Goal is to reduce data movement



### Goal: Wider Buses, Greater Processing Bandwidth





### **DIMM** with an Internal Processor

### Samsung AXDIMM

|                         | SAMSUNG<br>DDR5 |                                              |                 |                      |                 | AXDIMM              |                 |     |       |  |        | SAMSUNG<br>DDR5     |                 |                   | SAMSUNG<br>DDR5 |         | SAMSUNG<br>DDR5      |       | SAMSING | an palan palan an pala |                 |          |                     |
|-------------------------|-----------------|----------------------------------------------|-----------------|----------------------|-----------------|---------------------|-----------------|-----|-------|--|--------|---------------------|-----------------|-------------------|-----------------|---------|----------------------|-------|---------|------------------------|-----------------|----------|---------------------|
| N N N N N N N N         | SAMSUNG<br>DDR5 | N. B. I. | SAMSUNG<br>DDR5 | R. R. R. R. R. R.    | SAMSUNG<br>DDR5 |                     |                 |     | Buffe |  | r      |                     |                 | N R N R N R N R   | DDR5            | SAMSUNG |                      | DDR5  | SAMSUN  |                        | SAMSUNG<br>DDR5 | SAMSHING | - 4                 |
| al al to fail at at all | SAMSUNG<br>DDR5 | N. R. I. R. I. N. R. R.                      | SAMSUNG<br>DDR5 | N. IL IL IL IL IL IL | SAMSUNG<br>DDR5 | al Bitt Bi at at Bi | SAMSUNG<br>DDR5 |     |       |  | 101 IU | NI BLAI BÌ AI AI BÌ | SAMSUNG<br>DDR5 | N R I I I I I I I | DDR5            | SAMSUNG | RI RI LI RI RI RI RI | DDR5  |         | NI RI I I RI AI AI AI  | DDR5            | SAMSIING | at give at 10 co 10 |
| 11                      |                 | m                                            | na na           | m r                  | n 141           | <b>M</b> 1          | 1 11 1          | n m | m     |  | 1 1    | н                   | <b>m m</b>      | M                 | <b>1</b>        | 1 1     |                      | 1 171 |         | 171                    | M               |          | П                   |

**CXL** can replicate this approach



# **DRAM Chips with Internal Processor**



### Upmem DPU



# Natural Intelligence

# Automaton



### **DRAM Chips with Internal Processor**





### Processing Within the Memory Bit Cell

### **GSI Gemini APU**

| Gemini Associative Processor (APU) |                 |
|------------------------------------|-----------------|
|                                    | Large<br>Memory |

### **Macronix FortiX**



24

### Processing <u>Within</u> the Memory Bit Cell





### Processing <u>Within</u> the Memory Bit Cell





24

# Neural Networks: Anything But New!

### Intel's 80170NX ETANN

- Electrically-Trainable Analog Neural Network
- Introduced in 1989
- Not a commercial success







### Neural Networks Fit Emerging Memories



### Neural Networks Fit Emerging Memories



### **PIM Challenges**

### Lack of software support

- Few tools
- Few applications programs
- Lack of existing talent

# It's a game of catch-up



### Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up



# **AI** Without Limits

- Today: GPUs in the data center
- Tomorrow: Neural nets at the edge
- Later: AI manages parts of the AI system, like networks?
  - Al already manages some SSD internals
- Some CMOS Image Sensors already include an AI chip
  - Used for image recognition
- Al-generated code in use today
- Al could configure datacenters
  - Al's great at evaluating numerous options



# Where does AI fit in Tomorrow's World?



### Al eases bandwidth requirements



### What AI Brings to the Party

- Faster response times
- Reduced bandwidth requirements
- Higher data integrity
- Improved security
- Better user experience

### Protocol standards will be required



## Outline

- Hardware Changes
- Software Changes
- Otherware Changes
- How CXL Could Go
- Persistence and Emerging Memory Types
- Processor Specialization
- Processing In Memory
- Al Everywhere
- Wrap-Up





- Hardware changes: CPUs, GPUs, fabrics, & edge processing
- Software changes: Disaggregated Memory & Fabric Support, Persistence, AI
- Otherware changes: Edge processors & ML, AI, network advances
- CXL: On its way, but how quickly?
- Persistence: Coming to a cache near you!
- Diverse processor types: More skill sets required
- PIM at the edge: Condenses communications
- Ubiquitous AI: Small doses to reduce bandwidth & delays



# QUESTIONS?



56 | ©2024 SNIA. All Rights Reserved.