【Class】Stanford CS149 Parallel Computing

tiny_star Lv3

2025-12-25 01:04:13 2025-12-25 01:04:13 Créé 2026-03-02 23:12:46 2026-03-02 23:12:46 Mis à jour

学习一下“金字塔尖”课程，全英文不知道能不能看懂😭（自己的课程倒头就睡，别人的课程逐帧学习😜） The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design

课程

2023年课程视频 + 2025年PPT

课程视频

Stanford CS149 I Parallel Computing I 2023 I Kayvon Fatahalian and Kunle Olukotun - YouTube

课程信息

gfxcourses.stanford.edu/cs149/fall25/

Programming Assignments

Written Assignments

Lecture 1: Why Parallelism? Why Efficiency?

A parallel computer is a collection of processing elements that cooperate to solve problems quickly
collection of processing elements: We’re going to use multiple processing elements to get it
quickly: We care about performance, and we care about efficiency

Speedup

One major motivation of using parallel processing: achieve a speedup
For a given problem:

Communication limited the maximum speedup achieved
Minimizing the cost of communication improved speedup

Imbalance in work assignment limited speedup
Improving the distribution of work improved speedup

Communication costs can dominate a parallel computation, severely limiting speedup

Course theme

Course theme 1
Designing and writing parallel programs … that scale!
Parallel thinking
1.Decomposing(分解) work into pieces that can safely be performed in parallel
2.Assigning work to processors
3.Managing communication / synchronization(同步) between the processors so that it does not limit speedup
Abstractions(抽象) / mechanisms(机制) for performing the above tasks
Writing code in popular parallel programming languages

Course theme 2
Parallel computer hardware implementation: how parallel computers work
Mechanisms used to implement abstractions efficiently
1.Performance(性能) characteristics of implementations
2.Design trade-offs(权衡): performance vs. convenience vs. cost
Why do I need to know about hardware?
1.Because the characteristics of the machine really matter
2.Because you care about efficiency and performance

Course theme 3
Thinking about efficiency
FAST != EFFICIENT
Just because your program runs faster on a parallel computer, it does not mean it is using the hardware efficiently
Programmer’s perspective: make use of provided machine capabilities
HW(HardWare) designer’s perspective: choosing the right capabilities(功能) to put in system (performance / cost, cost = silicon area?, power?, etc.)

Until ~ 20 years ago: two significant reasons for processor performance improvement
1.Exploiting(利用) instruction-level(指令级) parallelism(superscalar(超标量) execution)
2.Increasing CPU clock frequency

What is a program?(from a processor’s perspective)

A program is just a list of processor instructions

What is an instruction?

It describes an operation for a processor to perform
Executing an instruction typically modifies the computer’s state
computer’s state: the values of program data, which are stored in a processor’s registers or in memory

What does a processor do?

A processor executes instructions
executes one instruction per clock

Superscalar execution: processor automatically finds (Or the compiler finds independent instructions at compile time and explicitly encodes dependencies in the compiled binary(二进制文件)) independent instructions in an instruction sequence and executes them in parallel on multiple execution units

Superscalar processor
This processor can decode and execute up to two instructions per clock

Diminishing(递减) returns(收益) of superscalar execution
Most available ILP(instruction-level parallelism) is exploited by a processor capable of issuing(发出) four instructions per clock(Little performance benefit from building a processor that can issue more)

ILP tapped out(已达极限) + end of frequency scaling(扩展)

The “power wall”
Power consumed by a transistor:

Static power: transistors burn power even when inactive due to leakage
High power = high heat
Power is a critical design constrant in modern processors
Maximum allowed frequency determined by processor’s core voltage

Single-core performance scaling
The rate of single-instruction stream performance scaling has decreased(almost to zero)
1.Frequency scaling limited by power
2.ILP scaling tapped out
Architects are now building faster processors by adding more execution units that run in parallel(Or units that are specialized for a specific task: like graphics, or audio / video playback)
Software must be written to be parallel to see performance gains. No more free lunch for software developers!

特性	指令级并行(ILP)	手动编写的并行(TLP/DLP)
负责人	硬件(CPU)	程序员(软件)
代码形式	普通的顺序代码	特殊的多线程/向量化代码
程序员负担	零负担 (自动提升)	负担沉重 (需设计架构、调试并发 BUG)
并行规模	很小 (通常每个时钟周期几条指令)	很大 (可以扩展到成千上万个核心)
限制因素	指令间的依赖关系、电路复杂度	算法的可拆分性、同步开销 (阿姆达尔定律)
粒度	极细(指令级)。处理的是加法、乘法、访存等最基本的指令	较粗(线程级或任务级)。程序员将一个大任务拆分成多个子任务(例如：处理图像时，把图像分成四块，交给四个核心处理)
实现方式	通过流水线（Pipelining）、乱序执行（Out-of-Order Execution）、多发射（Multiple Issue）和分支预测等技术	程序员需要考虑如何拆分任务、如何处理线程间的同步（锁、信号量）、如何避免死锁以及如何进行数据交换

Mobile parallel processing
Power constraints also heavily influence the design of mobile systems

What is a big concern in all computing?
Power
Two reasons to save power
Run at higher performance for a fixed(固定的) amount of time
Power = heat If a chip gets too hot, it must be clocked down(减低频率) to cool off(Another reason: hotter systems cost more to cool)
Run at sufficient performance for a longer amount of time
Power = battery Long battery life(续航时间) is a desirable feature in mobile devices

Specialized processing is ubiquitous in mobile systems
Motion(sensor)(运动(传感器))

Parallel + specialized HW
Modern systems not noly use many processing units, but also utilize specialized processing units to achieve high levels of power efficiency

Achieving efficient processing almost always comes down to(归结为) accessing data efficiently

What is memory?

A program’s memory address space
A computer’s memory is organized as an array of bytes
Each byte is identified by its “address” in memory(its position in this array)(We’ll assume memory is byte-addressable)
Load: an instruction for accessing the contents of memory

Memory access latency(延迟)
The amount of time it takes the memory system to provide data to the processor

Stalls(停顿)
A processor “stalls” (can’t make progress) when it cannot run the next instruction in an instruction stream because future instructions depend on a previous instruction that is not yet complete
Accessing memory is a major source of stalls
Memory access times ~ 100’s of cycles, is a measure of latency

What are caches?

A cache is a hardware implementation detail that does not impact the output of a program, only its performance
Cache is on-chip storage that maintains a copy of a subset(子集) of the values in memory
If an address is stored “in the cache” the processor can load/store to this address more quickly than if the data resides(驻留) only in DRAM
Caches operate at the granularity(粒度) of “cache lines”

There are two forms of “data locality(局部性)” in this sequence:
Spatial(空间) locality: loading data in a cache line “preloads(预加载)” the data needed for subsequent(后续) accesses to different address in the same line, leading to cache hits
Temporal(时间) locality: repeated accesses to the same address result in hits

Caches reduce length of stalls(reduce memory access latency)
Processors run efficiency when they access data that is resident(驻留) in caches
Caches reduce memory access latency when processors accesses data that they have recently accessed!(Caches also provide high bandwidth(带宽) data transfer(传输))

Data movement has high energy cost
Rule of thumb(经验法则) in modern system design: always seek to reduce amount of data movement in a computer

Summary

Single-thread-of-control performance is improving very slowly
To run programs significantly faster, programs must utilize multiple processing elements or specialized processing hardware
Which means you need to know how to reason about(推理) and write parallel and efficient code

Writing parallel programs can be challenging
Requires problems partitioning(划分), communication, synchronization
Knowledge of machine characteristics(特性) is important
In particular, understanding data movement!

Lecture 2: A Modern Multi-Core Processor (Part I)

Key concepts about how modern parallel processors achieve high throughput(吞吐量)
Two concern(有关) parallel execution (multi-core, SIMD parallel execution)
One addresses(解决) the challenges of memory latency (multi-threading)

Pre multi-core era processor

Majority of chip transistors used to perform operations that help make a single instruction stream run fast

More transistors = larger cache, smarter out-of-order logic, smarter branch predictor, etc
Out-of-order control logic（乱序执行控制逻辑）:打破指令在程序中原本的先后顺序，寻找可以并行执行的机会

Multi-core era processor

Use increasing transistor count to add more cores to the processor

SIMD processing(Single instruction, multiple data)

Add execution units(ALUs) to increase compute capability
Amortize(分摊) cost/complexity of managing an instruction stream across many ALUs
Same instruction broadcast to all ALUs
This operation is executed in parallel on all ALUs

Mask (discard) output of ALU

The program’s use of “forall” declares to the compiler that loop iterations are independent, and that same loop body will be executed on a large number of data elements
This abstraction can facilitate(促进) automatic generation of both multi-core parallel code, and vector instructions to make use of SIMD processing capabilities within a core

当代码运行到 if (t > 0.0) 时，如果 8 个通道里的数据，有些满足条件（T），有些不满足（F），硬件无法让一部分 ALU 跑 if 里的代码，另一部分跑 else 里的
硬件的处理方式是：
第一步：掩码（Masking）掉所有 F 的通道，只让 T 的通道执行 if 分支里的代码。此时，F 对应的 ALU 是闲置的（如你图中带红色叉号 X 的部分）
第二步：掩码掉所有 T 的通道，再让 F 的通道执行 else 分支里的代码。此时，T 对应的 ALU 是闲置的
结果：总耗时 = 执行 if 分支的时间 + 执行 else 分支的时间。原本并行的 8 个核心，在这一刻退化成了类似于串行的工作方式，吞吐量大幅下降
如果所有元素全都在同一个分支里（比如全是正数），处理器就只需要花执行 A 的时间，效率会提高一倍

Some common jargon(术语)

Instruction stream coherence(一致性)(“coherent execution”)
Property(特性) of a program where the same instruction sequence applies to many data elements
Coherent execution IS NECESSARY for SIMD processing resources to be used efficiently
Coherent execution IS NOT NECESSARY for efficient parallelization across different cores, since each core has the capability to fetch/decode a different instructions from their thread’s instruction stream

“Divergent(发散)” execution
A lack of instruction stream coherence in a program

Instruction are generated by the compiler
Parallelism explicitly requested by programmer using intrinsics(内部函数)
Parallelism conveyed using parallel language semantics(语义)(e.g., forall example)
Parallelism inferred by dependency analysis of loops by “auto-vectorizing” compiler

Terminology(术语): “explicit SIMD”: SIMD parallelization is performed at compile time
Can inspect program binary and see SIMD instructions (vstoreps, vmulps, etc.)

“Implicit SIMD”
Compiler generates a binary with scalar instructions
But N instances of the program are always run together on the processor
Hardware(not compiler) is responsible for simultaneously executing the same instruction from multiple program instances on different data on SIMD ALUs

SIMD width of most modern GPUs ranges from 8 to 32
Divergent execution can be a big issue(poorly written code might execute at 1/32 the peak capability of the machine!)

Summary: three different forms of parallel execution

Superscalar: exploit ILP within an instruction stream. Process different instructions from the same instruction stream in parallel(within a core)
SIMD: multiple ALUs controlled by same instruction(within a core)
Efficient for data-parallel workloads: amortize control costs over many ALUs
Vectorization(向量化) done by compiler (explicit SIMD) or at runtime by hardware(implicit SIMD)
Multi-core: use multiple processing cores
Provides thread-level parallelism: simultaneously execute a completely different instruction stream on each core
Software creates threads to expose parallelism to hardware(e.g., via threading API)

Hardware-supported multi-threading

Data prefetching(预取) reduces stalls (hides latency)
Many modern CPUs have logic for guessing what data will be accessed in the future and “pre-fetching” this data into caches
Dynamically analyze program’s memory access patterns to make predictions
Prefetching reduces stalls since data is resident in cache when accessed
Note: Prefetching can also reduce performance if the guess is wrong(consumes bandwidth, pollutes caches)

Multi-threading reduces stalls
interleave(交错) processing of multiple threads on the same core to hide stalls
if you can’t make progress on the current thread… work on another one

Hiding stalls with multi-threading

Throughput(吞吐量) computing: a trade-off(权衡)

No free lunch: storing execution contexts
Consider on-chip storage of execution contexts as a finite resource

Many small contexts (high latency hiding ability)
16 hardware threads: storage for small working set per thread

Four large contexts (low latency hiding ability)
4 hardware threads: storage for large working set per thread

Takeaway (point 1):
A processor with multiple hardware threads has the ability to avoid stalls by performing instructions from other threads when one thread must wait for a long latency operation to complete
Note: the latency of the memory operation is not changed by multi threading, it just no longer causes reduced processor utilization

Takeaway (point 2):
A multi-threaded processor hides memory latency by performing arithmetic from other threads
Programs that feature more arithmetic per memory access need fewer threads to hide memory stalls

Hardware-supported multi-threading

Core manages execution contexts for multiple threads
Core still has the same number of ALU resources: multi-threading only helps use them more efficiently in the face of high-latency operations like memory access
Processor makes decision about which thread to run each clock
Interleaved multi-threading (a.k.a.(又称) temporal multi-threading)
each clock, the core chooses a thread, and runs an instruction from the thread on the core’s ALUs
Simultaneous(同步) multi-threading (SMT)
Each clock, core chooses instructions from multiple threads to run on ALUs
Example: Intel Hyper-threading (2 threads per core)

Kayvon’s fictitious(虚构) multi-core chip

16 cores
8 SIMD ALUs per core (128 total)
4 threads per core
16 simultaneous instruction streams
64 total concurrent(同步) instruction streams
512 independent pieces of work are needed to run chip with maximal latency hiding ability

Example: Intel Skylake/Kaby Lake core

Two-way multi-threaded cores (2 threads)
Each core can run up to(最多) four independent scalar instructions
and up to three 8-wide vector instructions (up to 2 vector mul or 3 vector add)

GPUs: extreme throughput-oriented(面向吞吐量) processors

NVIDIA V100

There are 80 SM cores on the V100: That’s 163,840 pieces of data being processed concurrently to get maximal latency hiding!

The story so far…
To utilize modern parallel processors efficiently, an application must:
1.Have sufficient parallel work to utilize all available execution units (across many cores and many execution units per core)
2.Groups of parallel work items must require the same sequences of instructions (to utilize SIMD execution)
3.Expose more parallel work than processor ALUs to enable interleaving of work to hide memory stalls

Summary

Suggestion to students: know these terms
▪ Instruction stream
▪ Multi-core processor
▪ SIMD execution
▪ Coherent control flow(流)
▪ Hardware multi-threading
Interleaved multi-threading
Simultaneous multi-threading

Running code on a simple processor

Superscalar core

SIMD execution capability

Heterogeneous(异构) superscalar (scalar + SIMD)

Multi-threaded core

Multi-threaded, superscalar core

Multi-core, with multi-threaded, superscalar cores

GPU “SIMT” (single instruction multiple thread)

Thought experiment

You write an application that spawns two threads
The application runs on the processor shown below

Two cores, two-execution contexts per core, up to instructions per clock, one instruction is an 8-wide SIMD instruction

Question: “who” is responsible for mapping the applications’s threads to the processor’s thread execution contexts?
Answer: the operating system

Question: If you were implementing the OS, how would to assign the two threads to the four execution contexts?
Answer: 优先分在不同的核心（Core）上，以避免资源争抢
例外情况： 两个线程之间有极其频繁的数据交换（需要共享 L1 Cache）

Another question: How would you assign threads to execution contexts if your C program spawned five threads?
Answer: 时分复用（Time-sharing）与负载均衡

Lecture 3: Modern Multi-Core Architecture (Part II) + ISPC Programming Abstractions

latency and bandwidth

Memory bandwidth
The rate at which the memory system can provide data to a processor
Example: 20 GB/s

Pipelining
Consider a program that runs threads that repeat the following sequence of three dependent instructions
1.X = load 64 bytes
2.Y = add x + x
3.Z = add x + y
Let’s say we’re running this sequence on many threads of a multi-threaded(Assume there are plenty of hardware threads to hide memory latency) core that:
Executes one math operation per clock
Can issue load instructions in parallel with math
Receives 8 bytes/clock from memory

Processor that can do one add per clock (+ co-issues(同时执行) LDs)

outstanding(未完成的)
Rate of completing math instructions is limited by memory bandwidth

bandwidth-bound(带宽受限)

High bandwidth memories
Modern GPUs leverage(利用) high bandwidth memories located near processor
Example:
V100 uses HBM2
900 GB/s

Computation is bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up
Overcoming bandwidth limits is often the most important challenge facing software developers targeting modern throughput-optimized(优化) systems

In modern computing, bandwidth is the critical resource
Performant parallel programs will:
Organize computation to fetch data from memory less often
Reuse data previously loaded by the same thread (temporal locality optimizations)
Share data across threads (inter-thread(线程间) cooperation)
Favor(优先) performing additional arithmetic(算术运算) to storing/reloading values (the math is “free”)
Main point: programs must access memory infrequently to utilize modern processors efficiently

Abstraction vs. implementation

Semantics(语义): what do the operations provided by a programming model mean?
Given a program, and given the semantics (meaning) of the operations used, what is the answer that the program will compute?

Implementation (aka scheduling(调度)): how will the answer be computed on a parallel machine?
In what (potentially parallel) order will be a program’s operations be executed?
Which operations will be computed by each thread?
Each execution unit? Each lane(通道) of a vector instruction?

Your goal as a student:
Given a program and knowledge of how a parallel programming model is implemented, in your head can you “trace” through what each part of the parallel computer is doing during each step of program

Programming with ISPC

Intel SPMD Program Compiler (ISPC)
SPMD: single program multiple data

ispc/ispc: Intel® Implicit SPMD Program Compiler

Intel® Implicit SPMD Program Compiler

A great read: “The Story of ISPC” (by Matt Pharr) : The story of ispc: all the links

example program
Compute sin(x) using Taylor expansion: sin(x) = x - x³/3! + x⁵/5! - x⁷/7! + …
for each element of an array of N floating-point numbers

Invoking sinx()

// C++ code:  main.cpp
#include “sinx.h” 
int main(int argc, void** argv) {
  int N = 1024;
  int terms = 5;
  float* x = new float[N];
  float* result = new float[N];
  // initialize x here
    
  sinx(N, terms, x, result);
  
  return 0;
}

// C++ code:  sinx.cpp
void sinx(int N, int terms, float* x, float* result) 
{ 
   for (int i=0; i<N; i++) 
   {
       float value = x[i];
       float numer = x[i] * x[i] * x[i]; 
       int denom = 6;  // 3! 
       int sign = -1;
       for (int j=1; j<=terms; j++) 
       {
           value += sign * numer / denom;
           numer *= x[i] * x[i];
           denom *= (2*j+2) * (2*j+3);
           sign *= -1;
       }
       result[i] = value;
   }
}

sinx() in ISPC

// C++ code: main.cpp
#include “sinx_ispc.h”

int main(int argc, void** argv) { 
    int N = 1024; 
    int terms = 5; 
    float* x = new float[N]; 
    float* result = new float[N]; 
    
    // initialize x here
    
    // execute ISPC code 
    ispc_sinx(N, terms, x, result); 
    return 0;
}

// ISPC code: sinx.ispc
export void ispc_sinx( 
    uniform int N, 
    uniform int terms, 
    uniform float* x, 
    uniform float* result) 
{ 
    // assume N % programCount = 0 
    for (uniform int i=0; i<N; i+=programCount) 
    { 
        int idx = i + programIndex; 
        float value = x[idx]; 
        float numer = x[idx] * x[idx] * x[idx]; 
        uniform int denom = 6;  // 3! 
        uniform int sign = -1; 
        for (uniform int j=1; j<=terms; j++) 
        {  
            value += sign * numer / denom;
            numer *= x[idx] * x[idx]; 
            denom *= (2*j+2) * (2*j+3); 
            sign *= -1; 
        }
        result[idx] = value;
    }
}

SPMD programming abstraction:
Call to ISPC function spawns “gang(组)” of ISPC “program instances”
All instances run ISPC code concurrently
Each instance has its own copy of local variables( programIndex int float )
Upon return, all instances have completed
ISPC language keywords:
programCount: number of simultaneously executing instances in the gang (uniform value)
programIndex: id of the current instance in the gang. (a non-uniform value: “varying”)
uniform: A type modifier. All instances have the same value for this variable. Its use is purely an optimization. Not needed for correctness
“Interleaved” assignment of array elements to program instances

Invoking sinx() in ISPC

In this illustration programCount = 8

Interleaved assignment of program instances to loop iterations

In this illustration: gang contains eight instances: programCount = 8

ISPC implements the gang abstraction using SIMD instructions
ISPC compiler generates SIMD implementation:
Number of instances in a gang is the SIMD width of the hardware (or a small multiple of SIMD width)
ISPC compiler generates a C++ function binary (.o) whose body contains SIMD instructions
C++ code links against generated object file as usual

sinx() in ISPC: version 2
“Blocked” assignment of array elements to program instances

// C++ code: main.cpp
#include “sinx_ispc.h”

int main(int argc, void** argv) { 
    int N = 1024; 
    int terms = 5; 
    float* x = new float[N]; 
    float* result = new float[N]; 
    // initialize x here
  
    // execute ISPC code 
    ispc_sinx_v2(N, terms, x, result); 
    return 0;
}

// ISPC code: sinx.ispc
export void ispc_sinx_v2( 
    uniform int N, 
    uniform int terms, 
    uniform float* x, 
    uniform float* result) 
{ 
    // assume N % programCount = 0 
    uniform int count = N / programCount; 
    int start = programIndex * count; 
    for (uniform int i=0; i<count; i++) 
    { 
        int idx = start + i;
        float value = x[idx]; 
        float numer = x[idx] * x[idx] * x[idx]; 
        uniform int denom = 6;  // 3! 
        uniform int sign = -1; 
        for (uniform int j=1; j<=terms; j++) 
        {  
            value += sign * numer / denom;
            numer *= x[idx] * x[idx]; 
            denom *= (j+3) * (j+4); 
            sign *= -1; 
        } 
        result[idx] = value; 
    } 
}

Blocked assignment of program instances to loop iterations

In this illustration: gang contains eight instances: programCount = 8

Schedule: interleaved assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8

A single “packed vector load” instruction (vmovaps (see _mm256_load_ps() intrinsic function) ) efficiently implements:
float value = x[idx];
for all program instances, since the eight values are contiguous in memory

Schedule: blocked assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8

float value = x[idx];
For all program instances now touches eight non-contiguous values in memory
Need “gather” instruction (vgatherdps (see _mm256_i32gather_ps() intrinsic function) ) to implement (gather is a more complex, and more costly SIMD instruction…)

Raising level of abstraction with foreach

// C++ code: main.cpp 
#include “sinx_ispc.h” 

int N = 1024; 
int terms = 5; 
float* x = new float[N]; 
float* result = new float[N]; 
// initialize x here 
// execute ISPC code 
sinx(N, terms, x, result);

// ISPC code: sinx.ispc
export void ispc_sinx( 
    uniform int N, 
    uniform int terms, 
    uniform float* x, 
    uniform float* result) 
{ 
    foreach (i = 0 ... N) 
    { 
        float value = x[i]; 
        float numer = x[i] * x[i] * x[i]; 
        uniform int denom = 6;  // 3! 
        uniform int sign = -1; 
        
        for (uniform int j=1; j<=terms; j++) 
        {  
            value += sign * numer / denom;
            numer *= x[i] * x[i]; 
            denom *= (2*j+2) * (2*j+3); 
            sign *= -1; 
        } 
        result[i] = value; 
    }
}

foreach: key ISPC language construct
foreach declares parallel loop iterations
Programmer says: these are the iterations the entire gang (not each instance) must perform
ISPC implementation takes responsibility for assigning iterations to program instances in the gang

How might foreach be implemented?
Code written using foreach abstraction:

foreach (i = 0 ... N) 
{ 
    // do work for iteration i here...   
}

Implementation 1: program instance 0 executes all iterations

if (programCount == 0) { 
    for (int i=0; i<N; i++) { 
        // do work for iteration i here… 
    } 
}

Implementation 2: interleave iterations onto program instances

// assume N % programCount = 0 
for (uniform int loop_i=0; loop_i<N; loop_i+=programCount) 
{ 
    int i = loop_i + programIndex; 
    // do work for iteration i here...  
}

Implementation 3: block iterations onto program instances

// assume N % programCount = 0 
uniform int count = N / programCount; 
int start = programIndex * count; 
for (uniform int loop_i=0; loop_i<count; loop_i++) 
{ 
    int i = start + loop_i; 
    // do work for iteration i here... 
}

Implementation 4: dynamic assignment of iterations to instances

uniform int nextIter; 
if (programCount == 0) 
    nextIter = 0; 
int i = atomic_add_local(&nextIter, 1); 
while (i < N) {
    
    // do work for iteration i here...
    
    i = atomic_add_local(&nextIter, 1); 
}

Thinking about iterations, not parallel execution
In many simple cases, using foreach allows the programmer to express their program almost as if it was a sequential program

export void ispc_function( 
    uniform int N, 
    uniform float* x, 
    uniform float* y) 
{ 
    foreach (i = 0 ... N) 
    { 
        float val = x[I]; 
        float result; 
        
        // do work here to compute 
        // result from val 
        y[i] = result; 
    }
}

What does this program do?

// main C++ code: 
const int N = 1024; 
float* x = new float[N/2]; 
float* y = new float[N];

// initialize N/2 elements of x here

// call ISPC function 
absolute_repeat(N/2, x, y);

// ISPC code: 
export void absolute_repeat( 
    uniform int N, 
    uniform float* x, 
    uniform float* y) 
{ 
    foreach (i = 0 ... N) 
    { 
        if (x[i] < 0) 
            y[2*i] = -x[i]; 
        else 
            y[2*i] = x[i]; 
        y[2*i+1] = y[2*i]; 
    } 
}

This ISPC program computes the absolute value of elements of x, then repeats it twice in the output array y

// main C++ code: 
const int N = 1024; 
float* x = new float[N]; 
float* y = new float[N];

// initialize N elements of x 

// call ISPC function 
shift_negative(N, x, y);

// ISPC code: 
export void shift_negative( 
    uniform int N, 
    uniform float* x, 
    uniform float* y) 
{ 
    foreach (i = 0 ... N) 
    { 
        if (i >= 1 && x[i] < 0) 
            y[i-1] = x[i]; 
        else 
            y[i] = x[i]; 
    } 
}

The output of this program is undefined!
Possible for multiple iterations of the loop body to write to same memory location

Computing the sum of all elements in an array (incorrectly)

What’s the error in this program?

export uniform float sum_incorrect_1( 
    uniform int N, 
    uniform float* x) 
{ 
    float sum = 0.0f; 
    foreach (i = 0 ... N) 
    { 
        sum += x[i]; 
    }
    
    return sum; 
}

sum is of type float (different variable for all program instances)
Cannot return many copies of a varianble to the calling C code, which expects one return value of type float
Result: compile-time type error

What’s the error in this program?

export uniform float sum_incorrect_2( 
    uniform int N, 
    uniform float* x) 
{ 
    uniform float sum = 0.0f; 
    foreach (i = 0 ... N) 
    { 
        sum += x[i]; 
    }
    
    return sum; 
}

sum is of type uniform float (one copy of variable for all program instances)
x[i] has a different value for each program instance
So what gets copied into sum? (一份sum，多份x[i]，无法让多个并行的实例同时修改同一个变量)
Result: compile-time type error

Computing the sum of all elements in an array (correctly)

export uniform float sum_array( 
    uniform int N, 
    uniform float* x) 
{ 
    uniform float sum; 
    float partial = 0.0f; 
    foreach (i = 0 ... N) 
    { 
        partial += x[i]; 
    } 
    
    // reduce_add() is part of ISPC’s cross 
    // program instance standard library 
    sum = reduce_add(partial); 
   
    return sum;
}

Each instance accumulates a private partial sum (no communication)
Partial sums are added together using the reduce_add() cross-instance communication primitive
The result is the same total sum for all program instances (reduce_add() returns a uniform float)

The ISPC code will execute in a manner similar to the C code with AVX intrinsics implemented below

float sum_summary_AVX(int N, float* x) { 
    
    float tmp[8];  // assume 16-byte alignment(对齐)
    __mm256 partial = _mm256_broadcast_ss(0.0f); 
    
    for (int i=0; i<N; i+=8) 
        partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i])); 
    
    _mm256_store_ps(tmp, partial); 
    
    float sum = 0.f; 
    for (int i=0; i<8; i++) 
        sum += tmp[i]; 
    
    return sum; 
}

ISPC’s cross program instance operations

Compute sum of a variable’s value in all program instances in a gang:

1	uniform int64 reduce_add(int32 x);

Compute the min of all values in a gang:

1	uniform int32 reduce_min(int32 a);

Broadcast a value from one instance to all instances in a gang:

1	int32 broadcast(int32 value, uniform int index);

For all i, pass value from instance i to the instance i+offset % programCount:

1	int32 rotate(int32 value, uniform int offset);

ISPC: abstraction vs. implementation

Single program, multiple data (SPMD) programming model

Programmer “thinks”: running a gang is spawning programCount logical instruction streams (each with a different value of programIndex)
This is the programming abstraction
Program is written in terms of this abstraction
Single instruction, multiple data (SIMD) implementation
ISPC compiler emits vector instructions (e.g., AVX2, ARM NEON) that carry out the logic performed by a ISPC gang
ISPC compiler handles mapping of conditional control flow to vector instructions (by masking vector lanes, etc. like you do manually in assignment 1)
Semantics of ISPC can be tricky(棘手的)
SPMD abstraction + uniform values (allows implementation details to peek through(透过) abstraction a bit)

SPMD programming model summary
SPMD = “single program, multiple data”
Define one function, run multiple instances of that function in parallel on different input arguments

ISPC tasks

The ISPC gang abstraction is implemented by SIMD instructions that execute within on thread running on one x86 core of a CPU
So all the code I’ve shown you in the previous slides would have executed on only one of the four cores of the myth machines(斯坦福大学计算机科学系（Stanford CS）提供的专供学生使用的公共 Linux 计算集群)
ISPC contains another abstraction: a “task” that is used to achieve multi-core execution. I’ll let you read up about that as you do assignment 1

An ISPC task is just a task. A thread would be an implementation detail. But it’s very true that if you create eight tasks, a smart thing for ISPC to do would be under the hood, spawn eight threads and run them all on different threads. But if you create 100,000 tasks, it probably would be pretty dumb for ISPC to create 100,000 threads.
How does a computer that needs to run apparently a 700 kernel threads right now, how does it run 700 kernel threads on computer?
It’s got a context switch
The operating system from time t o time is saying, here are those eight threads that need to run. I’m going to put them on the processor and let the processor run. And periodically, some timer expires, and the operating system says, well, we have 700 threads. We need to have another 8 run. So we’re going to rip those threads off the processor and put those on. And you can imagine that could be pretty slow
operating system context switch vs. hardware execution context switch

特性	硬件多线程 (Hardware Multi-threading)	操作系统上下文切换 (OS Context Switch)
谁来做	硬件（电路）自动完成	软件（内核代码）执行指令完成
寄存器	复制多份（每个线程有独立的物理寄存器组）	只有一套（切换时必须把内容存到 RAM）
切换耗时	极快（通常 0 - 数个周期）	极慢（数千到数十万个周期）
切换触发点	硬件检测到 Stall（如 Cache Miss）	时间片到期、I/O 阻塞、系统调用
主要目标	延迟隐藏 (Latency Hiding)，榨干执行单元	任务管理、公平性、多任务并发

ISPC is a low-level programming language: by exposing(公开) programIndex and programCount, it allows programmer to define what work each program instance does and what data each instance accesses
Can implement programs with undefined output
Can implement programs that are correct only for a specific programCount

Everything outside a foreach must be uniform values and uniform logic(确保程序的控制流在进入并行区域前是明确且同步的)
串行上下文与并行上下文的切换：在 ISPC 中，代码的运行逻辑可以看作是“单线程入口，局部并行”
Outside foreach（外部）：代码处于串行控制流中。此时，编译器认为它是在为“整组”程序实例（整个 SIMD 单元）做决策。既然是为整组做决策，所有的变量值和逻辑判断就必须对所有实例都一致（即 uniform）
Inside foreach（内部）：这是并行开始的地方。foreach 负责把数据分配给不同的程序实例（Program Instances）。只有在这里，变量才可以是 varying 的（每个实例拥有不同的值）
硬件效率与寄存器利用
uniform 变量：通常存储在 CPU 的标量寄存器（Scalar Registers）中。它们只占一份空间，计算速度极快
varying 变量：存储在 SIMD 向量寄存器中
如果外部变量都是 uniform 的，编译器可以进行大量的优化，避免不必要的向量运算，减少内存带宽压力。只有在真正处理大量数组元素（foreach 内部）时，才动用沉重的向量资源

Summary

Programming models provide a way to think about the organization of parallel programs

They provide abstractions that permit multiple valid implementations

I want you to always be thinking about abstraction vs. implementation for the remainder(剩余部分) of this course

Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU

搭建本地Linux开发环境

安装WSL2(Windows Subsystem for Linux)
1.以管理员权限打开PowerShell，运行: wsl –install
2.在Microsoft Store下载 Ubuntu 22.04 LTS
3.优点:你可以在Ubuntu里写C++/CUDA 代码，但依然能用Windows的浏览器看视频、查文档
Windows 安装 WSL2 并运行 Ubuntu 22.04 指南本文为 Windows 10 和 Windows - 掘金
 安装 WSL | Microsoft Learn
如果显示正在安装: Ubuntu 这时候卡住在0%(微软商店抽风)，那么我们执行如下指令,从GitHub下载

1	wsl --install -d Ubuntu --web-download

配置 CUDA Toolkit
1.在 Windows主机安装最新的NVIDIA 驱动
2.在 WSL2(Ubuntu)里安装 CUDA Toolkit(注意要选Linux-WSL 版本)
3.安装完成后，输入nvcc -V, 看到版本号即表示成功
WLS2安装CUDA保姆级教程_wsl2 cuda-CSDN博客

IDE选择
使用Vs Code，并安装”Remote - WSL”扩展
这样你可以在 Windows界面操作，但所有的编译和运行都在WSL2的Linux环境下进行
开始将 VS Code 与 WSL 配合使用 | Microsoft Learn
Using C++ and WSL in VS Code

具体内容和实现
[WorldClass/Stanford CS149 Parallel Computing/asst1 at master · tiny-star3/WorldClass](https://github.com/tiny-star3/WorldClass/tree/master/Stanford CS149 Parallel Computing/asst1)

Lecture 4: Parallelizing Code: The Programming Thought Process

Creating a parallel program

Your thought process:
Identify work that can be performed in parallel
Partition work (and also data associated with the work)
Manage data access, communication, and synchronization

A common goal is maximizing speedup
For a fixed computation: Speedup(P processors) = Time(1 processor) / Time (P processors)
Other goals include achieving high efficiency (cost, area, power, etc.) or working on bigger problems than can fit on one machine

Orchestration 编排

Problem decomposition

Break up problem into tasks that can be carried out in parallel
In general: create at least enough tasks to keep all execution units on a machine busy
Key challenge of decomposition: identifying dependencies (or… a lack of dependencies)

Amdahl’s Law: dependencies limit maximum speedup due to parallelism
Let S = the fraction(比例) of sequential execution that is inherently sequential (dependencies prevent parallel execution)
Then maximum speedup due to parallel execution ≤ 1/S
Max speedup on P processors given by:

Assignment

Assigning tasks to workers
Think of “tasks” as things to do
What are “workers”? (Might be threads, program instances, vector lanes, etc.)
Goals: achieve good workload balance, reduce communication costs
Can be performed statically (before application is run), or dynamically as program executes
Although programmer is often responsible for decomposition, many languages/runtimes take responsibility for assignment
static assignment using C++11 threads
Dynamic assignment using ISPC tasks

Orchestration

Involves:
Structuring communication
Adding synchronization to preserve(维持) dependencies if necessary
Organizing data structures in memory
Scheduling tasks
Goals: reduce costs of communication/sync, preserve locality of data reference, reduce overhead, etc.
Machine details impact many of these decisions
If synchronization is expensive, programmer might use it more sparsely(少)

Assignment to hardware

Assign “threads” (“workers”) to hardware execution units
Example 1: assignment to hardware by the operating system
e.g., map a thread to HW execution context on a CPU core
Example 2: assignment to hardware by the compiler
e.g., Map ISPC program instances to vector instruction lanes
Example 3: assignment to hardware by the hardware
e.g., Map CUDA thread blocks to GPU cores (discussed in a future lecture)
Many interesting decisions:
Place related threads (cooperating threads) on the same core (maximize locality, data sharing, minimize costs of comm/sync)
Place unrelated threads on the same core (one might be bandwidth limited and another might be compute limited) to use machine more efficiently

A parallel programming example

A 2D-grid based solver
Problem: solve partial differential equation(偏微分方程) (PDE) on (N+2) x (N+2) grid
Solution uses iterative algorithm:
Perform Gauss-Seidel sweeps(高斯-赛德尔扫描) over grid until convergence
A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j]);

Grid solver algorithm: find the dependencies
Pseudocode for sequential algorithm is provided below

const int n; 
float* A;                 	// assume allocated for grid of N+2 x N+2 elements
void solve(float* A) { 
  float diff, prev; 
  bool done = false; 
  while (!done) {						// outermost loop: iterations 
    diff = 0.f;                       
    for (int i=1; i<n i++) {            // iterate over non-border points of grid 
      for (int j=1; j<n; j++) { 
        prev = A[i,j]; 
        A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i-1,j] + 
                                  A[i,j+1] + A[i+1,j]); 
        diff += fabs(A[i,j] - prev);    // compute amount of change 
      } 
    } 
    if (diff/(n*n) < TOLERANCE)         // quit if converged 
      done = true; 
  } 
}

Step 1: identify dependencies (problem decomposition phase)

Each row element depends on element to left
Each row depends on previous row
Note: the dependencies illustrated on this slide are grid element data dependencies in one iteration of the solver (in one iteration of the “while not done” loop)

There is independent work along the diagonals!
Good: parallelism exists!
Possible implementation strategy:
1.Partition grid cells on a diagonal into tasks
2.Update values in parallel
3.When complete, move to next diagonal
Bad: independent work is hard to exploit
Not much parallelism at beginning and end of computation
Frequent synchronization (after completing each diagonal)

Idea: improve performance by changing the algorithm to one that is more amenable(适合) to parallelism
Change the order that grid cell cells are updated
New algorithm iterates to same solution (approximately), but converges to solution differently
Note: floating-point values computed are different, but solution still converges to within error threshold(误差阈值)
Yes, we needed domain knowledge of the Gauss-Seidel method to realize this change is permissible
This is a common technique in parallel programming

New approach: reorder grid cell update via red-black coloring
Reorder grid traversal(遍历顺序): red-black coloring

Update all red cells in parallel
When done updating red cells , update all black cells in parallel (respect dependency on red cells)
Repeat until convergence

Possible assignments of work to processors

Question: Which is better? Does it matter?
Answer: It depends on the system this program is running on

Consider dependencies in the program
1.Perform red cell update in parallel
2.Wait until all processors done with update
3.Communicate updated red cells to other processors
4.Perform black cell update in parallel
5.Wait until all processors done with update
6.Communicate updated black cells to other processors
7.Repeat

Communication resulting from assignment

shaded area = data that must be sent to P2 each iteration
Blocked assignment requires less data to be communicated between processors

Two ways to think about writing this program
Data parallel thinking
SPMD / shared address space

Data-parallel expression of grid solver
Note: to simplify pseudocode: just showing red-cell update

const int n;                          
float* A = allocate(n+2, n+2));   // allocate grid 			Assignment: ???
void solve(float* A) { 
   bool done = false; 
   float diff = 0.f; 
   while (!done) {
     // Decomposition: processing individual grid elements constitutes independent work
     for_all (red cells (i,j)) { 						
         float prev = A[i,j]; 
         A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + 
                          A[i+1,j] + A[i,j+1]);
         // Orchestration: handled by system (builtin(内置) communication primitive: reduceAdd)
         reduceAdd(diff, abs(A[i,j] - prev));
         // Orchestration: handled by system (End of for_all block is implicit wait for all workers before returning to sequential control)
     } 
     if (diff/(n*n) < TOLERANCE) 
       done = true;     
   } 
}

Shared address space (with SPMD threads) expression of solver
Programmer is responsible for synchronization
Common synchronization primitives:
Locks (provide mutual exclusion): only one thread in the critical region at a time
Barriers: wait for threads to reach this point

// Assume these are global variables (accessible to all threads)
int     n;               
bool    done = false; 
float   diff = 0.0; 
LOCK    myLock; 
BARRIER myBarrier; 
// grid size 
// allocate grid 
float* A = allocate(n+2, n+2);
// Assume solve() function is executed by all threads. (SPMD-style)
void solve(float* A) { 
   float myDiff;
   // Value of threadId is different for each SPMD instance: use value to compute region of grid to work on
   int threadId = getThreadId(); 
   int myMin = 1 + (threadId * n / NUM_PROCESSORS); 
   int myMax = myMin + (n / NUM_PROCESSORS)   
   while (!done) { 
     float myDiff = 0.f; 
     diff = 0.f; 
     barrier(myBarrier, NUM_PROCESSORS);
     // Each thread computes the rows it is responsible for updating
     for (j=myMin to myMax) { 
        for (i = red cells in this row) { 
           float prev = A[i,j]; 
           A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]); 
           myDiff += abs(A[i,j] - prev)); 
     } 
     lock(myLock); 
     diff += myDiff; 
     unlock(myLock); 
     barrier(myBarrier, NUM_PROCESSORS); 
     if (diff/(n*n) < TOLERANCE)	// check convergence, all threads get same answer
         done = true; 
     barrier(myBarrier, NUM_PROCESSORS); 
   }
}

Synchronization in a shared address space

Shared address space model (abstraction)
Threads communicate by:
Reading/writing to shared variables in a shared address space(shared variables)
Communication between threads is implicit in memory loads/stores
Manipulating synchronization primitives(原语)
e.g., ensuring mutual exclusion via use of locks
This is a natural extension of sequential programming

Barrier synchronization primitive
barrier(num_threads)
Barriers are a conservative(保守) way to express dependencies
Barriers divide computation into phases
All computation by all threads before the barrier complete before any computation in any thread after the barrier begins
In other words, all computations after the barrier are assumed to depend on all computations before the barrier

Shared address space solver: one barrier
Idea: Remove dependencies by using different diff variables in successive(连续的) loop iterations
Trade off footprint(牺牲内存占用) for removing dependencies! (a common parallel programming technique)

int     n;               
// grid size 
bool    done = false; 
LOCK    myLock; 
BARRIER myBarrier; 
float diff[3];  // global diff, but now 3 copies 以前，现在，未来三个
float *A = allocate(n+2, n+2); 
void solve(float* A) { 
float myDiff;   
// thread local variable 
  int index = 0;  // thread local variable 
  diff[0] = 0.0f; 
  barrier(myBarrier, NUM_PROCESSORS);  // one-time only: just for init 
  while (!done) { 
    myDiff = 0.0f; 
    // 
    // perform computation (accumulate locally into myDiff)  
    // 
    lock(myLock); 
    diff[index] += myDiff;    // atomically update global diff 
    unlock(myLock); 
    diff[(index+1) % 3] = 0.0f; 
    barrier(myBarrier, NUM_PROCESSORS); 
    if (diff[index]/(n*n) < TOLERANCE) 
      break; 
    index = (index + 1) % 3; 
  } 
}

Grid solver implementation in two programming models
Data-parallel programming model
Synchronization:
Single logical thread of control, but iterations of forall loop may be parallelized by the system (implicit barrier at end of forall loop body)
Communication
Implicit in loads and stores (like shared address space)
Special built-in primitives for more complex communication patterns: e.g., reduce
Shared address space
Synchronization:
Mutual exclusion required for shared variables (e.g., via locks)
Barriers used to express dependencies (between phases of computation)
Communication
Implicit in loads/stores to shared variables

Summary

Amdahl’s Law
Overall maximum speedup from parallelism is limited by amount of serial execution in a program

Aspects of creating a parallel program
Decomposition to create independent work, assignment of work to workers, orchestration (to coordinate(协调) processing of work by workers), mapping to hardware
We’ll talk a lot about making good decisions in each of these phases in the coming lectures

Focus today: identifying dependencies

Focus soon: identifying locality(局部性), reducing synchronization

Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling

Programming for high performance

Optimizing the performance of parallel programs is an iterative process of refining choices for decomposition, assignment, and orchestration…

Key goals (that are at odds with each other) (彼此冲突)
Balance workload onto available execution resources
Reduce communication (to avoid stalls)
Reduce extra work (overhead) performed to increase parallelism, manage assignment, reduce communication, etc

TIP #1: Always implement the simplest solution first, then measure performance to determine if you need to do better

Balancing the workload

Ideally: all processors are computing all the time during program execution
(they are computing simultaneously, and they finish their portion of the work at the same time)

Static assignment
Assignment of work to threads does not depend on dynamic behavior
Assignment not necessarily set at compile-time (we call it static is the assignment is determined when the amount of work and number of workers is known: assignment may depend on runtime parameters such as input data size, number of threads, etc.)
Good aspects of static assignment: simple, essentially zero runtime overhead to perform assignment

When is static assignment applicable?
When the cost (execution time) of work and the amount of work is predictable, allowing the programmer to work out a good assignment in advance
Simplest example: it is known up front that all work has the same cost
When work is predictable, but not all jobs have same cost
Jobs have unequal, but known cost: assign equal number of tasks to processors to ensure good load balance (on average)
When statistics about execution time are predictable (e.g., same cost on average)

“Semi-static” assignment
Cost of work is predictable for near-term future
Idea: recent past is a good predictor of near future
Application periodically(定期) profiles(分析) its execution and re-adjusts assignment
Assignment is “static” for the interval between re-adjustments

Dynamic assignment
Program determines assignment dynamically at runtime to ensure a well-distributed load
(The execution time of tasks, or the total number of tasks, is unknown or unpredictable.)
Dynamic assignment using a work queue

Choosing task size
Useful to have many more tasks than processors(many small tasks enables good workload balance via dynamic assignment)
Motivates small granularity tasks
But want as few tasks as possible to minimize overhead of managing the assignment
Motivates large granularity tasks
Ideal granularity depends on many factors(Common theme in this course: must know your workload, and your machine)
Decreasing synchronization overhead using distributed queues(avoid need for all workers to synchronize on single work queue)

Work in task queues need not be independent

Summary
Challenge: achieving good workload balance
Want all processors working all the time (otherwise(否则), resources are idle(空闲)!)
But want low-cost solution for achieving this balance
Minimize computational overhead (e.g., scheduling/assignment logic)
Minimize synchronization costs
Static assignment vs. dynamic assignment
Really, it is not an either/or(非此即彼) decision, there’s a continuum of choices
Use up-front(预先) knowledge about workload as much as possible to reduce load imbalance and task management/ synchronization costs (in the limit, if the system knows everything, use fully static assignment)

Scheduling fork-join parallelism

Common parallel programming patterns
Data parallelism:
Perform same sequence of operations on many data elements
Explicit management of parallelism with threads:
Create one thread per execution unit (or per amount of desired concurrency)

Consider divide-and-conquer algorithms
Quick sort:

// sort elements from ‘begin’ up to (but not including) ‘end’ 
void quick_sort(int* begin, int* end) { 
  if (begin >= end-1)  
    return; 
  else { 
     // choose partition key and partition elements 
     // by key, return position of key as `middle` 
     int* middle = partition(begin, end);
     
     // independent work! 
     quick_sort(begin, middle); 
     quick_sort(middle+1, last); 
  } 
}

Fork-join pattern
Natural way to express the independent work that is inherent in divide-and-conquer algorithms
This lecture’s code examples will be in Cilk Plus
C++ language extension
Originally developed at MIT, now adapted as open standard (in GCC, Intel ICC)
cilk_spawn foo(args);
“fork” (create new logical thread of control)
Semantics: invoke foo, but unlike standard function call, caller may continue executing asynchronously with execution of foo
cilk_sync;
“join”
Semantics: returns when all calls spawned by current function have completed. (“sync up” with the spawned calls)
Note: there is an implicit cilk_sync at the end of every function that contains a cilk_spawn
(implication: when a Cilk function returns, all work associated with that function is complete)

Call-return of a function in C(And many other languages)

void my_func() { 
  // calling function (part A) 
  foo(); 
  bar(); 
  // calling function (part B) 
}

Semantics of a function call: Control moves to the function that is called (Thread executes instructions for the function)
When function returns, control returns back to caller (thread resumes(恢复) executing instructions from the caller)

Basic Cilk Plus examples

// foo() and bar() may run in parallel 
cilk_spawn foo(); 
bar(); 
cilk_sync;

// foo() and bar() may run in parallel 
cilk_spawn foo(); 
cilk_spawn bar(); 
cilk_sync;

Same amount of independent work first example, but potentially higher runtime overhead (due to two spawns vs. one)

// foo, bar, fizz, buzz, may run in parallel 
cilk_spawn foo(); 
cilk_spawn bar(); 
cilk_spawn fizz(); 
buzz(); 
cilk_sync;

Abstraction vs. implementation
Notice that the cilk_spawn abstraction does not specify how or when spawned calls are scheduled to execute
Only that they may be run concurrently with caller (and with all other calls spawned by the caller)
An implementation of Cilk is correct if it implements cilk_spawn foo() the same way as it implementation a normal function call to foo()
But cilk_sync does serve as a constraint on scheduling
All spawned calls must complete before cilk_sync returns

Parallel quicksort in Cilk Plus

void quick_sort(int* begin, int* end){
  // Sort sequentially if problem size is sufficiently small (overhead of spawn trumps(超过) benefits of potential parallelization)
  if(begin >= end - PARALLEL_CUTOFF) 
    std::sort(begin, end); 
  else{
    int* middle = partition(begin, end);
    cilk_spawn quick_sort(begin, middle); 
    quick_sort(middle+1, last); 
  } 
}

Writing fork-join programs
Main idea: expose independent work (potential parallelism) to the system using cilk_spawn
Recall parallel programming rules of thumb
Want at least as much work as parallel execution capability (e.g., program should probably spawn at least as much work as needed to fill all the machine’s processing resources)
Want more independent work than execution capability to allow for good workload balance of all the work onto the cores
“parallel slack” = ratio of independent work to machine’s parallel execution capability (in practice: ~8 is a good ratio)
But not too much independent work so that granularity of work is too small (too much slack incurs overhead of managing fine-grained(细粒度) work)

Scheduling fork-join programs
Consider very simple scheduler:
Launch pthread for each cilk_spawn using pthread_create
Translate cilk_sync into appropriate pthread_join calls
Potential performance problems?
Heavyweight spawn operation
Many more concurrently running threads than cores
Context switching overhead
Larger working set than necessary, less cache locality
Note: now we are going to talk about the implementation of Cilk

Pool of worker threads
The Cilk Plus runtime maintains pool of worker threads
Think: all threads are created at application launch(It’s perfectly fine to think about it this way, but in reality, runtimes tend to be lazy and initialize worker threads on the first Cilk spawn. (This is a common implementation strategy, ISPC does the same with worker threads that run ISPC tasks.))
Exactly as many worker threads as execution contexts in the machine

Consider execution of the following code

// spawned child
cilk_spawn foo();

// continuation (rest of calling function)
bar(); 
cilk_sync;

Specifically, consider execution from the point foo() is spawned
First, consider a serial implementation
Run child first… via a regular function call
Thread runs foo(), then returns from foo(), then runs bar()
Continuation is implicit in the thread’s stack
Per-thread work queues store “work to do”
Upon reaching cilk_spawn foo(), thread places continuation in its work queue, and begins executing foo()

Idle threads “steal” work from busy threads
If thread 1 goes idle (a.k.a. there is no work in its own queue), then it looks in thread 0’s queue for work to do

Run continuation first: queue child for later execution
Child is made available for stealing by other threads (“child stealing”)
Run child first: enqueue continuation for later execution
Continuation is made available for stealing by other threads (“continuation stealing”)

Run continuation first (“child stealing”)
Caller thread spawns work for all iterations before executing any of it
Think: breadth-first traversal(广度优先遍历) of call graph. O(N) space for spawned work (maximum space)
If no stealing, execution order is very different than that of program with cilk_spawn removed

Run child first (“continuation stealing”)
Caller thread only creates one item to steal (continuation that represents all remaining iterations)
If no stealing occurs, thread continually pops continuation from work queue, enqueues new continuation (with updated value of i)
Order of execution is the same as for program with spawn removed
Think: depth-first traversal(深度优先遍历) of call graph

Enqueues continuation with i advanced by 1
If continuation is stolen, stealing thread spawns and executes next iteration
Can prove that work queue storage for system with T threads is no more than T times that of stack storage for single threaded execution

Scheduling quicksort: assume 200 elements

void quick_sort(int* begin, int* end) { 
  if (begin >= end - PARALLEL_CUTOFF) 
    std::sort(begin, end); 
  else { 
     int* middle = partition(begin, end);  
     cilk_spawn quick_sort(begin, middle); 
     quick_sort(middle+1, last); 
  } 
}

Implementing work stealing: dequeue per worker
Work queue implemented as a dequeue (double ended queue)
Local thread pushes/pops from the “tail” (bottom)
Remote threads steal from “head” (top)

Implementing work stealing: choice of victim
Idle threads randomly choose a thread to attempt to steal from
Steal work from top of dequeue:
Steals largest amount of work (reduce number of steals)
Maximum locality in work each thread performs (when combined with run child first scheme)
Stealing thread and local thread do not contend(争用) for same elements of dequeue(efficient lock-free implementations of dequeue exist)

Child-first work stealing scheduler anticipates divide-and-conquer parallelism

for (int i=0; i<N; i++) { 
  cilk_spawn foo(i); 
} 
cilk_sync;

void recursive_for(int start, int end) {
    while (start <= end - GRANULARITY) { 
    int mid = (end - start) / 2; 
    cilk_spawn recursive_for(start, mid); 
    start = mid; 
  } 
  for (int i=start; i<end; i++) 
     foo(i); 
} 
recursive_for(0, N);

Code at second generates work in parallel, (code at first does not), so it more quickly fills up parallel machine

Implementing sync

Implementing sync: no stealing case

If no work has been stolen by other threads, then there’s nothing to do at the sync point
cilk_sync is a no-op
Implementing sync: stealing case

Idle thread 1 steals from busy thread 0
Note: descriptor(描述符) for block A created
The descriptor tracks the number of outstanding(未完成的) spawns for the block, and the number of those spawns that have completed
The 1 spawn tracked by the descriptor corresponds to foo(0) being run by thread 0. (Since the continuation is now owned by thread 1 after the steal.)

Thread 2 now resumes continuation and executes bar()
Note block A descriptor is now free

Cilk uses greedy join scheduling

Greedy join scheduling policy
All threads always attempt to steal if there is nothing to do
Threads only go idle if there is no work to steal in the system
Worker thread that initiated spawn may not be thread that executes logic after cilk_sync
Remember:
Overhead of bookkeeping(簿记) steals and managing sync points only occurs when steals occur
If large pieces of work are stolen, this should occur infrequently
Most of the time, threads are pushing/popping local work from their local dequeue

Cilk summary

Fork-join parallelism: a natural way to express divide-and-conquer algorithms
Discussed Cilk Plus, but many other systems also have fork/join primitives (e.g., OpenMP)

Cilk Plus runtime implements spawn/sync abstraction with a locality-aware(局部性感知) work stealing scheduler
Always run spawned child (continuation stealing)
Greedy behavior at join (threads do not wait at join, immediately look for other work to steal)

Lecture 6: Performance Optimization Part II: Locality, Communication, and Contention

shared address space model

Non-uniform memory access (NUMA)
The latency of accessing a memory location may be different from different processing cores in the system
Bandwidth from any one location may also be different to different CPU cores(In practice, you’ll find NUMA behavior on a single-socket system as well (recall: different cache slices are a different distance from each core))

Communication abstraction
Threads read/write variables in shared address space
Threads manipulate synchronization primitives: locks, atomic ops, etc
Logical extension of uniprocessor programming(But NUMA implementations require reasoning about locality for performance optimization)
Requires hardware support to implement efficiently
Any processor can load and store from any address
Can be costly to scale(扩展) to large numbers of processors(one of the reasons why high-core count processors are expensive)

Message passing model

Message passing model (abstraction)
Threads operate within their own private address spaces
Threads communicate by sending/receiving messages
send: specifies(指定) recipient, buffer to be transmitted, and optional message identifier (“tag”)
receive: sender, specifies buffer to store data, and optional message identifier
Sending messages is the only way to exchange data between threads 1 and 2

Message passing (implementation)
Hardware need not implement a single shared address space for all processors (it only needs to provide mechanisms to communicate messages between nodes)
Can connect commodity(通用) systems together to form a large parallel machine(message passing is a programming model for clusters and supercomputers)

Message passing expression of solver

Recall the grid solver application:
Update all red cells in parallel
When done updating red cells , update all black cells in parallel (respect dependency on red cells)
Repeat until convergence

Let’s think about expressing a parallel grid solver with communication via messages
One possible message passing machine configuration: a cluster of two machines

Message passing model: each thread operates in its own address space
In this figure: four threads
The grid data is partitioned into four allocations, each residing in one of the four unique thread address spaces
(four per-thread private arrays)
Data replication is now required to correctly execute the program
Grid data stored in four separate address spaces (four private arrays)

Message passing solver
Similar structure to shared address space solver, but now communication is explicit in message sends and receives

int N; 
int tid = get_thread_id(); 
int rows_per_thread = N / get_num_threads();
float* localA = allocate(rows_per_thread+2, N+2); 
// assume localA is initialized with starting values 
// assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids 
////////////////////////////////////// 
void solve() { 
  bool done = false; 
  while (!done) { 
    float my_diff = 0.0f;
    
    // Send and receive ghost rows to “neighbor threads”
    if (tid != 0) 
		send(&localA[1,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW); 
    if (tid != get_num_threads()-1) 
		send(&localA[rows_per_thread,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
    if (tid != 0) 
		recv(&localA[0,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW); 
    if (tid != get_num_threads()-1) 
		recv(&localA[rows_per_thread+1,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
    
    // Perform computation ( just like in shared address space version of solver)
    // 未考虑第一个线程第一行和最后一个线程最后一行不需要计算的特殊情况
    for (int i=1; i<rows_per_thread+1; i++) { 
       for (int j=1; j<n+1; j++) { 
         float prev = localA[i,j]; 
         localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] +  
                              localA[i,j-1] + localA[i,j+1]); 
         my_diff += fabs(localA[i,j] - prev); 
       } 
    }
    
    if (tid != 0) { 
       // All threads send local my_diff to thread 0
       send(&mydiff, sizeof(float), 0, MSG_ID_DIFF); 
       recv(&done, sizeof(bool), 0, MSG_ID_DONE); 
    } else {
       // Thread 0 computes global diff, evaluates termination predicate(评估终止条件) and sends result back to all other threads
       float remote_diff; 
       for (int i=1; i<get_num_threads()-1; i++) { 
          recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF); 
          my_diff += remote_diff; 
       } 
       if (my_diff/(N*N) < TOLERANCE) 
         done = true; 
       for (int i=1; i<get_num_threads()-1; i++) 
         send(&done, sizeof(bool), i, MSD_ID_DONE); 
    }
  }
}

Notes on the message passing example
Computation
Array indexing is relative to local address space
Communication:
Performed by sending and receiving messages
Bulk(批量) transfer: communicate entire rows at a time
Synchronization:
Performed by sending and receiving messages
Consider how to implement mutual exclusion, barriers, flags using messages

Synchronous (blocking) send and receive
send(): call returns when sender receives acknowledgement that message data resides(存在) in address space of receiver
recv(): call returns when data from received message is copied into address space of receiver and acknowledgement sent back to sender

硬件确保 message 和 ack 传输成功

There is a big problem with our message passing solver if it uses synchronous send/recv
同时 send/receive 会导致 deadlock
Message passing solver (fixed to avoid deadlock)
Send and receive ghost rows to “neighbor threads”
Even-numbered threads send, then receive
Odd-numbered thread recv, then send

int N; 
int tid = get_thread_id(); 
int rows_per_thread = N / get_num_threads(); 
float* localA = allocate(rows_per_thread+2, N+2); 
// assume localA is initialized with starting values 
// assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids 
////////////////////////////////////// 
void solve() { 
  bool done = false; 
  while (!done) { 
    float my_diff = 0.0f;
    // Send and receive ghost rows to “neighbor threads” 
    // Even-numbered threads send, then receive 
    // Odd-numbered thread recv, then send
    if (tid % 2 == 0) { 
       sendDown(); recvDown(); 
       sendUp();   recvUp(); 
    } else { 
       recvUp();   sendUp(); 
       recvDown(); sendDown(); 
    } 
    for (int i=1; i<rows_per_thread-1; i++) { 
       for (int j=1; j<n+1; j++) { 
         float prev = localA[i,j]; 
         localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] +  
                              localA[i,j-1] + localA[i,j+1]); 
         my_diff += fabs(localA[i,j] - prev); 
       } 
    } 
    if (tid != 0) { 
       send(&mydiff, sizeof(float), 0, MSG_ID_DIFF); 
       recv(&done, sizeof(bool), 0, MSG_ID_DONE); 
    } else { 
       float remote_diff; 
       for (int i=1; i<get_num_threads()-1; i++) { 
          recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF); 
          my_diff += remote_diff; 
       } 
       if (my_diff/(N*N) < TOLERANCE) 
         done = true; 
       if (int i=1; i<gen_num_threads()-1; i++) 
         send(&done, sizeof(bool), i, MSD_ID_DONE); 
    }  
  } 
}

Non-blocking asynchronous send/recv
send(): call returns immediately
Buffer provided to send() cannot be modified by calling thread since message processing occurs concurrently with thread execution
Calling thread can perform other work while waiting for message to be sent
recv(): posts intent(意图) to receive in the future, returns immediately
Use checksend(), checkrecv() to determine actual status of send/receipt
Calling thread can perform other work while waiting for message to be received

When I talk about communication, I’m not just referring to messages between machines
More examples:
Communication between cores on a chip
Communication between a core and its cache
Communication between a core and memory
Think of a parallel system as an extended memory hierarchy(层次结构)
I want you to think of “communication” generally:
Communication between a processor and its cache
Communication between processor and memory (e.g., memory on same machine)
Communication between processor and a remote memory(e.g., memory on another node in the cluster, accessed by sending a network message)

Discussion of bandwidth-limited execution (from lecture 3)
This was an example where the processor executed 2 instructions for each cache line load

How do you tell from the figure that the memory bus is fully utilized?
观察图中的蓝色长条（代表从内存传输数据的时间），在进入稳态（Steady State）后，发现一个蓝色长条的终点紧接着下一个蓝色长条的起点，没有间隙意味着内存总线在时间轴上是 100% 占满的。它始终在传输数据，没有任何空闲时间。即使核心（Core）在红色区域停顿（Stalled），内存总线依然在全力奔跑，说明总线被完全利用
How would you illustrate higher memory latency (keep in mind memory requests are pipelined and memory bus bandwidth is not changed)?
内存延迟 (Latency)：从 CPU 发出内存请求到数据返回之间的等待时间
内存总线带宽 (Bandwidth)：单位时间内内存系统能够传输的最大数据量
内存延迟体现在“发送加载命令（灰色）”到“数据开始传输（蓝色起始）”之间的间隔。如果延迟变高，这个间隔（灰色和蓝色之间的距离）会拉长。由于带宽没变，蓝色长条的宽度（长度）不会变。由于是流水线化的，只要请求够多，蓝色长条依然会保持“首尾相衔”。整排蓝色长条会相对于指令簇整体向右平移。在程序刚开始时，处理器等待第一份数据回来的时间（初始延迟）会变长，但一旦进入稳态，数据传输的频率（吞吐量）保持不变
How would the figure change if memory bus bandwidth was increased?
加带宽意味着在相同时间内可以传输更多数据，或者传输相同大小（一个缓存行）的数据所需的时间变短。每个蓝色长条会变窄（水平方向变短）。如果数学计算（黄色）的时间不变，蓝色长条变窄意味着数据回来的更快了，指令簇之间的空白区域（处理器停顿/Stalls）会缩小，整体执行速度加快
Would there still be processor stalls if the ratio of math instructions to load instructions was significantly increased? Why?
可能不会（停顿会消失），当每个加载指令对应的数学指令变多时，黄色长条的总长度（计算时间）会增加。如果计算时间长到足以覆盖掉数据传输的时间（蓝色长条的时间），那么当核心算完当前数据时，下一份数据已经传输完成了，此时程序从“访存受限（Memory-bound）”转变为“计算受限（Compute-bound）”。处理器始终在忙于计算，不再需要停下来等数据。即与计算强度（Arithmetic Intensity）有关

Arithmetic intensity
amount of computation (e.g., instructions) / amount of communication (e.g., bytes)
If numerator(分子) is the execution time of computation, ratio gives average bandwidth requirement of code(倒数)
1 / “Arithmetic intensity” = communication-to-computation ratio
Some people like to refer to communication to computation ratio
I find arithmetic intensity a more intuitive(直观) quantity, since higher is better
It also sounds cooler
High arithmetic intensity (low communication-to-computation ratio) is required to efficiently utilize modern parallel processors since the ratio of compute capability to available bandwidth is high (recall element-wise vector multiply example from lecture 3)

Two reasons for communication: inherent vs. artifactual communication

Inherent(固有) communication
Communication that must occur in a parallel algorithm
The communication is fundamental to the algorithm
In our messaging passing example at the start of class, sending ghost rows was inherent communication

Reducing inherent communication
Good assignment decisions can reduce inherent communication(increase arithmetic intensity)
1D blocked assignment: N x N grid 1D interleaved assignment: N x N grid

2D blocked assignment: N x N grid

N² elements
P processors
elements computed(per processor): N²/P
elements communicated(per processor):
arithmetic intensity:
Asymptotically(渐进地) better communication scaling than 1D blocked assignment
Communication costs increase sub-linearly(亚线性) with(随) P
Assignment captures 2D locality of algorithm

Artifactual(人工) communication
Inherent communication: information that fundamentally must be moved between processors to carry out the algorithm given the specified assignment (assumes unlimited capacity caches, minimum granularity transfers, etc.)
Artifactual communication: all other communication (artifactual communication results from practical details of system implementation)
Example: Artifactual communication arises from the behavior of caches
In this case: the communication is between memory and the processor
Data access in grid solver: row-major traversal

Assume row-major grid layout
Assume cache line is 4 grid elements
Cache capacity is 24 grid elements (6 lines)
Blue elements show data that is in cache after completing update to red element

Blue elements show data in cache at end of processing first row

Although elements (x,y)=(0,1), (1,1), (2,1), (0,2), and (2,2) have been accessed previously, they are no longer present in cache at start of processing the first output element in row 2
As a result, this program loads three cache lines for every four elements of output
Artifactual communication examples
System has minimum granularity of data transfer (system must communicate more data than what is needed by application)
Program loads one 4-byte float value but entire 64-byte cache line must be transferred from memory (16x more communication than necessary)
System operation might result in unnecessary communication:
Program stores 16 consecutive 4-byte float values, and as a result the entire 64-byte cache line is loaded from memory, entirely overwritten, then subsequently stored to memory (2x overhead… load was unnecessary since entire cache line was overwritten)
Finite replication capacity: the same data communicated to processor multiple times because cache is too small to retain it between accesses (capacity misses)

Techniques for reducing Artifactual communication
Improving temporal locality by changing grid traversal order
“Blocking”: reorder computation to reduce capacity misses

“Blocked” iteration order
(diagram shows state of cache after f inishing work from first row of first block)
Now load two cache lines for every six elements of output
Improving temporal(时间) locality by “fusing(融合)” loops

// Two loads, one store per math op (arithmetic intensity = 1/3)
void add(int n, float* A, float* B, float* C) { 
    for (int i=0; i<n; i++) 
       C[i] = A[i] + B[i];     
}
// Two loads, one store per math op (arithmetic intensity = 1/3)
void mul(int n, float* A, float* B, float* C) { 
    for (int i=0; i<n; i++) 
       C[i] = A[i] * B[i];     
} 
float* A, *B, *C, *D, *E, *tmp1, *tmp2; 
// assume arrays are allocated here 
// compute E = D + ((A + B) * C) 
add(n, A, B, tmp1); 
mul(n, tmp1, C, tmp2); 
add(n, tmp2, D, E);
// Overall arithmetic intensity = 1/3

// Four loads, one store per 3 math ops (arithmetic intensity = 3/5)
void fused(int n, float* A, float* B, float* C, float* D, float* E) { 
    for (int i=0; i<n; i++) 
       E[i] = D[i] + (A[i] + B[i]) * C[i];     
} 
// compute E = D + (A + B) * C 
fused(n, A, B, C, D, E);

Code on top is more modular(模块化) (e.g, array-based math library like numPy in Python)
Code on bottom performs much better

Optimization: improve arithmetic intensity by sharing data
Exploit sharing: co-locate(共同定位) tasks that operate on the same data
Schedule threads working on the same data structure at the same time on the same processor
Reduces inherent communication

Contention

A resource can perform operations at a given throughput (number of transactions per unit time)
Memory, communication links, servers, CA’s at office hours, etc
Contention occurs when many requests to a resource are made within a small window of time(the resource is a “hot spot”)

Example: distributed work queues reduce contention(contention in access to single shared work queue)

Worker threads:
Pull data from OWN work queue
Push new work to OWN work queue
(no contention when all processors have work to do)
When local work queue is empty…
STEAL work from random work queue
(synchronization okay at this point since the thread would have sat idle anyway)

Summary: reducing communication costs

Reduce overhead of communication to sender/receiver
Send fewer messages, make messages larger (amortize(分摊) overhead)
Coalesce(合并) many small messages into large ones
Reduce latency of communication
Application writer: restructure code to exploit locality
Hardware implementor: improve communication architecture
Reduce contention
Replicate contended resources (e.g., local copies, fine-grained(细粒度) locks)
Stagger(错开) access to contended resources
Increase communication/computation overlap(重叠)
Application writer: use asynchronous communication (e.g., async messages)
Requires additional concurrency in application (more concurrency than number of execution units)

Here are some tricks for understanding the performance of parallel software

Remember: Always, always, always try the simplest parallel solution first, then measure performance to see where you stand

A useful performance analysis strategy
Determine if your performance is limited by computation, memory bandwidth (or memory latency), or synchronization?
Try and establish “high watermarks”
What’s the best you can do in practice?
How close is your implementation to a best-case scenario?

Roofline model
In plot below, different points on the X axis correspond to different programs with different arithmetic intensities
The Y axis is the maximum obtainable instruction throughput for a program with a given arithmetic intensity

斜率是带宽(bandwidth)
Roofline model: optimization regions
Use various levels of optimization in benchmarks(基准)(e.g., best performance with and without using SIMD instructions)

Establishing high watermarks
Computation, memory access, and synchronization are almost never perfectly overlapped. As a result, overall performance will rarely be dictated(决定) entirely by compute or by bandwidth or by sync. Even so, the sensitivity of performance change to the above program modifications can be a good indication of dominant costs
Add “math” (non-memory instructions)
Does execution time increase linearly with operation count as math is added?(If so, this is evidence that code is instruction-rate limited)
Remove almost all math, but load same data
How much does execution time decrease? If not much, you might suspect memory bottleneck
Change all array accesses to A[0]
How much faster does your code get? (This establishes an upper bound on benefit of improving locality of data access)
Remove all atomic operations or locks
How much faster does your code get? (provided(假设) it still does approximately the same amount of work)(This establishes an upper bound on benefit of reducing sync overhead.)

Use profilers/performance monitoring tools
Image is “CPU usage(使用率)” from activity monitor in OS X while browsing the web in Chrome (from a laptop with a quad-core Core i7 CPU)

Graph plots(绘制) percentage of time OS has scheduled a process thread onto a processor execution context
Not very helpful for optimizing performance
All modern processors have low-level event “performance counters”
Registers that count important details such as: instructions completed, clock ticks, L2/L3 cache hits/misses, bytes read from memory controller, etc
Example: Intel’s Performance Counter Monitor Tool provides a C++ API for accessing these registers

PCM *m = PCM::getInstance(); 
SystemCounterState begin = getSystemCounterState(); 
// code to analyze goes here 
SystemCounterState end = getSystemCounterState(); 
printf(“Instructions per clock: %f\n”, getIPC(begin, end)); 
printf(“L3 cache hit ratio: %f\n”, getL3CacheHitRatio(begin, end)); 
printf(“Bytes read: %d\n”, getBytesReadFromMC(begin, end));

Also see Intel VTune, PAPI, oprofile, etc

Understanding problem size issues can very helpful when assessing program performance

Understanding scaling
There can be complex interactions(交互) between the size of the problem to solve and the size of the parallel computer
Can impact load balance, overhead, arithmetic intensity, locality of data access
Effects can be dramatic and application dependent
Evaluating a machine with a fixed problem size can be problematic
Too small a problem:
Parallelism overheads dominate(掩盖) parallelism benefits (may even result in slow downs)
Problem size may be appropriate for small machines, but inappropriate for large ones (does not reflect realistic usage of large machine!)
Too large a problem: (problem size chosen to be appropriate for large machine)
Key working set may not “fit” in small machine (causing thrashing to disk, or key working set exceeds cache capacity, or can’t run at all)
When problem working set “fits” in a large machine but not small one, super-linear speedups can occur
Can be desirable to scale problem size as machine sizes grow
(buy a bigger machine to compute more, rather than just compute the same problem faster)

Summary of tips

Measure, measure, measure…

Establish high watermarks for your program
Are you compute, synchronization, or bandwidth bound?

Be aware of scaling issues
Is the problem “well sized” for the machine?

Assignment 2: Scheduling Task Graphs on a Multi-Core CPU

具体内容和实现
[WorldClass/Stanford CS149 Parallel Computing/asst2 at master · tiny-star3/WorldClass](https://github.com/tiny-star3/WorldClass/tree/master/Stanford CS149 Parallel Computing/asst2)

Lecture 7: GPU Architecture & CUDA Programming

GPU compute mode

Review: how to run code on a CPU
Lets say a user wants to run a program on a multi-core CPU…
OS loads program text into memory
OS selects CPU execution context
OS interrupts processor, prepares execution context (sets contents of registers, program counter, etc. to prepare execution context)
Go!
Processor begins executing instructions from within the environment maintained in the execution context

How to run code on a GPU (prior to 2007)
Let’s say a user wants to draw a picture using a GPU…
Application (via graphics driver) provides GPU shader program binaries
Application sets graphics pipeline(管道) parameters (e.g., output image size)
Application provides GPU a buffer of vertices
Application sends GPU a “draw” command: drawPrimitives(vertex_buffer)
This was the only interface to GPU hardware
GPU hardware could only execute graphics pipeline computations

NVIDIA Tesla architecture (2007)
First alternative, non-graphics-specific (“compute mode”) interface to GPU hardware
Let’s say a user wants to run a non-graphics program on the GPU’s programmable cores…
Application can allocate buffers in GPU memory and copy data to/from buffers
Application (via graphics driver) provides GPU a single kernel program binary
Application tells GPU to run the kernel in an SPMD fashion (“run N instances of this kernel”) launch(myKernel, N)
Interestingly, this is a far simpler operation than the graphics operation drawPrimitives()

CUDA programming language
Introduced in 2007 with NVIDIA Tesla architecture
“C-like” language to express programs that run on GPUs using the compute-mode hardware interface
Relatively low-level: CUDA’s abstractions closely match the capabilities/performance characteristics of modern GPUs (design goal: maintain low abstraction distance)

CUDA programming abstractions

Describe CUDA abstractions using CUDA terminology
Specifically, be careful with the use of the term “CUDA thread”. A CUDA thread presents a similar abstraction as a pthread in that both correspond to logical threads of control, but the implementation of a CUDA thread is very different

CUDA programs consist of a hierarchy of concurrent threads
Thread IDs can be up to 3-dimensional (2D example below)
Multi-dimensional thread ids are convenient for problems that are naturally N-D

Basic CUDA syntax

// Regular application thread running on CPU (the “host”)

// “Host” code : serial execution 
// Running as part of normal C/C++ application on CPU
const int Nx = 12; 
const int Ny = 6; 
dim3 threadsPerBlock(4, 3); 
dim3 numBlocks(Nx/threadsPerBlock.x, Ny/
threadsPerBlock.y); 
// assume A, B, C are allocated Nx x Ny float arrays

// Bulk launch of many CUDA threads 
// “launch a grid of CUDA thread blocks” 
// Call returns when all threads have terminated
// this call will launch 72 CUDA threads: 6 thread blocks of 12 threads each  
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);	// A，B，C 位于GPU内存

// SPMD execution of device kernel function:

// CUDA kernel definition
// kernel definition (runs on GPU) 
// “CUDA device” code: kernel function (__global__ denotes a CUDA kernel function) runs on GPU
__global__ void matrixAdd(float A[Ny][Nx], 
                          float B[Ny][Nx], 
                          float C[Ny][Nx]) 
{ 
   // Each thread computes its overall grid thread id from its position in its block (threadIdx) and its block’s position in the grid (blockIdx)
   int i = blockIdx.x * blockDim.x + threadIdx.x; 
   int j = blockIdx.y * blockDim.y + threadIdx.y;
   C[j][i] = A[j][i] + B[j][i];
}

Clear separation of host and device code
Separation of execution into host and device code is performed statically by the programmer

Number of SPMD “CUDA threads” is explicit in the program
Number of kernel invocations(调用) is not determined by size of data collection(Block内都会调用)
(a kernel launch is not specified by map(kernel, collection) as was the case with graphics shader programming)

// Regular application thread running on CPU (the “host”)

const int Nx = 11;  // not a multiple of threadsPerBlock.x 
const int Ny = 5;   // not a multiple of threadsPerBlock.y 
dim3 threadsPerBlock(4, 3); 
dim3 numBlocks((Nx+threadsPerBlock.x-1)/threadsPerBlock.x, 
               (Ny+threadsPerBlock.y-1)/threadsPerBlock.y); 
// assume A, B, C are allocated Nx x Ny float arrays 
// this call will cause execution of 72 threads 
// 6 blocks of 12 threads each  
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);	// A，B，C 位于GPU内存

// CUDA kernel definition
__global__ void matrixAdd(float A[Ny][Nx], 
                          float B[Ny][Nx], 
                          float C[Ny][Nx]) 
{ 
   int i = blockIdx.x * blockDim.x + threadIdx.x; 
   int j = blockIdx.y * blockDim.y + threadIdx.y; 
   // guard against out of bounds array access 
   if (i < Nx && j < Ny) 
       C[j][i] = A[j][i] + B[j][i];
}

CUDA execution model

CUDA memory model
Distinct(独立的) host and device address spaces

memcpy primitive
Move data between address spaces

float* A = new float[N];  // allocate buffer in host mem

// populate(填充) host address space pointer A 
for (int i=0 i<N; i++) 
   A[i] = (float)i; 
int bytes = sizeof(float) * N; 
float* deviceA;                 // allocate buffer in  
cudaMalloc(&deviceA, bytes);    // device address space 

// populate deviceA 
cudaMemcpy(deviceA, A, bytes, cudaMemcpyHostToDevice);

// note: directly accessing deviceA[i] is an invalid operation here (cannot manipulate contents of deviceA directly from host, since deviceA is not a pointer into the host’s address space)

CUDA device memory model
Three distinct types of address spaces visible to kernels

priori(先验)

CUDA example: 1D convolution(卷积)
output[i] = (input[i] + input[i+1] + input[i+2]) / 3.f;
1D convolution in CUDA (version 1)
One thread per output element

// CUDA Kernel
#define THREADS_PER_BLK 128

__global__ void convolve(int N, float* input, float* output) { 
   int index = blockIdx.x * blockDim.x + threadIdx.x;  // thread local variable
    
   // each thread computes result for one element
   float result = 0.0f;  // thread-local variable 
   for (int i=0; i<3; i++)    
     result += input[index + i]; 
    
   // each thread writes result to global memory
   output[index] = result / 3.f;
}

// Host code
int N = 1024 * 1024; 
cudaMalloc(&devInput, sizeof(float) * (N+2) );  // allocate input array in device memory
cudaMalloc(&devOutput, sizeof(float) * N);      // allocate output array in device memory

// properly initialize contents of devInput here ...

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput);

1D convolution in CUDA (version 2)
One thread per output element: stage input data in per-block shared memory

// CUDA Kernel
#define THREADS_PER_BLK 128 
__global__ void convolve(int N, float* input, float* output) { 
   __shared__ float support[THREADS_PER_BLK+2];        // per-block allocation 
   int index = blockIdx.x * blockDim.x + threadIdx.x;  // thread local variable
   
   // All threads cooperatively load block’s support region from global memory into shared memory(total of 130 load instructions instead of 3 * 128 load instructions) 
   support[threadIdx.x] = input[index]; 
   if (threadIdx.x < 2) { 
      support[THREADS_PER_BLK + threadIdx.x] = input[index+THREADS_PER_BLK];  
   }
   
   // barrier (all threads in block)
   __syncthreads();
    
   // each thread computes result for one element
   float result = 0.0f;  // thread-local variable 
   for (int i=0; i<3; i++)    
     result += support[threadIdx.x + i];
   
   // write result to global memory
   output[index] = result / 3.f;
}

// Host code
int N = 1024 * 1024 
cudaMalloc(&devInput, sizeof(float) * (N+2) );  // allocate array in device memory 
cudaMalloc(&devOutput, sizeof(float) * N);      // allocate array in device memory 

// property initialize contents of devInput here ...

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput);

CUDA synchronization constructs
__syncthreads()
Barrier: wait for all threads in the block to arrive at this point
Atomic operations
e.g., float atomicAdd(float* addr, float amount)
CUDA provides atomic operations on both global memory addresses and per-block shared memory addresses
Host/device synchronization
Implicit barrier across all threads at return of kernel

Summary: CUDA abstractions
Execution: thread hierarchy
Bulk launch of many threads (this is imprecise… I’ll clarify later)
Two-level hierarchy: threads are grouped into thread blocks
Distributed address space
Built-in memcpy primitives to copy between host and device address spaces
Three different types of device address spaces
Per thread, per block (“shared”), or per program (“global”)
Barrier synchronization primitive for threads in thread block
Atomic primitives for additional synchronization (shared and global variables)

SPMD vs. SIMD
SPMD is a programming model question independent of the implementation
So like ISPC is an SPMD programming model, CUDA is a SPMD programming model
What that means is I write one program, like this program right here, and the system will run it many times with different thread IDs
Now how we execute all those threads or all those instances efficiently is up to the implementation
And ISPC use SIMD the instructions to execute things very efficiently

CUDA implementation on modern GPUs

CUDA semantics

#define THREADS_PER_BLK 128

__global__ void convolve(int N, float* input, float* output) { 
   __shared__ float support[THREADS_PER_BLK+2];  // per-block allocation 
   int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local var
    
   support[threadIdx.x] = input[index]; 
   if (threadIdx.x < 2) { 
      support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK];  
   } 
    
   __syncthreads(); 
    
   float result = 0.0f;  // thread-local variable 
   for (int i=0; i<3; i++)    
     result += support[threadIdx.x + i]; 
    
   output[index] = result / 3.f; 
} 

// host code ////////////////////////////////////////////////////// 
int N = 1024 * 1024; 
cudaMalloc(&devInput, N+2);  // allocate array in device memory 
cudaMalloc(&devOutput, N);   // allocate array in device memory 

// property initialize contents of devInput here ... 

//  Launch over 1 million CUDA threads (over 8K thread blocks)
convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput);

Assigning work

Desirable for CUDA program to run on all of these GPUs without modification
Note: there is no concept of num_cores in the CUDA programs I have shown you. (CUDA thread launch is similar in spirit to a forall loop in data parallel model examples)

CUDA compilation
A compiled CUDA device binary includes:
Program text (instructions)
Information about required resources:
128 threads per block
B bytes of local data per thread
128+2=130 floats (520 bytes) of shared space per thread block

CUDA thread-block assignment

More detail on GPU architecture

NVIDIA V100 SM “sub-core”

Threads in a warp are executed in a SIMD manner, **if they share the same instruction**
	NVIDIA calls this SIMT (single instruction multiple CUDA thread)
	If the 32 CUDA threads do not share the same instruction,  performance can suffer due to divergent execution
	This mapping is similar to how ISPC runs program instances in a gang(But GPU hardware is dynamically checking whether 32 independent CUDA threads share an instruction, and if this is true, it  executes all 32 threads in a SIMD manner.  The CUDA program is not compiled to SIMD instructions like ISPC gangs)(一个硬件实现，一个软件实现)
A warp is not part of CUDA, but is an important CUDA implementation  detail on modern NVIDIA GPUs

Instruction execution
Instruction stream for CUDA threads in a warp…
(note in this example all instructions are independent)
00 fp32 mul r0 r1 r2
01 int32 add r3 r4 r5
02 fp32 mul r6 r7 r8
…

Remember, entire warp of CUDA threads is running this instruction stream
So each instruction is run by all 32 CUDA threads in the warp
Since there are 16 ALUs, running the instruction for the entire warp takes two clocks

NVIDIA V100 GPU SM
This is one NVIDIA V100 streaming multi-processor (SM) unit

Running a thread block on a V100 SM

A convolve thread block is executed by 4 warps(4 warps x 32 threads/warp = 128 CUDA threads per block)

#define THREADS_PER_BLK 128 

__global__ void convolve(int N, float* input, 
                         float* output) 
{ 
   __shared__ float support[THREADS_PER_BLK+2]; 
   int index = blockIdx.x * blockDim.x + threadIdx.x; 
   support[threadIdx.x] = input[index]; 
   if (threadIdx.x < 2) { 
      support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK];  
   } 
   __syncthreads(); 
   float result = 0.0f;  // thread-local 
   for (int i=0; i<3; i++)    
     result += support[threadIdx.x + i];
   output[index] = result;
}

SM core operation each clock:
Each sub-core selects one runnable warp (from the 16 warps in its partition)
Each sub-core runs next instruction for the CUDA threads in the warp (this instruction may apply to all or a subset of the CUDA threads in a warp depending on divergence)

NVIDIA V100 GPU (80 SMs)

Summary: geometry of the V100 GPU
1.245 GHz clock
80 SM cores per chip
80 x 4 x 16 = 5,120 fp32 mul-add ALUs = 12.7 TFLOPs (mul-add counted as 2 flops)
Up to 80 x 64 = 5120 interleaved warps per chip (163,840 CUDA threads/chip)

Reminder: GPU “SIMT” (single instruction, multiple thread)

Running a CUDA program on a GPU

Running the convolve kernel
convolve kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block requires 130 x sizeof(float) = 520 bytes of shared memory
Let’s assume array size N is very large, so the host-side kernel launch generates thousands of thread blocks
#define THREADS_PER_BLK 128
convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, input_array, output_array);
Let’s run this program on the fictitious two-core GPU below
Running the CUDA kernel
Step 1: host sends CUDA device (GPU) a command (“execute this kernel”)

Step 2: scheduler maps block 0 to core 0 (reserves(预留) execution contexts for 128 threads and 520 bytes of shared storage)
Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown). Only two thread blocks fit on a core(third block won’t fit due to insufficient shared storage 3 x 520 bytes > 1.5 KB)

Step 4: thread block 0 completes on core 0
Step 5: block 4 is scheduled on core 0 (mapped to execution contexts 0-127)
Step 6: thread block 2 completes on core 0
Step 7: thread block 5 is scheduled on core 0 (mapped to execution contexts 128-255)

CUDA summary

Execution semantics
Partitioning of problem into thread blocks is in the spirit of the data-parallel model (intended to be machine independent: system schedules blocks onto any number of cores)
Threads in a thread block actually do run concurrently (they have to, since they cooperate)
Inside a single thread block: SPMD shared address space programming
There are subtle(微妙的), but notable(显著的) differences between these models of execution. Make sure you understand it. (And ask yourself what semantics are being used whenever you encounter a parallel programming system)

Memory semantics
Distributed address space: host/device memories
Thread local/block shared/global variables within device memory
Loads/stores move data between them (so it is correct to think about local/shared/global memory as being distinct(不同的) address spaces)

Key implementation details:
Threads in a thread block are scheduled onto same GPU “SM” to allow fast communication through shared memory
Threads in a thread block are are grouped into warps for SIMT execution on GPU hardware

Lecture 8: Data-Parallel Thinking

Think about describing algorithms in terms of(按照) operations on sequences of data
map sort filter groupBy(分组) fold / reduce(归约) join scan / segmented scan partition / flatten(展平)

Main idea: high-performance parallel implementations of these operations exist. So programs written in terms of these primitives can often run efficiently on a parallel machine(if you can avoid being bandwidth bound)

Motivation

Why must an application expose large amounts of parallelism?
Utilize large numbers of cores
High-core count machines
Many machines (e.g., cluster of machines in the cloud)
SIMD processing + multi-threaded cores require even more parallelism
GPU architectures require very large amounts of parallelism

Recall: geometry of the V100 GPU

1.245 GHz clock
80 SM cores per chip
80 x 4 x 16 = 5,120 fp32 mul-add ALUs = 12.7 TFLOPs(mul-add counted as 2 flops)
Up to 80 x 64 = 5120 interleaved warps per chip (163,840 CUDA threads/chip)
This chip can concurrently execute up to 163,860 CUDA threads! (programs that do not expose significant amounts of parallelism, and don’t have high arithmetic intensity, will not run efficiently on GPUs!)

Understanding dependencies is key
Key part of parallel programming is understanding when dependencies exist between operation
Lack of dependencies implies potential for parallel execution

Data-parallel model
Organize computation as operations on sequences of elements
e.g., perform same function on all elements of a sequence
A well-known modern example: NumPy: C = A + B
(A, B, and C are vectors of same length)

Key data type: sequences
Ordered collection of elements
In a C++ like language: Sequence<T\>
Scala lists: List[T]
Python Pandas Dataframes
PyTorch/JAX Tensors (N-D sequences)
In a functional language (like Haskell): seq T
Important: unlike arrays, programs only access elements of a sequence through specific operations, not direct element access

Map

Map
Higher order function(高阶函数) (function that takes a function as an argument)
Applies side-effect(副作用) free unary(一元) function f :: a -> b to all elements of input sequence, producing output sequence of the same length
In a functional language (e.g., Haskell)
map :: (a -> b) -> seq a -> seq b
In C++:
template<class InputIt, class OutputIt, class UnaryOperation\>
OutputIt transform(InputIt first1, InputIt last1, OutputIt d_first, UnaryOperation unary_op);
In JAX: vmap

int f(int x) { return x + 10; } 
int a[] = {3, 8, 4, 6, 3, 9, 2, 8}; 
int b[8];
// a: Input start iterator	a+8: Input end iterator	b: Output start iterator
std::transform(a, a+8, b, f);

1
2
3

a = [3, 8, 4, 6, 3, 9, 2, 8] 
f x = x + 10 
b = map f a

Parallelizing map
Since f :: a -> b is a function (side-effect free), then applying f to all elements of the sequence can be performed in any order without changing the output of the program
Therefore, the implementation of map has flexibility to reorder/parallelize processing of elements of sequence however it sees fit(根据需要)
map f s =
partition sequence s into P smaller sequences
for each subsequence s_i (in parallel)
out_i = map f s_i
out = concatenate out_i’s

Fold

Fold (fold left)
Apply binary operation f to each element and an accumulated value
Seeded by initial value of type b
f :: (b,a) -> b

1 2	// E.g. def foldLeft[A, B](init: B, f: (B, A) => B, l: List[A]): B

Parallel fold
Apply f to each element and an accumulated value
In addition to binary(二元) function f, also need an additional binary “combiner(组合器)” function(No need for comb if f::(b,b)->b is an associative(结合律) binary operator) (不需要满足交换律)
Seeded by initial value of type b (must be identity for f and comb)
f :: (b,a) -> b
comb :: (b,b) -> b
fold_par :: b -> ((b,a) -> b) -> ((b,b)->b) ->seq a -> b

Scan

Scan
“scan inclusive”
f :: (a,a) -> a (associative binary op)
scan :: a -> ((a,a) -> a) -> seq a -> seq a

// implement scan sequentially
float op(float a, float b) { … } 
scan_inclusive(float* in, float* out, int N) { 
    out[0] = in[0]; 
    for (i=1; i<N; i++) 
        out[i] = op(out[i-1], in[i]); 
}

And so what scan computes is the repeated application of that operator up to and including the current element
Fold produces a single value, scan gives you all the partials
Alternative form(另一种形式): “scan exclusive”: out[i] is the scan result for all elements up to, but excluding, in[i]

Parallel Scan

Data-parallel scan
let A = [a₀, a₁, a₂, a₃, …, a_n-1]
let ⊕ be an associative binary operator with identity element(单位元素) I
scan_inclusive(⊕, A) = [a₀, a₀⊕a₁, a₀⊕a₁⊕a₂, …
scan_exclusive(⊕, A) = [I, a₀, a₀⊕a₁, …
If operator is +, then scan_inclusive(+,A) is called “a prefix sum”
prefix_sum(A) = [a₀, a₀+a₁, a₀+a₁+a₂, …

Data-parallel inclusive scan(Subtract original vector to get exclusive scan result: not shown)

Work(Total operations performed): O(N lg N)
Span(Longest chain of sequential steps): O(lg N)
Inefficient compared to sequential algorithm!(工作量)

Work-efficient parallel exclusive scan (O(N) work)

Work efficient exclusive scan algorithm(with ⊕ = “+”)
Up-sweep:
for d=0 to (log₂n - 1) do
forall k=0 to n-1 by 2^d+1 do
a[k + 2^d+1 - 1] = a[k + 2^d - 1] + a[k + 2^d+1 - 1]
Down-sweep:
x[n-1] = 0
for d=(log₂n - 1) down to 0 do
forall k=0 to n-1 by 2^d+1 do
tmp = a[k + 2^d - 1]
a[k + 2^d - 1] = a[k + 2^d+1 - 1]
a[k + 2^d+1 - 1] = tmp + a[k + 2^d+1 - 1]
Work: O(N) (but what is the constant?) 5
Span: O(lg N) (but what is the constant?) 2

Scan: two processor (shared memory) implementation

Work: O(N) (but constant is now only 1.5)
Data-access:
Very high spatial(空间) locality (contiguous memory access)
P1’s access to a8 through a8-11 may be more costly on large core count system with non-uniform(不均匀的) memory access costs, but on small-scale multi-core system the access cost is likely the same as from P2

Exclusive scan: SIMD implementation (in CUDA)
Example: perform exclusive scan on 32-element array: SPMD program, assume 32-wide SIMD execution
When scan_warp is run by a group of 32 CUDA threads, each thread returns the exclusive scan result for element idx(also: upon completionptr[]stores inclusive scan result)

// 32-wide GPU execution (SPMD program)
// idx: CUDA thread index of caller
__device__ int scan_warp(int *ptr, const unsigned int idx) 
{ 
    const unsigned int lane = idx % 32; // index of thread in warp (0..31) 
    
    __syncwarp(); 
    for (int i=0; i<5; i++) { // 5 steps because 2^5 = 32  
        int shift = 1<<i;  
        if (lane >= shift)  { 
            int tmp1 = ptr[idx - shift]; 
            int tmp2 = ptr[idx]; 
            __syncwarp(); 
            ptr[idx] = tmp1 + tmp2; 
            __syncwarp();
        }
    } 
    return (lane > 0) ? ptr[idx-1] : 0; 
}

Work: N lg(N)
Work-efficient formulation of scan is not beneficial in this context because it results in low SIMD utilization
Work efficient algorithm would require more than 2x the number of instructions as the implementation above!
The fact that I did better would mean that I would have all this underutilization of my SIMD lanes, and so I do less work, but I spread it out over multiple highly incoherent instructions, so it’s a loss

Building scan on larger array
Example: 128-element scan using four-warp thread block

Multi-threaded, SIMD CUDA implementation
Example: cooperating threads in a CUDA thread block perform scan
We provide similar code in assignment 3
Code assumes length of array given by ptr is same as number of threads per block

// idx:  CUDA thread index of caller
__device__ void scan_block(int* ptr, const unsigned int idx) 
{
    const unsigned int lane = idx % 32;        	// index of thread in warp (0..31) 
    const unsigned int warp_id = idx >> 5;      // warp index in block
    
    int val = scan_warp(ptr, idx);              // Step 1. per-warp partial scan
    										// (Performed by all threads in block,
											// with threads in same warp communicating     																// through shared memory buffer ‘ptr’)					
    if (lane == 31) ptr[warp_id] = ptr[idx];    // Step 2. thread 31 in each warp copies
    __syncthreads();                            // partial-scan result into per-block
  											// shared mem
    
    if (warp_id == 0) scan_warp(ptr, idx);      // Step 3. scan to accumulate bases
    __syncthreads();                            // (only performed by warp 0)
   
    if (warp_id > 0)                            // Step 4. apply bases to all elements
        val = val + ptr[warp_id-1];             // (performed by all threads in block)
    __syncthreads(); 
   
    ptr[idx] = val; 
}

Building a larger scan
Example: one million element scan (1024 elements per block)

Exceeding 1 million elements requires partitioning phase two into multiple blocks

Scan implementation
Parallelism
Scan algorithm features O(N) parallel work
But efficient implementations only leverage(利用) as much parallelism as required to make good utilization of the machine
Goal is to reduce work and reduce communication/synchronization
Locality
Multi-level implementation to match memory hierarchy
(CUDA example: per-block implementation carried out in local memory)
Heterogeneity(异构性) in algorithm: different strategy for performing scan at different levels of the machine
CUDA example: different algorithm for intra-warp scan than inter-thread scan
Low-core count CPU example: based largely on sequential scan

Parallel Segmented Scan

Segmented scan
Common problem: operating on a sequence of sequences
Examples:
For each vertex v in a graph:
For each edge e connected to v
For each particle p in a simulation
For each particle within distance D of p
For each document d in a collection
For each word in d
There are two levels of parallelism in the problem that a programmer might want to exploit
But it is irregular: the size of edge lists, particle neighbor lists, words per document, etc, may be very different from vertex to vertex (or particle to particle)
Generalization(推广) of scan
Simultaneously(同时) perform scans on contiguous partitions of input sequence
let A = [ [1, 2], [6], [1, 2, 3, 4] ]
let ⊕ = +
segmented_scan_exclusive(⊕, A) = [[0,1], [0], [0,1,3,6]]
Assume a simple “start-flag” representation of nested(嵌套) sequences:
Consider nested sequence A = [[1,2,3],[4,5,6,7,8]]
flag: 1 0 0 1 0 0 0 0
data: 1 2 3 4 5 6 7 8

Work-efficient segmented scan(with ⊕ = “+”)
Up-sweep:
for d=0 to (log₂n - 1) do:
forall k=0 to n-1 by 2^d+1 do:
if flag[k + 2^d+1 - 1] == 0:
data[k + 2^d+1 - 1] = data[k + 2^d - 1] + data[k + 2^d+1 - 1]
flag[k + 2^d+1 - 1] = flag[k + 2^d - 1] || flag[k + 2^d+1 - 1]
Down-sweep:
data[n-1] = 0
for d=(log₂n - 1) down to 0 do:
forall k=0 to n-1 by 2^d+1 do:
tmp = data[k + 2^d - 1]
data[k + 2^d - 1] = data[k + 2^d+1 - 1]
if flag_original[k + 2^d] == 1: # must maintain copy of original flags
data[k + 2^d+1 - 1] = 0 # start of segment
else if flag[k + 2^d - 1] == 1:
data[k + 2^d+1 - 1] = tmp
else:
data[k + 2^d+1 - 1] = tmp + data[k + 2^d+1 - 1]
flag[k + 2^d - 1] = 0
Segmented scan (exclusive)

Scan/segmented scan summary
Scan
Theory: parallelism in problem is linear in number of elements
Practice: exploit locality, use only as much parallelism as necessary to fill the machine’s execution resources
Great example of applying different strategies at different levels of the machine
Segmented scan
Express computation and operate on irregular data structures (e.g., list of lists) in a regular, data parallel way

Gather/scatter: key data-parallel operations

gather(index, input, output)
output[i] = input[index[i]]

scatter(index, input, output)
output[index[i]] = input[i]

“Gather data from data_seq according to indices in index_seq”
output_seq = gather(index_seq, data_seq)

“Scatter data from data_seq according to indices in index_seq”
output_seq = scatter(index_seq, data_seq)

Gather machine instruction
“Gather from buffer mem_base into R1 according to indices specified by R0.”
gather(R1, R0, mem_base);

Gather supported with AVX2 in 2013
But AVX2 does not support SIMD scatter (must implement as scalar loop)
Scatter instruction exists in AVX512
Hardware supported gather/scatter exists on GPUs(still an expensive operation compared to load/store of contiguous vector)

Segmented scene + gather: sparse matrix multiplication example

Most values in matrix are zero
Note: easy parallelization by parallelizing the different per-row dot products
But different amounts of work per row (complicates wide SIMD execution)
Example sparse storage format: compressed sparse row
values = [ [3,1], [2], [4], ..., [2,6,8] ]
cols = [ [0,2], [1], [2], ...., ]
row_starts = [0, 2, 3, 4, ... ]
Sparse matrix multiplication with scan
x = [x0,x1,x2,x3]
values = [ [3,1], [2], [4], [2,6,8] ]
cols = [ [0,2], [1], [2], [1,2,3] ]
row_starts = [0, 2, 3, 4]
1.Gather from x based on cols: gathered[i] = x[cols[I]]
gathered = [x0, x2, x1, x2, x1, x2, x3]
2.Map (multiplication) over all non-zero values: products[i] = values[i] * gathered[I]
products = [3x0, x2, 2x1, 4x2, 2x1, 6x2, 8x3]
3.Create flags vector from row_starts: flags = [1,0,1,1,1,0,0]
4.Perform inclusive segmented-scan on (products, flags) using addition operator
[3x0, 3x0+x2, 2x1, 4x2, 2x1, 2x1+6x2, 2x1+6x2+8x3]
5.Take last element in each segment:
y = [3x0+x2, 2x1, 4x2 , 2x1+6x2+8x3]

Turning a scatter into sort/gather
Special case: assume elements of index are unique and all elements referenced in index (scatter is a permutation)
scatter(index, input, output) {
output = sort input sequence by values in index sequence
}

Implementing “scatterOp” with atomic sort/map/segmented-scan
Now, assume elements in index are not unique, so synchronization is required for atomicity!
for all elements in sequence
output[index[i]] = atomicOp(output[index[i]], input[i])
Example: atomicAdd(output[index[i], input[I]])
e.g,: index = [1, 1, 0, 2, 0, 0]
Step 1: Sort input sequence according to values in index sequence:
Sorted index: [0, 0, 0, 1, 1, 2]
Input sorted by index: [input[2], input[4], input[5], input[0], input[1], input[3]]
Step 2: Compute starts of each range of values with the same index number
starts: [1, 0, 0, 1, 0, 1]
Step 3: Segmented scan (using ‘op’) on each range
[op(op(input[2], input[4]), input[5]), op(input[0], input[1]), input[3])

More sequence operations

Group by key
Seq (key, T) —> Seq (key, Seq T)
Creates a sequence of sequences containing elements with the same key

Filter
Remove elements from sequence that do not match predicate

Sort

Example

Example: create grid of particles data structure on large parallel machine (e.g., a GPU)
Problem: place 1M point particles in a 16-cell uniform grid based on 2D position
Parallel data structure manipulation problem: build a 2D array of lists
Recall: Up to 2048 CUDA threads per SM core on a V100 GPU (80 SM cores)

Common use of this structure: N-body problems
A common operation is to compute interactions(相互作用) with neighboring particles
Example: given a particle, find all particles within radius R
Organize particles by placing them in grid with cells of size R
Only need to inspect particles in surrounding grid cells

Solution 1: parallelize over particles
One answer: assign one particle to each CUDA thread. Each thread computes cell containing particle, then atomically updates per cell list
Problem: massive contention: thousands of threads contending for access to single shared data structure

list cell_list[16];    // 2D array of lists 
lock cell_list_lock; 
for each particle p          // in parallel 
    c = compute cell containing p 
    lock(cell_list_lock) 
    append p to cell_list[c] 
    unlock(cell_list_lock)

Solution 2: use finer-granularity locks
Alleviate(缓解) contention for single global lock by using per-cell locks
Assuming uniform(均匀) distribution of particles in 2D space… ~16x less contention than previous solution

list cell_list[16];    // 2D array of lists 
lock cell_list_lock[16]; 
for each particle p          // in parallel 
    c = compute cell containing p 
    lock(cell_list_lock[c]) 
    append p to cell_list[c] 
    unlock(cell_list_lock[c])

Solution 3: parallelize over cells
Decompose work by cells: for each cell, independently compute what particles are within it (eliminates contention because no synchronization is required)
Insufficient parallelism: only 16 parallel tasks, but need thousands of independent tasks to efficiently utilize GPU)
Work inefficient: performs 16 times more particle-in-cell computations than sequential algorithm

list cell_lists[16];      // 2D array of lists 
for each cell c           // in parallel 
    for each particle p    // sequentially 
        if (p is within c) 
            append p to cell_lists[c]

Solution 4: compute partial results + merge
Yet another answer: generate N “partial” grids in parallel, then combine
Example: create N thread blocks (at least as many thread blocks as SM cores)
All threads in thread block update same grid
Enables faster synchronization: contention reduced by factor of N and cost of synchronization is lower because it is performed on block-local variables (in CUDA shared memory)
Requires extra work: merging the N grids at the end of the computation
Requires extra memory footprint(占用): stores N grids of lists, rather than 1

Solution 5: data-parallel approach
Step 1: map
compute cell containing each particle (parallel over input particles)

Step 2: sort results by cell (notice that the particle index array is also permuted(排列) based on sort)

Step 3: find start/end of each cell (parallel over particle_index elements)

// This code is run for each element of the particle_index array
// (each invocation(调用) has a unique valid of ‘thread_index’)
this_cell = grid_cell[thread_index]; 
prev_cell = grid_cell[thread_index-1]; 
if (thread_index == 0)              // special case for first cell 
    cell_starts[this_cell] = 0; 
else if (this_cell != prev_cell) { 
    cell_starts[this_cell] = thread_index; 
    cell_ends[prev_cell] = thread_index; 
} 
if (thread_index == numParticles-1) // special case for last cell 
    cell_ends[this_cell] = thread_index+1;

This solution maintains a large amount of parallelism and removes the need for fine-grained synchronization… at the cost of a sort and extra passes over the data (extra BW)!

Another example: parallel histogram
Consider computing a histogram for a sequence of values

int f(float value);            // maps array values to histogram bin id’s 
float input[N];                 
int histogram_bins[NUM_BINS];  // assume bins are initialized to 0 
for (int i=0; i<N; i++) { 
    histogram_bins[f(input[i])]++; 
}

Challenge: create a massively parallel implementation of histogram given only map() and sort() on sequences

Data-parallel histogram construction
Assume variable thread_index is the “thread index” associated with the invocation of the kernel function

void compute_bin(float* input, int* bin_ids) { 
   bin_ids[thread_index] = f(input[thread_index]);   
} 
void find_starts(int* bin_ids, int* bin_starts) { 
   if (thread_index == 0 || bin_ids[thread_index] != bin_ids[thread_index-1]) 
      bin_starts[bin_ids[thread_index]] = thread_index; 
} 
float input[N]; 
int   histogram_bins[NUM_BINS]; 
// temporary buffers 
int   bin_ids[N];            // bin_ids[i] = id of bin that element i goes in 
int   sorted_bin_idx[N]; 
int   bin_starts[NUM_BINS];  // initialized to -1 
// map f onto input sequence to get bin ids of all elements 
launch<<<N>>>compute_bin(input, bin_ids); 
// find starting point of each bin in sorted list 
sort(N, bin_ids, sorted_bin_ids); 
launch<<<N>>>find_starts(sorted_bin_ids, bin_starts); 
// compute bin sizes (see definition of bin_sizes() on next slide) 
launch<<<NUM_BINS>>>bin_sizes(bin_starts, histogram_bins, N, NUM_BINS);

// launched with one thread per output bin 
void bin_sizes(int* bin_starts, int* histogram_bins, int num_items, int num_bins) { 
    if (bin_starts[thread_index] == -1) { 
        histogram_bins[thread_index] = 0;    // no items in this bin
    } else {  
        // find start of next bin in order to determined size of current bin 
        // Tricky edge case: if the next bin is empty, then must search 
        // forward to find the next non-empty bin 
        int next_idx = thread_index+1; 
        while(next_idx < num_bins && bin_starts[next_idx] == -1) 
            next_idx++; 
        if (next_idx < num_bins) 
            histogram_bins[thread_index] = bin_starts[next_idx] - bin_starts[thread_index]; 
        else 
            histogram_bins[thread_index] = num_items - bin_starts[thread_index]; 
    } 
}

Summary

Data parallel thinking:
Implementing algorithms in terms of simple (often widely parallelizable, efficiently implemented) operations on large data collections

Turn irregular parallelism into regular parallelism

Turn fine-grained synchronization into coarse synchronization

But most solutions require multiple passes over data — bandwidth hungry!

Data parallel primitives are basis for many parallel/distributed systems today
CUDA’s Thrust Library
Pandas Dataframe operations
JAX
Apache Spark / Hadoop

Lecture 9: Efficiently Evaluating DNNs

Review

When communication and computation are overlapped (hiding memory latency), the capabilities of the machine (ops throughput and communication bandwidth) AND the arithmetic intensity of the program determine if the program’s overall instruction throughput is limited by available bandwidth (“bandwidth bound”) or by the machine’s instruction processing capability (“compute bound”)

Overlapping communication and computation costs footprint, since buffers for the data being processing AND the data being transferred need to be maintained on chip

Increasing arithmetic processing ability (“faster hardware”) makes a program more likely to be bandwidth bound

Increasing a program’s arithmetic intensity (“a program change”) makes a program more likely to be compute bound

Mini-intro(简要介绍): Convolutional Neural Networks

What is a deep neural network?
A basic unit:
Unit with n inputs described by n+1 parameters(weights + bias)

Example: rectified(修正) linear unit (ReLU) f(x)=max(0,x)
Basic computational interpretation:
It is just a circuit!
Machine learning interpretation:
Binary classifier: interpret output as the probability of one class

Deep neural network: topology(拓扑)

Fully connected layer as matrix-vector product

Assume f() is the element-wise max function (ReLU)

Image convolution (3x3 conv)

Convolutional layer: locally connected AND all units in layer share the same parameters (same weights + same bias)

int WIDTH = 1024; 
int HEIGHT = 1024; 
float input[(WIDTH+2) * (HEIGHT+2)]; 
float output[WIDTH * HEIGHT]; 
float weights[] = {1.0/9, 1.0/9, 1.0/9, 
                   1.0/9, 1.0/9, 1.0/9, 
                   1.0/9, 1.0/9, 1.0/9};
for (int j=0; j<HEIGHT; j++) { 
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int jj=0; jj<3; jj++) 
        for (int ii=0; ii<3; ii++) 
            tmp += input[(j+jj)*(WIDTH+2) + (i+ii)] * weights[jj*3 + ii]; 
        output[j*WIDTH + i] = tmp; 
    } 
}

Note: you can think of a filter as a “detector” of a pattern, and the magnitude(强度) of a pixel in the output image as the “response” of the filter to the region surrounding each pixel in the input image

Applying many filters to an image at once

Adding additional layers

Efficiently implementing convolution layers

Direct implementation of conv layer (batched)

float input[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][INPUT_DEPTH];         // input activations(激活值)
float output[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][LAYER_NUM_FILTERS];  // output activations 
float layer_weights[LAYER_NUM_FILTERS][LAYER_CONVY][LAYER_CONVX][INPUT_DEPTH]; 
float layer_biases[LAYER_NUM_FILTERS]; 
// assumes convolution stride is 1 
for (int img=0; img<IMAGE_BATCH_SIZE; img++) 
    for (int j=0; j<INPUT_HEIGHT; j++) 
        for (int i=0; i<INPUT_WIDTH; i++) 
            for (int f=0; f<LAYER_NUM_FILTERS; f++) { 
                float tmp = layer_biases[LAYER_NUM_FILTERS]; 
                for (int kk=0; kk<INPUT_DEPTH; kk++)         // sum over filter responses of input channels 
                    for (int jj=0; jj<LAYER_FILTER_Y; jj++)   // spatial(空间) convolution (Y) 
                        for (int ii=0; ii<LAYER_FILTER_X; ii+)  // spatial convolution (X) 
                            tmp += layer_weights[f][jj][ii][kk] * input[img][j+jj][i+ii][kk]; 
                output[img][j][i][f] = tmp; 
            }

Seven loops with significant input data reuse: reuse of filter weights (during convolution), and reuse of input values (across different filters)

3x3 convolution as matrix-vector product (“explicit gemm”)
Construct matrix from elements of input image

Note: 0-pad matrix
O(N) storage overhead for filter with N elements
Must construct input data matrix

Multiple convolutions on multiple input channels

For each filter, sum responses over input channels
Equivalent to (3 x 3 x num_channels) convolution on (W x H x num_channels) input data

Conv layer to explicit GEMM mapping

Symbol reference(符号参考):
Spatial support of filters: R x S
Input channels: C
Number of filters: K
Batch size: N

Matrix multiplication is at the heart of the “attention” blocks of a transformer architecture

Matrix multiplication is at the heart of the “attention” blocks of a transformer architecture

Sequence of tokens in, sequence of tokens out

The importance of dense matrix-matrix multiplication (GEMM) to modern AI
The kernel for…
Fully-connected layers
Convolutional layers
The attention block of a transformer

High performance implementations of GEMM exist
cuBLAS | NVIDIA Developer
Accelerate Fast Math with Intel® oneAPI Math Kernel Library
To use “off the shelf(现成的)” libraries, must materialize input matrices
For convolutional layer implications(影响), Increases DRAM traffic(流量) by a factor of R x S (To read input data from activation tensor and constitute “convolution matrix” )
Also requires large amount of additional storage

Dense matrix multiplication

float A[M][K]; 
float B[K][N]; 
float C[M][N]; 
// compute C += A * B 
#pragma omp parallel for 
for (int j=0; j<M; j++) 
  for (int i=0; i<N; i++) 
     for (int k=0; k<K; k++) 
         C[j][i] += A[j][k] * B[k][i];

Low arithmetic intensity (does not exploit temporal locality in access to A and B)

Increasing arithmetic intensity by “blocking”

float A[M][K]; 
float B[K][N]; 
float C[M][N];
// compute C += A * B 
#pragma omp parallel for 
for (int jblock=0; jblock<M; jblock+=BLOCKSIZE_J) 
  for (int iblock=0; iblock<N; iblock+=BLOCKSIZE_I) 
     for (int kblock=0; kblock<K; kblock+=BLOCKSIZE_K) 
        for (int j=0; j<BLOCKSIZE_J; j++) 
           for (int i=0; i<BLOCKSIZE_I; i++) 
              for (int k=0; k<BLOCKSIZE_K; k++) 
                 C[jblock+j][iblock+i] += A[jblock+j][kblock+k] * B[kblock+k][iblock+i];

Idea: compute partial result for block of C while required blocks of A and B remain in cache (Assumes BLOCKSIZE chosen to allow block of A, B, and C to remain resident)

Hierarchical blocked matrix mult
Exploit multiple levels of memory hierarchy (increase arithmetic intensity when considering multiple levels of memory hierarchy)

float A[M][K]; 
float B[K][N]; 
float C[M][N]; 
// compute C += A * B 
#pragma omp parallel for 
for (int jblock2=0; jblock2<M; jblock2+=L2_BLOCKSIZE_J) 
  for (int iblock2=0; iblock2<N; iblock2+=L2_BLOCKSIZE_I) 
     for (int kblock2=0; kblock2<K; kblock2+=L2_BLOCKSIZE_K) 
        for (int jblock1=0; jblock1<L1_BLOCKSIZE_J; jblock1+=L1_BLOCKSIZE_J) 
           for (int iblock1=0; iblock1<L1_BLOCKSIZE_I; iblock1+=L1_BLOCKSIZE_I) 
              for (int kblock1=0; kblock1<L1_BLOCKSIZE_K; kblock1+=L1_BLOCKSIZE_K) 
                  for (int j=0; j<BLOCKSIZE_J; j++) 
                     for (int i=0; i<BLOCKSIZE_I; i++) 
                        for (int k=0; k<BLOCKSIZE_K; k++) 
                           ...

Not shown: final level of “blocking” for register locality…

Vectorized, blocked dense matrix multiplication (1)
Consider SIMD parallelism within a block

... 
for (int j=0; j<BLOCKSIZE_J; j++) { 
   for (int i=0; i<BLOCKSIZE_I; i+=SIMD_WIDTH) { 
      simd_vec C_accum = vec_load(&C[jblock+j][iblock+i]); 
      for (int k=0; k<BLOCKSIZE_K; k++) { 
         // C = A*B + C 
         simd_vec A_val = splat(&A[jblock+j][kblock+k]); // load a single element in vector register 
         simd_muladd(A_val, vec_load(&B[kblock+k][iblock+i]), C_accum); 
      } 
	  vec_store(&C[jblock+j][iblock+i], C_accum); 
   } 
}

Vectorize i loop
Good: also improves spatial locality in access to B
Bad: working set increased by SIMD_WIDTH, still walking over B in large steps

Vectorized, blocked dense matrix multiplication (2)

... 
for (int j=0; j<BLOCKSIZE_J; j++) 
   for (int i=0; i<BLOCKSIZE_I; i++) { 
      float C_scalar = C[jblock+j][iblock+i]; 
      // C_scalar += dot(row of A,row of B) 
      for (int k=0; k<BLOCKSIZE_K; k+=SIMD_WIDTH) {
        C_scalar += simd_dot(vec_load(&A[jblock+j][kblock+k]), vec_load(&Btrans[iblock+i][[kblock+k]); 
      } 
      C[jblock+j][iblock+i] = C_scalar; 
   }

Assume i dimension is small
Previous vectorization scheme (1) would not work well
Pre-transpose(预转置) block of B (copy block of B to temp buffer in transposed form)
Vectorize innermost loop

Vectorized, blocked dense matrix multiplication (3)

// assume blocks of A and C are pre-transposed as Atrans and Ctrans 
for (int j=0; j<BLOCKSIZE_J; j+=SIMD_WIDTH) { 
   for (int i=0; i<BLOCKSIZE_I; i+=SIMD_WIDTH) { 
       
      simd_vec C_accum[SIMD_WIDTH]; 
      for (int k=0; k<SIMD_WIDTH; k++)   // load C_accum for a SIMD_WIDTH x SIMD_WIDTH chunk of C^T 
         C_accum[k] = vec_load(&Ctrans[iblock+i+k][jblock+j]); 
       
      for (int k=0; k<BLOCKSIZE_K; k++) { 
        simd_vec bvec = vec_load(&B[kblock+k][iblock+i]); 
        for (int kk=0; kk<SIMD_WIDTH; kk++)  // innermost loop items not dependent 
            simd_muladd(vec_load(&Atrans[kblock+k][jblock+j], splat(bvec[kk]), C_accum[kk]); 
      }
                        
      for (int k=0; k<SIMD_WIDTH; k++) 
        vec_store(&Ctrans[iblock+i+k][jblock+j], C_accum[k]); 
   }
}

Different layers of a single DNN may benefit from unique scheduling strategies (different matrix dimensions)
Notice sizes of weights and activations in this network: (and consider SIMD widths of modern machines)
Ug for library implementers!

Matrix multiplication implementations

Optimization: do not materialize full matrix (“implicit gemm”)
This is a naive implementation that does not perform blocking, but indexes into input weight and activation tensors
Symbol reference:
Spatial support of filters: R x S
Input channels: C
Number of filters: K
Batch size: N

Better implementation: materialize only a sub-block of the convolution matrix at a time in GPU on-chip “shared memory”
Does not require additional off-chip storage and does not increase required DRAM traffic
Use well-tuned(良好调优) shared-memory based GEMM routines to perform sub-block GEMM (see CUTLASS)
Symbol reference:
Output size: PxQ
Spatial support of filters: R x S
Input channels: C
Number of filters (output channels): K
Batch size: N

NVIDIA CUTLASS
Basic primitives/building block for implementing your custom high performance DNN layers. (e.g, unusual sizes that haven’t been heavily tuned by cuDNN)
NVIDIA/cutlass: CUDA Templates and Python DSLs for High-Performance Linear Algebra
Fast (in-shared memory) GEMM
Fast WARP level GEMMs
Iterators for fast block loading/tensor indexing
Tensor reductions
Etc

Triton
Welcome to Triton’s documentation! — Triton documentation
Language support for operations that load/store tensors
Load “blocks” of data into GPU shared memory
Perform data-parallel operations on those block

Thunderkittens
CUDA library of useful tile-based(基于图块) programming primitives
Intended to make advanced developers (CS149-level folks) more productive writing blocked code
Async load/store of tiles
Support for advanced memory layouts (blocked tiles, interleaved elements, etc.)
HazyResearch/ThunderKittens: Tile primitives for speedy kernels

Recall: NVIDIA V100 GPU (80 SMs)

Many processing units and many tensor cores
Need “a lot of parallel work” to fill the machine
Higher performance with “more work”
N=1, P=Q=64 case:
64 x 64 x 128 x 1 = 524K outputs = 2 MB of output data (float32)
N=32, P=Q=256 case:
256 x 256 x 128 x 32 = 256M outputs = 1 GB of output data (float32)

Direct implementation
Or you can just directly implement this loop nest directly yourself

float input[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][INPUT_DEPTH];         // input activations 
float output[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][LAYER_NUM_FILTERS];  // output activations 
float layer_weights[LAYER_NUM_FILTERS][LAYER_CONVY][LAYER_CONVX][INPUT_DEPTH]; 
float layer_biases[LAYER_NUM_FILTERS]; 

// assumes convolution stride is 1 
for (int img=0; img<IMAGE_BATCH_SIZE; img++)             // for all images in batch 
   for (int j=0; j<INPUT_HEIGHT; j++) 
      for (int i=0; i<INPUT_WIDTH; i++) 
         for (int f=0; f<LAYER_NUM_FILTERS; f++) {       // for all output channels 
            float tmp = layer_biases[LAYER_NUM_FILTERS]; 
            for (int kk=0; kk<INPUT_DEPTH; kk++)         // combine filter responses from all input channels 
               for (int jj=0; jj<LAYER_FILTER_Y; jj++)	// spatial convolution (Y) 
             	 for (int ii=0; ii<LAYER_FILTER_X; ii+)  // spatial convolution (X) 
                 tmp += layer_weights[f][jj][ii][kk] * input[img][j+jj][i+ii][kk]; 
        	output[img][j][i][f] = tmp; 
     	}

Low-level chip libraries offer high-performance implementations of key DNN layers
CUDA Deep Neural Network (cuDNN) | NVIDIA Developer
Intel® oneAPI Deep Neural Network Library (oneDNN)
Neuron Kernel Interface (NKI) Documentation — AWS Neuron Documentation
Libraries offering high-performance implementations of key DNN layers
torch.nn — PyTorch 2.9 documentation
Pytorch Convolution Layers -> NVIDIA cuDNN / Triton -> Graphics Card

Recall the loop fusion transformation: fuse multiple loops into one to increase a program’s arithmetic intensity

// Program 1
void add(int n, float* A, float* B, float* C) { 
    for (int i=0; i<n; i++)
       // Two loads, one store per math op (arithmetic intensity = 1/3)
       C[i] = A[i] + B[i];     
} 
void mul(int n, float* A, float* B, float* C) { 
    for (int i=0; i<n; i++)
       // Two loads, one store per math op (arithmetic intensity = 1/3)
       C[i] = A[i] * B[i];     
} 
float* A, *B, *C, *D, *E, *tmp1, *tmp2; 
// assume arrays are allocated here 
// compute E = D + ((A + B) * C) 
add(n, A, B, tmp1); 
mul(n, tmp1, C, tmp2); 
add(n, tmp2, D, E);
// Overall arithmetic intensity = 1/3

// Program 2
void fused(int n, float* A, float* B, float* C, float* D, float* E) { 
    for (int i=0; i<n; i++)
       // Four loads, one store per 3 math ops (arithmetic intensity = 3/5)
       E[i] = D[i] + (A[i] + B[i]) * C[i];     
} 
// compute E = D + (A + B) * C 
fused(n, A, B, C, D, E);

Memory traffic between operations
Consider this sequence:

Imagine the bandwidth cost of dumping(存) 1 GB of conv outputs to memory, and then reading it back to just scale(缩放) all the values, and then rereading to perform the pool!
Better solution:
Per-element [scale+bias] operation can easily be performed per-element right after each element is computed by conv!
And max pool’s output can be computed once every 2x2 region of output is computed

Fusing scale/bias with conv layer

float input[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][INPUT_DEPTH]; 
float output[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][LAYER_NUM_FILTERS]; 
float layer_weights[LAYER_NUM_FILTERS][LAYER_CONVY][LAYER_CONVX][INPUT_DEPTH]; 
// assumes convolution stride is 1 
for (int img=0; img<IMAGE_BATCH_SIZE; img++)              // for all images in batch 
   for (int j=0; j<INPUT_HEIGHT; j++)                      
      for (int i=0; i<INPUT_WIDTH; i++)                   // for all output pixels 
         for (int f=0; f<LAYER_NUM_FILTERS; f++) {        // for all output channels 
            float tmp = 0.0f; 
            for (int kk=0; kk<INPUT_DEPTH; kk++)          // filter combines responses from all input channels 
               for (int jj=0; jj<LAYER_FILTER_Y; jj++)    // spatial convolution (Y) 
             	 for (int ii=0; ii<LAYER_FILTER_X; ii+)   // spatial convolution (X) 
                 	 tmp += layer_weights[f][jj][ii][kk] * input[img][j+jj][i+ii][kk]; 
            output[img][j][i][f] = tmp*scale + bias; 
         }

“fuse” a max pool operation following this layer (max of 2x2 blocks of output matrix)

float input[IMAGE_BATCH_SIZE][INPUT_HEIGHT][INPUT_WIDTH][INPUT_DEPTH]; 
float output[IMAGE_BATCH_SIZE][INPUT_HEIGHT/2][INPUT_WIDTH/2][LAYER_NUM_FILTERS];
float layer_weights[LAYER_NUM_FILTERS][LAYER_CONVY][LAYER_CONVX][INPUT_DEPTH];
for (int img=0; img<IMAGE_BATCH_SIZE; img++) {
    for (int j=0; j<INPUT_HEIGHT; j+=2) { 
        for (int i=0; i<INPUT_WIDTH; i+=2) { 
            for (int f=0; f<LAYER_NUM_FILTERS; f++) {
                // 池化窗口内的最大值
                float max_val = -FLT_MAX; 
                for (int pj=0; pj<2; pj++) {
                    for (int pi=0; pi<2; pi++) {
                        int curr_j = j + pj;
                        int curr_i = i + pi;
                        float tmp = 0.0f;
                        for (int kk=0; kk<INPUT_DEPTH; kk++) {
                            for (int jj=0; jj<LAYER_FILTER_Y; jj++) {
                                for (int ii=0; ii<LAYER_FILTER_X; ii++) {
                                    tmp += layer_weights[f][jj][ii][kk] * input[img][curr_j+jj][curr_i+ii][kk];
                                }
                            }
                        }
                        float val = tmp * scale + bias;
                        if (val > max_val) max_val = val;
                    }
                }
                output[img][j/2][i/2][f] = max_val;
            }
        }
    }
}

A good idea: fusion trick for computing “attention” in a modern transformer

Attention module in a modern transformer

Let N be the length of the input sequence
Let d be the size of a feature embedding
Let Q be a N x d matrix
Let K be a N x d matrix
Let V be a N x d matrix
Let
Let softmax(S) is computing softmax over the rows of S
Let
Notes:
N can be long for long sequences (e.g., thousands)
Naive implementation uses N² space! Trouble!!!

Computing attention

Let’s break vector x into chunks:

So softmax can be computed in chunks!
Fused attention

for each j:
for each i:
Load block Q_i, K^T_j, V_j, O_i
Compute S_ij = Q_iK^T_j
Compute M_ij = m(S_ij), P_ij = f(S_ij), and l_ij = l(S_ij) (all functions operate row-wise on row-vectors)
Multiply P_ijV_j and accumulate into O_i with appropriate scalings
Save memory footprint: Never materialize N2 matrix
Save memory bandwidth: (high arithmetic intensity)
Read 3 blocks (from Q, K, V)
Do two matrix multiplies + a few row summations
Accumulate into O block (which is resident in cache)
Note there is additional computation vs. the original version (must re-scale prior values of O each step of i-loop)

Fusion in modern DNN frameworks

Old style: library writers hardcoded a few “fused” ops
eg.
cudnnConvolutionBiasActivationForward
tensorflow::ops::FusedBatchNorm
tensorflow::ops::FusedResizeAndPadConv2D

More flexible fusion example: CUDNN “backend”
NVIDIA cuDNN Backend — NVIDIA cuDNN Backend
Compiler generates new implementations that “fuse” multiple operations into a single node that executes efficiently (without runtime overhead or communicating intermediate results through memory)

Many compiler-based efforts to automatically schedule key DNN operations
facebookresearch/TensorComprehensions: A domain specific language to express machine learning workloads.
JAX：高性能数组计算 — JAX 文档
 apache/tvm: Open Machine Learning Compiler Framework
torch.compile 简介 — PyTorch 教程 2.9.0+cu128 文档 - PyTorch 文档
 NervanaSystems/ngraph: nGraph has moved to OpenVINO
TensorRT SDK | NVIDIA Developer
triton-lang/triton: Development repository for the Triton language and compiler

Another trick: use of low precision values
Many efforts to use low precision values for DNN weights and intermediate activations
16 bit and 8-bit values are common
Now moving into 4 bit values
In the extreme case: 1-bit ;-)
[1603.05279] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
[2509.25149] Pretraining Large Language Models with NVFP4

Optimization techniques
Better algorithms: manually designing better ML models
Common parameters: depth of network, width of filters, number of filters per layer, convolutional stride, etc
Common to perform automatic search for efficient topologies(拓扑)
Software optimization: Good scheduling of performance-critical operations
Loop blocking/tiling, fusion
Typically optimized manually by humans (but significant research efforts to automate scheduling)
Forms of approximation: compressing models
Lower bit precision

**Why might a GPU be a good platform for DNN evaluation? **
consider: arithmetic intensity, SIMD, data parallelism, memory bandwidth requirements

Deep neural networks on GPUs
Many high-performance DNN implementations target GPUs
High arithmetic intensity matrix-matrix computations benefit from flop-rich(浮点运算能力丰富) GPU architectures
Highly-optimized library of kernels exist for GPUs (cuDNN)

Why might a GPU be a sub-optimal platform for DNN evaluation?
(Hint: is a general purpose processor really needed?)

Next time: maximizing efficiency via specialized hardware acceleration for DNN inference/training
Google TPU3 GraphCore IPU Apple Neural Engine Intel Deep Learning Inference Accelerator Cerebras Wafer Scale Engine
SambaNova Cardinal SN10 Ampere GPU with Tensor Cores

【2023】Lecture 9: Spark (Distributed Computing on a Cluster)

Why Use A Cluster?

Want to process 100TB of log data (1 day @Facebook)
On 1 node: scanning @ 50MB/s = 23 days
On 1000 nodes: scanning @ 50MB/s = 33 min

get I/O bandwidth

But, very hard to utilize 1000 or 100,000 nodes!
Hard to program 16,000 cores
Something breaks every hour
Need efficient, reliable and usable framework

Warehouse-Scale Computers (WSC)

Standard architecture:
Cluster of commodity(商用) Linux nodes (multicore x86)
Private memory ⇒ separate address spaces & separate OS
Ethernet network ⇒ >10–40Gb today

Cheap?
Built from commodity processors, networks & storage
1000s of nodes for < $10M
WSC network is customized and expensive
Use a supercomputer networking ideas to provide high bandwidth across the datacenter

How to organize computations on this architecture?
Mask issues such as load balancing and failures

Warehouse-Scale Cluster Node (Server)

Consider bandwidths, what conclusions can you make?
In the early days, that was certainly not the case when you’re down at 0.1, but when you get up to the 2 gigabytes per second, now the picture is changing because now you can get potentially the same bandwidth to a remote nodes disk as to your local disk

Message passing model (abstraction)

Distributed memory communication without shared memory

Threads operate within their own private address spaces

Threads communicate by sending/receiving messages
send : specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
receive : sender, specifies buffer to store data, and optional message identifier
Sending messages is the only way to exchange data between threads 1 and 2

Do we need synchronization?
No because the act of sending a message is the synchronization

Storage Systems

First order problem: if nodes can fail, how can we store data persistently?

Answer: Distributed File System
Provides global file namespace
Google GFS, Hadoop HDFS

Typical usage pattern
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are most common (e.g. log files)

Distributed File System (GFS)

Chunk servers
a.k.a. DataNodes in HDFS
File is split into contiguous chunks (usually 64–256 MB)
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node
a.k.a. NameNode in HDFS
Stores metadata; usually replicated

Client library for file access
Talks to master to find chunk (data) servers
Connects directly to chunk servers to access data

Hadoop Distributed File System (HDFS)

MapReduce Programming Model

int f(int x) { return x + 10; } 
int a[] = {3, 8, 4, 6, 3, 9, 2, 8}; 
int b[8];
// a: Input start iterator	a+8: Input end iterator	b: Output start iterator
std::transform(a, a+8, b, f);

1
2
3

a = [3, 8, 4, 6, 3, 9, 2, 8] 
f x = x + 10 
b = map f a

**Reduce **
Apply binary operation f to each element and an accumulated value
f :: (b,a) -> b
reduce :: ((b,a) -> b) -> seq a -> b
E.g., in Scala:
def reduce[A](f: (B, A) => B, l: List[A]): B

(The code above computes the count of page views by each type of mobile phone)
Assume cs149log.txt is a large file, stored in a distributed file system, like HDFS
Contents of cs149log.txt are distributed evenly in blocks across the cluster

// called once per line in input file by runtime
// input: string (line of input file)
// output: adds (user_agent, 1) entry to list 
void mapper(string line, multimap<string,string>& results) {
    string user_agent = parse_requester_user_agent(line); 
    if (is_mobile_client(user_agent))
        results.add(user_agent, 1);
}
// called once per unique key (user_agent) in results
// values is a list of values associated with the given key
void reducer(string key, list<string> values, int& result) {
    int sum = 0;
    for (v in values)
        sum += v;
    result = sum;
}
// iterator over lines of text file
LineByLineReader input(“hdfs://cs149log.txt”);
// stores output
Writer output(“hdfs://…”);
// do stuff
runMapReduceJob(mapper, reducer, input, output);

design an implementation of runMapReduceJob

MapReduce Dataflow for Word Count

Should be called MapGroupByKeyReduce

Step 1: Running the mapper function
Step 1: run mapper function on all lines of file
Question: How to assign work to nodes?
Idea 1: use work queue for list of input blocks to process
Dynamic assignment: free node takes next available block
Idea 2: data distribution based
assignment: Each node processes lines in blocks of input file that are stored locally

Steps 2 and 3: gathering data, running the reducer
Step 2: Prepare intermediate(中间的) data for reducer
Step 3: Run reducer function on all keys
Question: how to assign reducer tasks?
See what threads are spinning with no work to be done and are available, and assign it to them
Question: how to get all data for key onto the correct worker node?
use some hash function based on the key value and use that as the information given to the mapping task nodes, and that will indicate where those nodes should send the key value data to be reduced depending on the key

Additional implementation challenges at scale
Nodes may fail during program execution
Some nodes may run slower than others(due to different amounts of work, heterogeneity(异质性) in the cluster, etc..)

Job scheduler responsibilities

Exploit data locality: “move computation to the data”
Run mapper jobs on nodes that contain input blocks
Run reducer jobs on nodes that already have most of data for a certain key

Handling node failures
Scheduler detects job failures and reruns job on new machines
This is possible since inputs reside in persistent storage (distributed file system)
Scheduler duplicates jobs on multiple machines (reduce overall processing latency incurred by node failures)

Handling slow machines
Scheduler duplicates jobs on multiple machines

MapReduce Benefits

By providing a data-parallel model, MapReduce greatly simplified cluster programming:
Automatic division of job into tasks
Locality-aware scheduling
Load balancing
Recovery from failures & stragglers

runMapReduceJob problems

Permits only a very simple program structure
Programs must be structured as: map, followed by reduce by key
See DryadLINQ for generalization to DAGs

Iterative algorithms must load from disk each iteration
Example graph processing:

void pagerank_mapper(graphnode n, map<string,string> results) {
    float val = compute update value for n
    for (dst in outgoing links from n)
        results.add(dst.node, val);
}
void pagerank_reducer(graphnode n, list<float> values, float& result) {
    float sum = 0.0;
    for (v in values)
        sum += v;
    result = sum;
}
for (i = 0 to NUM_ITERATIONS) {
    input = load graph from last iteration
    output = file for this iteration output
    runMapReduceJob(pagerank_mapper, pagerank_reducer, result[i-1], result[i]);
}

MapReduce Limitations

MapReduce greatly simplified “big data” analysis

But users quickly needed more:
More complex, multi-stage applications (e.g. iterative machine learning & graph processing)
More interactive ad-hoc(临时) queries

Application Trends
Despite huge amounts of data, many working sets in big data clusters fit in memory

Apache Spark in-memory, fault-tolerant distributed computing

Apache Spark™ - Unified Engine for large-scale data analytics

**Goals **
Programming model for cluster-scale computations where there is significant reuse of intermediate datasets
Iterative machine learning and graph algorithms
Interactive data mining: load large dataset into aggregate(聚合) memory of cluster and then perform multiple ad-hoc queries
Don’t want incur inefficiency of writing intermediates to persistent distributed file system (want to keep it in memory)
Challenge: efficiently implementing fault tolerance for large-scale distributed in-memory computations

Fault tolerance for in-memory calculations
Replicate all computations
Expensive solution: decreases peak throughput
Checkpoint and rollback
Periodically save state of program to persistent storage
Restart from last checkpoint on node failure
Maintain log of updates (commands and data)
High overhead for maintaining logs
Recall map-reduce solutions:
Checkpoints after each map/reduce step by writing results to file system
Scheduler’s list of outstanding(未完成的) (but not yet complete) jobs is a log
Functional structure of programs allows for restart at granularity of a single mapper or reducer invocation (don’t have to restart entire program)

Resilient Distributed Dataset (RDD)
Spark’s key programming abstraction :
Read-only ordered collection of records (immutable)
RDDs can only be created by deterministic(确定性) transformations on data in persistent storage or on existing RDDs
Actions on RDDs return data to application

// create RDD from file system data
val lines = spark.textFile(“hdfs://cs149log.txt”);
// create RDD using filter() transformation on lines
val mobileViews = lines.filter((x: String) => isMobileClient(x));
// another filter() transformation
val safariViews = mobileViews.filter((x: String) => x.contains(“Safari”));
// then count number of elements in RDD via count() action
val numViews = safariViews.count();
// RDDs: lines, mobileViews, safariViews	int: numViews

Repeating the MapReduce Example
“Lineage”: Sequence of RDD operations needed to compute output

// 1. create RDD from file system data
// 2. create RDD with only lines from mobile clients
// 3. create RDD with elements of type (String,Int) from line string
// 4. group elements by key
// 5. call provided reduction function on all keys to count views
val perAgentCounts =  spark.textFile(“hdfs://cs149log.txt”)
						.filter(x => isMobileClient(x))
						.map(x => (parseUserAgent(x),1))
						.reduceByKey((x,y) => x+y)
						.collect();
// Array[String,int]: perAgentCounts

Another Spark Program

// create RDD from file system data
val lines = spark.textFile(“hdfs://cs149log.txt”);
// create RDD using filter() transformation on lines
val mobileViews = lines.filter((x: String) => isMobileClient(x));
// instruct Spark runtime to try to keep mobileViews in memory
mobileViews.persist();
// create a new RDD by filtering mobileViews
// then count number of elements in new RDD via count() action
val numViews = mobileViews.filter(_.contains(“Safari”)).count();
// 1. create new RDD by filtering only Chrome views
// 2. for each element, split string and take timestamp of
//    page view
// 3. convert RDD to a scalar sequence (collect() action)
val timestamps = mobileViews.filter(_.contains(“Chrome”))
						.map(_.split(“ ”)(0))
						.collect();

RDD transformations and actions
Transformations: (data parallel operators taking an input RDD to a new RDD)

Actions: (provide data back to the “host” application)

RDD partitioning and dependencies
val lines = spark.textFile(“hdfs://cs149log.txt”);
val lower = lines.map(_.toLower());
val mobileViews = lower.filter(x => isMobileClient(x));
val howMany = mobileViews.count();

Black lines show dependencies between RDD partitions

**“loop fusion” & “tiling” **
globally restructuring the order of computation to improve producer-consumer locality
(improve arithmetic intensity of program)

Fusion with RDDs
Implementing sequence of RDD ops efficiently
The following code stores only a line of the log file in memory, and only reads input data from disk once (“streaming” solution)

int count = 0;
while (inputFile.eof()) {
    string line = inputFile.readLine();
    string lower = line.toLower;
    if (isMobileClient(lower))
        count++;
}

Narrow dependencies
“Narrow dependencies” = each partition of parent RDD referenced by at most one child RDD partition
Allows for fusing of operations (here: can apply map and then filter all at once on input element)
In this example: no communication between nodes of cluster (communication of one int at end to perform count() reduction)

Wide dependencies
groupByKey: RDD[(K,V)] → RDD[(K,Seq[V])]
“Make a new RDD where each element is a sequence containing all values from the parent RDD with the same key.”

Wide dependencies = each partition of parent RDD referenced by multiple child RDD partitions Challenges:
Must compute all of RDD_A before computing RDD_B
Example: groupByKey() may induce all-to-all communication as shown above
May trigger significant recomputation of ancestor lineage upon node failure

Cost of operations depends on partitioning
join: RDD[(K,V)], RDD[(K,W)] → RDD[(K,(V,W))]
Assume data in RDD_A and RDD_B are partitioned by key: hash username to partition id
RDD_A and RDD_B have different hash partitions: join creates wide dependencies

RDD_A and RDD_B have same hash partition: join only creates narrow dependencies

PartitionBy() transformation
Inform Spark on how to partition an RDD
e.g., HashPartitioner, RangePartitioner

// create RDD from file system data
val lines = spark.textFile(“hdfs://cs149log.txt”);
val clientInfo = spark.textFile(“hdfs://
clientssupported.txt”); // (useragent, “yes”/“no”)
// create RDD using filter() transformation on lines
val mobileViews = lines.filter(x => isMobileClient(x)).map(x => parseUserAgent(x));
// HashPartitioner maps keys to integers
val partitioner = spark.HashPartitioner(100);
// inform Spark of partition
// .persist() also instructs Spark to try to keep dataset in memory
val mobileViewPartitioned = mobileViews.partitionBy(partitioner)
								.persist();
val clientInfoPartitioned = clientInfo.partitionBy(partitioner)
								 .persist();
// join useragents with whether they are supported or not supported
// Note: this join only creates narrow dependencies due to the explicit partitioning above
void joined = mobileViewPartitioned.join(clientInfoPartitioned);

.persist():
Inform Spark this RDD’s contents should be retained in memory
.persist(RELIABLE) = store contents in durable storage (like a checkpoint)

Implementing Resilience via Lineage
RDD transformations are bulk, deterministic, and functional
Implication: runtime can always reconstruct contents of RDD from its lineage(the sequence of transformations used to create it)
Lineage is a log of transformations
Efficient: since the log records bulk data-parallel operations, overhead of logging is low (compared to logging fine-grained operations, like in a database)

// create RDD from file system data
val lines = spark.textFile(“hdfs://cs149log.txt”);
// create RDD using filter() transformation on lines
val mobileViews = lines.filter((x: String) => isMobileClient(x));
// 1. create new RDD by filtering only Chrome views
// 2. for each element, split string and take timestamp of
//    page view (first element)
// 3. convert RDD To a scalar sequence (collect() action)
val timestamps = mobileView.filter(_.contains(“Chrome”))
						.map(_.split(“ ”)(0));

Upon Node Failure: Recompute Lost RDD Partitions from Lineage

Note: (not shown): file system data is replicated so assume blocks 2 and 3 remain accessible to all nodes

Spark performance

HadoopBM = Hadoop Binary In-Memory (convert text input to binary, store in in-memory version of HDFS)
the baseline parses text input in each iteration of an iterative algorithm
HadoopBM’s first iteration is slow because it runs an extra Hadoop job to copy binary form of input data to in memory HDFS
Accessing data from HDFS, even if in memory, has high overhead:
Multiple mem copies in file system + a checksum
Conversion from serialized form to Java object

Caution: “scale out” is not the entire story
People say scale out is when you are connecting nodes together by a network, and scale up is when you are connecting cores together in a shared memory system. So if you share a memory, it scale up. If you’re not sharing memory, it scale out
Distributed systems designed for cloud execution address many difficult challenges, and have been instrumental in the explosion of “big-data” computing and large-scale analytics
Scale-out parallelism to many machines
Resiliency in the face of failures
Complexity of managing clusters of machines
But scale out is not the whole story:

Performance improvements to Spark
With increasing DRAM sizes and faster persistent storage (SSD), there is interest in improving the CPU utilization of Spark applications
Goal: reduce “COST”
Efforts looking at adding efficient code generation to Spark ecosystem (e.g., generate SIMD kernels, target accelerators like GPUs, etc.) to close the gap on single node performance
RDD storage layouts must change to enable high-performance SIMD processing (e.g., struct of arrays instead of array of structs)
See Spark’s Project Tungsten, Weld [Palkar Cidr ’17], IBM’s SparkGPU
High-performance computing ideas are influencing design of future performance-oriented distributed systems
Conversely(相反): the scientific computing community has a lot to learn from the distributed computing community about elasticity(弹性) and utility(实用) computing

Spark summary
Introduces opaque(不透明的) sequence abstraction (RDD) to encapsulate(封装) intermediates of cluster computations (previously… frameworks like Hadoop/MapReduce stored intermediates in the file system)
Observation: “files are a poor abstraction for intermediate variables in large-scale data-parallel programs”
RDDs are read-only, and created by deterministic data-parallel operators
Lineage tracked and used for locality-aware scheduling and fault-tolerance (allows recomputation of partitions of RDD on failure, rather than restore from checkpoint (Note that .persist(RELIABLE) allows programmer to request checkpointing in long lineage situations) )
Bulk operations allow overhead of lineage tracking (logging) to be low
Simple, versatile(通用的) abstraction upon which many domain-specific distributed computing frameworks are being implemented
See Apache Spark project: spark.apache.org

Modern Spark ecosystem

Compelling(引入注目的) feature: enables integration(集成)/composition of multiple domain-specific frameworks (since all collections implemented under the hood with RDDs and scheduled using Spark scheduler)

Spark SQL
https://spark.apache.org/sql/
Interleave computation and database query
Can apply transformations to RDDs produced by SQL queries

Spark MLlib
https://spark.apache.org/mllib/
Machine learning library build on top of Spark abstractions

Spark GraphX
https://spark.apache.org/graphx/
GraphLab-like library built on top of Spark abstractions

Lecture 10: Hardware Specialization

Heterogeneous processing
Observation: most “real world” applications have complex workload characteristics
They have components that can be widely parallelized
And components that are difficult to parallelize
They have components that are amenable(易于相处的) to wide SIMD execution
And components that are not (divergent(发散的) control flow)
They have components with predictable data access
And components with unpredictable access, but those accesses might cache well
Idea: the most efficient processor is a heterogeneous mixture of resources (“use the most efficient tool for the job”)

Examples of heterogeneity

Example: Intel “Skylake” (2015)
(6th Generation Core i7 architecture)

4 CPU cores + graphics cores + media accelerators
CPU cores and graphics cores share same memory system
Also share LLC (L3 cache)
Enables, low-latency, high-bandwidth communication between CPU and integrated(集成的) GPU
Graphics cores are cache coherent with CPU cores

More heterogeneity: add discrete(独立的) GPU
Keep discrete (power hungry) GPU unless needed for graphics-intensive applications
Use integrated, low power graphics for basic graphics/window manager/UI

Mobile heterogeneous processors

GPU-accelerated Supercomputing

Energy-constrained computing

Energy (Power x Time)-constrained computing
Supercomputers are energy constrained
Due to shear(剪切) scale of machine (100,000s of CPUs and GPUs)
Overall cost to operate (power for machine and for cooling)
Datacenters are energy constrained
Reduce cost of cooling
Reduce physical space requirements
Mobile devices are energy constrained
Limited battery life
Heat dissipation(耗散)

AI is Constrained by Energy
AI demands are growing exponentially(呈指数级增长)
Data centers are heavily energy constrained

Performance and Power

Better energy efficiency ⇒ Specialization (fixed function)
What is the magnitude of improvement from specialization?
Pursuing highly efficient processing… (specializing hardware beyond just parallel CPUs and GPUs)

Why is a “general-purpose processor” so inefficient?

Consider the complexity of executing an instruction on a modern processor…
Read instruction
Address translation, communicate with icache, access icache, etc.
Decode instruction
Translate op to uops, access uop cache, etc.
Check for dependencies/pipeline hazards(冲突)
Identify available execution resource
Use decoded operands to control register file SRAM (retrieve(检索) data)
Move data from register file to selected execution resource
Perform arithmetic operation
Move data from execution resource to register file
Use decoded operands to control write to register file SRAM

Review question:
How does SIMD execution reduce overhead of certain types of computations? What properties must these computations have?
amortize all of the parts that are not green over more green stuff. So you are executing across more data elements

Contrast that complexity to the circuit required to actually perform the operation

H.264 video encoding: fraction of energy consumed by functional units is small (even when using SIMD)

Fast Fourier(傅里叶) transform (FFT): throughput and energy benefits of specialization

Mobile: benefits of increasing efficiency
Run faster for a fixed period of time
Run at higher clock, use more cores (reduce latency of critical task)
Do more at once
Run at a fixed level of performance for longer
e.g., video playback, health apps
Achieve “always-on” functionality that was previously impossible
iPhone:
Siri activated by button press or holding phone up to ear
Amazon Echo / Google Home
Always listening
Google Glass:
~40 min recording per charge (nowhere near “always on”)

Example: Intel “Skylake” (2015)
(6th Generation Core i7 architecture)

CPU cores and graphics cores share same memory system
Also share LLC (L3 cache)
Enables, low-latency, high-bandwidth communication between CPU and integrated(集成的) GPU
Graphics cores are cache coherent with CPU cores

GPU’s are themselves heterogeneous multi-core processors

Example graphics tasks performed in fixed-function HW
Rasterization(光栅化): Determining what pixels a triangle overlaps

Texture mapping: Warping/filtering images to apply detail to surfaces

Geometric tessellation(镶嵌): computing fine-scale geometry from coarse geometry

Digital signal processors (DSPs)
Programmable processors, but simpler instruction stream control paths
Complex instructions (e.g., SIMD/VLIW): perform many operations per instruction (amortize cost of control)

Anton supercomputer for molecular dynamics(分子动力学)
Anton 1 (2008) simulates time evolution of proteins(蛋白质)
ASIC for computing particle-particle interactions (512 of them in machine)
Throughput-oriented subsystem for efficient fast-fourier transforms
Custom, low-latency communication, network designed for communication patterns of N-body simulations

Anton 3 (2025) is approximately 20 times faster than a contemporary(当代的) GPU

Specialized processors for evaluating deep networks
Countless papers followed at top computer architecture research conferences on the topic of ASICs or accelerators for deep learning or evaluating deep networks…

FPGAs (Field Programmable Gate Arrays)
Middle ground between an ASIC and a processor
FPGA chip provides array of logic blocks, connected by interconnect
Programmer-defined logic implemented directly by FGPA

Specifying combinational(组合) logic as a LUT(查找表)
Example: 6-input, 1 output LUT in Xilinx Virtex-7 FPGAs
Think of a LUT6 as a 64 elementtable

Example: 6-input AND

40-input AND constructed by chaining outputs of eight LUT6’s (delay = 3)

Modern FPGAs
A lot of area devoted to hard gates
Memory blocks (SRAM)
DSP blocks (multiplier)
CPUs (ARM, RISC-V)
Program with a hardware description language (e.g. Verilog, EE108)

Amazon EC2 F1/F2
FPGA’s are now available on Amazon cloud services
https://aws.amazon.com/cn/ec2/

Efficiency benefits of compute specialization
Rules of thumb: compared to high-quality C code on CPU…
Throughput-maximized processor architectures: e.g., GPU cores
Approximately 10x improvement in perf / watt
Assuming code maps well to wide data-parallel execution and is compute bound
Fixed-function ASIC (“application-specific integrated circuit”)
Can approach 100-1000x or greater improvement in perf/watt
Assuming code is compute bound and is not floating-point math

Efficiency vs. Programability

AI Progress Relies on Hardware Improvement

AI Models on GPUs
Many high-performance AI model implementations target GPUs
High arithmetic intensity computations (computational characteristics similar to dense matrix-matrix multiplication)
Benefit from flop-rich GPU architectures
Highly-optimized library of kernels exist for GPUs (cuDNN)

Why might a GPU be a sub-optimal platform for AI Model Acceleration?
(Hint: is a general purpose processor needed?)

Characteristics of An Ideal AI Model Accelerator
High peak TFLOPs and energy efficiency
High memory bandwidth
Simple to program for high-performance
Reaches performance bound on compute-bound models
Reaches performance bound on BW-bound models

Asynchronous (Nonblocking) Execution
Start later operations before earlier operations are complete

AI Models are Dataflow Graphs

Ideal AI Model Accelerator
Tiled(平铺式) AI accelerator programming model
CUTLASS
Triton
Thunderkittens
GEMM computation is cheap, but data movement is expensive
Silicon area
Watts
Nanoseconds
Ideal: Minimize cost of Data Movement
Ideal: Avoid Off-chip Data Access

Feature	Why?
Tiled tensors (e.g. 16 x 16, 32 x 32)	Max TFLOPS on GEMM Low instr. overhead
Asynchronous compute	Overlap compute and memory access
Asynchronous memory access	Overlap compute and memory access
Asynchronous chip-to-chip communication	Overlap compute, memory and communication
Compute unit to compute unit comm.	Fusion and pipelining Streaming Dataflow

Special instruction support

Recall: compute specialization = energy efficiency
Recall: data movement has high energy cost

Amortize overhead of instruction stream control using more complex instructions
Estimated overhead of programmability (instruction stream, control, etc.)
Half-precision FMA (fused multiply-add) 2000%
Half-precision DP4 (vec4 dot product) 500%
Half-precision 4x4 MMA (matrix-matrix multiply + accumulate) 27%
Key principle: amortize cost of instruction stream processing across many operations of a single complex instruction

Numerical data formats

Energy and Area Cost of Compute

Ampere GPU SM (A100)
Each SM core has:
64 fp32 ALUs (mul-add)
32 int32 ALUs
4 “tensor cores”
Execute 8x4 x 4x8 matrix mul-add instr
A x B + D for matrices A,B,D
A, B stored as fp16, accumulation with fp32 D
There are 108 SM cores in the GA100 GPU:
6,912 fp32 mul-add ALUs
432 tensor cores
1.4 GHz max clock
= 19.5 TFLOPs fp32
+ 312 TFLOPs (fp16/32 mixed) in tensor core

Nvidia H100 GPU (2022)
Fourth-generation Tensor Core
Tensor Memory Accelerator (TMA) unit
CUDA cluster capability
HBM3 with up to 80 GB
TSMC 4nm
80 Billion transistors

Tensor cores

H100 CUDA, Compute and Memory Hierarchies

CUDA Hierarchy	Compute Hierarchy	Memory Hierarchy
Grid	GPU	80 GB HBM/ 50 MB L2
Cluster	CPC	256 KB shared memory per SM
Thread Block	SM	256 KB shared memory
Threads	SIMD Lanes	1 KB RF per thread, 64KB per SM partition

Thread block cluster is a collective of up to 16 thread blocks
Each thread block is guaranteed to execute on a separate SM and to run at the same time

H100 GPU Streaming Multi-processor (SM)

Tensor Memory Accelerator
Special purpose instructions for efficient data movement
Asynchronously load/store a region of a tensor from global to shared memory
Copy descriptor describes region
Single thread issue TMA operation
cuda:memcpy_async
Signal barrier when copy is complete
Hardware address generation and data movement
Copy Descriptor

The Whole H100

144 SMs
Tensor cores (systolic array MMA): 989 TFLOPS (fp16)
SIMD: 134 TFLOPS (fp16), 67 TFLOPS (fp32)

GPU TFLOPS Over Time

All the TFLOPS are in the Tensor Cores

Nvidia Chips Becoming More Specialized
What are implications for programmers?
V100

A100

H100

B100

Tensor Cores in B100
Register bandwidth limits for tensor cores in B100
Tensor data in SMEM and TMEM
Single threads execute MMA ⇒ No more warps!
Programming Tensor Cores
Allocate TMEM and descriptors
tcgen05.alloc
Prefetch/stream tiles with TMA (async)
cp.async.bulk.tensor, coordinate with mbarrier
Launch async MMAs
tcgen05.mma batch withtcgen05.commit
Order & retire
tcgen05.fence
Not your father’s CUDA

DSLs for GPU AI Kernels
HazyResearch/ThunderKittens: Tile primitives for speedy kernels
CuTe DSL — NVIDIA CUTLASS Documentation
Mojo：面向AI开发者的编程语言 | 强大的CPU和GPU编程 | Mojo官网 Mojo社区 – Mojo
Pallas:Mosaic GPU — JAX 文档

How Ideal are GPUs

Feature	Why?	Nvidia GPU
Tiled tensors (e.g. 16 x 16, 32 x 32)	Max TFLOPS on GEMM Low instr. overhead	✅
Asynchronous compute	Overlap compute and memory access	✅ mma_async
Asynchronous memory access	Overlap compute and memory access	✅ TMA+TMEM
Asynchronous chip-to-chip communication	Overlap compute, memory and communication
Compute unit to compute unit comm.	Fusion and pipelining Streaming Dataflow	❓ TB Cluster

Hardware acceleration of AI inference/training

Google’s TPU (v1)

TPU area proportionality(比例)
Arithmetic units ~ 30% of chip
Note low area footprint of control
Key instructions:
read host memory
write host memory
read weights
matrix_multiply / convolve
activate

Floor Plan of TPU die(yellow = compute, blue = data, green = I/O, red = control)

Systolic array
(matrix vector multiplication example: y=Wx)

Notice: need multiple 4x32bit accumulators to hold output columns

Systolic Array Dataflow

Dataflow Type	What stays in each PE	What streams through	Main goal
Weight-Stationary (WS)	Weight values	Inputs (activations) and partial sums	Minimize reloading of weights
Output-Stationary (OS)	Partial sums (outputs)	Inputs and weights	Minimize movement of accumulated results
Input-Stationary (IS)	Input activations	Weights and partial sums	Minimize reloading of inputs

SIMD vs. Systolic Array

Feature	SIMD	Systolic Array
Dataflow	Control-driven (instructions)	Data-driven (wavefront)
Locality (data reuse)	Limited	Temporal and spatial
Communication	Global (register/memory)	Local (neighbor PEs)
Control	Centralized	Distributed
Efficiency (perf/mm2, perf/Watt)	Medium	Very high

Building larger matrix-matrix multiplies
Example: A = 8x8, B= 8x4096, C=8x4096
Assume 4096 accumulators

TPU Performance/Watt

GM = geometric mean over all apps total = cost of host machine + CPU
WM = weighted mean over all apps incremental = only cost of TPU

Evolution of Google TPUs

Hardware Lottery
When a research idea wins because it is suited to the available software and hardware and not because the idea is universally superior to alternative research directions

AI Models ⇒ Dataflow Architecture

Reconfigurable Dataflow Architecture vs Ideal Accelerator

No instructions ⇒ No instruction fetch/decode overhead
Extreme asynchrony: no sequential instruction execution

Dataflow Kernel Fusion

Summary: specialized hardware for AI model processing

Specialized hardware for executing key DNN computations efficiently

Feature many arithmetic units

Customized/configurable datapaths to directly move intermediate data values between processing units (schedule computation by laying it out spatially on the chip) at multiple granularities

Large amounts of on-chip storage for fast access to intermediates

Lecture 11: Programming Specialized Hardware for AI

Specialization Improves Efficiency
Tensor Cores
Specialized MMA compute

Warpgroup: 128 consecutive threads
PTX: Parallel Thread Execution
NVIDIA’s virtual instruction set architecture
TMA
Specialized block data movement unit
Eliminates 1000’s of instructions and memory addressing overhead
Eliminates unnecessary data movement through L1 and registers

GPU Kernels are Important
2025 GPU market is enormous ⇒ NVIDIA 2025 quarterly revenue of > $47B
GPU AI kernels are often run on clusters of hundreds of millions of dollars of GPUs, for months on end. (e.g. large training runs, serving models at scale, etc.)
FlashAttention-2 degraded from ~70% on A100s to ~35% on H100s. Took 2 years to come back up to ~65% with FlashAttention-3
Poor kernels underutilize billions of dollars worth of compute

Extracting Peak Performance from the H100
Kernels that keep the Tensor cores busy (>90% of TFLOPS)
Use 16 x 16 tiles of fp16 data ⇒ matches Tensor core compute
Make sure compute is never idle
Overlap memory access and compute ⇒ use asynchrony
A tile processing pipeline

ThunderKittens

A Simple Embedded DSL for AI kernels
Design principle #1: tile of 16x16 as primitive data type
TK manages layouts
TK provides basic operations
Design principle #2: Asynchrony, everywhere
Expose primitives for user to manage, if top performance needed
Design principle #3: High-level GPU coordination patterns
Producer-consumer processing
Embedded CUDA DSL template library
Templated Data Types
Register tiles: 2D tensors on the register file
height, width, and layout
Register vectors: 1D tensors on the register file
length and layout
Shared memory tiles: 2D tensors in shared memory
height, width, and layout
Shared memory vectors: 1D tensors in shared memory
Length
Operations
Initializers – zero out a shared vector, for example
Unary ops, like exp
Binary ops, like mul
Row / column ops, like a row_sum

Tile Processing Pipeline with ThunderKittens

TK Matmul
Step 1: Define layouts

#include "kittens.cuh"
#include "prototype.cuh"
using namespace kittens;
using namespace kittens::prototype;
using namespace kittens::prototype::lcf;
struct matmul_layout {
    using  a_global_layout = gl<bf16, 1, 1, -1, -1, st_bf<64, 64>>; // create a TMA descriptor for a 64x64 tile
    using  b_global_layout = gl<bf16, 1, 1, -1, -1, st_bf<64, 256>>; // create a TMA descriptor for a 64x256 tile
    using  c_global_layout = gl<bf16, 1, 1, -1, -1>; // no TMA descriptor needed for C
    struct globals        { a_global_layout A; b_global_layout B; c_global_layout C; };
    struct input_block    { st_bf<64, 64> a[2]; st_bf<64, 256> b; } // shared memory tile for input
    struct finish_block   { st_bf<64, 256> c[2]; }; // shared memory tiles for result
    struct consumer_state { rt_fl<16, 256> accum; }; // register tile
};

Step 2: Define pipeline and producers

struct matmul_template {
    using layout = matmul_layout;
    // 8 active consumer warps, 4 active producer warps (default), and a 4-stage pipeline
    static constexpr int NUM_CONSUMER_WARPS=8, INPUT_PIPE_STAGES=4;
    // Producers need to arrive just once, and each consumer wargroups arrives.
    static constexpr int PRODUCER_BARRIER_ARRIVALS=1, CONSUMER_BARRIER_ARRIVALS=2;
    __device__ static inline void common_setup(common_setup_args<layout> args) {
        // Tell the template we have a single task of (reduce dim) / 64 tiles to handle.
        args.num_iters = args.task_iter == 0 ? args.globals.A.cols/64 : -1;
    }
    struct producer {
        __device__ static void setup(producer_setup_args<layout> args) {
            // decrease registers for producers, to leave more for the consumers.
            warpgroup::decrease_registers<40>();
        }
        // Template waits for the input block to be ready to write before launching
        __device__ static void load(producer_load_args<layout> args) { 
            // We only actually need one warp (in fact, one thread) to tell TMA to go launch loads
            if(warpgroup::warpid() == 0) {
                // Tell the mbarrier sempahore how many bytes to expect (inferred from the input struct type)
                tma::expect(args.inputs_arrived, args.input);
                // Load the A tiles -- one per consumer wargroup -- for this input phase. Each is 64x64, strided vertically.
                for(int i = 0; i < 2; i++) { 
                    tma::load_async(args.input.a[i], args.globals.A, {blockIdx.x*2+i, args.iter}, args.inputs_arrived);
                }
                // Load the B tile for this input phase (just one 64x256 tile, shared by all consumer wargroups)
                tma::load_async(args.input.b, args.globals.B, {args.iter, blockIdx.y}, args.inputs_arrived);
            }
        }
    };

Step 3: Compute!

    struct consumer {
        __device__ static void setup(consumer_setup_args<layout> args) {
            warpgroup::increase_registers<232>(); // increase registers for consumers
            zero(args.state.accum); // zero the matrix accumulators
        }
        // Template waits for input block to be ready to use first
        __device__ static void compute(consumer_compute_args<layout> args) { 
            warpgroup::mma_AB(args.state.accum, args.input.a[warpgroup::groupid()], args.input.b);
            warpgroup::mma_async_wait();
            // A single thread marks that the memory is now finished.
            if(warpgroup::laneid() == 0) arrive(args.inputs_finished); 
        }
        __device__ static void finish(consumer_finish_args<layout> args) {
            int wg = warpgroup::groupid(); // Which consumer warpgroup worker am I?
            warpgroup::store(args.finish.c[wg], args.state.accum);
            warpgroup::sync(); // storing to shared memory first reorganizes for better coalescing to HBM
            warpgroup::store(args.globals.C, args.finish.c[wg], args.state.accum, {blockIdx.x*2+wg, blockIdx.y});
        }
    };
};

TK Matmul Performance

Metapipelining

Can we have asynchrony with a simpler programming model? (Hint: Take a data-centric view)

Recall: AI Models are Dataflow Graphs
AI Models ⇒ Dataflow Architecture

Streaming Dataflow ⇒ Kernel Fusion
Attention Algorithm

Attention Algorithm on RDA

Coarse -grained pipelining

Reconfigurable Dataflow

SambaNova SN40L RDU
1,040 PCUs and PMUs
638 TFLOPS (bf16)
520 MB on-chip SRAM
64 GB HBM
1.5 TB DDR

PCU: Pattern Compute Unit
systolic and SIMD compute (16 x 8 bf16)
PMU: Pattern Memory Unit
High address generation flexibility and bandwidth (0.5 MB)
S: Mesh switches
High on-chip interconnect flexibility and bandwidth
AGCU: Address Generator and Coalescing Unit
Portal to off-chip memory and IO

Dataflow Programming with Data Parallel Patterns

Composable Compute Primitives: MM, Map, Zip, Reduce, Gather, Scatter …
Flexible scheduling in space and time ⇒ spatial execution

Metapipelining
Hierarchical coarse-grained pipeline: A “pipeline of pipelines”
Exploits nested-loop parallelism
Convert parallel pattern (loop) into a streaming pipeline
Insert pipe stages in the body of the loop
Pipe stages execute in parallel
Overlap execution of multiple loop iterations
Intermediate data between stages stored in double buffers
Handles imbalanced stages with varying execution times
Tiling and fusion
Works well with tiling
Buffers can be used to change access pattern (e.g. transpose data)
Metapipelining can work when fusion does not

Metapipelining Intuition
Gaussian Discriminant Analysis (GDA)

Matmul Metapipeline

auto format = DataFormat::kBF16;
int64_t M = args::M.getValue();
int64_t N = args::N.getValue();
int64_t K = args::K.getValue();
auto A = INPUT_REGION("A", (M, K), format);
auto B = INPUT_REGION("B", (K, N), format);
auto C = OUTPUT_REGION("C", (M, N), format);
auto MM = 256; // Tile size along M, assumes to evenly divide M
auto NN =  64; // Tile size along N, assumes to evenly divide N
auto a_tile_shape = std::vector<int64_t>({MM, K});
auto b_tile_shape = std::vector<int64_t>({K, NN});
auto c_tile_shape = std::vector<int64_t>({MM, NN});
METAPIPE(M / MM, [&]() {
    auto a_tile = LOAD_TILE(A, a_tile_shape);
    METAPIPE(N / NN, [&]() {
        auto b_tile = LOAD_TILE(B, b_tile_shape, row_par = 4);
        auto c = MAT_MUL(a_tile, b_tile); 
        auto c_tile = BUFFER(c);
        STORE_TILE(C, c_tile);
    });
});

Matmul Metapipe

METAPIPE(M, MM) {
    a_tile = LOAD_TILE(A, a_tile_shape)
    METAPIPE(N, NN) {
        b_tile = LOAD_TILE(B, b_tile_shape)
        c = MAT_MUL(a_tile, b_tile, row_par = 4)
        c_tile = BUFFER(c)
        STORE_TILE(C,c_tile)
    }
}

Matmul Metapipe Mapping

FlashAttention Metapipeline
FlashAttention

Dataflow execution with token control ⇒ no lock-based synchronization

MetaPipeline = Streaming Dataflow

Llama3.1 8B

Limited Kernel Fusion on GPUs
Llama3.1 8B with Tensor-RT LLM

RDU Fuses Entire Decoder into One Kernel !
Llama3.1 8B with aggressive kernel fusion

Kernel Loop
Asynchronous memory and compute

One kernel call for all decoders
3 calls per token on RDU
~800 calls per token on GPU
100x fewer kernel calls

Dataflow ⇒ High Performance
Overlap compute, memory access, chip-to-chip communication
Fully overlap allreduce with weight load and compute
Allreduce does not consume HBM capacity or bandwidth

Summary: SpecializedHardware and Programming for AI Models
Specialized hardware for executing key AI computations efficiently
Feature large/many matrix multiply units implemented with systolic arrays
Customized/configurable datapaths to directly move intermediate data values between processing units (schedule computation by laying it out spatially on the chip)
Large amounts of on-chip storage for fast access to intermediates
H100: Asynchronous compute and memory mechanisms ⇒ complex programming
Need ThunderKittens and other DSLS to manage complexity
SN40L: Dataflow model with metapipelining ⇒ simpler programming model
Sophisticated compiler to optimize and map to dataflow hardware
Minimizing synchronization overheads required for high performance

Lecture 12: Mapping AI Applications to the AI Datacenter

Short Primer on Memory

CPU vs GPU Memory

Increase bandwidth, reduce power by chip stacking
Enabling technology: 3D stacking of DRAM chips
DRAMs connected via through-silicon-vias (TSVs) that run through the chips
TSVs provide highly parallel connection between logic layer and DRAMs
Base layer of stack “logic layer” is memory controller, manages requests from processor
Silicon “interposer(中介)” serves as high-bandwidth interconnect between DRAM stack and processor

Technologies:
Micron/Intel Hybrid Memory Cube (HBC)
High-bandwidth memory (HBM) - 1024 bit interface to stack

HBM Advantages
More Bandwidth
High Power Efficiency
Small Form Factor

GPUs are adopting HBM technologies

NVIDIA P100 GPU (2016)
4096-bit interface: 4 HBM2 chips x 1024 bit interface per chip
720 GB/sec peak BW
4 x 4 GB = 16 GB capacity
NVIDIA H100 GPU (2022)
6144-bit interface: 6 HBM3 stacks x 1024 bit interface per stack
3.2 TB/sec peak BW
80 GB capacity

Nvidia HBM Roadmap

AI Cluster Size

Scale Up and Scale Out

DGX SUPERPOD
Modular Architecture
1K GPU SuperPOD Cluster
140 DGX A100 nodes (1,120 GPUs) in a GPU POD
1st tier fast storage - DDN AI400x with Lustre
Mellanox HDR 200Gb/s InfiniBand - Full Fat-tree
Network optimized for AI and HPC
DGX A100 Nodes
2x AMD 7742 EPYC CPUs + 8x A100 GPUs
NVLINK 3.0 Fully Connected Switch
8 Compute + 2 Storage HDR IB Ports
A Fast Interconnect
Modular IB Fat-tree
Separate network for Compute vs Storage
Adaptive routing and SharpV2 support for offload

Message Passing Communication Primitives: AllReduce, ReduceScatter, AllGather
rank = accelerator node
Collective Operations — NCCL 2.29.1 documentation
AllReduce

ReduceScatter

AllGather

Message Passing Communication Primitives: All-to-All

Transformer

Where is the Parallelism in AI Models?

Parallelism and Communication

Distributed Matrix-Multiply Example
inputA[MxK] * inputB[KxN] = out[MxN]
BS = 16, M = 24576, K = 131072, N = 8192
Mapping: Distribute K dimension across S RDUs
Matrix multiply size per socket: [MxK/S] * [K/SxN] = [MxN]
Produces S partial results of size [MxN], one per socket
S-way reduce-scatter to combine the partial results

Compute - Communication Overlap

Importance of Overlap - Conceptual

Communication time increases on GPUs with more sockets
Communication becomes the bottleneck without overlap
GPUs need need large interconnect bandwidth to get high utilization

Importance of Overlap - Quantified on RDUs

Sustained 70+% utilization across 32 sockets due to compute-communication overlap

Pipeline Parallelism and Training
Under-utiliization of compute resources
Low overall throughput

Fine-grained Pipeline Parallelism
Mini-batch: the number of samples processed in each iteration
Divide a mini-batch into multiple micro-batches
Pipeline the forward and backward computations across micro-batches

Tensor, Data, Pipeline Parallelism

Accessing Memory (a basic tutorial on how DRAM works)

Reducing energy consumption idea 1: use specialized processing (use the right processor for the job)
Reducing energy consumption idea 2: move less data

Data Access has high energy cost
Rule of thumb in mobile system design: always seek to reduce amount of data transferred from memory
Earlier in class we discussed minimizing communication to reduce stalls (poor performance)
Now, we wish to reduce communication to reduce energy consumption
“Ballpark” numbers http://www.displaymate.com/iPad_ShootOut_1.htm
Integer op: ~ 1 pJ (Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc.)
Floating point op: ~20 pJ (Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc.)
Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ
Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ (Suggests that recomputing values, rather than storing and reloading them, is a better answer when optimizing code for energy efficiency!)
Implications
Reading 10 GB/sec from memory: ~1.6 watts
Entire power budget for mobile GPU: ~1 watt (remember phone is also running CPU, display, radios, etc.)
iPhone 16 battery: ~14watt-hours (note: my MacbookPro laptop: 99 watt-hour battery)
Exploiting locality matters!!!

Moving data is costly!
Data movement limits performance
Many processing elements…
= higher overall rate of memory requests
= need for more memory bandwidth
(result: bandwidth-limited execution)
Data movement has high energy cost
~ 0.9 pJ for a 32-bit floating-point math op
~ 5 pJfor a local SRAM (on chip) data access
~ 640 pJto load 32 bits from LPDDR memory

The memory system

DRAM array
1 transistor(晶体管) + capacitor(电容器) per “bit” (recall from physics: a capacitor stores charge)

DRAM operation (load one byte)
Estimated latencies are in units of memory clocks: DDR3-1600 (Kayvon’s laptop at the time of making this slide)

Load next byte from (already active) row
Lower latency operation: can skip precharge and row activation steps

DRAM access latency is not fixed
Best case latency: read from active row
Column access time (CAS)
Worst case latency: bit lines not ready, read from new row
Precharge (PRE) + row activate (RAS) + column access (CAS)
Precharge readies bit lines and writes row buffer’s contents back into DRAM array (reading a row is destructive)
Question 1: when to execute precharge?
After each column access
Only when new row is accessed?
Question 2: how to handle latency of DRAM access?

Problem: low pin utilization due to latency of access

Data pins in use only a small fraction of time (red = data pins busy)
This is bad since they are the scarcest(最稀有的) resource!

DRAM burst mode

Idea: amortize latency over larger transfers
Each DRAM command describes bulk transfer
Bits placed on output pins in consecutive clocks

DRAM chip consists of multiple banks
All banks share same pins (only one transfer at a time)
Banks allow for pipelining of memory requests
Precharge/activate rows/send column address to one bank while transferring data from another
Achieves high data pin utilization

Organize multiple chips into a DIMM
Example: Eight DRAM chips (64-bit memory bus)
Note: DIMM appears as a single, higher capacity, wider interface DRAM module to the memory controller. Higher aggregate bandwidth, but minimum transfer granularity is now 64 bits

Reading one 64-byte (512 bit) cache line (the wrong way)
Assume: consecutive physical addresses mapped to same row of same chip
Memory controller converts physical address to DRAM bank, row, column
All data for cache line serviced by the same chip
Bytes sent consecutively over same pins (8 bits per clock —> cache line takes 64 cycles to transfer!)

Reading one 64-byte (512 bit) cache line (efficient way)
Memory controller converts physical address to DRAM bank, row, column
Here: physical addresses are interleaved across DRAM chips at byte granularity
DRAM chips transmit first 64 bits in parallel (cache line takes 8 clocks to transfer)

DRAM controller requests data from new column (Recall modern DRAM’s support burst mode transfer of multiple consecutive columns, which would be used here)
DRAM chips transmit next 64 bits in parallel

Memory controller is a scheduler of memory requests
Receives load/store requests from LLC
Conflicting scheduling goals
Maximize throughput, minimize latency, minimize energy consumption
Common scheduling policy: FR-FCFS (first-ready, first-come-first-serve)
Service requests to currently open row first (maximize row locality)
Service requests to other rows in FIFO order
Controller may coalesce(合并) multiple small requests into large contiguous requests (to take advantage of DRAM “burst modes”)

Dual-channel memory system
Increase throughput by adding memory channels (effectively widen memory bus)
Below: each channel can issue independent commands
Different row/column is read in each channel
Simpler setup: use single controller to drive same command to multiple channels

Example: DDR4 memory
Processor: Intel^® Core™ i7-7700K Processor (in Myth cluster)
Memory system details from Intel’s site:
Intel® Core™ i7-7700K Processor
DDR4 2400
64-bit memory bus x 1.2GHz x 2 transfers per clock(DDR stands for “double data rate”) = 19.2 GB/s per channel
2 channels = 38.4 GB/sec
~13 nanosecond CAS

DRAM summary
DRAM access latency can depend on many low-level factors
Discussed today:
State of DRAM chip: row hit/miss? is recharge necessary?
Buffering/reordering of requests in memory controller
Significant amount of complexity in a modern multi-core processor has moved into the design of memory controller
Responsible for scheduling ten’s to hundreds of outstanding memory requests
Responsible for mapping physical addresses to the geometry of DRAMs
Area of active computer architecture research

Modern architecture challenge:
improving memory performance:
Decrease distance data must move by locating memory closer to processors (enables shorter, but wider interfaces)

HBM Advantages
More Bandwidth
High Power Efficiency
Small Form Factor

HBM4 Custom Logic Die

Xeon Phi (Knights Landing) MCDRAM
16 GB in package stacked DRAM
Can be treated as a 16 GB last level cache
Or as a 16 GB separate address space (“flat mode”)
Intel’s claims:
~ same latency at DDR4
~5x bandwidth of DDR4
~5x less energy cost per bit transferred

1 2	// allocate buffer in MCDRAM (“high bandwidth” memory malloc) float* foo = hbw_malloc(sizeof(float) * 1024);

The memory bottleneck is being addressed in many ways
By the application programmer
Schedule computation to maximize locality, increase arithmetic intensity (minimize required data movement)
By new hardware architectures
Intelligent DRAM request scheduling
Bringing data closer to processor (deep cache hierarchies, 3D stacking)
Increase bandwidth (wider memory systems)
Ongoing research in locating limited forms of computation “in” or near memory
Ongoing research in hardware accelerated compression(压缩) (not discussed today)
General design principles
Locate data storage near processor
Move computation to data storage
Data compression (trade-off extra computation for less data transfer)

Three trends in energy-optimized computing
Compute less!
Computing costs energy: parallel algorithms that do more work than sequential counterparts may not be desirable even if they run faster
Specialize compute units:
Heterogeneous processors: CPU-like cores + throughput-optimized cores (GPU-like cores)
Fixed-function units: audio processing, “movement sensor processing” video decode/encode, image processing/computer vision?
Specialized instructions: expanding set of AVX vector instructions, new instructions for accelerating AES encryption (AES-NI)
Programmable soft logic: FPGAs
Reduce bandwidth requirements
Exploit locality (restructure algorithms to reuse on-chip data as much as possible)
Aggressive use of compression: perform extra computation to compress application data before transferring to memory (likely to see fixed-function HW to reduce overhead of general data compression/decompression)

【2023】Lecture 17: Hardware Specialization and Algorithm Specific Programming

Challenges of heterogeneous designs:

(it’s not easy to realize the potential of specialized, heterogeneous processing)

Heterogeneous system: preferred processor for each task

Challenge to software developer: how to map application onto a heterogeneous collection of resources?
Challenge: “Pick the right tool for the job”: design algorithms that decompose into components that each map well to different processing components of the machine
The scheduling problem is more complex on a heterogeneous system

Challenge for hardware designer: what is the right mixture of resources?
Too few throughput oriented resources (lower peak throughput for parallel workloads)
Too few sequential processing resources (limited by sequential part of workload)
How much chip area should be dedicated to a specific function, like video?

Summary: heterogeneous processing for efficiency
Heterogeneous parallel processing: use a mixture of computing resources that fit mixture of needs of target applications
Latency-optimized sequential cores, throughput-optimized parallel cores, domain-specialized fixed-function processors
Examples exist throughout modern computing: mobile processors, servers, supercomputers
Traditional rule of thumb in “good system design” is to design simple, general-purpose components
This is not the case in emerging systems (optimized for perf/watt)
Today: want collection of components that meet perf requirement AND minimize energy use
Challenge of using these resources effectively is pushed up to the programmer
Current CS research challenge: how to write efficient, portable programs for emerging heterogeneous architectures?

Spatial

Mapping Algorithms to Execution Resources
General Purpose Processor

Special Purpose Processor (Accelerator)

So You Want to Design an Accelerator for Your Algorithm
Traditionally, you must spend years becoming an expert in VHDL or Verilog, Chisel…
High-Level Synthesis (HLS): Vivado HLS, Intel OpenCL, and Xilinx SDAccel
Restricted C with pragmas
These tools sacrifice performance and are difficult to use
Spatial is a high-level language for designing hardware accelerators that was designed to enable performance-oriented programmers to specify
Parallelism: specialized compute
Locality: specialized memories and data movement

https://spatial-lang.org/

Spatial: DSL for Accelerator Design
Simplify configurable accelerator design
Constructs to express:
Parallel patterns as parallel and pipelined datapaths
Independent parallelism
Dependent parallelism
hierarchical control
explicit memory hierarchies
Explicit parameters
All parameters exposed to the compiler
Simple APIs to manage CPU ⇔ Accelerator communication
Allows programmers to focus on “interesting stuff”
Designed for performance-oriented programmers (parallelism and locality)
More intuitive than CUDA: dataflow instead of threads

The Spatial Language: Memory Templates

// Explicit memory hierarchy
// Typed storage templates
val buffer = SRAM[UInt8](C)
val image  = DRAM[UInt8](H,W)
    
// Registers
val accum  = Reg[Double]
val fifo   = FIFO[Float](D)
val lbuf   = LineBuffer[Int](R,C)
val pixels = ShiftReg[UInt8](R,C)

// Explicit transfers across memory hierarchy
// Dense and sparse access
buffer load image(i, j::j+C)
buffer gather image(a, 10)

// Streaming abstractions
val videoIn  = StreamIn[RGB]
val videoOut = StreamOut[RGB]

The Spatial Language: Control Templates

// Blocking/non-blocking interaction with CPU
Accel { … }
Accel(*) { … }

// Arbitrary state machine / loop nesting with implicit control signals
FSM[Int]{s => s != DONE }{
	case STATE0 =>
		Foreach(C by 1){j => … }
	case STATE1 => …
		Reduce(0)(C by 1){i => … }
}{s => nextState(s) }

The Spatial Language: Design Parameters
Spatial templates capture a variety of design parameters:

// Explicit parallelization factors
val P = 16 (1 → 32)
Reduce(0)(N by 1 par P){i =>
	data(i)
}{(a,b) => a + b}

// Implicit/Explicit control schemes
Stream.Foreach(0 until N){i => 
	…
}

// Explicit size parameters for stride and buffer sizes
val B = 64 (64 → 1024)
val buffer = SRAM[Float](B)
Foreach(N by B){i => 
	…
}

// Implicit memory banking and buffering schemes for parallelized access
Foreach(64 par 16){i => 
	buffer(i) // Parallel read
}

Inner Product
Let’s build an accelerator to see how Spatial works

Inner Product in C
Here is inner product written in C for a CPU

// Set up accumulator and memory pointers
int output = 0;
int* vec1 = (int*)malloc(N * sizeof(int));
int* vec2 = (int*)malloc(N * sizeof(int));
// Iterate through data and accumulate
for (int i = 0; i < N; i++) {
    output = output + (vec1(i) * vec2(i));  
}

Inner Product in Spatial
Inner product in Spatial allows the programmer to build a hardware accelerator
Start of code looks like C example
Accel instantiates “for” loop in hardware

// Set up host and memory pointers
val output = ArgOut[Int]
val vec1 = DRAM[Int](N)
val vec2 = DRAM[Int](N)

// Create accelerator (instantiate hardware) 
Accel {
}

// Set up host and memory pointers
val output = ArgOut[Int]
val vec1 = DRAM[Int](N)
val vec2 = DRAM[Int](N)

// Create accelerator 
Accel {
	// Allocate on-chip memories 
	val tile1 = SRAM[Int](tileSize)
	val tile2 = SRAM[Int](tileSize) 
}

Spatial generates multi-step controllers (This Reduce controller’s final step will handle the accumulation)

// Set up host and memory pointers
val output = ArgOut[Int]
val vec1 = DRAM[Int](N)
val vec2 = DRAM[Int](N)

// Create accelerator 
Accel {
	// Allocate on-chip memories 
	val tile1 = SRAM[Int](tileSize)
	val tile2 = SRAM[Int](tileSize) 
	// Specify outer loop
	Reduce(output)(N by tileSize){ t =>
		// More controllers coming...
	}{a, b => a + b}
}

Spatial manages communication with DRAM

// Set up host and memory pointers
val output = ArgOut[Int]
val vec1 = DRAM[Int](N)
val vec2 = DRAM[Int](N)

// Create accelerator 
Accel {
	// Allocate on-chip memories 
	val tile1 = SRAM[Int](tileSize)
	val tile2 = SRAM[Int](tileSize) 
	// Specify outer loop
	Reduce(output)(N by tileSize){ t =>
		// Prefetch data
		tile1 load vec1(t :: t + tileSize)
		tile2 load vec2(t :: t + tileSize) 
	}{a, b => a + b}
}

// Set up host and memory pointers
val output = ArgOut[Int]
val vec1 = DRAM[Int](N)
val vec2 = DRAM[Int](N)

// Create accelerator 
Accel {
	// Allocate on-chip memories 
	val tile1 = SRAM[Int](tileSize)
	val tile2 = SRAM[Int](tileSize) 
	// Specify outer loop
	Reduce(output)(N by tileSize){ t =>
		// Prefetch data
		tile1 load vec1(t :: t + tileSize)
		tile2 load vec2(t :: t + tileSize) 
		// Multiply-accumulate data
		val accum = Reg[Int](0)
		Reduce(accum)(tileSize by 1 par 1){ i =>
			tile1(i) * tile2(i)
		}{a, b => a + b}
	}{a, b => a + b}
}

pipeline that, so that that’s one place that we could exploit parallelism in a product representation
parallelize step 2, SIMD would be a good way of parallelizing step 2 since we’re doing the same operation to all of the elements in the tile

Spatial Question
Spatial programmer’s responsibility
Specifying algorithm as a hierarchy of controllers
Specifying memory hierarchy of algorithm
Explicit data movement
Picking tiling factors, parallelism and scheduling
Spatial compiler’s responsibility
Banking and buffering of memories to maximize perf and minimize resources
Hardware generation for target platform (FPGA, CGRA, ASIC)
Performance debugging feedback

TensorFlow to FPGA

Kernel versus Stream Execution

Recap: Why was Flash Attention powerful?
Fused attention

for each j:
for each i:
Load block Q_i, K^T_j, V_j, O_i
Compute S_ij = Q_iK^T_j
Compute M_ij = m(S_ij), P_ij = f(S_ij), and l_ij = l(S_ij) (all functions operate row-wise on row-vectors)
Multiply P_ijV_j and accumulate into O_i with appropriate scalings
Save memory footprint: Never materialize N2 matrix
Save memory bandwidth: (high arithmetic intensity)
Read 3 blocks (from Q, K, V)
Do two matrix multiplies + a few row summations
Accumulate into O block (which is resident in cache)
Note there is additional computation vs. the original version (must re-scale prior values of O each step of i-loop)
With streaming execution, we get these benefits for free! (Free Fusion!)

Streaming execution model: Free Fusion!
Kernel-based Execution Model:
FlashAttention prevents the materialization of the N x N matrix
However, it requires modifying the algorithm and extra computation
Streaming execution model:
Avoids materialization of the N x N matrix
Without algorithmic changes & extra computation

Preliminary: Softmax
Softmax is actually a 3-step operation

Attention

Kernel-based Execution Model

Materialize the N x N matrix ⇒ ↑ Memory Footprint
Read & write the N x N matrix ⇒ ↑ Memory bandwidth

Streaming Execution Model
With the streaming execution model, we get fusion for free which means:
Avoid materializing the N x N matrix ⇒ ↓ Memory Footprint
Avoid reading & writing the intermediate N x N matrices ⇒ ↓ Memory bandwidth

An example program in a streaming execution model
Computation: Exponential & Rowsum

Doing the computation piece-wise

Kernel-based Execution Model (Overview)

Streaming Execution Model (Overview)

Streaming Execution Model

Can we do better with FlashAttention?
Yes!
By paying a bit more computation cost as Flash Attention does, we can eliminate the FIFO in the middle
We needed this sequence-length (N) sized FIFO to buffer the output of 𝑆= 𝐸𝑥𝑝(𝑆)
This is because in softmax, we have to wait until the row-wise reduction (row sum) is calculated to divide the output of 𝑆= 𝐸𝑥𝑝(𝑆) with the row sum
Flash Attention breaks this dependency by:
Reordering operations
Using a running sum & rescaling instead of the naïve reduction (row sum)

Kernel versus Stream Execution
More parallelism
FlashAttention with kernel-based execution model:
Cannot overlap the computation for different output tiles
Streaming execution model
Spatially maps each computation with pipeline communication
Can overlap (pipeline) the computation for different output tiles!
Don’t have to manually create fused kernels
FlashAttention with kernel-based execution model:
Have to manually write fused kernels in CUDA
Often challenging to fuse deeply due to the limit in (# of registers / SM)
Streaming execution model
Operations gets fused automatically if we write the program using FIFOs
Compiler can automatically generate fused executioin

Accelerator Design Summary

Significant energy efficiency improvements from specialized accelerators (100x–1000x)

Designing an accelerator is a tradeoff between performance and resource utilization
Parallelism
Locality

It requires the programmer to have insight into the application
Where is the bottleneck
Is the implementation compute or memory-bound

Spatial helps you understand the trade-off between performance and resource utilization
Allows rapid exploration of your algorithm
Enables high-level accelerator design

Lecture 13: Domain-Specific Programming Systems and Automatic Performance Optimization

Domain specific languages

Performance optimization in languages like C++, ISPC, CUDA = low productivity (Proof by assignments 1, 2, 3, 4, etc…)

The ideal parallel programming language

Popular languages (not exhaustive ;-))

Way forward ⇒ domain-specific languages

Domain specific languages
Domain Specific Languages (DSLs)
Programming language with restricted expressiveness for a particular domain
High-level, usually declarative, and deterministic

Domain-specific programming systems
Main idea: raise level of abstraction for expressing programs
Goal: quickly write a high-performance program for a target machine
Goal: write one program, and run it efficiently on different machines
Introduce high-level programming primitives specific to an application domain
Productive: intuitive(直观) to use, portable(可移植的) across machines, primitives correspond to behaviors frequently used to solve problems in targeted domain
Performant: system uses domain knowledge to provide efficient, optimized implementation(s)
Given a machine: system knows what algorithms to use, parallelization strategies to employ for this domain
Optimization goes beyond efficient mapping of software to hardware! The hardware platform itself can be optimized to the abstractions as well
Cost: loss of generality/completeness

A DSL example: Halide: a domain-specific language for image processing

Jonathan Ragan-Kelley, Andrew Adams et al. [SIGGRAPH 2012, PLDI 13]

Halide used in practice
Halide used to implement camera processing pipelines on Google phones
HDR+, aspects of portrait(人像) mode, etc…
Industry usage at Instagram, Adobe, etc.

3x3 image blur
Total work per image = 9 x WIDTH x HEIGHT
For NxN lter: N² x WIDTH x HEIGHT

int WIDTH = 1024; 
int HEIGHT = 1024; 
float input[(WIDTH+2) * (HEIGHT+2)]; 
float output[WIDTH * HEIGHT]; 
float weights[] = {1.f/9, 1.f/9, 1.f/9, 
                   1.f/9, 1.f/9, 1.f/9, 
                   1.f/9, 1.f/9, 1.f/9};
for (int j=0; j<HEIGHT; j++) { 
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int jj=0; jj<3; jj++) 
            for (int ii=0; ii<3; ii++) 
                tmp += input[(j+jj)*(WIDTH+2) + (i+ii)] * weights[jj*3 + ii]; 
        output[j*WIDTH + i] = tmp; 
    } 
}

Two-pass 3x3 blur
A 2D separable lter (such as a box lter) can be evaluated via two 1D ltering operations
Total work per image = 6 x WIDTH x HEIGHT
For NxN lter: 2N x WIDTH x HEIGHT
WIDTH x HEIGHT extra storage
2x lower arithmetic intensity than 2D blur (2x memory access)

int WIDTH = 1024; 
int HEIGHT = 1024; 
float input[(WIDTH+2) * (HEIGHT+2)]; 
float tmp_buf[WIDTH * (HEIGHT+2)]; 
float output[WIDTH * HEIGHT]; 
float weights[] = {1.f/3, 1.f/3, 1.f/3}; 
// 1D horizontal blur
for (int j=0; j<(HEIGHT+2); j++) 
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int ii=0; ii<3; ii++) 
            tmp += input[j*(WIDTH+2) + i+ii] * weights[ii]; 
        tmp_buf[j*WIDTH + i] = tmp; 
    } 
// 1D vertical blur
for (int j=0; j<HEIGHT; j++) { 
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int jj=0; jj<3; jj++) 
            tmp += tmp_buf[(j+jj)*WIDTH + i] * weights[jj]; 
        output[j*WIDTH + i] = tmp; 
    } 
}

Two-pass image blur: locality
Intrinsic bandwidth requirements of blur algorithm:
Application must read each element of input image and must write each element of output image

int WIDTH = 1024; 
int HEIGHT = 1024; 
float input[(WIDTH+2) * (HEIGHT+2)];
// Two pass: loads/stores to tmp_buf are overhead (this memory traffic is an artifact of the two-pass implementation: it is not intrinsic to computation being performed)
float tmp_buf[WIDTH * (HEIGHT+2)]; 
float output[WIDTH * HEIGHT]; 
float weights[] = {1.f/3, 1.f/3, 1.f/3}; 
// Data from input reused three times.  (immediately reused in next two i-loop iterations after first load, never loaded again.) 
// - Perfect cache behavior: never load required data more than once
// - Perfect use of cache lines (don’t load unnecessary data into cache)
for (int j=0; j<(HEIGHT+2); j++) 
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int ii=0; ii<3; ii++) 
            tmp += input[j*(WIDTH+2) + i+ii] * weights[ii]; 
        tmp_buf[j*WIDTH + i] = tmp; 
    } 
// Data from tmp_buf reused three times (but three rows of image data are accessed in between)
// - Never load required data more than once… if cache has capacity for three rows of image
// - Perfect use of cache lines (don’t load unnecessary data into cache)
for (int j=0; j<HEIGHT; j++) { 
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int jj=0; jj<3; jj++) 
            tmp += tmp_buf[(j+jj)*WIDTH + i] * weights[jj]; 
        output[j*WIDTH + i] = tmp; 
    } 
}

Two-pass image blur, “chunked” (version 1)
Total work per row of output:
step 1: 3 x 3 x WIDTH work
step 2: 3 x WIDTH work
Total work per image = 12 x WIDTH x HEIGHT
Loads from tmp_buffer are cached (assuming tmp_buffer fits in cache)

int WIDTH = 1024; 
int HEIGHT = 1024; 
float input[(WIDTH+2) * (HEIGHT+2)]; 
// Only 3 rows of intermediate buffer need to be allocated
float tmp_buf[WIDTH * 3]; 
float output[WIDTH * HEIGHT]; 
float weights[] = {1.f/3, 1.f/3, 1.f/3}; 
for (int j=0; j<HEIGHT; j++) { 
    // Produce 3 rows of tmp_buf (only what’s needed for one row of output)
    for (int j2=0; j2<3; j2++) 
        for (int i=0; i<WIDTH; i++) { 
            float tmp = 0.f; 
            for (int ii=0; ii<3; ii++) 
                tmp += input[(j+j2)*(WIDTH+2) + i+ii] * weights[ii]; 
            tmp_buf[j2*WIDTH + i] = tmp;
        }
    // Combine them together to get one row of output
    for (int i=0; i<WIDTH; i++) { 
        float tmp = 0.f; 
        for (int jj=0; jj<3; jj++) 
            tmp += tmp_buf[jj*WIDTH + i] * weights[jj]; 
        output[j*WIDTH + i] = tmp; 
    } 
}

Two-pass image blur, “chunked” (version 2)
Total work per chuck of output: (assume CHUNK_SIZE = 16)
Step 1: 18 x 3 x WIDTH work
Step 2: 16 x 3 x WIDTH work
Total work per image: (34/16) x 3 x WIDTH x HEIGHT = 6.4 x WIDTH x HEIGHT
Trends to ideal value of 6 x WIDTH x HEIGHT as CHUNK_SIZE is increased!

int WIDTH = 1024; 
int HEIGHT = 1024; 
float input[(WIDTH+2) * (HEIGHT+2)]; 
// Sized so entire buffer fits in cache (capture all producer-consumer locality)
float tmp_buf[WIDTH * (CHUNK_SIZE+2)]; 
float output[WIDTH * HEIGHT]; 
float weights[] = {1.f/3, 1.f/3, 1.f/3}; 
for (int j=0; j<HEIGHT; j+CHUNK_SIZE) { 
    // Produce  enough rows of tmp_buf to produce a CHUNK_SIZE number of rows of output
    for (int j2=0; j2<CHUNK_SIZE+2; j2++) 
        for (int i=0; i<WIDTH; i++) { 
            float tmp = 0.f; 
            for (int ii=0; ii<3; ii++) 
                tmp += input[(j+j2)*(WIDTH+2) + i+ii] * weights[ii]; 
            tmp_buf[j2*WIDTH + i] = tmp; 
        }
    // Produce CHUNK_SIZE rows of output
    for (int j2=0; j2<CHUNK_SIZE; j2++) 
        for (int i=0; i<WIDTH; i++) { 
            float tmp = 0.f; 
            for (int jj=0; jj<3; jj++) 
                tmp += tmp_buf[(j2+jj)*WIDTH + i] * weights[jj]; 
            output[(j+j2)*WIDTH + i] = tmp; 
        }
}

Still not done
We have not parallelized loops for multi-core execution
We have not used SIMD instructions to execute loops bodies
Other basic optimizations: loop unrolling, etc…

Optimized C++ code: 3x3 image blur
Good: ~10x faster on a quad-core CPU than my original two-pass code
Bad: specific to SSE (not AVX2), CPU-code only, hard to tell what is going on at all!

Halide language
Simple domain-specific language embedded in C++ for describing sequences of image processing operations
Halide function: an infinite(无限) (but discrete(离散)) set of values defined on N-D domain
Halide expression: a side-effect free expression that describes how to compute a function’s value at a point in its domain in terms of the values of other functions

Var x, y; 
Func blurx, blury, bright, out; 
// “Functions” map integer coordinates to values (e.g., colors of corresponding pixels)
Halide::Buffer<uint8_t> in = load_image(“myimage.jpg”); 
Halide::Buffer<uint8_t> lookup = load_image(“s_curve.jpg”);  // 255-pixel 1D image 
// perform 3x3 box blur in two-passes 
// Value of blurx at coordinate (x,y) is given by expression accessing three values of in
blurx(x,y) = 1/3.f * (in(x-1,y)    + in(x,y)      + in(x+1,y)); 
blury(x,y) = 1/3.f * (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)); 
// brighten blurred result by 25%, then clamp 
bright(x,y) = min(blury(x,y) * 1.25f, 255);   
// access lookup table to contrast enhance 
out(x,y) = lookup(bright(x,y)); 
// execute pipeline to materialize values of out in range (0:1024,0:1024) 
Halide::Buffer<uint8_t> result = out.realize(1024, 1024);

Image processing application as a DAG

Key aspects of representation
Intuitive expression:
Adopts local “point wise” view of expressing algorithms
Halide language is declarative. It does not define order of iteration, or what values in domain are stored!
It only defines what is needed to compute these values
Iteration over domain points is implicit (no explicit loops)

Var x, y; 
Func blurx, out; 
Halide::Buffer<uint8_t> in = load_image(“myimage.jpg”); 
// perform 3x3 box blur in two-passes 
blurx(x,y) = 1/3.f * (in(x-1,y)    + in(x,y)      + in(x+1,y)); 
out(x,y) =   1/3.f * (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)); 
// execute pipeline on domain of size 1024x1024 
Halide::Buffer<uint8_t> result = out.realize(1024, 1024);

One (serial) implementation of Halide

Equivalent “C-style” loop nest:
allocate in(1024+2, 1024+2);   			// (width,height)… initialize from image 
allocate blurx(1024,1024+2);   			// (width,height) 
allocate out(1024,1024);       			// (width,height) 

for y=0 to 1024:   
   for x=0 to 1024+2: 
      blurx(x,y) = … compute from in 
for y=0 to 1024:   
   for x=0 to 1024: 
      out(x,y) = … compute from blurx

Key aspect in the design of any system: Choosing the “right” representations for the job
Good representations are productive to use:
Embody(体现) the natural way of thinking about a problem
Good representations enable the system to provide the application useful services:
Validating/providing certain guarantees (correctness, resource bounds, conversation of quantities, type checking)
Performance (parallelization, vectorization, use of specialized hardware)
Now the job is not expressing an image processing computation, but generating an efficient implementation of a specific Halide program

A second set of representations for “scheduling”

Scheduling primitives allow the programmer to specify a high-level “sketch” of how to schedule the algorithm onto a parallel machine, but leave the details of emitting the low-level platform-specific code to the Halide compiler

Primitives for iterating over N-D domains
Specify both order and how to parallelize (multi-thread, vectorize via SIMD instr)

(In diagram, numbers indicate sequential order of processing within a thread)

Specifying loop iteration order and parallelism

blurx(x,y) = (in(x-1, y) + in(x,y) + in(x+1,y)) / 3.0f; 
out(x,y)   = (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)) / 3.0f;

// Given this schedule for the function “out”… 
out.tile(x, y, xi, yi, 256, 32).vectorize(xi,8).parallel(y);

// Halide compiler will generate this parallel, vectorized loop nest for computing elements of out…
for y=0 to HEIGHT 
  for x=0 to WIDTH 
      blurx(x,y) = ...  
for y=0 to num_tiles_y:          // parallelize this loop with threads 
   for x=0 to num_tiles_x: 
      for yi=0 to 32: 
         for xi=0 to 256 by 8:   // vectorize this loop with SIMD instr 
            idx_x = x*256+xi; 
            idx_y = y*32+yi 
            out(idx_x, idx_y) = ... (simd arithmetic here)

Primitives for how to interleave producer/consumer processing

// Halide algorithm:
blurx(x,y) = (in(x-1, y) + in(x,y) + in(x+1,y)) / 3.0f; 
out(x,y)   = (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)) / 3.0f; 
Halide::Buffer<uint8_t> result = out.realize(1024, 1024);
// Halide schedule:
out.tile(x, y, xi, yi, 256, 32);
blurx.compute_root();	// Do not compute blurx within out’s loop nest. Compute all of blurx, then all of out

// C-code equivalent:
allocate in(1024+2, 1024+2);   	// (width,height)… initialize from image 
allocate blurx(1024,1024+2);   	// (width,height), allocate buffer for all of blurx(x,y) 
allocate out(1024,1024);       	// (width,height) 
// Loops for computing values of blurx, all of blurx is computed here
for y=0 to HEIGHT: 
  for x=0 to WIDTH: 
     blurx(x,y) = … 
// Loops for computing values of out, values of blurx consumed here
for y=0 to num_tiles_y: 
   for x=0 to num_tiles_x: 
      for yi=0 to 32: 
         for xi=0 to 256: 
            idx_x = x*256+xi; 
            idx_y = y*32+yi 
            out(idx_x, idx_y) = …

Loop nest diagram of implementation:

// Halide algorithm:
blurx(x,y) = (in(x-1, y) + in(x,y) + in(x+1,y)) / 3.0f; 
out(x,y)   = (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)) / 3.0f; 
Halide::Buffer<uint8_t> result = out.realize(1024, 1024);
// Halide schedule:
out.tile(x, y, xi, yi, 256, 32);
blurx.compute_at(out, xi);	// Compute necessary elements of blurx within out’s xi loop nest

// C-code equivalent:
allocate in(1024+2, 1024+2);   	 // (width,height)… initialize from image
allocate out(1024,1024);		// (width,height)
// Outer loops over tiles of out
for y=0 to num_tiles_y: 
   for x=0 to num_tiles_x: 
      // Inner loops for computing values of out 
      for yi=0 to 32: 
         for xi=0 to 256: 
            idx_x = x*256+xi; 
            idx_y = y*32+yi 
            
            // Note: Halide compiler performs analysis that the output of each iteration of the xi loop required 3 elements of blurx
            allocate blurx(1,3)		// Only allocate 3 elements of blurx, allocate 3-element buffer for tmp_blurx
                
            // compute 3 elements of blurx needed for out(idx_x, idx_y) here  
            for (blur_x=0 to 3) 
                blurx(blur_x) = …  // compute blurx from in             
            out(idx_x, idx_y) = …	// compute out from blurx

Loop nest diagram of implementation:

// Halide algorithm:
blurx(x,y) = (in(x-1, y) + in(x,y) + in(x+1,y)) / 3.0f; 
out(x,y)   = (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)) / 3.0f; 
Halide::Buffer<uint8_t> result = out.realize(1024, 1024);
// Halide schedule:
out.tile(x, y, xi, yi, 256, 32);
blurx.compute_at(out, x);	// Compute necessary elements of blurx within out’s x loop nest (all necessary elements for one tile of out)

// C-code equivalent:
allocate in(1024+2, 1024+2);   // (width,height)… initialize from image 
allocate out(1024,1024);       // (width,height)
// Outer loops over tiles of out
for y=0 to num_tiles_y: 
   for x=0 to num_tiles_x: 
      allocate blurx(256, 34) // allocate 256x34 buffer for tile blurx, Only allocate a tile of blurx
      // Loops for computing values of blurx, tile of blurx is computed here
      for yi=0 to 32+2: 
         for xi=0 to 256: 
            blurx(xi,yi) = // compute blurx from in 
      // Inner loops for computing values of out (loops over elements), tile of blurx is consumed here
      for yi=0 to 32: 
         for xi=0 to 256: 
            idx_x = x*256+xi; 
            idx_y = y*32+yi 
            out(idx_x, idx_y) = …

Loop nest diagram of implementation:

Summary of scheduling the 3x3 box blur

// the “algorithm description”  (declaration of what to do) 
blurx(x,y) = (in(x-1, y) + in(x,y) + in(x+1,y)) / 3.0f; 
out(x,y)   = (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)) / 3.0f; 
Halide::Buffer<uint8_t> result = out.realize(1024, 1024); 
// “the schedule” (how to do it) 
out.tile(x, y, xi, yi, 256, 32).vectorize(xi,8).parallel(y); 
blurx.compute_at(out, x).vectorize(x, 8);

// Equivalent parallel loop nest:
allocate in(1024+2, 1024+2) 
allocate out(1024, 1024) 
for y=0 to num_tiles_y:   
// iters of this loop are parallelized using threads 
   for x=0 to num_tiles_x: 
      allocate blur_x(258,34)  // buffer for tile blurx 
      for yi=0 to 32+2: 
         for xi=0 to 256+2 BY 8: 
            blurx(xi,yi) = … // compute blurx from in using 8-wide 
                                 // SIMD instructions here 
                                 // compiler generates boundary conditions 
                                 // since 256+2 isn’t evenly divided by 8 
      for yi=0 to 32: 
         for xi=0 to 256 BY 8: 
            idx_x = x*256+xi; 
            idx_y = y*32+yi 
            out(idx_x, idx_y) = … // compute out from blurx using 8-wide 
                                  // SIMD instructions here

What is the philosophy(理念) of Halide?
Programmer is responsible for describing an image processing algorithm
Programmer has knowledge of how to schedule the application efficiently on machine (but it’s slow and tedious), so Halide gives programmer a second language to express high-level scheduling decisions
Loop structure of code
Unrolling / vectorization / multi-core parallelization
The system (Halide compiler) is not smart, it provides the service of mechanically(机械地) carrying out the details of the schedule in terms of mechanisms available on the target machine (phthreads, AVX intrinsics, etc.)

Constraints on language (to enable compiler to provide desired services)
Application domain scope: computation on regular N-D domains
Only feed-forward(前馈) pipelines (+ special support for reductions and fixed depth recursion(递归) )
All dependencies inferable(推断) by compiler

Initial academic Halide results
Application 1: camera RAW processing pipeline (Convert RAW sensor data to RGB image)
Original: 463 lines of hand-tuned ARM NEON assembly
Halide: 2.75x less code, 5% faster

Application 2: bilateral(双边) filter (Common image filtering operation used in many applications)
Original 122 lines of C++
Halide: 34 lines algorithm + 6 lines schedule
CPU implementation: 5.9x faster
GPU implementation: 2x faster than hand-written CUDA

Stepping back: what is Halide?
Halide is a DSL for helping expert developers optimize image processing code more rapidly
Halide does not decide how to optimize a program for a novice(新手) programmer
Halide provides primitives for a programmer (that has strong knowledge of code optimization) to rapidly express what optimizations the system should apply
Halide compiler carries out the nitty-gritty(细节) of mapping that strategy to a machine

Automatically generating Halide schedules
Problem: it turned out that very few programmers have the ability to write good Halide schedules
80+ programmers at Google write Halide
Very small number trusted to write schedules
Solution: extend compiler to analyze Halide program to automatically generate efficient schedules for the programmer [Adams 2019]
As of [Adams 2019], you’d have to work pretty hard to manually author a schedule that is better than the schedule generated by the Halide autoscheduler for image processing applications
See “Learning to Optimize Halide with Tree Search and Random Programs”, Adams et al. SIGGRAPH 2019

Modeling scheduling as a sequence of choices
For each node N in the program DAG, starting from the end of the DAG…
Choose where to place current node N in the existing loop nest (determine N.compute_at())
Choose a tile sizes for N (assume outer dimension is parallel over threads, inner dimension is vectorized)
Repeat until entire DAG is scheduled

Use search to find best performing schedule
Search over large space of schedules (e.g., greedy search, beam search)

Challenge: might need to search over hundreds of thousands of possible schedules… how do we get the cost of a schedule?

Cost estimation using AI
Given program + schedule… estimate cost (in practice, doesn’t directly compute cost… it outputs 27 coefficients(系数) that are plugged into a hand-crafted cost model)
Simple MLP that runs in 10’s microseconds per schedule (e.g., 1.4M schedules tested in 166 seconds)
Trained on a large database of randomly generated Halide programs
Training programs compiled and executed to get actual cost

Autoscheduler comparable to best known human schedules
Graphs plot(绘制) relative throughput (output pixels/second)

TL;DR - [Adams 2019], you’d have to work pretty hard to manually author a schedule that is better than the schedule generated by the Halide autoscheduler for image processing applications on CPUs

Autoscheduler saves time for experts
Earlier results from [Mullapudi 2016], not [Adams 2019]

Takeaways(要点)
Halide scheduling primitives were designed to enhance productivity of expert human programmers that were trying to schedule image processing code
The high level of abstraction for scheduling also provided a clear way to enumerate the space of all possible schedules, enabling automated search
Consider searching over all possible permutations of a C++ program

Darkroom/Rigel/Aetherling
Goal: directly synthesize ASIC or FGPA implementation of image processing pipelines from a high-level algorithm description (a constrained “Halide-like” language)

Goal: very-high e ciency image processing

Many other recent domain-speci c programming systems
hadoop
Less domain speci c than examples given today, but still designed specifically for: data-parallel computations on big data for distributed systems (“Map-Reduce”)
GraphLab
DSL for graph-based machine learning computations
Also see Ligra (DSLs for describing operations on graphs)
RAILS
Model-view-controller paradigm for web-applications
TensorFlow
DSL for de ning deep neural networks and training/inference computations on those networks
julia
Numerical computing
Ongoing e orts in many domains…
Languages for physical simulation: Simit [MIT], Ebb [Stanford]
Opt: a language for non-linear least squares optimization [Stanford]

Summary
Modern machines: parallel and heterogeneous
Only way to increase compute capability in energy-constrained world
Most software uses small fraction of peak capability of machine
Very challenging to tune programs to these machines
Tuning efforts are not portable across machines
Domain-specific programming environments trade-off generality to achieve productivity, performance, and portability
Case study today: Halide
Leverage explicit dependencies, domain restrictions, domain knowledge for system to synthesize efficient implementations

Another DSL example: Lizst: a language for solving PDE’s on meshes

[DeVito et al. Supercomputing 11, SciDac ’11]

liszt.stanford.edu

What a Liszt program does
A Liszt program is run on a mesh(网格):
A Liszt program computes the value of fields defined on mesh faces, edges, or vertices

Liszt program: heat conduction(热传导) on mesh
Program computes the value of fields defined on meshes

Liszt programming
A Liszt program describes operations on fields of an abstract mesh representation
Application specifies type of mesh (regular, irregular) and its topology(拓扑结构)
Mesh representation is chosen by Liszt (not by the programmer)
Based on mesh type, program behavior, and target machine
Well, that’s interesting. I write a program, and the compiler decides what data structure it should use based on what operations my code performs

Compiling to parallel computers
Recall challenges you have faced in your assignments
1.Identify parallelism
2.Identify data locality
3.Reason about what synchronization is required
Now consider how to automate this process in the Liszt compiler

Key: determining program dependencies
1.Identify parallelism
Absence of dependencies implies code can be executed in parallel
2.Identify data locality
Partition data based on dependencies
3.Reason about required synchronization
Synchronization is needed to respect dependencies (must wait until the values a computation depends on are known)
In general programs, compilers are unable to infer dependencies at global scale:
Consider: a[f(i)] += b[i];
(must execute f(i) to know if dependency exists across loop iterations i)

Liszt is constrained(受限) to allow dependency analysis
Lizst infers “stencils”: “stencil” = mesh elements accessed in an iteration of loop = dependencies for the iteration
Statically(静态) analyze code to find stencil of each top-level for loop
Extract(提取) nested mesh element reads
Extract operations on data at mesh elements

Portable(可移植) parallelism: compiler uses knowledge of dependencies to implement different parallel execution strategies
Strategy 1: mesh partitioning

Strategy 2: mesh coloring

Imagine compiling a Lizst program to a cluster (multiple nodes, distributed address space) How might Liszt distribute a graph across these nodes?
Must access mesh elements relative to some input vertex, edge, face, etc.)
Notice how many operators return sets (e.g., “all edges of this face”)

Distributed memory implementation of Liszt

Imagine compiling a Lizst program to a GPU (single address space, many tiny threads)
GPU implementation: parallel reductions
In previous example, one region of mesh assigned per processor (or node in cluster)
On GPU, natural parallelization is one edge per CUDA thread
Edges (each edge assigned to 1 CUDA thread)

Flux field values (stored per vertex)
Different edges share a vertex: requires atomic update of per-vertex field data

GPU implementation: conflict graph
Edges (each edge assigned to 1 CUDA thread)

Flux field values (stored per vertex)

Identify mesh edges with colliding(冲突) writes (lines in graph indicate presence of collision)
Can simply run program once to get this information (results remain valid for subsequent executions provided mesh does not change)

“Color” nodes in graph such that no connected nodes have the same color
Can execute on GPU in parallel, without atomic operations, by running all nodes with the same color in a single CUDA launch

Performance of Lizst program on a cluster
256 nodes, 8 cores per node (message-passing)

Important: performance portability!
Same Liszt program also runs with high efficiency on GPU (results not shown)
But uses a different algorithm when compiled to GPU! (graph coloring)

Liszt summary
Productivity
Abstract representation of mesh: vertices, edges, faces, fields (concepts that a scientist thinks about already!)
Intuitive topological(拓扑) operators
Portability
Same code runs on large cluster of CPUs and GPUs (and combinations thereof(其中)!)
High performance
Language is constrained to allow compiler to track dependencies
Used for locality-aware partitioning (distributed memory implementation)
Used for graph coloring to avoid sync (GPU implementation)
Compiler chooses different parallelization strategies for different platforms
System can customize mesh representation based on application and platform (e.g, don’t store edge pointers if code doesn’t need it)

Elements of good domain-speci c programming system design

#1: good systems identify the most important cases, and provide most bene t in these situations
Structure of code mimics(模仿) the natural structure of problems in the domain
Halide: pixel-wise view of filters: pixel(x,y) computed as expression of these input pixel values
Graph processing algorithms: per-vertex operations
Efficient expression: common operations are easy and intuitive to express
Efficient implementation: the most important optimizations in the domain are performed by the system for the programmer
My experience: a parallel programming system with “convenient” abstractions that precludes(排除) best-known implementation strategies will almost always fail

#2: good systems are simple systems
They have a small number of key primitives and operations
Halide: a few scheduling primitives for describing loop nests
Hadoop: map + reduce
Allows compiler/runtime to focus on optimizing these primitives
Provide parallel implementations, utilize appropriate hardware
Common question that good architects ask: “do we really need that?” (can this concept be reduced to a primitive we already have?)
For every domain-specific primitive in the system: there better be a strong performance or expressivity justification for its existence

#3: good primitives compose
Composition of primitives allows for wide application scope, even if scope is limited to a domain
e.g., frameworks discussed today support a wide variety of graph algorithms
Halide’s loop ordering + loop interleaving schedule primitives allow for expression of wide range of schedules
Composition often allows optimization to generalizable
If system can optimize A and optimize B, then it can optimize programs that combine A and B
Common sign that a feature should not be added (or added in a di erent way):
The new feature does not compose with all existing features in the system
Sign of a good design:
System ultimately is used for applications original designers never anticipated(预料)

LLM code generation

Trial and error via reflection

KernelBench
A benchmark of hundreds of PyTorch kernels
LMM agent’s goal is to automatically produce fast and correct CUDA kernels
https://github.com/ScalingIntelligence/KernelBench

Domain specific languages for writing DNN programs help automation as well
Good:
LLM is now assembling high-performance primitives, not writing low-level CUDA
Less likely for correctness mistakes/hallucinations(幻觉)
Challenge:
DNNs can struggle to write correct code in less-used languages (less data to train on… will resolve over time)
https://github.com/HazyResearch/ThunderKittens
https://triton-lang.org/main/index.html
https://github.com/NVIDIA/cutlass
https://github.com/tile-ai/tilelang

Open question: Can an LLM agent serve as a great CS149 student? At what token cost?
Idea 1: fine-tune LLMs based on experience
Use experience to fine-tune a custom LLM for a partial type of programming task
Need large numbers of tasks, ability to fine tune larger LLMs
Idea 2: LLM agent self-improves by building up a DB of “example solutions”
Agent builds up a database of example solutions (“e.g. practice problems”)
Given new problem to solve, it retrieves(检索) solutions to most relevant practice problems

Benefit of database-driven self-improving agent

Idea 3: self-improvement via optimizing prompts from experience
Same idea as before, but now update the prompt given to the LLM based on the prior experience, don’t just provide relevant examples

Idea 4: 
	Combine exhaustive(穷举) search based techniques (like the Halide autotune), with the LLM agentic ideas above  
	Extremely high optimization cost, but some of the best results  
<style>.fqkvmzzkklej{zoom:80%;}</style>{% asset_img fqkvmzzkklej image-20260210000311334.png '"""image-20260210000311334"' %}

Summary

Performance optimization requires a high level of expertise

And even for experts it’s tedious and hard

And have to repeat it for new machines, slightly different problems

And companies are spending 10’s to 100’s of millions of dollars a year or more on AI compute costs

Seems like a great case for automation

The best cs149 students of the future will likely be able to work in tangent with(协同) automatic agents to accelerate their thinking and their work

Interesting debates on whether the real value that leads to success is in the DSL design, or the LLM agent!!!!

Lecture 14: Cache Coherence

Cache Design

the difference between a write back and a write-through cache
write through says when we do the write to the cache, we also write to the main memory, and write back says we only write to the cache, and later the data is actually written to main memory

a write-allocate vs. write-no-allocate cache
So the question is, what happens when I write to the cache, and the line that I want is not there? Do I allocate? Do I actually fetch the rest of the line and write into the cache? Or do I just write directly to main memory?

Behavior of write-allocate, write-back cache on a write miss (uniprocessor case)
Example: processor executes int x = 1;
1.Processor performs write to address that “misses” in cache
2.Cache selects location to place line in cache, if there is a dirty line currently in this location, the dirty line is written out to memory
3.Cache loads line from memory (“allocates line in cache”)
4.Whole cache line is fetched and 32 bits are updated
5.Cache line is marked as dirty

Cache hierarchy of Intel Skylake CPU (2015)
Caches exploit locality
64 byte cache line size

3 Cs cache miss model
Cold
Capacity
Conflict

A shared memory multi-processor
Processors read and write to shared variables
More precisely: processors issue load and store instructions
A reasonable expectation of memory is:
Reading a value at address X should return the last value written to address X by any processor

(A simple view of four processors and their shared address space)

Cache coherence problem

Modern processors replicate contents of memory in local caches

Problem: processors can observe different values for the same memory location
This is a problem created by replicating the data in local caches, can not fix the problem by adding locks to program

Intuitive expectation of shared memory
Intuitive behavior for memory system: reading value at address X should return the last value written to address X by any processor
Memory coherence problem exists because there is both global storage (main memory) and per-processor local storage (processor caches) implementing the abstraction of a single shared address space
On a uniprocessor(单处理器), providing this behavior is fairly simple, since writes typically come from one source: the processor
Exception: device I/O via direct memory access (DMA)

Problems with the intuition
Intuitive behavior: reading value at address X should return the last value written to address X by any processor
What does “last” mean?
What if two processors write at the same time?
What if a write by P1 is followed by a read from P2 so close in time that it is impossible to communicate the occurrence of the write to P2 in time?
In a sequential program, “last” is determined by program order (not time)
Holds true within one thread of a parallel program
But we need to come up witha meaningful way to describe order across threads in a parallel program

Coherence

Definition: Coherence
A memory system is coherent if:
The results of a parallel program’s execution are such that for each memory location , there is a hypothetical(假想的) serial order of all program operations (executed by all processors) to the location that is consistent with the results of execution, and:
1.Memory operations issued by any one processor occur in the order issued by the processor
2.The value returned by a read is the value written by the last write to the location… as given by the serial order

Implementation: Cache Coherence Invariants(不变量)
For any memory address x, at any given time period (epoch):
Single-Writer, Multiple-Read (SWMR) Invariant
Read-write epoch: there exists only a single processor that may write to x (and can also read it)
Read-Only- epoch: some number of processors that may only read x
Data-Value Invariant (write serialization)
The value of the memory address at the start of an epoch is the same as the value of the memory location at the end of its last read-write epoch

Implementing coherence
Software-based solutions (coarse grain: VM page)
OS uses page-fault(缺页) mechanism to propagate(传播) writes
Can be used to implement memory coherence over clusters of workstations
We won’t discuss these solutions
Big performance problem: false sharing
Hardware-based solutions (fine grain: cache line)
“Snooping(窥探)”-based coherence implementations
Directory(目录)-based coherence implementations

Shared caches: coherence made easy
One single cache shared by all processors
Eliminates problem of replicating state in multiple caches
Obvious scalability problems (since the point of a cache is to be local and fast)
Interference(干扰) (conflict misses) / contention(争用) due to many clients (destructive)
But shared caches can have benefits:
Facilitates(便于) fine-grained sharing (overlapping working sets)
Loads/stores by one processor might pre-fetch lines for another processor (constructive)

“Snooping”-based coherence implementations

Snooping cache-coherence schemes
Main idea: all coherence-related activity is broadcast to all processors in the system (more specifically: to the processor’s cache controllers)
Cache controllers monitor (“they snoop”) memory operations, and follow cache coherence protocol to maintain memory coherence

Very simple coherence implementation
Let’s assume:
1.Write-through caches
2.Granularity of coherence is cache line
Coherence Protocol:
Upon write, cache controller broadcasts invalidation message
As a result, the next read from other processors will trigger cache miss
(processor retrieves updated value from memory due to write-through policy)
$ mean cache

Action	Interconnect activity	P0 $	P1 $	mem location X
				0
P0 load X	cache miss for X	0		0
P1 load X	cache miss for X	0	0	0
P0 write 100 to X	invalidation for X	100		100
P1 load X	cache miss for X	100	100	100

Write-through policy is inefficient
Every write operation goes out to memory
Very high bandwidth requirements
Write-back caches absorb most write traffic as cache hits
Significantly reduces bandwidth requirements
But now how do we maintain cache coherence invariants?
This requires more sophisticated coherence protocols

Cache coherence with write-back caches

broadcast and serialization
Dirty state of cache line now indicates exclusive ownership (Read-Write Epoch)
Modified: cache is only cache with a valid copy of line (it can safely be written to)
Owner: cache is responsible for propagating information to other processors when they attempt to load it from memory (otherwise a load from another processor will get stale(旧的) data from memory)

Cache Coherence Protocol
Algorithm that maintains cache coherent invariants
The logic we are about to describe is performed by each processor’s cache controller in response to:
Loads and stores by the local processor
Messages from other caches on the bus
If all cache controllers operate according to this described protocol, then coherence will be maintained
The caches “cooperate” to ensure coherence is maintained

Invalidation-based write-back protocol
Key ideas:
A line in the “modified” state can be modified without notifying the other caches
Processor can only write to lines in the modified state
Need a way to tell other caches that processor wants exclusive access to the line
We accomplish this by sending message to all the other caches
When cache controller sees a request for modified access to a line it contains
It must invalidate the line in its cache
cache line state bits

MSI write-back invalidation protocol
Key tasks of protocol
Ensuring processor obtains exclusive access for a write
Locating most recent copy of cache line’s data on cache miss
Three cache line states
Invalid (I): same as meaning of invalid in uniprocessor cache
Shared (S): line valid in one or more caches, memory is up to date
Modified (M): line valid in exactly one cache (a.k.a. “dirty” or “exclusive” state)
Two processor operations (triggered by local CPU)
PrRd(read)
PrWr(write)
Three coherence-related bus transactions (from remote caches)
BusRd: obtain copy of line with no intent to modify
BusRdX: obtain copy of line with intent to modify
BusWB: write dirty line out to memory

Cache Coherence Protocol: MSI State Transition Diagram

MSI Invalidate Protocol
Read obtains block in “shared”
even if only cached copy
Obtain exclusive ownership before writing
BusRdX causes others to invalidate
If M in another cache, will cause writeback
BusRdX even if hit in S
promote to M (upgrade)
all caches are carrying out this logic independently to maintain coherence

A Cache Coherence Example

Proc Action	P1 $-state	P2 $-state	P3 $-state	Bus Trans	Data from
P1 read x	S	--	--	BusRd	Memory
P3 read x	S	--	S	BusRd	Memory
P3 write x	I	--	M	BusRdX	Memory
P1 read x	S	--	S	BusRd	P3 $
P1 read x	S	--	S		P1 $
P2 write x	I	M	I	BusRdX	Memory

Single writer, multiple reader protocol
Why do you need Modified to Shared?
the invariant we’re trying to maintain is single-writer, multiple-reader. So in order to track that transition, we need the transition from modified to share
Communication increases memory latency

How Does MSI Satisfy Cache Coherence?
1.Single-Writer, Multiple-Read (SWMR) Invariant
2.Data-Value Invariant (write serialization)

Summary: MSI
A line in the M state can be modified without notifying other caches
No other caches have the line resident, so other processors cannot read these values (without generating a memory read transaction)
Processor can only write to lines in the M state
If processor performs a write to a line that is not exclusive in cache, cache controller must first broadcast a transaction to move the line into that state
Read-exclusive tells other caches about impending(即将) write (“you can’t read any more, because I’m going to write”)
Read-exclusive transaction is required even if line is valid (but not exclusive… it’s in the S state) in processor’s local cache
Dirty state implies exclusive
When cache controller snoops a “read exclusive” for a line it contains
Must invalidate the line in its cache
Because if it didn’t, then multiple caches will have the line (and soit wouldn’t be exclusive in the other cache!)

MESI invalidation protocol
MSI requires two interconnect transactions for the common case of reading an address, then writing to it
Transaction 1: BusRdto move from I to S state
Transaction 2: BusRdXto move from S to M state
This inefficiency exists even if application has no sharing at all
Solution: add additional state E (“exclusive clean”)
Line has not been modified, but only this cache has a copy of the line
Decouples(解耦) exclusivity from line ownership (line not dirty, so copy in memory is valid copy of data)
Upgrade from E to M does not require anbustransaction

MESI state transition diagram

So you go from invalid to shared would be the typical case, if the line was shared
But if the line were not shared, then you could go directly to exclusive state
And now the upgrade can be made without doing a bus transaction
example:
if you did a write first, I want to read first and then write

Directory-based coherence implementations

Scalable cache coherence using directories
Snooping schemes broadcast coherence messages to determine the state of a line in the other caches: not scalable
Alternative idea: avoid broadcast by storing information about the status of the line in one place: a “directory”
The directory entry for a cache line contains information about the state of the cache line in all caches
Caches look up information from the directory as necessary
Cache coherence is maintained by point-to-point messages between the caches on a “need to know” basis (not by broadcast mechanisms)
Directory instead of bus serves as coherence serialization mechanism
Still need to maintain invariants
SWMR
Write serialization

Directory coherence in Intel Core i7 CPU

L3 serves as centralized(集中式) directory for all lines in the L3 cache
Serialization piont
(Since L3 is an inclusive cache, any line in L2 is guaranteed to also be resident in L3)
Directory maintains list of L2 caches containing line
Instead of broadcasting coherence traffic to all L2’s, only send coherence messages to L2’s that contain the line
(Core i7 interconnect is a ring, it is not a bus)
Directory dimensions:
P=4
M = number of L3 cache lines

Implications of cache coherence to the programmer

Communication Overhead
Communication time is a key parallel overhead
Appears as increased memory access time in multiprocessor
Extra main memory accesses in UMA systems(均匀访存模型（Uniform Memory Access，UMA）是一种并行体系结构中的内存访存模型，其特点是所有处理器访问内存的时间相同，亦称统一内存存取或紧耦合系统)
Must determine increase in cache miss rate vs. uniprocessor
Some accesses have higher latency in NUMA systems(非统一内存访问（NUMA）是一种用于多处理器的电脑内存体设计，内存访问时间取决于处理器的内存位置。在NUMA下，处理器访问它自己的本地存储器的速度比非本地存储器（存储器的地方到另一个处理器之间共享的处理器或存储器）快一些)
Only a fraction of a % of these can be significant!
AMAT_{Multiprocessor} > AMAT_Uniprocessor

Use system tools to optimize cache performance
Intel VTune
Get Intel® VTune™ Profiler
Apple Xcode Instruments
Instruments Tutorials | Apple Developer Documentation

Unintended communication via false sharing
What is the potential performance problem with this code?
有多个数据在同一个 cache line，写数据时需要频繁执行切换写回，conflict

1 2	// allocate per-thread variable for local per-thread accumulation int myPerThreadCounter[NUM_THREADS];

Why might this code be more performant?
数据分散在不同 cache line，写数据时不需要频繁执行切换写回，conflict

// allocate per thread variable for local accumulation
struct PerThreadState {
    int myPerThreadCounter;
    char padding[CACHE_LINE_SIZE - sizeof(int)];
};
PerThreadState myPerThreadCounter[NUM_THREADS];

Demo: false sharing
threads update a per-thread counter many times

void* worker(void* arg) {
    volatile int* counter = (int*)arg;
    for (int i=0; i<MANY_ITERATIONS; i++)
        (*counter)++;
    return NULL;
}

Execution time with num_threads=8 on 4-core system: 14.2 sec

void test1(int num_threads) {
    pthread_t threads[MAX_THREADS];
    int counter[MAX_THREADS];
    for (int i=0; i<num_threads; i++)
        pthread_create(&threads[i], NULL, &worker, &counter[i]);
    for (int i=0; i<num_threads; i++)
        pthread_join(threads[i], NULL);
}

Execution time with num_threads=8 on 4-core system: 4.7 sec

struct padded_t {
    int counter;
    char padding[CACHE_LINE_SIZE - sizeof(int)];
};
void test2(int num_threads) {
    pthread_t threads[MAX_THREADS];
    padded_t counter[MAX_THREADS];
    for (int i=0; i<num_threads; i++)
        pthread_create(&threads[i], NULL, &worker, &(counter[i].counter));
    for (int i=0; i<num_threads; i++)
        pthread_join(threads[i], NULL);
}

False sharing
Condition where two processors write to different addresses, but addresses map to the same cache line

Cache line “ping-pongs” between caches of writing processors, generating significant amounts of communication due to the coherence protocol
No inherent communication, this is entirely artifactual communication (cachelines > 4B)
False sharing can be a factor in when programming for cache coherent architectures

Impact of cache line size on miss rate
Results from simulation of a 1 MB cache (four example applications)

Note: I separated the results into two graphs because of different Y-axis scales

Summary: Cachecoherence

The cache coherence problem exists because the abstraction of a single shared address space is not implemented by a single storage unit
Storage is distributed among main memory and local processor caches
Data is replicated in local caches for performance

Main idea of snooping-based cache coherence: whenever a cache operation occurs that could affect coherence, the cache controller broadcasts a notification to all other cache controllers in the system
Challenge for HW architects: minimizing overhead of coherence implementation
Challenge for SW developers: be wary of artifactual communication due to coherence protocol (e.g., false sharing)

Scalability of snooping implementations is limited by ability to broadcast coherence messages to all caches!
Scaling cache coherence via directory-based approaches

Lecture 15: Memory Coherency and Consistency

Shared Memory Behavior

Intuition says loads should return latest value written
What is latest?
Coherence: only one memory address
Consistency: apparent ordering for all addresses
Order in which memory operations performed by one thread become visible to other threads

Affects
Programmability: how programmers reason about program behavior
Allowed behavior of multithreaded programs executing with shared memory
Performance: limits HW/SW optimizations that can be used
Reordering memory operations to hide latency

Memory Consistency: Who Should Care

Anyone who:
Wants to implement a synchronization library
Will ever work a job in kernel (or driver) development
Seeks to implement lock-free data structures

Memory coherence vs. memory consistency

Memory coherence defines requirements for the observed behavior of reads and writes to the same memory location
All processors must agree on the order of reads/writes to X
In other words: it is possible to put all operations involving X on a timeline such that the observations of all processors are consistent with that timeline

Memory consistency defines the behavior of reads and writes to different locations (as observed by other processors)
Coherence only guarantees that writes to address X will eventually propagate to other processors
Consistency deals with when writes to X propagate to other processors, relative to reads and writes to other addresses

Coherence vs. Consistency (said again, perhaps more intuitively this time)

The goal of cache coherence is to ensure that the memory system in a parallel computer behaves as if the caches were not there
Just like how the memory system in a uni-processor system behaves as if the cache was not there
A system without caches would have no need for cache coherence

Memory consistency defines the allowed behavior of loads and stores to different addresses in a parallel system
The allowed behavior of memory should be specified whether or notcaches are present (and that’s what a memory consistency model does)

Memory Consistency

TL:DR:(简而言之)
Multiprocessors reorder memory operations in unintuitive and strange ways
This behavior is necessary for performance
Application programmers rarely see this behavior
Systems (OS and compiler) developers see it all the time

Memory operation ordering

A program defines a sequence of loads and stores (this is the “program order” of the loads and stores)

Four types of memory operation orderings
W_X→R_Y: write to X must commit before subsequent read from Y
R_X →R_Y: read from X must commit before subsequent read from Y
R_X →W_Y: read to X must commit before subsequent write to Y
W_X →W_Y: write to X must commit before subsequent write to Y

To clarify: “write must commit before subsequent read” means:
When a write comes before a read in program order, the write must commit (its results are visible) by the time the read occurs

Example: Multiprocessor Execution

Initially A = B = 0
Proc 0 Proc 1
(1) A = 1 (3) B = 1
(2) print B (4) print A

What can be printed?
“01”?
“10”?
“11”?
“00”?

Orderings That Should Not Happen

The program should not print “00” or “10”
A “happens-before” graph shows the order in which events must execute to get a desired outcome
If there’s a cycle in the graph, an outcome is impossible—an event must happen before itself!

Sequential consistency

What Should Programmers Expect
Sequential Consistency
Lamport 1976 (Turing Award 2013)
All operations executed in some sequential order
As if they were manipulating a single shared memory
Each thread’s operations happen in program order
A sequentially consistent memory system maintains all four memory operation orderings (W_X →R_Y, R_X→R_Y, R_X→W_Y, W_X→W_Y)
There is a chronology of all memory operations that is consistent with observed values

Sequential consistency (switch metaphor)
All processors issue loads and stores in program order
Memory chooses a processor at random, performs a memory operation to completion, then chooses another processor, …

Sequential Consistency Example

Relaxing memory operation ordering

A sequentially consistent memory system maintains all four memory operation orderings (W_X →R_Y, R_X→R_Y, R_X→W_Y, W_X→W_Y)

Relaxed memory consistency models allow certain orderings to be violated(违反)

Motivation for relaxed consistency: hiding latency
Why are we interested in relaxing ordering requirements?
To gain performance
Specifically, hiding memory latency: overlap memory access operations with other operations when they are independent
Remember, memory access in a cache coherent system may entail(包含) much more work then simply reading datafrom memory (finding data, sending invalidations, etc.)

Problem with SC

Optimization: Write Buffer

Initially A = B = 0
Proc 0 Proc 1
(1) A = 1 (3) B = 1
(2) r1 = B (4) r2 = A
Can r1 = r2 = 0?
SC: No
Write buffers: Yes

Write Buffer Performance

Base: Sequentially consistent execution. Processor issues one memory operation at a time, stalls until completion
W-R: relaxed W→R ordering constraint (write latency almost fully hidden)

Write Buffers: Who Cares?
Performance improvement
Every modern processor uses them
Intel x86, ARM, RISC-V
Need a weaker memory model
TSO: Total Store Order
Slightly harder to reason about than SC
x86 uses an incompletely specified form of TSO

Allowing reads to move ahead of writes
Four types of memory operation orderings
~~W_X → R_Y: write must complete before subsequent read~~
R_X → R_Y: read must complete before subsequent read
R_X → W_Y: read must complete before subsequent write
W_X → W_Y: write must complete before subsequent write
Allow processor to hide latency of writes
Total Store Ordering (TSO)
Processor Consistency (PC)

Total store ordering (TSO)
Processor P can read B before its write to A is seen by all processors (processor can move its own reads in front of its own writes)
Reads by other processors cannot return new value of A until the write to A is observed by all processors
Processor consistency (PC)
Any processor can read new value of A before the write is observed by all processors
In TSO and PC, only W_X → R_Y order is relaxed. The W_X → W_Y constraint still exists. Writes by the same thread are not reordered (they occur in program order)

Clarification (make sure you get this!)
The cache coherency problem exists because hardware implements the optimization of duplicating data in multiple processor caches. The copies of the data must be kept coherent
Relaxed memory consistency issues arise from the optimization of reordering memory operations. (Consistency is unrelated to whether or notcaches exist in the system)

Allowing writes to be reordered
Four types of memory operation orderings
~~W_X→R_Y: write complete before subsequent read~~
R_X →R_Y: read must complete before subsequent read
R_X →W_Y: read must complete before subsequent write
~~W_X →W_Y: write must complete before subsequent write~~
Partial Store Ordering (PSO)
Execution may not match sequentialconsistency on program 1 (P2 may observe change to flag before change to A)
Thread 1 (on P1) Thread 2 (on P2)
A = 1; while (flag == 0);
flag = 1; print A;

**Why might it be useful to allow more aggressive(激进的) memory operation reorderings? **
W_X → W_Y: processor might reorder write operations in a write buffer (e.g., one is a cache miss while the other is a hit)
R_X → W_Y, R_X → R_Y: processor might reorder independent instructions in an instruction stream (out-of-order execution)
Keep in mind these are all valid optimizations if a program consists of a single instruction stream

Allowing all reorderings
Four types of memory operation orderings
~~W_X→R_Y: write complete before subsequent read~~
~~R_X →R_Y: read must complete before subsequent read~~
~~R_X →W_Y: read must complete before subsequent write~~
~~W_X →W_Y: write must complete before subsequent write~~
No guarantees about operations on data!
Everything can be reordered
Motivation is increased performance
Overlap multiple reads and writes in the memory system
Execute reads as early as possible and writes as late as possible to hide memory latency
Examples:
Weak ordering (WO)
Release Consistency (RC)

Synchronization

Synchronization to the Rescue
Memory reordering seems like a nightmare (it is!)
Every architecture provides synchronization primitives to make memory ordering stricter
Fence (memory barrier) instructions prevent reorderings, but are expensive
All memory operations complete before any memory operation after it can begin
reorderable reads and writes here
...
MEMORY FENCE
...
reorderable reads and writes here
...
MEMORY FENCE
Other synchronization primitives (per address):
read-modify-write/compare-and-swap, transactional memory, …

Example: expressing synchronization in relaxed models
Intel x86/x64 ~ total store ordering
Provides sync instructions if software requires a specific instruction ordering not guaranteed by the consistency model
mm_lfence (“load fence”: wait for all loads to complete)
mm_sfence (“store fence”: wait for all stores to complete)
mm_mfence (“mem fence”: wait for all me operations to complete)
ARM processors: very relaxed consistency model
A cool post on the role of memory fences in x86:
http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
ARM has some great examples in their programmer’s reference:
http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf
A great list of academic papers:
http://www.cl.cam.ac.uk/~pes20/weakmemory/

Problem: Data Races
Every example so far has involved a data race
Two accesses to the same memory location
At least one is a write
Unordered by synchronization operations

Conflicting data accesses
Two memory accesses by different processors conflict if…
They access the same memory location
At least one is a write
Unsynchronized program
Conflicting accesses not ordered by synchronization (e.g., a fence, operation with release/acquire semantics, barrier, etc.)
Unsynchronized programs contain data races: the output of the program depends on relative speed of processors (non-deterministic program results)

Synchronized Programs
Synchronized programs yield SC results on non-SC systems
Synchronized programs are data-race-free
If there are no data races, reordering behavior doesn’t matter
Accesses are ordered by synchronization, and synchronization forces sequential consistency
In practice, most programs you encounter will be synchronized (via locks, barriers, etc. implemented in synchronization libraries)
Rather than via ad-hoc reads/writes to shared variables like in the example programs

Summary: Relaxed Consistency

Motivation: obtain higher performance by allowing reordering of memory operations (reordering is not allowed by sequential consistency)

One cost is software complexity: programmer or compiler must correctly insert synchronization to ensure certain specific operation orderings when needed
But in practice complexities encapsulated in libraries that provide intuitive primitives like lock/unlock, barrier (or lower-level primitives like fence)
Optimize for the common case: most memory accesses are not conflicting, so don’t design a system that pays the cost as if they are

Relaxed consistency models differ in which memory ordering constraints they ignore

Language Level Memory Models

Languages Need Memory Models Too

Provide a contract to programmers about how their memory operations will be reordered by the compiler e.g. no reordering of shared memory operations

Modern (C11, C++11) and not-so-modern (Java 5) languages guarantee sequential consistency for data-race-free programs (“SC for DRF”)
Compilers will insert the necessary synchronization to cope with the hardware memory model

No guarantees if your program contains data races!
The intuition is that most programmers would consider a racy program to be buggy

Use a synchronization library!

Memory Consistency Models Summary

Define the allowed reorderings of memory operations by hardware and compilers

A contract between hardware or compiler and application software

Weak models required for good performance?
SC can perform well with many more resources

Details of memory model can be hidden in synchronization library
Requires data race free (DRF) programs

Lecture 16: Implementing Locks, Fine-Grained Synchronization, and (a short intro to) Lock-Free Programming

Deadlock Livelock Starvation

(Deadlock and livelock concern program correctness. Starvation is really an issue of fairness.)

Deadlock
Deadlock is a state where a system has outstanding(未完成的) operations to complete, but no operation can make progress
Deadlock can arise when each operation has acquired a shared resource that another operation needs
In a deadlock situations, there is no way for any thread to make progress unless some thread relinquishes a resource

Required conditions for deadlock
Mutual exclusion: only one processor can hold a given resource at once
Hold and wait: processor must hold the resource while waiting for other resources it needs to complete an operation
No preemption: processors don’t give up resources until operation they wish to perform is complete
Circular wait: waiting processors have mutual dependencies (a cycle exists in the resource dependency graph)

Livelock
Livelock is a state where a system is executing many operations, but no thread is making meaningful progress

Starvation
State where a system is making overall progress, but some processes make no progress
Starvation is usually not a permanent(永久的) state

Lock implementations

Test-and-set based lock
Atomic test-and-set instruction:
ts R0, mem[addr] // load mem[addr] into R0
// if mem[addr] is 0, set mem[addr] to 1
lock: ts R0, mem[addr] // load word into R0
bnz R0, lock // if 0, lock obtained
unlock: st mem[addr], #0 // store 0 to address

x86 cmpxchg
Compare and exchange (atomic when used with lock prefix)

Test-and-set lock: consider coherence traffic

what is the duration of time the thread running on P1 holds the lock?
It was the entire time
At what points in time does P1’s cache contain a valid copy of the cache line containing the lock variable?
Was only for a second at the time of lock, and for a split second at the time of unlock

Test-and-set lock performance
Benchmark: execute a total of N lock/unlock sequences (in aggregate) by P processors
Critical section time removed so graph plots only time acquiring/releasing the lock

Desirable lock performance characteristics
Low latency
If lock is free and no other processors are trying to acquire it, a processor should be able to acquire the lock quickly
Low interconnect traffic
If all processors are trying to acquire lock at once, they should acquire the lock in succession(依次) with as little traffic as possible
Scalability
Latency / traffic should scale reasonably with number of processors
Low storage cost
Fairness
Avoid starvation or substantial unfairness
One ideal: processors should acquire lock in the order they request access to it
Simple test-and-set lock: low latency (under low contention), high traffic, poor scaling, low storage cost (one int), no provisions for fairness

Test-and-test-and-set lock

void Lock(int* lock) { 
    while (1) { 
        while (*lock != 0);						// while another processor has the lock…
        									  // (assume *lock is NOT register allocated)
        if (test_and_set(*lock) == 0)			// when lock is released, try to acquire it
            return; 
    } 
}

void Unlock(int* lock) { 
    *lock = 0; 
}

Test-and-test-and-set lock: coherence traffic

Test-and-test-and-set characteristics
Slightly higher latency than test-and-set in no contention case
Must test… then test-and-set
Generates much less interconnect traffic
One invalidation, per waiting processor, per lock release (O(P) invalidations)
This is O(P²) interconnect traffic if all processors have the lock cached
Recall: test-and-set lock generated one invalidation per waiting processor per test
More scalable (due to less traffic)
Storage cost unchanged (one int)
Still no provisions for fairness

Another impl: ticket lock
Main problem with test-and-set style locks: upon release, all waiting processors attempt to acquire lock using test-and-set

struct lock { 
    int next_ticket; 
    int now_serving; 
}; 

void Lock(lock* l) { 
    int my_ticket = atomic_increment(&l->next_ticket);			// take a “ticket”
    while (my_ticket != l->now_serving);                 		// wait for number to be called
} 

void unlock(lock* l) { 
    l->now_serving++; 
}

No atomic operation needed to acquire the lock (only a read)
Result: only one invalidation per lock release (O(P) interconnect traffic)

Atomic operations (provided by CUDA)

int   atomicAdd(int* address, int val); 
float atomicAdd(float* address, float val); 
int   atomicSub(int* address, int val); 
int   atomicExch(int* address, int val); 
float atomicExch(float* address, float val); 
int   atomicMin(int* address, int val); 
int   atomicMax(int* address, int val); 
unsigned int atomicInc(unsigned int* address, unsigned int val); 
unsigned int atomicDec(unsigned int* address, unsigned int val); 
int   atomicCAS(int* address, int compare, int val); 
int   atomicAnd(int* address, int val);  // bitwise 
int   atomicOr(int* address, int val);   // bitwise 
int   atomicXor(int* address, int val);  // bitwise

(omitting(省略) additional 64 bit and unsigned int versions)

Implementing atomic fetch-and-op
Example: atomic_min()

// atomicCAS:  (“compare and swap”) 
// atomic compare and swap performs the following logic atomically  
int atomicCAS(int* addr, int compare, int val) { 
    int old = *addr; 
    *addr = (old == compare) ? val : old; 
    return old; 
}

void atomic_min(int* addr, int x) { 
    int old = *addr; 
    int new = min(old, x); 
    while (atomicCAS(addr, old, new) != old) { 
        old = *addr; 
        new = min(old, x); 
    } 
}

Exercise: how can you build an atomic fetch+op out of atomicCAS()?

// for signed values of x
int atomic_increment(int* addr, int x) {
    int old = *addr;
    int new = old+x;
    int actual_old;
    while ((actual_old = atomicCAS(addr, old, new)) != old) {
        old = actual_old;
        new = old+x;
    }
    return old;
}

Another exercise: build a lock

typedef int lock; 
void lock(Lock* l) { 
    while (atomicCAS(l, 0, 1) == 1); 
} 
void unlock(Lock* l) { 
    *l = 0; 
}

// potentially more efficient under contention
void lock(Lock* l) { 
    while (1) { 
        while(*l == 1); 
        if (atomicCAS(l, 0, 1) == 0) 
            return; 
    } 
}

Load-linked, store conditional (LL/SC)
Pair of corresponding instructions (not a single atomic instruction like compare-and swap)
load_linked(x): load value from address
store_conditional(x, value): store value to x, if x hasn’t been written to by any processor since the corresponding load linked operation
Corresponding ARM instructions: LDREX and STREX
How might LL/SC be implemented on a cache coherent processor?
LL/SC 的实现依赖于每核一个的链接状态寄存器。load_linked 记录地址并设置标志；利用 MESI 协议的作废机制，当地址被其他核心修改时，通过监听总线自动清空标志；store_conditional 只需检查标志位即可确定数据是否在此期间被动过

C++ 11 atomic<T>
Provides atomic read, write, read-modify-write of entire objects
Atomicity may be implemented by mutex or efficiently by processor-supported atomic instructions (if T is a basic type)
Provides memory ordering semantics for operations before and after atomic operations
By default: sequential consistency
See std::memory_order or more detail

atomic<int> i; 
i++; // atomically increment i 
int a = i; 
// do stuff 
i.compare_exchange_strong(a, 10);   	// if i has same value as a, set i to 10
bool b = i.is_lock_free();           	// true if implementation of atomicity is lock free

Using locks: Fine-grained locking examples

Example: a sorted linked list
What can go wrong if multiple threads operate on the linked list simultaneously?

struct Node { 
    int value; 
    Node* next; 
};
struct List { 
    Node* head; 
};
void insert(List* list, int value) { 
    Node* n = new Node; 
    n->value = value; 
    // assume case of inserting before head of 
    // of list is handled here (to keep slide simple) 
    Node* prev = list->head; 
    Node* cur = list->head->next; 
    while (cur) { 
        if (cur->value > value) 
            break; 
        prev = cur; 
        cur = cur->next; 
    } 
    n->next = cur;  
    prev->next = n;
}
void delete(List* list, int value) { 
    // assume case of deleting first node in list 
    // is handled here (to keep slide simple) 
    Node* prev = list->head; 
    Node* cur = list->head->next; 
    while (cur) { 
        if (cur->value == value) { 
            prev->next = cur->next; 
            delete cur; 
            return; 
        } 
        prev = cur; 
        cur = cur->next; 
    } 
}

Example: simultaneous insertion
Thread 1 attempts to insert 6
Thread 2 attempts to insert 7

Thread 1 and thread 2 both compute same prev and cur
Result: one of the insertions gets lost!
Result: (assuming thread 1 updates prev->next before thread 2)

Example: simultaneous insertion/deletion
Thread 1 attempts to insert 6
Thread 2 attempts to delete 10

Possible result: (thread 2 finishes delete first)

Solution 1: protect the list with a single lock

struct Node { 
    int value; 
    Node* next; 
};
struct List { 
    Node* head; 
    Lock  lock;			// Per-list lock
};
void insert(List* list, int value) { 
    Node* n = new Node; 
    n->value = value; 
    lock(list->lock); 
    // assume case of inserting before head of 
    // of list is handled here (to keep slide simple) 
    Node* prev = list->head; 
    Node* cur = list->head->next; 
    while (cur) { 
        if (cur->value > value) 
            break; 
        prev = cur; 
        cur = cur->next; 
    } 
    n->next = cur;  
    prev->next = n;
    unlock(list->lock);
}
void delete(List* list, int value) { 
    lock(list->lock);
    // assume case of deleting first element is  
    // handled here (to keep slide simple) 
    Node* prev = list->head; 
    Node* cur = list->head->next; 
    while (cur) { 
        if (cur->value == value) { 
            prev->next = cur->next; 
            delete cur; 
            unlock(list->lock);
            return; 
        } 
        prev = cur; 
        cur = cur->next; 
    } 
    unlock(list->lock);
}

Single global lock per data structure
Good:
It is relatively simple to implement correct mutual exclusion for data structure operations (we just did it!)
Bad:
Operations on the data structure are serialized
May limit parallel application performance

Solution 2: “hand-over-hand” locking fine-grained locking

struct Node { 
    int value; 
    Node* next; 
    Lock* lock; 
};
struct List { 
    Node* head; 
    Lock* lock; 
};
void insert(List* list, int value) { 
    Node* n = new Node; 
    n->value = value; 
    // assume case of insert before head handled 
    // here (to keep slide simple) 
    Node* prev, *cur; 
    lock(list->lock); 
    prev = list->head; 
    lock(prev->lock); 
    unlock(list->lock); 
    cur = prev->next; 
    if (cur) lock(cur->lock); 
    while (cur) { 
        if (cur->value > value) 
            break; 
        Node* old_prev = prev; 
        prev = cur;  
        cur = cur->next; 
        unlock(old_prev->lock); 
        if (cur) lock(cur->lock); 
    } 
    n->next = cur;  
    prev->next = n; 
    unlock(prev->lock); 
    if (cur) unlock(cur->lock); 
}
void delete(List* list, int value) { 
    // assume case of delete head handled here 
    // (to keep slide simple) 
    Node* prev, *cur; 
    lock(list->lock); 
    prev = list->head; 
    lock(prev->lock); 
    unlock(list->lock);  
    cur = prev->next; 
    if (cur) lock(cur->lock) 
        while (cur) { 
            if (cur->value == value) { 
                prev->next = cur->next; 
                unlock(prev->lock); 
                unlock(cur->lock); 
                delete cur;  
                return; 
            } 
            Node* old_prev = prev; 
            prev = cur; 
            cur = cur->next; 
            unlock(old_prev->lock); 
            if (cur) lock(cur->lock); 
        } 
    unlock(prev->lock); 
}

Challenge to students: there is way to further improve the implementation of insert(). What is it?
由于持有 prev 的锁足以阻止任何可能导致 cur 节点失效的操作（如删除），因此 insert() 在遍历和执行插入时，只需要持有 prev 的锁，无需锁定 cur

Fine-grained locking
Goal: enable parallelism in data structure operations
Reduces contention for global data structure lock
In the linked-list example: a single monolithic(整体式) lock is overly conservative (operations on different parts of the linked list can proceed in parallel)
Challenge: tricky to ensure correctness
Determining when mutual exclusion is required
Deadlock?
Livelock?
Costs?
Overhead of taking a lock each traversal step (extra instructions + traversal now involves memory writes)
Extra storage cost (a lock per node)
What is a middle-ground solution that trades off some parallelism for reduced overhead? (hint: similar issue to selection of task granularity)
多个节点一个锁

Practice exercise (on your own time)
Implement a fine-grained locking implementation of a binary search tree supporting insert and delete

struct Tree { 
    Node* root; 
}; 
struct Node { 
    int value; 
    Node* left; 
    Node* right; 
}; 

void insert(Tree* tree, int value) {
    if(tree->root == nullptr) {
        Node* root = new Node;
        root->value = value;
        root->left = nullptr;
        root->right = nullptr;
        tree->root = root;
        return;
    }
    Node* root = tree->root;
    while(root) {
        if(root->value < value)
        {
            if(root->right != nullptr) {
                root = root->right;
            }else{
                Node* newNode = new Node;
                newNode->value = value;
                newNode->left = nullptr;
                newNode->right = nullptr;
                root->right = newNode;
                return;
            }
        }else{
            if(root->left != nullptr) {
                root = root->left;
            }else{
                Node* newNode = new Node;
                newNode->value = value;
                newNode->left = nullptr;
                newNode->right = nullptr;
                root->left = newNode;
                return;
            }
        }
    }
}
void delete(Tree* tree, int value) {
    
}

Using locks: Lock-free data structure designs

Blocking algorithms/data structures
A blocking algorithm allows one thread to prevent other threads from completing operations on a shared data structure indefinitely
Example:
Thread 0 takes a lock on a node in our linked list
Thread 0 is swapped out by the OS, or crashes, or is just really slow (takes a page fault), etc
Now, no other threads can complete operations on the data structure (although thread 0 is not actively making progress modifying it)
An algorithm that uses locks is blocking regardless of whether the lock implementation uses spinning or pre-emption(抢占)

Lock-free algorithms
Non-blocking algorithms are lock-free if some thread is guaranteed to make progress (“systemwide progress”)
In lock-free case, it is not possible to preempt(抢占) one of the threads at an inopportune(不合时宜) time and prevent progress by rest of system
Note: this definition does not prevent starvation of any one thread

Single reader, single writer bounded queue
Assume a sequentially consistent memory system for now (or the presence of appropriate memory fences, or C++ 11 atomic<>)
Only two threads (one producer, one consumer) accessing queue at the same time
Threads never synchronize or wait on each other
When queue is empty (pop fails), when it is full (push fails)

struct Queue {  
    int data[N]; 
    int head;   	// head of queue
    int tail;   	// next free element
};
void init(Queue* q) { 
    q->head = q->tail = 0; 
}
// return false if queue is full 
bool push(Queue* q, int value) {
    // queue is full if tail is element before head   
    if (q->tail == MOD_N(q->head - 1)) 
        return false; 
    q->data[q->tail] = value; 
    q->tail = MOD_N(q->tail + 1); 
    return true; 
}
// returns false if queue is empty 
bool pop(Queue* q, int* value) { 
    // if not empty 
    if (q->head != q->tail) { 
        *value = q->data[q->head]; 
        q->head = MOD_N(q->head + 1);  
        return true; 
    } 
    return false; 
}

Single reader, single writer unbounded queue
Assume a sequentially consistent memory system for now (or the presence of appropriate memory fences, or C++ 11 atomic<>)
Tail points to last element added (if non-empty)
Head points to element BEFORE head of queue
Node allocation and deletion performed by the same thread (producer thread)

struct Node { 
    Node* next; 
    int   value; 
}; 
struct Queue {  
    Node* head;  
    Node* tail; 
    Node* reclaim; 
}; 
void init(Queue* q) { 
    q->head = q->tail = q->reclaim = new Node; 
}
void push(Queue* q, int value) { 
    Node* n = new Node; 
    n->next = NULL; 
    n->value = value; 
    q->tail->next = n; 
    q->tail = q->tail->next; 
    while (q->reclaim != q->head) { 
        Node* tmp = q->reclaim; 
        q->reclaim = q->reclaim->next; 
        delete tmp; 
    } 
} 
// returns false if queue is empty 
bool pop(Queue* q, int* value) { 
    if (q->head != q->tail) { 
        *value = q->head->next->value; 
        q->head = q->head->next;  
        return true; 
    } 
    return false; 
}

Lock-free stack (first try)
Assume a sequentially consistent memory system for now (or the presence of appropriate memory fences, or C++ 11 atomic<>)
Main idea: as long as no other thread has modified the stack, a thread’s modification can proceed
Note difference from fine-grained locking: In fine-grained locking, the implementation locked a part of a data structure. Here, threads do not hold lock on data structure at all

struct Node { 
    Node* next; 
    int   value; 
}; 
struct Stack {  
    Node* top; 
};
void init(Stack* s) { 
    s->top = NULL; 
} 
void push(Stack* s, Node* n) { 
    while (1) { 
        Node* old_top = s->top; 
        n->next = old_top; 
        if (compare_and_swap(&s->top, old_top, n) == old_top) 
            return; 
    } 
} 
Node* pop(Stack* s) { 
    while (1) { 
        Node* old_top = s->top; 
        if (old_top == NULL) 
            return NULL; 
        Node* new_top = old_top->next; 
        if (compare_and_swap(&s->top, old_top, new_top) == old_top) 
            return old_top; 
    } 
}

The ABA problem
Careful: A, B, C, and D are addresses of nodes, not the value stored by the nodes!

ABBA 问题 (死锁): ABBA 形象地描述了两个线程尝试以相反的顺序获取两把锁的情况

Lock-free stack using counter for ABA soln
Maintain counter of pop operations
Requires machine to support “double compare and swap” (DCAS) or doubleword CAS
Could also solve ABA problem with careful node allocation and/or element reuse policies

struct Node { 
    Node* next; 
    int   value; 
}; 
struct Stack {  
    Node* top; 
    int   pop_count; 
}; 
void init(Stack* s) { 
    s->top = NULL; 
} 
void push(Stack* s, Node* n) { 
    while (1) { 
        Node* old_top = s->top; 
        n->next = old_top; 
        if (compare_and_swap(&s->top, old_top, n) == old_top) 
            return; 
    } 
} 
Node* pop(Stack* s) { 
    while (1) { 
        int pop_count = s->pop_count; 
        Node* top = s->top; 
        if (top == NULL) 
            return NULL; 
        Node* new_top = top->next; 
        // test to see if either have changed (assume function returns true if no changes)
        if (double_compare_and_swap(&s->top,       top,       new_top, 
                                    &s->pop_count, pop_count, pop_count+1)) 
            return top; 
    } 
}

Compare and swap on x86
x86 supports a “double-wide” compare-and-swap instruction
Not quite the “double compare-and-swap” used on the previous
But could simply ensure the stack’s count and top fields are contiguous in memory to use the 64-bit wide single compare-and-swap instruction below
cmpxchg8b
“compare and exchange eight bytes”
Can be used for compare-and-swap of two 32-bit values
cmpxchg16b
“compare and exchange 16 bytes”
Can be used for compare-and-swap of two 64-bit values

Another problem: referencing freed memory

struct Node { 
    Node* next; 
    int   value; 
}; 
struct Stack {  
    Node* top; 
    int   pop_count; 
};
void init(Stack* s) { 
    s->top = NULL; 
} 
void push(Stack* s, int value) { 
    Node* n = new Node; 
    n->value = value; 
    while (1) { 
        Node* old_top = s->top; 
        n->next = old_top; 
        if (compare_and_swap(&s->top, old_top, n) == old_top) 
            return; 
    } 
} 
int pop(Stack* s) { 
    while (1) { 
        Stack old; 
        old.pop_count = s->pop_count; 
        old.top = s->top; 
        if (old.top == NULL) 
            return NULL;
        Stack new_stack; 
        // old top might have been freed at this point (by some other thread that popped it)
        new_stack.top = old.top->next; 
        new_stack.pop_count = old.pop_count+1;  
        if (doubleword_compare_and_swap(s, old, new_stack)) {
            int value = old.top->value; 
            delete old.top; 
            return value; 
        }
    }
}

[Advanced topic] Hazard pointer: avoid freeing a node until it’s known that all other threads do not hold reference to it

struct Node { 
    Node* next; 
    int value; 
}; 
struct Stack {  
    Node* top; 
    int pop_count; 
}; 
// per thread ptr (node that cannot  
// be deleted since the thread is  
// accessing it) 
Node* hazard; 
// list of nodes this thread must 
// delete (this is a per thread list) 
Node* retireList; 
int   retireListSize;
// delete nodes if possible 
void retire(Node* ptr) { 
    push(retireList, ptr); 
    retireListSize++; 
    if (retireListSize > THRESHOLD) 
        for (each node n in retireList) { 
            if (n not pointed to by any thread’s hazard pointer) { 
                remove n from list 
                delete n; 
            } 
        } 
}
void init(Stack* s) { 
    s->top = NULL; 
} 
void push(Stack* s, int value) { 
    Node* n = new Node; 
    n->value = value; 
    while (1) { 
        Node* old_top = s->top; 
        n->next = old_top; 
        if (compare_and_swap(&s->top, old_top, n) == old_top) 
            return; 
    } 
} 
int pop(Stack* s) { 
    while (1) { 
        Stack old; 
        old.pop_count = s->pop_count; 
        old.top = hazard = s->top; 
        if (old.top == NULL) { 
            return NULL; 
        } 
        Stack new_stack; 
        new_stack.top = old.top->next; 
        new_stack.pop_count = old.pop_count+1;  
        if (doubleword_compare_and_swap(s, old, new_stack)) { 
            int value = old.top->value; 
            retire(old.top); 
            return value; 
        } 
        hazard = NULL; 
    } 
}

Lock-free linked list insertion
For simplicity, this slide assumes the only operation on the list is insert. Delete is more complex
Compared to fine-grained locking implementation:
No overhead of taking locks
No per-node storage overhead

struct Node { 
    int value; 
    Node* next; 
};
struct List { 
    Node* head; 
};
// insert new node after specified node 
void insert_after(List* list, Node* after, int value) { 
    Node* n = new Node; 
    n->value = value; 
    // assume case of insert into empty list handled 
    // here (keep code on slide simple for class discussion) 
    Node* prev = list->head; 
    while (prev->next) { 
        if (prev == after) { 
            while (1) { 
                Node* old_next = prev->next; 
                n->next = old_next;  
                if (compare_and_swap(&prev->next, old_next, n) == old_next) 
                    return; 
            } 
        } 
        prev = prev->next; 
    } 
}

Lock-free linked list deletion
Supporting lock-free deletion significantly complicates data-structure
Consider case where B is deleted simultaneously with insertion of E after B
B now points to E, but B is not in the list!
For the curious:
Harris 2001. “A Pragmatic Implementation of Non-blocking Linked-Lists”
Fomitchev 2004. “Lock-free linked lists and skip lists”

Lock-free vs. locks performance comparison

Lock-free algorithm run time normalized to run time of using pthread mutex locks

In practice: why lock free data structures?
When optimizing parallel programs in this class you often assume that only your program is using the machine
Because you care about performance
Typical assumption in scientific computing, graphics, machine learning, data analytics, etc.
In these cases, well-written code with locks can be as fast (or faster) than lock-free code
And is often much simpler to implement
But there are situations where code with locks can suffer from tricky performance problems
Situations where a program features many threads (e.g., database, webserver) and page faults, pre-emption, etc. can occur while a thread is in a critical section
Locks create problems like priority inversion(反转), convoying, crashing in critical section, etc. that are often discussed in OS classes

Summary

Use fine-grained locking to reduce contention (maximize parallelism) in operations on shared data structures
But fine-granularity can increase code complexity (errors) and increase execution overhead
Lock-free data structures: non-blocking solution to avoid some overheads/pitfalls(陷阱) of locks
But can be tricky to implement (and ensuring correctness in a lock-free setting has its own overheads)
Still requires appropriate memory fences on modern relaxed consistency hardware
Note: a lock-free design does not eliminate contention
Compare-and-swap can fail under heavy contention, requiring spins

Preview: transactional memory
Q. What was the role of the compare and swap in our lock-free implementations?
A. Determining if another thread had modified the data structure while the calling thread was in the middle of an operation
Next time… transactional memory
A more general mechanism to allow a system to speculate that an operation will be successfully completed before another thread attempts to modify the structure
With mechanisms to “abort” an operation in the event another thread does

More reading on lock-free structures
Michael and Scott 1996. Simple, Fast and Practical Non-Blocking and Blocking Concurrent Queue Algorithms
Multiple reader/writer lock-free queue
Harris 2001. A Pragmatic(实用) Implementation of Non-Blocking Linked-Lists
Michael Sullivan’s Relaxed Memory Calculus (RMC) compiler
msullivan/rmc-compiler: Implementation of RMC atomics for C/C++
Many good blog posts and articles on the web:
http://www.drdobbs.com/cpp/lock-free-code-a-false-sense-of-security/210600279
developers.memsql.com

Lecture 17: Transactional Memory

Transactional Memory (TM) Semantics

Raising level of abstraction for synchronization
Previous topic: machine-level atomic operations
Test-and-set, fetch-and-op, compare-and-swap, load linked-store conditional
Then we used these atomic operations to construct higher level synchronization primitives in software:
Locks, barrier
Lock-free data structures
We’ve seen how it can be challenging to produce correct programs using these primitives (easy to create bugs that violate atomicity, create deadlock, etc.)
Today: raising level of abstraction for synchronization even further
Idea: transactional memory

What you should know
What a transaction is
The difference (in semantics) between an atomiccode block and lock/unlock primitives
The basic design space of transactional memory implementations
Data versioning policy
Conflict detection policy
Granularity of detection
The basics of a software implementation of transactional memory
The basics of a hardware implementation of transactional memory (consider how it relates to the cache coherence protocol implementations we’ve discussed previously in the course)

Between a Lock and a Hard Place
Locks force trade-off between
Degree of concurrency ⇒ performance
Chance of races, deadlock ⇒ correctness
Coarse grain locking
low concurrency, higher chance of correctness
E.g. single lock for the whole data structure or all shared memory
Fine grain locking
high concurrency, lower chance of correctness
E.g. hand-over-hand locking
Is there a better synchronization abstraction?

Review: ensuring atomicity via locks

void deposit(Acct account, int amount)
{
    lock(account.lock);
    int tmp = bank.get(account);
    tmp += amount;
    bank.put(account, tmp);
    unlock(account.lock);
}

Deposit is a read-modify-write operation: want “deposit” to be atomic with respect to other bank operations on this account
Locks are one mechanism to synchronize threads to ensure atomicity of update (via ensuring mutual exclusion on the account)

Programming with transactions

void deposit(Acct account, int amount) 
{
    atomic {
        int tmp = bank.get(account);
        tmp += amount;
        bank.put(account, tmp);
    }
}

Atomic construct is declarative
Programmer states what to do (maintain atomicity of this code), not how to do it
No explicit use or management of locks
System implements synchronization as necessary to ensure atomicity
System could implement atomic { } using locks
Implementation discussed today uses optimistic concurrency: maintain serialization only in situations of true contention (R-W or W-W conflicts)

Declarative vs. Imperative(命令式) Abstractions
Declarative: programmer defines what should be done
Execute all these independent 1000 tasks
Perform this set of operations atomically
Imperative: programmer states how it should be done
Spawn N worker threads. Assign work to threads by removing work from a shared task queue
Acquire a lock, perform operations, release the lock

Transactional Memory (TM) Semantics
Memory transaction
An atomic and isolated sequence of memory accesses
Inspired by database transactions
Atomicity (all or nothing)
Upon transaction commit, all memory writes in transaction take effect at once
On transaction abort, none of the writes appear to take effect (as if transaction never happened)
Isolation
No other processor can observe writes before transaction commits
Serializability
Transactions appear to commit in a single serial order
But the exact order of commits is not guaranteed by semantics of transaction

Transactional Memory (TM)
In other words… many of the properties we maintained for a single address in a coherent memory system, we’d like to maintain for sets of reads and writes in a transaction
Transaction:
Reads: X, Y, Z
Writes: A, X
These memory transactions will either all be observed by other processors, or none of them will. (the effectively all happen at the same time)
What is the consistency model for TM?
It gives you basically sequential consistency, some people have called this transactional consistency

Motivating transactional memory

Java HashMap
Map: Key → Value
Implemented as a hash table with linked list per bucket

public Object get(Object key)  {
    int idx = hash(key);			// compute hash
    HashEntry e = buckets[idx];		 // find bucket
    while (e != null)  {			// find element in bucket
        if (equals(key, e.key))
            return e.value;
        e = e.next;
    }
    return null;
}

Bad: not thread safe (when synchronization needed)
Good: no lock overhead when synchronization not needed

Synchronized HashMap
Java 1.4 solution: synchronized layer
Convert any map to thread-safe variant
Uses explicit, coarse-grained mutual locking specified by programmer

public Object get(Object key) {
    synchronized (myHashMap) { 		// per-hashmap lock guards all accesses to hashMap
        return myHashMap.get(key);
    }
}

Coarse-grain synchronized HashMap
Good: thread-safe, easy to program
Bad: limits concurrency, poor scalability
One solution: use finer-grained synchronization (e.g., lock per bucket)
Now thread safe: but incurs lock overhead even if synchronization not needed
performance of fine-grained locking
Reducing contention via fine-grained locking leads to better performance

Transactional HashMap
Simply enclose all operation in atomic block
Semantics of atomic block: system ensures atomicity of logic within block

public Object get(Object key) {
    atomic {					// system guarantees atomicity
        return m.get(key);
    }
}

Good: thread-safe, easy to program
What about performance and scalability?
Depends on the workload and implementation of atomic (to be discussed)

Another example: tree update by two threads
Goal: modify nodes 3 and 4 in a thread-safe way

Fine-grained locking example

Locking can prevent concurrency (here: locks on node 1 and 2 during update to node 3 could delay update to 4)
Transactions example

Transactions example #2
(Both transactions modify node 3)

Performance: locks vs. transactions
HashMap Balanced Tree

“TCC” is a TM system implemented in hardware

Atomic and Doubly-Linked List
Make PushLeft method on a doubly-linked list thread safe using atomic()

void PushLeft(DQueue *q, int val) { 
    QNode *qn = malloc(sizeof(QNode));
    qn->val = val;
    atomic {
        QNode *leftSentinel = q->left;
        QNode *oldLeftNode = leftSentinel->right;
        qn->left = leftSentinel;
        qn->right = oldLeftNode;
        leftSentinel->right = qn;
        oldLeftNode->left = qn; 
    }
}

Another motivation: failure atomicity
Failure atomicity: locks

void transfer(A, B, amount) {
    synchronized(bank)
    {
        try {
            withdraw(A, amount);
            deposit(B, amount);
        }
        catch(exception1) { /* undo code 1*/ }
        catch(exception2) { /* undo code 2*/ }
        …
    }
}

Complexity of manually catching exceptions
Programmer provides “undo” code on a case-by-case(具体情况) basis
Complexity: must track what to undo and how…
Some side-effects may become visible to other threads
E.g., an uncaught case can deadlock the system…
Failure atomicity: transactions

void transfer(A, B, amount)
{
    atomic {
        withdraw(A, amount);
        deposit(B, amount);  
    }
}

System now responsible for processing exceptions
All exceptions (except those explicitly managed by the programmer)
Transaction is aborted and memory updates are undone
Recall: a transaction either commits or it doesn’t: no partial updates are visible to other threads
E.g., no locks held by a failing threads…

Another motivation: composability
Composability: locks

void transfer(A, B, amount)
{
    synchronized(A) {
        synchronized(B) {
            withdraw(A, amount);
            deposit(B, amount);
        }
    }
}

DEADLOCK!
Thread 0: transfer(A, B, 100)
Thread 1: transfer(B, A, 200)
Composing lock-based code can be tricky
Requires system-wide policies to get correct
System-wide policies can break software modularity
Programmer caught between a lock and a hard place !
Coarse-grain locks: low performance
Fine-grain locking: good for performance, but mistakes can lead to deadlock
Composability: transactions

void transfer(A, B, amount) {
    atomic {
        withdraw(A, amount);
        deposit(B, amount);
    }
}

Transactions compose gracefully (in theory)
Programmer declares global intent(意图) (atomic execution of transfer)
No need to know about global implementation strategy
Transaction in transfer subsumes(包含) any defined in withdraw and deposit
Outermost transaction defines atomicity boundary
System manages concurrency as well as possible
Serialization for transfer(A, B, 100) and transfer(B, A, 200)
Concurrency for transfer(A, B, 100) and transfer(C, D, 200)

Advantages (promise) of transactional memory
Easy to use synchronization construct
It is difficult for programmers to get synchronization right
Programmer declares need for atomicity, system implements it well
Claim: transactions are as easy to use as coarse-grain locks
Often performs as well as fine-grained locks
Provides automatic read-read concurrency and fine-grained concurrency
Performance portability: locking scheme for four CPUs may not be the best scheme for 64 CPUs
Productivity argument for transactional memory: system support for transactions can achieve 90% of the benefit of expert programming with fined-grained locks, with 10% of the development time
Failure atomicity and recovery
No lost locks when a thread fails
Failure recovery = transaction abort + restart
Composability
Safe and scalable composition of software modules

Self-check: atomic { } ≠ lock() + unlock()
The difference (Make sure you understand this difference in semantics!)
Atomic: high-level declaration of atomicity
Does not specify implementation of atomicity
Lock: low-level blocking primitive
Does not provide atomicity or isolation on its own
Keep in mind
Locks can be used to implement an atomicblock but…
Locks can be used for purposes beyond atomicity
Cannot replace all uses of locks with atomic regions
Atomic eliminates many data races, but programming with atomic blocks can still suffer from atomicity violations: e.g., programmer erroneous splits sequence that should be atomic into two atomic blocks

What about replacing synchronized with atomic in this example?

// Thread 1
synchronized(lock1)
{
    …
    flagA = true;
    while (flagB == 0); 
    …
}

// Thread 2
synchronized(lock2)
{
    …
    flagB = true;
    while (flagA == 0); 
    …
}

Thread 1 和 Thread 2 读区域和写区域相互冲突，导致序列化执行，又因为无法满足 while 循环条件，导致先执行的线程死循环

Atomicity violation due to programmer error

// Thread 1
atomic
{
    … 
    ptr = A;
    …
}
atomic
{
    B = ptr->field;
}

// Thread 2
atomic
{
    …
    ptr = NULL; 
}

Programmer mistake: logically atomic code sequence (in thread 1) is erroneously separated into two atomic blocks (allowing another thread to set pointer to NULL in between)

Implementing transactional memory

Recall transactional semantics
Atomicity (all or nothing)
At commit, all memory writes take effect at once
In event of abort, none of the writes appear to take effect
Isolation
No other code can observe writes before commit
Serializability
Transactions seem to commit in a single serial order
The exact order is not guaranteed though

TM implementation basics
TM systems must provide atomicity and isolation
While maintaining as much concurrency as possible
Two key implementation questions
Data versioning policy: How does the system manage uncommitted (new) and previously committed (old) versions of data for concurrent transactions?
Conflict detection policy: how/when does the system determine that two concurrent transactions conflict?

Data versioning policy
Manage uncommitted (new) and previously committed (old) versions of data for concurrent transactions
1.Eager versioning (undo-log based)
2.Lazy versioning (write-buffer based)

Eager versioning
Update memory immediately, maintain “undo log” in case of(以防) abort

Lazy versioning
Log memory updates in transaction write buffer, flush buffer on commit

Data versioning
Goal: manage uncommitted (new) and committed (old) versions of data for concurrent transactions
Eager versioning (undo-log based)
Update memory location directly on write
Maintain undo information in a log (incurs per-store overhead)
Good: faster commit (data is already in memory)
Bad: slower aborts, fault tolerance issues (consider crash in middle of transaction)
Eager versioning philosophy: write to memory immediately, hoping transaction won’t abort (but deal with aborts when you have to)
Lazy versioning (write-buffer based)
Buffer data in a write buffer until commit
Update actual memory location on commit
Good: faster abort (just clear log), no fault tolerance issues
Bad: slower commits
Lazy versioning philosophy: only write to memory when you have to

Conflict detection
Must detect and handle conflicts between transactions
Read-write conflict: transaction A reads address X, which was written to by pending(待办的) (but not yet committed) transaction B
Write-write conflict: transactions A and B are both pending, and both write to address X
System must track a transaction’s read set and write set
Read-set: addresses read during the transaction
Write-set: addresses written during the transaction

Pessimistic detection
Check for conflicts (immediately) during loads or stores
Philosophy: “I suspect conflicts might happen, so let’s always check to see if one has occurred after each memory operation… if I’m going to have to roll back, might as well do it now to avoid wasted work.”
“Contention manager” decides to stall or abort transaction when a conflict is detected
Various policies to handle common case fast

Pessimistic detection examples
Note: diagrams assume “aggressive” contention manager on writes: writer wins, so other transactions abort)

Optimistic detection
Detect conflicts when a transaction attempts to commit
Intuition: “Let’s hope for the best and sort out(解决) all the conflicts only when the transaction tries to commit”
On a conflict, give priority to committing transaction
Other transactions may abort later on

Optimistic detection examples

Conflict detection trade-offs
Pessimistic conflict detection (a.k.a. “eager”)
Good: detect conflicts early (undo less work, turn some aborts to stalls)
Bad: no forward progress guarantees, more aborts in some cases
Bad: fine-grained communication (check on each load/store)
Bad: detection on critical path
Optimistic conflict detection (a.k.a.“lazy” or “commit”)
Good: forward progress guarantees
Good: bulk communication and conflict detection
Bad: detects conflicts late, can still have fairness problems

Lecture 18: Transactional Memory Part II + Course Wrap Up

TM implementation space (examples)
Software TM systems
Lazy + optimistic (rd/wr): Sun TL2
Lazy + optimistic (rd)/pessimistic (wr): MS OSTM
Eager + optimistic (rd)/pessimistic (wr): Intel STM
Eager + pessimistic (rd/wr): Intel STM
Hardware TM systems
Lazy + optimistic: Stanford TCC
Lazy + pessimistic: MIT LTM, Intel VTM
Eager + pessimistic: Wisconsin LogTM (easiest with conventional cache coherence)
Optimal design remains an open question
May be different for HW, SW, and hybrid

Software TM systems

Software Transactional Memory

atomic {
    a.x = t1
    a.y = t2
    if (a.z == 0) {
        a.x = 0
        a.z = t3
    }
}

// 转换
tmTxnBegin()
tmWr(&a.x, t1)
tmWr(&a.y, t2)
if (tmRd(&a.z) != 0) {
    tmWr(&a.x, 0);
    tmWr(&a.z, t3)
}
tmTxnCommit()

Software barriers (STM function call) for TM bookkeeping
Versioning, read/write-set tracking, commit, …
Using locks, timestamps, data copying, …
Requires function cloning or dynamic translation
Function used inside and outside of transaction

STM Runtime Data Structures
Transaction descriptor (per-thread)
Used for conflict detection, commit, abort, …
Includes the read set, write set, undo log or write buffer
Transaction record (per data)
Pointer-sized record guarding shared data
Tracks transactional state of data
Shared: accessed by multiple readers
Using version number or shared reader lock
Exclusive: access by one writer
Using writer lock that points to owner
BTW(顺便提一句): same way that HW cache coherence works

Mapping Data to Transaction Records
Every data item has an associated transaction record

Conflict Detection Granularity
Object granularity
Low overhead mapping operation
Exposes optimization opportunities
False conflicts (e.g. Txn 1 and Txn 2)
Element/field granularity (word)
Reduces false conflicts
Improves concurrency (e.g. Txn 1 and Txn 2)
Increased overhead (time/space)
Cache line granularity (multiple words)
Matches hardware TM
Reduces storage overhead of transactional records
Hard for programmer & compiler to analyze
Mix & match per type basis
E.g., element-level for arrays, object-level for non-arrays

// Txn 1
a.x = …
a.y = …
// Txn 2
… = … a.z …

An Example STM Algorithm
Based on Intel’s McRT STM [PPoPP’06, PLDI’06, CGO’07]
Eager versioning, optimistic reads, pessimistic writes
Based on timestamp for version tracking
Global timestamp
Incremented when a writing xaction commits
Local timestamp per xaction
Global timestamp value when xaction last validated
Transaction record (32-bit)
LS bit: 0 if writer-locked, 1 if not locked
MS bits
Timestamp (version number) of last commit if not locked
Pointer to owner xaction if locked

STM Operations
STM read (optimistic)
Direct read of memory location (eager)
Validate read data
Check if unlocked and data version ≤ local timestamp
If not, validate all data in read set for consistency
Insert in read set
Return value
STM write (pessimistic)
Validate data
Check if unlocked and data version ≤ local timestamp
Acquire lock
Insert in write set
Create undo log entry
Write data in place (eager)

STM Operations (cont)
Read-set validation
Get global timestamp
For each item in the read set
If locked by other or data version > local timestamp, abort
Set local timestamp to global timestamp from initial step
STM commit
Atomically increment global timestamp by 2 (LSb used for write-lock)
If preincremented (old) global timestamp > local timestamp, validate read-set
Check for recently committed transactions
For each item in the write set
Release the lock and set version number to global timestamp

STM Example

// X1
atomic {
    t = foo.x;
    bar.x = t;
    t = foo.y;
    bar.y = t; 
}

// X2
atomic {
    t1 = bar.x;
    t2 = bar.y; 
}

X1 copies object foo into object bar
X2 should read bar as [0,0] or [9,7]

TM Implementation Summary 1
TM implementation
Data versioning: eager or lazy
Conflict detection: optimistic or pessimistic
Granularity: object, word, cache-line, …
Software TM systems
Compiler adds code for versioning & conflict detection
Note: STM barrier = instrumentation code
Basic data-structures
Transactional descriptor per thread (status, rd/wr set, …)
Transactional record per data (locked/version)

Challenges for STM Systems
Overhead of software barriers
Function cloning
Robust(稳健的) contention management
Memory model (strong Vs. weak atomicity)

Optimizing Software Transactions
Monolithic(整体式) barriers hide redundant logging & locking from the compiler

atomic {
    a.x = t1
    a.y = t2
    if (a.z == 0) {
        a.x = 0
        a.z = t3
    }
}

// 转换
tmTxnBegin()
tmWr(&a.x, t1)
tmWr(&a.y, t2)
if (tmRd(&a.z) != 0) {
    tmWr(&a.x, 0);
    tmWr(&a.z, t3)
}
tmTxnCommit()

Decomposed barriers expose redundancies

atomic {
    a.x = t1
    a.y = t2
    if (a.z == 0) {
        a.x = 0
        a.z = t3
    }
}

// 转换
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
txnLogObjectInt(&a.y, a)
a.y = t2
txnOpenForRead(a)
if(a.z != 0) {
    txnOpenForWrite(a)
    txnLogObjectInt(&a.x, a)
    a.x = 0
    txnOpenForWrite(a)
    txnLogObjectInt(&a.z, a)
    a.z = t3 
}

Allows compiler to optimize STM code
Produces fewer & cheaper STM operations

atomic {
    a.x = t1
    a.y = t2
    if (a.z == 0) {
        a.x = 0
        a.z = t3
    }
}

// 转换
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if (a.z != 0) {
    a.x = 0
    txnLogObjectInt(&a.z, a)
    a.z = t3 
}

Effect of Compiler Optimizations
1 thread overheads over(相对于) thread-unsafe baseline

With compiler optimizations
<40% over no concurrency control
<30% over lock-based synchronization

STM Question
Given an optimistic read, pessimistic write, eager versioning STM
What steps are required to implement the atomic region

1
2
3

atomic{
    obj.f1=42;
}

tx = GetTxDescriptor();	// Assume a way of to get transaction descriptor
OpenForWriteTx(tx, obj);
LogForUndoIntTx(tx, obj, offset);	// record old value in the undo log
obj.f1 = 42;

Hardware TM systems

Motivation for Hardware Support

STM slowdown: 2-8x per thread overhead due to barriers
Short term issue: demotivates parallel programming
Long term issue: energy wasteful
Lack of strong atomicity
Costly to provide purely in software

Why is STM Slow?
Measured single-thread STM performance

1.8x – 5.6x slowdown over sequential
Most time goes in read barriers & commit
Most apps read more data than they write

Types of Hardware Support
Hardware-accelerated STM systems (HASTM, SigTM, USTM, …)
Start with an STM system & identify key bottlenecks
Provide (simple) HW primitives for acceleration, but keep SW barriers
Hardware-based TM systems (TCC, LTM, VTM, LogTM, …)
Versioning & conflict detection directly in HW
No SW barriers
Hybrid TM systems (Sun Rock, …)
Combine an HTM with an STM by switching modes when needed
Based on xaction characteristics available resources, …

	HTM	STM	HW-STM
Write versioning	HW	SW	SW
Conflict detection	HW	SW	HW

Hardware transactional memory (HTM)
Data versioning is implemented in caches
Cache the write buffer or the undo log
Add new cache line metadata to track transaction read set and write set
Conflict detection through cache coherence protocol
Coherence lookups detect conflicts between transactions
Works with snooping and directory coherence
Note:
Register checkpoint must also be taken at transaction begin (to restore execution context state on abort)

HTM design
Cache lines annotated(带注释的) to track read set and write set
R bit: indicates data read by transaction (set on loads)
W bit: indicates data written by transaction (set on stores)
R/W bits can be at word or cache-line granularity
R/W bits gang-cleared(集体清除) on transaction commit or abort

For eager versioning, need a 2nd cache write for undo log
Coherence requests check R/W bits to detect conflicts
Observing shared request to W-word is a read-write conflict
Observing exclusive (intent to write) request to R-word is a write-read conflict
Observing exclusive (intent to write) request to W-word is a write-write conflict

Example HTM implementation: lazy-optimistic

CPU changes
Ability to checkpoint register state (available in many CPUs)
TM state registers (status, pointers to abort handlers, …)
Cache changes
R bit indicates membership to read set
W bit indicates membership to write set

HTM transaction execution

Xbegin
    Load A
    Load B
    Store C ⇐ 5
Xcommit

Transaction begin Xbegin
Initialize CPU and cache state
Take register checkpoint
Load operation Load A Load B
Serve cache miss if needed
Mark data as part of read set
Store operation Store C ⇐ 5
Service cache miss if needed
Mark data as part of write set (note: this is not a load into exclusive state. Why?)
That each of these transactions is isolated from each other
Fast two-phase commit
Validate: request RdX access to write set lines (if needed)
Commit: gang-reset R and W bits, turns write set data to valid (dirty) data

HTM transaction execution: detect/abort
Assume remote processor commits transaction with writes to A and D

Fast conflict detection and abort
Check: lookup exclusive requests in the read set and write set
Abort: invalidate write set, gang-reset R and W bits, restore to register checkpoint

HTM Performance Example

2x to 7x over STM performance
Within 10% of sequential for one thread
Scales efficiently with number of processors

HTM Example: Transactional Coherence and Consistency
Use TM as the coherence mechanism -> all transactions all the time
Successful transaction commits update memory and all caches in the system

Assumptions
Lazy and optimistic
One “commit” per execution step across all processors
When one transaction causes another transaction to abort and re-execute, assume that the transaction “commit” of one transaction can overlap with the “begin” of the re-executing transaction
Minimize the number of execution steps

Hardware transactional memory support in Intel Haswell architecture
New instructions for “restricted transactional memory” (RTM)
xbegin: takes pointer to “fallback(倒退) address” in case of abort
e.g., fallback to code-path with a spin-lock
xend
xabort
Implementation: tracks read and write set in L1 cache
Processor makes sure all memory operations commit atomically
But processor may automatically abort transaction for many reasons (e.g., eviction(驱逐) of line in read or write set will cause a transaction abort)
Implementation does not guarantee progress (see fallback address)
Intel optimization guide (ch 12) gives guidelines for increasing probability(可能性) that transactions will not abort

Summary: transactional memory
Atomic construct: declaration that atomic behavior must be preserved by the system
Motivating idea: increase simplicity of synchronization without (significantly) sacrificing(牺牲) performance
Transactional memory implementation
Many variants have been proposed: SW, HW, SW+HW
Implementations differ in:
Data versioning policy (eager vs. lazy)
Conflict detection policy (pessimistic vs. optimistic)
Detection granularity (object, word, cache line)
Software TM systems (STM)
Compiler adds code for versioning & conflict detection
Note: STM barrier = instrumentation code (e.g. StmRead, StmWrite)
Basic data-structures
Transactional descriptor per thread (status, rd/wr set, …)
Transactional record per data (locked/version)
Hardware Transactional Memory (HTM)
Versioned data is kept in caches
Conflict detection mechanisms augment(增强) coherence protocol

Course Wrap Up

For the foreseeable future, the primary way to obtain higher performance computing hardware is through a combination of increased parallelism and hardware specialization

Modern software is surprisingly inefficient compared to the peak capability of modern machines
A lot of performance is currently left on the table (increasingly so as machines get more complex, and parallel processing capability grows)
Extracting this performance stands to provide a notable impact on many compute-intensive fields (or, more importantly enable new applications of computing!)
Given current software programming systems and tools, understanding principles of how a parallel machine works is important to achieving high performance
A major challenge going forward is making it simpler for programmers to extract performance on these complex machines

This is very important given how exciting (and efficiency-critical) the next generation of computing applications are likely to be

Key issues we have addressed in this course
Identifying parallelism
(or conversely, identifying dependencies)
Efficiently scheduling work
1.Achieving good workload balance
2.Overcoming communication constraints:
Bandwidth limits, dealing with latency, synchronization
Exploiting data/computation locality = efficiently managing state!
3.Scheduling under heterogeneity (using the right processor for the job)
We discussed these issues at many scales and in many contexts
Heterogeneous mobile SoC
Single chip, multi-core CPU
Multi-core GPU
CPU+GPU connected via bus
Clusters of machines
AI accelerator hardware
Large scale, multi-node supercomputers
How throughput-oriented hardware works
Multiple cores, hardware-threads, SIMD
Specialized AI accelerators
Specialization to key domains
Abstractions that help structure code to be efficient
Data parallel thinking
Functional parallelism
Transactions
Tasks
SPMD

Other relevant classes
CS 217: Hardware Accelerators for Machine Learning (Winter, Kunle’s course)
Focuses on design of specialized hardware architectures for ML (understanding the workload and building efficient hardware for that workload)
CS 348K: Visual Computing Systems (Spring, Kayvon)
Design of high-performance hardware/software systems for processing images and video (ray tracing, video analysis, smartphone camera processing, NeRF/AI-based graphics, fast data labeling, etc)
CS/EE 282: Computer Systems Architecture

Titre: 【Class】Stanford CS149 Parallel Computing
Auteur: tiny_star
Créé à : 2025-12-25 01:04:13
Mis à jour à : 2026-03-02 23:12:46
Lien: https://tiny-star3.github.io/2025/12/25/Cpp/[Class]Stanford CS149 Parallel Computing/
Licence: Cette œuvre est sous licence CC BY-NC-SA 4.0.

Commentaires

Sur cette page

【Class】Stanford CS149 Parallel Computing

课程
1. 课程视频
2. 课程信息
3. Programming Assignments
4. Written Assignments
Lecture 1: Why Parallelism? Why Efficiency?
1. Speedup
2. Course theme
3. What is a program?(from a processor’s perspective)
4. What is an instruction?
5. What does a processor do?
6. What is memory?
7. What are caches?
8. Summary
Lecture 2: A Modern Multi-Core Processor (Part I)
1. Pre multi-core era processor
2. Multi-core era processor
3. SIMD processing(Single instruction, multiple data)
4. Some common jargon(术语)
5. Summary: three different forms of parallel execution
6. Hardware-supported multi-threading
7. Summary
8. Thought experiment
Lecture 3: Modern Multi-Core Architecture (Part II) + ISPC Programming Abstractions
1. latency and bandwidth
2. Abstraction vs. implementation
3. Programming with ISPC
4. Summary
Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
1. 搭建本地Linux开发环境
Lecture 4: Parallelizing Code: The Programming Thought Process
1. Creating a parallel program
2. Problem decomposition
3. Assignment
4. Orchestration
5. Assignment to hardware
6. A parallel programming example
7. Synchronization in a shared address space
8. Summary
Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling
1. Programming for high performance
2. Balancing the workload
3. Scheduling fork-join parallelism
4. Cilk uses greedy join scheduling
5. Cilk summary
Lecture 6: Performance Optimization Part II: Locality, Communication, and Contention
1. shared address space model
2. Message passing model
3. Two reasons for communication: inherent vs. artifactual communication
4. Contention
5. Summary: reducing communication costs
6. Here are some tricks for understanding the performance of parallel software
7. Summary of tips
Assignment 2: Scheduling Task Graphs on a Multi-Core CPU
Lecture 7: GPU Architecture & CUDA Programming
1. GPU compute mode
2. CUDA programming abstractions
3. CUDA implementation on modern GPUs
4. More detail on GPU architecture
5. Running a CUDA program on a GPU
6. More advanced scheduling questions
7. CUDA summary
Lecture 8: Data-Parallel Thinking
1. Motivation
2. Map
3. Fold
4. Scan
5. Parallel Scan
6. Parallel Segmented Scan
7. Gather/scatter: key data-parallel operations
8. More sequence operations
9. Example
10. Summary
Lecture 9: Efficiently Evaluating DNNs
1. Review
2. Mini-intro(简要介绍): Convolutional Neural Networks
3. Efficiently implementing convolution layers
4. Matrix multiplication is at the heart of the “attention” blocks of a transformer architecture
5. Matrix multiplication implementations
6. A good idea: fusion trick for computing “attention” in a modern transformer
7. Fusion in modern DNN frameworks
【2023】Lecture 9: Spark (Distributed Computing on a Cluster)
1. Why Use A Cluster?
2. Warehouse-Scale Computers (WSC)
3. Warehouse-Scale Cluster Node (Server)
4. Message passing model (abstraction)
5. Storage Systems
6. Distributed File System (GFS)
7. Hadoop Distributed File System (HDFS)
8. MapReduce Programming Model
9. design an implementation of runMapReduceJob
10. Job scheduler responsibilities
11. MapReduce Benefits
12. runMapReduceJob problems
13. MapReduce Limitations
14. Apache Spark in-memory, fault-tolerant distributed computing
15. Modern Spark ecosystem
Lecture 10: Hardware Specialization
1. Examples of heterogeneity
2. Energy-constrained computing
3. Special instruction support
4. Summary: specialized hardware for AI model processing
Lecture 11: Programming Specialized Hardware for AI
1. ThunderKittens
2. Metapipelining
Lecture 12: Mapping AI Applications to the AI Datacenter
1. Short Primer on Memory
2. Accessing Memory (a basic tutorial on how DRAM works)
【2023】Lecture 17: Hardware Specialization and Algorithm Specific Programming
1. Challenges of heterogeneous designs:
2. Spatial
3. Kernel versus Stream Execution
4. Accelerator Design Summary
Lecture 13: Domain-Specific Programming Systems and Automatic Performance Optimization
1. Domain specific languages
2. A DSL example: Halide: a domain-specific language for image processing
3. Another DSL example: Lizst: a language for solving PDE’s on meshes
4. Elements of good domain-speci c programming system design
5. LLM code generation
6. Summary
Lecture 14: Cache Coherence
1. Cache Design
2. Cache coherence problem
3. Coherence
4. “Snooping”-based coherence implementations
5. Directory-based coherence implementations
6. Implications of cache coherence to the programmer
7. Summary: Cachecoherence
Lecture 15: Memory Coherency and Consistency
1. Shared Memory Behavior
2. Memory Consistency: Who Should Care
3. Memory coherence vs. memory consistency
4. Coherence vs. Consistency (said again, perhaps more intuitively this time)
5. Memory Consistency
6. Memory operation ordering
7. Example: Multiprocessor Execution
8. Sequential consistency
9. Relaxing memory operation ordering
10. Synchronization
11. Summary: Relaxed Consistency
12. Language Level Memory Models
13. Memory Consistency Models Summary
Lecture 16: Implementing Locks, Fine-Grained Synchronization, and (a short intro to) Lock-Free Programming
1. Deadlock Livelock Starvation
2. Lock implementations
3. Using locks: Fine-grained locking examples
4. Using locks: Lock-free data structure designs
5. Lock-free vs. locks performance comparison
6. Summary
Lecture 17: Transactional Memory
1. Transactional Memory (TM) Semantics
2. Motivating transactional memory
3. Implementing transactional memory
Lecture 18: Transactional Memory Part II + Course Wrap Up
1. Software TM systems
2. Hardware TM systems
3. Course Wrap Up

【Class】Stanford CS149 Parallel Computing

课程

课程视频

课程信息

Programming Assignments

Written Assignments

Lecture 1: Why Parallelism? Why Efficiency?

Speedup

Course theme

What is a program?(from a processor’s perspective)

What is an instruction?

What does a processor do?

What is memory?

What are caches?

Summary

Lecture 2: A Modern Multi-Core Processor (Part I)

Pre multi-core era processor

Multi-core era processor

SIMD processing(Single instruction, multiple data)

Some common jargon(术语)

Summary: three different forms of parallel execution

Hardware-supported multi-threading

Summary

Thought experiment

Lecture 3: Modern Multi-Core Architecture (Part II) + ISPC Programming Abstractions

latency and bandwidth

Abstraction vs. implementation

Programming with ISPC

Summary

Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU

搭建本地Linux开发环境

Lecture 4: Parallelizing Code: The Programming Thought Process

Creating a parallel program

Problem decomposition

Assignment

Orchestration

Assignment to hardware

A parallel programming example

Synchronization in a shared address space

Summary

Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling

Programming for high performance

Balancing the workload

Scheduling fork-join parallelism

Cilk uses greedy join scheduling

Cilk summary

Lecture 6: Performance Optimization Part II: Locality, Communication, and Contention

shared address space model

Message passing model

Two reasons for communication: inherent vs. artifactual communication

Contention

Summary: reducing communication costs

Here are some tricks for understanding the performance of parallel software

Summary of tips

Assignment 2: Scheduling Task Graphs on a Multi-Core CPU

Lecture 7: GPU Architecture & CUDA Programming

GPU compute mode

CUDA programming abstractions

CUDA implementation on modern GPUs

More detail on GPU architecture

Running a CUDA program on a GPU

More advanced scheduling questions

CUDA summary

Lecture 8: Data-Parallel Thinking

Motivation

Map

Fold

Scan

Parallel Scan

Parallel Segmented Scan

Gather/scatter: key data-parallel operations

More sequence operations

Example

Summary

Lecture 9: Efficiently Evaluating DNNs

Review

Mini-intro(简要介绍): Convolutional Neural Networks

Efficiently implementing convolution layers

Matrix multiplication is at the heart of the “attention” blocks of a transformer architecture

Matrix multiplication implementations