Lectures notes
Terminologies
- Parallel computing is the simultaneous use of multiple processing elements (in opposition to serial computing)
- Concurrent computing is the composition of independently executing task (in opposition to sequential computing)
Processor power consumption is defined as $P_{cpu} = P_{dyn} + P_{stc}$ where dynamic power consumption $P_{dyn} C\cdot V^2\cdot f$
- $C$ switches capacitance decrease (transistor works like a small capacitor)
- $V$ supply voltage
- $f$ clock speed
and $P_{stc}$ (static power consumption)
- the smaller the CPUs they get, the more they leak power
- sub-threshold leakage (lowering the voltage)
- gate tunneling leakage (making the transistor smaller)
High-performance computing used in different applications
- Scientific computing
- Physical simulations
- Cryptanalysis
Uniprocessor machine
- Throughput: rate at which operations are executed
- Latency: time required to complete one operation (harder to improve)
Implicit vs Explicit parallelism
- Implicit (transparent by hardware/compiler)
- Explicit (write actual parallel code)
Bit-Level Parallelism
Is achieved by increasing a CPU word size
Instruction-Level Parallelism
Execute more instruction per processor cycle: create optimal sequence of instruction based on instruction dependency, eg out-of-order execution)
Processor architecture
Subscalar
- One instruction takes multiple clock cycle
- Only one instruction at a time can be executed
Scalar
- Instruction still take multiple cycles, but it’s possible overlap the execution of instructions (different logic unit take care of “some” instructions (it generally decrease LATENCY but the THROUGHPUT increase). It might be necessary to duplicate hardware (logic unit) to achieve a Instruction-Per-Clock IPC = 1 even though an instruction takes multiple cycles. It generally can introduce different type of hazard
- Solutions to data hazard (read-after-write dependency)
- bubbles (insert some NOOP to delay the execution of the critical instruction)
- bypasses (delay the part of instruction that require a value right after the value has been loaded)
- compilation: make the compiler to avoid those hazard (by inserting unrelated instruction in the place of the remaining bubbles)
- Solution to control hazard (branching)
- bubbles
- branch delay shot (load instructions that are independent form the branch result while the branch is being evaluated)
- branch prediction accurate up to 90% (both STATIC or DYNAMIC) but if the wrong prediction it’s necessary to FLUSH the instruction pipeline
- Solutions to data hazard (read-after-write dependency)
Superscalar
- Instructions still take multiple cycle but there are SEVERAL pipeline available (still IN-ORDER strategy)
- Same hazard BUT
- pipeline flushing is much more expensive
- parallel pipeline could execute dependent instructions (no actual parallelism)
- branch instructions can totally DISRUPT the flow of the instructions
- Same hazard BUT
VLIW
- The processor offers n-ways parallel execution units (like superscalar).
- Dynamic dependency checking logic removed from hardware: operations are packed into VLIW (Very-Long-Instruction-Word) instructions.
- Compiler generates VLIW instruction with independent operations
Out-of-order
- Rearrange the instructions without violating dependencies
- Reorder buffer: collect all instructions and pick the one that can be executed right away
Improving ILP
- Register renaming
- Issues: data hazard due to limited number of registers, can cause FALSE dependencies (WAR, WAW)
- Solutions: register-association-table to keep track of registers mapping, more physical register than visible registers
- Speculative execution
- Start executing speculatively the predicted branch. If the prediction was right, make them “appear”, otherwise just flush the pipeline
Compiler techniques
- Pipeline scheduling
- Prevent RAW data dependency
- Reorder instructions to reduce the number of pipeline stalls
- Loop unrolling
- Unroll a few loop iterations into a sequential stream to maximize ILP
- Others
- Eliminate branches by using conditional instructions
- Inline some function call (avoid having to allocate stack, resources..)
Task-Level Parallelism
Divide code into independent task that can be scheduled separately.
- In uniprocessor pseudo-TLP if tasks overlap between computation and I/O accesses
- Simultaneous multithreading (SMT)
- The processor executes multiple threads simultaneously
- Maximize the usage of resources (executions units)
- Appears as two separate CPU from the OS point of view
Flynn’s taxonomy
- SISD: single instruction stream, single data stream (uniprocessor $\rightarrow$ BLP, ILP, TLP)
- SIMD: single instruction stream, multiple data stream (vector processors, GPGPU $\rightarrow$ BLP, DLP)
- MISD: multiple instruction stream, single data stream (systolic arrays? redundant execution by different “system” of the same “programs”)
- MIMD: multiple instruction stream, multiple data stream (multi core systems, distributed systems $\rightarrow$ DLP, TLP)
Processes vs Threads
Processes
- Running/suspended instance of a program that consist of: address space (code, data, heap, stack) and environment (PID, User ID, group ID, file description, ..)
- Process concurrency (communication):
- file:
fopen
,fread
,fwrite
- signal:
kill
- socket:
socket()
,connect()
,bind()
,listen()
,accept()
- message queue:
mq_open()
,mq_send()
,mq_receive()
- pipe:
pipe()
,dup2()
- shared memory:
shmget()
- message passing: MPI
- memory mapped file:
mmap()
- file:
Threads
- Inside the same process (share most of the environment)
- POSIX threads
OpenMP
C/C++ API
- Compiler directives
- ignore if no support, use macro to check support
- Header file
- custom type, function prototypes
- Library
- use
-fopenmp
to use OpenMP library, implementation
- use
- Environment variables
- i.e.
OMP_NUM_THREADS=8 ./a.out
- i.e.
To work with OpenMP use #pragma omp directive [clause(s)]
, for example #pragma omp parallel num_threads(2)
. The directives refer to the next line of code or a block statement if {}
are used
parallel
execute the code with as many thread as the hardware corefor
split the for loop workload between threads. It’s possible to specify the scheduling policy withfor scheduling(KIND[,CHUNK_SOZE])
static
divide the work evenly, by defaultCHUNK_SIZE = loop_count/number_threads
dynamic
put chunks in an internal work queue and keep assigning new chunk to available thread, by defaultCHUNK_SIZE = 1
guided
start with larger chunk keep creating smaller chunk (never smaller thanCHUNK_SIZE
) until the work is completedruntime
use the value ofOMP_SCHEDULE
for schedulingauto
delegate to the compiler
It’s also possible to have nested parallelism, collapse nested loops parallel for collapse(N)
(where N
indicate the number of nested loops)
Any variable outside the parallel section is global to each thread, while variables declare within the parallel section are private to each thread (a local copy for each thread).
It’s possible to change the global/private visibility inside a loop
private(var)
var is now private to each thread, initial value is undefinedfirstprivate(var)
var is private with initial value equal to the outer valuelastprivate(var)
var is private and the thread who’s performing the last iteration set a new value for the global variableshared(var)
to share the variable between thread (default behavior for global variables)
To synchronize threads it’s possible to split the parallel region. All thread would be destroyed after the first region and the recreate for the next region. But it’s also possible to use the directive barrier
To guarantee that only one thread at the time execute some code, the directive critical
should be used. The directive atomic
take advantage of specif hardware to avoid locking/unlocking the variable
When parallel regions produce a result that needs to be combined among all threads, the directive reduction(OPERATION:VARIABLE)
can be used. The operation can be +, -, *, max, min, &&, ||, &, |, ^
and the variable on which performing the reduction should be explicitly declared. It’s also possible to define custom reduction operations.
When the number of parallel regions is known in advance it’s possible to use the sections
directive followed by a section
directive for each parallel region
on might not be robust#pragma omp parallel sections
{
#pragma omp section
{
// first section
}
#pragma omp section
{
// second section
}
}
The directive master
and single
limits the execution of a block to the master thread or a single thread (any thread) respectively. These directives can be used in pair with the task
directive. In general a single thread is responsible for setting up the variables and then, inside a task
section, OpenMP creates a task queue and every entry is assigned to available thread until all task are executed.
Shared Memory
Independent processor with private caches communicate over the network and logically share memory
Memory organization
- Uniform Memory Access - Multiple memory modules, allow simultaneous access without race conditions
- Non-uniform Memory Access - Processors can share their local memory (faster) and remote memory. It’s more scalable than UMA but force programmer to manage organization (OS kernel support, user-space support
libnuma
)
Interconnection network
There are different type of network organization (topology) and all of them have different characteristics
- Static - point-to-point connection between nodes
- Fully connected - each node is connected to every other node, usually unfeasible in practice but theoretically ideal
- Bus - nodes connected on a single shared line, it’s easy to add new node but it allows only one communication at the time
- Ring - minimum amount of simultaneous communication
- Mesh - nodes are organized in 2D/3D structure that allows many simultaneous communication, the diameter can become too large
- Hypercube 4D mesh with lots of link and smaller diameter
- Dynamic - connection is established using switches
- Crossbar - any input can be connected to any output, the actual connection is set up by a switch. Up to $N$ non blocking connection, but hard to scale
- Omega - multistage network using two-by-two crossbar. Easier to implement
Synchronization hardware
- On single processor, race condition may occur in the case of a multi instruction operation if the sub-instruction are not executed sequentially
- On multi processor, race condition occur if different instructions overlap during their execution. Atomicity is guaranteed by locking the bus
There are class of atomic operations that avoid race condition and are easy to implement in hardware (Read-Modify-Write instructions)
- Test-and-set
- Fetch-and-add
Compare-and-swap
- On interconnected network, race conditions are a very real problem that cannot be solve using bus locking. Specific hardware operation can used instead. These operations uses a reservation table on the memory side
- Load-Link - return a new key (from reservation table) if it’s the first access, or the existing ley if there are ongoing access
- Store-Conditional - if the key is correct, store the result and cancel key. Otherwise a race condition occurred and the program should try the entire LL/SC again
Cache coherence
Multiple cpu share the same memory, how to update the cache when the memory is modified?
- Snooping - The processor broadcast the change is about to do on memory over the bus, other processor can snoop the bus and determine wether they need to update their cache
- Directory-based - The memory keep track of which processor cached particular addresses and when a change happens, the memory notify the specif processors
Memory consistency
In which acceptable order writes to different memory location (distributed systems) will be seen. Memory consistency is important only when race condition may occur
- Strict - strongest model, all changes are seen instantaneously. Impossible to implement
- Sequential - all changes are seen in the order that they actually happen by all nodes. Very slow
Optimization technique may compromise memory consistency (out-of-order execution), for this reason weaker consistency model are usually adopted
- Total store ordering - write as are buffered and reads still access the old values
To force sequentiality there are different techniques: compiler and hardware barrier, atomic RWM instructions.
Message Passing
Distributed computers that communicate by message passing can be used to create powerful parallel systems. In general, every system using message passing require a collection of process, each with its exclusive address space. Data must be split and place among processes (locality is critical to achieve high performances) and communication are expensive and should be minimized by design good algorithms and mappings.
OpenMPI
C/C++ API
GPU
CPU
- Optimized for sequential operations
- Fewer execution units
- Higher clock speed
- Sophisticated control logic
- Levers or large cache
- Low latency tolerance
GPU
- Optimized for parallel operations
- Many parallel ALUs
- High throughput
- High latency tolerance
CUDA
CUDA C/C++ API
A CUDA program consist of several phases, either executed on host or on devices. The host is the main processor that execute code sequentially (if not explicitly parallelized) and the device is the GPU with rich data parallelism. The device execute functions called kernels.
Launching a kernel usually require three phases: memory setup with data preparation, invoke kernel and retrieve result with memory cleanup. Memory management on device is more complicate than host (data transfer from/to per-grid global) since there could be R/Ws on thread registers, thread local memory, block shared memory and grid global memory
cudaError_t cudaMalloc(void **buff, size_t size)
cudaError_t cudaFree(void *buff)
cudaError_t cudaMemcpy(void *dst, const void *src, size_t size, cudaMemcpyKind kind)
, wherekind
could becudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyHostToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault
Kernel functions are execute by all CUDA threads, they should prefixed with __global__
if the kernel is invoked from host and deployed on device, __device__
if it’s executed and callable from only devices, or __host__
.
CUDA threads are organized in grid (1D, 2D, 3D), and inside each grid there can be different blocks. Threads inside the same block can collaborate (share memory, synchronization). It’s possible to access the relative position of threads inside the grid or block with gridDim.{x, y, z}
, blockIdx.{x, y, z}
and threadIdx.{x, y, z}
Blocks are also further divided in 32-threads units called wraps and threads of same warps execute one common instruction at the time. Switching from warp to another has NO overhead (handy to keep the processors always busy).
CUDA have different type of memory
- Registers: local variables allocated by compiler
- Local memory: thread private variables that dont fit the registers
- Shared memory: per-block, usually each tread works on a separate piece. Indicated with
__shared__
- Global memory: per-grid, memory allocated on host with
cudaMalloc
or on device. Global variables indicated with__device__
- Constant memory: only $64 KB$ cached on chip, loaded with
cudaMemcpytoSymbol
. Indicated with__constant__