# Lectures notes

## Terminologies

• Parallel computing is the simultaneous use of multiple processing elements (in opposition to serial computing)
• Concurrent computing is the composition of independently executing task (in opposition to sequential computing)

Processor power consumption is defined as $P_{cpu} = P_{dyn} + P_{stc}$ where dynamic power consumption $P_{dyn} C\cdot V^2\cdot f$

• $C$ switches capacitance decrease (transistor works like a small capacitor)
• $V$ supply voltage
• $f$ clock speed

and $P_{stc}$ (static power consumption)

• the smaller the CPUs they get, the more they leak power
• sub-threshold leakage (lowering the voltage)
• gate tunneling leakage (making the transistor smaller)

High-performance computing used in different applications

• Scientific computing
• Physical simulations
• Cryptanalysis

## Uniprocessor machine

• Throughput: rate at which operations are executed
• Latency: time required to complete one operation (harder to improve)

### Implicit vs Explicit parallelism

• Implicit (transparent by hardware/compiler)
• Explicit (write actual parallel code)

## Bit-Level Parallelism

Is achieved by increasing a CPU word size

## Instruction-Level Parallelism

Execute more instruction per processor cycle: create optimal sequence of instruction based on instruction dependency, eg out-of-order execution)

### Processor architecture

#### Subscalar

• One instruction takes multiple clock cycle
• Only one instruction at a time can be executed

#### Scalar

• Instruction still take multiple cycles, but it’s possible overlap the execution of instructions (different logic unit take care of “some” instructions (it generally decrease LATENCY but the THROUGHPUT increase). It might be necessary to duplicate hardware (logic unit) to achieve a Instruction-Per-Clock IPC = 1 even though an instruction takes multiple cycles. It generally can introduce different type of hazard
• Solutions to data hazard (read-after-write dependency)
• bubbles (insert some NOOP to delay the execution of the critical instruction)
• bypasses (delay the part of instruction that require a value right after the value has been loaded)
• compilation: make the compiler to avoid those hazard (by inserting unrelated instruction in the place of the remaining bubbles)
• Solution to control hazard (branching)
• bubbles
• branch delay shot (load instructions that are independent form the branch result while the branch is being evaluated)
• branch prediction accurate up to 90% (both STATIC or DYNAMIC) but if the wrong prediction it’s necessary to FLUSH the instruction pipeline

#### Superscalar

• Instructions still take multiple cycle but there are SEVERAL pipeline available (still IN-ORDER strategy)
• Same hazard BUT
• pipeline flushing is much more expensive
• parallel pipeline could execute dependent instructions (no actual parallelism)
• branch instructions can totally DISRUPT the flow of the instructions

#### VLIW

• The processor offers n-ways parallel execution units (like superscalar).
• Dynamic dependency checking logic removed from hardware: operations are packed into VLIW (Very-Long-Instruction-Word) instructions.
• Compiler generates VLIW instruction with independent operations

#### Out-of-order

• Rearrange the instructions without violating dependencies
• Reorder buffer: collect all instructions and pick the one that can be executed right away

### Improving ILP

• Register renaming
• Issues: data hazard due to limited number of registers, can cause FALSE dependencies (WAR, WAW)
• Solutions: register-association-table to keep track of registers mapping, more physical register than visible registers
• Speculative execution
• Start executing speculatively the predicted branch. If the prediction was right, make them “appear”, otherwise just flush the pipeline

### Compiler techniques

• Pipeline scheduling
• Prevent RAW data dependency
• Reorder instructions to reduce the number of pipeline stalls
• Loop unrolling
• Unroll a few loop iterations into a sequential stream to maximize ILP
• Others
• Eliminate branches by using conditional instructions
• Inline some function call (avoid having to allocate stack, resources..)

Divide code into independent task that can be scheduled separately.

• In uniprocessor pseudo-TLP if tasks overlap between computation and I/O accesses
• The processor executes multiple threads simultaneously
• Maximize the usage of resources (executions units)
• Appears as two separate CPU from the OS point of view

### Flynn’s taxonomy

• SISD: single instruction stream, single data stream (uniprocessor $\rightarrow$ BLP, ILP, TLP)
• SIMD: single instruction stream, multiple data stream (vector processors, GPGPU $\rightarrow$ BLP, DLP)
• MISD: multiple instruction stream, single data stream (systolic arrays? redundant execution by different “system” of the same “programs”)
• MIMD: multiple instruction stream, multiple data stream (multi core systems, distributed systems $\rightarrow$ DLP, TLP)

Processes

• Running/suspended instance of a program that consist of: address space (code, data, heap, stack) and environment (PID, User ID, group ID, file description, ..)
• Process concurrency (communication):
• signal: kill
• socket: socket(), connect(), bind(), listen(), accept()
• message queue: mq_open(), mq_send(), mq_receive()
• pipe: pipe(), dup2()
• shared memory: shmget()
• message passing: MPI
• memory mapped file: mmap()

• Inside the same process (share most of the environment)

## OpenMP

### C/C++ API

• Compiler directives
• ignore if no support, use macro to check support
• custom type, function prototypes
• Library
• use -fopenmp to use OpenMP library, implementation
• Environment variables

To work with OpenMP use #pragma omp directive [clause(s)], for example #pragma omp parallel num_threads(2). The directives refer to the next line of code or a block statement if {} are used

• parallel execute the code with as many thread as the hardware core
• for split the for loop workload between threads. It’s possible to specify the scheduling policy with for scheduling(KIND[,CHUNK_SOZE])
• static divide the work evenly, by default CHUNK_SIZE = loop_count/number_threads
• dynamic put chunks in an internal work queue and keep assigning new chunk to available thread, by default CHUNK_SIZE = 1
• guided start with larger chunk keep creating smaller chunk (never smaller than CHUNK_SIZE) until the work is completed
• runtime use the value of OMP_SCHEDULE for scheduling
• auto delegate to the compiler

It’s also possible to have nested parallelism, collapse nested loops parallel for collapse(N) (where N indicate the number of nested loops)

Any variable outside the parallel section is global to each thread, while variables declare within the parallel section are private to each thread (a local copy for each thread).

It’s possible to change the global/private visibility inside a loop

• private(var) var is now private to each thread, initial value is undefined
• firstprivate(var) var is private with initial value equal to the outer value
• lastprivate(var) var is private and the thread who’s performing the last iteration set a new value for the global variable
• shared(var) to share the variable between thread (default behavior for global variables)

To synchronize threads it’s possible to split the parallel region. All thread would be destroyed after the first region and the recreate for the next region. But it’s also possible to use the directive barrier

To guarantee that only one thread at the time execute some code, the directive critical should be used. The directive atomic take advantage of specif hardware to avoid locking/unlocking the variable

When parallel regions produce a result that needs to be combined among all threads, the directive reduction(OPERATION:VARIABLE) can be used. The operation can be +, -, *, max, min, &&, ||, &, |, ^ and the variable on which performing the reduction should be explicitly declared. It’s also possible to define custom reduction operations.

When the number of parallel regions is known in advance it’s possible to use the sections directive followed by a section directive for each parallel region

on might not be robust

The directive master and single limits the execution of a block to the master thread or a single thread (any thread) respectively. These directives can be used in pair with the task directive. In general a single thread is responsible for setting up the variables and then, inside a task section, OpenMP creates a task queue and every entry is assigned to available thread until all task are executed.

## Shared Memory

Independent processor with private caches communicate over the network and logically share memory

### Memory organization

• Uniform Memory Access - Multiple memory modules, allow simultaneous access without race conditions
• Non-uniform Memory Access - Processors can share their local memory (faster) and remote memory. It’s more scalable than UMA but force programmer to manage organization (OS kernel support, user-space support libnuma)

### Interconnection network

There are different type of network organization (topology) and all of them have different characteristics

• Static - point-to-point connection between nodes
• Fully connected - each node is connected to every other node, usually unfeasible in practice but theoretically ideal
• Bus - nodes connected on a single shared line, it’s easy to add new node but it allows only one communication at the time
• Ring - minimum amount of simultaneous communication
• Mesh - nodes are organized in 2D/3D structure that allows many simultaneous communication, the diameter can become too large
• Hypercube 4D mesh with lots of link and smaller diameter
• Dynamic - connection is established using switches
• Crossbar - any input can be connected to any output, the actual connection is set up by a switch. Up to $N$ non blocking connection, but hard to scale
• Omega - multistage network using two-by-two crossbar. Easier to implement

### Synchronization hardware

• On single processor, race condition may occur in the case of a multi instruction operation if the sub-instruction are not executed sequentially
• On multi processor, race condition occur if different instructions overlap during their execution. Atomicity is guaranteed by locking the bus

There are class of atomic operations that avoid race condition and are easy to implement in hardware (Read-Modify-Write instructions)

• Test-and-set
• Compare-and-swap

• On interconnected network, race conditions are a very real problem that cannot be solve using bus locking. Specific hardware operation can used instead. These operations uses a reservation table on the memory side
• Load-Link - return a new key (from reservation table) if it’s the first access, or the existing ley if there are ongoing access
• Store-Conditional - if the key is correct, store the result and cancel key. Otherwise a race condition occurred and the program should try the entire LL/SC again

### Cache coherence

Multiple cpu share the same memory, how to update the cache when the memory is modified?

• Snooping - The processor broadcast the change is about to do on memory over the bus, other processor can snoop the bus and determine wether they need to update their cache
• Directory-based - The memory keep track of which processor cached particular addresses and when a change happens, the memory notify the specif processors

### Memory consistency

In which acceptable order writes to different memory location (distributed systems) will be seen. Memory consistency is important only when race condition may occur

• Strict - strongest model, all changes are seen instantaneously. Impossible to implement
• Sequential - all changes are seen in the order that they actually happen by all nodes. Very slow

Optimization technique may compromise memory consistency (out-of-order execution), for this reason weaker consistency model are usually adopted

• Total store ordering - write as are buffered and reads still access the old values

To force sequentiality there are different techniques: compiler and hardware barrier, atomic RWM instructions.

## Message Passing

Distributed computers that communicate by message passing can be used to create powerful parallel systems. In general, every system using message passing require a collection of process, each with its exclusive address space. Data must be split and place among processes (locality is critical to achieve high performances) and communication are expensive and should be minimized by design good algorithms and mappings.

## GPU

### CPU

• Optimized for sequential operations
• Fewer execution units
• Higher clock speed
• Sophisticated control logic
• Levers or large cache
• Low latency tolerance

### GPU

• Optimized for parallel operations
• Many parallel ALUs
• High throughput
• High latency tolerance

#### CUDA

##### CUDA C/C++ API

A CUDA program consist of several phases, either executed on host or on devices. The host is the main processor that execute code sequentially (if not explicitly parallelized) and the device is the GPU with rich data parallelism. The device execute functions called kernels.

Launching a kernel usually require three phases: memory setup with data preparation, invoke kernel and retrieve result with memory cleanup. Memory management on device is more complicate than host (data transfer from/to per-grid global) since there could be R/Ws on thread registers, thread local memory, block shared memory and grid global memory

• cudaError_t cudaMalloc(void **buff, size_t size)
• cudaError_t cudaFree(void *buff)
• cudaError_t cudaMemcpy(void *dst, const void *src, size_t size, cudaMemcpyKind kind), where kind could be
• cudaMemcpyHostToDevice
• cudaMemcpyDeviceToHost
• cudaMemcpyHostToHost
• cudaMemcpyDeviceToDevice
• cudaMemcpyDefault

Kernel functions are execute by all CUDA threads, they should prefixed with __global__ if the kernel is invoked from host and deployed on device, __device__ if it’s executed and callable from only devices, or __host__.

CUDA threads are organized in grid (1D, 2D, 3D), and inside each grid there can be different blocks. Threads inside the same block can collaborate (share memory, synchronization). It’s possible to access the relative position of threads inside the grid or block with gridDim.{x, y, z}, blockIdx.{x, y, z} and threadIdx.{x, y, z}

Blocks are also further divided in 32-threads units called wraps and threads of same warps execute one common instruction at the time. Switching from warp to another has NO overhead (handy to keep the processors always busy).

CUDA have different type of memory

• Registers: local variables allocated by compiler
• Local memory: thread private variables that dont fit the registers
• Shared memory: per-block, usually each tread works on a separate piece. Indicated with __shared__
• Global memory: per-grid, memory allocated on host with cudaMalloc or on device. Global variables indicated with __device__
• Constant memory: only $64 KB$ cached on chip, loaded with cudaMemcpytoSymbol. Indicated with __constant__