멀티코어컴퓨팅

Introduction to Multicore Computing

Multicore Processor

A single computing component with two or more independent cores
Core: computing unit that reads / executes program instructions
Multiple cores run multiple instructions at the same time (concurrently)
Increase overall program speed (performance)
performance gained by multi-core processor
desktop PCs, mobile devices, servers

Manycore processor (GPU)

multi-core architectures with an especially high number of cores (thounds)
CUDA
- Compute Unified Device Architecture
OpenCL

Thread / Process

Both: independent sequence of execution
Process: run in seperate memory space
Thread
- run in shared memory space in a process
- One process may have multiple threads
Multithreaded Program
- a program running with multiple threads that is executed simultaneously

What is Parallel Computing?

Parallel computing
- using multiple processors in parallel to solve problems more quickly than with a single processor
Examples of parallel machines:
- A cluster computer that contains multiple PCs combined together with a high speed network
- A shared memory multiprocessor by connecting multiple processors to a single memory system
- A chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip
Concurrent execution comes from desire for performance; unlike the inherent concurrency in a multi-user distributed system

Parallelism vs Concurrency

Parallel Programming
- Using additional computational resources to produce an answer faster
- Problem of using extra resources effectively?
- Ex. summing up all the numbers in an array with multiple n processors
Concurrent Programming
- Correctly and efficiently controlling access by multiple threads to shared resources
- Problem of preventing a bad interleaving of operations from different threads
Example
- Implementation of dictionary with hashtable
  - operations insert, update, lookup, delete occur simultaneously (concurently)
  - Multiple threads acceess the same hashtable
- Web Visit Counter
Often used interchangeably
In practice, the distinction between parallelism and concurrency is not absolute
Many multithreaded programs have aspects of both

Parallel Programming Techniques

Shared Memory
- OpenMP, pthreads
Distributed Memory (several computer 로 나눠지지만, processor 는 다른 컴퓨터의 memory 에 직접 접근하지는 못함)
- MPI
Distributed / Shared Memory (Share memory 를 가지고 있지만, 직접 접속은 할 수 없음)
- Hybrid (MPI + OpenMP)
GPU Parallel Programming
- CUDA Programming
- OpenCL

Parallel Processing Systems

Small-Scale Multicore Environment
- Notebook, Workstation, Server
- OS supports multicore
- POSIX threads (pthread), win32 thread
- GPGPU-based supercomputer
- Development of CUDA/OpenCL/GPGPU
Large-Scale Multicore Environment
- Supercomputer: more than 10000 cores
- Clusters
- Servers
- Grid Computing

Parallel Computing vs Distributed Computing

Parallel Computing
- all processors may have access to a shared memory to exchange information between processors.
- more tightly coupled to multi-threading
Distributed Computing
- multiple computers communicate through network
- each processor has its own private memory (distributed memory)
- executing sub-tasks on different machines and then merging the results. (Each computer has a sub-result)
Distributed 는 parallel 의 한 종류이다.
No Clear Distinction

Cluster Computing vs Grid Computing

Cluster Computing
- a set of loosely connected computers that work together so that in many respects they can be viewed as a single system
- good price / performance
- memory not shared
Grid Computing
- fedetation of computer resources from multiple locations to reach a common goal (a large scale distributed system)
- grid tend to be more loosely coupled, heterogeneous, and geographically dispersed

Cloud Computing

shares networked computing resources rather than having local servers or personal devices to handle applications
“Cloud” is used as a metaphor for “Internet” meaning “a type of Internet-based computing”
different services - such as servers, storage and applications - are delivered to an user’s computers and smart phones through the Internet.

Good Parallel Program

Writing good parallel programs
- Correct
- Good Performance
- Scalability
- Load Balance
- Portability
- Hardware Specific Utilization

Characteristics of Distributed System

Resource Sharing: It is the ability to use any Hardware, Software, or Data anywhere in the System.
Openness: It is concerned with Extensions and improvements in the system (i.e., How openly the software is developed and shared with others)
Concurrency: It is naturally present in Distributed Systems, that deal with the same activity or functionality that can be performed by separate users who are in remote locations. Every local system has its independent Operating Systems and Resources.
Scalability: It increases the scale of the system as a number of processors communicate with more users by accommodating to improve the responsiveness of the system.
Fault tolerance: It cares about the reliability of the system if there is a failure in Hardware or Software, the system continues to operate properly without degrading the performance of the system
Transparency: It hides the complexity of the Distributed Systems to the Users and Application programs as there should be privacy in every system.
Heterogeneity: Networks, computer hardware, operating systems, programming languages, and developer implementations can all vary and differ among dispersed system components.

Challenges of Distributed Systems

Network latency: The communication network in a distributed system can introduce latency, which can affect the performance of the system.
Distributed coordination: Distributed systems require coordination among the nodes, which can be challenging because of the distributed nature of the system.
Data consistency: Maintaining data consistency across multiple nodes in a distributed system can be challenging.

Advantages

High Performance a. Distributed systems can outperform centralised systems by utilising the capabilities of several computers to handle the demand.
Reliable a. In terms of failures, distributed systems are significantly more dependable than single systems. Even if a single node fails, it does not affect the remaining servers. Other nodes can continue to operate normally.
Scalable
Expandability
Availability
Reduced latency (short distance) a. Low latency is achieved using distributed systems. If a node is closer to the user, the distributed system ensures that traffic from that node reaches the system. As a result, the user may realise that it takes significantly less time to serve them.
Secure a. Distributed systems include security safeguards that prevent data breaches and illegal access to any data, hardware or software of an organization.
Cost-effectiveness a. Distributed systems, despite their high expense of implementation, are reasonably cost-effective in the long term. In contrast to a mainframe computer, which is made up of multiple processors, a distributed system is made up of several computers working together. This infrastructure is significantly less expensive than a mainframe system.
Geographic Distribution a. A feature of a distributed system is geographic distribution. It allows companies and organisations to provide users with services in different areas.

Disadvantages

High cost.
The problem is finding the fault.
More space is needed.
The increased infrastructure is needed.
In distributed systems, it is challenging to provide adequate security because both the nodes and the connections must be protected.

Moore’s Law: Review

Doubling of the number of transistors on integrated circuits roughly every two years.
Microprocessors have become smaller, denser, and more powerful.
processing speed, memory capacity, sensors and even the number and size of pixels in digital cameras. All of these are improving at (roughly) exponenetial rates

Computer Hardware Trend

Chip density is continuing increase ~2x every 2years
- Clock speed is not increasing anymore (in high clock speed, power consumption and heat generation is too high to be tolerated)
- number of cores may double instead
No more hidden parallelism to be found
Transistor number still rising
Clock speed flattening sharply.

=> Need Multicore programming!

All computers are now parallel computers.
Multi-core processors represent an important new trend in computer architecture.
- Decreased power consumption and heat generation
- Minimized wire lengths and interconnect latencies.
They enable tru thread-level parallelism with great energy efficiency and scalability.

Why writing (fast) parallel programs is hard

Principles of Parallel Computing

Finding enough parallelism (Amdahl’s Law)
granularity
Locality
Load balance
Coordination and synchronization

→ All of these things makes parallel programming even harder than sequential programming

Finding Enough Parallelism

Suppose only part of an application seems parallel
Amdahl’s law
- let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable
- P = number of processors
- $Speedup(P) = Time(1) / Time(P) <= \frac{1}{s + \frac{(1-s)}{P}} <= 1/s$
Even if the parallel part speeds up perfectly performance is limited by the sequential part

Overhead of Parallelism

Given enough parallel work, this is the biggest barrier to getting desired speedup
Parallelism overheads include:
- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
Each of these can be in the range of milliseconds (=millions of flops) on some systems
Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel, but not so large that there is not enough parallel work

Locality and Parallelism

Large memories are slow, fast memories are small
Storage hierarchies are large and fast on average
Parallel processors, collectively, have large, fast cache
- the slow accesses to “remote” edata we call “communication”
Algorithm should do most work on local data

Load imbalance

Load imbalance is the time that some processors in the system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
Algoroithm needs to balance load

Hyper-threading

Hyper-threading is Intel’s proprietary simultaneous multithreading implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors.

Performance of Parallel Programs

Flynn’s Taxonomy on Parallel Computer

Classified with two independent dimension
- Instruction stream
- Data stream

SISD (Single Instruction, Single Data)

A serial (non-parallel) computer
This is the oldest and even today, the most common type of computer

SIMD (Single Instruction Multiple Data)

All processing units execute the same instruction at any given clock cycle
Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing

MISD (Multiple Instruction, Single Data)

Each processing unit operates on the data independently via separate instruction streams
Few actual examples of this class of parallel computer have ever existed.

MIMD (Multiple Instruction, Multiple Data)

Every processor may be executing a different instruction stream
Every processor may be working with a different data stream
the most common type of parallel computer
Most modern supercomputers fall into this category

Creating a Parallel Program

Decomposition
Assignment
Orchestration/Mapping

Decomposition

Break up computation into tasks to be divided among processes
identify concurrency and decide level at which to exploit it

Domain Decomposition

Data associated with a problem is decomposed.
Each parallel task then works on a portion of data.

Functional Decomposition

the focus is on the computation that is to be performed rather than on the data
problem is decomposed according to the work that mulst be done.
Each task then performs a portion of the overall work.

Assignment

Assign tasks to threads
- Balance workload, reduce communication and management cost
- Together with decomposition, also called partitioning
Can be performed statically, or dynamically
Goal
- Balanced workload
- Reduced communication costs

Orchestration

Structuring communication and synchronization
Organizing data structures in memory and scheduling tasks temporally
Goals
- Reduce cost of communication and synchronization as seen by processors
- Reserve locality of data reference (including data structure organization)

Mapping

Mapping threads to execution units (CPU cores)
Parallel application treis to use the entire machine
Usually a job for OS
Mapping decision
- Place related threads (cooperating threads) on the same processor
- maximize locality, data sharing, minimize costs of comm/sync

Performance of Parallel Programs

What factors affect the performance?
- Decomposition
  - Coverage of parallelism in algorithm
- Assignment
  - Granularity of partitioning among processors
- Orchestration/Mapping
  - Locality of computation and communication

Coverage (Amdahl’s Law)

Potential program speedup is defined by the fraction of code that can be parallelized
speed up은 sequential work 의 비율에 달려있다.

Amdahl’s Law

p = fraction of work that can be parallelized
n = the number of processor
s = 1-p

$speedup = \frac{1}{{(1 - p) + \frac{p}{n}}}$

Implications of Amdahl’s Law

Speeedup tends to 1/(1-p) as number of processors tends to infinity
Parallel programming is worthwhile when programs have a lot of work that is parallel in nature

Performance Scalability

Scalability: the capable of a system to increase total throughput under an increased load when resources (typically hardwares) are added

Granularity (work units)

Granularity is a qualitative measure of the ratio of computation to communication
- Coarse: relatively large amounts of computational work are done between communication events
- Fine: relatively small amounts of computational work are done between communication events
Computation stages are typically separated from periods of communication by synchronization events
Granularity
- the extent to which a system is broken down into small parts
Coarse-grained systems
- consist of fewer, larger components than fine-grained systems
- regards large subcomponents
Fine-grained systems
- regards smaller components of which the larger ones are composed

Fine vs Coarse Granularity

Fine-grain Parallelism
- Low computation to communication ratio
- Small amounts of computational work between communication stages
- High communication overhead
Coarse-grain Parallelism
- High computation to communication ratio
- Large amounts of computational work between communication events
The most efficient granularity is dependent on the algorithm and the hardware
In most cases the over head associated with communications and synchronization is high relative to execution speed so it’s advantageous to have coarse granularity.
Fine-grain parallelism can help reduce overheads due to load imbalance.

Load Balancing

distributing approximately equal amounts of work among tasks so that all tasks are kept busy all of the time.
It can be considered a minimization of task idle time.
For example, if all tasks are subject to a barrier syncrhonization point, the slowest task will determine the overall performance.

General Load Balancing Problem

The whole work should be completed as fast as possible.
As workers are very expensive, they should be kept busy.
The work should be distributed fairly. About the same amount of work should be assigned to every worker.
There are precedence constraints between different tasks (we can start building the roof only after finishing the walls) Thus we also have to find a clever processing order of the different jobs.

Load Balancing Problem

Processors that finish early have to wait for the processor with the largest amount of work to complete
- Leads to idle time, lowers utilization

Static load balancing

Programmer make decisions and assigns a fixed amount of work to each processing core a priority
Low run time overhead
Works well for homogeneous multicores
- All core are the same
- Each core has an equal amount of work
Not so well for heterogeneous multicores
- Some cores may be faster than others
- Work distribution is uneven

Dynamic Load Balancing

When one core finishes its allocated work, it takes work from a work queue or a core with the heaviest workload
Adapt partitioning at run time to balance load
High runtime overhead
Ideal for codes where work is uneven, unpredictable, and in heterogeneous multicore

Granularity and Performance Tradeoffs

Load balancing
- How well is work distributed among cores?
Synchronization/Communication
- Communication Overhead?

Communication

With message passing, programmer has to understand the computation and orchestrate the communication accordingly
- Point to Point
- Broadcast (one to all) and Reduce (all to one)
- All to All (each processor sends its data to all others)
- Scatter (one to several) and Gather (several to one)

Factors to consider for communication

Cost of communications
- Inter-task communication virtually always implies overhead.
- Communications frequently require some type of synchronization between tasks, which can result in tasks spending time ‘waiting’ instead of doing work.
Latency vs Bandwidth
- Latency
  - the time it takes to send a minimal (0 byte) message from point A to point B
- Bandwidth
  - the amount of data that can be communicated per unit of time.
- Sending many small messages can cause latency to dominate communication overheads
- Often it is more efficient to package small messages into a larger message.
Synchronous vs asynchronous
- Synchronous: require some type of handshaking between tasks that share data
- Asynchronous: transfer data independently from one another.
Scope of communication
- Point-to-point
- collective

MPI: Message Paassing Library

MPI: portable specification
- Not a language or compiler specification
- Not a specific implementation or product
- SPMD model (same program, multiple data)
For parallel computers, clusters, and heterogeneous networks, multicores
Multiple communication modes allow precise buffer management
Extensive collective operations for scalable global communication

Point-to-Point

Basic method of communication between two processors
- Originating processor “sends” messages to destination processor
- Destination processor then “receives” the message
The message commonly includes
- Data or other information
- Length of the message
- Destination address and possibly a tag

Synchronous vs Asynchronous Messages

Synchronous send
- Sender notified when message is received
Asynchronous send
- Sender only knows that message is sent

Blocking vs Non-Blocking Messages

Blocking messages
- Sender waits until message is transmitted: buffer is empty
- Receiver waits until message is received: buffer is full
- Potential for deadlock
Non-blocking
- Processing continues even if message hasn’t been transmitted
- Avoid idle time and deadlocks

Broadcast

One processor sends the same information to many other processors
- MPI_BCAST

Reduction

Example: every processor starts with a value and needs to know the sum of values stored on all processors
A reduction combines data from all processors and returns it to a single process
- MPI_REDUCE
- Can apply any associative operation on gathered data
  - ADD, OR, AND, MAX, MIN, etc…
- No processor can finish reduction before each processor has contributed a value
BCAST/REDUCE can reduce programming complexity and may be more efficient in some programs.

Synchronization

Coordination of simultaneous events (threads / processes) in order to obtain correct runtime order and avoid unexpected condition
Types of synchronization
- Barrier
  - Any thread / process must stop at this point (barrier) and cannot proceed until all other threads / processes reach this barrier
- Lock/Semaphore
  - The first task acquires the lock. This task can then safely (serially) access the protected data or code.
  - Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it.

Locality

Algorithm should do most work on local data
Need to exploit spatial and temporal locality

Locality of memery access (shared memory)

Parallel computation is serialized due to memory contention and lack of bandwidth
Distribute data to relieve contention and increase effective bandwidth

Memory Access Latency in Shared Memory Architectures

Uniform Memory Access (UMA)
- Centrally located memory
- All processors are equidistant (access times)
Non-Uniform Access (NUMA)
- Physically partitioned but accessible by all
- Processors have the same address space (Logically large one memory)
- Placement of data affects performance (멀리 떨어진 데이터는 접근할 때 많은 시간이 소요됨)
- CC-NUMA (Cache-Coherent NUMA)

Cache Coherence

the uniformity of shared resource data that ends up stored in multiple local caches
Problem: When a processor modifies a shared variable in local cache, different processors may have different value of the variable
- Copies of a variable can be present in multiple caches
- A write by one processor may not beccome visible to others
- They’ll keep accessing stale vaule in their caches
- Need to take actions to ensure visibility or cache coherence
Snooping cache coherence
- Send all request for data to all processors
- Works well for small systems
Directory-based cache coherence
- Keep track of what is being shared in a directory
- Send point-to-point requests to processors

Shared Memory Architecture

all processors to access all memory as global address space (UMA, NUMA)
Advantage
- Global address space provides a user-friendly programming perspective to memory
- Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
Disadvantages
- Primary disadvantage is the lack of scalability between memory and CPUs
- Programmer responsibility for synchronization
- Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing number of processors.

Distributed Memory Architecture

Characteristics
- Only private(local) memory
- Independent
- require a communication network to connect inter-processor memory
Advantages
- Scalable (processors, memory)
- Cost effective
Disadvantages
- Programmer responsibility of data communication
- No global memory access
- Non-uniform memory access time

Hybrid Architecture

Advantages / Disadvantages
- Combination of Shared/Distributed architecture
- Scalable
- Increased programmer complexity

Ray Tracing

Shoot a ray into scene through every pixel in image plane
Follow their paths
- they bounce around as they strike objects
- they generate new rays: ray tree per input ray
Result is color and opacity for that pixel
Parallelism across rays

Java Thread Programming

Process

Operating system abstraction to represent what is needed to run a single program
a sequential stream of execution in its own address space
program in execution
각 프로세스는 0~ $2^{64}$ bytes 의 주소를 갖는다. (64비트 운영체제에서)
물리적으로는 제한되지만, 논리적으로는 제한이 없는 메모리를 가지고 있을 수 있다.

PCB

Process Control Block
포함하는 것들
- process state
- process number
- program counter
- registers
- memory limits
- list of open files

UNIX process

Every process, except process 0, is created by the fork() system call
- fork() allocates entry in process table and assigns a unique PID to the child process
- child gets a copy of process image of parent
- both child and parent are executing the same code following fork()

Threads

Definition
- single sequential flow of control within a program
- A thread runs within the context of a program’s process and takes advantage of the resources allocated for that process and it’s environment
Each thread is comprised of (from OS perspective)
- Program counter
- Register set
- Stack
Threads belonging to the same process share
- Code section
- Data section
- OS resources such as open files

Single and multithreaded program

code, data, files 는 스레드들 사이에서 공유된다.

Multi-process vs Multi-thread

Process
- Child process gets a copy of parents variables
- Relatively expensive to start
- Don’t have to worry about concurrent access to variables
Thread
- Child thread shares parent’s variables
- Relatively cheap to start
- Concurrent access to variables is an issue

Programming Java threads

JAVA Threading Models

Java has threads build-in
Applications consist of at least one thread
- Often called ‘main’
The Java Virtual Machine creates the initial thread which executes the main method of the class passed to the JVM
The methods executed by the ‘main’ thread can then create other threads

Creating Threads : method 1

A Thread class manages a single sequential thread of control.

class MyThread extends Thread {
	public void run() {
	// work to do
	}
}

Thread Names

All threads have a name to be printed out
The default name is of the format: Thread-No
User-defined names can be given through constructor
- Thread myThread = new Thread(“HappyThread”)
Or usinig the setName(aString) method.
There is a method in Thread class, called getName() to obtain a thread’s name

Creating Threads : method 2

Since Java does not permit multiple inheritance, we often implement the run() method in a class not derived from Thread from the interface Runnable

public inerface Runnable {
	public abstract void run();
}

class MyRun implements Runnable {
	public void run() {
	// work to do
	}
}

Thread t = new Thread(new MyRun());
t.start();

Creating & Executing Threads (Runnable)

Runnable interface has single method
- public void run()
Implement a Runnable and define run()
Thread’s run() method invokes the Runnable’s run() method

Thread Life-Cycle

Created
- start() → Alive
- stop() → Terminated
Alive
- stop() or run() returns
Terminated

Alive States

Once started, an alivethread has a number of substates
- start() → Running
  - sleep(), wait() I/O blocking
  - yield() → Runnable
- Runnable (Ready)
  - dispatch → Running
- Non-Runnable
  - notify(), Time expires, I/O completed → Runnable

Thread Priority

All Java threads have a priority value, currently between 1 and 10.
Priority can be changed at any time
Initial priority is that of the creating thread
Preemptive scheduling
- JVM gives preference to higher priority threads (Not guaranteed)

yield

Release the right of CPU
static void yield()
- allows the scheduler to select another runnable thread (of the same priority)
- no guarantees as to which thread

Thread identity

Thread.currentThread()
- Returns reference to the running thread

Thread sleep, suspend, resume

static void sleep(long millis)
- Blocks this thread for at least the time specified
void stop(), void suspend(), void resume()
- Deprecated!

Thread Waiting & Status Check

void join(), void join(long), void join(long, int)
- One thread (A) can wait for another thread (B) to end
boolean isAlive()
- returns true if the thread has been started and not stopped

Thread synchronization

The advantage of threads is that they allow many things to happen at the same time
The problem with threads is that they allow many things to happen at the same time
Safety
- Nothing bad ever happens
- no race condition
Liveness
- Something eventually happens: no deadlock
Concurrent access to shared data in an object
Need a way to limit thread’s access to shared data
- Reduce concurrency
Mutual Exclusion of Critical Section
- Add a lock to an object
- Any thread must acquire the lock before executing the methods
- If lock is currently unavailable, thread will block

Synchronized JAVA methods

We can control access to an object by using the synchronized keyword
Using the synchronized keywork will force the lock on the object to be used

Synchronized Lock Object

Every Java object has an associated lock acquired via
synchronized statements (block)
Only one thread can hold a lock at a time
Lock granularity: small critical section is better for concurrency object

Condition Variables

lock(syncrhonized)
- control thread access to data
condition variable (wait,notify/notifyall)
- synchronization primitives that enable threads to wait until a particular condition occurs
- enable threads to atomically release a lock and enter the sleeping state.
- Without condition variables
  - the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met
  - A condition variable is a way to achieve the same goal without polling
- A condition variable is always used in conjunction with a mutex lock

`wait()` and `notify()`

wait()
- If no interrupt (normal case), current thread is blocked
- The thread is placed into wait set associated with the object
- Synchronization lock for the object is released
notify()
- One thread, say T, is removed from wait set, if exists.
- T retains the lock for the object
- T is resumed from waiting stauts

Producer-Consumer Problem

The problem describes two processes, the producer and the consumer, who share a common, fixed-size buffer used as a queue.
- producer: generate a piece of data, put it into the buffer and start again. At the same time,
- consumer: consumes the data (i.e., removing it from the buffer) one piece at a time.
- The problem is to make sure that the producer won't try to add data into the buffer if it's full and that the consumer won't try to remove data from an empty buffer.
The solution for the producer is to go to sleep if the buffer is full. The next time the consumer removes an item from the buffer, it notifies the producer, who starts to fill the buffer again.
In the same way, the consumer can go to sleep if it finds the buffer to be empty. The next time the producer puts data into the buffer, it wakes up the sleeping consumer.
generalized to have multiple producers and consumers.

To motivate condition variables, consider the canonical example of a bounded buffer for sharing work among threads

Bounded buffer: A queue with a fixed size

(Unbounded still needs a condition variable, but 1 instead of 2)

For sharing work – think an assembly line:

Producer thread(s) do some work and enqueue result objects
Consumer thread(s) dequeue objects and do next stage
Must synchronize access to the queue

Waiting

enqueue to a full buffer should not raise an exception
- Wait until there is room
dequeue from an empty buffer should not raise an exception
- Wait until there is data

What we want

Better would be for a thread to wait until it can proceed
- Be notified when it should try again
- In the meantime, let other threads run
Like locks, not something you can implement on your own
- Language or library gives it to you, typically implemented with operating-system support
An ADT that supports this: condition variable
- Informs waiter(s) when the condition that causes it/them to wait has varied
Terminology not completely standard; will mostly stick with Java

Key ideas

Java weirdness: every object “is” a condition variable (and a lock)
- other languages/libraries often make them separate
wait:
- “register” running thread as interested in being woken up
- then atomically: release the lock and block
- when execution resumes, thread again holds the lock
notify:
- pick one waiting thread and wake it up
- no guarantee woken up thread runs next, just that it is no longer blocked on the condition – now waiting for the lock
- if no thread is waiting, then do nothing

Bug 1

Between the time a thread is notified and it re-acquires the lock, the condition can become false again!

Guideline: Always re-check the condition after re-gaining the lock

In fact, for obscure reasons, Java is technically allowed to notify a thread spuriously (i.e., for no reason)

Bug 2

If multiple threads are waiting, we wake up only one
- Sure only one can do work now, but can’t forget the others!

notifyAll wakes up all current waiters on the condition variable Guideline: If in any doubt, use notifyAll

Wasteful waking is better than never waking up
So why does notify exist?
Well, it is faster when correct…

Java Multithread Programming Exercise

class C extends Thread {
  int i;
  C(int i) { this.i = i; }
  public void run() {
    System.out.println("Thread " + i + " says hi");
    try { 
      sleep(500);
    } catch (InterruptedException e) {}
    System.out.println("Thread " + i + " says bye");
  }
}

public class ex1 {
   private static final int NUM_THREAD = 10;
  public static void main(String[] args) {
    System.out.println("main thread start!");
    C[] c = new C[NUM_THREAD];
    for(int i=0; i < NUM_THREAD; ++i) {
      c[i] = new C(i);
      c[i].start();
    }
    System.out.println("main thread calls join()!");
    for(int i=0; i < NUM_THREAD; ++i) {
      try { 
        c[i].join();
      } catch (InterruptedException e) {}
    }
    System.out.println("main thread ends!");
  }
}

public class ex2_serial {
  public static void main(String[] args) {
    int[] int_arr = new int [10000]; 
    int i,s;
    for (i=0;i<10000;i++) int_arr[i]=i+1;
    s=sum(int_arr);
    System.out.println("sum=" + s +"\n");
  }

  static int sum(int[] arr) {
    int i;
    int s=0;
    for (i=0;i<arr.length;i++) s+=arr[i];
    return s;
  }
}

class SumThread extends Thread {
  int lo; // fields for communicating inputs
  int hi;
  int[] arr;
  int ans = 0; // for communicating result
  SumThread(int[] a, int l, int h) {
    lo=l; hi=h; arr=a;
  }
  public void run() { // overriding, must have this type
    for(int i=lo; i<hi; i++)
      ans += arr[i];
  }
}

class ex2 {

  private static int NUM_END=10000;
  private static int NUM_THREAD=4;

  public static void main(String[] args) {
    if (args.length==2) {
      NUM_THREAD = Integer.parseInt(args[0]);
      NUM_END = Integer.parseInt(args[1]);
    } 
    System.out.println("number of threads:"+NUM_THREAD);
    System.out.println("sum from 1 to "+NUM_END+"=");

    int[] int_arr = new int [NUM_END]; 
    int i,s;
    
    for (i=0;i<NUM_END;i++) int_arr[i]=i+1;
    s=sum(int_arr);
    System.out.println(s);
  }

  static int sum(int[] arr) {
    int len = arr.length;
    int ans = 0;
    SumThread[] ts = new SumThread[NUM_THREAD];
    for(int i=0; i < NUM_THREAD; i++) {
      ts[i] = new SumThread(arr,(i*len)/NUM_THREAD,((i+1)*len)/NUM_THREAD);
      ts[i].start();
    }
    try {
      for(int i=0; i < NUM_THREAD; i++) {
        ts[i].join();
        ans += ts[i].ans;
      }
    } catch (InterruptedException IntExp) {
    }

    return ans;
  }
}

ex2 는 naive 한 solution 이다. 결국은 thread 개수만큼 for 문을 돌면서 더해줘야 하기 때문이다.

class IntThread extends Thread {
  int my_id; // fields for communicating inputs
  int num_steps;
  int num_threads;
  double sum;

  IntThread(int id, int numSteps, int numThreads) {
    my_id=id; num_steps=numSteps; num_threads=numThreads;
    sum=0.0;
  }
  public void run() { 
    double x;
    int i;
    int i_start = my_id * (num_steps/num_threads);
    int i_end = i_start + (num_steps/num_threads);
    double step = 1.0/(double)num_steps;
    for (i=i_start;i<i_end;i++) { 
      x=(i+0.5)*step;
      sum=sum+4.0/(1.0+x*x);
    }
    sum = sum*step;
    System.out.println("myid"+my_id+", sum=" + sum);
  }
  public double getSum() { return sum; }
}

class ex3 {
  private static int NUM_THREAD = 4;
  private static int NUM_STEP = 1000000000;

  public static void main(String[] args) {
    int i;
    double sum=0.0;

    if (args.length==2) {
      NUM_THREAD = Integer.parseInt(args[0]);
      NUM_STEP = Integer.parseInt(args[1]);
    } 

    long start = System.currentTimeMillis();
    IntThread[] ts = new IntThread[NUM_THREAD];
    for(i=0; i < NUM_THREAD; i++) {
      ts[i] = new IntThread(i,NUM_STEP,NUM_THREAD);
      ts[i].start();
    }
     
    try {
      for (i=0;i<NUM_THREAD;i++) {
        ts[i].join();
        sum += ts[i].getSum();
      }
    } catch (InterruptedException e) {}
    long finish = System.currentTimeMillis();

    System.out.println("integration result=" + sum +"\n");
    System.out.println("Execution time (" + NUM_THREAD + "): "+(finish-start)+"ms");
  }

}

Parallelism for summation

Example: Sum elements of a large array
Idea: Have 4 threads simultaneously sum 1/4 of the array
- Warning: This is an inferior first approach
Create 4 thread objects, each given a portion of the work
Call start() on each thread object to actually run it in parallel
Wait for threads to finish using join()
Add together their 4 answers for the final result
Problems? : processor utilization, subtask size

A better Approach

Solution is to use lots of threads, far more than the number of processors

reusable and efficient across platforms
Use processors “available to you now” :
- Hand out “work chunks” as you go
Load balance
- in general subproblems may take significantly different amounts of time

Naive algorithm is poor

Suppose we create 1 thread to process every 1000 elements

Then combining results will have arr.length / 1000 additions

Linear in size of array (with constant factor 1/1000)
Previously we had only 4 pieces (constant in size of array)

In the extreme, if we create 1 thread for every 1 element, the loop to combine results has length-of-array iterations

Just like the original sequential algorithm

A better idea: devide-and-conquer

This is straightforward to implement using divide-and-conquer

Parallelism for the recursive calls
The key is divide-and-conquer parallelizes the result-combining
If you have enough processors, total time is height of the tree: O(log n) (optimal, exponentially faster than sequential O(n))
We will write all our parallel algorithms in this style

class SumThread extends java.lang.Thread {

  int lo; int hi; int[] arr; // arguments
  int ans = 0; // result
  SumThread(int[] a, int l, int h) { … }

  public void run(){ // override
    if(hi – lo < SEQUENTIAL_CUTOFF)
      for(int i=lo; i < hi; i++)
        ans += arr[i];
    else {
      SumThread left = new SumThread(arr,lo,(hi+lo)/2);
      SumThread right= new SumThread(arr,(hi+lo)/2,hi);
      left.start();
      right.start();
      left.join(); // don’t move this up a line – why?
      right.join();
      ans = left.ans + right.ans;
    }
  }
}

int sum(int[] arr){ 
   SumThread t = new SumThread(arr,0,arr.length);
   t.run();
   return t.ans;
}

Being realistic

In theory, you can divide down to single elements, do all your result-combining in parallel and get optimal speedup
- Total time O(n/numProcessors + log n)
In practice, creating all those threads and communicating swamps the savings, so:
- Use a sequential cutoff, typically around 500-1000
  - Eliminates almost all the recursive thread creation (bottom levels of tree)
  - Exactly like quicksort switching to insertion sort for small subproblems, but more important here
- Do not create two recursive threads; create one and do the other “yourself”
  - Cuts the number of threads created by another 2x

Concurrent Programming

What is concurrency?

What is a sequential program?
- A single thread of control that executes one instruction and when it is finished execute the next logical instruction
What is a concurrent program?
- A collection of autonomous sequential threads, executing (logically) in parallel
The implementation of a collection of threads can be:
- Multiprogramming
  - Threads multiplex their executions on a single processor
- Multiprocessing
  - Threads multiplex their executions on a multiprocessor or a multicore system
- Distributed Processing
  - Processes multiplex their executions on several different machines

Concurrency and Parallelism

Concurrency is not parallelism
Interleaved Concurrency
- Logically simultaneous processing
- Interleaved execution on a single processor
Parallelism
- Physically simultaneous processing
- Requires a multiprocessors or a multicore system

Why use Concurrent Programming?

Natural Application Structure
- The world is not sequential! Easier to program multiple independent and concurrent activities.
Increased application throughput and responsiveness
- Not blocking the entire application due to blocking I/O
Performance from multiprocessor/multicore hardware
- Parallel execution
Distributed systems
- Single application on multiple machines
- Client/Server type or peer-to-peer systems

Synchronization

All the interleavings of the threads are NOT acceptable correct programs
Java provides synchronization mechanism to restrict the interleavings
Synchronization serves two purposes:
- Ensure safety for shared updates
  - Avoid race conditions
- Coordinate actions of threads
  - Parallel computation
  - Event notification

Safety

Multiple threads access shared resource simultaneously
Safe only if:
- All accesses have no effect on resource,
- All accesses idempotent
- Only one access at a time: mutual exclusion

Mutual Exclusion

Prevent more than one thread from accessing critical section at a given time
- Once a thread is in the critical section, no other thread can enter that critical section until the first thread has left the critical section
- No interleavings of threads within the critical section
- Serializes access to section

Atomicity

Synchronized methods execute the body as an atomic unit
May need to execute a code region as the atomic unit
Block Synchronization is a mechanism where a region of code can be labeled as synchronized
The synchronized keyword takes as a parameter an object whose lock the system needs to obtain before it can continue

Avoiding Deadlock

Cycle in locking graph = deadlock
Standard solution: canonical order for locks
- Acquire in increasing order
- Release in decreasing order
Ensures deadlock-freedom, but not always easy to do

Races

Race conditions - insidious bugs
- Non-deterministic, timing dependent
- Cause data corruption, crashes
- Difficult to detect, reproduce, eliminate
Many programs contain races
- Inadvertent programming errors
- Failure to observe locking displine

Data Races

A data race happens when two threads access a variable simultaneously, and one access is a write
Problem with data races: non-determinism
- Depends on interleaving of threads
Usual way to avoid data races: mutual exclusion
- Ensures serialized access of all the shared objects

Potential Concurrency Problems

Deadlock
- Two or more threads stop and wait for each other
Livelock
- Two or more threads continue to execute, but make no progress toward the ultimate goal.
Starvation
- Some thread gets deffered forever.
Lack of fairness
- Each thread gets a turn to make progress.
Race Condition
- Some possible interleaving of threads results in an undesired computation result

Important Concepts in Concurrent Programming (from wikipedia)

Concurrency / Parallelism: logically / physically simultaneous processing
Synchronization
- coordication of simultaneous events (threads / processes) in order to obtain correct run time order and void unexpected race condition.
Mutual Exclusion
- ensuring that no two processes ro threads are in their critical section at the same time
Critical Section
- a piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution
Race Condition
- a type of flaw in an electronic or software system where the output is dependent on the sequence or timing of other uncontrollable events.
Semaphore
- a synchronization object that maintains a count between zero and a specified maximum value. The count is decremented each time a thread completes a wait for the semaphore object and incremented each time a thread releases the semaphore. When the count reaches zero, no more threads can successfully wait for the semaphore object state to become signaled.
concurrent hash map
- a hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates.
copy on write arrays
- CopyOnWriteArrayList behaves as a List implementation that allows multiple concurrent reads, and for reads to occur concurrently with a write. The way it does this is to make a brand new copy of the list every time it is altered.
Barrier
- a type of synchronization method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads / processes reach this barrier.