Barrier (computer science) explained

In parallel computing, a barrier is a type of synchronization method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.[1]

Many collective routines and directive-based parallel languages impose implicit barriers. For example, a parallel do loop in Fortran with OpenMP will not be allowed to continue on any thread until the last iteration is completed. This is in case the program relies on the result of the loop immediately after its completion. In message passing, any global communication (such as reduction or scatter) may imply a barrier.

In concurrent computing, a barrier may be in a raised or lowered state. The term latch is sometimes used to refer to a barrier that starts in the raised state and cannot be re-raised once it is in the lowered state. The term count-down latch is sometimes used to refer to a latch that is automatically lowered once a predetermined number of threads/processes have arrived.

Implementation

Take an example for thread, known as the thread barrier. The thread barrier needs a variable to keep track of the total number of threads that have entered the barrier.[2] Whenever there are enough threads enter the barrier, it will be lifted. A synchronization primitive like mutex is also needed when implementing the thread barrier.

This thread barrier method is also known as Centralized Barrier as the threads have to wait in front of a "central barrier" until the expected number of threads have reached the barrier before it is lifted.

The following C code, which implemented thread barrier by using POSIX Threads will demonstrate this procedure:[3]

  1. include
  2. include
  3. define TOTAL_THREADS 2
  4. define THREAD_BARRIERS_NUMBER 3
  5. define PTHREAD_BARRIER_ATTR NULL // pthread barrier attribute

typedef struct _thread_barrier thread_barrier;

thread_barrier barrier;

void thread_barrier_init(thread_barrier *barrier, pthread_mutexattr_t *mutex_attr, int thread_barrier_number)

void thread_barrier_wait(thread_barrier *barrier)

void thread_barrier_destroy(thread_barrier *barrier)

void *thread_func(void *ptr)

int main

In this program, the thread barrier is defined as a struct, struct _thread_barrier, which include:

Based on the definition of barrier, we need to implement a function like thread_barrier_wait in this program which will "monitor" the total number of thread in the program in order to life the barrier.

In this program, every thread calls thread_barrier_wait will be blocked until THREAD_BARRIERS_NUMBER threads reach the thread barrier. The result of that program is:thread id is waiting at the barrier, as not enough 3 threads are running ...thread id is waiting at the barrier, as not enough 3 threads are running ...// (main process is blocked as not having enough 3 threads)// Line printf("Thread barrier is lifted\n") won't be reachedAs we can see from the program, there are just only 2 threads are created. Those 2 thread both have, as the thread function handler, which call, while thread barrier expected 3 threads to call in order to be lifted.

Change TOTAL_THREADS to 3 and the thread barrier is lifted:thread id is waiting at the barrier, as not enough 3 threads are running ...thread id is waiting at the barrier, as not enough 3 threads are running ...thread id is waiting at the barrier, as not enough 3 threads are running ...The barrier is lifted, thread id is running nowThe barrier is lifted, thread id is running nowThe barrier is lifted, thread id is running nowThread barrier is lifted

Sense-Reversal Centralized Barrier

Beside decreasing the total thread number by one for every thread successfully passing the thread barrier, thread barrier can use opposite values to mark for every thread state as passing or stopping.[4] For example, thread 1 with state value is 0 means it's stopping at the barrier, thread 2 with state value is 1 means it has passed the barrier, thread 3's state value = 0 means it's stopping at the barrier and so on.[5] This is known as Sense-Reversal.

The following C code demonstrates this:[6]

  1. include
  2. include
  3. include
  4. define TOTAL_THREADS 2
  5. define THREAD_BARRIERS_NUMBER 3
  6. define PTHREAD_BARRIER_ATTR NULL // pthread barrier attribute

typedef struct _thread_barrier thread_barrier;

thread_barrier barrier;

void thread_barrier_init(thread_barrier *barrier, pthread_mutexattr_t *mutex_attr, int thread_barrier_number)

void thread_barrier_wait(thread_barrier *barrier)

void thread_barrier_destroy(thread_barrier *barrier)

void *thread_func(void *ptr)

int mainThis program has all features similar to the previous Centralized Barrier source code. It just only implements in a different way by using 2 new variables:

When a thread stops at the barrier, local_sense's value is toggled. When there are less than THREAD_BARRIERS_NUMBER threads stopping at the thread barrier, those threads will keep waiting with the condition that the flag member of struct _thread_barrier is not equal to the private '''local_sense''' variable.

When there are exactly THREAD_BARRIERS_NUMBER threads stopping at the thread barrier, the total thread number is reset to 0, and the flag is set to '''local_sense'''.

Combining Tree Barrier

The potential problem with the Centralized Barrier is that due to all the threads repeatedly accessing the global variable for pass/stop, the communication traffic is rather high, which decreases the scalability.

This problem can be resolved by regrouping the threads and using multi-level barrier, e.g. Combining Tree Barrier. Also hardware implementations may have the advantage of higher scalability.

A Combining Tree Barrier is a hierarchical way of implementing barrier to resolve the scalability by avoiding the case that all threads are spinning at the same location.

In k-Tree Barrier, all threads are equally divided into subgroups of k threads and a first-round synchronizations are done within these subgroups. Once all subgroups have done their synchronizations, the first thread in each subgroup enters the second level for further synchronization. In the second level, like in the first level, the threads form new subgroups of k threads and synchronize within groups, sending out one thread in each subgroup to next level and so on. Eventually, in the final level there is only one subgroup to be synchronized. After the final-level synchronization, the releasing signal is transmitted to upper levels and all threads get past the barrier.[7]

Hardware Barrier Implementation

The hardware barrier uses hardware to implement the above basic barrier model.

The simplest hardware implementation uses dedicated wires to transmit signal to implement barrier. This dedicated wire performs OR/AND operation to act as the pass/block flags and thread counter. For small systems, such a model works and communication speed is not a major concern. In large multiprocessor systems this hardware design can make barrier implementation have high latency. The network connection among processors is one implementation to lower the latency, which is analogous to Combining Tree Barrier.[8]

POSIX Thread barrier functions

POSIX Threads standard directly supports thread barrier functions which can be used to block the specified threads or the whole process at the barrier until other threads to reach that barrier. 3 main API supports by POSIX to implement thread barriers are:

: Init the thread barrier with the number of threads needed to wait at the barrier in order to lift it[9]
  • : Destroy the thread barrier to release back the resource
  • : Calling this function will block the current thread until the number of threads specified by call to lift the barrier.[10] The following example (implemented in C with pthread API) will use thread barrier to block all the threads of the main process and therefore block the whole process:
    1. include
    2. include
    3. define TOTAL_THREADS 2
    4. define THREAD_BARRIERS_NUMBER 3
    5. define PTHREAD_BARRIER_ATTR NULL // pthread barrier attribute

    pthread_barrier_t barrier;

    void *thread_func(void *ptr)

    int mainThe result of that source code is:Waiting at the barrier as not enough 3 threads are running ...Waiting at the barrier as not enough 3 threads are running ...// (main process is blocked as not having enough 3 threads)// Line printf("Thread barrier is lifted\n") won't be reachedAs we can see from the source code, there are just only two threads are created. Those 2 thread both have thread_func, as the thread function handler, which call, while thread barrier expected 3 threads to call in order to be lifted.

    Change TOTAL_THREADS to 3 and the thread barrier is lifted:Waiting at the barrier as not enough 3 threads are running ...Waiting at the barrier as not enough 3 threads are running ...Waiting at the barrier as not enough 3 threads are running ...The barrier is lifted, thread id 140643372406528 is running nowThe barrier is lifted, thread id 140643380799232 is running nowThe barrier is lifted, thread id 140643389191936 is running nowThread barrier is liftedAs main is treated as a thread, i.e the "main" thread of the process,[11] calling inside will block the whole process until other threads reach the barrier. The following example will use thread barrier, with inside, to block the process/main thread for 5 seconds as waiting the 2 "newly created" thread to reach the thread barrier:

    1. define TOTAL_THREADS 2
    2. define THREAD_BARRIERS_NUMBER 3
    3. define PTHREAD_BARRIER_ATTR NULL // pthread barrier attribute

    pthread_barrier_t barrier;

    void *thread_func(void *ptr)

    int mainThis example doesn't use to wait for 2 "newly created" threads to complete. It calls inside, in order to block the main thread, so that the process will be blocked until 2 threads finish its operation after 5 seconds wait (line 9 -).

    See also

    External links

    Web site: Parallel Programming with Barrier Synchronization. sourceallies.com. March 2012.

    Notes and References

    1. Web site: GNU Operating System . Implementation of pthread_barrier . 2024-03-02 . gnu.org.
    2. Book: Solihin, Yan . Fundamentals of Parallel Multicore Architecture . 2015-01-01 . Chapman & Hall/CRC . 978-1482211184 . 1st.
    3. Web site: Implementing Barriers. Carnegie Mellon University.
    4. Book: Culler, David . Parallel Computer Architecture, A Hardware/Software Approach . Gulf Professional . 1998 . 978-1558603431.
    5. Book: Culler, David . Parallel Computer Architecture, A Hardware/Software Approach . Gulf Professional . 1998 . 978-1558603431.
    6. Book: Evolving OpenMP in an Age of Extreme Parallelism. limited. Nanjegowda. Ramachandra. Hernandez. Oscar. Chapman. Barbara. Jin. Haoqiang H.. 2009-06-03. Springer Berlin Heidelberg. 9783642022845. Müller. Matthias S.. Lecture Notes in Computer Science. 42–52. en. 10.1007/978-3-642-02303-3_4. Supinski. Bronis R. de. Chapman. Barbara M..
    7. Book: Nikolopoulos. Dimitrios S.. Papatheodorou. Theodore S.. Proceedings of the 13th international conference on Supercomputing. A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems. 1999-01-01. ICS '99. New York, NY, USA. ACM. 319–328. 10.1145/305138.305209. 978-1581131642. 6097544. 2019-01-18. 2017-07-25. https://web.archive.org/web/20170725025227/http://pure.qub.ac.uk/portal/en/publications/a-quantitative-evaluation-of-synchronization-algorithms-and-disciplines-on-ccnuma-systems-the-case-of-the-sgi-origin2000(2b9278d2-85fc-4ce9-96ed-725524379f9a).html. dead.
    8. N.R. Adiga, et al. An Overview of the BlueGene/L Supercomputer. Proceedings of the Conference on High Performance Networking and Computing, 2002.
    9. Web site: pthread_barrier_init, pthread_barrier_destroy . 2024-03-16 . Linux man page.
    10. Web site: pthread_barrier_wait . 2024-03-16 . Linux man page.
    11. Web site: How to get number of processes and threads in a C program? . 2024-03-16 . stackoverflow.