- Split-phase Barriers with Java Phasers
- Point-to-Point Synchronization with Phasers
- One-Dimensional Iterative Averaging with Phasers
- Pipeline Parallelism
- Data Flow Parallelism
Split-phase Barriers with Java Phasers
In this lecture, we examined a variant of the barrier example that we studied earlier:
1 |
forall (i : [0:n-1]) { |
We learned about Java’s Phaser class, and that the operation 𝚙𝚑.𝚊𝚛𝚛𝚒𝚟𝚎𝙰𝚗𝚍𝙰𝚠𝚊𝚒𝚝𝙰𝚍𝚟𝚊𝚗𝚌𝚎() can be used to implement a barrier through phaser object 𝚙𝚑. We also observed that there are two possible positions for inserting a barrier between the two print statements above — before or after the call to 𝚕𝚘𝚘𝚔𝚞𝚙(𝚒). However, upon closer examination, we can see that the call to 𝚕𝚘𝚘𝚔𝚞𝚙(𝚒) is local to iteration i and that there is no specific need to either complete it before the barrier or to complete it after the barrier. In fact, the call to 𝚕𝚘𝚘𝚔𝚞𝚙(𝚒) can be performed in parallel with the barrier. To facilitate this split-phase barrier (also known as a fuzzy barrier) we use two separate APIs from Java Phaser class — 𝚙𝚑.𝚊𝚛𝚛𝚒𝚟𝚎() and 𝚙𝚑.𝚊𝚠𝚊𝚒𝚝𝙰𝚍𝚟𝚊𝚗𝚌𝚎(). Together these two APIs form a barrier, but we now have the freedom to insert a computation such as 𝚕𝚘𝚘𝚔𝚞𝚙(𝚒) between the two calls as follows:
1 |
|
“party” 是 Phaser 中的一个术语,相当于是线程的意思,当一个 party 到达,就是线程到达意思就是线程到了同步的屏障(Barrier)。
Phaser Understanding
1 |
import java.util.ArrayList; |
Doing so enables the barrier processing to occur in parallel with the call to 𝚕𝚘𝚘𝚔𝚞𝚙(𝚒), which was our desired outcome.
Point-to-Point Synchronization with Phasers
In this lecture, we looked at a parallel program example in which the span (critical path length) would be 6 units of time if we used a barrier, but is reduced to 5 units of time if we use individual phasers as shown in the following table:
Task0 | Task1 | Task2 |
1a:X=A();//cost=1 | 1b:Y=B();//cost=2 | 1c:Z=C();//cost=3 |
2a:ph0.arrive(); | 2b:ph1.arrive(); | 2c:ph2.arrive(); |
3a:ph1.awaitAdvance(0); | 3b:ph0.awaitAdvance(0); | 3c:ph1.awaitAdvance(0); |
4a:D(X,Y);//cost=3 | 4b:ph2.awaitAdvance(0); | 4c:F(Y,Z);//cost=1 |
5b:E(X,Y,Z);//cost=2 |
Each column in the table represents execution of a separate task, and the calls to 𝚊𝚛𝚛𝚒𝚟𝚎() and 𝚊𝚠𝚊𝚒𝚝𝙰𝚍𝚟𝚊𝚗𝚌𝚎(𝟶) represent synchronization across different tasks via phaser objects, 𝚙𝚑𝟶, 𝚙𝚑𝟷, and 𝚙𝚑𝟸, each of which is initialized with a party count of 1 (only one signalling task). (The parameter 0 in 𝚊𝚠𝚊𝚒𝚝𝙰𝚍𝚟𝚊𝚗𝚌𝚎(𝟶) represents a transition from phase 0 to phase 1.)
One-Dimensional Iterative Averaging with Phasers
In this lecture, we revisited the barrier-based Iterative Averaging example that we studied earlier, and observed that a full barrier is not necessary since forall iteration i only needs to wait for iterations i − 1 and i + 1 to complete their current phase before iteration i can move to its next phase. This idea can be captured by phasers, if we allocate an array of phasers as follows:
1 |
// Allocate array of phasers |
As we learned earlier, grouping/chunking of parallel iterations in a forall can be an important consideration for performance (due to reduced overhead). The idea of grouping of parallel iterations can be extended to forall loops with phasers as follows:
1 |
// Allocate array of phasers proportional to number of chunked tasks |
Pipeline Parallelism
In this lecture, we studied how point-to-point synchronization can be used to build a one-dimensional pipeline with p tasks (stages), T0,…Tn. For example, three important stages in a medical imaging pipeline are denoising, registration, and segmentation.
We performed a simplified analysis of the WORK and SPAN for pipeline parallelism as follows.
Let n be the number of input items and p the number of stages in the pipeline, WORK = n × p is the total work that must be done for all data items, and CPL = n + p − 1 is the span or critical path length for the pipeline. Thus, the ideal parallelism is PAR = WORK /CPL = np / (n + p − 1). This formula can be validated by considering a few boundary cases. When p = 1, the ideal parallelism degenerates to PAR = 1, which confirms that the computation is sequential when only one stage is available. Likewise, when n = 1, the ideal parallelism again degenerates to PAR = 1, which confirms that the computation is sequential when only one data item is available. When n is much larger than p (n » p), then the ideal parallelism approaches PAR = p in the limit, which is the best possible case.
The synchronization required for pipeline parallelism can be implemented using phasers by allocating an array of phasers, such that phaser 𝚙𝚑[𝚒] is “signalled” in iteration i by a call to 𝚙𝚑[𝚒].𝚊𝚛𝚛𝚒𝚟𝚎() as follows:
1 |
// Code for pipeline stage i |
Data Flow Parallelism
Thus far, we have studied computation graphs as structures that are derived from parallel programs. In this lecture, we studied a dual approach advocated in the data flow parallelism model, which is to specify parallel programs as computation graphs. The simple data flow graph studied in the lecture consisted of five nodes and four edges: A → C, A → D, B → D, B → E. While futures can be used to generate such a computation graph, e.g., by including calls to A.get() and B.get() in task D, the computation graph edges are implicit in the get() calls when using futures. Instead, we introduced the asyncAwait notation to specify a task along with an explicit set of preconditions (events that the task must wait for before it can start execution). With this approach, the program can be generated directly from the computation graph as follows:
1 |
async( () -> {/* Task A */; A.put(); } ); // Complete task and trigger event A |
Interestingly, the order of the above statements is not significant. Just as a graph can be defined by enumerating its edges in any order, the above data flow program can be rewritten as follows, without changing its meaning:
1 |
asyncAwait(A, () -> {/* Task C */} ); // Only execute task after event A is triggered |
Finally, we observed that the power and elegance of data flow parallel programming is accompanied by the possibility of a lack of progress that can be viewed as a form of “deadlock” if the program omits a put() call for signalling an event.
近期评论