Automatically Exploiting Cross-Invocation Parallelism Using Runtime Information
Harnessing the performance potential of multicore processors requires scalable parallel
programs. Automatic parallelization techniques are a promising approach for producing
well-performing parallel programs.
Nevertheless, most existing techniques parallelize only independent loops and insert
global synchronizations at the end of each loop invocation. For programs with few loop
invocations, these global synchronizations do not limit parallel execution performance. However,
for programs with many loop invocations, those synchronizations can easily become
the performance bottleneck since they frequently force all threads to wait, losing potential
parallelization opportunities. To address this problem, some automatic parallelization techniques
apply static analyses to enable cross-invocation parallelization. Instead of waiting,
threads can execute iterations from follow-up invocations if they do not cause any conflict.
However, static analysis must be conservative and cannot handle irregular dependence patterns
manifested by particular program inputs at runtime.
In order to enable more parallelization across loop invocations, this thesis presents two
novel automatic parallelization techniques: DOMORE and SpecCross. Unlike existing
techniques relying on static analyses, these two techniques take advantage of runtime
information to achieve much more aggressive parallelization. DOMORE constructs a custom
runtime engine which non-speculatively observes dependences at runtime and synchronizes
iterations only when necessary; while SpecCross applies software speculative barriers to
permit some of the threads to execute past the invocation boundaries. The two techniques
are complimentary in the sense that they can parallelize programs with potentially very
different characteristics. SpecCross, with less runtime overhead, works best when programs'
cross-invocation dependences seldom cause any runtime conflict. DOMORE, on
the other hand, has its advantage in handling dependences which cause frequent conflicts.
Evaluating implementations of DOMORE and SpecCross demonstrates that both
techniques can achieve much better scalability compared to existing automatic parallelization
techniques. Among twenty programs from seven benchmark suites, DOMORE is automatically
applied to parallelize six of them and achieves a geomean speedup of 2.1× over
codes without cross-invocation parallelization and 3.2× over the original sequential performance
on 24 cores. SpecCross is found to be applicable to eight of the programs and it
achieves a geomean speedup of 4.6× over the best sequential execution, which compares
favorably to a 1.3× speedup obtained by parallel execution without any cross-invocation