Papers are divided into the following categories:
- Parallel Applications
- Performance Portability and Scalability
- Implications for Tightly-coupled Multiprocessing
- Benchmarking, Methodology and Performance Evaluation
- Software Shared Memory for Clusters: Protocols and Systems and Evaluation
- Sequential Algorithms for Computational Biology and Medicine
- Miscellaneous
Hierarchical N-body and Tree Building Computer Graphics Probabilistic Inference Protein Structure Determination Multimedia Other
These papers develop new algorithms for partitioning irregular hierarchical N-body applications in a shared address space model. Algorithms are much simpler than those used in message passing (though they can and have since been used there by emulating a shared address space in application software using hashing techniques). The last paper develops completely new tree-building algorithms that are especially valuable for SVM.
Following earlier work on parallelizing diffuse hierarchical radiosity (first paper in N-body subsection), develops new parallel algorithms for the specular+diffuse case, which has 3-way rather than 2-way interactions and hence very different characteristics and parallelization needs.
A completely different parallel algorithm than our earlier one (IEEE Computer paper above), which scales much better on hardware-coherent systems and performs dramatically better on SVM. Uses performance tools to arrive at insights for algorithmic restructuring, and provides directions for such tools.
These papers show that contrary to previous publications, good parallel performance can indeed be obtained on inference (critical given its computational complexity), and that it is especially effective in precisely the cases that need high-performance computing. Examine the tradeoffs between parallelization techniques (static and dynamic, and topological versus in-node), introducing a new one that exploits the right type of parallelism, and study systems implications.
The latter paper achieves successful performance, but points out some difficult, generalizable partitioning problems that bear further research.
Develops two partitioning schemes for MPEG-2 decoding, compares them along different axes, and demonstrates good performance on hardware-coherent systems.
A longer version of the conference paper, with more detail.
Studies the performance portability of applications across different communication architectures that support a coherent shared address space programming model (from fine-grained hardware-coherent to page-grained SVM). Restructures applications in different classes of optimizations to understand programming issues. Finds that many applications as written for hardware coherence do not port well, but to can be restructured to do so. Unfortunately the restructuring needed is large (usually algorithmic), butt fortunately it ports well back to hardware-coherent systems. Develops programming guidelines for SVM. Also illustrates the importance of and provides some directions for performance diagnosis tools.
Shows, using detailed simulator, that hardware-coherent performance should indeed scale very well for many applications and retains its programmability advantages over message passing. Introduces a performance metric for studying such issues. Examines the key issue of how the difficulty of programming increases with machine size by defining a structured set of major levels of programming optimizations with machine scale.• Scaling Application Performance on Cache-coherent Multiprocessors. Dongming Jiang and Jaswinder Pal Singh. Proc. International Symposium on Computer Architecture, May 1999.
• Scaling of Hardware Cache-coherent Multiprocessors. Dongming Jiang and Jaswinder Pal Singh. Invited paper. SIAM Conference on Parallel Processing, March 1999.
Examines the scaling of an SVM system that uses general-purpose network interface support to 64 processors, using a cluster of sixteen 4-way SMPs connected by Myrinet. Compares speedups with those achieved on a high-end hardware-coherent machine, the SGI Origin2000, of similar scale. Shows that surprisingly good performance can be obtained for many real applications (though not for some particularly challenging ones that speed up well on efficient hardware-coherent systems), but that the programming and performance analysis effort needed for this is very high. Fortunately, most of the restucturings are also valuable for hardware-coherent machines at even larger scale. Examines in detail the key performance bottlenecks and where programs spend their time.
Programming Models (Ease and Performance) Communication Architecture Performance Resource Distribution and Scaling Organization of Extended Memory Hierarchy Theory-Systems Interactions (Performance Modeling) Clustering Implications for Parallelizing Compilers
Examines the programming and performance tradeoffs between the coherent shared address space and explicit message passing programming models. Demonstrates the advantages of the former for irregular, dynamically changing applications using hierarchical N-body as the example. Also shows that the high temporal locality inherent in these applications makes automatic caching and coherence work very well and enables good performance easily when the communication abstraction cooperates (working sets are small and scale slowly with data set size).
Shows that coherent shared address space multiprocessors make it much easier to obtain excellent parallel performance on a range of graphics applications (ray tracing, volume rendering and radiosity) than in message passing, despite their irregular, unpredictable nature. The parallel programs do not involve major overhauls of the algorithm and data structure organization, as they do in traditional message passing systems, but are extensions of the sequential programs. Shows that temporal locality is key in this case too.
Compares the performance of the three programming models on a set of relatively regular applications and sorting. Shows how to improve MPI implementation and what the tradeoffs are. Studies the impact of problem and machine size on the differences.
By examining the performance parameters of the communication architecture, finds that the occupancy of the communication controller is a critical parameter that can easily destroy performance in irrecoverable ways.
Shows that despite programmability, the FLASH controller is on the good side of the occupancy threshold due to its many optimizations, though it does suffer performance losses in some cases.
Shows that the fundamental properties (communication to computation ratio, load balance and temporal locality) of many real applications are such that performance can scale well to large processor counts. Also shows that the key properties of these applications are such that at large scale fine-grained machines (many processors and relatively little cache and memory per processor) may be more cost effective than coarse-grained machines, which has implications for resource distribution when levels of integration increase.
Shows that even under much more detailed simulation, hardware-coherent performance does scale very well for many applications and retains its programmability advantages over message passing. Introduces a metric for studying such problems, and examines the how difficulty of programming increases with machine size by defining a structured set of levels of optimization with machine scale.
Uses real machines to study the tradeoffs between CC-NUMA and more complex COMA architectures that replicate coherently in hardware in main memory as well. Shows that the working set properties of scalable applications are such that COMA machines do not benefit much, and that their major benefit is from automatic migration of data rather than automatic coherent replication for which they were designed.
Using application examples, shows that the real difficulty in modeling application performance is not modeling the system, which has been the focus so far, but rather modeling the behavior of the application. In particular, highlights the relationship between communication and replication, which is especially crucial on coherent shared memory systems. Proposes that the systems community can help with a hierarchy of levels of simulators. The occupancy paper above also showed that contention is a particular important and difficult to model aspect.
First such study using real applications, with suggestions to focus compiler research.
Shows the importance of scaling under application considerations rather than simply scaling data set size, and illustrates the impact on architectural conclusions.
The original SPLASH paper, presenting and describing the applications.
SPLASH-2 has much broader representation than SPLASH, and also includes different levels of optimization in some cases. This paper characterizes the key properties of the programs quantitatively, helping users choose parameters, and addresses the major methodological issues in workload-driven evaluation using parallel applications (e.g. pruning the design space based on application-system interactions without losing representativeness), illustrating pitfalls.
Both papers show that while hardware cache coherence in real modern implementations mostly meets its promise, some implementation tradeoffs rather than architectural weaknesses can hurt performance dramatically; in particular, the sharing of some key communication resources and some constraints placed upon them. In the process, they (especially the latter) describe and illustrate a methodology for doing such evaluations, and identify applications for which commercial hardware coherent systems don't work so well.
First comprehensive study of page-based shared virtual memory (SVM) performance using standard applications, and comparison with hardware coherent systems to see how far off they are. Shows that a hardware-supported Òhome-basedÓ protocol substantially outperforms the dominant all-software non-home protocol and why. Finds that performance was far from hardware coherence for irregular applications, and identified the key bottlenecks such as synchronization and serialization caused by the dilation of critical sections due to expensive page faults inside them.
Separates the effects of hardware support from those of home versus non-home based protocols. Shows that the hardware support is not so crucial, so the latter is the key, and that this form of hardware support is not likely to be so attractive in the future, especially for off-the-shelf network interfaces. Helped us decide not to provide this hardware support in the next-generation system at Princeton.
Studies the limitations to improving the performance of both page-grained and fine-grained approaches to software shared memory as the three layers that stand between a user and performance (the application, protocol and communication layers) are improved in realistic and idealized ways, both individually and in concert. Shows the complex synergies in the page-grained SVM approach, and comments on how different sources so major improvement match with technological trends.
Introduces a consistency model to obtain the benefits of associating data with synchronization without the programming burden of Entry Consistency. Has been implemented since by other researchers at Rice University. Also proposes the idea of building home-based protocols entirely in software, which has been shown to outperform earlier all-software systems.
A key question for software shared memory is whether coherence should be maintained at coarse grain or fine grain. Coarse grain typically requires sophisticated consistency models to reduce false sharing, but can amortize protocol and communication overhead. This paper explores the space of consistency and granularity, using a real platform with hardware access control. Shows that relaxed consistency can regain the performance losses in cases where these exist, and that contrary to previous results which did not look at irregular applications, multiple-writer protocols are very important at coarse grain.
Clusters are increasingly built out of hardware-coherent multiprocessors rather than uniprocessor nodes. This has potential advantages for software shared memory, since much of the communication may be contained within a node. These papers propose, implement and evaluate protocols that exploit the hardware coherence within a node as much as possible, reducing software intervention while still being lazy.
An older, simulation-based paper about SMP-node protocols. This one used an automatic update based protocol, and mapped all pages write-through. Replaced by the papers above.
Based on this, developed protocols/system to avoid both interrupts and polling by processor, using only general, commodity NI support and comm. layer mechanisms mostly within VIA API. Not the kind of hardware support provided in SHRIMP NI (or that used in Cashmere), and no protocol-specific mechanisms or protocol processing. Prototyped on SMP-node Myrinet cluster. Reorganized protocol and its laziness to take advantage of explicit remote fetch, remote deposit and mutex support in NI. By exploiting all three layers (application, protocol, communication), finally got close to hardware-SAS performance on moderate-scale system for many real applications. Identified key remaining bottlenecks that need to be addressed to scale up to larger systems.See also paper in Performance Portability and Scalability section that examines scalability of a protocol based on this to larger processor counts.
Places threads of research from different groups so far in a common framework, assesses where we are, and argues that applications, application bottleneck-driven protocol improvements, understanding hardware support and tools are key research areas for the near future.
Shows how dramatic performance improvements can be achieved by taking advantage of problem structure. Applies to a host of diagnosis problems. The basic insight has also been used later for handling continuous variables in inference.
Proposes a new representational framework for inference, with computational and intuitive advantages.
Extends the existing least-squares probabilistic approach (to obtain 3-d structure from experimental and theoretical constraints) to handle constraints that are not Gaussian. Does this by converting constraints to a mixture of Gaussians using expectation-maximization techniques, and then extending the algorithm by branching and combination to handle them.
Shows that hierarchical approaches that take advantage of structure (either knowledge-based or automatically constructed) achieve dramatic performance gains (measured 50-100x on 30S Ribosomal subunit) over flat approaches. Enables much larger problems to be solved by practitioners in the domain (see SC paper above). Examines different automatic heuristics and manual techniques for generating hierarchy, and shows that former can work very well if done right.
Soliton Phase Shifts in a Dissipative Lattice. Nayeem Islam, Jaswinder
Pal Singh and Kenneth Steiglitz. In Journal of Applied Physics, August
1987.