Annotated List of Publications

Textbook

  • David E. Culler, Jaswinder Pal Singh. Parallel Computer Architecture: A Hardware-Software Approach. With Anoop Gupta. Morgan Kaufmann Publishers, 1998.

  •  

     

    Papers

    Papers are divided into the following categories:

    1. Parallel Applications

  • Hierarchical N-body and Tree Building
  • Computer Graphics
  • Probabilistic Inference
  • Protein Structure Determination
  • Multimedia
  • Other
  • 1(a). Hierarchical N-body and Tree Building

  • Load Balancing and Data Locality in Adaptive Hierarchical N-body Methods: Barnes-Hut, Fast Multipole and Radiosity. Jaswinder Pal Singh, Chris Holt, Takashi Totsuka, Anoop Gupta and John L. Hennessy. Journal of Parallel and Distributed Computing, June 1995.

  •  
  • A Parallel Adaptive Fast Multipole Method. Jaswinder Pal Singh, Chris Holt, Anoop Gupta and John L. Hennessy. In Proceedings of Supercomputing93, November 1993.

  •  
  • A Parallel Adaptive Fast Multipole Method. Chris Holt and Jaswinder Pal Singh. Invited Paper. SIAM Conference on Parallel Processing for Scientific Computing, February 1995.

  •  
  • Parallel Tree Building for a Range of Shared Address Space Multiprocessors. Hongzhang Shan and Jaswinder Pal Singh. Princeton University Technical Report, Computer Science Department. In Proceedings of the International Parallel Processing Symposium (IPPS), 1998.
  • These papers develop new algorithms for partitioning irregular hierarchical N-body applications in a shared address space model. Algorithms are much simpler than those used in message passing (though they can and have since been used there by emulating a shared address space in application software using hashing techniques). The last paper develops completely new tree-building algorithms that are especially valuable for SVM.

    1(b). Computer Graphics

  • Parallel Hierarchical Radiosity on Cache-coherent Multiprocessors. James Richard and Jaswinder Pal Singh. Parallel Rendering Symposium, October 1997.
  • Following earlier work on parallelizing diffuse hierarchical radiosity (first paper in N-body subsection), develops new parallel algorithms for the specular+diffuse case, which has 3-way rather than 2-way interactions and hence very different characteristics and parallelization needs.
  • Improving Parallel Shear-Warp Volume Rendering on Shared Address Space Multiprocessors. Dongming Jiang, and Jaswinder Pal Singh. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June, 1997.
  • A completely different parallel algorithm than our earlier one (IEEE Computer paper above), which scales much better on hardware-coherent systems and performs dramatically better on SVM. Uses performance tools to arrive at insights for algorithmic restructuring, and provides directions for such tools.

    1(c). Probabilistic Inference

  • A Parallel Lauritzen-Spiegelhalter Algorithm for Probabilistic Inference. Alexander Kozlov and Jaswinder Pal Singh. In Proceedings of Supercomputing94, November 1994.

  •  
  • Parallel Probabilistic Inference on Cache-coherent Multiprocessors. Alexander Kozlov and Jaswinder Pal Singh. IEEE Computer, special issue on Applications for Shared Memory Multiprocessors, December 1996.
  • These papers show that contrary to previous publications, good parallel performance can indeed be obtained on inference (critical given its computational complexity), and that it is especially effective in precisely the cases that need high-performance computing. Examine the tradeoffs between parallelization techniques (static and dynamic, and topological versus in-node), introducing a new one that exploits the right type of parallelism, and study systems implications.

    1(d). Protein Structure Determination from Constraints

  • Parallel Protein Structure Determination in the Presence of Uncertainty. Cheng Chen, Jaswinder Pal Singh, Russ Altman and Bill Poland. In Proceedings of Supercomputing94, November 1994.

  •  
  • Parallel Hierarchical Protein Structure Determination. Cheng Chen, Jaswinder Pal Singh and Russ Altman. In Proceedings of Supercomputing96, November 1996.
  • The latter paper achieves successful performance, but points out some difficult, generalizable partitioning problems that bear further research.

    1(e). Multimedia

  • Real-Time Parallel MPEG-2 Decoding in Software. Angelos Bilas, Jason Fritts, and Jaswinder Pal Singh. In Proceedings of 11th International Parallel Processing Symposium, April 1997. Selected as Extended Paper.
  •  Develops two partitioning schemes for MPEG-2 decoding, compares them along different axes, and demonstrates good performance on hardware-coherent systems.
  • Real-Time Parallel MPEG-2 Decoding in Software. Angelos Bilas, Jason Fritts, and Jaswinder Pal Singh. Technical Report no. TR-516-97, Computer Science Department, Princeton University.
  • A longer version of the conference paper, with more detail.

    1(f). Other

  • Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results and Implications. Jaswinder Pal Singh and John L. Hennessy. Journal of Parallel and Distributed Computing, May 1992.

  •  
  • Data Locality and Memory System Performance in the Parallel Simulation of Ocean Eddy Currents. Jaswinder Pal Singh and John L. Hennessy. In Proceedings of the Second International Symposium on High Performance Computing, October 1991. Also in High Performance Computing II, North-Holland, 1991.

  •  

     

    2. Performance Portability and Scalability

    2(a). Performance Portability

  • Application Restructuring and Performance Portability on Shared Virtual Memory and Hardware-Coherent Multiprocessors. Dongming Jiang, Hongzhang Shan, and Jaswinder Pal Singh. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June, 1997.
  • Studies the performance portability of applications across different communication architectures that support a coherent shared address space programming model (from fine-grained hardware-coherent to page-grained SVM). Restructures applications in different classes of optimizations to understand programming issues. Finds that many applications as written for hardware coherence do not port well, but to can be restructured to do so. Unfortunately the restructuring needed is large (usually algorithmic), butt fortunately it ports well back to hardware-coherent systems. Develops programming guidelines for SVM. Also illustrates the importance of and provides some directions for performance diagnosis tools.

    2(b). Scalability of Hardware Cache-coherent Multiprocessors

    Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors. Chris Holt, Jaswinder Pal Singh and John Hennessy. In Proceedings of the 22nd Intl. Symposium on Computer Architecture (ISCA), May 1996.
    Shows, using detailed simulator, that hardware-coherent performance should indeed scale very well for many applications and retains its programmability advantages over message passing. Introduces a performance metric for studying such issues. Examines the key issue of how the difficulty of programming increases with machine size by defining a structured set of  major levels of programming optimizations with machine scale.
    Scaling Application Performance on Cache-coherent Multiprocessors. Dongming Jiang and Jaswinder Pal Singh.  Proc. International Symposium on Computer Architecture, May 1999.

    Scaling of Hardware Cache-coherent Multiprocessors. Dongming Jiang and Jaswinder Pal Singh. Invited paper. SIAM Conference on Parallel Processing, March 1999.

    2(c). Scalability of Software Shared Virtual Memory Clusters

     Application Scaling under Shared Virtual Memory on a Cluster of SMPs. Proc. International Symposium on Computer Architecture, May 1999.
    Examines the scaling of an SVM system that uses general-purpose network interface support to 64 processors, using a cluster of sixteen 4-way SMPs connected by Myrinet.  Compares speedups with those achieved on a high-end hardware-coherent machine, the SGI Origin2000, of similar scale. Shows that surprisingly good performance can be obtained for many real applications (though not for some particularly challenging ones that speed up well on efficient hardware-coherent systems), but that the programming and performance analysis effort needed for this is very high.  Fortunately, most of the restucturings are also valuable for hardware-coherent machines at even larger scale.  Examines in detail the key performance bottlenecks and where programs spend their time.
     
     

    3. Implications of Applications for Tightly-coupled Multiprocessing

  • Programming Models (Ease and Performance)
  • Communication Architecture Performance
  • Resource Distribution and Scaling
  • Organization of Extended Memory Hierarchy
  • Theory-Systems Interactions (Performance Modeling)
  • Clustering
  • Implications for Parallelizing Compilers
  • 3(a). Programming Models (Ease and Performance)

  • Implications of Hierarchical N-body Methods for Multiprocessor Architecture. Jaswinder Pal Singh, John L. Hennessy and Anoop Gupta. ACM Transactions on Computer Systems, May 1995.
  • Examines the programming and performance tradeoffs between the coherent shared address space and explicit message passing programming models. Demonstrates the advantages of the former for irregular, dynamically changing applications using hierarchical N-body as the example. Also shows that the high temporal locality inherent in these applications makes automatic caching and coherence work very well and enables good performance easily when the communication abstraction cooperates (working sets are small and scale slowly with data set size).
  • Parallel Visualization Algorithms and their Implications for Multiprocessor Architecture. Jaswinder Pal Singh, Anoop Gupta and Marc Levoy. IEEE Computer, special issue on Visualization, vol. 27, no. 6, June 1994.
  • Shows that coherent shared address space multiprocessors make it much easier to obtain excellent parallel performance on a range of graphics applications (ray tracing, volume rendering and radiosity) than in message passing, despite their irregular, unpredictable nature. The parallel programs do not involve major overhauls of the algorithm and data structure organization, as they do in traditional message passing systems, but are extensions of the sequential programs. Shows that temporal locality is key in this case too.
  • The Performance Advantages of Integrating Block Data Transfer in Cache-coherent Multiprocessors. Steven Woo, Jaswinder Pal Singh, and John L. Hennessy. In Proceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1994. There is also an older technical report with different system assumptions (but more detailed analysis).
  • A Comparison of the MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000.  Hongzhang Shan and Jaswinder Pal Singh. Proc. Intl. Conference on Supercomputing, June 1999.
    Compares the performance of the three programming models on a set of relatively regular applications and sorting. Shows how to improve MPI implementation and what the tradeoffs are. Studies the impact of problem and machine size on the differences.

    3(b). Communication Architecture Performance

  • The Effects of Latency, Occupancy and Bandwidth on the Performance of Cache-coherent Multiprocessors. Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg and John L. Hennessy. Stanford University Technical Report. Submitted.
  • By examining the performance parameters of the communication architecture, finds that the occupancy of the communication controller is a critical parameter that can easily destroy performance in irrecoverable ways.
  • The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor. Mark Heinrich et al. In Proceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1994.
  • Shows that despite programmability, the FLASH controller is on the good side of the occupancy threshold due to its many optimizations, though it does suffer performance losses in some cases.

    3(c). Resource Distribution and Scaling

  • Working Sets, Cache Sizes and Node Granularity for Large-Scale Multiprocessors. Edward Rothberg, Jaswinder Pal Singh and Anoop Gupta. In Proceedings of the Twentieth International Symposium on Computer Architecture, pp. 14-25, May 1993.
  • Shows that the fundamental properties (communication to computation ratio, load balance and temporal locality) of many real applications are such that performance can scale well to large processor counts. Also shows that the key properties of these applications are such that at large scale fine-grained machines (many processors and relatively little cache and memory per processor) may be more cost effective than coarse-grained machines, which has implications for resource distribution when levels of integration increase.
  • Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors. Chris Holt, Jaswinder Pal Singh and John Hennessy. In Proceedings of the Twenty-second International Symposium on Computer Architecture, May 1996.
  • Shows that even under much more detailed simulation, hardware-coherent performance does scale very well for many applications and retains its programmability advantages over message passing. Introduces a metric for studying such problems, and examines the how difficulty of programming increases with machine size by defining a structured set of levels of optimization with machine scale.

    3(d). Organization of Extended Memory Hierarchy

  • An Empirical Comparison of the Kendall Square Research KSR-1 and Stanford DASH Multiprocessors. Jaswinder Pal Singh, Truman Joe, Anoop Gupta and John L. Hennessy. In Proceedings of Supercomputing93, November 1993.
  • Uses real machines to study the tradeoffs between CC-NUMA and more complex COMA architectures that replicate coherently in hardware in main memory as well. Shows that the working set properties of scalable applications are such that COMA machines do not benefit much, and that their major benefit is from automatic migration of data rather than automatic coherent replication for which they were designed.

    3(e). Theory-Systems Interactions (Performance Modeling)

  • Modeling Communication in Parallel Algorithms: A Fruitful Interaction Between Theory and Systems? Jaswinder Pal Singh, Edward Rothberg and Anoop Gupta. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, 1994.
  • Using application examples, shows that the real difficulty in modeling application performance is not modeling the system, which has been the focus so far, but rather modeling the behavior of the application. In particular, highlights the relationship between communication and replication, which is especially crucial on coherent shared memory systems. Proposes that the systems community can help with a hierarchy of levels of simulators. The occupancy paper above also showed that contention is a particular important and difficult to model aspect.

    3(f). Clustering

  • The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-driven Investigation. Andrew Erlichson, Basem Nayfeh, Jaswinder Pal Singh and Kunle Olukotun. In Proceedings of Supercomputing95, November 1995.

  •  
  • Evaluation of Shared Cache Alternatives for Small-scale Multiprocessors. Basem Nayfeh, Kunle Olukotun and Jaswinder Pal Singh. In Proceedings of the Symposium on High-Performance Computer Architecture, February 1996.
  • 3(g). Implications for Parallelizing Compilers

  • An Empirical Investigation of the Effectiveness and Limitations of Automatic Parallelization. Jaswinder Pal Singh and John L. Hennessy. In Proceedings of the International Symposium on Shared Memory Multiprocessing, April 1991. Also in Shared Memory Multiprocessing, MIT Press, 1992.
  • First such study using real applications, with suggestions to focus compiler research.
     
     

    4. Benchmarking, Methodology and Performance Evaluation

    4(a). Application Suites and Methodology

  • Scaling Parallel Programs for Multiprocessors: Methodology and Examples. Jaswinder Pal Singh, John L. Hennessy and Anoop Gupta. IEEE Computer, vol. 26, no. 7, July 1993.
  • Shows the importance of scaling under application considerations rather than simply scaling data set size, and illustrates the impact on architectural conclusions.
  • SPLASH: Stanford Parallel Applications for Shared Memory. Jaswinder Pal Singh, Wolf-Dietrich Weber and Anoop Gupta. Computer Architecture News, 1992.
  • The original SPLASH paper, presenting and describing the applications.
  • The SPLASH2 Programs: Characterization and Methodological Considerations. Steven Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh and Anoop Gupta. In Proceedings of the Twenty-first International Symposium on Computer Architecture, June 1995.
  • SPLASH-2 has much broader representation than SPLASH, and also includes different levels of optimization in some cases. This paper characterizes the key properties of the programs quantitatively, helping users choose parameters, and addresses the major methodological issues in workload-driven evaluation using parallel applications (e.g. pruning the design space based on application-system interactions without losing representativeness), illustrating pitfalls.

    4(b). Evaluation of Real Commercial Systems

  • An Application-driven Evaluation of the Convex SPP-1200. Radhika Thekkath, Amit Pal Singh, Jaswinder Pal Singh, Susan John and John Hennessy. In Proceedings of 11th International Parallel Processing Symposium, April 1997.

  •  
  • A Methodology and an Evaluation of the SGI Origin2000. Dongming Jiang and Jaswinder Pal Singh. In Proceedings of SIGMETRICS Conference on Measurement and Modeling of Computer Systems, June 1998.
  • Both papers show that while hardware cache coherence in real modern implementations mostly meets its promise, some implementation tradeoffs rather than architectural weaknesses can hurt performance dramatically; in particular, the sharing of some key communication resources and some constraints placed upon them. In the process, they (especially the latter) describe and illustrate a methodology for doing such evaluations, and identify applications for which commercial hardware coherent systems don't work so well.
     
     

    5. Software Shared Memory: Protocols, Systems and Evaluation

    5(a). Performance Evaluation and Bottleneck Analysis

  • Understanding Application Performance on Shared Virtual Memory Systems. Liviu Iftode, Jaswinder Pal Singh and Kai Li. In Proceedings of the Twenty-second International Symposium on Computer Architecture, May 1996.
  • First comprehensive study of page-based shared virtual memory (SVM) performance using standard applications, and comparison with hardware coherent systems to see how far off they are. Shows that a hardware-supported Òhome-basedÓ protocol substantially outperforms the dominant all-software non-home protocol and why. Finds that performance was far from hardware coherence for irregular applications, and identified the key bottlenecks such as synchronization and serialization caused by the dilation of critical sections due to expensive page faults inside them.
  • Evaluation of Hardware Support for Automatic Update in Shared Virtual Memory Clusters. Angelos Bilas, Liviu Iftode and Jaswinder Pal Singh. In Proceedings of the International Conference on Supercomputing, July 1998.
  • Separates the effects of hardware support from those of home versus non-home based protocols. Shows that the hardware support is not so crucial, so the latter is the key, and that this form of hardware support is not likely to be so attractive in the future, especially for off-the-shelf network interfaces. Helped us decide not to provide this hardware support in the next-generation system at Princeton.
  • Limits to the Performance of Software Shared Memory: A Layered Approach. Jaswinder Pal Singh, Angelos Bilas, Dongming Jiang and Yuanyuan Zhou. In Proc. Symposium on High Performance Computer Architecture (HPCA), Jan 1998.
  • Studies the limitations to improving the performance of both page-grained and fine-grained approaches to software shared memory as the three layers that stand between a user and performance (the application, protocol and communication layers) are improved in realistic and idealized ways, both individually and in concert. Shows the complex synergies in the page-grained SVM approach, and comments on how different sources so major improvement match with technological trends.

    5(b). Memory Consistency Models and Related Protocols

  • Scope Consistency: A Bridge between Release Consistency and Entry Consistency. Liviu Iftode, Jaswinder Pal Singh and Kai Li. In Proceedings of the Symposium on Parallel Algorithms and Architectures, June 1996. Selected for journal publication (see next).

  •  
  • Scope Consistency: A Bridge between Release Consistency and Entry Consistency. Liviu Iftode, Jaswinder Pal Singh and Kai Li. Theory of Computer Systems.
  • Introduces a consistency model to obtain the benefits of associating data with synchronization without the programming burden of Entry Consistency. Has been implemented since by other researchers at Rice University. Also proposes the idea of building home-based protocols entirely in software, which has been shown to outperform earlier all-software systems.

    5(c). Fine-grain versus Coarse-grain Software Shared Memory

  • Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. Y. Zhou, L. Iftode, J.P. Singh, K. Li, B. R. Toonen, I. Schoinas, M.D. Hill and D.A. Wood. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, June 1997.
  • A key question for software shared memory is whether coherence should be maintained at coarse grain or fine grain. Coarse grain typically requires sophisticated consistency models to reduce false sharing, but can amortize protocol and communication overhead. This paper explores the space of consistency and granularity, using a real platform with hardware access control. Shows that relaxed consistency can regain the performance losses in cases where these exist, and that contrary to previous results which did not look at irregular applications, multiple-writer protocols are very important at coarse grain.

    5(d). Using Multiprocessor Rather than Uniprocessor Nodes

  • Supporting a Coherent Shared Address Space Across SMP Nodes: An Application-driven Investigation. Angelos Bilas, Liviu Iftode and Jaswinder Pal Singh. In Proceedings of IMA Workshop on Algorithms for Parallel Processing, Nov. 1996.

  •  
  • Home-based Shared Virtual Memory Across SMP Nodes. Rudrojit Samanta, Angelos Bilas, Liviu Iftode and Jaswinder Pal Singh. In Proceedings of the Fourth International Symposium on High Performance Computer Architecture (HPCA), 1998.
  • Clusters are increasingly built out of hardware-coherent multiprocessors rather than uniprocessor nodes. This has potential advantages for software shared memory, since much of the communication may be contained within a node. These papers propose, implement and evaluate protocols that exploit the hardware coherence within a node as much as possible, reducing software intervention while still being lazy.
  • Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance. Angelos Bilas, Liviu Iftode, and J.P. Singh. Princeton Univers ity Technical Report, TR-517-96. Sixth Workshop on Scalable Shared-Memory Multiprocessors, Oct 5, 1996.
  • An older, simulation-based paper about SMP-node protocols. This one used an automatic update based protocol, and mapped all pages write-through. Replaced by the papers above.

    5(e). Smart Network Interface Support and Related Protocols

  • Using Commodity Network Interfaces to Accelerate Shared Virtual Memory on Clusters. Angelos Bilas, Cheng Liao, and Jaswinder Pal Singh. Proc. International Symposium on Computer Architecture, 1999.
  • Based on this, developed protocols/system to avoid both interrupts and polling by processor, using only general, commodity NI support and comm. layer mechanisms mostly within VIA API. Not the kind of hardware support provided in SHRIMP NI (or that used in Cashmere), and no protocol-specific mechanisms or protocol processing. Prototyped on SMP-node Myrinet cluster. Reorganized protocol and its laziness to take advantage of explicit remote fetch, remote deposit and mutex support in NI. By exploiting all three layers (application, protocol, communication), finally got close to hardware-SAS performance on moderate-scale system for many real applications. Identified key remaining bottlenecks that need to be addressed to scale up to larger systems.
    See also paper in Performance Portability and Scalability section that examines scalability of a protocol based on this to larger processor counts.

    5(f). Survey

  • Shared Virtual Memory: Progress and Challenges. Liviu Iftode and Jaswinder Pal Singh. To appear in a Special Issue of Proceedings of the IEEE, 1998.
  • Places threads of research from different groups so far in a common framework, assesses where we are, and argues that applications, application bottleneck-driven protocol improvements, understanding hardware support and tools are key research areas for the near future.
     
     

    6. Sequential Algorithms For Computational Biology and Medicine

    6(a). Probabilistic Inference

  • Complexity Reduction in BN2O Networks Using Similarity of States. Alexander Kozlov and Jaswinder Pal Singh.Proceedings of Uncertainty in Artificial Intelligence, 1996.
  • Shows how dramatic performance improvements can be achieved by taking advantage of problem structure. Applies to a host of diagnosis problems. The basic insight has also been used later for handling continuous variables in inference.
  • Sensitivities as an Alternative to Conditional Probabilities for Probabilistic Inference. Alexander Kozlov and Jaswinder Pal Singh. In Proceedings of Uncertainty in Artificial Intelligence, August 1995.
  • Proposes a new representational framework for inference, with computational and intuitive advantages.

    6(b). Protein Structure Determination from Constraints

  • Probabilistic Constraint Satisfaction with Non-Gaussian Constraint Noise. Russ Altman, Cheng Chen, Bill Poland and Jaswinder Pal Singh. In Proceedings of Uncertainty in Artificial Intelligence, 1994.
  • Extends the existing least-squares probabilistic approach (to obtain 3-d structure from experimental and theoretical constraints) to handle constraints that are not Gaussian. Does this by converting constraints to a mixture of Gaussians using expectation-maximization techniques, and then extending the algorithm by branching and combination to handle them.
  • Hierarchical Organizations of Molecular Structure Computations. Cheng Chen, Jaswinder Pal Singh and Russ Altman. In Proc. RECOMB'98. Extended version to appear in Journal of Computational Biology.
  • Shows that hierarchical approaches that take advantage of structure (either knowledge-based or automatically constructed) achieve dramatic performance gains (measured 50-100x on 30S Ribosomal subunit) over flat approaches. Enables much larger problems to be solved by practitioners in the domain (see SC paper above). Examines different automatic heuristics and manual techniques for generating hierarchy, and shows that former can work very well if done right.
     
     

    7. Miscellaneous

    A Model of Workloads and its use in Miss-rate Prediction for Fully Associative Caches. Jaswinder Pal Singh, Harold S. Stone and Dominique Thiebaut. IEEE Transactions on Computers, July 1992.

    Soliton Phase Shifts in a Dissipative Lattice. Nayeem Islam, Jaswinder Pal Singh and Kenneth Steiglitz. In Journal of Applied Physics, August 1987.