About Me
I'm a graduate student in Computer Science at Princeton University working with Prof. J.P. Singh on parallel programming models and performance modeling.
My fascination with parallel computers started during college at MIT, where I worked with the Cilk and RAW projects. After graduation I worked for a few years at Akamai Technologies where I learned how to apply algorithms to real problems and develop solid applications running on thousands of machines.
Before going to college I attended the Computer Science High School in Bucharest, Romania. During this period I participated at several CS olympiads winning 3 times the 1st prize at national level, and 2 gold medals at international level at IOI'93 Argentina and IOI'94 Sweden.
Hobbies and extra-curriculars include climbing, mountaineering, shotokan and taekwondo, playing chess, and photography.
Research Interests
Currently I'm working on parallel programming models and performance modeling of hierarchical interconnects (see Coarse Grain Dataflow home page). Previous related work includes :
- B.S. at MIT, implementation and performance tuning of Cilk parallel applications
- M.Eng. at MIT, developed a coarse grain parallel API for the 4x4 multi-processor RAW chip
- worked with Geo Fluid Dynamics Lab affiliated with Princeton University on understanding performance bottlenecks and parallel programing issues for real world scientific applications developed under the FMS / MOM4 project
The Cilk programming language is an elegant C extension that exposes parallelism by spawning recursive calls. The RAW chip API I worked on emulates Cilk semantics relying on C macros and function calls; it embeds the call tree into the 2D RAW mesh to maintain short parent-child distances.
Based on previous experience and after working with GFDL and understanding some of their performance issues we started developing a new high-level parallel programming model guided by a few objectives:
- ease of writing applications
- ease of writing efficient code
- optimizations within library, not at application level
- good performance portability
The overall philosophy is to expose application parallelism at a high level, and to use the most efficient communication primitives at the low level. This way most performance can be achieved with minimal user involvement. A potential tradeoff is smaller problem coverage.
Alternatively, using lower level primitives users do more work and fine tune their application for these libraries. Unfortunately, this can be time consuming and doesn't give the best performance portability. OpenMP and PGAS languages are a good alternative reducing programming effort. However, performance still doesn't come easy: using local and global arrays and copying data when needed is an elegant replacement of messaging.
My 2c
| Technology | Architecture | Programming Model | Application Development |
|||
| Link | Silicon | Interconnect | HW Support | Lower Level | Higher Level | |
| DDR 800 async 2500 |
45 65 90 |
NumaLink Quadrics Infiniband Ethernet FSB HT |
CC-SAS network ASIC CPU |
MPI SHMEM RDMA pthreads |
OpenMP CoArray UPC PGAS |
NPB SPLASH PARSEC MOM4 |
| Performance Model |
Algorithm Design |
|||||
| PRAM D-BSP LogP |
FFT PDE LP |
|||||
This table is by no means comprehensive, examples are given to clarify the meaning of each category.
One core issue in parallel programming is obtaining good performance easily. Unfortunately, good performance requires fine-tuning and understanding how the programming model is implemented in terms of architecture. Many times this is non-intuitive or impossible to predict, not no mention machine dependent.
As an algorithm designer it would be hard to understand why making the program slower by adding more operations can result in a faster execution (when contention occurs); similarly why for a linear algorithm T(2n) = 10 T(n) (large messages generate contention). Moreover, performance models pay a close attention to message in-travel latency function of topology when in practice this is insignificant vs. overall communication time (typically 100-300ns vs 30+us).
On the other hand, architectures are changed to make existing applications faster; libraries are fine-tuned to take advantage of the improved hardware. Increasing bandwidths and decreasing latencies will help, but they will not address the fundamental message contention or cache line sharing issues which are not well understood from above. Alternatively, working towards an architecture with predictable performance could make programmer's life a lot easier.
In general development happens on each layer independently, starting from the layer below and using existing benchmarks from the layer above. One reason is availability of benchmarks and familiarity with existing models and benchmarks. However, considering the end-to-end argument it is possible a change of architecture, programming model, and application at the same time will hit the jackpot. In particular this includes hardware support that significantly improves application bottlenecks that are intelligently exposed by the programming model.