Summary for Final Submission
============================

Preamble
--------

You are required to demonstrate three things for the final submission: (1) Good performance, (2) Due diligence and (3) Full understanding of the parallel behavior.

To demonstrate (1) good performance, you must be able to show that your program achieves the maximum theoretical speedup up to 64 CPUs. Ideally the speedup is linear, but it is possible that you encounter issues in practice that will limit the parallel performance. If the program scales worse than linearly, you must show that this is not a limitation of your parallel algorithm. You can do that by arguing that some parts of the code are inherently serial (use time measurements and Amdahl's Law for that, the parallel part is still expected to scale linearly) or by showing that you have reached limitations of the hardware and the program has become memory or I/O bound. Even in that case, you must achieve good speedups or you simply have chosen the wrong program for this project.

To demonstrate (2) due diligence, you are required to give sufficient proof that you did what you could to make sure the program is free of data races, deadlocks and other common issues of parallel programs. This includes for example the output of the Helgrind data race checker (see below).

To demonstrate (3) full understanding, you must be able to fully explain what the program does and why it behaves the way it does. This includes your ability to correctly interpret unusual behavior which your program might exhibit in practice. Do not confuse this with understanding the algorithm itself or the serial version of your program: If your explanations equally apply to the serial and the parallel version, they are not relevant for this course.


Program Submission
------------------

You are required to submit the source code of the program version as well as any inputs which you used for your measurements. The source code must have the following properties:

- The program is parallelized with pthreads and is the one you used to obtain all performance numbers which you present.

- An automated build system (e.g. a Makefile) is included which controls the build process.

- The program builds on niagara.cs.princeton.edu, hbar.cs.princeton.edu and hecate.princeton.edu without requiring any further modifications.

- The number of threads to use can be passed to the program via command line arguments. No recompilation is necessary.

- The program exists in a serial and a parallel version. A version is serial if it does not require inclusion of the pthread.h header file and does not link to the pthreads library or any other threading systems. You can include a serial version by providing two copies of the source code or individual files which you modified, but for smaller changes it is more common to simply #define a macro which controls the inclusion of calls to any thread primitives or other code modifications required by the parallel program. This way you have only one code version to which your parallelization can be added if desired.

- If you use any instrumentation results in your arguments, you are required to leave the instrumentation code which you used to collect the data in the code. The instrumentation code must be disabled by default, and it must be possible to enable it by #defining the right macro.


Part 1 - Usage
--------------
(Note: If you have not changed anything since your first submission you can simply copy your explanations from there.)

1a.) How do you compile your program? List all necessary steps.


1b.) How do you run your program? Give an executable command with all arguments.


1c.) Which input did you choose for your performance measurements? Justify why your chosen input stresses all parts of your program in a way that resembles "real" use for its intended purpose (i.e. not for performance measurements).


1d.) List any macros that control how to build your program or other relevant information.


Part 2 - Performance
--------------------

2.) Complete the following table to show the performance of the program with your chosen input on hecate and hbar. Parallelization overhead is the execution time of the parallel version on 1 CPU over the serial version.


Machine: hecate.princeton.edu

#CPUs	Execution Time / [s]	Speedup
1
2
4
8
16
32
64

Execution time of serial version:
Parallelization overhead:


Machine: hbar.cs.princeton.edu

#CPUs	Execution Time / [s]	Speedup
1
2
4
8

Execution time of serial version:
Parallelization overhead:


Part 3 - Due Diligence
----------------------

3a.) Show that your current parallelization has no automatically detectable data races. To do so run the submitted version of your program with Helgrind and 8 threads. Do not forget to compile with debugging symbols (option -g of gcc). Submit the unmodified output of Helgrind in a separate file 'helgrind.txt'. For each flagged memory region in your program, explain why it cannot be a data race by explaining the locking protocol and giving the name and source code location of the associated lock(s). You may skip memory regions which do not belong to your program. No more than 1-3 sentences per memory region.

3b.) Show that your current parallelization is unlikely to have deadlocks and data races. To do so write a small script 'check.sh' which repeatedly runs your program and compares its output to a reference output obtained from the serial version. Both outputs should be identical if you created them on the same machine. You can use the template which we provide as a starting point. Submit only the script 'check.sh' and make sure it runs on both hecate and hbar. (Hint: Run your program many times with a large number of threads to be certain to detect any issues. To increase the probability of finding data races or deadlocks, you can run it on different machines and inject random noise by sporadically running computationally intensive programs in the background. This will force your program into different execution paths.)

3c.) Show that your current parallelization has no automatically detectable hotlocks. To do so run the submitted version of your program with pthreadw and 8 threads. Do not forget to compile with pthreadw support (see readme file of pthreadw). Submit the unmodified output of pthreadw in a separate file 'pthreadw.txt'. Your program will be considered free of hotlocks if no lock uses more than 5% of the total CPU time. (Hint: Locks will be listed from most time-intensive to least time-intensive, so you only have to check the first lock listed for each type.)


Part 4 - Understanding
----------------------

4a.) Describe your parallelization. No more than 200 words.

4b.) List all shared data structures. Give the file names and line numbers where they are defined in the source code as well as a brief explanation of their purpose. No more than 1-3 sentences per data structure.

4c.) Explain any speedup differences between hbar and hecate which you might have measured. No more than 100 words. (Hint: Hecate is a ccNUMA machine. Hbar uses CMPs which are arranged in an SMP configuration.)


Part 5 - Further Explanations
-----------------------------

5a.) Learning from other people is fine and was explicitly allowed, but proper credit must be given. Please list all publications and previous work which you have used for your parallelization project.

5b.) If there is anything else you have not mentioned yet which you think we should know, add it here.