HW5: Dataflow Analysis and Optimizations
Getting Started
To get started, accept the assignment on Github Classroom and clone your team’s repository.
Many of the files in this project are taken from the earlier projects. The
new files (only) and their uses are listed below. Those marked with * are
the only ones you should need to modify while completing this assignment.
bin/datastructures.ml |
set and map modules (enhanced with printing) |
bin/cfg.ml |
“view” of LL control-flow graphs as dataflow graphs |
bin/analysis.ml |
helper functions for propagating dataflow facts |
bin/solver.ml |
|
bin/alias.ml |
|
bin/dce.ml |
|
bin/constprop.ml |
|
bin/liveness.ml |
provided liveness analysis code |
bin/analysistests.ml |
test cases (for liveness, constprop, alias) |
bin/opt.ml |
optimizer that runs dce and constprop |
bin/backend.ml |
|
bin/registers.ml |
collects statistics about register usage |
bin/printanalysis.ml |
a standalone program to print the results of an analysis |
PerformanceExperiments.xlsx |
|
The shared, public git submodule to which you should push your public test
case is structured as shown below (spXX is “spring 20XX”, e.g. sp26 is
“spring 2026”):
spXX_hw5_tests/ |
|
spXX_hw5_tests/spXX_hw5_tests.ml |
|
spXX_hw5/tests/regalloctest3.oat |
an example test case |
spXX_hw5_tests/*.oat |
|
Note
You’ll need to have menhir and clang installed on your system for this assignment. If you have not already done so, follow the provided instructions to install them.
Note
As usual, running oatc --test will run the test suite. oatc
also now supports several new flags having to do with optimizations.
-O1 : runs two iterations of (constprop followed by dce)
--liveness {trivial|dataflow} : select which liveness analysis to use for register allocation
--regalloc {none|greedy|better} : select which register allocator to use
--print-regs : print a histogram of the registers used
Overview
The Oat compiler we have developed so far produces very inefficient code, since it performs no optimizations at any stage of the compilation pipeline. In this project, you will implement several simple dataflow analyses and some optimizations at the level of our LLVMlite intermediate representation in order to improve code size and speed. (A full-blown compiler would also do further optimization at the x86 level, but we will not pursue that here.)
Provided Code
The provided code makes extensive use of modules, module signatures, and functors. These aid in code reuse and abstraction. If you need a refresher on OCaml functors, we recommend reading through the Functors Chapter of Real World OCaml.
In datastructures.ml, we provide you with a number of useful modules,
module signatures, and functors for the assignment, including:
OrdPrintT: A module signature for a type that is both comparable and can be converted to a string for printing. This is used in conjunction with some of our other custom modules described below. Wrapper modulesLblandUidsatisfying this signature are defined later in the file for theLl.lblandLl.uidtypes.
SetS: A module signature that extends OCaml’s built-in set to include string conversion and printing capabilities.
MakeSet: A functor that creates an extended set (SetS) from a type that satisfies theOrdPrintTmodule signature. This is applied to theLblandUidwrapper modules to create a label set moduleLblSand a UID set moduleUidS.
MapS: A module signature that extends OCaml’s built-in maps to include string conversion and printing capabilities. Three additional helper functions are also included:updatefor updating the value associated with a particular key,find_orfor performing a map look-up with a default value to be supplied when the key is not present, andupdate_orfor updating the value associated with a key if it is present, or adding an entry with a default value if not.
MakeMap: A functor that creates an extended map (MapS) from a type that satisfies theOrdPrintTmodule signature. This is applied to theLblandUidwrapper modules to create a label map moduleLblMand a UID map moduleUidM. These map modules have fixed key types, but are polymorphic in the types of their values.
Task I: Dataflow Analysis
Your first task is to implement a version of the worklist algorithm for solving dataflow flow equations presented in lecture. Since we plan to implement several analyses, we’d like to reuse as much code as possible between each one. In lecture, we saw that each analysis differs only in the choice of the abstract domain, the flow function, the direction of the analysis, and how to compute the meet of facts flowing into a node. We can take advantage of this by writing a generic solver as an OCaml functor and instantiating it with these parameters.
The Algorithm
Assuming only that we have a directed graph where each node’s in and out
edges are` labeled with a dataflow fact and each node has a flow function,
we can compute a fixpoint of the flow on the graph as follows:
let w = new set with all nodes
repeat until w is empty
let n = w.pop()
old_out = out[n]
let in = combine({ out[p] for each p in preds[n] })
out[n] := flow[n](in)
if (!equal old_out out[n]),
for all m in succs[n], w.add(m)
end
Here equal, combine and flow are abstract operations that will be
instantiated with abstract domain equality, the meet operation and the flow function
(e.g., as might be defined by the gen and kill sets of the analysis),
respectively. Similarly, preds and succs are the graph predecessors and
successors in the flow graph, and so are not in one-to-one correspondence with
the edges found the LL IR control-flow-graph of the program. These general
operations can be instantiated appropriately to create a forwards or backwards
analysis.
Note
Don’t try to use OCaml’s polymorphic equality operator (=) to compare
old_out and out[n] – that’s reference equality, not structural
equality. Use the supplied Fact.compare instead.
Getting Started and Testing
Be sure to review the comments in the DFA_GRAPH (data flow analysis graph)
and FACT module signatures in solver.ml, which define the parameters of
the solver. Make sure you understand what each declaration in the signature does
– your solver will need to use each one (other than the printing functions)!
It will also be helpful for you to understand the way that cfg.ml connects
to the solver. Read the commentary there for more information.
Now implement the solver
Your first task is to fill in the solve function in the Solver.Make
functor in solver.ml. The input to the function is a flow graph labeled
with the initial facts. It should compute the fixpoint and return a graph with
the corresponding labeling. You will find the set datatype from
datastructures.ml useful for manipulating sets of nodes.
To test your solver, we have provided a full implementation of a liveness
analysis in liveness.ml. Once you’ve completed the solver, the liveness
tests in the test suite should all be passing. These tests compare the output
of your solver on a number of programs with pre-computed solutions in
analysistest.ml. Each entry in this file describes the set of uids that
are live-in at a label in a program from ./llprograms. To debug,
you can compare these with the output of the Graph.to_string function on
the flow graphs you will be manipulating.
printanalysis
The stand-alone program printanalysis can print out the results of a
dataflow analysis for a given .ll program. You can build it by doing
make printanalysis. It takes a flag to indicate which analysis to run
(run with --h for a list).
For example, once the solver is implemented correctly, you can use it to
display the results of liveness analysis for the add.ll program like this:
cis4521:/hw5> ./printanalysis -live llprograms/add.ll
define i64 @main(i64 %argc, i8** %arcv) {
{_entry={}
IN : {}
%1 = add i64 5, 9
OUT: {%1}
IN : {%1}
ret i64 %1
OUT: {}
}
}
Task II: Alias Analysis and Dead Code Elimination
The goal of this task is to implement a simple dead code elimination
optimization that can also remove store instructions when we can prove that
they have no effect on the result of the program. Though we already have a
liveness analysis, it doesn’t give us enough information to eliminate store
instructions: even if we know the UID of the destination pointer is dead after a
store and is not used in a load in the rest of the program, we can not remove a
store instruction because of aliasing. The problem is that there may be
different UIDs that name the same stack slot or heap location. There are a
number of ways this can happen after a pointer is returned by alloca:
The pointer is used as an argument to a
getelementptrorbitcastinstructionThe pointer is stored into memory and then later loaded
The pointer is passed as an argument to a function, which can manipulate it in arbitrary ways
Some pointers are never aliased. For example, the code generated by the Oat
frontend for local variables never creates aliases because the Oat language
itself doesn’t have an “address of” operator. We can find such uses of
alloca by applying a simple alias analysis.
Alias Analysis
We have provided some code to get you started in alias.ml. You will have
to fill in the flow function and abstract domain operations. The type of abstract domain
elements, fact, is a map from UIDs to symbolic pointers of type
SymPtr.t. Your analysis should compute, at every program point, the set of
UIDs of pointer type that are in scope and, additionally, whether that pointer
is the unique name for a stack slot according to the rules above. See the
comments in alias.ml for details.
Alias.insn_flow: the flow function over instructions
Alias.fact.combine: the combine function for alias facts
Dead Code Elimination
Now we can use our liveness and alias analyses to implement a dead code elimination pass. We will simply compute the results of the analysis at each program point, then iterate over the blocks of the CFG removing any instructions that do not contribute to the output of the program.
For all instructions except
storeandcall, the instruction can be removed if the UID it defines is not live-out at the point of definitionA
storeinstruction can be removed if we know the UID of the destination pointer is not aliased and not live-out at the program point of the storeA
callinstruction can never be removed
Complete the dead-code elimination optimization in dce.ml, where you will
only need to fill out the dce_block function that implements these rules.
Task III: Constant Propagation
Programmers don’t often write dead code directly. However, dead code is often produced as a result of other optimizations that execute parts of the original program at compile time, for instance constant propagation. In this section you’ll implement a simple constant propagation analysis and constant folding optimization.
Start by reading through the constprop.ml. Constant propagation is similar
to the alias analysis from the previous section. Dataflow facts will be maps
from UIDs to the type SymConst.t, which corresponds to the abstract domain from
the lecture slides. Your analysis will compute the set of UIDs in scope at
each program point, and the integer value of any UID that is computed as a
result of a series of binop and icmp instructions on constant
operands. More specifically:
The flow out of any
binoporicmpwhose operands have been determined to be constants is the incoming flow with the defined UID toConstwith the expected constant value obtained by (statically) interpreting the operation.The flow out of any
binoporicmpwith aNonConstoperand sets the defined UID toNonConstSimilarly, the flow out of any
binoporicmpwith aUndefConstoperand sets the defined UID toUndefConstA
storeorcallof typeVoidsets the defined UID toUndefConstAll other instructions set the defined UID to
NonConst
Constant propagation of this form acts a lot like an interpreter–it is a “symbolic” interpreter that can’t always produce an answer. (At this point we could also include some arithmetic identities, for instance optimizing multiplication by 0, but we’ll keep the specification simple.)
Next, you will have to implement the constant folding optimization itself, which just traverses the blocks of the CFG and replaces operands whose values we have computed with the appropriate constants. The structure of the code is very similar to that in the previous section. You will have to fill in:
Constprop.insn_flowwith the rules defined above
Constprop.Fact.combinewith the combine operation for the analysis
Constprop.cp_block(inside therunfunction) with the code needed to perform the constant propagation transformation
Note
Once you have implemented constant folding and dead-code elimination, the
compiler’s -O1 flag will ask oatc to optimize your ll code by doing 2
iterations of (constant prop followed by dce). See opt.ml. The -O1
optimizations are not used for testing except that they are always
performed in the register-allocation quality tests – these optimizations
improve register allocation (see below).
This coupling means that if you have a faulty optimization pass, it might cause the quality of your register allocator to degrade. And it might make getting a high score harder.
Task IV: Register Allocation
Warning
This task is by far the hardest part of the project. Allow yourself enough time to complete it!
The backend implementation that we have given you provides two basic register allocation stragies:
none: spills all uids to the stack (as done in prior projects)
greedy: uses registers via a greedy linear-scan algorithm that locally exploits liveness info.
For this task, you will implement a better register allocation strategy that
makes use of the liveness information that you compute in Task I. Most of the
instructions for this part of the assignment are found in backend.ml, where
we have modified the code generation strategy to be able to make use of liveness
information. The task is to implement a single function better_layout that
beats our example “greedy” register allocation strategy. We recommend
familiarizing yourself with the ways that the simple strategies and new
backend-code generator work before attempting to write your own allocator. In
particular, pay attention to the new PMov (parallel move) instruction and
its use in implementing calling conventions.
The oatc compiler now also supports several additional command-line switches that
can be used to select among different analysis and code generation options for
testing purposes:
--print-regs prints the register usage statistics for x86 code
--liveness {trivial|dataflow} use the specified liveness analysis
--regalloc {none|greedy|better} use the specified register allocator
Note
The flags above do not imply the -O1 flag (despite the fact that we
always turn on optimization for testing purposes when running with
--test). You should enable it explicitly when running oatc from the
command line.
For testing purposes, you can run the compiler with the -v verbose flag
and/or use the --print-regs flag to get more information about how your
algorithm is performing. It is also useful to sprinkle your own verbose
output into the backend.
The goal for this part of the homework is to create a strategy such that code
generated with the --regalloc better --liveness dataflow flags is
“better” than code generated using the simple settings, which are --regalloc
greedy --liveness dataflow. See the discussion about how we compare
register allocation strategies in backend.ml. The “quality” test cases
report the results of these comparisons.
Note
The quality metric for better versus greedy register allocation asks you minimize the number of memory operations and, secondarily, the overall size of the program.
Of course, your register allocation strategy should produce correct code, so we still perform all of the correctness tests that we have used in previous version of the compiler. Your allocation strategy should not break any of these tests.
Warning
You cannot earn points for the “quality” tests unless *all* of the correctness tests also pass – correctness is more important than performance.
Note
Because register allocation is necessarily a bit heuristic, it can happen that the greedy algorithm gets lucky and out-performs the better algorithm. You may not want to shoot for a “perfect” score. (Indeed our reference solution for the better allocator fails to beat the greedy algorithm on one test case.)
Task V: Experimentation / Validation
Of course we want to understand how much of an impact your register allocation strategy has on actual execution time. For the final task, you will create a new Oat program that highlights the difference. There are two parts to this task.
Create a test case
Push an Oat program to spXX_hw5_tests git submodule. This program should
exhibit significantly different performance when compiled using the “greedy”
register allocation strategy vs. using your “better” register allocation
strategy with dataflow information. See the file
spXX_hw5_tests/regalloctest3.oat for an uninspired example of such a program.
Yours should be more creative. (For a difficult challenge, try to create an Oat program
for which your register allocation strategy yields a significant speedup, but
for which clang is not able to do much better.)
Evaluate Performance
Use the unix time command to test the performance of your register
allocation algorithm. This should take the form of a simple table of timing
information for several test cases, including the one you create and those
mentioned below. The provided spreadsheet will calculate the speedup over the
baseline performance.
Note
Collect the user time (not the real time) as it is a truer indication
of time spent running your code. (The real “wall-clock” time reported
includes OS startup costs, etc., which vary from system to system.)
Also, if the user time is reported as 0.000 (which may happen for
clang-compiled code) then use the value 0.0001 to prevent
division-by-zero errors in the speedup calculation.
You should test the performance of the compiler in eight configurations for
.oat source programs:
using the
--liveness trivial--regalloc noneflags (baseline)using the
--liveness dataflow--regalloc greedyflags (greedy)using the
--liveness dataflow--regalloc betterflags (better)using the
--clangflags (clang)
And… all of the above plus the -O1 flag. These experiments will test
our optimizations using the Oat optimizer.
To help you with this task, the Makefile in this project includes two new
targets oat_experiments and ll_experiments that generate executables for
each case and the run them using the time` command. Each such make target
takes a parameter FILE=<filename> and can optionally include OPT=-O1 to
enable the Oat level 1 optimizations.
make oat_experiments FILE=<filename.oat> [OPT=-O1]
make ll_experiments FILE=<filename.ll> [OPT=-O1]
Test your compiler on these three programs:
spXX_hw5_tests/regalloctest3.oat
llprograms/matmul.llyour own test case
For best results, use a “lightly loaded” machine (close all other applications) and average the timing over at least five trial runs.
The example below shows one interaction used to test the sp26_hw5_tests/regalloctest3.oat file in
several configurations from the command line:
~/cis5521/hw/hw5/soln> make oat_experiments FILE=sp26_hw5_tests/regalloctest3.oat
dune build bin/main.exe
echo "Generating executables for sp26_hw5_tests/regalloctest3.oat with optimization "
Generating executables for sp26_hw5_tests/regalloctest3.oat with optimization
./oatc -o a_baseline.out --liveness trivial --regalloc none sp26_hw5_tests/regalloctest3.oat bin/runtime.c
./oatc -o a_greedy.out --liveness dataflow --regalloc greedy sp26_hw5_tests/regalloctest3.oat bin/runtime.c
./oatc -o a_better.out --liveness dataflow --regalloc better sp26_hw5_tests/regalloctest3.oat bin/runtime.c
./oatc -o a_clang.out --clang sp26_hw5_tests/regalloctest3.oat bin/runtime.c
time ./a_baseline.out
13800019670000000 1.52 real 1.28 user 0.00 sys
time ./a_greedy.out
13800019670000000 0.57 real 0.36 user 0.00 sys
time ./a_better.out
13800019670000000 0.57 real 0.31 user 0.00 sys
time ./a_clang.out
13800019670000000 0.22 real 0.00 user 0.00 sys
The example below shows one interaction used to test the llprograms/matmul.ll file in
several configurations that use the -O1 flag from the command line:
~/cis5521/hw/hw5/soln> make ll_experiments FILE=llprograms/matmul.ll OPT=-O1
dune build bin/main.exe
echo "Generating executables for llprograms/matmul.ll with optimization -O1"
Generating executables for llprograms/matmul.ll with optimization -O1
./oatc -o a_baseline-O1.out --liveness trivial --regalloc none -O1 llprograms/matmul.ll
./oatc -o a_greedy-O1.out --liveness dataflow --regalloc greedy -O1 llprograms/matmul.ll
./oatc -o a_better-O1.out --liveness dataflow --regalloc better -O1 llprograms/matmul.ll
./oatc -o a_clang-O1.out --clang llprograms/matmul.ll -O1
time ./a_baseline-O1.out
1.62 real 1.38 user 0.00 sys
time ./a_greedy-O1.out
1.09 real 0.87 user 0.00 sys
time ./a_better-O1.out
0.53 real 0.32 user 0.00 sys
time ./a_clang-O1.out
0.27 real 0.05 user 0.00 sys
Don’t get too discouraged when clang beats your compiler’s performance by many orders of magnitude. It uses register promotion and many other optimizations to get high-quality code!
After collecting this data, use it to fill out
PerformanceExperiments.xlsx. This spread sheet will compute the
speed up for each of these programs under the various optimization
configurations. It asks you to report the processor and OS version that are
hosting the Docker instance that you used to test your code, as well as
the total time (in seconds) for the various configurations.
Post Your Results
For fun, please post your performance results to Ed on the designated thread so that everyone can see how your optimizations perform.
Grading
Projects that do not compile will receive no credit!
- Your team’s grade for this project will be based on:
90 Points: the various automated tests that we provide. Note that the register-allocator quality points cannot be earned with an allocator that generates incorrect code.
5 Points for committing an interesting test case to the shared
spXX_hw5_testsrepo 24 hours in advance of the project deadline (Graded manually.)5 Points for the timing analysis in
PerformanceExperiments.xlsx(Graded manually.)