Assignment 4: Directory and File Tree Abstract Objects


Purpose

The purpose of this assignment is to increase your experience with the following important software engineering fundamentals:

It also will give you the opportunity to gain more experience with the Linux operating system and the GNU programming tools, especially bash, gdb, and make.

Part 1 consists of no code writing, but will have lots of code reading. Part 2 consists of some additional reading, (hopefully) some deep thinking about invariants, and some code writing. Part 3 consists of (hopefully) some practical thinking about design decisions and (potentially, but not necessarily) lots of code writing.

As fair warning, we expect it may be more challenging to get started on this assignment than your assignments thus far because of the learning curve to understand the inner workings of an initially unfamiliar codebase.

In the past three terms, students have reported taking on average 25 hours to complete the assignment. Note that this measure may not be as representative as it is for other assignments, because this assignment is our "youngest" assignment (first offered in fall 2020). Even still, plan for the assignment to be a significant undertaking.


Rules

You may work with one partner on this assignment. You are not required to work with a partner, but we strongly prefer and suggest that you do: in a recent semester, the ratio of generally working versus catastrophically broken A4 submissions for partnerships was approximately 11:1, whereas the ratio for solo efforts was only 1:1! Your preceptor will help you with your partner search if you so desire.

There is no "challenge" portion for this assignment.


Background

As you learned in COS 126, a tree is a linked data structure that represents a hierarchical organization of data. There is exactly one node with no incoming links, which we call the root of the tree. There is a unique path from the root to every other node in the tree. Every other node in the tree has exactly one incoming link (from its parent). A node with no outgoing links (and thus no children) is called a leaf of the tree.

As you learned earlier in this course, the Linux file system is structured as a tree. There is one root directory, technically with no name but colloquially called / , that serves as the ancestor of all other directories in the directory hierarchy. Both directories and files may be leaves of the tree, and nodes that are not leaves may have either or both of directories and files as children. Each node in the hierarchy has a canonical name (absolute path or full path) derived from an enumeration of the nodes visited traveling from the root directory to that node, delimited by slashes. As an example, /usr/bin/ls describes all four nodes visited in reaching the ls binary executable: the unnamed directory that is the tree's root, the usr directory that is a child of the root, the bin directory that is a child of usr, and the ls file leaf that is a child of bin.

In this assignment you will be working with three different tree data structures, each of which represents an increasingly complicated file system hierarchy, eventually achieving the full description from the first paragraph above:

As a simplification to avoid complications with empty strings within paths, in all three implementations, neither the root nor any other directory will be unnamed, and no superfluous delimiters will be permitted within path names. Thus no absolute path may have multiple slash delimiters in a row (e.g., a/b/c would be a valid full path and /a//b/c would not be, unlike in Linux).


Your Task

There are three parts to this assignment, corresponding to the three tree data structures listed above.

In Part 1, you are given object files for several faulty Binary Directory Tree implementations, as well as for a correct implementation. You are also given the source for a minimalist BDT client that exhibits behavioral or memory management errors when run after building with any of the faulty implementations. You will locate the bugs in the faulty implementations by debugging the client program using gdb.

In Part 2, you are given object files for several faulty Directory Tree implementations, as well as the source for a correct implementation and a minimalist DT client. This time the object files were not configured to facilitate stepping through the code in those files while debugging in gdb. However the API functions that mutate the data structure were augmented at the leading and trailing edges to include calls to an internal validation module. So instead of the approach from Part 1, here you will flesh out the invariant tests in that internal validation module to verify the validity of the data structure's representation such that it correctly identifies the invalid state of the data structure and terminates the program for each faulty implementation.

In Part 3, you are given an expanded API that is appropriate for hierarchies that contain both directories and files. You will implement the File Tree interface, using the Directory Tree implementation from Part 2 as a jumping off point. Along the way, you may need to reassess the program design and modularity decisions made for the DT implementation if they are no longer the best choices for the FT requirements.


The Given Files

Part 1

Part 2

Part 3


The Procedure

As you have been doing thus far in the course, you may write code in whatever development environment and on whatever computer you see fit.

In this assignment, we provide pre-built .o files that contain various correct and buggy solutions. These were built using gcc217 on armlab, so you will have to do the testing and debugging there.

We continue to encourage you to embrace the discipline of committing frequently in GitHub. This becomes potentially even more important now if you are sharing your codebase with a partner. When you are finished, pull your final version to armlab to issue the appropriate submit command(s).


Step 0.1: Preliminaries

As with the other assignments, you will create your own repository, separate from ours, that has the same contents as ours. As a reminder, you can follow these procedures to import your own version of our contents into your GitHub account, or follow the Setup Step 5 from the Git and GitHub Primer document from the first precept.

Note, though, that only one partner should do this. After having done so, that partner should continue with Setup Step 6. Only after that will the other partner obtain a working copy, by executing a (regular, non-bare) clone of the first partner's repository.

The COS 217 assignment repository for this assignment is:
https://github.com/COS217/DirectoryFileTrees


Step 1.1: Examine the BDT Interface and Client

Study the given source files, a4def.h, bdt.h and bdt_client.c, to learn about what operations a BDT supports and how they might be called by a client of the BDT module. The client program is not a true external client of the module, but instead a test client that validates (some of) the expected behavior of the module.

The BDT internal implementation is not revealed to the interface or the client — this is good data structure design! However, to get you started, we will tell you that the BDT is implemented as an Abstract Object with the following static state variables:

      /* 1. a flag for being in an initialized state (TRUE) or not (FALSE) */
      static boolean bIsInitialized;

      /* 2. a pointer to the root node in the hierarchy */
      static struct node *psRoot;

      /* 3. a counter of the number of nodes in the hierarchy */
      static size_t ulCount;
                
where struct node * is a pointer to one of these structures:
      /* A node in a BDT */
      struct node {
         /* the object corresponding to the node's absolute path */
         Path_T oPPath;

         /* the node's parent, or NULL if the node is the root */
         struct node *psParent;

         /* the node's child directories, each NULL if there is no such child.
         invariant: if child1 == NULL, then child2 == NULL as well. */
         struct node *psChild1, *psChild2;
      };
                


Step 1.2: Build the BDTs

Using the Makefile, you can issue all these commands to build the programs from the client's C source, the two provided ADTs' C source, and the good and bad implementations' object code:

      $ make
      gcc217 -g -c dynarray.c
      gcc217 -g -c path.c
      gcc217 -g -c bdt_client.c
      gcc217 -g dynarray.o path.o bdtGood.o bdt_client.o -o bdtGood
      gcc217 -g dynarray.o path.o bdtBad1.o bdt_client.o -o bdtBad1
      gcc217 -g dynarray.o path.o bdtBad2.o bdt_client.o -o bdtBad2
      gcc217 -g dynarray.o path.o bdtBad3.o bdt_client.o -o bdtBad3
      gcc217m -g -c dynarray.c -o dynarrayM.o
      gcc217m -g -c path.c -o pathM.o
      gcc217m -g -c bdt_client.c -o bdt_clientM.o
      gcc217m -g dynarrayM.o pathM.o bdtBad4.o bdt_clientM.o -o bdtBad4
      gcc217m -g dynarrayM.o pathM.o bdtBad5.o bdt_clientM.o -o bdtBad5
                
Note that the last two programs are built with gcc217m to facilitate use of the meminforeport tool in helping you debug.


Step 1.3: Run the BDT programs

Each of the buggy implementations has a simple-but-serious bug that will manifest itself in some observable way, e.g. a failed assert, a runtime crash, output that diverges from the good implementation, or a memory issue identifiable by meminforeport. Run the programs to observe their behavior:

      $ ./bdtGood
      ...
      $ ./bdtBad1
      ...
                


Step 1.4: Debug the Buggy BDTs

One at a time, conduct one or more gdb sessions to step through the operations that eventually result in an error and explore to home in on the point at which it all goes wrong.

Refer to the gdb guide for tips on how to debug efficiently.

Write the location of your identified bug locations (simply the function is fine, no need to be more granular) in the designated section of the readme. Note that this should be the location where the bug is introduced (and where it should be fixed), not necessarily where it manifests in an error or becomes observable. Note also that the location may be an auxiliary static function that does not appear in bdt.h.

For the implementations with memory bugs, you may want to try to isolate which calls from the client cause the bad memory behavior by using the "divide and conquer" approach discussed in lecture.

Once you have identified the bug in each of the buggy implementations, you are done with Part 1.


Step 2.1: Examine the DT Interface, Client, and Given Implementations

Study the new given source files for the newly modularized part 2: dt.h, nodeDT.h, and dt_client.c. The interface is extremely similar to the BDT interface from part 1 of the assignment, but the client exhibits some of the different properties of the new data structure (e.g., that a parent can now have more than two children).

Also study the given implementation dtGood.c and nodeDTGood.c. This implementation is sometimes cryptic in its heavy use of static helper functions, making it imperative that you really study it to tease out the design, algorithms, and invariants of Node_T and DT. But you will need to do so in order to construct your internal testing, because the buggy versions share the same internal representation and overall structure. You will also need to have a strong grasp of this implementation to complete the required critique of its design.


Step 2.2: Build the DTs

Using the Makefile, you can issue all these commands to build the programs from the client's C source, the two provided ADTs' C source, the good and bad implementations' object code, and your validation module's C source:

      $ make
      gcc217 -g -c dynarray.c
      gcc217 -g -c path.c
      gcc217 -g -c checkerDT.c
      gcc217 -g -c nodeDTGood.c
      gcc217 -g -c dtGood.c
      gcc217 -g -c dt_client.c
      gcc217 -g dynarray.o path.o checkerDT.o nodeDTGood.o dtGood.o dt_client.o -o dtGood
      gcc217 -g dynarray.o path.o checkerDT.o nodeDTBad1a.o dtBad1a.o dt_client.o -o dtBad1a
      gcc217 -g dynarray.o path.o checkerDT.o nodeDTBad1b.o dtBad1b.o dt_client.o -o dtBad1b
      gcc217 -g dynarray.o path.o checkerDT.o nodeDTBad2.o dtBad2.o dt_client.o -o dtBad2
      gcc217 -g dynarray.o path.o checkerDT.o nodeDTBad3.o dtBad3.o dt_client.o -o dtBad3
      gcc217 -g dynarray.o path.o checkerDT.o nodeDTBad4.o dtBad4.o dt_client.o -o dtBad4
                
Notice that the Makefile automatically recognizes which files do and do not need to be rebuilt after any changes. This was also the case with the BDTs, however in part 2 you will have to rebuild often as you develop your validation module, so the partial-build capabilities are more visible.


Step 2.3: Run the DT programs

As in Step 1.3, each of the buggy implementations has a bug that is observable: dtBad2 fails a test in the client, dtBad3 produces different output versus the reference solution, and dtBad4 fails an assert inside a known-good ADT module. But unlike with the BDTs, trying to debug these with gdb runs into a problem: the DTs were not compiled with -g, so there is no source code associated with the memory locations in the executable. This is a good indication that a developer will be well served taking a different approach to debugging.

(But do note that gdb can still debug programs built with these object files, it just cannot help you line-by-line through that code.)


Step 2.4: Develop Your Own DT Validation Module

Lacking the line-by-line analysis capabilities of gdb, you will instead approach the problem at a lower level by beefing up the internal testing used in the DT modules.

The provided checkerDT.h gives a specification for isValid functions for the DT and the Node_T data structures. These functions should test the data structures' invariants. If a broken invariant is detected, they print a meaningful description of the broken invariant to standard error, then return FALSE, so that the program aborts with a well-explained error rather than resulting in an unexpected failed assert or a segfault, or completing with incorrect behavior.

(What's a data structure invariant? These are constraints on the set of values of individual state variables within the data structure's implementation or well-defined relationships between several state variables' values. For example, in your Assignment 3 SymTable list implementation, if oSymTable->firstNode is NULL, then oSymTable->uCount must be 0, and vice versa. Another example is in your hash table implementation: if the binding for a particular key is found in bucket i, the hash value of that key must be i!)

Each function listed in DT's interface that changes the state of the data structure (as opposed to just observing / reporting it) has a correct call to CheckerDT_isValid at the beginning of its function body and just before each possible return point. Similarly, each function listed in the given node interface that produces a new Node_T for the client or mutates an existing one for the client has a call to CheckerDT_Node_isValid at the function's leading and trailing edges. The implementation of these checker functions should thoroughly exercise checks of every invariant of the data structures' internal representations and their interfaces' stated restrictions.

To get you started, you are given some key checker scaffolding code: as some of the tests for DT validity require a proper method of traversing the tree, we have provided a working implementation of this algorithm in checkerDT.c. You are also given a few sample tests in checkerDT.c to use as a pattern for your own tests. (This is why dtBad1a and dtBad1b already abort with a correct description of the bug.) You may write your additional tests in the two checker functions themselves or in additional static functions that you call from the checker functions.

You may assume that the following simple getter functions work in all implementations, in order to make it less cumbersome to write your checks: Node_getPath, Node_getChild, Node_getParent, and Node_getNumChildren. (You may also assume that the DynArray and Path modules are correct in their entirety, though their design is fair game for part 2.5 below.)

Once your checker module correctly identifies the bug in each of the incorrect implementations, you can consider your testing complete: there are myriad potential invariants to check, but we will evaluate your coverage based only on its ability to catch the bugs we've provided while not falsely claiming a bug in a correct implementation.

Our advice is that your checkerDT.c implementation should identify all or nearly all of the buggy DTs' issues before you proceed to the next step. (This isn't a hard prerequisite, it's just that writing more checker tests will further familiarize you with the implementation upon which we assume you will be basing Part 3.)


Step 2.5: Critique Our DT Implementations

In the code given for Part 2, there may exist functions that are not used or not used to the best of their abilities, module bloat, questionable naming conventions or parameter order, etc. There could even be a couple potential errors in handling allocation error cases (thankfully, ones not exercised by the sample client). Some things were improved between Part 1 and Part 2 (Node_T is implemented with a fully fledged module, not an overgrown struct within another object's module, among other things), but real code you sit down in front of does not typically have the polish of perfect academic examples, and we've represented that here to a limited degree. Give one last look back at the code and offer a critique of how a refactorer could/should clean up the interfaces and dtGood.c and nodeDTGood.c implementations to better meet the goals from the Modularity lecture and accord to what we know about good program design and style. You will enter this in the appropriate place in the readme — either paragraph or bulleted list form will suffice.


Step 3.1: Examine the FT Interface and Client

Study the two new given source files, ft.h and ft_client.c, to learn about what additional operations an FT supports beyond those from the DT interface, and how they might be called by a client of the FT module. Again, the client program is not a true external client of the module, but instead a test client that validates (some of) the expected behavior of the module.


Step 3.2: Design Modules, Interfaces to Support FT

In Part 3, you will implement the FT interface to further expand the capabilities of a DT to include representing files: the leaves in the tree that have contents associated with them. Your first inclination for doing so might be to rush out and start "hacking" by adding new fields to the Node_T implementation to fit the first idea that comes to mind, making slapdash changes to the DT code where its invariants no longer hold for the new expansion, etc.

While this could end up working — and indeed, your eventual product will likely end up making some of these changes — a complicating factor is that the code you have works for a DT, however the invariants and constraints of an FT will differ, potentially significantly. Thus, a better software engineering approach is to reconsider the overall design of your program. The crux of Part 3 that makes it intellectually interesting, rather than just "busy work" is to figure out how to adapt the Part 2 code, which was based on a set of invariants, to your new representation and its new invariants.

Modularity is the manifestation of abstraction: a strong program designer understands how to find abstractions in a large program and convey them via a cogent set of modules, each with a sensible interface. So before you begin coding, consider the answers to these design decision questions (there may be no right answers):

As you answer the questions, you should, for your own benefit, compose a diagram (e.g. a flowchart or a mind map) of the interactions between the modules that you intend to write and/or the internal representations you intend to change for existing modules. Aim for strong cohesion and weak coupling, and ensure that there are no violations of encapsulation built in to the design.

(Aside: this assignment's author has been repeatedly on the receiving end of the lamentations of the software engineer in charge of recruiting for a major tech company, who indicates that candidates — including Princetonians, alack! — often fail her modularity-focused design question in interviews because they charge into coding instead of drawing out a coherent diagram of the architecture of modules they will be building.)

Your design choices will inform how easy or difficult it will be to complete subsequent steps in Part 3. So do not become wedded to your choices to the point of becoming susceptible to the sunk-cost fallacy: you may have to iterate on your design decisions as you delve into the next steps of actually implementing the FT. The perfect, however, is the enemy of the good — encouraging deep thoughts about design doesn't mean we want you to never start coding!

Realtalk / Practical Advice

You are given a wide berth to proceed here, but there are two sensible options that most submissions will select from:

  1. cloning the existing node implementation into two variants (one for files and one for directories)
  2. expanding the existing node implementation to be able to differentiate between whether a given instance represents a file or a directory
Option 1 will result in two modules that include large swaths of near-identical code, which the second avoids. But Option 2 infringes on an objective for object oriented design that a given object should represent one thing only, but do it well. In our experience, Option 1 can be more elegant when done well, but Option 2 is significantly easier to get working. Given the provided code base and time constrains of the assignment, it is acceptable to pursue either of these paths for full potential credit without having to solve their glaring downsides. (Similarly, you don't have to resolve your critiques from Step 2.5 in your FT implementation.) We do hope, however, that you do put some real thought into how you could do better if Assignment 4.5 called for a complete refactor. (Don't worry, there is no Assignment 4.5!)

One key invariant constraint that will prove to be among the most challenging to manage will be output ordering requirements and how they are potentially reflected in the ordering within your data structure itself, which will end up trickling down through all the functions in the API. Here are some potential options for storing the children of a node (no matter whether you chose Option 1 or Option 2 or another unlisted option above), with discussion of their advantages and potential pitfalls:

  1. 2 dynamic arrays of children: 1 for files and 1 for directories, each stored in lexicographic order. This results in reasonably easy child wrangling throughout the interface functions, except for a slightly complicated FT_toString function because it must iterate over both arrays to produce the correct ordering.
  2. 1 dynamic array of children of both types, all stored in purely lexicographic order. This is perhaps the easiest approach throughout the various interface functions, though like with the two-array option there may be complications making FT_toString generate the correct ordering. Hint: consider making two passes over the DynArray in preOrderTraversal to visit a node's children in the required order.
  3. 1 dynamic array of children in the final order required for output: this is the most elegant, and allows the FT_toString implementation to remain nearly unchanged from its DT version ... but on the other hand makes for the most complications in the other functions, and ends up as the hardest to get right. If you do choose to take this approach, read the following advice carefully:


Step 3.3: Compose a Makefile

Create a first version of a Makefile that corresponds to the files for the modules you have chosen in the previous step. You may have to refine this Makefile as you refine your program design.

Your Makefile must:


Step 3.4: Compose and Test Your New and/or Improved Supporting Modules

Having settled on a set of modules and their interfaces in Step 3.2, you will now implement these modules. As you write the implementation(s), you may find that you are never using some functions from the API, yet you feel as though other useful ones are missing — use this insight to iterate back to Step 3.2 and refine your design choices. (If these are leftover elements from DT, consider also whether this observation applies to the given code from Part 2 as well, and if so whether you should augment your answer from Step 2.5.)

You will want to test each new module, perhaps with a simple test client for your own benefit (that is, you need not submit such a client), before relying on the module as a dependency of a larger build. If you completed an organizational diagram in Step 3.2, it can provide the appropriate order for composing and testing your new modules that minimizes the entanglement of multiple untested modules with each other.


Step 3.5: Implement the FT Interface

The final module that you will implement is the only one with its interface specified for you: the core FT abstract object. The FT interface is very similar to the DT interface, except almost every function is now split out into a directory version and a file version. In addition, there are functions to manipulate a file's contents, and a new FT_stat function that will allow a client to access metadata about a node in the File Tree.

Note: your implementation of the FT_toString function must maintain the same format from the corresponding BDT_toString and DT_toString functions. In dealing with the new features, it adds the ordering requirement specified in the comment in the client that a directory's children should be printed depth first in lexicographic order but with file children listed before their directory children siblings.

In order to make the required FT behavior unambiguous, you are provided the object file for a reference implementation. You can build the given sampleft.o with the provided test client ft_client.c using the provided Makefile.sampleft, which is named in such a way as not to conflict with your own Makefile. Here is the command to do so:

make -f Makefile.sampleft

The -f option to make specifies what file to use instead of the default names Makefile or makefile .

You may choose to expand ft_client.c with your own tests, write your own test client, or even to write a FT validation checker module if we have sufficiently convinced you that these are valuable. (If you do the latter, you'll need to name it differently than the module from Part 2 in order to submit it without a naming conflict. CheckerFT would be a natural choice.)

In addition to matching the sample implementation's functional behavior, your implementation must also manage memory correctly, avoiding dynamic memory management errors. As you did in Assignment 3, you should use MemInfo and Valgrind to observe and debug your program's memory management.


Step 3.6: Critique Your FT Implementation

Critique the code for your new modules using critTer. Each time critTer generates a warning on your code, you must either (1) edit your code to eliminate the warning, or (2) explain your disagreement with the warning in your readme file.

Critique the program consisting of the new modules and the client using splint. Each time splint generates a warning on your code, you must either (1) edit your code to eliminate the warning, or (2) explain your disagreement with the warning in your readme file.

Exceptions: You do not have to address warnings about too many functions, too long loops, functions, or files, or too deeply nested code. You also do not have to address warnings that result solely from your repurposing of the DT code we gave you, but you might consider if those warnings are additional fodder for your comments in Step 2.5.


Step 4.1: Complete the readme File

Edit your copy of the given readme file by answering each question that is expressed therein.

In step 1.5, you should have entered the bug locations from Part 1. If you didn't, do it now.

In step 2.5, you should have entered your critique and refactoring suggestions from Part 2. If you didn't, do it now.


Step 4.2: Provide Feedback

Provide the instructors with your feedback on the assignment. To do that, issue this command on armlab:

      FeedbackCOS217.py 4
                
and answer the questions that it asks. (If you worked with a partner, then when answering the numeric questions, please enter the average of the responses that you and your partner would provide individually.) That command stores its questions and your answers in a file named feedback in your working directory. You need not store it in your repository, though you can.

Note: this portion is particularly helpful for this assignment, given that this is still a "young" assignment relative to the others in this course that have developed and evolved over the course of decades.


Step 4.3: Submit

Submit your work electronically on armlab. If you worked with a partner, then one of the partners must submit all of your team's files, and the other partner must submit none of your team's files.

Your readme file must contain both your name and your partner's name in the appropriate locations.

For Part 1 you must complete the appropriate section of the readme and submit that file.

For Part 2 you must submit checkerDT.c .

For Part 3 you must submit the .c and .h files for all the modules used to implement the FT interface and build your ft executable, with the exception of the ft.h interface file itself and the ft_client.c client file itself. If you have not modified dynarray.c, dynarray.h, path.c, path.h, or a4def.h, you do not have to submit these either. (Be sure you get all the required files! Seriously. Go triple-check. We will deduct several points if we have to track you down to submit missing files after-the-fact.)

For Part 3 you also must submit the Makefile that uses all these files to build the ft executable out of ft_client.c.

You must submit the completed readme and the feedback transcript.

Finally, if you have worked with a partner, the submitting partner must create a partner file indicating the non-submitting partner's netid and submit it (the non-submitting partner should still submit no files). You can use the following commands, substituting in your partner's netid into each:

      touch netid.partner
      submit 4 netid.partner
                
(WARNINGS: please make sure the file has the .partner extension, since that's how we can recognize it; please don't submit a file called, literally, netid.partner -- change netid to be your actual partner's actual netid; yes, the actual netid used to log in to armlab, not their email alias; non-submitting partners should make sure the submitting partner did this, because otherwise we have no way to know to give you credit!)

Really finally, go do an nth check that you have submitted all the required files for us to build and run your code. (Can you tell that many students miss submitting all the required files for this assignment? Don't be one of them!)


Program Style

In part, good program style is defined by the splint and critTer tools, and by the rules given in The Practice of Programming (Kernighan and Pike) as summarized by the Rules of Programming Style document.

The more course-specific style rules listed in the previous assignment specifications also apply, as do these:


This assignment was created by Christopher Moretti and Vikash Modi '23, with input from Xiaoyan Li and Donna Gabai.