I've been told (source kept secret) that the Glitter soundtrack was
released on Septemeber 11, and that it went platinum on October 16.
So, I guess there are still enough Mariah fans to carry that album
despite the movie. Still baffling to me was that the talk-show
promotion for the movie was done by "Padma", the fashion model who is
11th-billed in the movie, and who happens to be the girlfriend of
Salman Rushdie. Also interesting is the cast list for Padma's latest
movie - http://us.imdb.com/Title?0330082


Lots of people had questions about interleaving.

Assume you have 4 disks. Byte interleaving means that byte N is on
disk (N mod 4). Since disks deal with sectors, this also means that
the new "sector size" is 4x the original sector size. All reads and
writes involve all disks, which is great for large transfers, but
doesn't give you any benefit from multiple disk heads on small
transfers.

Block interleaving means that block N is on disk (N mod 4). Sector
size can remain unchanged. If your workload consists of lots of reads
of 1-block files, then all the disks can be seeking independently,
giving you good performance. Likewise, large transfers also get the
benefit of all disks reading/writing data at the same time.


What is parity?

The count of the number of 1 bits - the last line of 
http://www.cs.princeton.edu/courses/archive/fall02/cs318/lec6/slide30.html


Explain the RAID levels again.

See gory details at
http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf

RAID 0 - no redundancy, data may be interleaved (striped) across the
disks, may not be. If interleaved, you get better transfer rates for
large transfers.

RAID 1 - Each disk has a "mirror", and so all data exists on two
disks.  Writes have to be written to both disks, but reads can be read
from either disk. So, read rates can be twice as high as write rates.

RAID 2 - The ECC approach is used to detect and correct some failures.
If you're interested in the details, read up on Hamming Codes. 

RAID 3 - One extra disk stores the XOR result, all data interleaved at
byte level. All reads and writes involve all disks, so only large
transfers get performance benefits.

RAID 4 - Interleaving now at block level, so multiple small transfers
can gain since the disks can seek independently. All writes must
update the parity disk, which becomes the bottleneck.

RAID 5 - The contents of that last disk are now spread across all
disks to try to reduce the write bottleneck. So, small writes still
involve two disks, but this is better than all writes waiting on the
same disk.


Explain difference between raid 3&4, and 2&3

3 and 4 differ in the level of interleaving, while 2 and 3 differ in
what approach is used to do the checking. If the explanation above
doesn't clarify, write me.


In raid, what happens if you lose the parity disk?

You basically rebuild it the same way that you'd rebuild any failed disk.


If raid is using information theory as an underlying basis, does that
mean we can throw in more space to handle more failures?

In general, yes. However, that tends to be ugly. Instead, when people
care about that, they might just have mirrored RAID 5 or something
like that. Or, if you're really worried, see the Byzantine
fault-tolerance work at http://www.pmg.lcs.mit.edu/~castro/pubs.html


What kinds of companies are interested in RAID?

Anybody that doesn't want to suffer downtime due to disk failure. Most
places seem to use it to improve reliability, with raw performance
being less of a concern. Most of the disks the CS department (and
presumably OIT) are using involve RAID storage.


Slide 33 (on the web) - what does "general error correcting codes too
powerful" mean

The ECC scheme is great for silent errors, and is often used
for checking RAM, where individual bits can flip. However, disks
tend to fail visibly, so more complex schemes like ECC can be
avoided, and simpler schemes like XOR can be used instead.


How is log corruption checked?

Generally, you don't assume random bits flipping on disks, so I would
guess that most log corruption of that form isn't checked. What logs
may do, however, is put down some kind of before and after marker in
the log for every write, so that way, if a write is only partially
complete when the power fails, you'll be able to tell.


In logging, don't you lose the log when you lose power?

The log is on disk, so you assume that anything that's already
written is stable across power failures.


In logging, do you clear the log file every time you update the disk?

Not exactly sure what's being asked. However, the log is basically
intended to be a holding area, and the changes made to the log have to
be reflected to the "real" metadata parts of disk. So, when those
updates have taken place, you can mark those entries in the log as
being "clean".


Does logging slow down performance - even though they're sequential
writes, they're still writes? Is it worth it if you rarely have to
perform recovery

If you were trying to sustain it over long periods of time, logging
would probably be a net loss unless the log were on a separate disk.
However, disk traffic tends to be "bursty" - periods of idle time
punctuated by small regions with lots of activity. One goal of logging
is to allow that activity to occur as quickly as possible, so that the
user can move on to other things while the OS cleans out the log in
the background. Another goal of logging is to avoid having to do the
fsck cleanup after a power loss.


What does the new Linux ext3fs "journalized" filesystem do for
reliability?

Logging, basically as I've described it. It's nothing special as far
as I can tell.


What does the swap partition do?

That's coming up next in the virtual memory part of the course.
Basically, the OS uses space on disk to augment the physical memory of
the machine, giving you the illusion that you have more memory than
you really have.


Are old midterms up?

Anything on-line in previous years is fair game, but please don't get
old copies from other sources.


What were advantages of unix filesystem?

You could grow the file to arbitrarily large sizes in a relatively
elegant way, and you could use space on disk regardless of whether it
was contiguous or not.


Do the first 10 entries in the unix inode point to only one block of
data each?

Yes


Why does unix have 13 entries in the inode?

If I had to guess, the number of direct block entries was chosen
to fill up space to get the inode to 128 bytes. The single, double,
and triple indirect entries are needed for expansion. So, if the rest
of the inode had needed more space, instead of 10 direct block entries,
there would have been fewer.


Are triple-indirect implemented on modern systems?

Given that some systems support really large single files, I would
guess that they are.


What happens in Unix when those 13 entries aren't enough?

You could have a quadruple-indirect entry by getting rid of one of the
direct entries.


In Unix, is the smallest file 4KB?

See page 5 of the "Fast Filesystem" paper.


With today's large files, isn't it better to have a block size greater
than 4KB? Small files will waste space, but it's faster.

Special-purpose filesystems, like video-on-demand, will often use much
larger block sizes, like 128KB. I don't know if general-purpose
filesystems have really increased the block sizes beyond 4/8 KB.


Explain the FAT slide 
http://www.cs.princeton.edu/courses/archive/fall02/cs318/lec6/slide24.html

File "foo" has a first block of 217. The array marked FAT doesn't
contain the data, but only the linked list of blocks. So, entry 217 in
this array tells us what the next block of the file is, and block 217
on disk contains the actual data of the first block of the file. The
location of the second block of the file is indicated by the value in
entry 217, so the second block is block #619, and the third is block
#399, etc.


In old DOS, drives were limited to 4GB - does that relate to the
structure of FAT?

I believe so, and if I recall, "FAT32" was the solution to use larger
disks.


Doesn't FAT have the same problem as Unix in that the data blocks may
not be contiguous?

Yes - the extent-based filesystems had "extents" that potentially
spanned many blocks and were contiguous. However, they tended to have
a fixed-size number of extents. FAT and the standard Unix filesystem
don't have a built-in mechanism to get these benefits, although more
modern implementations of Unix do try pretty hard. Likewise, disk
defragmentation programs (particularly on DOS) try to fix this problem
as well.


How does NTFS differ from FAT?

Offhand, I don't know. I believe NTFS is more Unix-like, but I may
be wrong.


Slide 23 - what does "up front declaration a real pain" mean

When you create the file, you have to say what its maximum size will
be, etc. You have to declare its usage at the time of creation.


If last lecture is on midterm, can I post the notes early?

I'll try. Realize, though, that the midterm is on Tuesday, so that
gives you five days after the last lecture.


How do you do "special order" to get fast recovery times

I assume this means how do programs like ScanDisk and fsck optimize
their time. If that's the question, the answer is that disks can be
read via the filesystem, or in a "raw" manner, by treating it as an
array of blocks/sectors. If you have the appropriate permissions, then
you can read it as raw data, which means that you can figure out where
things are on disk in a brute-force manner and do the optimization
yourself.


something about metafile - couldn't read question