Q&A Session - 7:00pm Wednesday, room 105 (our current room)

What does it mean that each process has the kernel mapped? Is this
taking up extra memory?

Logically, what this means is that the virtual memory structures
needed to map the kernel are always in place. If you do this "right",
it doesn't really have to consume that much extra memory. If we're
dealing with hierarchical page tables, the relevant entries in the
top-level directory for all processes have to point to the same set of
lower-level entries that handle the kernel mapping. So, it's not too
ugly.


Could you go over the differences between software-controlled and
hardware-controlled TLBs again? Are TLBs always implemented in
hardware, and it's just the management of the TLBs that may optionally
be in software?

The TLB hit/miss checking is always implemented in hardware since that
is absolutely performance-critical. The management of the TLB (which
determines its behavior on TLB misses) is what may be implemented in
hardware or software. Realize that a TLB is just a cache of the page
information, so not all of the page table entries will fit into the
TLB. So, if the TLB has a miss, the most common reason is simply
because there wasn't enough space in the TLB. So, at this point,
something has to decide what TLB entry to evict, and how to load the
appropriate entry from the main-memory page table into the TLB. This
used to be done by software, and then after the approaches stabilized,
this has sometimes moved to hardware. So, this means that the OS and
hardware have to agree upon a page table structure. Note, however,
that only "minor faults" (valid PTEs exist in the main page table) are
going to be handled by the hardware - if the page isn't in physical
memory, or if the attempted access was invalid, it's still invoking
the software path.


How does having a TLB in hardware save time if there's a miss?

If the TLB is software managed, that means that some instructions must
be executed on every TLB miss. However, if those instructions aren't
in the L1 cache, then they must be fetched from main memory. So, every
TLB miss may also incur that extra penalty. In contrast, if the minor
fault handler is implemented in hardware, there are no instructions
that need to be fetched to run it. They're essentially built into the
chip.


Can you explain when software TLB management isn't too bad and when
it's really bad?

If you don't have many TLB misses, then increasing their cost by a
small factor may not hurt at all. Likewise, if you have lots of TLB
misses, the instructions to resolve them may stay in the L1 cache all
the time.

The bad scenario is when you have enough TLB misses to slow down
performance, and you've got a program that accesses enough code/data
in between TLB misses to kick the miss handlers out of the cache.


What determines the size of the TLB?

There's a certain amount of space on the chip that can be devoted to
caches - L1, L2, and TLB. My guess is that the TLB is made as large as
possible without it being too slow or without consuming more chip
space than it can effectively use for an interesting set of benchmark
applications. That's sort of a weasel answer, but I didn't even get
any good results on the TLB details of current chips when I tried a
few web searches.


Why is there a valid bit in the TLB? Why would certain mappings be
invalid?

Well, remember that all entries of the TLB are being checked in
parallel.  Assume the TLB doesn't have any process ID info, and a
context switch has just occurred - the TLB has to be flushed. Since
all of the comparisons are going to take place anyway, you have to
have some way of saying that "even if this entry matches, it's not a
real entry", and that's one of the uses of the invalid bit.


When you say combining a TLB with a cache, does that mean that if a
page is in the TLB, it should also be in the cache? How are they being
used together?

Note that combining here doesn't mean a literal combining of the two
features. What motivates this "combining of behavior" is the
observation that if the cache maintains information by physical
address, then every cache hit will first require a virtual-to-physical
translation. Instead, if the cache maintains information by virtual
address, then a cache hit doesn't require that a TLB lookup be
performed as well - it's already been merged. The cache still operates
using cache lines (not pages), and having its own replacement policy
separate from the TLB.


What is meant by "consistency in memory"?

A cache is just replicating part of memory, but is faster and smaller.
If something gets written to the cache, it should eventually get
written back to main memory. Likewise, if something changes in main
memory, that change should get reflected in the cache (or that portion
of the cache should be marked as invalid, causing future accesses to
get the real value from main memory). So, if the value being cached
changes in main memory and this change doesn't get reflected in the
cached copy, it's said to be in an inconsistent state.


Why is it good to have a sparsely-populated array if we're not using a
hash table?

It's not that we have much of a choice in the matter - the logical
array is sparsely populated simply because most processes only use a
small fraction of their virtual memory space. Hence, all of the unused
virtual memory regions are causing the array to be sparsely populated.


In the inverted page table, if the table has one entry per physical
page, why do you need to hash at all? Shouldn't the pid/vpage number
map directly to the physical page?

It would be very difficult to come up with a mapping that's general
enough and that allows all resources to be used easily. For the sake
of argument, assume you have a system with 100 pages of memory and 100
process that use 1 page each. Can the same mapping handle this case as
a system that has 10 processes that use 10 pages each? I can't think
of one, so that's why hashing is used - if you design the hash
function well, it's unlikely to have really bad worst cases.


Where is the hash chain in the inverted page table?

Not shown. Also realize that it may not have a chain - there are other
ways of handling conflicts in hash tables, such as just walking down
until you find a free entry, or re-hashing.


I don't understand direct-mapped caches - what does it mean to be
N-way set-associative?

Direct-mapped caches have exactly one entry per location in the
cache. That means that if items X, Y, and Z all map to the same
location in the cache, only one can be cached at a time. For N-Way
Set-Associative caches, you can think of them as having N entries per
location. So, if we have a 2-way set-associative cache, then out of X,
Y, and Z, two can be cached at any time. In a fully associative cache,
the only restriction on what can be cached is the size of the cache
itself.


How many levels do page tables usually have?

I would guess that on 32-bit systems with a 4KB page size, you'd only
need two levels. Each level would take care of 10 bits of the virtual
address.


Is a TLB miss always a minor fault?  What exactly is a major page
fault?

No - a TLB miss can also occur if the page isn't in physical memory at
all. So, at this point, the OS has to get involved and load the page
from disk. This is a major fault.


What's the difference between segmentation with paging and a
multiple-level page table?

Well, the hardware requirements are quite different - realize that the
segmentation schemes we've discussed have the segment table built into
the hardware. In the multi-level page table, there's not a direct
counterpart. Both can use TLBs to speed up the system.


TLBs don't work well on matrices

Several people made this observation, so I should clarify. Certain
matrix operations, such as matrix multiply, have straightforward
implementations that interact poorly with TLBs. So, the people who
care about these things implement performance-aware algorithms. Do a
search for "blocked matrix multiply" if you'd like to know more.


How exactly is the entry of the TLB chosen based on VPage#?

The virtual page # is presented to the TLB, which compares all of the
entries in the TLB to see if one of them has a matching virtual page#.
The low-level details are basically involve lots of comparators, one
per TLB entry (I think).


Doesn't making the TLB fully associative make it harder to look things
up than making it N-way set-associative?

Yes, it requires more hardware, but my guess is that there must be
certain classes of applications (think high-performance fortran code)
that would suffer if TLBs weren't fully associative. So, if it's a
small cost to make sure you don't get killed on a benchmark, do it.


How does the OS know what level of hardware suport is provided by a
particular processor?

Processors provide certain instructions that allow the OS to get
information about what the processor provides. Some of these
instructions may operate in very fine detail, while others may just
say that this is processor level X, and there's some published info
from the manufacturer that says how much of everything level X has.


Can you explain virtually-addressed caches again?

This is the start of the next lecture (which had slides handed out in
this one)


At which point does the OS get involved in page faults on a system
like Linux?

I think all modern x86 chips have a hardware TLB miss handler, so the
OS only needs to get involved if the page isn't in physical memory at
all, or if the attempted access was invalid.


How does the TLB know what entries to keep? If it can only load about
100 entries, which 100 does it have?

This was the slide about replacement policy. Some TLBs just randomly
evict an entry when a new entry needs to be loaded. Others will have
some sort of "LRU-like" approach, where they try to evict the entry
that hasn't been used in the longest time.


When are you getting rid of Windows?

Well, right now, I've got a working Windows laptop, and I will always
need a laptop for presentations and such, so there's no need to get
rid of Windows. When this laptop needs to get replaced, then it might
be possible to get rid of Windows. The reason for getting rid of Linux
was basically that it was time to replace my desktop machine (300 Mhz
Pentium II).


Does the UltraSparc use an inverted page table?

According to http://www.memorymanagement.org/glossary/i.html
the Alpha, UltraSparc, and PowerPC all include inverted page tables,
but if this were really important, I'd check something besides the
web.


When is the last day that we can ask questions?

The safest assumption is that whatever's written on the feedback on
Thursday will get answered. However, that's feedback, so it's somewhat
restricted to the scope of lecture, and assumes that the book's been
read, etc. The last safe time to ask any general question, then, is on
Wednesday night. I'm unlikely to have e-mail access over the weekend.


On slide 29, what does "need to write back" mean?

Assuming I'm thinking of the right slide, the TLB may keep track of
which pages have been referenced, modified, etc., so if the PTE has
changed, it would have to be written back to the main page table
before it's evicted from the TLB.


What is the value of "size" on slide 21?

I've got different slide #s - could you give me the appropriate
URL to a jpg?


Are midterm questions generally like the quiz questions, or are they
more like problems that we have to solve?

My format is probably going to be something like these:
http://www.cs.princeton.edu/~vivek/f2000_318_midterm.pdf
http://www.cs.princeton.edu/~vivek/f2001_318_midterm.pdf
http://www.cs.princeton.edu/~vivek/f2001_318_final.pdf