Congratulations! Your submission titled "Shark: Scaling File Servers
via Cooperative Caching" (paper #208) has been accepted to NSDI
2005. We had an excellent submission pool and the competition for the
final program was intense.  You should take the acceptance of your
paper as a strong endorsement of your work by the community.  At the
same time, we hope that you will use the detailed reviews as an
opportunity to improve your work for the final program.

Your reviews are at the end of this message. We look forward to your
revised camera ready copy. As you prepare it, please bear the
following points in mind:

1. Paper and Contact Information. For each paper, Usenix requires (i)
the title; (ii) the ordered list of authors with their full
affiliations; (iii) the contact author's name, email address, phone
number, and mailing address.  Please reply to this message *now* and
send us an up-to-date set of this information.

2. Shepherding. All papers are being shepherded in their
revisions. Your shepherd will contact you within a few days. You
should discuss your plan for revisions with them (it is often helpful
to chop up review text and say how you intend to respond to points) as
they can help you understand the key strengths and weaknesses of your
paper as perceived by the Program Committee. Papers are due to your
shepherd for review no later than *March 11, 2005*.

3. Registration and camera-ready instructions. The camera ready copy
is due to Usenix by *March 29, 2005*. Details of the camera-ready
requirements and submission process, as well as conference
registration for authors, are available at:
http://www.usenix.org/events/nsdi05/instrux/. Please read this entire
page; it is the only information you will receive on camera ready
preparation.

We look forward to a strong program and presentations in Boston.

Best of luck with your revisions and congratulations once again! 

Amin Vahdat and David Wetherall
NSDI 2005 PC Chairs.

------------


<Review #1>

Your qualifications  to review this paper :
I know the material, but am not an expert

Overall paper merit:
Top 10% but not top 5% of submitted papers

Provide a short summary of the paper:
This paper is part of the long line of SFS papers, which in this case
is tailored to overcome a specific problem: high latency and low
bandwidth to a single WAN file server for mostly read access
workloads.  It does so through cooperative caching, which has the
additional benefit of permitting the WAN file server (also referred to
as the origin server) to scale to hundreds or thousands of WAN
clients.  In a nutshell: AFS + secure & reliable channels + rabin
finger prints + cooperative caching w/ FS optimized DHT.  The paper
does a nice job of describing the design and implementation of tying
together a number of well understood, separate mechanisms.


What are the most important reasons to accept this paper?:
See below...

What are the most important reasons to reject this paper?:
See below...

Provide detailed comments to the author:
The authors should also speak to their dependency on whole file
caching, which strikes me as a design choice with potentially serious
impact.

Another potential weakness that concerns this reviewer is latency and
outage of the origin server.  For example, suppose the origin server
is at NYU, the client is in Korea, and latency is high combined with
intermittent network outages. In such a scenario, connectivity to the
origin server might be lost causing file accesses to "hang" eventhough
the underlying data is local or nearby.

The paper could do a better job of making a strong case for the need
for such a file system.  For example, what applications have a mostly
readonly "filesystem" workload for large amounts of data?  Maybe grid
or others? It is an unknown.  


</Review #1>


<Review #2>

Your qualifications  to review this paper :
I know a lot about this area

Overall paper merit:
Top 25% but not top 10% of submitted papers

Provide a short summary of the paper:
This paper describes a p2p/dht based cooperative cache for a shared
(and centralized) file server. It is targeted at grid applications or
planet-lab experiments where a group of nodes share a collection of
binary or data files that must be distributed from the server to the
nodes to run the experiment (note that it may have broader
applicability.) It is based on a group of known techniques and re-uses
considerable existing code: SFS, LBFS, Coral, etc but it tweaks these
pieces a bit (e.g., cut-through routing in coral to create
distribution trees when a file is first accessed by a bunch of clients
at about the same time.)


What are the most important reasons to accept this paper?:
The system seems like it would be useful for planetlab and grid
researchers and perhaps others.

Good engineering. (a) Rather than reinvent the wheel, use a lot of
existing code with minor changes. (b) Strive to co-exist with current
environment (e.g., legacy servers) rather than stick to p2p-everywhere
orthodoxy. 


What are the most important reasons to reject this paper?:
The design reveals relatively few new technical ideas. Most of it
assembles known techniques in more or less the way one would expect
given the problem statement "build a p2p/dht cache for an NFS/SFS/LBFS
server". The cut-through routing/atomic put/get is perhaps the biggest
new change, and this feels more like a "gotcha" that was successfully
diagnosed and addressed than a big stand-alone contribution.


Provide detailed comments to the author:
I tend to buy the argument that building a p2p cache for a centralized
server is a sensible piece of useful, low-hanging, more-easily-adopted
fruit than demanding that the server also be made p2p. But, the last
paragraph of section 2.1 needs to be considered carefully -- some
would regard the fact that "system operators administer this [shark]
server as with any other host...perform backups...configure and reboot
the machine when necessary, run other services, ..." as limitations
not features. P2p advocates might argue that they have a better answer
to many of these issues than the traditional answer -- you might
acknowlege the limitations of a central server while also pointing out
the practical advantages of going for this as low hanging fruit (as
well as point out any fundamental advantages you see.)

I would suggest pulling out the "prefetch whole file" as a
configurable option -- this approach is sensible for some workloads,
but might be a disaster for others (I'm thinking of, say, a grid
application that wants to use this file system to send a big data file
to a cluster but where different cluster nodes only want subsets of
the data. You can probably come up with other disaster scenarios.) You
might want to have a tunable prefetch parameter and explore how to
make it self-tuning or how to set it manually when a file is
opened. It might be enlightening to re-run some of the experiments
with prefetching completely turned off to isolate which factors
account for your performance more precisely.

I'm not sure the median of 5 runs is sufficient to make the results
believable despite PlanetLab variability. You probably should go to
the extra trouble of using confidence intervals to characterize the
variability (you might also verify that the distribution is stationary
across runs.)


</Review #2>


<Review #3>

Your qualifications  to review this paper :
I know a lot about this area

Overall paper merit:
Top 25% but not top 10% of submitted papers

Provide a short summary of the paper:
Yet another point in the p2p filesystem space; here we have a
client-server environment with a p2p proxy caching mechanism in between.


What are the most important reasons to accept this paper?:
This is one of the more sensible points in the
design space for p2p filesystem work.  The compatability with common
administrative practice and client mount protocols is nice.  The
Sloppy DHT design is the intriguing technical nugget amid the rehash
of prior work.

What are the most important reasons to reject this paper?:
Most of the ideas here appeared in earlier systems.
The target here isn't as conceptual ambitious as prior work, and
technically the system is a modest contribution (despite gratuitous
muscle-flexing about how many bugs they found in other people's code).
The author's worldview is unacceptably MIT/NYU-centric, and the
comparisons to SFS should be complemented with comparisons to, say, a
commercial AFS implementation, as well as other p2p platforms.  

Provide detailed comments to the author:
What are the realistic "killer" applications scenarios where your
benefit makes you so compelling we can't ignore you?  If they're all
"fire off my distributed experiment" type scenarios, perhaps an
app-level approach would be simpler/more effective -- e.g. use
BitTorrent or something to disseminate the binaries in advance.

Is cut-through routing really that important?

What about WebDAV over Coral?  Must be slower, but
how much so?  Would it be easier to build?


</Review #3>


<Review #4>

Your qualifications  to review this paper :
I know the material, but am not an expert

Overall paper merit:
Top 10% but not top 5% of submitted papers

Provide a short summary of the paper: 
This paper describes a distributed file system aimed towards an
environment where similar sets of files need to be distributed on a
large set of clients.  They extend the notion of a single, remote
server (e.g., NFS) by having clients act as cache proxies for others.
The clients access a modified distributed hash table to locate other
clients with copies of the file before asking the remote server for
the data.  To prevent potentially malicious clients from disrupting
data flow in this cache network, all operations are encrypted, and
methods are introduced for the client and the client proxy to
establish mutual trust.  Comparing the performance of this system to
SFS, the authors show that they can achieve an order of magnitude less
load at the server, and 4-5 times lower latency at the clients.  Their
canonical workload is distributing applications and data to PlanetLab
nodes for execution. Their evaluation clearly shows the benefits of
their system on this type of workload.

What are the most important reasons to accept this paper?: 
The system is interesting in large part because of the underlying
technologies it employs: DHTs, LBFS-style chunking, and cryptography.
For the most part, the authors clearly identify why these techniques
benefit their system.

The paper is well-written and clearly presents the overall ideas, the
details of the system, and the results of their evaluation.

The system presented fills a possibly important niche: the
distribution in this system provides performance benefits without
sacrificing the manageability of a single remote file server.

The mechanics of the system are well thought out and presented,
especially the security aspects.

Overall, this is an interesting paper that I would enjoy seeing
included in the conference.

What are the most important reasons to reject this paper?: 
While this is a solid presentation of a system, it is unclear how far
it advances the state of the art.  Many of the major components of the
system have already been explored in depth elsewhere.  This is only a
slight concern that might be alleviated by a better highlighting of
the contributions in the writing.

Shark is clearly useful for users of PlanetLab, but the authors do not
identify or discuss other possible workloads.

Shark employs cryptography to achieve a number of security goals.  The
paper could be improved by adding an explicit discussion of a threat
model and Shark's security solutions.  Currently, this information is
presented in pieces throughout the paper.

The evaluation results were heavily influenced by features of
PlanetLab's resource management. The authors acknowledge this fact,
but neglect to offer significant results that are not handicapped in
this manner.

Provide detailed comments to the author: 
I think your comparison to Farsite in related work (where you note a
similarity between encryption methods) is weak, especially since a
stated goal of that system is to present the logical view of a single
server file system.  I would have appreciated a more detailed
comparison.

You've effectively shown a reduction in workload on the origin server,
but what about individual servers in the cooperative cache?  How is
the workload shared among them?  In particular, does the first client
to fetch the file receive a disproportionate share of requests?

There are a few typos:
- Section 1, paragraph 2: "Program_s_ cannot run"
- Section 2.3, last paragraph before "a cooperative-caching read
  protocol": "command-li_n_e options"
- Section 5, paragraph 1: "such as AFS do not -do- perform"


</Review #4>


<Review #5>

Your qualifications  to review this paper :
I have passing familiarity

Overall paper merit:
Top 25% but not top 10% of submitted papers

Provide a short summary of the paper:
Shark is a distributed file system that (1) integrates smoothly into
existing local file systems, and (2) leverages P2P-based distributed
indexing so that often requests that cannot be satisfied locally can be
satisfied by accessing a nearby server in the same locality-based cluster.
The claim is that this design allows Shark to scale to very large
systems.

What are the most important reasons to accept this paper?:
The approach appears sound and it's exciting to think that one could
in fact construct such large distributed file systems.

What are the most important reasons to reject this paper?:
I didn't form a clear picture of how exactly Shark works in practice.
Part of this may be my not knowing this area well.  Two elements that
confused me were (1) will Shark indeed form effective clusters, and
(2) where will Shark's gains break down - what sort of uses or apps will
have access patterns for which Shark's caching will degenerate and we're
left with little more than wide-area NFS?

There are some unexplained performance peculiarities (see below).
It's not clear whether these might reflect serious problems (either
with Shark itself or with the evaluation).

I would like to better understand how well Shark will perform when
fetches are highly synchronized, rather than coming within a minute
of one another as in 4.2 (which clearly is not a stress-test).

Provide detailed comments to the author:
The use of "special characters" confused me.  Are these symbols not
in the alphabet, in order to impart some type information into the
hashes?  Why are they needed, and how are they escaped to make them
distinct from other bit patterns?

Regarding Coral stopping when reaching a node that's both full and loaded,
what happens if there is no such node?

What's meant by choosing cache names "carefully" to fit in the kernel name
cache?

Why the large difference between 10Mb file read and 40Mb file read in
figure 5?  (Also, these should be MB, not Mb.)  This is quite important to
understand, as it could reflect problems either in the implementation or
the evaluation.

"data read from /dev/random" - on many Unices, /dev/random only returns
(and consumes) bits available from the entropy pool, and there won't be
anything like 10MB available, so this won't in fact work.  So, what did
you actually wind up with?

It's a bit strange to refer to access limited to 10 Mbps as "local",
as usually that term connotes very high-speed access.

I didn't understand the comment in 4.2 about the tradeoffs between
continually retrying and increasing client latency.  What do you picture
here might lead to reduced bandwidth usage?

There are a lot of grammar errors :-(.


</Review #5>


<Review #6>

Your qualifications  to review this paper :
I know a lot about this area

Overall paper merit:
Top 25% but not top 10% of submitted papers

Provide a short summary of the paper:

The basic idea behind this paper is that (a) distributed file systems
are nice (much better than distributing files by hand, etc.) and that
(b) to use them in certain settings such as PlanetLab, they need to be
fixed. Most people would probably agree with (a), which leaves (b) as
the point of interest -- just how should distributed file systems be
fixed to enable their use in those types of settings?
                                                                                
The authors here focus on scale. What if hundreds or thousands of
machines suddenly start requesting data at the same time, hence
overwhelming the poor file server found in typical file systems? The
solution here is to apply some ideas that have been used elsewhere:
cooperating caching and LBFS-style file chunking.


What are the most important reasons to accept this paper?:

It is a paper about a real system.

It synthesizes some well-known good ideas
(e.g. cooperative caching, lbfs-style chunking).

It has a few cute implementation points
(e.g., atomic get/put)


What are the most important reasons to reject this paper?:

Not super novel.

Experiments are not very well done.

Some related work is missing.

Provide detailed comments to the author:

I lean towards reject here. Overall, I think this paper is OK -- not
terrifically innovative (though there are a few neat points, which I
list below), it is lacking in experimentation (also as detailed
below), and missing a few other ingredients (e.g., some related work,
some aspects of the design). However, I really think the paper will
benefit from the reject-improve-resubmit cycle.
                                                                                
All of that said, though, if you need more systems papers, this is certainly    
not an embarrassing selection -- it is just a bit underdone.                    

- The evaluation needs MORE WORK. A better set of comparisons -- e.g., to       
  local disk too, as that is the performance level you get if you manage to     
  distribute all of your stuff to the local disks. There is also a cost         
  involved with content-based approaches (someone has to look at all the
  data), which should be brought out. The benchmarks aren't very complete --    
  some file reads and some emacs stuff, barely showing anything interesting     
  about the system. Using the emacs distro to compute potential bandwidth       
  savings is more of a study of the emacs distro than something interesting     
  about Shark. And the hand-waving about PlanetLab performance problems is a    
  bit off-putting -- if one of your main points is that your system has to be   
  fast (in the absolute sense) to really be usable, it needs to be fast,        
  period! I could go on -- there is plenty to do here.                          
                                                                                
- Related work is missing. For example, it seemed odd to me that you didn't     
  discuss Pangaea (from the same OSDI as Ivy). Also, there were some papers at  
  the last NSDI that had at least some relevance: Total Recall from UCSD, and   
  the BAD FS work from Wisconsin.                                               
                                                                                
- When I read about your lease scheme (server has a committment to notify       
  clients of mods during that time, the 5 minute default duration), I           
  immediately wondered what kind of effect this had on scalability. It seems    
  like that might be a lot of work for the server, handing out those            
  notifications to potentially thousands of clients, esp. so frequently. I      
  hoped this would be addressed in the evaluation, but it wasn't!               
                                                                                
- Take care in using MB/s or Mb/s properly (MB -> MegaByte, Mb -> MegaBit).     
  You are a little sloppy with this here and there and it is a problem.         
                                                                                
- I really liked the cross-file system sharing support provided by Shark --     
  this is a neat outcome of the design (because tokens which are used to fetch  
  file chunks are derived from the contents).                                   
                                                                                
- The way in which the Shark clients start acting as proxies immediately is     
  nice (though I'm not sure that calling it "cut-through" routing is            
  particularly useful).                                                         
                                                                                
- The atomic put/get is neat and is a small but valuable contribution.          
                                                                                
- There are various sentence problems (e.g., page 10, first full para), some    
  of which make the paper feel a little rushed.                                 


</Review #6>