Congratulations! Your submission titled "Shark: Scaling File Servers via Cooperative Caching" (paper #208) has been accepted to NSDI 2005. We had an excellent submission pool and the competition for the final program was intense. You should take the acceptance of your paper as a strong endorsement of your work by the community. At the same time, we hope that you will use the detailed reviews as an opportunity to improve your work for the final program. Your reviews are at the end of this message. We look forward to your revised camera ready copy. As you prepare it, please bear the following points in mind: 1. Paper and Contact Information. For each paper, Usenix requires (i) the title; (ii) the ordered list of authors with their full affiliations; (iii) the contact author's name, email address, phone number, and mailing address. Please reply to this message *now* and send us an up-to-date set of this information. 2. Shepherding. All papers are being shepherded in their revisions. Your shepherd will contact you within a few days. You should discuss your plan for revisions with them (it is often helpful to chop up review text and say how you intend to respond to points) as they can help you understand the key strengths and weaknesses of your paper as perceived by the Program Committee. Papers are due to your shepherd for review no later than *March 11, 2005*. 3. Registration and camera-ready instructions. The camera ready copy is due to Usenix by *March 29, 2005*. Details of the camera-ready requirements and submission process, as well as conference registration for authors, are available at: http://www.usenix.org/events/nsdi05/instrux/. Please read this entire page; it is the only information you will receive on camera ready preparation. We look forward to a strong program and presentations in Boston. Best of luck with your revisions and congratulations once again! Amin Vahdat and David Wetherall NSDI 2005 PC Chairs. ------------ Your qualifications to review this paper : I know the material, but am not an expert Overall paper merit: Top 10% but not top 5% of submitted papers Provide a short summary of the paper: This paper is part of the long line of SFS papers, which in this case is tailored to overcome a specific problem: high latency and low bandwidth to a single WAN file server for mostly read access workloads. It does so through cooperative caching, which has the additional benefit of permitting the WAN file server (also referred to as the origin server) to scale to hundreds or thousands of WAN clients. In a nutshell: AFS + secure & reliable channels + rabin finger prints + cooperative caching w/ FS optimized DHT. The paper does a nice job of describing the design and implementation of tying together a number of well understood, separate mechanisms. What are the most important reasons to accept this paper?: See below... What are the most important reasons to reject this paper?: See below... Provide detailed comments to the author: The authors should also speak to their dependency on whole file caching, which strikes me as a design choice with potentially serious impact. Another potential weakness that concerns this reviewer is latency and outage of the origin server. For example, suppose the origin server is at NYU, the client is in Korea, and latency is high combined with intermittent network outages. In such a scenario, connectivity to the origin server might be lost causing file accesses to "hang" eventhough the underlying data is local or nearby. The paper could do a better job of making a strong case for the need for such a file system. For example, what applications have a mostly readonly "filesystem" workload for large amounts of data? Maybe grid or others? It is an unknown. Your qualifications to review this paper : I know a lot about this area Overall paper merit: Top 25% but not top 10% of submitted papers Provide a short summary of the paper: This paper describes a p2p/dht based cooperative cache for a shared (and centralized) file server. It is targeted at grid applications or planet-lab experiments where a group of nodes share a collection of binary or data files that must be distributed from the server to the nodes to run the experiment (note that it may have broader applicability.) It is based on a group of known techniques and re-uses considerable existing code: SFS, LBFS, Coral, etc but it tweaks these pieces a bit (e.g., cut-through routing in coral to create distribution trees when a file is first accessed by a bunch of clients at about the same time.) What are the most important reasons to accept this paper?: The system seems like it would be useful for planetlab and grid researchers and perhaps others. Good engineering. (a) Rather than reinvent the wheel, use a lot of existing code with minor changes. (b) Strive to co-exist with current environment (e.g., legacy servers) rather than stick to p2p-everywhere orthodoxy. What are the most important reasons to reject this paper?: The design reveals relatively few new technical ideas. Most of it assembles known techniques in more or less the way one would expect given the problem statement "build a p2p/dht cache for an NFS/SFS/LBFS server". The cut-through routing/atomic put/get is perhaps the biggest new change, and this feels more like a "gotcha" that was successfully diagnosed and addressed than a big stand-alone contribution. Provide detailed comments to the author: I tend to buy the argument that building a p2p cache for a centralized server is a sensible piece of useful, low-hanging, more-easily-adopted fruit than demanding that the server also be made p2p. But, the last paragraph of section 2.1 needs to be considered carefully -- some would regard the fact that "system operators administer this [shark] server as with any other host...perform backups...configure and reboot the machine when necessary, run other services, ..." as limitations not features. P2p advocates might argue that they have a better answer to many of these issues than the traditional answer -- you might acknowlege the limitations of a central server while also pointing out the practical advantages of going for this as low hanging fruit (as well as point out any fundamental advantages you see.) I would suggest pulling out the "prefetch whole file" as a configurable option -- this approach is sensible for some workloads, but might be a disaster for others (I'm thinking of, say, a grid application that wants to use this file system to send a big data file to a cluster but where different cluster nodes only want subsets of the data. You can probably come up with other disaster scenarios.) You might want to have a tunable prefetch parameter and explore how to make it self-tuning or how to set it manually when a file is opened. It might be enlightening to re-run some of the experiments with prefetching completely turned off to isolate which factors account for your performance more precisely. I'm not sure the median of 5 runs is sufficient to make the results believable despite PlanetLab variability. You probably should go to the extra trouble of using confidence intervals to characterize the variability (you might also verify that the distribution is stationary across runs.) Your qualifications to review this paper : I know a lot about this area Overall paper merit: Top 25% but not top 10% of submitted papers Provide a short summary of the paper: Yet another point in the p2p filesystem space; here we have a client-server environment with a p2p proxy caching mechanism in between. What are the most important reasons to accept this paper?: This is one of the more sensible points in the design space for p2p filesystem work. The compatability with common administrative practice and client mount protocols is nice. The Sloppy DHT design is the intriguing technical nugget amid the rehash of prior work. What are the most important reasons to reject this paper?: Most of the ideas here appeared in earlier systems. The target here isn't as conceptual ambitious as prior work, and technically the system is a modest contribution (despite gratuitous muscle-flexing about how many bugs they found in other people's code). The author's worldview is unacceptably MIT/NYU-centric, and the comparisons to SFS should be complemented with comparisons to, say, a commercial AFS implementation, as well as other p2p platforms. Provide detailed comments to the author: What are the realistic "killer" applications scenarios where your benefit makes you so compelling we can't ignore you? If they're all "fire off my distributed experiment" type scenarios, perhaps an app-level approach would be simpler/more effective -- e.g. use BitTorrent or something to disseminate the binaries in advance. Is cut-through routing really that important? What about WebDAV over Coral? Must be slower, but how much so? Would it be easier to build? Your qualifications to review this paper : I know the material, but am not an expert Overall paper merit: Top 10% but not top 5% of submitted papers Provide a short summary of the paper: This paper describes a distributed file system aimed towards an environment where similar sets of files need to be distributed on a large set of clients. They extend the notion of a single, remote server (e.g., NFS) by having clients act as cache proxies for others. The clients access a modified distributed hash table to locate other clients with copies of the file before asking the remote server for the data. To prevent potentially malicious clients from disrupting data flow in this cache network, all operations are encrypted, and methods are introduced for the client and the client proxy to establish mutual trust. Comparing the performance of this system to SFS, the authors show that they can achieve an order of magnitude less load at the server, and 4-5 times lower latency at the clients. Their canonical workload is distributing applications and data to PlanetLab nodes for execution. Their evaluation clearly shows the benefits of their system on this type of workload. What are the most important reasons to accept this paper?: The system is interesting in large part because of the underlying technologies it employs: DHTs, LBFS-style chunking, and cryptography. For the most part, the authors clearly identify why these techniques benefit their system. The paper is well-written and clearly presents the overall ideas, the details of the system, and the results of their evaluation. The system presented fills a possibly important niche: the distribution in this system provides performance benefits without sacrificing the manageability of a single remote file server. The mechanics of the system are well thought out and presented, especially the security aspects. Overall, this is an interesting paper that I would enjoy seeing included in the conference. What are the most important reasons to reject this paper?: While this is a solid presentation of a system, it is unclear how far it advances the state of the art. Many of the major components of the system have already been explored in depth elsewhere. This is only a slight concern that might be alleviated by a better highlighting of the contributions in the writing. Shark is clearly useful for users of PlanetLab, but the authors do not identify or discuss other possible workloads. Shark employs cryptography to achieve a number of security goals. The paper could be improved by adding an explicit discussion of a threat model and Shark's security solutions. Currently, this information is presented in pieces throughout the paper. The evaluation results were heavily influenced by features of PlanetLab's resource management. The authors acknowledge this fact, but neglect to offer significant results that are not handicapped in this manner. Provide detailed comments to the author: I think your comparison to Farsite in related work (where you note a similarity between encryption methods) is weak, especially since a stated goal of that system is to present the logical view of a single server file system. I would have appreciated a more detailed comparison. You've effectively shown a reduction in workload on the origin server, but what about individual servers in the cooperative cache? How is the workload shared among them? In particular, does the first client to fetch the file receive a disproportionate share of requests? There are a few typos: - Section 1, paragraph 2: "Program_s_ cannot run" - Section 2.3, last paragraph before "a cooperative-caching read protocol": "command-li_n_e options" - Section 5, paragraph 1: "such as AFS do not -do- perform" Your qualifications to review this paper : I have passing familiarity Overall paper merit: Top 25% but not top 10% of submitted papers Provide a short summary of the paper: Shark is a distributed file system that (1) integrates smoothly into existing local file systems, and (2) leverages P2P-based distributed indexing so that often requests that cannot be satisfied locally can be satisfied by accessing a nearby server in the same locality-based cluster. The claim is that this design allows Shark to scale to very large systems. What are the most important reasons to accept this paper?: The approach appears sound and it's exciting to think that one could in fact construct such large distributed file systems. What are the most important reasons to reject this paper?: I didn't form a clear picture of how exactly Shark works in practice. Part of this may be my not knowing this area well. Two elements that confused me were (1) will Shark indeed form effective clusters, and (2) where will Shark's gains break down - what sort of uses or apps will have access patterns for which Shark's caching will degenerate and we're left with little more than wide-area NFS? There are some unexplained performance peculiarities (see below). It's not clear whether these might reflect serious problems (either with Shark itself or with the evaluation). I would like to better understand how well Shark will perform when fetches are highly synchronized, rather than coming within a minute of one another as in 4.2 (which clearly is not a stress-test). Provide detailed comments to the author: The use of "special characters" confused me. Are these symbols not in the alphabet, in order to impart some type information into the hashes? Why are they needed, and how are they escaped to make them distinct from other bit patterns? Regarding Coral stopping when reaching a node that's both full and loaded, what happens if there is no such node? What's meant by choosing cache names "carefully" to fit in the kernel name cache? Why the large difference between 10Mb file read and 40Mb file read in figure 5? (Also, these should be MB, not Mb.) This is quite important to understand, as it could reflect problems either in the implementation or the evaluation. "data read from /dev/random" - on many Unices, /dev/random only returns (and consumes) bits available from the entropy pool, and there won't be anything like 10MB available, so this won't in fact work. So, what did you actually wind up with? It's a bit strange to refer to access limited to 10 Mbps as "local", as usually that term connotes very high-speed access. I didn't understand the comment in 4.2 about the tradeoffs between continually retrying and increasing client latency. What do you picture here might lead to reduced bandwidth usage? There are a lot of grammar errors :-(. Your qualifications to review this paper : I know a lot about this area Overall paper merit: Top 25% but not top 10% of submitted papers Provide a short summary of the paper: The basic idea behind this paper is that (a) distributed file systems are nice (much better than distributing files by hand, etc.) and that (b) to use them in certain settings such as PlanetLab, they need to be fixed. Most people would probably agree with (a), which leaves (b) as the point of interest -- just how should distributed file systems be fixed to enable their use in those types of settings? The authors here focus on scale. What if hundreds or thousands of machines suddenly start requesting data at the same time, hence overwhelming the poor file server found in typical file systems? The solution here is to apply some ideas that have been used elsewhere: cooperating caching and LBFS-style file chunking. What are the most important reasons to accept this paper?: It is a paper about a real system. It synthesizes some well-known good ideas (e.g. cooperative caching, lbfs-style chunking). It has a few cute implementation points (e.g., atomic get/put) What are the most important reasons to reject this paper?: Not super novel. Experiments are not very well done. Some related work is missing. Provide detailed comments to the author: I lean towards reject here. Overall, I think this paper is OK -- not terrifically innovative (though there are a few neat points, which I list below), it is lacking in experimentation (also as detailed below), and missing a few other ingredients (e.g., some related work, some aspects of the design). However, I really think the paper will benefit from the reject-improve-resubmit cycle. All of that said, though, if you need more systems papers, this is certainly not an embarrassing selection -- it is just a bit underdone. - The evaluation needs MORE WORK. A better set of comparisons -- e.g., to local disk too, as that is the performance level you get if you manage to distribute all of your stuff to the local disks. There is also a cost involved with content-based approaches (someone has to look at all the data), which should be brought out. The benchmarks aren't very complete -- some file reads and some emacs stuff, barely showing anything interesting about the system. Using the emacs distro to compute potential bandwidth savings is more of a study of the emacs distro than something interesting about Shark. And the hand-waving about PlanetLab performance problems is a bit off-putting -- if one of your main points is that your system has to be fast (in the absolute sense) to really be usable, it needs to be fast, period! I could go on -- there is plenty to do here. - Related work is missing. For example, it seemed odd to me that you didn't discuss Pangaea (from the same OSDI as Ivy). Also, there were some papers at the last NSDI that had at least some relevance: Total Recall from UCSD, and the BAD FS work from Wisconsin. - When I read about your lease scheme (server has a committment to notify clients of mods during that time, the 5 minute default duration), I immediately wondered what kind of effect this had on scalability. It seems like that might be a lot of work for the server, handing out those notifications to potentially thousands of clients, esp. so frequently. I hoped this would be addressed in the evaluation, but it wasn't! - Take care in using MB/s or Mb/s properly (MB -> MegaByte, Mb -> MegaBit). You are a little sloppy with this here and there and it is a problem. - I really liked the cross-file system sharing support provided by Shark -- this is a neat outcome of the design (because tokens which are used to fetch file chunks are derived from the contents). - The way in which the Shark clients start acting as proxies immediately is nice (though I'm not sure that calling it "cut-through" routing is particularly useful). - The atomic put/get is neat and is a small but valuable contribution. - There are various sentence problems (e.g., page 10, first full para), some of which make the paper feel a little rushed.