Applications that will benefit from the SPINE software
architecture range from those that perform streaming I/O (e.g., multimedia clients/servers
and file-servers), host based IP routers, cluster based storage management, to support for
packet filtering.
Using SPINE, we have built a video client application. The application defines an application-specific video extension that transfers video data arriving from the network directly to the frame buffer. The video client runs as a regular application on Windows NT. It is responsible for creating the framing window that will be used to display the video and informing the video extension of the window coordinates. The video extension on the network interface maintains window coordinate and size information, and DMA transfers video data arriving from the network to the region of frame buffer memory representing the applications window. The video client application catches window movement events and informs the video extension of the new window coordinates.
The first thing to note is that it is quite similar to the SPINE video client. The router application on the host loads the IP routing extensions onto the network interfaces (label 1) and initializes the forwarding table. IP packets arriving from the network are dispatched to the router extension, which determines how to forward packets by looking into the IP forwarding table. If the packet should be forwarded to another network interface (label 2), then the router extension can use the peer-to-peer communication support provided by SPINE. On the other hand, if the IP packet is intended for the host, then it is handed to the operating system's networking stack (label 3). The router, perhaps more so than the video client, demonstrates the distributed systems nature of SPINE. That is, extensions can communicate with the host, peer devices, or via the network. In our experimental setup each network interface can forward 11,800 packets per second using the SPINE router extension, while placing zero load on the host CPU and memory subsystem as neither control or data needs to be transfer to the host system. In comparison, a host based IP forwarding system using identical hardware (i.e., multiple Myrinet interfaces plugged into a 200Mhz Pentium PC) built at USC/ISI achieves 12,000 packets per second while utilizing 100% of the host CPU. The USC/ISI implementation optimizes the data path and only the IP packet header is copied into the host system, the remaining IP packet data is DMA transferred directly between the source and destination Myrinet interfaces. Note that the SPINE based implementation is only 2% slower using the slow embedded processor on the Myrinet interface compared to the host based forwarding implementation that uses a 200Mhz Pentium processor. However, our approach places zero load on the host system. As a result of this system structure, the host has plenty of processing cycles available to handle routing updates or more complex protocol processing while the intelligent network interface cards independently forward IP packets.
It is possible to have the network interface issue disk read/write requests without involving the host processor, thereby allowing remote hosts to directly share disks even during the failure of the target's operating system. Thus, making both the network and disk interface independent from the host operating system enables both high performance and highly available applications (e.g. shared disks for databases, network attached disks). The emerging I2O architecture may create intelligent disk subsystems that will it feasible to access a disk directly from the network device.
To access memory on a remote node does not require the cooperation of the remote processor. Using an intelligent IO processor enables application-specific remote memory operations to be handled directly from the network controller. For example, the coherence protocols used by software distributed memory systems generally suffer from communication latency and overhead. By pushing the application-level coherence protocol into the kernel we expect to improve performance, because the protocol code has direct access to operating systems services that control virtual memory. By further pushing a component of the application-level coherence protocol into the network interface we expect to reduce the overhead and latency of "getting" and "putting" pages from/to remote nodes in a cluster without interrupting the remote node's host processor. Marc E. Fiuczynski mef@cs.washington.edu Department of Computer Science and Engineering University of Washington |