Preliminary White Paper Outline

White Paper: Duke University DAFS Demo: Duke University

Participants:

Jeff Chase (Duke), Richard Kisley (Duke),

Andrew Gallatin (Duke), Darrell Anderson (Duke),

Rajiv Wickremesinghe (Duke)

Margo Seltzer (Harvard), Kostas Magoutis (Harvard),

Salimah Addetia (Harvard)

I. Overview

The Duke team will demo a DAFS client and a DAFS-compliant memory-based file server, running over Giganet cLAN on FreeBSD. The FreeBSD support for DAFS/cLAN is jointly developed by research groups at Duke and Harvard. The DAFS client allows direct I/O from applications based on an external I/O toolkit (TPIE) developed at Duke. The TPIE toolkit supports a range of applications including analysis programs for Geographic Information Systems (GIS) processing massive terrain grids. Information about TPIE and its use for GIS applications can be found through http://www.cs.duke.edu/geo*.

This is an early demo of a DAFS reference implementation for open-source systems (FreeBSD and Linux). A source-released reference implementation is a significant step towards collaborative development of this powerful enabling technology in the open source community. The demonstration shows the power and performance of the DAFS asynchronous data transfer model and direct user-level I/O for applications. Our use of the TPIE toolkit as an adaptation layer shows how important applications can benefit fully from the DAFS model without reprogramming them to use the DAFS interface.

II. Motivation:

DAFS has the potential to enable efficient and transparently scalable storage and content services for IP networks and server interconnects such as Infiniband. The DAFS architecture leverages user-level networking to enable low-overhead network I/O to network-attached storage, while empowering the client to control caching and data transfer policies. The implementation also allows us to study the performance of DAFS file access functionality relative to NFS.

The demonstration allows a direct performance comparison of an application written for read and write system calls over an NFS Gigabit-Ethernet connection and the same application re-linked to run DAFS over cLAN (using the same block sizes). By changing only the file access calls of the application without changing the file handling semantics we can directly measure the effect of NFS and DAFS on performance.

III. TPIE Merge Application

TPIE is an external memory programming toolkit enabling fast implementation of algorithms designed for massive-data applications. External memory algorithms seek to minimize the I/O performed when solving a given problem. TPIE is designed to bridge the gap between the theory and practice of parallel I/O systems, abstracting I/O management into streams and sub-streams. TPIE includes a pluggable Block Transfer Engine (BTE) that interfaces to the underlying file system. We introduced a new TPIE BTE for the DAFS interface, allowing us to run TPIE applications over DAFS and other file systems without modifying them individually.

IV. Performance of DAFS Relative to NFS

The setup for our demo consists of two identical consumer-grade PCs: 800 MHz Pentium III, Intel 810 PCI chipset, 100MHz SDRAM, FreeBSD 4.3 operating system. This platform has an I/O bandwidth limit of 87 MB/s. One node hosts a DAFS server and an NFS server; the other runs the TPIE applications as a DAFS client or NFS client. The server has 512MB SDRAM; the client has 256MB. Each node has a Giganet cLAN NIC (for DAFS) and an Alteon Tigon II Gigabit Ethernet NIC (for NFSv3/UDP/IP), either of which can saturate the I/O system.

We use memory-based file systems to factor out disk-related costs and focus on data movement overheads. The NFS filesystem is hosted on a RAM-disk mounted on the server. The NFS is configured for optimal performance with adequate NFS daemons and block I/O daemons.

. Performance is limited by the client CPU in our experiments. To minimize data transfer overheads the NFS configuration uses a large (32KB) block size with 32 KB+ UDP packets over 8KB+ “Jumbo” Ethernet frames. We have modified the client kernel NFS and IP code and the Tigon firmware to eliminate data copying on the client; the only client copy for NFS I/O is the copy through the system call interface. Checksum offloading on the Tigon is enabled, also freeing the client CPU from checksum overheads in the NFS configuration. This experiment is the state of the art for NFS. Thus our analysis is conservative, and tends to understate the performance improvement from DAFS relative to NFS for similar application scenarios in practice.

The test TPIE application merges n sorted input files of x y-byte records, each with a fixed-size key, into one sorted output file. Performance is reported as merge throughput: x * y (Bytes) / t (sec). The I/O bandwidth is double the reported merge bandwidth, since each record is read and then written. Record size is plotted by record density (rec / MB), which controls the input file size. Varying the merge order n and record size y allows us to control the amount of CPU work the application performs per unit of I/O. CPU cost per record (key comparisons) increases logarithmically with n; CPU cost per byte decreases linearly with record size, since only the keys are compared.

The graph shows merge runs of merge order 2, 4, and 8. There are two sets of DAFS runs, one using double buffering and asynchronous I/O in the BTE, and one using synchronous I/O. FreeBSD does double buffering in the NFS client using block I/O daemons. Asynchronous I/O allows the DAFS client to double-buffer with lower CPU overhead; the DAFS client is single-threaded, minimizing context-switch overhead.

For a given merge order, DAFS with asynchronous I/O significantly outperforms NFS. This is because DAFS allows the client to perform I/O with lower CPU overhead, leaving more of the CPU for the application to process the data. For example, on the left side of the graph where the application has the highest I/O demand (because it is less compute-bound due to larger records), the NFS client consumes up to 60% of its CPU in the kernel managing I/O, even with our optimizations. The difference between DAFS and NFS is least significant at the right side of the graph, where the merge application is heavily compute-bound and I/O overheads are a smaller share of client CPU time. At a given merge order, DAFS with double buffering based on asynchronous I/O outperforms NFS in this realistic application scenario by as much as a factor of two in the I/O-bound cases.

V. Summary

The demo exhibits basic DAFS functionality, including file access and data transfer using low-overhead RDMA and asynchronous I/O, in a reference implementation for open-source systems. The application tests provide a fair basis for judging the performance benefits from DAFS for real applications in practice, relative to NFS. The DAFS model allows lower I/O overhead due to copy-avoidance with RDMA, user-level networking enabled by the Emulex/Giganet cLAN VI network, reduced protocol overhead on the host, reduced context switching, and other factors. Low-overhead I/O improves application performance by as much as a factor of two depending on the I/O demands of the application. In practice, the benefits may be much higher since the NFS in our experiments is highly optimized.