Abstract:Currently, I/O device virtualization models in virtual machine (VM)environments require involvement ofa virtual machine monitor (VMM) and/or a privileged VMfor each I/O operation, which may turn out to be a performance bottleneck forsystems with high I/O demands, especially those equipped withmodern high speed interconnects such as InfiniBand. In this paper, we propose a new device virtualization modelcalled VMM-bypass I/O, which extends the idea of OS-bypass originated from user-levelcommunication. Essentially, VMM-bypass allows time-critical I/O operations tobe carried out directly in guest VMs without involvement of the VMMand/or a privileged VM. By exploiting theintelligence found in modern high speed networkinterfaces, VMM-bypass can significantly improve I/O and communicationperformance for VMs without sacrificing safety or isolation. To demonstrate the idea of VMM-bypass, we have developed a prototypecalled Xen-IB, which offers InfiniBand virtualization support inthe Xen 3.0 VM environment. Xen-IB runs with current InfiniBandhardware and does not require modifications to existing user-levelapplications or kernel-level drivers that use InfiniBand.Our performance measurements show that Xen-IBis able to achieve nearly the same raw performance as theoriginal InfiniBand driver running in a non-virtualized environment. 1 Introduction Virtual machine (VM) technologies were first introduced in the 1960s [14], but are experiencing a resurgencein recent years and becoming more and more attractive to both the industry and the research communities [35]. A key component in a VM environment is the virtual machine monitor (VMM)(also called hypervisor),which is implemented directly on top of hardware and provides virtualizedhardware interfaces to VMs.With the help of VMMs, VM technologies allow running many different virtual machinesin a single physical box, with each virtual machine possibly hosting a different operating system. VMs can also provide secure and portable environments to meet the demanding resource requirementsof modern computing systems [9]. In VM environments, device I/O access in guest operating systems can be handled in different ways.For instance, in VMware Workstation, device I/O relies on switching back to the host operating system and user-levelemulation [37]. In VMware ESX Server, guest VM I/O operations trap intothe VMM, which makes direct access to I/O devices [42]. In Xen [11], device I/O follows a split-drivermodel. Only an isolated device domain (IDD) has access to the hardware using native device drivers. All other virtual machines (guest VMs, or domains) need to pass the I/O requests to the IDD to access the devices. This control transfer between domains needs involvement of the VMM. In recent years, network interconnects that provide very low latency (less than 5 us)and very high bandwidth (multiple Gbps) are emerging.Examples of these high speed interconnects include Virtual InterfaceArchitecture (VIA) [12], InfiniBand [19], Quadrics [34], and Myrinet [25]. Due to their excellentperformance, these interconnects have become strong players in areas such as high performance computing (HPC). To achieve high performance, these interconnects usually have intelligent network interface cards (NICs) which can be used to offload a large part of the hostcommunication protocol processing. The intelligence in the NICs alsosupports user-level communication, which enablessafe direct I/O access from user-level processes (OS-bypass I/O) andcontributes to reduced latency and CPU overhead. VM technologies can greatly benefitcomputing systems built from the aforementioned high speed interconnects bynot only simplifying cluster management for these systems, but also offering much cleaner solutions to tasks such as check-pointing and fail-over.Recently, as these high speed interconnects becomemore and more commoditized with their cost going down, they are also usedfor providing remote I/O access in high-end enterprise systems, which increasinglyrun in virtualized environments. Therefore, it is very important toprovide VM support to high-end systems equipped with these high speed interconnects.However, performance and scalability requirements of thesesystems pose some challenges.In all the VM I/O access approaches mentioned previously, VMMs have to beinvolved to make sure that I/O accesses are safe and do not compromiseintegrity of the system. Therefore, current deviceI/O access in virtual machines requires context switches betweenthe VMM and guest VMs. Thus, I/O access can suffer from longerlatency and higher CPU overhead compared to native I/O access innon-virtualized environments. In some cases, the VMM may also becomea performance bottleneck which limits I/O performance in guest VMs. In some of the aforementioned approaches (VM Workstation and Xen),a host operating system or another virtual machine is also involvedin the I/O access path. Although these approaches can greatly simplifyVMM design by moving device drivers out of the VMM,they may lead to even higher I/O access overhead when requiringcontext switches between the host operating system and the guestVM or two different VMs. In this paper, we present a VMM-bypass approach for I/O access in VMenvironments. Our approach takes advantages of features found in modernhigh speed intelligent network interfaces to allow time-critical operations to be carried out directly in guest VMswhile still maintaining system integrity and isolation. With this method, we can remove the bottleneck of going through the VMM or a separate VM for many I/O operations and significantly improve communication and I/O performance.The key idea of our VMM-bypass approach is based on the OS-bypass designof modern high speed network interfaces, which allows user processes toaccess I/O devices directly in a safe way without going throughoperating systems. OS-bypass was originally proposed by research communities [41,40,29,6,33] and lateradopted by some commercial interconnects such as InfiniBand. Our idea canbe regarded as an extension of OS-bypass designs in the context of VMenvironments. To demonstrate the idea of VMM-bypass, we have designed and implementeda prototype called Xen-IB to provide virtualization support for InfiniBand in Xen. Basically, our implementation presents to each guest VM apara-virtualized InfiniBand device. Our design requires no modification to existing hardware. Also, through a technique called high-level virtualization,we allow current user-level applications and kernel-levelmodules that utilize InfiniBand to run without changes.Our performance results, which includes benchmarks at the basicInfiniBand level as well as evaluation of upper-layer InfiniBandprotocols such as IP over InfiniBand (IPoIB) [1] and MPI [36], demonstrate that performanceof our VMM-bypass approach comes close to that in a native,non-virtualized environment. Although our current implementation isfor InfiniBand and Xen, the basic VMM-bypassidea and many of our implementation techniques can be readily appliedto other high-speed interconnects and other VMMs. In summary, the main contributions of our work are:We proposed the VMM-bypass approach for I/O accesses in VMenvironments for modern high speed interconnects. Using this approach,many I/O operations can be performed directly withoutinvolvement of a VMM or another VM. Thus, I/O performance can be greatlyimproved.
Based on the idea of VMM-bypass, we implemented a prototype, Xen-IB, to virtualize InfiniBand devices in Xen guest VMs. Our prototype supports running existing InfiniBand applications and kernelmodules in guest VMs without any modification.
We carried out extensive performance evaluation of our prototype.Our results show that performance of our virtualized InfiniBanddevice is very close to native InfiniBand devices running in anon-virtualized environment.
The rest of the paper is organized as follows:In Section 2, we present background information, including the Xen VM environment and the InfiniBand architecture. In Section 3,we present the basic idea of VMM-bypass I/O.In Section 4, we discuss thedetailed design and implementation of our Xen-IB prototype. In Section 5, we discuss several related issues and limitations of our current implementation and how they can be addressed in future.Performance evaluation resultsare given in Section 6. We discuss related work inSection 7 and conclude the paper in Section 8. 2 Background In this section, we provide background information forour work. In Section 2.1, we describe how I/O device access is handled in several popular VM environments. In Section 2.3, we describe the OS-bypass featurein modern high speed network interfaces. Since our prototypeis based on Xen and InfiniBand, we introduce them in Sections 2.2 and 2.4,respectively. 2.1 I/O Device Access in Virtual Machines In a VM environment, the VMM plays the central role of virtualizinghardware resources such as CPUs, memory, and I/O devices.To maximize performance, the VMM can letguest VMs access these resources directly whenever possible. TakingCPU virtualization as an example, a guest VM can execute all non-privilegedinstructions natively in hardware without intervention of the VMM. However,privileged instructions executed in guest VMs will generate a trap intothe VMM. The VMM will then take necessary steps to make sure thatthe execution can continue without compromising system integrity.Since many CPU intensive workloads seldom use privileged instructions (Thisis especially true for applications in HPC area.), they can achieve excellentperformance even when executed in a VM. I/O device access in VMs, however, is a completely different story.Since I/O devices are usually shared among all VMs in a physical machine,the VMM has to make sure that accesses to them are legal and consistent. Currently,this requires VMM intervention on every I/O access from guest VMs.For example, in VMware ESX Server [42], all physical I/O accesses are carried out within the VMM, which includes device drivers for popularserver hardware. System integrity is achieved with every I/O access goingthrough the VMM.Furthermore, the VMM can serveas an arbitrator/multiplexer/demultiplexer to implement useful featuressuch as QoS control among VMs. However, VMM interventionalso leads to longer I/O latency and higher CPU overheaddue to the context switches betweenguest VMs and the VMM. Since the VMM serves as a central control pointfor all I/O accesses, it may also become a performance bottleneck forI/O intensive workloads. Having device I/O access in the VMM also complicates the design ofthe VMM itself. It significantly limits the range of supported physicaldevices because new device drivers have to be developed to work withinthe VMM. To address this problem, VMware workstation [37]and Xen [13] carry out I/O operationsin a host operating system or a special privileged VM called isolated device domain (IDD), which can run popular operating systems such as Windows and Linux that have a largenumber of existing device drivers. Although this approach can greatlysimplify the VMM design and increase the range of supported hardware,it does not directly address performance issues with the approachused in VMware ESX Server. In fact, I/O accesses now may result inexpensive operations calleda world switch (a switch between the host OS and a guest VM) ora domain switch (a switch between two different VMs), which can lead to even worse I/O performance. 2.2 Overview of the Xen Virtual Machine Monitor Xen is a popular high performance VMM.It uses para-virtualization [43], in which host operating systemsneed to be explicitly ported to the Xen architecture. This architecture is similar to native hardware such as the x86 architecture, with onlyslight modifications to support efficient virtualization. Since Xen does not require changes to the application binary interface (ABI), existing user applications can run without any modification. Figure 1:The structure of the Xen hypervisor, hosting three xenoLinux operating systems(courtesy [32]) Figure 1 illustrates the structure of a physical machine running Xen.The Xen hypervisor is at the lowest level and has direct access to the hardware. The hypervisor, instead of the guest operating systems,is running in the most privileged processor-level. Xen provides basic control interfaces needed to perform complex policy decisions. Above the hypervisor are the Xen domains (VMs).There can be many domains running simultaneously. Guest VMs are preventedfrom directly executing privileged processor instructions. A special domain called domain0, which is created at boot time, is allowed to access the control interface provided by the hypervisor. The guest OS in domain0 hosts application-level management software and perform the tasks to create, terminate ormigrate other domains through the control interface. There is no guarantee that a domain will get a continuous stretch of physicalmemory to run a guest OS. Xen makes a distinction between machine memory and pseudo-physicalmemory. Machine memory refers to the physical memory installed in a machine, while pseudo-physical memory is a per-domain abstraction, allowing a guest OS totreat its memory as a contiguous range of physical pages. Xenmaintains the mapping between the machine and the pseudo-physical memory. Only acertain parts of the operating system needs to understand the differencebetween these two abstractions. Guest OSes allocate and manage their own hardware pagetables, with minimal involvement of the Xen hypervisor to ensure safety and isolation. In Xen, domains can communicate with each other through shared pages and event channels. Event channels provide anasynchronous notification mechanism betweendomains. Each domain has a set ofend-points (or ports) which may be bounded to an event source. When a pair of end-points in two domains are bound together, a ``send'' operation on one side will cause an event to be received by the destination domain, which may in turncause an interrupt. Eventchannels are only intended for sending notifications between domains. So if adomain wants to send data to another, the typical scheme is for a source domain to grant access to local memory pages to the destination domain. Then, these shared pages areused to transfer data. Virtual machines in Xen usually do not have direct access to hardware. Since most existingdevice drivers assume they have complete control of the device, there cannot be multiple instantiationsof such drivers in different domains for a single device. To ensure manageability and safe access, devicevirtualization in Xen follows a split device drivermodel [13].Each device driver is expected to run in an isolated device domain (IDD), whichhosts a backend driver to serve access requests from guest domains. Each guest OS uses a frontend driver to communicate with the backend. The split driverorganization provides security: misbehaving code in a guest domain will notresult in failure of other guest domains. The split device driver model requires thedevelopment of frontend and backend drivers for each device class. A number of popular device classes suchas virtual disk and virtual network arecurrently supported in guest domains. 2.3 OS-bypass I/O Traditionally, device I/O accesses are carried out inside the OS kernelon behalf of application processes. However, this approach imposes severalproblems such as overhead caused by context switches between user processesand OS kernels and extra data copies which degrade I/O performance [5].It can also result in QoS crosstalk [17] due to lacking of proper accounting forcosts of I/O accesses carried outby the kernel on behalf of applications. To address these problems, a concept called user-level communication wasintroduced by the research community. One of the notable features of user-level communication is OS-bypass, with whichI/O (communication) operations can be achieved directlyby user processes without involvement of OS kernels. OS-bypasswas later adopted by commercial products, many of which havebecome popular in areas such as high performance computing wherelow latency is vital to applications. It should be notedthat OS-bypass does not mean all I/O operations bypass the OS kernel.Usually, devices allow OS-bypass for frequent and time-criticaloperations while other operations, such as setup andmanagement operations, can go through OS kernels and arehandled by a privileged module, asillustrated in Figure 2. Figure 2:OS-Bypass Communication and I/OThe key challenge to implement OS-bypass I/O is to enable safe access to a device shared by many differentapplications. To achieve this, OS-bypass capable devices usuallyrequire more intelligence in the hardware than traditional I/Odevices. Typically, an OS-bypass capable device is able to presentvirtual access points to different user applications. Hardwaredata structures for virtual access points can be encapsulated intodifferent I/O pages. With the help of an OS kernel, theI/O pages can be mapped into the virtual address spaces of different user processes. Thus, different processes can accesstheir own virtual access points safely, thanks to the protectionprovided by the virtual memory mechanism. Although the idea of user-level communication and OS-bypasswas developed for traditional, non-virtualized systems, theintelligence and self-virtualizing characteristic of OS-bypassdevices lend themselves nicely to a virtualized environment,as we will see later. 2.4 InfiniBand Architecture InfiniBand [19] is a high speed interconnect offering high performanceas well asfeatures such as OS-bypass. InfiniBand host channel adapters (HCAs)are the equivalent of network interface cards (NICs) intraditional networks.InfiniBand uses a queue-based model for communication.A Queue Pair (QP) consists of a send queue and a receive queue. The send queue holds instructions to transmit data and the receive queue holds instructions that describe where received data is to be placed. Communication instructions are described in Work Queue Requests (WQR), or descriptors,and submitted to the queue pairs. The completion of the communicationis reported through Completion Queues (CQs)using Completion Queue Entries (CQEs). CQEs canbe accessed by using polling or event handlers. Initiating data transfers (posting descriptors) and notification of their completion (polling for completion) are time-critical tasks which use OS-bypass.In the Mellanox [21] approach, which represents a typical implementation of the InfiniBand specification, posting descriptors is done by ringing a doorbell. Doorbells are rung by writing to theregisters that form the User Access Region (UAR). Each UAR is a 4k I/O page mapped into a process's virtual address space. Posting a work request includes putting thedescriptors to a QP buffer and writing the doorbell to the UAR, which is completed without the involvement of the operating system. CQ buffers, where the CQEsare located, can also be directly accessed from the process virtual address space. These OS-bypass features make itpossible for InfiniBand to provide very low communication latency. InfiniBand also provides a comprehensive management scheme. Managementcommunication is achieved by sending management datagrams (MADs) to well-knownQPs (QP0 and QP1). InfiniBand requires all buffers involved in communication beregistered before they can be used in data transfers. In Mellanox HCAs, the purpose of registration is two-fold. First, an HCA needs to keep an entry in the Translation and Protection Table (TPT) so that itcan perform virtual-to-physical translation and protection checks during data transfer.Second, the memory buffer needs to be pinned in memory so that HCA canDMA directly into the target buffer.Upon the success of registration, a local key and a remote key arereturned, which can be used later for local and remote (RDMA) accesses. QP and CQbuffers described above are just normal buffers that are directly allocated from the process virtual memory space and registered with HCA.There are two popular stacks for InfiniBand drivers. VAPI [23] is the Mellanoximplementation and OpenIB Gen2 [28] recently have come out as a new generation of IB stackprovided by the OpenIB community. In this paper, our prototype implement is based onOpenIB Gen2, whose architecture is illustrated in Figure 3. Figure 3:Architectural overview of OpenIB Gen2 stack 3 VMM-Bypass I/O VMM-bypass I/O can be viewed as an extension to the idea ofOS-bypass I/O in the context ofVM environments. In this section, we describe the basic designof VMM-bypass I/O. Two key ideas in our design arepara-virtualization and high-level virtualization. In some VM environments, I/O devices are virtualized at thehardware level [37]. Each I/O instructionto access a device is virtualized by the VMM. Withthis approach, existing device drivers can be used in theguest VMs without any modification. However, it significantlyincreases the complexity of virtualizing devices.For example, one popular InfiniBand card (MT23108 fromMellanox [24])presents itself as a PCI-X device to the system. After initialization, it can beaccessed by the OS using memory mapped I/O. Virtualizing thisdevice at the hardware level would require us to not only understandall the hardware commands issued through memory mapped I/O, but alsoimplement a virtual PCI-X bus in the guest VM. Another problem with thisapproach is performance. Since existing physical devices are typicallynot designed to run in a virtualized environment, the interfaces presentedat the hardware level may exhibit significant performance degradationwhen they are virtualized. Our VMM-bypass I/O virtualization design is based on the idea of para-virtualization, similar to [11]and [44]. We do not preserve hardware interfacesof existing devices. To virtualize a device in a guest VM,we implement a device driver called guest module in the OSof the guest VM. The guest module is responsible for handling allthe privileged accesses to the device.In order to achieve VMM-bypass device access,the guest module also needs to set things up properly so that I/O operationscan be carried out directly in the guest VM. This means that the guestmodule must be able to create virtual access points on behalf ofthe guest OS and map them into the addresses of user processes.Since the guest module does not have direct access to the device hardware, we need to introduce another software component called backend module, which provides device hardware access for differentguest modules. If devices are accessed inside the VMM, thebackend module can be implemented as part of the VMM. It is possible tolet the backendmodule talk to the device directly. However, we can greatly simplifyits design by reusing theoriginal privilege module of the OS-bypass device driver. In addition toserving as a proxy for device hardware access, the backend modulealso coordinates accesses among different VMs so that system integritycan be maintained. The VMM-bypass I/O design is illustrated inFigure 4. Figure 4:VM-Bypass I/O (I/O Handled by VMM Directly) Figure 5:VM-Bypass I/O (I/O Handled by Another VM)If device accesses are provided by another VM (device driver VM), thebackend module can be implemented within the device driver VM. The communication between guest modules and the backend modulecan be achieved through the inter-VM communication mechanismprovided by the VM environment. This approach is shown inFigure 5. Para-virtualization can lead to compatibility problemsbecause a para-virtualized device does notconform to any existing hardware interfaces. However,in our design, these problems can be addressed by maintainingexisting interfaces which are at a higher level than the hardwareinterface (a technique we dubbed high-level virtualization). Modern interconnects such as InfiniBand have their ownstandardized access interfaces. For example, InfiniBand specification definesa VERBS interface for a host to talk to an InfiniBand device. The VERBS interface is usually implemented in the form of an API set through a combination of software and hardware.Our high-level virtualization approach maintains the sameVERBS interface within a guest VM. Therefore, existing kernel drivers and applications that use InfiniBand willbe able to run without any modification. Although in theory adriver or an application can bypass the VERBS interface andtalk to InfiniBand devices directly, this seldom happensbecause it leads to poor portability due to the factthat different InfiniBand devices may have different hardwareinterfaces. 4 Prototype Design and Implementation In this section, we present the design and implementation of Xen-IB, our InfiniBand virtualization driver for Xen. We describe details of the design and how we enableaccessing the HCA from guest domains directly for time-critical tasks. 4.1 Overview Like many other device drivers, InfiniBand drivers cannot have multiple instantiations for asingle HCA. Thus, a split driver model approach is required to share a single HCA among multiple Xen domains. Figure 6:The Xen-IB driver structure with the split driver modelFigure 6 illustrates a basic design of our Xen-IB driver.The backend runs as a kernel daemon on top of the native InfiniBand driverin the isolated device domain (IDD), which is domain0is our current implementation. It waits for incoming requests from the frontend drivers in the guest domains. The frontend driver, which corresponds to the guest modulementioned in Section 3, replaces the kernel HCA driver in OpenIB Gen2 stack. Once the frontend is loaded, it establishes two event channels with thebackend daemon. The first channel, together with shared memory pages,forms a device channel [13] which is used to process requests initiated from the guest domain.The second channel is used for sending InfiniBand CQ and QP events to the guest domain and will be discussed in detail later. The Xen-IB frontend driver provides the same set of interfaces as a normal Gen2 stack for kernelmodules. It is a relatively thin layerwhose tasks include packing a request together withnecessary parameters and sending it to the backend through the device channel. The backend driver reconstructs the commands, performs the operation using the native kernel HCA driver on behalf of the guest domain, and returns the result to the frontend driver. The split device driver model in Xen poses difficulties for user-leveldirect HCA access in Xen guest domains.To enable VMM-bypass, we need to let guest domainshave direct access to certain HCA resources such asthe UARs and the QP/CQ buffers. 4.2 InfiniBand Privileged Accesses In the following, we discuss in general how we support all privileged InfiniBand operations, includinginitialization, InfiniBand resource management,memory registration and event handling. Initialization and resource management:Before applications can communicate using InfiniBand, it must finish severalpreparation steps including opening HCA, creating CQ, creating QP, and modifying QP status,etc. Those operations are usually not in the time critical path and canbe implemented in a straightforward way. Basically, the guest domains forward these commands to the device driver domain(IDD) and wait for the acknowledgments after the operations are completed. Allthe resources are managed in the backend and the frontends refer to these resources by handles. Validationchecks must be conducted in IDD to ensure that all references are legal. Memory Registration: The InfiniBand specification requires all the memory regions involved in data transfers to be registered with the HCA. With Xen'spara-virtualization approach, real machine addresses are directly visibleto user domains. (Note that access control is still achieved becauseXen makes sure a user domain cannot arbitrarily map a machine page.) Thus, a domain can easily figure out the DMA addresses of buffers and there is no extra need foraddress translation (assuming that no IOMMU is used). The information needed by memory registration is alist of DMA addresses that describes the physical locations of the buffers, access flags and the virtual address that the application will use when accessing the buffers. Again, the registration happens inthe device domain. The frontend driver sends aboveinformation to the backend driver and get back thelocal and remote keys. Note that since the Translation and Protection Table(TPT) on HCA is indexed by keys, multiple guest domains are allowed to register with the same virtual address.For security reasons, the backend driver can verify if the frontend driver offers valid DMA addresses belonging to the specific domain in which it is running. This check makes sure that all later communication activities of guest domains are withinthe valid address spaces. Event Handling: InfiniBand supports several kinds of CQ and QP events. The most commonly used is thecompletion event. Event handlers are associated with CQs orQPs when they are created. An application can subscribe for event notification by writing a command to the UAR page. When those subscribed events happen, the HCA driver will first be notified by the HCA and thendispatch the event to different CQs or QPs according to the event type. Then the application/driver that owns the CQ/QP will get a callback on the event handler. For Xen-IB, events are generated for the device domain, where all QPs and CQs are actuallycreated. But the device domain cannot directly give a callback on the event handlers in the guest domains. To address this issue, we create a dedicated eventchannel between a frontend and the backend driver.The backend driver associates a special event handler to each CQ/QP created due to requests from guest domains. Each time the HCA generates an event to these CQs/QPs,this special event handler gets executed and forwards information such as the event type andthe CQ/QP identifier to the guest domain through the event channel. The frontend driver binds an event dispatcher as a callback handler to one end of the event channel after the channel is created. The event handlers given by the applicationsare associated to the CQs or QPs after they are successfully created. Frontenddriver also maintains a translation table between the CQ/QP identifiers and theactual CQ/QPs. Once the eventdispatcher gets an event notification from the backend driver, it checks the identifier and gives the corresponding CQ/QP a callback on the associated handler. 4.3 VMM-Bypass Accesses In InfiniBand, QP accesses (posting descriptors) include writing WQEs to the QP buffers and ringingdoorbells (writing to UAR pages)to notify the HCA. Then the HCA can use DMA to transfer the WQEs to internal HCA memory and perform thesend/receive or RDMA operations. Once a work request is completed, HCA will put acompletion entry (CQE) in the CQ buffer. In InfiniBand, QP access functions are used for initiating communication.To detect completion of communication, CQ polling can be used. QP access and CQ polling functions are typically used in the criticalpath of communication. Therefore, it is very important to optimizetheir performance by using VMM-bypass.The basic architecture of the VMM-bypass design is shown in Figure 7. Figure 7:VMM-Bypass design of Xen-IB driver Supporting VMM-bypass for QP access and CQ polling imposes two requirementson our design of Xen-IB: first, UAR pages must be accessible from aguest domain; second, both QP and CQ buffers should be directly visible in the guestdomain. When a frontend driver is loaded, the backend driver allocates a UAR pageand returns its page frame number (machine address) to the frontend. The frontenddriver then remaps this page to its own address space so that it can directly access the UAR in the guest domain to serve requests from the kernel drivers. (We have applied a small patch to Xen toenable access to I/O pages in guest domains.)In thesame way, when a user application starts, the frontend driver applies for a UAR pagefrom the backend and remaps the page to the application's virtual memoryaddress space, which can be later accessed directly from the user space. Since all UARs are managed in acentralized manner in the IDD, there will be no conflicts between UARs in different guest domains.To make QP and CQ buffers accessible to guest domains, creating CQs/QPs has to go through two stages. In the first stage, QP or CQ buffers areallocated in the guest domains and registered through the IDD. During the second stage, thefrontend sends the CQ/QP creation commands to the IDD along with the keysreturned from the registration stage to complete the creation process. Addresstranslations are indexed by keys, so in later operations the HCA can directly read WQRs from and write the CQEs back to the buffers (using DMA) located in the guest domains. Since we also allocate UARs to user space applications in guest domains, the userlevel InfiniBand library now keeps its OS-bypass feature. The VMM-bypassIB-Xen workflow is illustrated inFigure 8. Figure 8:Working flow of the VMM-bypass Xen-IB driver It should be noted that since VMM-bypass accesses directly interact withthe HCA, they are usually hardware dependent and the frontends need toknow how to deal with different types of InfiniBand HCAs. However, existing InfiniBand drivers and user-level libraries already includecode for direct access and it can be reused without spending new developmentefforts. 4.4 Virtualizing InfiniBand Management Operations In an InfiniBand network, management and administrative tasksare achieved through the use of Management Datagrams (MADs). MADs are sent and received just like normal InfiniBand communication,except that they must use two well-known queue-pairs: QP0 and QP1.Since there is only one set of such queue pairs in every HCA, theiraccess must be virtualized for accessing from many different VMs, which means wemust treat them differently than normal queue-pairs. However, since queue-pairaccesses can be done directly in guest VMs in our VMM-bypass approach, it would be very difficultto track each queue-pair access and take different actions based onwhether it is a management queue-pair or a normal one. To address this difficulty, we use the idea of high-levelvirtualization. This is based on the fact that although MAD is thebasic mechanism for InfiniBand management, applications and kerneldrivers seldom use it directly. Instead, different management tasks are achieved through more user-friendly and standard APIsets which are implemented on top of MADs. For example, the kernel IPoIBprotocol makes use of the subnet administration (SA) services,which are offered through a high-level, standardized SA API. Therefore,instead of tracking each queue-pair access, we virtualize managementfunctions at the API level by providing our own implementation forguest VMs. Most functions can be implemented in a similar manner as privilegedInfiniBand operations, whichtypically includes sending a request to the backend driver, executing the request (backend), and getting a reply. Since management functions are rarelyin time-critical paths, the implementation will not bring anysignificant performance degradation. However, it does require usto implement every function provided by all the different management interfaces.Fortunately, there are only a couple of such interfaces and the implementation effort is not significant. 5 Discussions In this section, we discuss issues related to our prototypeimplementation such as how safe device access is ensured,how performance isolation between different VMs can beachieved, and challenges in implementing VM check-pointingand migration with VMM-bypass. We also point out severallimitations of our current prototype and how we can addressthem in future. 5.1 Safe Device Access To ensure that accesses to virtual InfiniBanddevices by different VMs will not compromise system integrity,we need to make sure that both privileged accesses and VMM-bypass accesses are safe. Since all privileged accessesneed to go through the backend module, access checks areimplemented there to guarantee safety. VMM-bypass operationsare achieved through accessing the memory-mappedUAR pages which contain virtual access points. Setting-upthese mappings is privileged and can be checked.InfiniBand allows using both virtual and physical addresses forsending and receiving messages or carrying out RDMA operations, as long as a valid memory key is presented.Since the key is obtained through InfiniBand memory registration, which is also a privileged operation,we implement necessary safety checks in the backend module toensure that a VM can only carry out valid memory registration operations.It should be noted that once a memory buffer is registered, itsphysical memory pages cannot be reclaimed by the VMM. Therefore,we should limit the total size of buffers that can be registeredby a single VM. This limit check can also be implemented in the backend module.Memory registration is an expensive operation in InfiniBand. In ourvirtual InfiniBand implementation, memory registration cost is evenhigher due to inter-domain communication. This may lead to performancedegradation in cases where buffers cannot be registered in advance.Techniques such as pin-down cache can be applied when buffersare reused frequently, but it is not always effective. To addressthis issue, some existing InfiniBand kernel drivers creates and usesan DMA key through which all physical pages can be accessed. Currently, our prototype supports DMA keys. However, this leaves asecurity hole because all physical memory pages (including those belongingto other VMs) can be accessed. In future, we plan to address thisproblem by letting the DMA keys only authorize access to physical pages inthe current VM. However, this also means that we need to update thekeys whenever the VMM changes the physical pages allocated toa VM. 5.2 Performance Isolation Although our current prototype does not yet implementperformance isolation or QoS among different VMs, this issue can be addressed by taking advantage of QoS mechanisms which are presentin the current hardware. For example, Mellanox InfiniBand HCAssupport a QoS scheme in which a weighted round-robin algorithm is used to schedule different queue-pairs. In this scheme,QoS policy parameters areassigned when queue-pairs are created and initialized. Afterthat, the HCA hardware is responsible for taking necessary stepsto ensure QoS policies. Since queue-pair creations areprivileged, we can create desired QoS policies in the backendwhen queue-pairs are created. These QoS policies will laterbe enforced by device hardware. We plan to explore more alongthis direction in future. 5.3 VM Check-pointing and Migration VMM-bypass I/O poses new challenges for implementingVM check-pointing and migration. This is due to two reasons.First, the VMM does not have completeknowledge of VMs with respect to device accesses. This is incontrast to traditional device virtualization approaches in which the VMM isinvolved in every I/O operation and it can easily suspend andbuffer these operations when check-pointing or migration starts.The second problem is that VMM-bypass I/O exploits intelligent devices whichcan store a large part of the VM system states. For example,an InfiniBand HCA has onboard memory which stores informationsuch as registered buffers, queue-pair data structures, and so on.Some of the state information on an HCA can only be changed as side effectsof VERBS functions calls. It does not allow changing it inan arbitrary way. This makes it difficult for check-pointingand migrations because when a VM is restored from a previous checkpointor migrated to another node, the corresponding state informationon the HCA needs to be restored also. There are two directions to address the above problems. The first oneis to involve VMs in the process of check-pointing and migration. For example, the VMs can bring themselves to some determined stateswhich simplify check-pointing and migration. Another way is tointroduce some hardware/firmware changes. We are currently workingon both directions. 6 Performance Evaluation In this section, we first evaluate the performance of our Xen-IB prototype using a set ofInfiniBand layer micro-benchmarks. Then, we present performance resultsfor the IPoIB protocol based on Xen-IB. We also provide performance numbers of MPI on Xen-IB at bothmicro-benchmark and application levels. 6.1 Experimental Setup Our experimental testbed is an InfiniBand cluster. Each system in the cluster is equippedwith dual Intel Xeon 3.0GHz CPUs, 2 GB memory and a Mellanox MT23108 PCI-X InfiniBand HCA.The PCI-X buses on the systems are 64 bit and run at 133 MHz.The systems are connected with an InfiniScale InfiniBand switch. Theoperating systems are RedHat AS4 with 2.6.12 kernel.Xen 3.0 is used for all our experiments, with each guest domain ran with singlevirtual CPU and 512 MB memory. 6.2 InfiniBand Latency and Bandwidth In this subsection, we compared user-level latency and bandwidth performancebetween Xen-IB and native InfiniBand. Xen-IB results were obtained fromtwo guest domains on two different physical machines. Pollingwas used for detecting completion of communication. Figure 9:InfiniBand RDMA Write Latency Figure 10:InfiniBand Send/Receive Latency The latency tests were carried out in a ping-pong fashion. They wererepeated many times and the average half round-trip time was reportedas one-way latency.Figures 9 and 10 show the latencyfor InfiniBand RDMA write and send/receive operations, respectively. There is very little performance difference between Xen-IB and native InfiniBand.This is because in the tests, InfiniBandcommunication was carried out by directly accessing the HCAfrom the guest domains with VMM-bypass.The lowest latency achieved by both was around 4.2 usfor RDMA write and 6.6 us for send/receive. In the bandwidth tests, a sender sent a number of messages to a receiverand then waited for an acknowledgment. The bandwidth was obtained by dividingthe number of bytes transferred from the sender bythe elapsed time of the test. From Figures 11 and12, we again see virtually no difference betweenXen-IB and native InfiniBand. Both of them were able to achieve bandwidthup to 880 MByte/s, which was limited by the bandwidth of the PCI-Xbus. Figure 11:InfiniBand RDMA Write Bandwidth Figure 12:InfiniBand Send/Receive Bandwidth 6.3 Event/Interrupt Handling Overhead The latency numbers we showed in the previous subsection were based on polling schemes. In this section, we characterize the overhead of event/interrupt handling inXen-IB byshowing send/receive latency results with blocking InfiniBand user-level VERBS functions. Compared with native InfiniBand event/interrupt processing, Xen-IB introducesextra overhead because it requires forwarding an event from domain0 to a guestdomain, which involves Xen inter-domain communication. InFigure 13, we show performance of Xen inter-domain communication. We can see that the overhead increases with the amount ofdata transferred. However, even with very small messages,there is an overhead of about 10 us. Figure 13:Inter-domain Communication One Way Latency Figure 14:Send/Receive Latency Using Blocking VERBS FunctionsFigure 14 shows the send/receive one-way latency usingblocking VERBS. The test is almost the same as the send/receive latency test using polling. The difference isthat a process will block and wait for a completion event instead of busy polling on thecompletion queue. From the figure, we see that Xen-IB has higher latency dueto overhead caused by inter-domain communication. For each message, Xen-IBneeds to use inter-domain communication twice, one for send completion andone for receive completion. For large messages, we observe that the differencebetween Xen-IB and native InfiniBand is around 18-20 us, which is roughlytwice the inter-domain communication latency.However, for small messages, the difference is much less. For example, nativeInfiniBand latency is only 3 us better for 1 byte messages. This difference graduallyincreases with message sizes until it reaches around 20 us. Our profilingreveals that this is due to ``event batching''. For small messages,the inter-domain latency is much higher than InfiniBand latency. Thus, when a send completion event is delivered to a guest domain, a reply may have alreadycome back from the other side. Therefore, the guest domain can process twocompletions with a single inter-domain communication operation, which resultsin reduced latency. For small messages, event batching happens very often.As message size increases, it becomes less and less frequentand the difference between Xen-IB and native IB increases. 6.4 Memory Registration Memory registration is generally a costly operation in InfiniBand.Figure 15 shows the registrationtime of Xen-IB and native InfiniBand. The benchmark registers and unregisters a trunk of user buffers multiple times and measures the average time for each registration. Figure 15:Memory Registration TimeAs we can see from the graph, Xen-IB adds consistently around 25%-35% overhead to theregistration cost. The overheadincreases with the number of pages involved inregistration. This is because Xen-IB needs to use inter-domain communicationto send a message which contains machine addresses of all the pages.The more pages we register, the bigger the size ofmessage we need to send to the device domain through the inter-domain device channel. This observation indicates that if the registration is a time critical operation ofan application, we need to use techniques such as an efficient implementation of registration cache [38] to reduce costs. 6.5 IPoIB Performance IPoIB allows one to run TCP/IP protocol suites over InfiniBand.In this subsection, we compared IPoIB performance between Xen-IB and nativeInfiniBand using Netperf [2]. For Xen-IB performance, the netperf server is hosted in a guest domain with Xen-IB while the client process is running with native InfiniBand. Figure 16:IPoIB Netperf Throughput Figure 16 illustrates the bulk data transfer rates over TCP stream using the following commands: netperf -H $host -l 60 - -s$size -S$sizeDue to the increased cost of interrupt/eventprocessing, we cannot achieve the same throughput while the server is hosted with Xen-IB compared with native InfiniBand. However, Xen-IB is still able to reach more than 90% of the native InfiniBand performance for large messages. We notice that IPoIB achieved much less bandwidth compared with rawInfiniBand. This is because of two reasons. First, IPoIB uses InfiniBandunreliable datagram service, which has significantly lower bandwidththan the more frequently used reliable connection service due to the current implementation of Mellanox HCAs. Second, in IPoIB,due to the limit of MTU, large messages are divided into small packets, whichcan cause a large number of interrupts and degrade performance. Figure 17:Netperf Transaction Test Figure 17 shows the request/response performance measured by Netperf (transactions/second) using: netperf -l 60 -H $host -tTCP_RR - -r $size,$sizeAgain, Xen-IB performs worse than native InfiniBand, especiallyfor small messages where interrupt/event cost plays adominant role for performance. Xen-IB performs more comparable tonative InfiniBand for large messages. 6.6 MPI Performance MPI is a communication protocol used in high performancecomputing. For tests in this subsection, we have usedMVAPICH [27,20], whichis a popular MPI implementation over InfiniBand. Figures 18 and 19 compareXen-IB and native InfiniBand in terms of MPI one-way latencyand bandwidth. The tests were run between two physicalmachines in the cluster. Since MVAPICH uses polling forall underlying InfiniBand communication, Xen-IB was ableachieve the same performance as native InfiniBand byusing VMM-bypass. The smallest latency achieved byMPI with Xen-IB was 5.4 us. The peak bandwidthwas 870 MBytes/s. Figure 18:MPI Latency Figure 19:MPI Bandwidth Figure 20:MPI NAS Benchmarks Figure 20 shows performance of IS, FT, SP and BT applicationsfrom the NAS Parallel Benchmarks suite [26] (class A), whichis frequently used by researchers in the area of high performance computing. We show normalized execution time based onnative InfiniBand. In these tests, two physical nodes were used with two guest domains pernode for Xen-IB. For native InfiniBand, two MPI processeswere launched for each node. We can see that Xen-IB performs comparablywith native InfiniBand, even for communication intensiveapplications such as IS. IB-Xen performs about 4% worse for FT andaround 2-3% better for SP and BT. We believe the difference isdue to the fact that MVAPICH uses shared memory communicationfor processes in a single node. Although MVAPICH with Xen-IB currently does nothave this feature, it can be added by taking advantage ofthe page sharing mechanism provided by Xen. 7 Related Work In Section 2.1, we have discussed current I/O devicevirtualization approaches such as those in VMwareWorkstation [37], VMware ESX Server [42],and Xen [13]. All of them require theinvolvement of the VMM or a privileged VM to handle every I/O operation.In our VMM-bypass approach, many time-critical I/O operations can beexecuted directly by guest VMs. Since this method makes use of intelligencein modern high speed network interfaces, it is limited to arelatively small range of devices which are used mostly in high-end systems.The traditional approaches can be applied to a much wider ranges of devices.OS-bypass is a feature found in user-level communication protocols suchas active messages [41], U-Net [40], FM [29],VMMC [6], and Arsenic [33]. Later, it was adopted bythe industry [12,19] and found its way into commercial products [25,34].Our work extends the idea of OS-bypass to VM environments. With VMM-bypass,I/O and communication operations can be initiated directly by user spaceapplications, bypassing the guest OS, the VMM, and the device driver VM.VMM-bypass also allows an OS in a guest VM to carry out many I/O operationsdirectly, although virtualizing interrupts still needs the involvementof the VMM. The idea of direct device access from a VM has been proposed earlier.For example, [7] describes a method to implement direct I/Oaccess from a VM for IBM mainframes. However, it requires an I/Odevice to be dedicated to a specific VM. The VMM-bypass approach notonly enables direct device access, but allows for safe device sharingamong many different VMs. Recently, the industry has started workingon standardization of I/O virtualization by extending the PCI Express standard [30] to allow a physical device to present itself as multiple virtual devices to the system [31]. This approach can potentially allowa VM to directly interact with a virtual device. However, it requiresbuilding new hardware support into PCI devices while our VMM-bypassapproach is based on existing hardware. At about the sametime when we were working on our virtualization support for InfiniBandin Xen, others in the InfiniBand community proposed similar ideas [39,22]. However, details regardingtheir implementations are currently not available. Our InfiniBand virtualization support for Xen uses a para-virtualizationapproach.As a technique to improve VM performance by introducing small changesin guest OSes, para-virtualization has been used in many VMenvironments [8,16,44,11].Essentially, para-virtualization presents a different abstractionto the guest OSes than native hardware, which lends itself to easier andfaster virtualization. The same idea can be applied to the virtualizationof both CPU and I/O devices. Para-virtualization usually trades compatibilityfor enhanced performance. However, our InfiniBand virtualization support achievesboth high performance and good compatibility by maintaining the same interfaceas native InfiniBand drivers at a higher level than hardware. As a result,our implementation is able to support existing kernel drivers and user applications. Virtualization at higher levels than native hardware is used in a numberof other systems. For example, novel operating systems such as Mach [15],K42 [4], and L4 [18] use OS level API or ABI emulation tosupport traditional OSes such as Unix and Linux. Several popular VM projectsalso use this approach [10,3]. 8 Conclusions and Future Work In this paper, we presented the idea of VMM-bypass, which allows time-criticalI/O commands to be processed directly in guest VMs without involvementof a VMM or a privileged VM. VMM-bypass can significantly improve I/Operformance in VMs by eliminating context switching overhead betweena VM and the VMM or two different VMs caused by current I/O virtualizationapproaches.To demonstrate the idea of VMM-bypass, we described the design and implementation of Xen-IB, an VMM-bypass capable InfiniBanddriver for the Xen VM environment. Xen-IB runs with current InfiniBandhardware and does not require modification to applications or kernel drivers which use InfiniBand.Our performance evaluations showed that Xen-IB can provide performance close to native hardware under most circumstances, with expected degradationon event/interrupthandling and memory registration. Currently, we are working on providing check-pointing and migrationsupport for our Xen-IB prototype. We are also investigatinghow to provide performance isolation by implementing QoS supportin Xen-IB. In future, we plan to study the possibility to introduce VMs into high performance computing area. We will explore how to take advantages of Xen to provide better support ofcheck-pointing, QoS and cluster management with minimum loss of computing power.Acknowledgments We would like to thank Charles Schulz, Orran Krieger, Muli Ben-Yehuda, Dan Poff, Mohammad Banikazemi, and ScottGuthridge of IBM Research for valuable discussions and their support forthis project. We thank Ryan Harper and Nivedita Singhvi of IBM LinuxTechnology Center for extending help to improve the Xen-IB implementation.We also thank the anonymous reviewers for their insightful comments.This research is supported inpart by the following grants and equipment donations to the Ohio State University: Department of Energy's Grant #DE-FC02-01ER25506; NationalScience Foundation grants #CNS-0403342 and #CCR-0509452; grants fromIntel, Mellanox, Sun, Cisco, and Linux Networx; and equipment donations fromApple, AMD, IBM, Intel, Microway, Pathscale, Silverstorm and Sun.Bibliography
VMware 9 Unlocker - Hardware Virtualization Bypasser
2ff7e9595c
Comments