Saturday, March 14, 2015
Sat,9:00 - 9:15 Session 1: Welcome and Awards

  abstract VEE'15 foreword,contents, organization, and sponsors
Ada Gavrilovska (Georgia Institute of Technology), Angela Demke Brown (University of Toronto), Bjarne Steensgaard (Microsoft)
Sat,9:15 - 10:15 Opening Keynote Address:
Chair: Angela Demke Brown (University of Toronto)

  abstract Securing the Endpoint with Micro-virtualization
Ian Pratt (Bromium)
Sat,10:15 - 10:45 Break
Sat,10:45 - 12:15 Session 2: Improving and Exploiting I/O Virtualization
Chair: Bjarne Steensgaard (Microsoft)

  abstract doi A Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
Cheng-Chun Tu (Oracle Labs), Michael Ferdman (Stony Brook University), Tzi-cker Chiueh, Chao-tung Lee (Industrial Technology Research Institute)

  abstract doi A Hybrid I/O Virtualization Framework for RDMA-capable Network Interfaces
Jonas Pfefferle, Patrick Stuedi, Animesh Trivedi, Bernard Metzler, Ioannis Koltsidas (IBM Research), Thomas R. Gross (ETH Zurich)

  abstract doi Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect
Andrew J. Younge (Indiana University), John Paul Walters, Stephen P. Crago (University of Southern California), Geoffrey C. Fox (Indiana University)
Sat,12:15 - 13:45 Lunch
Sat,13:45 - 15:15 Session 3: Address Space Management
Chair: Gaël Thomas (LIP6)

  abstract doi Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi
Fei Guo, Seongbeom Kim, Yury Baskakov, Ishan Banerjee (VMware)

  abstract doi HSPT: Practical Implementation and Efficient Management of Embedded Shadow Page Tables for Cross-ISA System Virtual Machines
Zhe Wang, Jianjun Li, Chenggang Wu (Institute of Computing Technology), Dongyan Yang (IBM), Zhenjiang Wang (Institute of Computing Technology), Wei-Chung Hsu (National Taiwan University), Bin Li (Netease, Inc.), Yong Guan (Capital Normal University)

  abstract doi GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping
Jens Kehne, Jonathan Metter, Frank Bellosa (Karlsruhe Institute of Technology (KIT))
Sat,15:15 - 15:45 Break
Sat,15:45 - 17:15 Session 4: Managing Virtual Clusters
Chair: Ada Gavrilovska (Georgia Institute of Technology)

  abstract doi HeteroVisor: Exploiting Resource Heterogeneity to Enhance the Elasticity of Cloud Platforms
Vishal Gupta (VMware), Min Lee (Intel), Karsten Schwan (Georgia Institute of Technology)

  abstract doi A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters
Hui Wang (Beihang University), Canturk Isci (IBM Thomas J. Watson Research Center), Lavanya Subramanian (Carnegie Mellon University), Jongmoo Choi (Dankook University), Depei Qian (Beihang University), Onur Mutlu (Carnegie Mellon University)

  abstract doi Towards VM Consolidation Using a Hierarchy of Idle States
Rayman Preet Singh, Tim Brecht, S. Keshav (University of Waterloo)
Sunday, March 15, 2015
Sun,8:50 - 10:20 Session 5: VM Testing and Introspection
Chair: Angela Demke Brown (University of Toronto)

  abstract doi Application of Domain-aware Binary Fuzzing to Aid Android Virtual Machine Testing
Stephen Kyle, Hugh Leather, Björn Franke (University of Edinburgh), Dave Butcher, Stuart Monteith (ARM Limited)

  abstract doi Exploring VM Introspection: Techniques and Trade-offs
Sahil Suneja (University of Toronto), Canturk Isci (IBM Research), Eyal de Lara (University of Toronto), Vasanth Bala (IBM Research)

  abstract doi PEMU: A Pin Highly Compatible Out-of-VM Dynamic Binary Instrumentation Framework
Junyuan Zeng, Yangchun Fu, Zhiqiang Lin (University of Texas at Dallas)
Sun,10:20 - 11:00 Break
Sun,11:00 - 12:00 Session 6: User-facing Applications
Chair: Dan Tsafrir (Technion - Israel Institute of Technology)

  abstract doi Improving Remote Desktopping Through Adaptive Record/Replay
Shehbaz Jaffer (NetApp Inc.), Piyus Kedia (Microsoft Research Bangalore), Sorav Bansal (IIT Delhi)

  abstract doi Migration of Web Applications with Seamless Execution
JinSeok Oh, Jin-woo Kwon, Hyukwoo Park, Soo-Mook Moon (Seoul National University)
Sun,12:00 - 13:45 Lunch
Sun,13:45 - 15:15 Session 7: Security and Reliability
Chair: Gaël Thomas (LIP6)

  abstract doi AppSec: A Safe Execution Environment for Security Sensitive Applications
Jianbao Ren, Yong Qi, Yuehua Dai, Xiaoguang Wang, Yi Shi (Xi'an Jiaotong University)

  abstract doi Hardware-Assisted Secure Resource Accounting under a Vulnerable Hypervisor
Seongwook Jin, Jinho Seol, Jaehyuk Huh, Seungryoul Maeng (KAIST)

  abstract doi PARS: A Page-Aware Replication System for Efficiently Storing Virtual Machine Snapshots
Lei Cui, Tianyu Wo, Bo Li, Jianxin Li, Bin Shi, Jinpeng Huai (Beihang University)
Sun,15:15 - 15:45 Break
Sun,15:45 - 16:45 Closing Keynote Address:
Chair: Bjarne Steensgaard (Microsoft)

  abstract Virtualizing the Browser
Emery Berger (University of Massachusetts Amherst)
Sun,18:00 - 20:00 Joint VEE/ASPLOS Reception (Roof Bar)
Malicious OS kernel can easily access user's private data in main memory and pries human-machine interaction data, even one that employs privacy enforcement based on application level or OS level. This paper introduces AppSec, a hypervisor-based safe execution environment, to protect both the memory data and human-machine interaction data of security sensitive applications from the untrusted OS transparently.

AppSec provides several security mechanisms on an untrusted OS. AppSec introduces a safe loader to check the code integrity of application and dynamic shared objects. During runtime, AppSec protects application and dynamic shared objects from being modified and verifies kernel memory accesses according to application's intention. AppSec provides a devices isolation mechanism to prevent the human-machine interaction devices being accessed by compromised kernel. On top of that, AppSec further provides a privileged-based window system to protect application's X resources. The major advantages of AppSec are threefold. First, AppSec verifies and protects all dynamic shared objects during runtime. Second, AppSec mediates kernel memory access according to application's intention but not encrypts all application's data roughly. Third, AppSec provides a trusted I/O path from end-user to application. A prototype of AppSec is implemented and shows that AppSec is efficient and practical.
DMA-capable interconnects, providing ultra-low latency and high-bandwidth, are increasingly being used in the context of distributed storage and data processing systems. However, the deployment of such systems in virtualized data centers is currently inhibited by the lack of a flexible and high-performance virtualization solution for RDMA network interfaces.

In this work, we present a hybrid virtualization architecture which builds upon the concept of separation of paths for control and data operations available in RDMA. With hybrid virtualization, RDMA control operations are virtualized using hypervisor involvement, while data operations are set up to bypass the hypervisor completely. We describe HyV (Hybrid Virtualization), a virtualization framework for RDMA devices implementing such a hybrid architecture. In the paper, we provide a detailed evaluation of HyV for different RDMA technologies and operations. We further demonstrate the advantages of HyV in the context of a real distributed system by running RAMCloud on a set of HyV-enabled virtual machines deployed across a 6-node RDMA cluster. All of the performance results we obtained illustrate that hybrid virtualization enables bare-metal RDMA performance inside virtual machines while retaining the flexibility typically associated with paravirtualization.
Accessing the display of a computer remotely, is popularly called remote desktopping. Remote desktopping software installs at both the user-facing client computer and the remote server computer; it simulates user's input events at server, and streams the corresponding display changes to client, thus providing an illusion to the user of controlling the remote machine using local input devices (e.g., keyboard/mouse). Many such remote desktopping tools are widely used. We show that if the remote server is a virtual machine (VM) and the client is reasonably powerful (e.g., current laptop and desktop grade hardware), VM deterministic replay capabilities can be used adaptively to significantly reduce the network bandwidth consumption and server-side CPU utilization of a remote desktopping tool. We implement these optimizations in a tool based on Qemu/KVM virtualization platform and VNC remote desktopping platform. Our tool reduces VNC's network bandwidth consumption by up to 9x and server-side CPU utilization by up to 56% for popular graphics-intensive applications. On the flip side, our techniques consume higher CPU/memory/disk resources at the client. The effect of our optimizations on user-perceived latency is negligible. Typical VM consolidation approaches re-pack VMs into fewer physical machines, resulting in energy and cost savings. Recent work has explored a just-in time approach to VM consolidation by transitioning VMsto an inactive state when idle and activating them on the arrival of client requests. This leads to increased VM density at the cost of an increase in client request latency (called miss penalty). The VM density so obtained, although greater, is still limited by the number of VMs that can be hosted in the one inactive state. If idle VMs were hosted in multiple inactive states, VM density can be increased further while ensuring small miss penalties. However, VMs in different inactive states have different capacities, activation times, and resource requirements.

Therefore, a key question is: How should VMs be transitioned between different states to minimize the expected miss penalty? This paper explores the hosting of idle VMs in a hierarchy of multiple such inactive states, and studies the effect of different idle VM management policies on VM density and miss penalties. We formulate a mathematical model for the problem, and provide a theoretical lower bound on the miss penalty. Using an off-the-shelf virtualization solution, we demonstrate how the required model parameters can be obtained. We evaluate a variety of policies and quantify their miss penalties for different VM densities. We observe that some policies consolidate up to 550 VMs per machine with average miss penalties smaller than 1 ms.
While there are a variety of existing virtual machine introspection (VMI) techniques, their latency, overhead, complexity and consistency trade-offs are not clear. In this work, we address this gap by first organizing the various existing VMI techniques into a taxonomy based upon their operational principles, so that they can be put into context. Next we perform a thorough exploration of their trade-offs both qualitatively and quantitatively. We present a comprehensive set of observations and best practices for efficient, accurate and consistent VMI operation based on our experiences with these techniques. Our results show the stunning range of variations in performance, complexity and overhead with different VMI techniques.We further present a deep dive on VMI consistency aspects to understand the sources of inconsistency in observed VM state and show that, contrary to common expectation, pause-and-introspect based VMI techniques achieve very little to improve consistency despite their substantial performance impact. With the proliferation of cloud computing to outsource computation in remote servers, the accountability of computational resources has emerged as an important new challenge for both cloud users and providers. Among the cloud resources, CPU and memory are difficult to verify their actual allocation, since the current virtualization techniques attempt to hide the discrepancy between physical and virtual allocations for the two resources. This paper proposes an online verifiable resource accounting technique for CPU and memory allocation for cloud computing. Unlike prior approaches for cloud resource accounting, the proposed accounting mechanism, called Hardware-assisted Resource Accounting (HRA), uses the hardware support for system management mode (SMM) and virtualization to provide secure resource accounting, even if the hypervisor is compromised. Using a secure isolated execution support of SMM, this study investigates two aspects of verifiable resource accounting for cloud systems. First, this paper presents how the hardware-assisted SMM and virtualization techniques can be used to implement the secure resource accounting mechanism even under a compromised hypervisor. Second, the paper investigates a sample-based resource accounting technique to minimize performance overheads. Using a statistical random sampling method, the technique estimates the overall CPU and memory allocation status with 99%~100% accuracies and performance degradations of 0.1%~0.5%. VMware ESXi leverages hardware support for MMU virtualization available in modern Intel/AMD CPUs. To optimize address translation performance when running on such CPUs, ESXi preferably uses host large pages (2MB in x86-64 systems) to back VMメs guest memory. While using host large pages provides best performance when host has sufficient free memory, it increases host memory pressure and effectively defeats page sharing. Hence, the host is more likely to hit the point where ESXi has to reclaim VM memory through much more expensive techniques such as ballooning or host swapping. As a result, using host large pages may significantly hurt consolidation ratio.

To deal with this problem, we propose a new host large page management policy that allows to: a) identify ムcoldメ large pages and break them even when host has plenty of free memory; b) break all large pages proactively when host free memory becomes scarce, but before the host starts ballooning or swapping; c) reclaim the small pages within the broken large pages through page sharing. With the new policy, the shareable small pages can be shared much earlier and the amount of memory that needs to be ballooned or swapped can be largely reduced when host memory pressure is high. We also propose an algorithm to dynamically adjust the page sharing rate when proactively breaking large pages using a VM large page shareability estimator for higher efficiency.

Experimental results show that the proposed large page management policy can improve the performance of various workloads up to 2.1x by significantly reducing the amount of ballooned or swapped memory when host memory pressure is high. Applications still fully benefit from host large pages when memory pressure is low.
Over the past 20 years, we have witnessed a widespread adoption of dynamic binary instrumentation (DBI) for numerous program analyses and security applications including program debugging, profiling, reverse engineering, and malware analysis. To date, there are many DBI platforms, and the most popular one is Pin, which provides various instrumentation APIs for process instrumentation. However, Pin does not support the instrumentation of OS kernels. In addition, the execution of the instrumentation and analysis routine is always inside the virtual machine (VM). Consequently, it cannot support any out-of-VM introspection that requires strong isolation. Therefore, this paper presents PEMU, a new open source DBI framework that is compatible with Pin-APIs, but supports out-of-VM introspection for both user level processes and OS kernels. Unlike in-VM instrumentation in which there is no semantic gap, for out-of-VM introspection we have to bridge the semantic gap and provide abstractions (i.e., APIs) for programmers. One important feature of PEMU is its API compatibility with Pin. As such, many Pin plugins are able to execute atop PEMU without any source code modification. We have implemented PEMU, and our experimental results with the SPEC 2006 benchmarks show that PEMU introduces reasonable overhead. Web browsers have become a de facto universal operating system, and JavaScript its instruction set. Programming applications in this OS and with this instruction set is complex and error-prone. Virtualizing the browser to make it look like a conventional OS and CPU would have numerous advantages. It would make it easy to port existing applications to the browser and let programmers write web applications in the language of their choice. Unfortunately, browsers are an idiosyncratic and hostile environment that makes this tricky. I will describe the numerous challenges of virtualizing the browser, and show how our DoppioVM system addresses these challenges, making it possible to run unaltered general-purpose applications in a wide range of browsers.

Speaker Bio

Emery Berger is a Professor in the School of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research and at the Universitat Politecnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC). Professor Berger’s research spans programming languages, runtime systems, and operating systems, with a particular focus on reliability, security, and performance. He is the creator of influential software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (on which the Mac OS X memory manager is based); DieHard, an error-avoiding memory manager that directly influenced the design of the Windows Fault-Tolerant Heap; and DieHarder, a secure memory manager that was an inspiration for hardening changes made to the Windows 8 heap. His honors include a Microsoft Research Fellowship, an NSF CAREER Award, a Lilly Teaching Fellowship, a Most Influential Paper Award at OOPSLA 2012, a Google Research Award, a Microsoft SEIF Award, CACM Research Highlights, a Best Artifact Award, and several Best Paper Awards. He is currently an Associate Editor of the ACM Transactions on Programming Languages and Systems, and will serve as Program Chair for PLDI 2016.
Cross-ISA (Instruction Set Architecture) system-level virtual machine has a significant research and practical value. For example, several recently announced virtual smart phones for iOS which run smart phone applications on x86 based PCs are deployed on cross-ISA system level virtual machines. Also, for mobile device application development, by emulating the Android/ARM environment on the more powerful x86-64 platform, application development and debugging become more convenient and productive. However, the virtualization layer often incurs high performance overhead. The key overhead comes from memory virtualization where a guest virtual address (GVA) must go through multi-level address translation to become a host physical address (HPA). The Embedded Shadow Page Table (ESPT) approach has been proposed to effectively decrease this address translation cost. ESPT directly maps GVA to HPA, thus avoid the lengthy guest virtual to guest physical, guest physical to host virtual, and host virtual to host physical address translation.

However, the original ESPT work has a few drawbacks. For example, its implementation relies on a loadable kernel module (LKM) to manage the shadow page table. Using LKMs is less desirable for system virtual machines due to portability, security and maintainability concerns. Our work proposes a different, yet more practical, implementation to address the shortcomings. Instead of relying on using LKMs, our approach adopts a shared memory mapping scheme to maintain the shadow page table (SPT) using only ''mmap'' system call. Furthermore, this work studies the support of SPT for multi-processing in greater details. It devices three different SPT organizations and evaluates their strength and weakness with standard and real Android applications on the system virtual machine which emulates the Android/ARM platform on x86-64 systems.
The development of a new application virtual machine (VM), like the creation of any complex piece of software, is a bug-prone process. In version 5.0, the widely-used Android operating system has changed from the Dalvik VM to the newly-developed ART VM to execute Android applications. As new iterations of this VM are released, how can the developers aim to reduce the number of potentially security-threatening bugs that make it into the final product? In this paper we combine domain-aware binary fuzzing and differential testing to produce DexFuzz, a tool that exploits the presence of multiple modes of execution within a VM to test for defects. These modes of execution include the interpreter and a runtime that executes ahead-of-time compiled code. We find and present a number of bugs in the in-development version of ART in the Android Open Source Project. We also assess DexFuzz's ability to highlight defects in the experimental version of ART released in the previous version of Android, 4.4, finding 189 crashing programs and 15 divergent programs that indicate defects after only 5,000 attempts. Virtual machine (VM) snapshot enhances the system availability by saving the running state into stable storage during failure-free execution and rolling back to the snapshot point upon failures. Unfortunately, the snapshot state may be lost due to disk failures, so that the VM fails to be recovered. The popular distributed file systems employ replication technique to tolerate disk failures by placing redundant copies across disperse disks. However, unless user-specific personalization is provided, these systems consider the data in the file as of same importance and create identical copies of the entire file, leading to non-trivial additional storage overhead.

This paper proposes a page-aware replication system (PARS) to store VM snapshots efficiently. PARS employs VM introspection technique to explore how a page is used by guest, and classifies the pages by their importance to system execution. If a page is critical, PARS replicates it multiple copies to ensure high availability and long-term durability. Otherwise, the loss of this page causes no harm for system to work properly, PARS therefore saves only one copy of the page. Consequently, PARS improves storage efficiency without compromising availability. We have implemented PARS to justify its practicality. The experimental results demonstrate that PARS achieves 53.9% space saving compared to the native replication approach in HDFS which replicates the whole snapshot file fully and identically.
With an attack surface of many tens of millions of lines of code, commodity operating systems such as Windows and OS X pose an easy target for hackers. Users may be duped into exposing their systems to such attacks through a variety of means such as a malicious web links, poisoned email attachments or rogue USB sticks, though increasingly attackers are using techniques such as malicious advertisements or "watering hole" attacks that compromise systems without the user even having to click on anything bad. Existing security products do a poor job of defending against such attacks and are easily evaded by zero-day or polymorphic malware.

This talk introduces a new approach called micro-virtualization, in which a separate virtual machine OS instance is created for each individual task that a user performs. Hence each web site, each document, each spreadsheet etc opens in its own isolated micro-VM. Hardware virtualization capabilities of modern CPUs can be used to achieved robust isolation between micro-VMs with excellent performance and an unchanged user experience. Hence micro-virtualization provides a practical implementation of the principal of least privilege that operates below the client OS, implemented using a small, hardened code base that should be orders of magnitude harder to attack.

Speaker Bio

Ian Pratt is Co–founder and EVP of Products at Bromium Inc. Prior to Bromium, Ian was founder and Chief Scientist at XenSource, which built enterprise-class virtualization products based on the Xen hypervisor. XenSource was acquired by Citrix in 2007, where he then served as Vice President for Advanced Products and CTO. Ian founded the Xen project while he was a member of tenured faculty at the University of Cambridge Computer Laboratory, where he led the Systems Research Group for nearly 10 years. He was also a co-founder of Nemesys Research Ltd, acquired by FORE Systems in 1996. Ian holds a PhD in Computer Science, is a Fellow of the Institute of Engineering and Technology, and a Fellow of the Royal Academy of Engineering which awarded him the Academy's Silver Medal in 2009.
This paper presents HeteroVisor, a heterogeneity-aware hypervisor, that exploits resource heterogeneity to enhance the elasticity of cloud systems. Introducing the notion of 'elasticity' (E) states, HeteroVisor permits applications to manage their changes in resource requirements as state transitions that implicitly move their execution among heterogeneous platform components. Masking the details of platform heterogeneity from virtual machines, the E-state abstraction allows applications to adapt their resource usage in a fine-grained manner via VM-specific 'elasticity drivers' encoding VM-desired policies. The approach is explored for the heterogeneous processor and memory subsystems evolving for modern server platforms, leading to mechanisms that can manage these heterogeneous resources dynamically and as required by the different VMs being run. HeteroVisor is implemented for the Xen hypervisor, with mechanisms that go beyond core scaling to also deal with memory resources, via the online detection of hot memory pages and transparent page migration. Evaluation on an emulated heterogeneous platform uses workload traces from real-world data, demonstrating the ability to provide high on-demand performance while also reducing resource usage for these workloads. As the performance overhead associated with CPU and memory virtualization becomes largely negligible, research efforts are directed toward reducing the I/O virtualization overhead, which mainly comes from two sources: DMA set-up and payload copy, and interrupt delivery. The advent of SRIOV and MRIOV effectively reduces the DMA-related virtualization overhead to a minimum. Therefore, the last battleground for minimizing virtualization overhead is how to directly deliver every interrupt to its target VM without involving the hypervisor.

This paper describes the design, implementation, and evaluation of a KVM-based direct interrupt delivery system called DID. DID delivers interrupts from SRIOV devices, virtual devices, and timers to their target VMs directly, completely avoiding VM exits. Moreover, DID does not require any modifications to the VM's operating system and preserves the correct priority among interrupts in all cases. We demonstrate that DID reduces the number of VM exits by a factor of 100 for I/O-intensive workloads, decreases the interrupt invocation latency by 80%, and improves the throughput of a VM running Memcached by a factor of 3.
Web applications (apps) are programmed using HTML5, CSS, and JavaScript, and are distributed in the source code format. Web apps can be executed on any devices where a web browser is installed, allowing one-source, multi-platform environment. We can exploit this advantage of platform independence for a new user experience called app migration, which allows migrating an app in the middle of execution seamlessly between smart devices. This paper proposes such a migration framework for web apps where we can save the current state of a running app and resume its execution on a different device by restoring the saved state. We save the web appメs state in the form of a snapshot, which is actually another web app whose execution can restore the saved state. In the snapshot, the state of the JavaScript variables and DOM trees are saved using the JSON format. We solved some of the saving/restoring problems related to event handlers and closures by accessing the browser and the JavaScript engine internals. Our framework does not require instrumenting an app or changing its source code, but works for the original app. We implemented the framework on the Chrome browser with the V8 JavaScript engine and successfully migrated non-trivial sample apps with reasonable saving and restoring overhead. We also discuss other usage of the snapshot for optimizations and user experiences for the web platform. Over the last few years, GPUs have been finding their way into cloud computing platforms, allowing users to benefit from the performance of GPUs at low cost. However, a large portion of the cloud's cost advantage traditionally stems from oversubscription: Cloud providers rent out more resources to their customers than are actually available, expecting that the customers will not actually use all of the promised resources. For GPU memory, this oversubscription is difficult due to the lack of support for demand paging in current GPUs. Therefore, recent approaches to enabling oversubscription of GPU memory resort to software scheduling of GPU kernels -- which has been shown to induce significant runtime overhead in applications even if sufficient GPU memory is available -- to ensure that data is present on the GPU when referenced.

In this paper, we present GPUswap, a novel approach to enabling oversubscription of GPU memory that does not rely on software scheduling of GPU kernels. GPUswap uses the GPU's ability to access system RAM directly to extend the GPU's own memory. To that end, GPUswap transparently relocates data from the GPU to system RAM in response to memory pressure. GPUswap ensures that all data is permanently accessible to the GPU and thus allows applications to submit commands to the GPU directly at any time, without the need for software scheduling. Experiments with our prototype implementation show that GPU applications can still execute even with only 20 MB of GPU memory available. In addition, while software scheduling suffers from permanent overhead even with sufficient GPU memory available, our approach executes GPU applications with native performance.
Virtualization technologies has been widely adopted by large-scale cloud computing platforms. These virtualized systems employ distributed resource management (DRM) to achieve high resource utilization and energy savings by dynamically migrating and consolidating virtual machines. DRM schemes usually use operating-system-level metrics, such as CPU utilization, memory capacity demand and I/O utilization, to detect and balance resource contention. However, they are oblivious to microarchitecture-level resource interference (e.g., memory bandwidth contention between different VMs running on a host), which is currently not exposed to the operating system.

We observe that the lack of visibility into microarchitecture-level resource interference significantly impacts the performance of virtualized systems. Motivated by this observation, we propose a novel architecture-aware DRM scheme (ADRM), that takes into account microarchitecture-level resource interference when making migration decisions in a virtualized cluster. ADRM makes use of three core techniques: 1) a profiler to monitor the microarchitecture-level resource usage behavior online for each physical host, 2) a memory bandwidth interference model to assess the interference degree among virtual machines on a host, and 3) a cost-benefit analysis to determine a candidate virtual machine and a host for migration.

Real system experiments on thirty randomly selected combinations of applications from the CPU2006, PARSEC, STREAM, NAS Parallel Benchmark suites in a four-host virtualized cluster show that ADRM can improve performance by up to 26.55%, with an average of 9.67%, compared to traditional DRM schemes that lack visibility into microarchitecture-level resource utilization and contention.
Cloud Infrastructure-as-a-Service paradigms have recently shown their utility for a vast array of computational problems, ranging from advanced web service architectures to high throughput computing. However, many scientific computing applications have been slow to adapt to virtualized cloud frameworks. This is due to performance impacts of virtualization technologies, coupled with the lack of advanced hardware support necessary for running many high performance scientific applications at scale.

By using KVM virtual machines that leverage both Nvidia GPUs and InfiniBand, we show that molecular dynamics simulations with LAMMPS and HOOMD run at near-native speeds. This experiment also illustrates how virtualized environments can support the latest parallel computing paradigms, including both MPI+CUDA and new GPUDirect RDMA functionality. Specific findings show initial promise in scaling of such applications to larger production deployments targeting large scale computational workloads.
  • Industry Sponsors