The relentless progress of Moore’s Law has periodically inspired major innovations – both in hardware and software – at specific points in time to keep performance growth on pace with transistor density. Industry has reached another such point as it encounters intellectual and engineering challenges in the form of power dissipation, processor-memory performance gap, limits to instruction level parallelism, slower frequency growth, and rising non-recurring engineering costs. As a consequence, when we consider how the large number of transistors that will be supplied at future technology nodes will be used to sustain performance growth, there are some inevitable trends, including i) replication of cores, ii) the use of high volume custom accelerators due to the fact that these devices have small footprint and dramatically less power consumption for the performance gains they offer, and iii) innovations in memory hierarchies. The preceding collectively inspire the development of Hybrid Virtual Machines (HVM) for heterogeneous many-core platforms – large scale, heterogeneous systems comprised of single or shared ISA general purpose cores intermingled with customized heterogeneous cores – accelerators, and using diverse memory, cache and interconnect hierarchies. Such platforms will be seen both, in individual user as well as rack scale and multi-rack scale systems in order to keep up with the growing application demands.

These Hybrid Virtual Machines will run applications belonging to a wide spectrum comprising of high performance applications like scientific computing, biological simulations etc, enterprise applications like financial processing, data processing etc to client applications like gaming and image processing. Achieving performance guarantees for these applications under the constraints imposed by this workload variability, power, cost and heterogeneity then, requires a re-thinking of the current software stack. To this end, we have undertaken a wide reaching effort at Georgia Tech which evaluates the challenges imposed by the emerging hardware and applications on the entire software stack. We are working on designing and implementing suitable programming models, runtimes, operating system changes and hypervisor changes to prepare software for such future heterogeneous many-core platforms. This effort is a collaboration between different research groups at Georgia Tech and research labs across United States. The project has been split into different components, each addressing a subset of challenges presented by this endeavor. The rest of this page will summarize these efforts and enlist the people involved.

Programming Models and Runtime

Harmony

The Harmony runtime seeks to address software challenges of heterogeneous multi-core architectures by the development of a software infrastructure that reduces the complexity of effectively and efficiently using these architectures and improves productivity with minimal performance loss. In particular we seek an infrastructure that i) implements execution models that bridge from existing languages and software development practices to commodity hardware systems and thereby leverage individual accelerator vendor compilation tool chains and programming models, ii) enables portable implementations across system configurations to amortize substantial software investments, and iii) delivers high performance by adapting applications to specific available accelerator capabilities across a range of system sizes and configurations. (more...)

Ocelot

The emergence of GPGPU programming models such as CUDA and OpenCL provide avenues for general purpose application developers to leverage the computation capabilities of highly parallel SIMD GPU architectures. PTX defines a machine model and virtual ISA for these architectures that is used commercially by NVIDIA GPUs, but is also representative of other SIMD machine models. Ocelot seeks to develop a set of tools that enable the low level analysis of GPGPU applications as well a providing a JIT compiler for diverse architectures. Our end-goal is twofold:

  1. We wish to be able to dynamically and transparently migrate applications from one processor to another in order to improve efficiency and compatibility using a combination of binary translation and emulation. For example, we can improve compatibility by running GPU applications on systems without GPU hardware. Similarly we can improve efficiency by moving parallel compute intensive workloads to accelerators and delegating less intensive programs with little parallelism to CPU cores.
  2. Additionally, we wish to be able to dynamically characterize and optimize programs, for example, by compressing highly parallel programs into single threaded loops on sequential architectures with high synchronization or context switch overhead. Ocelot allows us to gather static compile-time statistics as well as identify runtime memory access patterns, control-flow behavior, and inter-thread communication patterns. We plan to use these statistics to guide dynamic mapping of program kernels to CPUs or accelerators as well as driving low level compiler optimizations that adapt to a program's dynamic behavior.

We have completed control and dataflow analysis modules for PTX as well as an emulator for the PTX 1.3-1.4 virtual machine compatible with CUDA 2.1-2.3 applications. Our implementation is rigorously tested against the CUDA SDK, the UIUC Parboil Benchmarks, and several internal applications. We are currently working on a back end binary translator from PTX to various architectures. (more...)

GViM, SVIO and Cellule: Virtualization of accelerator based systems

Heterogeneous accelerator-based multi-cores (e.g., accelerator based systems such as Intel’s Larrabee, NVIDIA based GPGPU systems, IBM and LANL's AMD-Cell based RoadRunner) are becoming increasingly common. While today’s OSs and applications treat these resources as specialized devices, using device specific drivers and programming methods, a desirable future state is one in which both general purpose and specialized processing cores are treated as first class entities, managed and controlled by the hypervisor to execute applications that freely utilize both types of cores. Realizing this vision presents significant challenges to virtualization infrastructures, including dealing with proprietary programming models (e.g. NVIDIA's GPUs), heterogeneity (IBM's Cell and Intel's IXP), and fair (or weighted) sharing of accelerator resources. Attaining this vision is critical, however, as future cloud and data center infrastructures require the flexibility and consolidation opportunities provided by system virtualization. In order to make our vision achievable, we have virtualized a broad range of accelerators, each offering different set of challenges and providing lessons in dealing with such systems.

GViM – GPU-accelerated Virtual Machines (in collaboration with HP Labs)

GViM is a system designed for virtualizing and managing the resources of a general purpose system accelerated by graphics processors. Using the NVIDIA GPU as an example, we discuss how such accelerators can be virtualized without additional hardware support and describe the basic extensions needed for resource management. Experimental evaluations with a Xen-based implementation of GViM demonstrate efficiency and flexibility in system usage coupled with only small performance penalties for the virtualized vs. non-virtualized solutions. ( more...)

Self-virtualized IO

This project addresses a key issue in system virtualization - how to efficiently virtualize I/O subsystems and peripheral devices. We have developed a novel approach to I/O virtualization, termed self-virtualized devices, which improves I/O performance by off loading select virtualization functionality onto the device. This permits guest virtual machines to more efficiently (i.e., with less overhead and reduced latency) interact with the virtualized device. The concrete instance of such a device developed and evaluated in this paper is a self-virtualized network interface (SV-NIC), targeting the high end NICs used in the high performance domain. The SV-NIC (1) provides virtual interfaces (VIFs) to guest virtual machines for an underlying physical device, the network interface, (2) manages the way in which the device's physical resources are used by guest operating systems, and (3) provides high performance, low overhead network access to guest domains. ( more...)

Cellule – Lightweight Virtualization of Accelerators (in collaboration with IBM Research)

Initial steps in this research have focused on the efficient use of accelerators, using IBM's Cell BE processor as the key platform. Here, experiences with running the Linux operating system on the Power core of the Cell processor have shown that this core is less efficient than the general purpose cores in hosting a full fledged operating system. In part, this is because the Power core was principally designed to be a `service processor' responsible for coordinating the Cell's SPEs. The first challenge faced by our research, then, has been to make efficient use of this service processor in order to exploit Cell as a remote accelerator utilized by one or more general purpose machines, with hardware configurations like those in the Roadrunner project. More generally, this research is investigating the opportunities presented by combining the concepts of virtualization and accelerators to simplify the Cell execution model, to enable its effective utilization by the applications running on the general purpose machines. The first technical outcome of the proposed research is the "Cellule" execution environment. ( more...)

Montage: Scheduling and Resource Management in
Heterogeneous Many-core Systems

Region Scheduling

For modern computer architectures, memory access times and caching performance are key determinants of program and system performance. This is evidenced not only by a rich set of research on caches in computer architecture but also by the variety of cache structures found in modern multi core and many core systems. The goal of this project is to find new methodology to manage memory hierarchy effectively. For this purpose, new abstraction called ‘region’ is defined and implemented in the Xen scheduler to improve memory hierarchy performance. ( more...)

VMAcS (In collaboration with HP Labs)

VMAcS, is a system designed for managing the resources on virtualized accelerator based multi-core systems. Using the NVIDIA Graphics Processing Unit (GPU) as an example, we describe the extensions needed for resource management, the algorithms for providing performance guarantees to individual VMs, and the scheduling extensions needed for improved efficiency when sharing resources. Experimental evaluation with a Xen-based implementation of VMAcS demonstrate small performance penalties for the virtualized vs. non-virtualized solution coupled with substantial improvements in fairness of use by multiple VMs. ( more...)

Asymmetric multicore scheduling (In collaboration with Intel Labs)

Our target hardware platforms include ones which have performance as well as functional asymmetries exhibited by cores on a single chip. The goal of this work is to design the hypervisor resource management such that it can handle demands from various OSes and make intelligent decisions based on the differences in requirements and asymmetries in the hardware. Earlier work on asymmetry has focused on solving the problem for operating system scheduling and has useful insights in terms of importance of modifying OS scheduling as well as the need for asymmetry in architecture. We intend to eventually combine the smart hypervisor with the smart OS to achieve more optimized utilization of such a system. ( more...)