Loading…

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track 2 [clear filter]
Wednesday, July 10
 

11:20am PDT

Partisan: Scaling the Distributed Actor Runtime
We present the design of an alternative runtime system for improved scalability and reduced latency in actor applications called Agner. Agner provides higher scalability by allowing the application developer to specify the network overlay used at runtime without changing application semantics, thereby specializing the network communication patterns to the application. Agner reduces message latency through a combination of three predominately automatic optimizations: parallelism, named channels, and affinitized scheduling. We implement a prototype of Agner in Erlang and demonstrate that Agner achieves up to an order of magnitude increase in the number of nodes the system can scale to through runtime overlay selection, up to a 34.96x increase in throughput, and up to a 13.4x reduction in latency over Distributed Erlang.

Speakers
CS

Christopher S. Meiklejohn

Carnegie Mellon University
HM

Heather Miller

Carnegie Mellon University
PA

Peter Alvaro

UC Santa Cruz


Wednesday July 10, 2019 11:20am - 11:40am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:40am PDT

Unleashing the Power of Learning: An Enhanced Learning-Based Approach for Dynamic Binary Translation
Dynamic binary translation (DBT) is a key system technology that enables many important system applications such as system virtualization and emulation. To achieve good performance, it is important for a DBT system to be equipped with high-quality translation rules. However, most translation rules in existing DBT systems are created manually with high engineering efforts and poor quality. To solve this problem, a learning-based approach was recently proposed to automatically learn semantically-equivalent translation rules, and symbolic verification is used to prove the semantic equivalence of such rules. But, they still suffer from some shortcomings.

In this paper, we first give an in-depth analysis on the constraints of prior learning-based methods and observe that the equivalence requirements are often unduly restrictive. It excludes many potentially high-quality rule candidates from being included and applied. Based on this observation, we propose an enhanced learning-based approach that relaxes such equivalence requirements but supplements them with constraining conditions to make them semantically equivalent when such rules are applied. Experimental results on SPEC CINT2006 show that the proposed approach can improve the dynamic coverage of the translation from 55.7\% to 69.1\% and the static coverage from 52.2\% to 61.8\%, compared to the original approach. Moreover, up to 1.65X performance speedup with an average of 1.19X are observed.

Speakers
CS

Changheng Song

Fudan University
WW

Wenwen Wang

University of Minnesota
PY

Pen-Chung Yew

University of Minnesota
AZ

Antonia Zhai

University of Minnesota
WZ

Weihua Zhang

Fudan University


Wednesday July 10, 2019 11:40am - 12:00pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:00pm PDT

Transactuations: Where Transactions Meet the Physical World
A large class of IoT applications read sensors, execute application logic, and actuate actuators. However, the lack of high-level programming abstractions compromises correctness especially in presence of failures and unwanted interleaving between applications. A key problem arises when operations on IoT devices or the application itself fails, which leads to inconsistencies between the physical state and application state, breaking application semantics and causing undesired consequences. Transactions are a well-established abstraction for correctness, but assume properties that are absent in an IoT context. In this paper, we study one such environment, smart home, and establish inconsistencies manifesting out of failures. We propose an abstraction called transactuation that empowers developers to build reliable applications. Our runtime, Relacs, implements the abstraction atop a real smart-home platform. We evaluate programmability, performance, and effectiveness of transactuations to demonstrate its potential as a powerful abstraction and execution model.

Speakers
AS

Aritra Sengupta

Samsung Research
MS

Masoud Saeida Ardekani

Samsung Research
CA

Cesar A. Stuardo

University of Chicago


Wednesday July 10, 2019 12:00pm - 12:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:20pm PDT

Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code
All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average 10% slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm , a significant extension to Browsix that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of 2.08$\times$ (Firefox) and 2.5$\times$ (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.

Speakers
AJ

Abhinav Jangda

University of Massachusetts Amherst
BP

Bobby Powers

University of Massachusetts Amherst
ED

Emery D. Berger

University of Massachusetts Amherst
AG

Arjun Guha

University of Massachusetts Amherst


Wednesday July 10, 2019 12:20pm - 12:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:20pm PDT

Apache Nemo: A Framework for Building Distributed Dataflow Optimization Policies
Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance, and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation, optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario.

Speakers
YY

Youngseok Yang

Seoul National University
JE

Jeongyoon Eo

Seoul National University
GK

Geon-Woo Kim

Viva Republica
JY

Joo Yeon Kim

Samsung Electronics
SL

Sanha Lee

Naver Corp.
JS

Jangho Seo

Seoul National University
WW

Won Wook Song

Seoul National University
BC

Byung-Gon Chun

Seoul National University


Wednesday July 10, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:40pm PDT

Tangram: Bridging Immutable and Mutable Abstractions for Distributed Data Analytics
Data analytics frameworks that adopt immutable data abstraction usually provide better support for failure recovery and straggler mitigation, while those that adopt mutable data abstraction are more efficient for iterative workloads thanks to their support for in-place state updates and asynchronous execution. Most existing frameworks adopt either one of the two data abstractions and do not enjoy the benefits of the other. In this paper, we propose a novel programming model named MapUpdate, which can determine whether a distributed dataset is mutable or immutable in an application. We show that MapUpdate not only offers good expressiveness, but also allows us to enjoy the benefits of both mutable and immutable abstractions. MapUpdate naturally supports iterative and asynchronous execution, and can use different recovery strategies adaptively according to failure scenarios. We implemented MapUpdate in a system, called Tangram, with novel system designs such as lightweight local task management, partition-based progress control, and context-aware failure recovery. Extensive experiments verified the benefits of Tangram on a variety of workloads including bulk processing, graph analytics, and iterative machine learning.

Speakers
YH

Yuzhen Huang

The Chinese University of Hong Kong
XY

Xiao Yan

The Chinese University of Hong Kong
GJ

Guanxian Jiang

The Chinese University of Hong Kong
TJ

Tatiana Jin

The Chinese University of Hong Kong
JC

James Cheng

The Chinese University of Hong Kong
AX

An Xu

The Chinese University of Hong Kong
ZL

Zhanhao Liu

The Chinese University of Hong Kong
ST

Shuo Tu

The Chinese University of Hong Kong


Wednesday July 10, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:00pm PDT

STRADS-AP: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model
It is a daunting task for a data scientist to convert sequential code for a Machine Learning (ML) model, published by an ML researcher, to a distributed framework that runs on a cluster and operates on massive datasets. The process of fitting the sequential code to an appropriate programming model and data abstractions determined by the framework of choice requires significant engineering and cognitive effort. Furthermore, inherent constraints of frameworks sometimes lead to inefficient implementations, delivering suboptimal performance.

We show that it is possible to achieve automatic and efficient distributed parallelization of familiar sequential ML code by making a few mechanical changes to it while hiding the details of concurrency control, data partitioning, task parallelization, and fault-tolerance. To this end, we design and implement a new distributed ML framework, STRADS-Automatic Parallelization (AP), and demonstrate that it simplifies distributed ML programming significantly, while outperforming a popular data-parallel framework with a non-familiar programming model, and achieving performance comparable to an ML-specialized framework.

Speakers
JK

Jin Kyu Kim

Carnegie Mellon University
AA

Abutalib Aghayev

Carnegie Mellon University
GA

Garth A. Gibson

Carnegie Mellon University, Vector Institute, University of Toronto
EP

Eric P. Xing

Petuum Inc, Carnegie Mellon University


Wednesday July 10, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:20pm PDT

SOPHIA: Online Reconfiguration of Clustered NoSQL Databases for Time-Varying Workloads
Reconfiguring NoSQL databases under changing workload patterns is crucial for maximizing database throughput. This is challenging because of the large configuration parameter search space with complex interdependencies among the parameters. While state-of-the-art systems can automatically identify close-to-optimal configurations for static workloads, they suffer for dynamic workloads as they overlook three fundamental challenges: (1) Estimating performance degradation during the reconfiguration process (such as due to database restart). (2) Predicting how transient the new workload pattern will be. (3) Respecting the application’s availability requirements during reconfiguration. Our solution, SOPHIA, addresses all these shortcomings using an optimization technique that combines workload prediction with a cost-benefit analyzer. SOPHIA computes the relative cost and benefit of each reconfiguration step, and determines an optimal reconfiguration for a future time window. This plan specifies when to change configurations and to what, to achieve the best performance without degrading data availability. We demonstrate its effectiveness for three different workloads: a multi-tenant, global-scale metagenomics repository (MG-RAST), a bus-tracking application (Tiramisu), and an HPC data-analytics system, all with varying levels of workload complexity and demonstrating dynamic workload changes. We compare SOPHIA’s performance in throughput and tail-latency over various baselines for two popular NoSQL databases, Cassandra and Redis.

Speakers
AM

Ashraf Mahgoub

Purdue University
PW

Paul Wood

Johns Hopkins University
AM

Alexander Medoff

Purdue University
SM

Subrata Mitra

Adobe Research
FM

Folker Meyer

Argonne National Lab
SC

Somali Chaterji

Purdue University
SB

Saurabh Bagchi

Purdue University


Wednesday July 10, 2019 3:20pm - 3:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:10pm PDT

libmpk: Software Abstraction for Intel Memory Protection Keys (Intel MPK)
Intel Memory Protection Keys (MPK) is a new hardware primitive to support thread-local permission control on groups of pages without requiring modification of page tables. Unfortunately, its current hardware implementation and software support suffer from security, scalability, and semantic problems: (1) vulnerable to protection-key-use-after-free; (2) providing the limited number of protection keys; and (3) incompatible with mprotect()’s process-based permission model.

In this paper, we propose libmpk, a software abstraction for MPK. It virtualizes the hardware protection keys to eliminate the protection-key-use-after-free problem while providing accesses to an unlimited number of virtualized keys. To support legacy applications, it also provides a lazy inter-thread key synchronization. To enhance the security of MPK itself, libmpk restricts unauthorized writes to its metadata. We apply libmpk to three real-world applications: OpenSSL, JavaScript JIT compiler, and Memcached for memory protection and isolation. Our evaluation shows that it introduces negligible performance overhead (

Speakers
SP

Soyeon Park

Georgia Institute of Technology
SL

Sangho Lee

Microsoft Research
WX

Wen Xu

Georgia Institute of Technology
HM

Hyungon Moon

Ulsan National Institute of Science and Technology
TK

Taesoo Kim

Georgia Institute of Technology


Wednesday July 10, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:30pm PDT

Effective Static Analysis of Concurrency Use-After-Free Bugs in Linux Device Drivers
In Linux device drivers, use-after-free (UAF) bugs can cause system crashes and serious security problems. According to our study of Linux kernel commits, 42% of the driver commits fixing use-after-free bugs involve driver concurrency. We refer to these use-after-free bugs as concurrency use-after-free bugs. Due to the non-determinism of concurrent execution, concurrency use-after-free bugs are often more difficult to reproduce and detect than sequential use-after-free bugs.

In this paper, we propose a practical static analysis approach named DCUAF, to effectively detect concurrency use-after-free bugs in Linux device drivers. DCUAF combines a local analysis analyzing the source code of each driver with a global analysis statistically analyzing the local results of all drivers, forming a local-global analysis, to extract the pairs of driver interface functions that may be concurrently executed. Then, with these pairs, DCUAF performs a summary-based lockset analysis to detect concurrency use-after-free bugs. We have evaluated DCUAF on the driver code of Linux 4.19, and found 640 real concurrency use-after-free bugs. We have randomly selected 130 of the real bugs and reported them to Linux kernel developers, and 95 have been confirmed.

Speakers
JB

Jia-Ju Bai

Tsinghua University
JL

Julia Lawall

Sorbonne Université/Inria/LIP6
QC

Qiu-Liang Chen

Tsinghua University
SH

Shi-Min Hu

Tsinghua University


Wednesday July 10, 2019 4:30pm - 4:50pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:50pm PDT

LXDs: Towards Isolation of Kernel Subsystems
Modern operating systems are monolithic. Today, however, lack of isolation is one of the main factors undermining security of the kernel. Inherent complexity of the kernel code and rapid development pace combined with the use of unsafe, low-level programming language results in a steady stream of errors. Even after decades of efforts to make commodity kernels more secure, i.e., development of numerous static and dynamic approaches aimed to prevent exploitation of most common errors, several hundreds of serious kernel vulnerabilities are reported every year. Unfortunately, in a monolithic kernel a single exploitable vulnerability potentially provides an attacker with access to the entire kernel.

Modern kernels need isolation as a practical means of confining the effects of exploits to individual kernel subsystems. Historically, introducing isolation in the kernel is hard. First, commodity hardware interfaces provide no support for efficient, fine-grained isolation. Second, the complexity of a modern kernel prevents a naive decomposition effort. Our work on Lightweight Execution Domains (LXDs) takes a step towards enabling isolation in a full-featured operating system kernel. LXDs allow one to take an existing kernel subsystem and run it inside an isolated domain with minimal or no modifications and with a minimal overhead. We evaluate our approach by developing isolated versions of several performance-critical device drivers in the Linux kernel.

Speakers
VN

Vikram Narayanan

University of California, Irvine
AB

Abhiram Balasubramanian

University of Utah
CJ

Charlie Jacobsen

University of Utah
SS

Sarah Spall

University of Utah
SB

Scott Bauer

University of Utah
MQ

Michael Quigley

University of Utah
AH

Aftab Hussain

University of California, Irvine
AY

Abdullah Younis

University of California, Irvine
JS

Junjie Shen

University of California, Irvine
MB

Moinak Bhattacharyya

University of California, Irvine
AB

Anton Burtsev

University of California, Irvine


Wednesday July 10, 2019 4:50pm - 5:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

5:10pm PDT

JumpSwitches: Restoring the Performance of Indirect Branches In the Era of Spectre
The Spectre family of security vulnerabilities show that speculative execution attacks on modern processors are practical. Spectre variant 2 attacks exploit the speculation of indirect branches, which enable processors to execute code from arbitrary locations in memory. Retpolines serve as the state-of-the-art defense, effectively disabling speculative execution for indirect branches. While retpolines succeed in protecting against Spectre, they come with a significant penalty — 20% on some workloads.

In this paper, we describe and implement an alternative mechanism: the JumpSwitch, which enables speculative execution of indirect branches on safe targets by leveraging indirect call promotion, transforming indirect calls into direct calls. Unlike traditional inlining techniques which apply call promotion at compile time, JumpSwitches aggressively learn targets at runtime and leverages an optimized patching infrastructure to perform just-in-time promotion without the overhead of binary translation.

We designed and optimized JumpSwitches for common patterns we have observed. If a JumpSwitch cannot learn safe targets, we fall back to the safety of retpolines. JumpSwitches seamlessly integrate into Linux, and are evaluated in Linux v4.18. In addition, we have submitted patches enabling JumpSwitch upstream to the Linux community. We show that JumpSwitches can improve performance over retpolines by up to 20% for a range of workloads. In some cases,JumpSwitches even show improvement over a system without retpolines by directing speculative execution into direct calls just-in-time and reducing mispredictions.

Speakers
NA

Nadav Amit

VMware Research
MW

Michael Wei

VMware Research


Wednesday July 10, 2019 5:10pm - 5:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX
 
Thursday, July 11
 

9:45am PDT

Multi Queue Fair Queuing
Modern high-speed devices (e.g., network adapters, storage, accelerators) use new host interfaces, which expose multiple software queues directly to the device. These multi-queue interfaces allow mutually distrusting applications to access the device without any cross-core interaction, enabling throughput in the order of millions of IOP/s on multicore systems. Unfortunately, while independent device access is scalable, it also introduces a new problem: unfairness. Mechanisms that were used to provide fairness for older devices are no longer tenable in the wake of multi-queue design, and straightforward attempts to re-introduce it would require cross-core synchronization that undermines the scalability for which multiple queues were designed.

To address these challenges, we present Multi-Queue Fair Queueing (MQFQ), the first fair, work-conserving scheduler suitable for multi-queue systems. Specifically, we (1) reformulate a classical fair queueing algorithm to accommodate multi-queue designs, and (2) describe a scalable implementation that bounds potential unfairness while minimizing synchronization overhead. Our implementation of MQFQ in Linux 4.15 demonstrates both fairness and high throughput. Evaluation with an NVMe over RDMA fabric (NVMf) device shows that MQFQ can reach up to 3.1 Million IOP/s on a single machine — $20\times$ higher than the state-of-the-art Linux Budget Fair Queueing. Compared to a system with no fairness, MQFQ reduces the slowdown caused by an antagonist from $3.78\times$ to $1.33\times$ for the FlashX workload and from $6.57\times$ to $1.03\times$ for the Aerospike workload ($2\times$ is considered "fair" slowdown).

Speakers
MH

Mohammad Hedayati

University of Rochester
KS

Kai Shen

Google
ML

Michael L. Scott

University of Rochester


Thursday July 11, 2019 9:45am - 10:05am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:05am PDT

BRAVO -- Biased Locking for Reader-Writer Locks
Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads when the "reader indicator" state is updated frequently by a diverse set of threads, causing cache invalidation and coherence traffic. Other designs use distributed reader indicators, one per NUMA node, per core or even per thread. This improves reader-reader scalability, but also increases the size of each lock instance and creates overhead for writers.

We propose a simple transformation, BRAVO, that augments any existing reader-writer lock, adding just two integer fields to the lock instance. Readers make their presence known to writers by hashing their thread's identity with the lock address, forming an index into a visible readers table and installing the lock address into the table. All locks and threads in an address space can share the same readers table. Crucially, readers of the same lock tend to write to different locations in the table, reducing coherence traffic. Therefore, BRAVO can augment a simple compact lock to provide scalable concurrent reading, but with only modest and constant increase in memory footprint.

We implemented BRAVO in user-space, as well as integrated it with the Linux kernel reader-writer semaphore (rwsem). Our evaluation with numerous benchmarks and real applications, both in user and kernel-space, demonstrate that BRAVO improves performance and scalability of underlying locks in read-heavy workloads while introducing virtually no overhead, including in workloads in which writes are frequent.

Speakers
DD

Dave Dice

Oracle Labs
AK

Alex Kogan

Oracle Labs


Thursday July 11, 2019 10:05am - 10:25am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:25am PDT

Mitigating Asymmetric Read and Write Costs in Cuckoo Hashing for Storage Systems
In storage systems, cuckoo hash tables have been widely used to support fast query services. For a read, the cuckoo hashing delivers real-time access with $O(1)$ lookup complexity via open-addressing approach. For a write, most concurrent cuckoo hash tables fail to efficiently address the problem of endless loops during item insertion due to the essential property of hash collisions. The asymmetric feature of cuckoo hashing exhibits fast-read-slow-write performance, which often becomes the bottleneck from single-thread writes. In order to address the problem of asymmetric performance and significantly improve the write/insert efficiency, we propose an optimized Concurrent Cuckoo hashing scheme, called CoCuckoo. To predetermine the occurrence of endless loops, CoCuckoo leverages a directed pseudoforest containing several subgraphs to leverage the cuckoo paths that represent the relationship among items. CoCuckoo exploits the observation that the insertion operations sharing a cuckoo path access the same subgraph, and hence a lock is needed for ensuring the correctness of concurrency control via allowing only one thread to access the shared path at a time; Insertion operations accessing different subgraphs are simultaneously executed without collisions. CoCuckoo improves the throughput performance by a graph-grained locking to support concurrent writes and reads. We have implemented all components of CoCuckoo and extensive experiments using the YCSB benchmark have demonstrated the efficiency and efficacy of our proposed scheme.

Speakers
YS

Yuanyuan Sun

Huazhong University of Science and Technology
YH

Yu Hua

Huazhong University of Science and Technology
ZC

Zhangyu Chen

Huazhong University of Science and Technology
YG

Yuncheng Guo

Huazhong University of Science and Technology


Thursday July 11, 2019 10:25am - 10:45am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:15am PDT

SIMD-X: Programming and Processing of Graph Algorithms on GPUs
With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However, graph algorithms such as breadth-first search and k-core, often fail to take full advantage of GPUs, due to irregularity in memory access and control flow. To address this challenge, we have developed SIMD-X, for programming and processing of single instruction multiple, complex, data on GPUs. Specifically, the new Active-Compute-Combine (ACC) model not only provides ease of programming to programmers, but more importantly creates opportunities for system-level optimizations. To this end, SIMD-X utilizes just-in-time task management which filters out inactive vertices at runtime and intelligently maps various tasks to different amount of GPU cores in pursuit of workload balancing. In addition, SIMD-X leverages push-pull based kernel fusion that, with the help of a new deadlock-free global barrier, reduces a large number of computation kernels to very few. Using SIMD-X, a user can program a graph algorithm in tens of lines of code, while achieving 3x, 6x, 24x, 3x speedup over Gunrock, Galois, CuSha, and Ligra, respectively.

Speakers
HL

Hang Liu

University of Massachusetts Lowell
HH

H. Howie Huang

George Washington University


Thursday July 11, 2019 11:15am - 11:35am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:35am PDT

11:55am PDT

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Recent deep learning models have moved beyond low dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, e-commerce user-item graphs, and knowledge graphs. This evolution has led to large graph-based neural network models that go beyond what existing deep learning frameworks or graph computing systems are designed for. We present NeuGraph, a new framework that bridges the graph and dataflow models to support efficient and scalable parallel neural network computation on graphs. NeuGraph introduces graph computation optimizations into the management of data partitioning, scheduling, and parallelism in dataflow-based deep learning frameworks. Our evaluation shows that, on small graphs that can fit in a single GPU, NeuGraph outperforms state-of-the-art implementations by a significant margin, while scaling to large real-world graphs that none of the existing frameworks can handle directly with GPUs.

Speakers
LM

Lingxiao Ma

Peking University
ZY

Zhi Yang

Peking University
YM

Youshan Miao

Microsoft Research
JX

Jilong Xue

Microsoft Research
MW

Ming Wu

Microsoft Research
LZ

Lidong Zhou

Microsoft Research
YD

Yafei Dai

Peking University


Thursday July 11, 2019 11:55am - 12:15pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:15pm PDT

Pre-Select Static Caching and Neighborhood Ordering for BFS-like Algorithms on Disk-based Graph Engines
Many important graph algorithms are based on the breadth first search (BFS) approach, which builds itself on recursive vertex traversal. We classify algorithms that share this characteristic into what we call a BFS-like algorithm. In this work, we first analyze and study the I/O request patterns of BFS-like algorithms executed on disk-based graph engines. Our analysis exposes two shortcomings in executing BFS-like algorithms. First, we find that the use of the cache is ineffective. To make use of the cache more effectively, we propose an in-memory static cache, which we call BFS-Aware Static Cache or Basc, for short. Basc is static as its contents, which are edge lists of vertices that are pre-selected before algorithm execution, do not change throughout the execution of the algorithm. Second, we find that state-of-the-art ordering method for graphs on disks is ineffective with BFS-like algorithms. Thus, based on an I/O cost model that estimates the performance based on the ordering of graphs, we develop an efficient graph ordering called Neighborhood Ordering or Norder. We provide extensive evaluations of Basc and Norder on two well-known graph engines using five real-world graphs including Twitter that has 1.9 billion edges. Our experimental results show that Basc and Norder, collectively have substantial performance impact.

Speakers
JS

Jiwon Seo

Hanyang University


Thursday July 11, 2019 12:15pm - 12:35pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:00pm PDT

StreamBox-TZ: Secure Stream Analytics at the Edge with TrustZone
While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted.

This approach faces two major challenges: (1) executing high-throughput, low-delay stream analytics in a single TEE, which is constrained by a low trusted computing base (TCB) and limited physical memory; (2) verifying execution of stream analytics as the execution involves untrusted software components on the edge. In response, we present StreamBox-TZ (SBT), a stream analytics engine for an edge platform that offers strong data security, verifiable results, and good performance. SBT contributes a data plane designed and optimized for a TEE based on ARM TrustZone. It supports continuous remote attestation for analytics correctness and result freshness while incurring low overhead. SBT only adds 42.5 KB executable to the TCB (16% of the entire TCB). On an octa core ARMv8 platform, it delivers the state-of-the-art performance by processing input events up to 140 MB/sec (12M events/sec) with sub-second delay. The overhead incurred by SBT’s security mechanism is less than 25%.

Speakers
HP

Heejin Park

Purdue ECE
SZ

Shuang Zhai

Purdue ECE
LL

Long Lu

Northeastern University


Thursday July 11, 2019 2:00pm - 2:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:20pm PDT

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves
Hardware secure enclaves are increasingly used to run complex applications. Unfortunately, existing and emerging enclave architectures do not allow secure and efficient implementation of custom page fault handlers. This limitation impedes in-enclave use of secure memory-mapped files and prevents extensions of the application memory layer commonly used in untrusted systems, such as transparent memory compression or access to remote memory.

CoSMIX is a Compiler-based system for Secure Memory Instrumentation and eXecution of applications in secure enclaves. A novel memory store abstraction allows implementation of application-level secure page fault handlers that are invoked by a lightweight enclave runtime. The CoSMIX compiler instruments the application memory accesses to use one or more memory stores, guided by a global instrumentation policy or code annotations without changing application code.

The CoSMIX prototype runs on Intel SGX and is compatible with popular SGX execution environments, including SCONE and Graphene. Our evaluation of several production applications shows how CoSMIX improves their security and performance by recompiling them with appropriate memory stores. For example, unmodified Redis and Memcached key-value stores achieve about 2× speedup by using a self-paging memory store while working on datasets up to 6× larger than the enclave’s secure memory. Similarly, annotating a single line of code in a biometric verification server changes it to store its sensitive data in Oblivious RAM and makes it resilient against SGX side-channel attacks.

Speakers
YM

Yan Michalevsky

Anjuna Security
MS

Mark Silberstein

Technion – Israel Institute of Technology


Thursday July 11, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:40pm PDT

Secured Routines: Language-based Construction of Trusted Execution Environments
Trusted Execution Environments (TEEs), such as Intel SGX’s enclave, use hardware to ensure the confidentiality and integrity of operations on sensitive data. While the technology is widely available, the complexity of its programming model and its performance overhead have limited adoption. TEEs provide a new and valuable hardware functionality that has no obvious analogue in programming languages, which means that developers must manually partition their application into trusted and untrusted components.

This paper describes an approach that fully integrates trusted execution in a language-appropriate manner. We extend the Go language to allow a programmer to execute a goroutine within an enclave, to use low-overhead channels to communicate between the trusted and untrusted environments, and to rely on a compiler to automatically extract the secure code and data. Our prototype compiler and runtime, GOTEE , is a backward-compatible fork of the Go compiler.

The evaluation shows that our compiler-driven code and data partitioning efficiently executes both microbenchmarks and applications. On the former, GOTEE achieves a 5.2x throughput, and a 2.3x latency improvement over the Intel SGX SDK. Our case studies, the Go tls package and a secured keystore inspired by the go-ethereum project, show that minor source-code modifications suffice to provide confidentiality and integrity guarantees with only moderate performance overheads.


Thursday July 11, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:00pm PDT

Supporting Security Sensitive Tenants in a Bare-Metal Cloud
SecCloud is a new architecture for bare-metal clouds that enables tenants to control tradeoffs between security, price, and performance. It enables security-sensitive tenants to minimize their trust in the public cloud provider and achieve similar levels of security and control that they can obtain in their own private data centers, while not imposing overhead on tenants that are security insensitive and not compromising the flexibility or operational efficiency of the provider. Our prototype exploits a novel provisioning system and specialized firmware to enable elasticity similar to virtualized clouds. Experimentally we quantify the cost of different levels of security for a variety of workloads and demonstrate the value of giving control to the tenant.

Speakers
AM

Amin Mosayyebzadeh

Boston University
AM

Apoorve Mohan

Northeastern University
ST

Sahil Tikale

Boston University
MA

Mania Abdi

Northeastern University
NS

Nabil Schear

MIT Lincoln Laboratory
CM

Charles Munson

MIT Lincoln Laboratory
LR

Larry Rudolph

Two Sigma
GC

Gene Cooperman

Northeastern University
PD

Peter Desnoyers

Northeastern University
OK

Orran Krieger

Boston University


Thursday July 11, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:50pm PDT

SmartDedup: Optimizing Deduplication for Resource-constrained Devices
Storage on smart devices such as smartphones and the Internet of Things has limited performance, capacity, and endurance. Deduplication has the potential to address these limitations by eliminating redundant I/Os and data, but it must be considered under the various resource constraints of the devices. This paper presents SmartDedup, a deduplication solution optimized for resource-constrained devices. It proposes a novel architecture that supports symbiotic in-line and out-of-line deduplication to take advantage of their complementary strengths and allow them to be adapted according to a device's current resource availability. It also cohesively combines in-memory and on-disk fingerprint stores to minimize the memory overhead while achieving a good level of deduplication. SmartDedup is prototyped on EXT4 and F2FS and evaluated using benchmarks, workloads generated from real-world device images, and traces collected from real-world devices. The results show that SmartDedup substantially improves I/O performance (e.g., increases write and read throughput by 31.1% and 32%, respectively for an FIO experiment with 25% deduplication ratio), reduces flash writes (e.g., by 70.9% in a trace replay experiment with 72.4% deduplication ratio) and saves space usage (e.g., by 45% in a DEDISbench experiment with 46.1% deduplication ratio) with low memory, storage, and battery overhead, compared to both native file systems and related deduplication solutions.

Speakers
QY

Qirui Yang

Arizona State University
RJ

Runyu Jin

Arizona State University
MZ

Ming Zhao

Arizona State University


Thursday July 11, 2019 3:50pm - 4:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:10pm PDT

Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere!
Data Domain has added a cloud tier capability to its onpremises storage appliance, allowing clients to achieve the cost benefits of deduplication in the cloud. While there were many architectural changes necessary to support a cloud tier in a mature storage product, in this paper, we focus on innovations needed to support key functionality for customers. Consider typical customer interactions: First, a customer determines which files to migrate to the cloud by estimating how much space will be freed on the on-premises Data Domain appliance. Second, a customer transfers selected files to the cloud and later restores files back. Finally, a customer deletes a file in the cloud when its retention period has expired. Each of these operations requires significant architectural changes and new algorithms to address both the impact of deduplicated storage and the latency and expense of cloud object storage. We also present analysis from deployed cloud tier systems. As an example, some customers have moved more than 20PB of logical data to the cloud tier and achieved a total compression factor (deduplication * local compression) of 40× or more, resulting in millions of dollars of cost savings.

Speakers

Thursday July 11, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:35pm PDT

Pragh: Locality-preserving Graph Traversal with Split Live Migration
Many real-world data like social, transportation, biology, and communication data can be efficiently modeled as a graph. Hence, graph traversal such as multi-hop or graph-walking queries has been key operations atop graph stores. However, since different graph traversals may touch different sets of data, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance cost. This paper proposes Pragh, an efficient locality-preserving live graph migration scheme for graph store in the form of key-value pairs. The key idea of Pragh is a split migration model which only migrates values physically while retains keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized accesses to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong with up to 2.53× throughput improvement further confirms the effectiveness and generality of Pragh.

Speakers
XX

Xiating Xie

Shanghai Jiao Tong University
XW

Xingda Wei

Shanghai Jiao Tong University
RC

Rong Chen

Shanghai Jiao Tong University
HC

Haibo Chen

Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.


Thursday July 11, 2019 4:35pm - 4:55pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:55pm PDT

ElasticBF: Elastic Bloom Filter with Hotness Awareness for Boosting Read Performance in Large Key-Value Stores
LSM-tree based key-value (KV) stores suffer from severe read amplification because searching a key requires to check multiple SSTables. To reduce extra I/Os, Bloom filters are usually deployed in KV stores to improve read performance. However, Bloom filters suffer from false positive, and simply enlarging the size of Bloom filters introduces large memory overhead, so it still causes extra I/Os in memory-constrained systems. In this paper, we observe that access skewness is very common among SSTables or even small-sized segments within each SSTable. To leverage this skewness feature, we develop ElasticBF, a fine-grained heterogeneous Bloom filter management scheme with dynamic adjustment according to data hotness. ElasticBF is orthogonal to the works optimizing the architecture of LSM-tree based KV stores, so it can be integrated to further speed up their read performance. We build ElasticBF atop of LevelDB, RocksDB, and PebblesDB, and our experimental results show that ElasticBF increases the read throughput of the above KV stores to 2.34x, 2.35x, and 2.58x, respectively, while keeps almost the same write and range query performance.

Speakers
YL

Yongkun Li

University of Science and Technology of China
CT

Chengjin Tian

University of Science and Technology of China
FG

Fan Guo

University of Science and Technology of China
CL

Cheng Li

University of Science and Technology of China
YX

Yinlong Xu

University of Science and Technology of China


Thursday July 11, 2019 4:55pm - 5:15pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

5:15pm PDT

SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores
LSM-based KV stores are designed to offer good write performance, by capturing client writes in memory, and only later flushing them to storage. Writes are later compacted into a tree-like data structure on disk to improve read performance and to reduce storage space use. It has been widely documented that compactions severely hamper throughput. Various optimizations have successfully dealt with this problem. These techniques include, among others, rate-limiting flushes and compactions, selecting among compactions for maximum effect, and limiting compactions to the highest level by so-called fragmented LSMs.

In this paper we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and in fact in some cases exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes and compactions. We then introduce the notion of an I/O scheduler for an LSM-based KV store to reduce this interference. We explore three techniques as part of this I/O scheduler: 1) opportunistically allocating more bandwidth to internal operations during periods of low load, 2) prioritizing flushes and compactions at the lower levels of the tree, and 3) preempting compactions.

SILK is a new open-source KV store that incorporates this notion of an I/O scheduler. SILK is derived from RocksDB, but the concepts can be applied to other LSM-based KV stores. We use both a production workload at Nutanix and synthetic benchmarks to demonstrate that SILK achieves up to two orders of magnitude lower 99th percentile latencies than RocksDB and TRIAD, without any significant negative effects on other performance metrics.

Speakers
OB

Oana Balmau

University of Sydney
FD

Florin Dinu

University of Sydney
WZ

Willy Zwaenepoel

University of Sydney
KG

Karan Gupta

Nutanix Inc.
DD

Diego Didona

IBM Research - Zurich


Thursday July 11, 2019 5:15pm - 5:35pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

5:35pm PDT

Unification of Temporary Storage in the NodeKernel Architecture
Efficiently exchanging temporary data between tasks is critical to the end-to-end performance of many data processing frameworks and applications. Unfortunately, the diverse nature of temporary data creates storage demands that often fall between the sweet spots of traditional storage platforms, such as file systems or key-value stores.

We present NodeKernel, a novel distributed storage architecture that offers a convenient new point in the design space by fusing file system and key-value semantics in a common storage kernel while leveraging modern networking and storage hardware to achieve high performance and cost-efficiency. NodeKernel provides hierarchical naming, high scalability, and close to bare-metal performance for a wide range of data sizes and access patterns that are characteristic of temporary data. We show that storing temporary data in Crail, our concrete implementation of the NodeKernel architecture which uses RDMA networking with tiered DRAM/NVMe-Flash storage, improves NoSQL workload performance by up to 4.8× and Spark application performance by up to 3.4×. Furthermore, by storing data across NVMe Flash and DRAM storage tiers, Crail reduces storage cost by up to 8× compared to DRAM-only storage systems.

Speakers
PS

Patrick Stuedi

IBM Research
AT

Animesh Trivedi

Vrije Universiteit
JP

Jonas Pfefferle

IBM Research
AK

Ana Klimovic

Stanford University
AS

Adrian Schuepbach

IBM Research
BM

Bernard Metzler

IBM Research


Thursday July 11, 2019 5:35pm - 5:55pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX
 
Friday, July 12
 

9:15am PDT

R2P2: Making RPCs first-class datacenter citizens
Remote Procedure Calls are widely used to connect datacenter applications with strict tail-latency service level objectives in the scale of μs. Existing solutions utilize streaming or datagram-based transport protocols for RPCs that impose overheads and limit the design flexibility. Our work exposes the RPC abstraction to the endpoints and the network, making RPCs first-class datacenter citizens and allowing for in-network RPC scheduling. We propose R2P2, a UDP-based transport protocol specifically designed for RPCs inside a datacenter. R2P2 exposes pairs of requests and responses and allows efficient and scalable RPC routing by separating the RPC target selection from request and reply streaming. Leveraging R2P2, we implement a novel join-bounded-shortest-queue (JBSQ) RPC load balancing policy, which lowers tail latency by centralizing pending RPCs in the router and ensures that requests are only routed to servers with a bounded number of outstanding requests. The R2P2 router logic can be implemented either in a software middlebox or within a P4 switch ASIC pipeline. Our evaluation, using a range of microbenchmarks, shows that the protocol is suitable for μs-scale RPCs and that its tail latency outperforms both random selection and classic HTTP reverse proxies. The P4-based implementation of R2P2 on a Tofino ASIC adds less than 1μs of latency whereas the software middlebox implementation adds 5μs latency and requires only two CPU cores to route RPCs at 10 Gbps line-rate. R2P2 improves the tail latency of web index searching on a cluster of 16 workers operating at 50% of capacity by 5.7× over NGINX. R2P2 improves the throughput of the Redis key-value store on a 4-node cluster with master/slave replication for a tail-latency service-level objective of 200μs by more than 4.8× vs. vanilla Redis.


Friday July 12, 2019 9:15am - 9:35am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

9:35am PDT

Your Coflow has Many Flows: Sampling them for Fun and Speed
Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling.

We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philae's robustness to flow size skew.

Speakers
AJ

Akshay Jajoo

Purdue University
YC

Y. Charlie Hu

Purdue University
XL

Xiaojun Lin

Purdue University


Friday July 12, 2019 9:35am - 9:55am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

9:55am PDT

PostMan: Rapidly Mitigating Bursty Traffic by Offloading Packet Processing
Unexpected bursty traffic due to certain sudden events, such as news in the spotlight on a social network or discounted items on sale, can cause severe load imbalance in backend services. Migrating hot data---the standard approach to achieve load balance---meets a challenge when handling such unexpected load imbalance, because migrating data will slow down the server that is already under heavy pressure.

This paper proposes PostMan, an alternative approach to rapidly mitigate load imbalance for services processing small requests. Motivated by the observation that processing large packets incurs far less CPU overhead than processing small ones, PostMan deploys a number of middleboxes called helpers to assemble small packets into large ones for the heavily-loaded server. This approach essentially offloads the overhead of packet processing from the heavily-loaded server to others. To minimize the overhead, PostMan activates helpers on demand, only when bursty traffic is detected. To tolerate helper failures, PostMan can migrate connections across helpers and can ensure packet ordering despite such migration. Our evaluation shows that, with the help of PostMan, a Memcached server can mitigate bursty traffic within hundreds of milliseconds, while migrating data takes tens of seconds and increases the latency during migration.

Speakers
PJ

Panpan Jin

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
JG

Jian Guo

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
YX

Yikai Xiao

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
RS

Rong Shi

The Ohio State University, USA
YN

Yipei Niu

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
FL

Fangming Liu

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
CQ

Chen Qian

University of California Santa Cruz, USA
YW

Yang Wang

The Ohio State University


Friday July 12, 2019 9:55am - 10:15am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:15am PDT

Lancet: A self-correcting Latency Measuring Tool
We present LANCET, a self-correcting tool designed to measure the open-loop tail latency of μs-scale datacenter applications with high fan-in connection patterns. LANCET is self-correcting as it relies on online statistical tests to determine situations in which tail latency cannot be accurately measured from a statistical perspective, including situations where the workload configuration, the client infrastructure, or the application itself does not allow it. Because of its design, LANCET is also extremely easy to use. In fact, the user is only responsible for (i) configuring the workload parameters, i.e., the mix of requests and the size of the client connection pool, and (ii) setting the desired confidence interval for a particular tail latency percentile. All other parameters, including the length of the warmup phase, the measurement length, and the necessary sampling rate, are dynamically determined by the LANCET experiment coordinator. When available, LANCET leverages NIC-based hardware timestamping to measure RPC end-to-end latency. Otherwise, it uses an asymmetric setup with a latency-agent that leverages busy-polling system calls to reduce the client bias. Our evaluation shows that LANCET automatically identifies situations in which tail latency cannot be determined and accurately reports the latency distribution of workloads with single-digit μs service time. For the workloads studied, LANCET can successfully report, with 95% confidence, the 99th percentile tail latency within an interval of ≤ 10μs. In comparison with state-of-the-art tools such as Mutilate and Treadmill, LANCET reports a latency cumulative distribution that is ∼20μs lower when the NIC timestamping capability is available and ∼10μs lower when it is not.

Speakers

Friday July 12, 2019 10:15am - 10:35am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:05am PDT

EdgeWise: A Better Stream Processing Engine for the Edge
Many Internet of Things (IoT) applications would benefit if streams of data could be analyzed rapidly at the Edge, near the data source. However, existing Stream Processing Engines (SPEs) are unsuited for the Edge because their designs assume Cloud-class resources and relatively generous throughput and latency constraints.

This paper presents EdgeWise, a new Edge-friendly SPE, and shows analytically and empirically that EdgeWise improves both throughput and latency. The key idea of EdgeWise is to incorporate a congestion-aware scheduler and a fixed-size worker pool into an SPE. Though this idea has been explored in the past, we are the first to apply it to modern SPEs and we provide a new queue-theoretic analysis to support it. In our single-node and distributed experiments we compare EdgeWise to the state-of-the-art Storm system. We report up to a 3x improvement in throughput while keeping latency low.

Speakers
XF

Xinwei Fu

Virginia Tech
TG

Talha Ghaffar

Virginia Tech
JC

James C. Davis

Virginia Tech
DL

Dongyoon Lee

Virginia Tech


Friday July 12, 2019 11:05am - 11:25am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:25am PDT

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in Microsoft. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

Speakers
MJ

Myeongjae Jeon

UNIST and Microsoft Research
SV

Shivaram Venkataraman

University of Wisconsin and Microsoft Research
AP

Amar Phanishayee

Microsoft Research
JQ

Junjie Qian

Microsoft Research
WX

Wencong Xiao

Beihang University and Microsoft Research
FY

Fan Yang

Microsoft Research


Friday July 12, 2019 11:25am - 11:45am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:50am PDT

Optimizing CNN Model Inference on CPUs
The popularity of Convolutional Neural Network (CNN) models and the ubiquity of CPUs imply that better performance of CNN model inference on CPUs can deliver significant gain to a large number of users. To improve the performance of CNN inference on CPUs, current approaches like MXNet and Intel OpenVINO usually treat the model as a graph and use the high-performance libraries such as Intel MKL-DNN to implement the operations of the graph. While achieving reasonable performance on individual operations from the off-the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level optimizations are predefined. Therefore, it is restrictive and misses the opportunity to optimize the end-to-end inference pipeline as a whole. This paper presents \emph{NeoCPU}, a comprehensive approach of CNN model inference on CPUs that employs a full-stack and systematic scheme of optimizations. \emph{NeoCPU} optimizes the operations as templates without relying on third-parties libraries, which enables further improvement of the performance via operation- and graph-level joint optimization. Experiments show that \emph{NeoCPU} achieves up to 3.45$\times$ lower latency for CNN model inference than the current state-of-the-art implementations on various kinds of popular CPUs.

Speakers
YW

Yao Wang

Amazon
ML

Mu Li

Amazon


Friday July 12, 2019 11:50am - 12:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:10pm PDT

Accelerating Rule-matching Systems with Learned Rankers
Infusing machine learning (ML) and deep learning (DL) into modern systems has driven a paradigm shift towards learning-augmented system design. This paper proposes the learned ranker as a system building block, and demonstrates its potential by using rule-matching systems as a concrete scenario. Specifically, checking rules can be time-consuming, especially complex regular expression (regex) conditions. The learned ranker prioritizes rules based on their likelihood of matching a given input. If the matching rule is successfully prioritized as a top candidate, the system effectively achieves early termination. We integrated the learned rule ranker as a component of popular regex matching engines: PCRE, PCRE-JIT, and RE2. Empirical results show that the rule ranker achieves a top-5 classification accuracy at least 96.16%, and reduces the rule-matching system latency by up to 78.81% on a 8-core CPU.

Speakers
ZL

Zhao Lucis Li

University of Science and Technology China
CM

Chieh-Jan Mike Liang

Microsoft Research
WB

Wei Bai

Microsoft Research
QZ

Qiming Zheng

Shanghai Jiao Tong University
YX

Yongqiang Xiong

Microsoft Research
GS

Guangzhong Sun

University of Science and Technology China


Friday July 12, 2019 12:10pm - 12:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:30pm PDT

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving
The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale. The key challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing the serving cost. In this paper, we tackle the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS). MArk employs three design choices tailor-made for inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive for inference serving, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover the occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to $7.8\times$ while achieving even better latency performance.

Speakers
CZ

Chengliang Zhang

Hong Kong University of Science and Technology
MY

Minchen Yu

Hong Kong University of Science and Technology
WW

Wei Wang

Hong Kong University of Science and Technology
FY

Feng Yan

University of Nevada, Reno


Friday July 12, 2019 12:30pm - 12:50pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:50pm PDT

Cross-dataset Time Series Anomaly Detection for Cloud Systems
In recent years, software applications are increasingly deployed as online services on cloud computing platforms. It is important to detect anomalies in cloud systems in order to maintain high service availability. However, given the velocity, volume, and diversified nature of cloud monitoring data, it is difficult to obtain sufficient labelled data to build an accurate anomaly detection model. In this paper, we propose cross-dataset anomaly detection: detect anomalies in a new unlabelled dataset (the target) by training an anomaly detection model on existing labelled datasets (the source). Our approach, called ATAD (Active Transfer Anomaly Detection), integrates both transfer learning and active learning techniques. Transfer learning is applied to transfer knowledge from the source dataset to the target dataset, and active learning is applied to determine informative labels of a small part of samples from unlabelled datasets. Through experiments, we show that ATAD is effective in cross-dataset time series anomaly detection. Furthermore, we only need to label about 1%-5% of unlabelled data and can still achieve significant performance improvement.

Speakers
XZ

Xu Zhang

Microsoft Research, Nanjing University
QL

Qingwei Lin

Microsoft Research
YX

Yong Xu

Microsoft Research
SQ

Si Qin

Microsoft Research
HZ

Hongyu Zhang

The University of Newcastle
BQ

Bo Qiao

Microsoft Research
YD

Yingnong Dang

Microsoft
XY

Xinsheng Yang

Microsoft
QC

Qian Cheng

Microsoft
YW

Youjiang Wu

Microsoft
KH

Ken Hsieh

Microsoft
KS

Kaixin Sui

Microsoft Research
XM

Xin Meng

Microsoft Research
YX

Yaohai Xu

Microsoft Research
WZ

Wenchi Zhang

Microsoft Research
FS

Furao Shen

Nanjing University
DZ

Dongmei Zhang

Microsoft Research


Friday July 12, 2019 12:50pm - 1:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX