Loading…

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, July 8
 

9:00am PDT

Opening Remarks
Speakers
CD

Christina Delimitrou

Cornell University
DR

Dan R. K. Ports

Microsoft Research


Monday July 8, 2019 9:00am - 9:15am PDT
HotCloud: Grand Ballroom VII–IX

9:00am PDT

Transaction Support using Compound Commands in Key-Value SSDs
Recently proposed key-value SSD (KVSSD) provides the popular and versatile key-value interface at the device level, promising high performance and simplified storage management with the minimal involvement of the host software. However, its I/O command set over NVMe is defined on a per key-value pair basis, enforcing the host to post key-value operations to KVSSD independently. This not only incurs high interfacing overhead for small key-value operations but also makes it subtle to support transactions in KVSSDs without a software support.

In this paper, we propose compound commands for KVSSDs. The compound command allows the host to specify multiple key-value pairs in a single NVMe operation, thereby effectively amortizing I/O interfacing overhead. In addition, it provides an effective way for defining a transaction comprised of multiple key-value pairs. Our evaluation using a prototype KVSSD and an in-house KVSSD emulator shows promising benefits of the compound command, with improving the performance by up to 55%.

Speakers
SK

Sang-Hoon Kim

Ajou University
JK

Jinhong Kim

Sungkyunkwan University
KJ

Kisik Jeong

Sungkyunkwan University
JK

Jin-Soo Kim

Seoul National University


Monday July 8, 2019 9:00am - 9:30am PDT
HotStorage: Grand Ballroom I–III

9:15am PDT

The Seven Sins of Personal-Data Processing Systems under GDPR
In recent years, our society is being plagued by unprecedented levels of privacy and security breaches. To rein in this trend, the European Union, in 2018, introduced a comprehensive legislation called the General Data Protection Regulation (GDPR). In this paper, we review GDPR from a system design perspective, and identify how its regulations conflict with the design, architecture, and operation of modern systems. We illustrate these conflicts via the seven GDPR sins: storing data forever; reusing data indiscriminately; walled gardens and black markets; risk-agnostic data processing; hiding data breaches; making unexplainable decisions; treating security as a secondary goal. Our findings reveal a deep-rooted tussle between GDPR requirements and how modern systems have evolved. We believe that achieving compliance requires comprehensive, grounds up solutions, and anything short would amount to fixing a leaky faucet in a sinking ship.

Speakers
SS

Supreeth Shastri

University of Texas at Austin
MW

Melissa Wasserman

University of Texas at Austin
VC

Vijay Chidambaram

University of Texas at Austin


Monday July 8, 2019 9:15am - 9:30am PDT
HotCloud: Grand Ballroom VII–IX

9:30am PDT

NetWarden: Mitigating Network Covert Channels without Performance Loss
Network covert channels are an advanced threat to the security and privacy of cloud systems. One common limitation of existing defenses is that they all come at the cost of performance. This presents significant barriers to their practical deployment in high-speed networks. We sketch the design of NetWarden, a novel defense whose key design goal is to preserve TCP performance while mitigating covert channels. The use of programmable data planes makes it possible for NetWarden to adapt defenses that were only demonstrated before as proof of concept, and apply them at linespeed. Moreover, NetWarden uses a set of performance boosting techniques to temporarily increase the performance of connections that have been affected by channel mitigation, with the ultimate goal of neutralizing its impact on performance. Our simulation provides initial evidence that NetWarden can mitigate several covert channels with little performance disturbance. As ongoing work, we are working on a full system design and implementation of NetWarden.

Speakers
JX

Jiarong Xing

Rice University
AM

Adam Morrison

Rice University
AC

Ang Chen

Rice University


Monday July 8, 2019 9:30am - 9:45am PDT
HotCloud: Grand Ballroom VII–IX

9:30am PDT

ZoneAlloy: Elastic Data and Space Management for Hybrid SMR Drives
The emergence of Hybrid Shingled Magnetic Recording (H-SMR) allows dynamic conversion of the recording format between Conventional Magnetic Recording (CMR) and SMR on a single disk drive. H-SMR is promising for its ability to manage the performance/capacity trade-off on the disk platters and to adaptively support different application scenarios in large-scale storage systems. However, there is little research on how to efficiently manage data and space in such H-SMR drives.

In this paper, we present ZoneAlloy, an elastic data and space management scheme for H-SMR drives, to explore the benefit of using such drives. ZoneAlloy initially allocates CMR space for the application and then gradually converts the disk format from CMR to SMR to create more space for the application. ZoneAlloy controls the overhead of the format conversion on the application I/O with our quantized migration mechanism. When data is stored in an SMR area, ZoneAlloy reduces the SMR update overhead using H-Buffer and Zone-Swap. H-Buffer is a small host-controlled CMR space that absorbs the SMR updates and migrates those updates back to the SMR space in batches to bring down the SMR update cost. Zone-Swap dynamically swaps ``hot'' data from the SMR space to the CMR space to further alleviate the SMR update problem. Evaluation results based on MSR-Cambridge traces demonstrate that ZoneAlloy can reduce the average I/O latency and limit the performance degradation of the application I/O during format conversion.

Speakers
FW

Fenggang Wu

University of Minnesota, Twin Cities
BL

Bingzhe Li

University of Minnesota, Twin Cities
ZC

Zhichao Cao

University of Minnesota, Twin Cities
BZ

Baoquan Zhang

University of Minnesota, Twin Cities
MY

Ming-Hong Yang

University of Minnesota, Twin Cities
HW

Hao Wen

University of Minnesota, Twin Cities
DH

David H.C. Du

University of Minnesota, Twin Cities


Monday July 8, 2019 9:30am - 10:00am PDT
HotStorage: Grand Ballroom I–III

9:45am PDT

A Double-Edged Sword: Security Threats and Opportunities in One-Sided Network Communication
One-sided network communication technologies such as RDMA and NVMe-over-Fabrics are quickly gaining adoption in production software and in datacenters. Although appealing for their low CPU utilization and good performance, they raise new security concerns that could seriously undermine datacenter software systems building on top of them. At the same time, they offer unique opportunities to help enhance security. Indeed, one-sided network communication is a double-edged sword in security. This paper presents our insights into security implications and opportunities of one-sided communication.

Speakers
ST

Shin-Yeh Tsai

Purdue University
YZ

Yiying Zhang

Purdue University


Monday July 8, 2019 9:45am - 10:00am PDT
HotCloud: Grand Ballroom VII–IX

10:00am PDT

DynaShield: Reducing the Cost of DDoS Defense using Cloud Services
Fueled by IoT botnets and DDoS-as-a-Service tools, distributed denial of service (DDoS) attacks have reached record high volumes. Although there exist DDoS protection services, they can be costly for small organizations as well as individual users. In this paper, we present a low-cost DDoS solution, DynaShield, which a user can deploy at common cloud service providers. DynaShield employs three techniques to reduce cost. First, it uses an on-demand model. A server dynamically updates its DNS record to redirect clients’ traffic to DynaShield when it is under attack, avoiding paying for cloud services during peacetime. Second, DynaShield combines serverless functions and elastic servers provided by cloud providers to auto-scale to large attacks without overprovisioning. Third, DynaShield uses cryptocurrency puzzles as proof of work. The coin mining profit can further offset a protected server’s cloud service charges. Our preliminary evaluation suggests that DynaShield can cost as little as a few dollars per month to prevent an organization from common DDoS attacks.

Speakers
SZ

Shengbao Zheng

Duke University
XY

Xiaowei Yang

Duke University


Monday July 8, 2019 10:00am - 10:15am PDT
HotCloud: Grand Ballroom VII–IX

10:00am PDT

Towards an Unwritten Contract of Intel Optane SSD
New non-volatile memory technologies offer unprecedented performance levels for persistent storage. However, to exploit their full potential, a deeper performance characterization of such devices is required. In this paper, we analyze a NVM-based block device -- the Intel Optane SSD -- and formalize an "unwritten contract'' of the Optane SSD. We show that violating this contract can result in 11x worse read latency and limited throughput (only 20% of peak bandwidth) regardless of parallelism. We present that this contract is relevant to features of 3D XPoint memory and Intel Optane SSD's controller/interconnect design. Finally, we discuss the implications of the contract.

Speakers
KW

Kan Wu

University of Wisconsin-Madison
AA

Andrea Arpaci-Dusseau

University of Wisconsin-Madison
RA

Remzi Arpaci-Dusseau

University of Wisconsin-Madison
Remzi Arpaci-Dusseau is the Grace Wahba professor of Computer Sciences at UW-Madison. He co-leads a research group with Professor Andrea Arpaci-Dusseau. Together, they have graduated 24 Ph.D. students and won numerous best-paper awards; many of their innovations are used by commercial... Read More →


Monday July 8, 2019 10:00am - 10:30am PDT
HotStorage: Grand Ballroom I–III

10:15am PDT

Deep Enforcement: Policy-based Data Transformations for Data in the Cloud
Despite the growing collection and use of private data in the cloud, there remains a fundamental disconnect between unified data governance and the storage system enforcement techniques. On one side, high-level governance policies derived from regulations like General Data Protection Regulation (GDPR) have emerged with stricter rules dictating who, when and how data can be processed. On the other side, storage-level controls, both role- or attribute-based, continue to focus on access/deny enforcement. In this paper, we propose how to bridge this gap. We introduce Deep Enforcement, a system that provides unified governance and transformation policies coupled with data transformations embedded into the storage fabric to achieve policy compliance. Data transformations can vary in complexity, from simple redactions to complex differential privacy-based techniques to provide the required amount of anonymization. We show how this architecture can be implemented into two broad classes of data storage systems in the cloud: object storages and SQL databases. Depending on the complexity of the transformation, we also demonstrate how to implement them either in-line (on data access) or off-line (creating an alternate cached dataset).

Speakers
EK

Ety Khaitzin

IBM Research
MA

Maya Anderson

IBM Research
HJ

Hani Jamjoom

IBM Research
RK

Ronen Kat

IBM Research
AN

Arjun Natarajan

IBM Research
RR

Roger Raphael

IBM Cloud
RS

Roee Shlomo

IBM Research
TS

Tomer Solomon

IBM Research


Monday July 8, 2019 10:15am - 10:30am PDT
HotCloud: Grand Ballroom VII–IX

10:30am PDT

Break with Refreshments
Monday July 8, 2019 10:30am - 11:00am PDT
Grand Ballroom Prefunction

10:30am PDT

11:00am PDT

Jumpgate: In-Network Processing as a Service for Data Analytics
In-network processing, where data is processed by special-purpose devices as it passes over the network, is showing great promise at improving application performance, in particular for data analytics tasks. However, analytics and in-network processing are not yet integrated and widely deployed. This paper presents a vision for providing in-network processing as a service to data analytics frameworks, and outlines benefits, remaining challenges, and our current research directions towards realizing this vision.

Speakers
CM

Craig Mustard

University of British Columbia
FR

Fabian Ruffy

University of British Columbia
AG

Anny Gakhokidze

University of British Columbia
IB

Ivan Beschastnikh

University of British Columbia
AF

Alexandra Fedorova

University of British Columbia


Monday July 8, 2019 11:00am - 11:15am PDT
HotCloud: Grand Ballroom VII–IX

11:00am PDT

Analyzing the Impact of GDPR on Storage Systems
The recently introduced General Data Protection Regulation (GDPR) is forcing several companies to make significant changes to their systems to achieve compliance. Motivated by the finding that more than 30% of GDPR articles are related to storage, we investigate the impact of GDPR compliance on storage systems. We illustrate the challenges of retrofitting existing systems into compliance by modifying GDPR-compliant Redis. We show that despite needing to introduce a small set of new features, a strict real-time compliance (e.g., logging every user request synchronously) lowers Redis’ throughput by 20x. Our work reveals how GDPR allows compliance to be a spectrum, and what its implications are for system designers. We discuss the technical challenges that need to be solved before strict compliance can be efficiently achieved.

Speakers
AS

Aashaka Shah

University of Texas at Austin
VB

Vinay Banakar

Hewlett Packard Enterprise
SS

Supreeth Shastri

University of Texas at Austin
MW

Melissa Wasserman

University of Texas at Austin
VC

Vijay Chidambaram

University of Texas at Austin


Monday July 8, 2019 11:00am - 11:30am PDT
HotStorage: Grand Ballroom I–III

11:15am PDT

Just In Time Delivery: Leveraging Operating Systems Knowledge for Better Datacenter Congestion Control
Network links and server CPUs are heavily contended resources in modern datacenters. To keep tail latencies low, datacenter operators drastically overprovision both types of resources today, and there has been significant research into effectively managing network traffic and CPU load. However, this work typically looks at the two resources in isolation.

In this paper, we make the observation that, in the datacenter, the allocation of network and CPU resources should be co-designed for the most efficiency and the best response times. For example, while congestion control protocols can prioritize traffic from certain flows, this provides no benefit if the traffic arrives at an overloaded server that will only queue the request.

This paper explores the potential benefits of such a co-designed resource allocator and considers the recent work in both CPU scheduling and congestion control that is best suited to such a system. We propose a Chimera, a new datacenter OS that integrates a receiver-based congestion control protocol with OS insight into application queues, using the recent Shenango operating system.

Speakers
AB

Adam Belay

MIT CSAIL
IZ

Irene Zhang

Microsoft Research


Monday July 8, 2019 11:15am - 11:30am PDT
HotCloud: Grand Ballroom VII–IX

11:30am PDT

Designing an Efficient Replicated Log Store with Consensus Protocol
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.

Speakers
JA

Jung-Sang Ahn

eBay Inc.
WK

Woon-Hak Kang

eBay Inc.
KR

Kun Ren

eBay Inc.
GZ

Guogen Zhang

eBay Inc.


Monday July 8, 2019 11:30am - 11:45am PDT
HotCloud: Grand Ballroom VII–IX

11:30am PDT

Graphs Are Not Enough: Using Interactive Visual Analytics in Storage Research
Storage researchers have always been interested in understanding the complex behavior of storage systems with the help of statistics, machine learning, and simple visualization techniques. However, when a system's behavior is affected by hundreds or even thousands of factors, existing approaches break down. Results are often difficult to interpret, and it can be challenging for humans to apply domain knowledge to a complex system. We propose to enhance storage system analysis by applying "interactive visual analytics," which can address the aforementioned limitations. We have devised a suitable Interactive Configuration Explorer (ICE), and conducted several case studies on a typical storage system, to demonstrate its benefits for storage system researchers and designers. We found that ICE makes it easy to explore a large parameter space, identify critical parameters, and quickly zero in on optimal parameter settings.

Speakers
ZC

Zhen Cao

Stony Brook University
GK

Geoff Kuenning

Harvey Mudd College
KM

Klaus Mueller

Stony Brook University
AT

Anjul Tyagi

Stony Brook University
EZ

Erez Zadok

Stony Brook University


Monday July 8, 2019 11:30am - 12:00pm PDT
HotStorage: Grand Ballroom I–III

11:45am PDT

Tackling Parallelization Challenges of Kernel Network Stack for Container Overlay Networks
Overlay networks are the de facto networking technique for providing flexible, customized connectivity among distributed containers in the cloud. However, overlay networks also incur non-trivial overhead due to its complexity, resulting in significant network performance degradation of containers. In this paper, we perform a comprehensive empirical performance study of container overlay networks which identifies unrevealed, important parallelization bottlenecks of the kernel network stack that prevent container overlay networks from scaling. Our observations and root cause analysis cast light on optimizing the network stack of modern operating systems on multi-core systems to more efficiently support container overlay networks in light of high-speed network devices.

Speakers
JL

Jiaxin Lei

SUNY at Binghamton
KS

Kun Suo

The University of Texas at Arlington
HL

Hui Lu

SUNY at Binghamton
JR

Jia Rao

The University of Texas at Arlington


Monday July 8, 2019 11:45am - 12:00pm PDT
HotCloud: Grand Ballroom VII–IX

12:00pm PDT

A Black-Box Approach for Estimating Utilization of Polled IO Network Functions
Cloud management tasks such as performance diagnosis, workload placement, and power management depend critically on estimating the utilization of an application. But, it is challenging to measure actual utilization for polled IO network functions (NFs) without code instrumentation. We ask if CPU events (e.g., data cache misses) measured using hardware performance counters are good at estimating utilization for polled-IO NFs. We find a strong correlation between several CPU events and NF utilization for three representative types of network functions. Inspired by this finding, we explore the possibility of computing a universal estimation function that maps selected CPU events to NF utilization estimates for a wide-range of NFs, traffic profiles and traffic loads. Our NF-specific estimators and universal estimators achieve absolute estimation errors below 6% and 10% respectively.

Speakers
HG

Harshit Gupta

Georgia Institute of Technology
AS

Abhigyan Sharma

AT&T Labs Research
AZ

Alex Zelezniak

AT&T Labs Research
MJ

Minsung Jang

AT&T Labs Research
UR

Umakishore Ramachandran

Georgia Institute of Technology


Monday July 8, 2019 12:00pm - 12:15pm PDT
HotCloud: Grand Ballroom VII–IX

12:00pm PDT

Fair-EDF: A Latency Fairness Framework for Shared Storage Systems
We present Fair-EDF, a framework for latency guarantees in shared storage servers. It provides fairness control while supporting latency guarantees. Fair-EDF extends the pure earliest deadline first (EDF) scheduler by adding a controller to shape the workloads. Under overload it selects a minimal number of requests to drop and to choose the dropped requests in a fair manner. The evaluation results show Fair-EDF provides steady fairness control among a set of clients with different runtime behaviors.

Speakers
YP

Yuhan Peng

Rice University
PV

Peter Varman

Rice University


Monday July 8, 2019 12:00pm - 12:30pm PDT
HotStorage: Grand Ballroom I–III

12:15pm PDT

Luncheon for Workshop Attendees
Monday July 8, 2019 12:15pm - 2:00pm PDT
Olympic Pavilion

12:30pm PDT

2:00pm PDT

DLion: Decentralized Distributed Deep Learning in Micro-Clouds
Deep learning is a popular technique for building inference models and classifiers from large quantities of input data for applications in many domains. With the proliferation of edge devices such as sensor and mobile devices, large volumes of data are generated at rapid pace all over the world. Migrating large amounts of data into centralized data center(s) over WAN environments is often infeasible due to cost, performance or privacy reasons. Moreover, there is an increasing need for incremental or online deep learning over newly generated data in real-time. These trends require rethinking of the traditional training approach to deep learning. To handle the computation on distributed input data, micro-clouds, small-scale clouds deployed near edge devices in many different locations, provide an attractive alternative for data locality reasons. However, existing distributed deep learning systems do not support training in micro-clouds, due to the unique characteristics and challenges in this environment. In this paper, we examine the key challenges of deep learning in micro-clouds: computation and network resource heterogeneity at inter- and intra micro-cloud levels and their scale. We present DLion, a decentralized distributed deep learning system for such environments. It employs techniques specifically designed to address the above challenges to reduce training time, enhance model accuracy, and provide system scalability. We have implemented a prototype of DLion in TensorFlow and our preliminary experiments show promising results towards achieving accurate and efficient distributed deep learning in micro-clouds.

Speakers
RH

Rankyung Hong

University of Minnesota
AC

Abhishek Chandra

University of Minnesota


Monday July 8, 2019 2:00pm - 2:15pm PDT
HotCloud: Grand Ballroom VII–IX

2:00pm PDT

Mismatched Memory Management of Android Smartphones
Current Linux memory management algorithms have been applied for many years. Android inherits Linux kernel, and thus the memory management algorithms of Linux are transplanted to Android smartphones. To evaluate the efficiency of the memory management algorithms of Android, page re-fault is applied as the target metric in this paper. Through carefully designed experiments, this paper shows that current memory management algorithms are not working well on Android smartphones. For example, page re-fault is up to 37% when running a set of popular apps, which means a large proportion of pages evicted by the existing memory management algorithms are accessed again in the near future. Furthermore, the causes of the high page re-fault ratio are analyzed. Based on the analysis, a tradeoff between the reclaim size and the overall performance is uncovered. By exploiting this tradeoff, a preliminary idea is proposed to improve the performance of Android smartphones.

Speakers
YL

Yu Liang

Department of Computer Science, City University of Hong Kong
QL

Qiao Li

Department of Computer Science, City University of Hong Kong
CJ

Chun Jason Xue

Department of Computer Science, City University of Hong Kong


Monday July 8, 2019 2:00pm - 2:30pm PDT
HotStorage: Grand Ballroom I–III

2:15pm PDT

The Case for Unifying Data Loading in Machine Learning Clusters
Training machine learning models involves iteratively fetching and pre-processing batches of data. Conventionally, popular ML frameworks implement data loading within a job and focus on improving the performance of a single job. However, such an approach is inefficient in shared clusters where multiple training jobs are likely to be accessing the same data and duplicating operations. To illustrate this, we present a case study which reveals that for hyper-parameter tuning experiments we can reduce up to 89% I/O and 97% pre-processing redundancy.

Based on this observation, we make the case for unifying data loading in machine learning clusters by bringing the isolated data loading systems together into a single system. Such a system architecture can remove the aforementioned redundancies that arise due to the isolation of data loading in each job. We introduce OneAccess, a unified data access layer and present a prototype implementation that shows a 47.3% improvement in I/O cost when sharing data across jobs. Finally we discuss open research challenges in designing and developing a unified data loading layer that can run across frameworks on shared multi-tenant clusters, including how to handle distributed data access, support diverse sampling schemes, and exploit new storage media.

Speakers
AK

Aarati Kakaraparthy

University of Wisconsin, Madison & Microsoft Gray Systems Lab, Madison
AV

Abhay Venkatesh

University of Wisconsin, Madison
AP

Amar Phanishayee

Microsoft Research
SV

Shivaram Venkataraman

University of Wisconsin and Microsoft Research


Monday July 8, 2019 2:15pm - 2:30pm PDT
HotCloud: Grand Ballroom VII–IX

2:30pm PDT

Accelerating Deep Learning Inference via Freezing
Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher prediction latencies. Prior efforts to reduce this latency such as quantization, model distillation, and any-time prediction models typically trade-off accuracy for performance. In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests. We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18. We present Freeze Inference, a system that introduces approximate caching at each intermediate layer and we discuss techniques to reduce the cache size and improve the cache hit rate. Finally, we discuss some of the open research challenges in realizing such a design.

Speakers
AK

Adarsh Kumar

University of Wisconsin, Madison
AB

Arjun Balasubramanian

University of Wisconsin, Madison
SV

Shivaram Venkataraman

University of Wisconsin and Microsoft Research
AA

Aditya Akella

University of Wisconsin - Madison


Monday July 8, 2019 2:30pm - 2:45pm PDT
HotCloud: Grand Ballroom VII–IX

2:30pm PDT

Linearizable Quorum Reads in Paxos
Many distributed systems/databases rely on Paxos for providing linearizable reads. Linearizable reads in Paxos are achieved either through running a full read round with followers, or via reading from a stable leader which holds leases on followers. We introduce a third method for performing linearizable reads by eschewing the leader and only reading from a quorum of followers. For guaranteeing linearizability, a bare quorum read is insufficient and it needs to be amended with a rinse phase to account for pending update operations. We present our Paxos Quorum Read (PQR) protocol that implements this. Our evaluations show that PQR significantly improves throughput compared to the other methods. The evaluations also show that PQR achieves comparable latency to the read from stable Paxos leader optimization.

Speakers
AC

Aleksey Charapko

University at Buffalo, SUNY; Microsoft, Redmond, WA
AA

Ailidani Ailijiang

Microsoft, Redmond, WA
MD

Murat Demirbas

University at Buffalo, SUNY; Microsoft, Redmond, WA


Monday July 8, 2019 2:30pm - 3:00pm PDT
HotStorage: Grand Ballroom I–III

2:45pm PDT

Characterization and Prediction of Performance Interference on Mediated Passthrough GPUs for Interference-aware Scheduler
Sharing GPUs in the cloud is cost effective and can facilitate the adoption of hardware accelerator enabled cloud. Butsharing causes interference between co-located VMs andleads to performance degradation. In this paper, we proposedan interference-aware VM scheduler at the cluster level withthe goal of minimizing interference. NVIDIA vGPU pro-vides sharing capability and high performance, but it has unique performance characteristics, which have not been studied thoroughly before. Our study reveals several key ob-servations. We leverage our observations to construct modelsbased on machine learning techniques to predict interferencebetween co-located VMs on the same GPU. We proposed a system architecture leveraging our models to schedule VMs to minimize the interference. The experiments show that our observations improves the model accuracy (by 15% ̃ 40%) and the scheduler reduces application run-time overhead by 24.2% in simulated scenarios.

Speakers
XX

Xin Xu

VMware Inc
NZ

Na Zhang

VMware Inc
MC

Michael Cui

VMware Inc
MH

Michael He

The University of Texas at Austin
RS

Ridhi Surana

VMware Inc


Monday July 8, 2019 2:45pm - 3:00pm PDT
HotCloud: Grand Ballroom VII–IX

3:00pm PDT

Happiness index: Right-sizing the cloud’s tenant-provider interface
Cloud providers and their tenants have a mutual interest in identifying optimal configurations in which to run tenant jobs, i.e., ones that achieve tenants' performance goals at minimum cost; or ones that maximize performance within a specified budget. However, different tenants may have different performance goals that are opaque to the provider. A consequence of this opacity is that providers today typically offer fixed bundles of cloud resources, which tenants must themselves explore and choose from. This is burdensome for tenants and can lead to choices that are sub-optimal for both parties.

We thus explore a simple, minimal interface, which lets tenants communicate their happiness with cloud infrastructure to the provider, and enables the provider to explore resource configurations that maximize this happiness. Our early results indicate that this interface could strike a good balance between enabling efficient discovery of application resource needs and the complexity of communicating a full description of tenant utility from different configurations to the provider.

Speakers
VD

Vojislav Dukic

ETH Zurich
AS

Ankit Singla

ETH Zurich


Monday July 8, 2019 3:00pm - 3:15pm PDT
HotCloud: Grand Ballroom VII–IX

3:00pm PDT

Jungle: Towards Dynamically Adjustable Key-Value Store by Combining LSM-Tree and Copy-On-Write B+-Tree
Designing key-value stores based on log-structured merge-tree (LSM-tree) encounters a well-known trade-off between the I/O cost of update and that of lookup as well as of space usage. It is generally believed that they cannot be improved at the same time; reducing update cost will increase lookup cost and space usage, and vice versa. Recent works have been addressing this issue, but they focus on probabilistic approaches or reducing amortized cost only, which may not be helpful for tail latency that is critical to server applications. This paper suggests a novel approach that transplants copy-on-write B+-tree into LSM-tree, aiming at reducing update cost without sacrificing lookup cost. In addition to that, our scheme provides a simple and practical way to adjust the index between update-optimized form and space-optimized form. The evaluation results show that it significantly reduces update cost with consistent lookup cost.

Speakers

Monday July 8, 2019 3:00pm - 3:30pm PDT
HotStorage: Grand Ballroom I–III

3:15pm PDT

The True Cost of Containing: A gVisor Case Study
We analyze many facets of the performance of gVisor, a new security-oriented container engine that integrates with Docker and backs Google’s serverless platform. We explore the effect gVisor’s in-Sentry network stack has on network throughput as well as the overheads of performing all file opens via gVisor’s Gofer service. We further analyze gVisor startup performance, memory efficiency, and system-call overheads. Our findings have implications for the future design of similar hypervisor- based container engines.

Speakers
EG

Ethan G. Young

University of Wisconsin-Madison
PZ

Pengfei Zhu

University of Wisconsin-Madison
TC

Tyler Caraza-Harter

University of Wisconsin-Madison
AC

Andrea C. Arpaci-Dusseau

University of Wisconsin, Madison
RH

Remzi H. Arpaci-Dusseau

University of Wisconsin, Madison


Monday July 8, 2019 3:15pm - 3:30pm PDT
HotCloud: Grand Ballroom VII–IX

3:30pm PDT

Carving Perfect Layers out of Docker Images
This paper and abstract are under embargo and will be released to the public on July 8, 2019.

Speakers
DS

Dimitris Skourtis

IBM Research
LR

Lukas Rupprecht

IBM Research
VT

Vasily Tarasov

IBM Research-Almaden
NM

Nimrod Megiddo

IBM Research


Monday July 8, 2019 3:30pm - 3:45pm PDT
HotCloud: Grand Ballroom VII–IX

3:30pm PDT

3:45pm PDT

4:00pm PDT

Respecting the block interface – computational storage using virtual objects
Computational storage has remained an elusive goal. Though minimizing data movement by placing computation close to storage has quantifiable benefits, many of the previous attempts failed to take root in industry. They either require a departure from the widespread block protocol to one that is more computationally-friendly (e.g., file, object, or key-value), or they introduce significant complexity (state) on top of the block protocol.

We participated in many of these attempts and have since concluded that neither a departure from nor a significant addition to the block protocol is needed. Here we introduce a block-compatible design based on virtual objects. Like a real object (e.g., a file), a virtual object contains the metadata that is needed to process the data. We show how numerous offloads are possible using virtual objects and, as one example, demonstrate a 99% reduction in the data movement required to “scrub” object storage for bitrot. We also present our early work with erasure coded data which, unlike RAID, can be easily adapted to computational storage using virtual objects.

Speakers
IF

Ian F. Adams

Intel Labs
JK

John Keys

Intel Labs


Monday July 8, 2019 4:00pm - 4:30pm PDT
HotStorage: Grand Ballroom I–III

4:15pm PDT

Bridging the Edge-Cloud Barrier for Real-time Advanced Vision Analytics
Advanced vision analytics plays a key role in a plethora of real-world applications. Unfortunately, many of these applications fail to leverage the abundant compute resource in cloud services, because they require high computing resources {\em and} high-quality video input, but the (wireless) network connections between visual sensors (cameras) and the cloud/edge servers do not always provide sufficient and stable bandwidth to stream high-fidelity video data in real time.

This paper presents CloudSeg, an edge-to-cloud framework for advanced vision analytics that co-designs the cloud-side inference with real-time video streaming, to achieve both low latency and high inference accuracy. The core idea is to send the video stream in low resolution, but recover the high-resolution frames from the low-resolution stream via a {\em super-resolution} procedure tailored for the actual analytics tasks. In essence, CloudSeg trades additional cloud-side computation (super-resolution) for significantly reduced network bandwidth. Our initial evaluation shows that compared to previous work, CloudSeg can reduce bandwidth consumption by $\sim$6.8$\times$ with negligible drop in accuracy.

Speakers
JJ

Junchen Jiang

University of Chicago


Monday July 8, 2019 4:15pm - 4:30pm PDT
HotCloud: Grand Ballroom VII–IX

4:30pm PDT

NLUBroker: A Flexible and Responsive Broker for Cloud-based Natural Language Understanding Services
Cloud-based Natural Language Understanding (NLU) services are getting more and more popular with the development of artificial intelligence. More applications are integrated with cloud-based NLU services to enhance the way people communicate with machines. However, with NLU services provided by different companies powered by unrevealed AI technology, how to choose the best one is a problem for users. To our knowledge, there is currently no platform that can provide guidance to users and make recommendations based on their needs. To fill this gap, in this paper, we propose NLUBroker, a platform to comprehensively measure the performance indicators of candidate NLU services, and further provide a broker to select the most suitable service according to the different needs of users. Our evaluation shows that different NLU services leading in different aspects, and NLUBroker is able to improve the quality of experience by automatically choosing the best service. In addition, reinforcement learning is used to support NLUBroker by an intelligent agent in a dynamic environment, and the results are promising.

Speakers
LX

Lanyu Xu

Wayne State University
AI

Arun Iyengar

IBM T.J. Watson Research Center
WS

Weisong Shi

Wayne State University


Monday July 8, 2019 4:30pm - 4:45pm PDT
HotCloud: Grand Ballroom VII–IX

4:30pm PDT

A Tale of Two Abstractions: The Case for Object Space
The increasing availability of byte-addressable non-volatile memory on the system bus provides an opportunity to dramatically simplify application interaction with persistent data. However, software and hardware leverage different abstractions: software operating on persistent data structures requires “global” pointers that remain valid after a process terminates, while hardware requires that a diverse set of devices all have the same mappings they need for bulk transfers to and from memory, and that they be able to do so for a potentially heterogeneous memory system. Both abstractions must be implemented in a way that is efficient using existing hardware.

We propose to abstract physical memory into an object space, which maps objects to physical memory, while providing applications with a way to refer to data that may have a lifetime longer than the processes accessing it. This approach reduces the coordination required for access to multiple types of memory while improving hardware security and enabling more hardware autonomy. We describe how we can use existing hardware support to implement these abstractions, both for applications and for the OS and devices, and show that the performance penalty for this approach is minimal.

Speakers
DB

Daniel Bittman

UC Santa Cruz
PA

Peter Alvaro

UC Santa Cruz
DD

Darrell D. E. Long

UC Santa Cruz
EL

Ethan L. Miller

UC Santa Cruz


Monday July 8, 2019 4:30pm - 5:00pm PDT
HotStorage: Grand Ballroom I–III

4:45pm PDT

Static Call Graph Construction in AWS Lambda Serverless Applications
We present new means for performing static program analysis on serverless programs. We propose a new type of call graph that captures the stateless, event-driven nature of such programs and describe a method for constructing these new extended service call graphs. Next, we survey applications of program analysis that can leverage our extended service call graphs to answer questions about code that executes on a serverless platform. We present findings on the applicability of our techniques to real open source serverless programs. Finally, we close with several open questions about how to best incorporate static analysis in problem solving for developing serverless applications.

Speakers
MO

Matthew Obetz

Rensselaer Polytechnic Institute
SP

Stacy Patterson

Rensselaer Polytechnic Institute
AM

Ana Milanova

Rensselaer Polytechnic Institute


Monday July 8, 2019 4:45pm - 5:00pm PDT
HotCloud: Grand Ballroom VII–IX

5:00pm PDT

Agile Cold Starts for Scalable Serverless
The Serverless or Function-as-a-Service (FaaS) model capitalizes on lightweight execution by packaging code and dependencies together for just-in-time dispatch. Often a container environment has to be set up afresh– a condition called “cold start", and in such cases, performance suffers and overheads mount, both deteriorating rapidly under high concurrency. Caching and reusing previously employed containers ties up memory and risks information leakage. Latency for cold starts is frequently due to work and wait-times in setting up various dependencies – such as in initializing networking elements. This paper proposes a solution that pre-crafts such resources and then dynamically re-associates them with baseline containers. Applied to networking, this approach demonstrates an order of magnitude gain in cold starts, negligible memory consumption, and flat startup time under rising concurrency.

Speakers
AM

Anup Mohan

Intel Corporation
HS

Harshad Sane

Intel Corporation
KD

Kshitij Doshi

Intel Corporation
SE

Saikrishna Edupuganti

Intel Corporation
NN

Naren Nayak

Intel Corporation


Monday July 8, 2019 5:00pm - 5:15pm PDT
HotCloud: Grand Ballroom VII–IX

5:00pm PDT

An Ounce of Prevention is Worth a Pound of Cure: Ahead-of-time Preparation for Safe High-level Container Interfaces
Containers continue to gain traction in the cloud as lightweight alternatives to virtual machines (VMs). This is partially due to their use of host filesystem abstractions, which play a role in startup times, memory utilization, crash consistency, file sharing, host introspection, and image management. However, the filesystem interface is high-level and wide, presenting a large attack surface to the host. Emerging secure container efforts focus on lowering the level of abstraction of the interface to the host through deprivileged functionality recreation (e.g., VMs, userspace kernels). However, the filesystem abstraction is so important that some have resorted to directly exposing it from the host instead of suffering the resulting semantic gap. In this paper, we suggest that through careful ahead-of-time metadata preparation, secure containers can maintain a small attack surface while simultaneously alleviating the semantic gap.

Speakers
RK

Ricardo Koller

IBM T. J. Watson Research Center
DW

Dan Williams

IBM T. J. Watson Research Center


Monday July 8, 2019 5:00pm - 5:30pm PDT
HotStorage: Grand Ballroom I–III

5:15pm PDT

Co-evolving Tracing and Fault Injection with Box of Pain
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain’s lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.

Speakers
DB

Daniel Bittman

UC Santa Cruz
EL

Ethan L. Miller

UC Santa Cruz
PA

Peter Alvaro

UC Santa Cruz


Monday July 8, 2019 5:15pm - 5:30pm PDT
HotCloud: Grand Ballroom VII–IX

5:30pm PDT

6:30pm PDT

USENIX Annual Meeting
Come meet the USENIX Board of Directors and staff. Find out what's going on and share your ideas on how we can better serve you.

Monday July 8, 2019 6:30pm - 7:30pm PDT
West Seattle Room
 
Tuesday, July 9
 

7:30am PDT

Continental Breakfast
Tuesday July 9, 2019 7:30am - 8:30am PDT
Grand Ballroom Prefunction

7:30am PDT

7:30am PDT

Badge Pickup
Tuesday July 9, 2019 7:30am - 5:00pm PDT
Grand Stair Prefunction

8:30am PDT

Shared Keynote Address with HotStorage '19
Speakers

Tuesday July 9, 2019 8:30am - 9:40am PDT
Grand Ballroom

8:30am PDT

Shared Keynote Address with HotEdge '19
Speakers

Tuesday July 9, 2019 8:30am - 9:40am PDT
Grand Ballroom

9:40am PDT

Break with Refreshments
Tuesday July 9, 2019 9:40am - 10:10am PDT
Grand Ballroom Prefunction

9:40am PDT

10:10am PDT

10:10am PDT

The Case for Dual-access File Systems over Object Storage
Object storage has emerged as a low-cost and hyper-scalable alternative to distributed file systems. However, interface incompatibilities and performance limitations often compel users to either transfer data between a file system and object storage or use inefficient file connectors over object stores. The result is growing storage sprawl, unacceptably low performance, and an increase in associated storage costs. One promising solution to this problem is providing dual access, the ability to transparently read and write the same data through both file system interfaces and object storage APIs. In this position paper we argue that there is a need for dual-access file systems over object storage, and examine some representative use cases which benefit from such systems. Based on our conversations with end users, we discuss features which we believe are essential or desirable in a dual-access object storage file system (OSFS). Further, we design and implement an early prototype of Agni, an efficient dual-access OSFS which overcomes the shortcomings of existing approaches. Our preliminary experiments demonstrate that for some representative workloads Agni can improve performance by 20%--60% compared to either S3FS, a popular OSFS, or the prevalent approach of manually copying data between different storage systems.

Speakers
KL

Kunal Lillaney

Johns Hopkins University
VT

Vasily Tarasov

IBM Research-Almaden
DP

David Pease

IBM Research-Almaden
RB

Randal Burns

Johns Hopkins University


Tuesday July 9, 2019 10:10am - 10:40am PDT
HotStorage: Grand Ballroom I–III

10:20am PDT

Collaborative Learning on the Edges: A Case Study on Connected Vehicles
The wide deployment of 4G/5G has enabled connected vehicles as a perfect edge computing platform for a plethora of new services which are impossible before, such as remote real-time diagnostics and advanced driver assistance. In this work, we propose CLONE, a collaborative learning setting on the edges based on the real-world dataset collected from a large electric vehicle (EV) company. Our approach is built on top of the federated learning algorithm and long short-term memory networks, and it demonstrates the effectiveness of driver personalization, privacy serving, latency reduction (asynchronous execution), and security protection. We choose the failure of EV battery and associated accessories as our case study to show how the CLONE solution can accurately predict failures to ensure sustainable and reliable driving in a collaborative fashion.

Speakers
SL

Sidi Lu

Wayne State University
YY

Yongtao Yao

Wayne State University
WS

Weisong Shi

Wayne State University


Tuesday July 9, 2019 10:20am - 10:35am PDT
HotEdge: Grand Ballroom VII–IX

10:35am PDT

Home, SafeHome: Ensuring a Safe and Reliable Home Using the Edge
As smart home environments get more complex and denser, they are becoming harder to manage. We present our ongoing work on the design and implementation of ``SafeHome'', a system for management and coordination inside a smart home. SafeHome offers users and programmers the flexibility to specify safety properties in a declarative way, and to specify routines of commands in an imperative way. SafeHome includes mechanisms which ensure that under concurrent routines and device failures, the smart home behavior is consistent (e.g., serializable) and safety properties are always guaranteed. SafeHome is intended to run on edge machines co-located with the smart home. Our design space opens the opportunity to borrow and adapt rich ideas and mechanisms from related areas such as databases and compilers.

Speakers
SB

Shegufta Bakht Ahsan

University of Illinois at Urbana Champaign
RY

Rui Yang

University of Illinois at Urbana Champaign
SA

Shadi Abdollahian Noghabi

Microsoft Research
IG

Indranil Gupta

University of Illinois at Urbana Champaign


Tuesday July 9, 2019 10:35am - 10:50am PDT
HotEdge: Grand Ballroom VII–IX

10:40am PDT

File Systems as Processes
We introduce file systems as processes (FSP), a storage architecture designed for modern ultra-fast storage devices. By building a direct-access file system as a standalone user-level process, FSP accelerates file system development velocity without compromising essential file system properties. FSP promises to deliver raw device-level performance via highly tuned inter-process communication mechanisms; FSP also ensures protection and metadata integrity by design. To study the potential advantages and disadvantages of the FSP approach, we develop DashFS, a prototype user-level file system. We discuss its architecture and show preliminary performance benefits.

Speakers
JL

Jing Liu

University of Wisconsin, Madison
AC

Andrea C. Arpaci-Dusseau

University of Wisconsin, Madison
RH

Remzi H. Arpaci-Dusseau

University of Wisconsin, Madison
SK

Sudarsun Kannan

Rutgers University


Tuesday July 9, 2019 10:40am - 11:10am PDT
HotStorage: Grand Ballroom I–III

10:50am PDT

Exploring the Use of Synthetic Gradients for Distributed Deep Learning across Cloud and Edge Resources
With the explosive growth of data, largely contributed by the rapidly and widely deployed smart devices on the edge, we need to rethink the training paradigm for learning on such realworld data. The conventional cloud-only approach can hardly keep up with the computational demand from these deep learning tasks; and the traditional back propagation based training method also makes it difficult to scale out the training. Fortunately, the continuous advancement in System on Chip (SoC) hardware is transforming edge devices into capable computing platforms, and can potentially be exploited to address these challenges. These observations have motivated this paper’s study on the use of synthetic gradients for distributed training cross cloud and edge devices. We employ synthetic gradients into various neural network models to comprehensively evaluate its feasibility in terms of accuracy and convergence speed. We distribute the training of the various layers of a model using synthetic gradients, and evaluate its effectiveness on the edge by using resource-limited containers to emulate edge devices. The evaluation result shows that the synthetic gradient approach can achieve comparable accuracy compared to the conventional back propagation, for an eight-layer model with both fully-connected and convolutional layers. For a more complex model (VGG16), the training suffers from some accuracy degradation (up to 15%). But it achieves 11% improvement in training speed when the layers of a model are decoupled and trained on separate resource-limited containers, compared to the training of the whole model using the conventional method on the physical machine.

Speakers
YC

Yitao Chen

Arizona State University
KZ

Kaiqi Zhao

Arizona State University
BL

Baoxin Li

Arizona State University
MZ

Ming Zhao

Arizona State University


Tuesday July 9, 2019 10:50am - 11:05am PDT
HotEdge: Grand Ballroom VII–IX

11:05am PDT

Distributing Deep Neural Networks with Containerized Partitions at the Edge
Deploying machine learning on edge devices is becoming increasingly important, driven by new applications such as smart homes, smart cities, and autonomous vehicles. Unfortunately, it is challenging to deploy deep neural networks (DNNs) on resource-constrained devices. These workloads are computationally intensive and often require cloud-like resources. Prior solutions attempted to address these challenges by either sacrificing accuracy or by relying on cloud resources for assistance.

In this paper, we propose a containerized partition-based runtime adaptive convolutional neural network (CNN) acceleration framework for Internet of Things (IoT) environments. The framework leverages spatial partitioning techniques through convolution layer fusion to dynamically select the optimal partition according to the availability of computational resources and network conditions. By containerizing each partition, we simplify the model update and deployment with Docker and Kubernetes to efficiently handle runtime resource management and scheduling of containers.

Speakers
LZ

Li Zhou

The Ohio State University
HW

Hao Wen

University of Minnesota, Twin Cities
RT

Radu Teodorescu

The Ohio State University
DH

David H.C. Du

University of Minnesota, Twin Cities


Tuesday July 9, 2019 11:05am - 11:20am PDT
HotEdge: Grand Ballroom VII–IX

11:10am PDT

Filesystem Aging: It’s more Usage than Fullness
Filesystem fragmentation is a first-order performance problem that has been the target of many heuristic and algorithmic approaches. Real-world application benchmarks show that common filesystem operations cause many filesystems to fragment over time, a phenomenon known as filesystem aging.

This paper examines the common assumption that space pressure will exacerbate fragmentation. Our microbenchmarks show that space pressure can cause a substantial amount of inter-file and intra-file fragmentation. However, on a “real-world” application benchmark, space pressure causes fragmentation that slows subsequent reads by only 20% on ext4, relative to the amount of fragmentation that would occur on a file system with abundant space. The other file systems show negligible additional degradation under space pressure.

Our results suggest that the effect of free-space fragmentation on read performance is best described as accelerating the filesystem aging process. The effect on write performance is non-existent in some cases, and, in most cases, an order of magnitude smaller than the read degradation from fragmentation cause by normal usage.

Speakers
AC

Alex Conway

Rutgers University
EK

Eric Knorr

Rutgers University
YJ

Yizheng Jiao

The University of North Carolina at Chapel Hill
MA

Michael A. Bender

Stony Brook University
WJ

William Jannen

Williams College
RJ

Rob Johnson

VMware Research
DP

Donald Porter

The University of North Carolina at Chapel Hill
MF

Martin Farach-Colton

Rutgers University


Tuesday July 9, 2019 11:10am - 11:40am PDT
HotStorage: Grand Ballroom I–III

11:20am PDT

LiveMicro: An Edge Computing System for Collaborative Telepathology
Telepathology is the practice of digitizing histological images for transmission along telecommunication pathways for diagnosis, consultation or continuing medical education. Existing telepathology solutions are limited to offline or delay-tolerant diagnosis.

In this paper we present LiveMicro, a telepathology system that, leveraging edge computing, enables multiple pathologists to collaborate on a diagnosis by allowing a remote live control of a microscope. In such environment, computation at the edge is used in three ways: (1) to allow remote users to control the microscope simultaneously, (2) to process histological image and live video, by running algorithms that recognize e.g., tumor grades, (3) to preserve privacy creating virtual shared data views. In particular, we built the first opensource edge computing based telepathology system. In our prototype, the examples of edge processing that we currently support are extraction of diagnosis-oriented features and compression of payloads to minimize transmission delays. Our evaluation shows how LiveMicro can help a medical team with a remote, faster and more accurate diagnosis.

Speakers
AS

Alessio Sacco

Politecnico di Torino
FE

Flavio Esposito

Saint Louis University
PO

Princewill Okorie

Saint Louis University
GM

Guido Marchetto

Politecnico di Torino


Tuesday July 9, 2019 11:20am - 11:35am PDT
HotEdge: Grand Ballroom VII–IX

11:35am PDT

Secure Incentivization for Decentralized Content Delivery
Prior research has proposed using peer-to-peer (P2P) content delivery to serve Internet video at lower costs. Yet, such methods have not witnessed widespread adoption. An important challenge is incentivization: what tangible benefits does P2P content delivery offer users who bring resources to the table? In this paper, we ask whether monetary incentives can help attract peers in P2P content delivery systems. We first propose Gringotts, a system to enable secure monetary incentives for P2P content delivery systems. Gringotts provides a novel Proof of Delivery mechanism that allows content providers to verify correct delivery of their files, and shows how to use cryptocurrency to pay peers while guarding against liars and Sybil attacks. We then present results from an 876-person professional survey we commissioned to understand users’ willingness to participate in Gringotts, and what challenges remain. Our survey revealed that 51% would participate for suitable financial incentives, and motivated the need for alternate payment forms, device security, and peer anonymity.

Speakers
RA

Ragaad Altarawneh

Intel Labs, Intel Corporation
CS

Christina Strong

Intel Labs, Intel Corporation
LR

Luis Remis

Intel Labs, Intel Corporation
PM

Pablo Munoz

Intel Labs, Intel Corporation
AS

Addicam Sanjay

Internet of Things Group, Intel Corporation
SK

Srikanth Kambhatla

System Software Products, Intel Corporation


Tuesday July 9, 2019 11:35am - 11:50am PDT
HotEdge: Grand Ballroom VII–IX

11:40am PDT

EvFS: User-level, Event-Driven File System for Non-Volatile Memory
The extremely low latency of non-volatile memory (NVM) raises issues of latency in file systems. In particular, user-kernel context switches caused by system calls and hardware interrupts become a non-negligible performance penalty. A solution to this problem is using direct-access file systems, but existing work focuses on optimizing their non-POSIX user interfaces. In this work, we propose EvFS, our new user-level POSIX file system that directly manages NVM in user applications. EvFS minimizes the latency by building a user-level storage stack and introducing asynchronous processing of complex file I/O with page cache and direct I/O. We report that the event-driven architecture of EvFS leads to a 700-ns latency for 64-byte non-blocking file writes and reduces the latency for 4-Kbyte blocking file I/O by 20 us compared to a kernel file system with journaling disabled.

Speakers
TY

Takeshi Yoshimura

IBM Research - Tokyo
TC

Tatsuhiro Chiba

IBM Research - Tokyo
HH

Hiroshi Horii

IBM Research - Tokyo


Tuesday July 9, 2019 11:40am - 12:10pm PDT
HotStorage: Grand Ballroom I–III

11:50am PDT

Navigating the Visual Fog: Analyzing and Managing Visual Data from Edge to Cloud
Visual data produced at the edge is rich with information, opening a world of analytics opportunities for applications to explore. However, the demanding requirements of visual data on computational resources and bandwidth have hindered effective processing, preventing the data from being used in an economically efficient manner. In order to scale out visual analytics systems, it is necessary to have a framework that works collaboratively between edge and cloud. In this paper, we propose an end-to-end (E2E) visual fog architecture, designed for processing and management of visual data. Using our architecture to extract shopper insights, we are able to achieve application specified real time requirements for extracting and querying visual data, showing the feasibility of our design in a real-world setting. We also discuss the lessons we learned from deploying an edge-to-cloud architecture for video streaming applications.

Speakers
RA

Ragaad Altarawneh

Intel Labs, Intel Corporation
CS

Christina Strong

Intel Labs, Intel Corporation
LR

Luis Remis

Intel Labs, Intel Corporation
PM

Pablo Munoz

Intel Labs, Intel Corporation
AS

Addicam Sanjay

Internet of Things Group, Intel Corporation
SK

Srikanth Kambhatla

System Software Products, Intel Corporation


Tuesday July 9, 2019 11:50am - 12:05pm PDT
HotEdge: Grand Ballroom VII–IX

12:05pm PDT

Modeling The Edge: Peer-to-Peer Reincarnated
The rise of edge computing as a new storage and compute model has already motivated numerous studies within the systems community, focusing on the choices and mechanisms of task offloading from end devices to the edge infrastructure, pricing, consistency, indexing and caching. However, it is not yet entirely clear how the edge infrastructure itself will be deployed, and, more importantly, managed. A common point of view considers the edge as an extension of traditional content distribution networks (CDN), due to its hierarchical layout, centralized ownership, and cloud back-end.

In this paper, we consider a different view of the edge, as a "reincarnation" of the well-known peer-to-peer (P2P) model. We show how the edge is similar to P2P systems in many aspects, including the number, heterogeneity and limited availability and resources of its nodes, their central role in performing the system's storage and computation, and the vulnerabilities related to tight interoperability with user end devices. We describe the similarities of the edge to both CDNs and P2P systems, the challenges that arise from these similarities, and the previous approaches to address them in both contexts. We show that the challenges that remain in applying these approaches may be addressed by viewing the edge as a larger and smarter reincarnation of P2P systems.

Speakers
GY

Gala Yadgar

Computer Science Department, Technion
OK

Oleg Kolosov

Computer Science Department, Technion
MF

Mehmet Fatih Aktas

Department of Electrical and Computer Engineering, Rutgers University
ES

Emina Soljanin

Department of Electrical and Computer Engineering, Rutgers University


Tuesday July 9, 2019 12:05pm - 12:20pm PDT
HotEdge: Grand Ballroom VII–IX

12:10pm PDT

Luncheon for Workshop Attendees
Tuesday July 9, 2019 12:10pm - 2:00pm PDT
Olympic Pavilion

12:20pm PDT

Luncheon for Workshop Attendees
Tuesday July 9, 2019 12:20pm - 1:40pm PDT
Olympic Pavilion

1:40pm PDT

Towards a Serverless Platform for Edge AI
This paper proposes a serverless platform for building and operating edge AI applications. We analyze edge AI use cases to illustrate the challenges in building and operating AI applications in edge cloud scenarios. By elevating concepts from AI lifecycle management into the established serverless model, we enable easy development of edge AI workflow functions. We take a deviceless approach, i.e., we treat edge resources transparently as cluster resources, but give developers fine-grained control over scheduling constraints. Furthermore, we demonstrate the limitations of current serverless function schedulers, and present the current state of our prototype.

Speakers

Tuesday July 9, 2019 1:40pm - 1:55pm PDT
HotEdge: Grand Ballroom VII–IX

1:55pm PDT

An Edge-based Framework for Cooperation in Internet of Things Applications
Edge computing and the Internet of Things (IoT) are irrevocably intertwined and much work has proposed enhancing the IoT through the use of edge computing. These solutions have typically focused on using the edge to increase the locality of cloud applications, achieving benefits mainly in terms of lower network latency. In this work, we argue that IoT systems can benefit much more from semantic properties which are best recognized and exploited in situ, at the edge of the network where the data streams and actuators exist. We outline the idea of a real-time semantic operating system, hosted on the edge, which can provide higher performance in energy consumption, latency, accuracy, and improve not only individual application performance but that of an entire IoT infrastructure. We have implemented a prototype system and show some initial results demonstrating the efficacy of our proposed optimizations, and provide insights into how to handle some of the most critical issues faced in such a system.

Speakers
ZL

Zach Leidall

University of Minnesota, Twin Cities
AC

Abhishek Chandra

University of Minnesota
JW

Jon Weissman

University of Minnesota, Twin Cities


Tuesday July 9, 2019 1:55pm - 2:10pm PDT
HotEdge: Grand Ballroom VII–IX

2:00pm PDT

Specialize in Moderation—Building Application-aware Storage Services using FPGAs in the Datacenter
In order to keep up with big data workloads, distributed storage needs to offer low latency, high bandwidth and energy efficient access to data. To achieve these properties, most state of the art solutions focus either exclusively on software or on hardware-based implementation. FPGAs are an example of the latter and a promising platform for building storage nodes but they are more cumbersome to program and less flexible than software, which limits their adoption.

We make the case that, in order to be feasible in the cloud, solutions designed around programmable hardware, such as FPGAs, have to follow a service provider-centric methodology: the hardware should only provide functionality that is useful across all tenants and rarely changes. Conversely, application-specific functionality should be delivered through software that, in a cloud setting, is under the provider's control. Deploying FPGAs this way is less cumbersome, requires less hardware programming and flexibility increases overall.

We demonstrate the benefits of this approach by building an application-aware storage for Parquet files, a columnar data format widely used in big data frameworks. Our prototype offers transparent 10Gbps deduplication in hardware without sacrificing low latency operation and specializes to Parquet files using a companion library. This work paves the way for in-storage filtering of columnar data without having to implement file-type and tenant-specific parsing in the FPGA.

Speakers
LK

Lucas Kuhring

IMDEA Software Institute, Madrid, Spain
EG

Eva Garcia

Universidad Autónoma de Madrid, Spain
ZI

Zsolt István

IMDEA Software Institute, Madrid, Spain


Tuesday July 9, 2019 2:00pm - 2:30pm PDT
HotStorage: Grand Ballroom I–III

2:10pm PDT

OneOS: IoT Platform based on POSIX and Actors
Recent interest in Edge/Fog Computing has pushed IoT Platforms to support a broader range of general-purpose workloads. We propose a design of an IoT Platform called OneOS, inspired by Distributed OS and micro-kernel principles, providing a single system image of the IoT network. OneOS aims to preserve the portability of applications by reusing a subset of the POSIX interface at a higher layer over a flat group of Actors. As a distributed middleware, OneOS achieves its goal through evaluation context replacement, which enables a process to run in a virtual context rather than its local context.

Speakers
KJ

Kumseok Jung

University of British Columbia
JG

Julien Gascon-Samson

ÉTS Montréal
KP

Karthik Pattabiraman

University of British Columbia


Tuesday July 9, 2019 2:10pm - 2:25pm PDT
HotEdge: Grand Ballroom VII–IX

2:25pm PDT

Open Infrastructure for Edge: A Distributed Ledger Outlook
High demand for low latency services and local data processing has given rise for edge computing. As opposed to cloud computing, in this new paradigm computational facilities are located close to the end-users and data producers, on the edge of the network, hence the name. The critical issue for the proliferation of edge computing is the availability of local computational resources. Major cloud providers are already addressing the problem by establishing facilities in the proximity of end-users. However, there is an alternative trend, namely, developing open infrastructure as a set of standards, technologies, and practices to enable any motivated parties to offer their computational capacity for the needs of edge computing. Open infrastructure can give an additional boost to this new promising paradigm and, moreover, help to avoid problems for which cloud computing has been long criticized for, such as vendor lock-in or privacy. In this paper, we discuss the challenges related to creating such an open infrastructure, in particular focusing on the applicability of distributed ledgers for contractual agreement and payment. Solving the challenge of contracting is central to realizing an open infrastructure for edge computing, and in this paper, we highlight the potential and shortcomings of distributed ledger technologies in the context of our use case.

Speakers
AZ

Aleksandr Zavodovski

University of Helsinki
NM

Nitinder Mohan

University of Helsinki
WW

Walter Wong

University of Helsinki
JK

Jussi Kangasharju

University of Helsinki


Tuesday July 9, 2019 2:25pm - 2:40pm PDT
HotEdge: Grand Ballroom VII–IX

2:30pm PDT

Automating Context-Based Access Pattern Hint Injection for System Performance and Swap Storage Durability
Memory pressure is inevitable as the size of working sets is rapidly growing while the capacity of dynamic random access memory (DRAM) is not. Meanwhile, storage devices have evolved so that their speed is comparable to the speed of DRAM while their capacity scales are comparable to that of hard disk drives (HDD). Thus, hierarchial memory systems configuring DRAM as the main memory and high-end storages as swap devices will be common.

Due to the unique characteristics of these modern storage devices, the swap target decision should be optimal. It is essential to know the exact data access patterns of workloads for such an optimal decision, although underlying systems cannot accurately estimate such complex and dynamic patterns. For this reason, memory systems allow programs to voluntarily hint their data access pattern. Nevertheless, it is exhausting for a human to manually figure out the patterns and embed optimal hints if the workloads are huge and complex.

This paper introduces a compiler extension that automatically optimizes a program to voluntarily hint its dynamic data access patterns to the underlying swap system using a static/dynamic analysis based profiling result. To our best knowledge, this is the first profile-guided optimization (PGO) for modern swap devices. Our empirical evaluation of the scheme using realistic workloads shows consistent improvement in performance and swap device lifetime up to 2.65 times and 2.98 times, respectively.

Speakers
SP

SeongJae Park

Seoul National University
YL

Yunjae Lee

Seoul National University
MK

Moonsub Kim

Seoul National University
HY

Heon Y. Yeom

Seoul National University


Tuesday July 9, 2019 2:30pm - 3:00pm PDT
HotStorage: Grand Ballroom I–III

2:40pm PDT

HydraOne: An Indoor Experimental Research and Education Platform for CAVs
Connected and autonomous vehicles (CAVs) is currently a hot topic and a major focus in the field of edge computing, and it has created numerous pivotal and challenging research problems. In this paper, we present HydraOne, an indoor experimental research and education platform for edge computing in the CAVs scenario. HydraOne is a hardware-software co-design platform built from scratch based on our experience with the requirements of edge computing research problems. We present the design and implementation details and discuss three key characteristics of HydraOne: design modularization, resource extensibility and openness, as well as function isolation. These will help researchers and students fully understand the platform and take advantage of it to conduct research experiments. We also provide three case studies deployed on HydraOne to demonstrate the capabilities of our research platform.

Speakers
YW

Yifan Wang

Institute of Computing Technology, Chinese Academy of Sciences
LL

Liangkai Liu

Wayne State University
XZ

Xingzhou Zhang

Institute of Computing Technology, Chinese Academy of Sciences
WS

Weisong Shi

Wayne State University


Tuesday July 9, 2019 2:40pm - 2:55pm PDT
HotEdge: Grand Ballroom VII–IX

2:55pm PDT

DeCaf: Iterative Collaborative Processing over the Edge
The increase in privacy concerns among the users has led to edge based analytics applications such as federated learning which trains machine learning models in an iterative and collaborative fashion on the edge devices without sending the raw private data to the central cloud. In this paper, we propose a system for enabling iterative collaborative processing (ICP) in resource constrained edge environments. We first identify the unique systems challenges posed by ICP, which are not addressed by the existing distributed machine learning frameworks such as the parameter server. We then propose the system components necessary for ICP to work well in highly distributed edge environments. Based on this, we propose a system design for enabling such applications over the edge. We show the benefits of our proposed system components with a preliminary evaluation.

Speakers
DK

Dhruv Kumar

University of Minnesota, Twin Cities
AA

Aravind Alagiri Ramkumar

University of Minnesota, Twin Cities
RS

Rohit Sindhu

University of Minnesota, Twin Cities
AC

Abhishek Chandra

University of Minnesota


Tuesday July 9, 2019 2:55pm - 3:10pm PDT
HotEdge: Grand Ballroom VII–IX

3:00pm PDT

Caching in the Multiverse
To get good performance for data stored in Object storage services like S3, data analysis clusters need to cache data locally. Recently these caches have started taking into account higher-level information from analysis framework, allowing prefetching based on predictions of future data accesses. There is, however, a broader opportunity; rather than using this information to predict one future, we can use it to select a future that is best for caching. This paper provides preliminary evidence that we can exploit the directed acyclic graph (DAG) of inter-task dependencies used by data-parallel frameworks such as Spark, Pig, and Hive to improve application performance, by optimizing caching for the critical path through the DAG for the application. We present experimental results for PIG running TPC-H queries, showing completion time improvements of up to 23% vs our implementation of MRD, a state-of-the-art DAG-based prefetching system, and improvements of up to 2.5x vs LRU caching. We then discuss the broader opportunity for building a system based on this opportunity.

Speakers
MA

Mania Abdi

Northeastern University
AM

Amin Mosayyebzadeh

Boston University
MH

Mohammad Hossein Hajkazemi

Northeastern University
AT

Ata Turk

State Street
OK

Orran Krieger

Boston University
PD

Peter Desnoyers

Northeastern University


Tuesday July 9, 2019 3:00pm - 3:30pm PDT
HotStorage: Grand Ballroom I–III

3:10pm PDT

Shimmy: Shared Memory Channels for High Performance Inter-Container Communication
With the increasing need for more reactive services, and the need to process large amounts of IoT data, edge clouds are emerging to enable applications to be run close to the users and/or devices. Following the trend in hyperscale clouds, applications are trending toward a microservices architecture where the application is decomposed into smaller pieces that can each run in its own container and communicate with each other over a network through well defined APIs. This improves the development effort and deployability, but also introduces inefficiencies in communication. In this paper, we rethink the communication model, and introduce the ability to create shared memory channels between containers supporting both a pub/sub model and streaming model. Our approach is not only applicable to the edge clouds but also beneficial in core cloud environments. Local communication is made more efficient, and remote communication is efficiently supported through synchronizing shared memory regions via RDMA.

Speakers
MA

Marcelo Abranches

University of Colorado, Boulder
SG

Sepideh Goodarzy

University of Colorado, Boulder
MN

Maziyar Nazari

University of Colorado, Boulder
SM

Shivakant Mishra

University of Colorado, Boulder
EK

Eric Keller

University of Colorado, Boulder


Tuesday July 9, 2019 3:10pm - 3:25pm PDT
HotEdge: Grand Ballroom VII–IX

3:25pm PDT

The Case for Determinism on the Edge
Emerging edge applications, such as augmented and virtual reality, real-time video analytics and thin-client gaming, are latency-sensitive, resource-intensive, and stateful. Transitioning these applications from cloud deployments to the edge is non-trivial since edge deployments will exhibit variable resource availability, significant user mobility, and high potential for faults and application preemption, requiring considerable developer effort per application to maintain stable quality of experience for the user.

In this paper, we propose deterministic containers, a new abstraction that simplifies the development of complex applications on the edge. Deterministic containers enforce the property that all activity within a container behave deterministically. Determinism provides replication, which in turn provides key benefits for edge computing including resilience to performance jitter, enhanced fault-tolerance, seamless migration, and data provenance.

We are currently building a prototype, Shadow, that aims to provide deterministic containers with minimal performance overhead while requiring few application modifications. For all sources of non-determinism, Shadow either converts the behavior to be deterministic or restricts the allowable application behavior. Preliminary results indicate that using Shadow to reduce performance jitter at the edge for a vehicle caravan application involving video analytics reduces median application response time by up to 25%.

Speakers
MF

Matthew Furlong

University of Michigan
AQ

Andrew Quinn

University of Michigan
JF

Jason Flinn

University of Michigan


Tuesday July 9, 2019 3:25pm - 3:40pm PDT
HotEdge: Grand Ballroom VII–IX

3:30pm PDT

3:40pm PDT

4:00pm PDT

Reducing Garbage Collection Overhead in SSD Based on Workload Prediction
In solid-state drives (SSDs), garbage collection (GC) plays a key role in making free NAND blocks for newly coming data. The data copied from one block to another by GC affects both the performance and lifetime of SSD significantly. Placing the data with different “temperature” into different NAND blocks can reduce data copy overhead in GC. This paper proposes a scheme to place data according to its predicted future temperature. A neural network called LSTM is applied to increase the accuracy of temperature prediction in both temporal and spatial dimensions. And it also uses K-Means to do clustering and automatically dispatch similar “future temperature” data to the same NAND blocks. The results obtained show that performance and write amplification factor (WAF) are improved in various applications. In the best case, the WAF and 99.99% of the write latency are reduced by up to 43.5% and 79.3% respectively.

Speakers
PY

Pan Yang

Samsung R&D Institute China Xi'an, Samsung Electronics
NX

Ni Xue

Samsung R&D Institute China Xi'an, Samsung Electronics
YZ

Yuqi Zhang

Samsung R&D Institute China Xi'an, Samsung Electronics
YZ

Yangxu Zhou

Samsung R&D Institute China Xi'an, Samsung Electronics
LS

Li Sun

Samsung R&D Institute China Xi'an, Samsung Electronics
WC

Wenwen Chen

Samsung R&D Institute China Xi'an, Samsung Electronics
ZC

Zhonggang Chen

Samsung R&D Institute China Xi'an, Samsung Electronics
WX

Wei Xia

Samsung R&D Institute China Xi'an, Samsung Electronics
JL

Junke Li

Samsung R&D Institute China Xi'an, Samsung Electronics
KK

Kihyoun Kwon

Samsung R&D Institute China Xi'an, Samsung Electronics


Tuesday July 9, 2019 4:00pm - 4:30pm PDT
HotStorage: Grand Ballroom I–III

4:10pm PDT

SMC: Smart Media Compression for Edge Storage Offloading
With the pervasiveness and growth in media technology, user-generated content has become intertwined with our day-to-day life. Such advancements however, have enabled the exponential growth in media file sizes, which leads to shortage of storage on small-scale edge devices. Online clouds are generally a potential solution, however, they raise privacy concerns, are not fully automated, and do not adapt to different networking environments (rural/urban/metropolitan). Distributed storage systems rely on their distributed nature to combat concerns over privacy and are adaptable to different networking environments. Nevertheless, such systems lack optimization via compression due to energy concerns on edge devices. In this work, we propose Smart Media Compression (SMC) for distributed edge storage systems. SMC dynamically adjusts compression parameters, in order to reduce the amount of needless compression, thus reducing energy consumption while providing smaller user file access delays. Our results show an improvement in average file access delay by up to 90%, while only costing an additional 14% in energy consumption.

Speakers
AE

Ali E. Elgazar

Carnegie Mellon University
MA

Mohammad Aazam

Carnegie Mellon University
KA

Khaled A. Harras

Carnegie Mellon University


Tuesday July 9, 2019 4:10pm - 4:25pm PDT
HotEdge: Grand Ballroom VII–IX

4:25pm PDT

Edge-based Transcoding for Adaptive Live Video Streaming
User-generated video content is imposing an increasing burden on live video service architectures such as Facebook Live. These services are responsible for ingesting large amounts of video, transcoding that video into different quality levels (i.e., bitrates), and adaptively streaming it to viewers. These tasks are expensive, both computationally and network-wise, often forcing service providers to sacrifice the “liveness” of delivered video. Given the steady increases in smartphone bandwidth and energy resources, coupled with the fact that commodity smartphones now include hardware-accelerated codecs, we propose that live video services augment their existing infrastructure with edge support for transcoding and transmission. We present measurements to motivate the feasibility of incorporating such edge-support into the live video ecosystem, present the design of a peer-to-peer adaptive live video streaming system, and discuss directions for future work to realize this vision in practice.

Speakers

Tuesday July 9, 2019 4:25pm - 4:40pm PDT
HotEdge: Grand Ballroom VII–IX

4:30pm PDT

Sentinel Cells Enabled Fast Read for NAND Flash
With latest development, NAND flash is experiencing increased errors. The read reference voltages are the key factor for RBER seen by ECC. The limited error correction capability of ECC determines a value range that the read voltages should fall into, otherwise a read failure followed by a read retry with tuned read voltage, would happen. Therefore, finding a correct read voltage with the smallest number of read failures has been a hot research problem. Previous methods in the literature are designed to either progressively tune the voltage value or empirically predict a read voltage based on error models. However, straightforward tuning leads to unpredictable large number of read retries, whereas complex modeling brings large overhead. This paper proposes a novel approach, by reserving a small set of cells as sentinels, which directly tell us the optimal voltage, as drifting caused errors exhibits strong locality. Experiments demonstrate the proposed technique is both efficient and effective.

Speakers
QL

Qiao Li

Department of Computer Science, City University of Hong Kong
MY

Min Ye

YEESTOR Microelectronics Co., Ltd
YC

Yufei Cui

Department of Computer Science, City University of Hong Kong
LS

Liang Shi

School of Computer Science and Software Engineering, East China Normal University
XL

Xiaoqiang Li

YEESTOR Microelectronics Co., Ltd
CJ

Chun Jason Xue

Department of Computer Science, City University of Hong Kong


Tuesday July 9, 2019 4:30pm - 5:00pm PDT
HotStorage: Grand Ballroom I–III

4:40pm PDT

Toward Optimal Performance with Network Assisted TCP at Mobile Edge
In contrast to the classic fashion for designing distributed end-to-end (e2e) TCP schemes for cellular networks (CN), we explore another design space by having the CN assist the task of the transport control. We show that in the emerging cellular architectures such as mobile/multi-access edge computing (MEC), where the servers are located close to the radio access network (RAN), significant improvements can be achieved by leveraging the nature of the logically centralized network measurements at the RAN and passing information such as its minimum e2e delay and access link capacity to each server. Particularly, a Network Assistance module (located at the mobile edge) will pair up with wireless scheduler to provide feedback information to each server and facilitate the task of congestion control. To that end, we present two Network Assisted schemes called NATCP (a clean-slate design replacing TCP at end-hosts) and NACubic (a backward compatible design requiring no change for TCP at end-hosts). Our preliminary evaluations using real cellular traces show that both schemes dramatically outperform existing schemes both in single-flow and multi-flow scenarios.

Speakers
YX

Yang Xu

Fudan University
HS

Hang Shi

Futurewei Technologies
UC

Ulas C. Kozat

Futurewei Technologies
YY

Yinghua Ye

Futurewei Technologies


Tuesday July 9, 2019 4:40pm - 4:55pm PDT
HotEdge: Grand Ballroom VII–IX

4:55pm PDT

Throwing MUD into the FOG: Defending IoT and Fog by expanding MUD to Fog network
Manufacturer Usage Description (MUD) is a proposed IETF standard enabling local area networks (LAN) to automatically configure their access control when adding a new IoT device based on the recommendations provided for that device by the manufacturer. MUD has been proposed as an isolation-based defensive mechanism with a focus on devices in the home, where there is no dedicated network administrator. In this paper, we describe the efficacy of MUD for a generic IoT device under different threat scenarios in the context of the Fog. We propose a method to use rate limiting to prevent end devices from participating in denial of service attacks (DDoS), including against the Fog itself. We illustrate our assumptions by providing a possible real world example and describe the benefits for MUD in the Fog for various stakeholders.

Speakers
VA

Vafa Andalibi

Indiana University Bloomington
DK

DongInn Kim

Indiana University Bloomington
LJ

L. Jean Camp

Indiana University Bloomington


Tuesday July 9, 2019 4:55pm - 5:10pm PDT
HotEdge: Grand Ballroom VII–IX

5:00pm PDT

Fair Resource Allocation in Consolidated Flash Systems
We argue that, along with bandwidth and capacity, lifetime of flash devices is also a critical resource that needs to be explicitly and carefully managed, especially in emerging consolidated environments. We study the resulting multi-resource allocation problem in a setting where "fairness" across consolidated workloads is desired. Towards this, we propose to adapt the well-known notion of dominant resource fairness (DRF). We empirically show that using DRF with only bandwidth and capacity (and ignoring lifetime) may result in poor device lifetime. Incorporating lifetime, however, turns out to be non-trivial. We identify key challenges in this adaptation and present simple heuristics. We also discuss possible design choices which will be fully explored in future work.

Speakers
WC

Wonil Choi

Pennsylvania State University
BU

Bhuvan Urgaonkar

Pennsylvania State University
MK

Mahmut Kandemir

Pennsylvania State University


Tuesday July 9, 2019 5:00pm - 5:30pm PDT
HotStorage: Grand Ballroom I–III

5:10pm PDT

Devices-as-Services: Rethinking Scalable Service Architectures for the Internet of Things
We investigate a new distributed services model and architecture for Internet of Things (IoT) applications. In particular, we observe that devices at the edge of the network, although resource constrained, are increasingly capable—performing actions (e.g. data analytics, decision support, actuation, control, etc.) in addition to event telemetry. Thus, such devices are better modeled as servers, which applications in the cloud compose for their functionality. We investigate the implications of this "flipped" IoT client-server model, for server discovery, authentication, and resource use. We find that by combining capability-based security with an edge-aware registry, this model can achieve fast response and energy efficiency.

Speakers
FB

Fatih Bakir

Univ. of California, Santa Barbara
RW

Rich Wolski

Univ. of California, Santa Barbara
CK

Chandra Krintz

Univ. of California, Santa Barbara
GS

Gowri Sankar Ramachandran

Univ. of Southern California


Tuesday July 9, 2019 5:10pm - 5:25pm PDT
HotEdge: Grand Ballroom VII–IX

5:25pm PDT

Publish-Pay-Subscribe Protocol for Payment-driven Edge Computing
IoT applications are starting to rely heavily on edge computing due to the advent of low-power and high data-rate wireless communication technologies such as 5G and the processing capability of GPU-driven edge platforms. However, the computation and the data communication model for the edge computing applications are quite diverse, which limits their interoperability. An interoperable edge computing architecture with a versatile communication model would lead to the development of innovative and incentive-driven edge computing applications by combining various data sources from a wide array of devices. In this paper, we present an edge computing architecture by extending the publish-subscribe protocol with support for incentives. Our novel publish-pay-subscribe protocol enables the data producers (publishers) to sell their data with data consumers and service providers (subscribers). The proposed architecture not only allows the device owners to gain incentive but also enable the service providers to sell the processed data with one or more data consumers. Our proof-of-concept implementation using AEDES publish-subscribe broker and Ethereum cryptocurrency shows the feasibility of publish-pay-subscribe broker and its support for data-driven and incentive-based edge computing applications.

Speakers
GS

Gowri Sankar Ramachandran

Univ. of Southern California
SL

Sharon L.G Contreras

University of Southern California
BK

Bhaskar Krishnamachari

University of Southern California


Tuesday July 9, 2019 5:25pm - 5:40pm PDT
HotEdge: Grand Ballroom VII–IX

5:40pm PDT

7:00pm PDT

USENIX Women in Advanced Computing (WiAC) BoF
Let’s talk about women in advanced computing. All registered attendees—of all genders—are welcome to attend this BoF.

Tuesday July 9, 2019 7:00pm - 8:00pm PDT
West Seattle Room

8:00pm PDT

Students and Young Professionals Meetup
Come for the refreshments, stay for the opportunity to meet and network with other students and young professionals.

Tuesday July 9, 2019 8:00pm - 9:00pm PDT
West Seattle Room

9:00pm PDT

Board Game Night
Join us for some good old-fashioned board games. We’ll have some on hand, but bring your own games, too!

Tuesday July 9, 2019 9:00pm - 11:00pm PDT
West Seattle Room
 
Wednesday, July 10
 

7:30am PDT

Badge Pickup
Wednesday July 10, 2019 7:30am - 5:00pm PDT
Grand Stair Prefunction

7:45am PDT

Continental Breakfast
Wednesday July 10, 2019 7:45am - 8:45am PDT
Grand Ballroom Prefunction

8:45am PDT

Opening Remarks and Awards
Speakers
DM

Dahlia Malkhi

VMware Research
DT

Dan Tsafrir

Technion—Israel Institute of Technology & VMware Research


Wednesday July 10, 2019 8:45am - 9:00am PDT
Grand Ballroom

9:00am PDT

Measure, Then Build
This talk will discuss an approach to systems research based upon an iterative, empirically-driven approach, loosely known as "measure, then build." By first carefully analyzing the state of the art, one can learn what the real problems in today's systems are; by then designing and implementing new systems to solve said problems, one can ensure that one's work is both relevant and important. The talk will draw on examples in research done at the University of Wisconsin-Madison over nearly two decades, including research into Linux file systems, enterprise storage, key-value storage systems, and distributed systems.

Speakers
RA

Remzi Arpaci-Dusseau

University of Wisconsin-Madison
Remzi Arpaci-Dusseau is the Grace Wahba professor of Computer Sciences at UW-Madison. He co-leads a research group with Professor Andrea Arpaci-Dusseau. Together, they have graduated 24 Ph.D. students and won numerous best-paper awards; many of their innovations are used by commercial... Read More →


Wednesday July 10, 2019 9:00am - 10:00am PDT
Grand Ballroom

10:00am PDT

Lightning Talks
Wednesday July 10, 2019 10:00am - 10:50am PDT
Grand Ballroom

10:50am PDT

Break with Refreshments
Wednesday July 10, 2019 10:50am - 11:20am PDT
Grand Ballroom Prefunction

11:20am PDT

The Design and Operation of CloudLab
Given the highly empirical nature of research in cloud computing, networked systems, and related fields, testbeds play an important role in the research ecosystem. In this paper, we cover one such facility, CloudLab, which supports systems research by providing raw access to programmable hardware, enabling research at large scales, and creating a shared platform for repeatable research.

We present our experiences designing CloudLab and operating it for four years, serving nearly 4,000 users who have run over 79,000 experiments on 2,250 servers, switches, and other pieces of datacenter equipment. From this experience, we draw lessons organized around two themes. The first set comes from analysis of data regarding the use of CloudLab: how users interact with it, what they use it for, and the implications for facility design and operation. Our second set of lessons comes from looking at the ways that algorithms used "under the hood," such as resource allocation, have important---and sometimes unexpected---effects on user experience and behavior. These lessons can be of value to the designers and operators of IaaS facilities in general, systems testbeds in particular, and users who have a stake in understanding how these systems are built.

Speakers
DD

Dmitry Duplyakin

University of Utah
RR

Robert Ricci

University of Utah
AM

Aleksander Maricq

University of Utah
GW

Gary Wong

University of Utah
JD

Jonathon Duerig

University of Utah
EE

Eric Eide

University of Utah
LS

Leigh Stoller

University of Utah
MH

Mike Hibler

University of Utah
DJ

David Johnson

University of Utah
KW

Kirk Webb

University of Utah
AA

Aditya Akella

University of Wisconsin - Madison
KW

Kuangching Wang

Clemson University
GR

Glenn Ricart

US Ignite
LL

Larry Landweber

University of Wisconsin - Madison
MZ

Michael Zink

University of Massachusetts Amherst
EC

Emmanuel Cecchet

University of Massachusetts Amherst
SK

Snigdhaswin Kar

Clemson University
PM

Prabodh Mishra

Clemson University


Wednesday July 10, 2019 11:20am - 11:40am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:20am PDT

Partisan: Scaling the Distributed Actor Runtime
We present the design of an alternative runtime system for improved scalability and reduced latency in actor applications called Agner. Agner provides higher scalability by allowing the application developer to specify the network overlay used at runtime without changing application semantics, thereby specializing the network communication patterns to the application. Agner reduces message latency through a combination of three predominately automatic optimizations: parallelism, named channels, and affinitized scheduling. We implement a prototype of Agner in Erlang and demonstrate that Agner achieves up to an order of magnitude increase in the number of nodes the system can scale to through runtime overlay selection, up to a 34.96x increase in throughput, and up to a 13.4x reduction in latency over Distributed Erlang.

Speakers
CS

Christopher S. Meiklejohn

Carnegie Mellon University
HM

Heather Miller

Carnegie Mellon University
PA

Peter Alvaro

UC Santa Cruz


Wednesday July 10, 2019 11:20am - 11:40am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:40am PDT

Everyone Loves File: File Storage Service (FSS) in Oracle Cloud Infrastructure
File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service in Oracle Cloud Infrastructure. Using a pipelined Paxos implementation, we implemented a scalable block store that provides linearizable multipage limited-size transactions. On top of the block store, we built a scalable B-tree that provides linearizable multikey limited-size transactions. By using self-validating B-tree nodes and performing all Btree housekeeping operations as separate transactions, each key in a B-tree transaction requires only one page in the underlying block transaction. The B-tree holds the filesystem metadata. The filesystem provides snapshots by using versioned key-value pairs. The entire system is programmed using a nonblocking lock-free programming style. The presentation servers maintain no persistent local state, with any state kept in the B-tree, making it easy to scale up and failover the presentation servers. We use a non-scalable Paxos-replicated hash table to store configuration information required to bootstrap the system. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication.

Speakers
BC

Bradley C. Kuszmaul

Oracle Corporation
MF

Matteo Frigo

Oracle Corporation
JM

Justin Mazzola Paluska

Oracle Corporation
AS

Alexander (Sasha) Sandler

Oracle Corporation


Wednesday July 10, 2019 11:40am - 12:00pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:40am PDT

Unleashing the Power of Learning: An Enhanced Learning-Based Approach for Dynamic Binary Translation
Dynamic binary translation (DBT) is a key system technology that enables many important system applications such as system virtualization and emulation. To achieve good performance, it is important for a DBT system to be equipped with high-quality translation rules. However, most translation rules in existing DBT systems are created manually with high engineering efforts and poor quality. To solve this problem, a learning-based approach was recently proposed to automatically learn semantically-equivalent translation rules, and symbolic verification is used to prove the semantic equivalence of such rules. But, they still suffer from some shortcomings.

In this paper, we first give an in-depth analysis on the constraints of prior learning-based methods and observe that the equivalence requirements are often unduly restrictive. It excludes many potentially high-quality rule candidates from being included and applied. Based on this observation, we propose an enhanced learning-based approach that relaxes such equivalence requirements but supplements them with constraining conditions to make them semantically equivalent when such rules are applied. Experimental results on SPEC CINT2006 show that the proposed approach can improve the dynamic coverage of the translation from 55.7\% to 69.1\% and the static coverage from 52.2\% to 61.8\%, compared to the original approach. Moreover, up to 1.65X performance speedup with an average of 1.19X are observed.

Speakers
CS

Changheng Song

Fudan University
WW

Wenwen Wang

University of Minnesota
PY

Pen-Chung Yew

University of Minnesota
AZ

Antonia Zhai

University of Minnesota
WZ

Weihua Zhang

Fudan University


Wednesday July 10, 2019 11:40am - 12:00pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:00pm PDT

Zanzibar: Google’s Consistent, Global Authorization System
Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.


Wednesday July 10, 2019 12:00pm - 12:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:00pm PDT

Transactuations: Where Transactions Meet the Physical World
A large class of IoT applications read sensors, execute application logic, and actuate actuators. However, the lack of high-level programming abstractions compromises correctness especially in presence of failures and unwanted interleaving between applications. A key problem arises when operations on IoT devices or the application itself fails, which leads to inconsistencies between the physical state and application state, breaking application semantics and causing undesired consequences. Transactions are a well-established abstraction for correctness, but assume properties that are absent in an IoT context. In this paper, we study one such environment, smart home, and establish inconsistencies manifesting out of failures. We propose an abstraction called transactuation that empowers developers to build reliable applications. Our runtime, Relacs, implements the abstraction atop a real smart-home platform. We evaluate programmability, performance, and effectiveness of transactuations to demonstrate its potential as a powerful abstraction and execution model.

Speakers
AS

Aritra Sengupta

Samsung Research
MS

Masoud Saeida Ardekani

Samsung Research
CA

Cesar A. Stuardo

University of Chicago


Wednesday July 10, 2019 12:00pm - 12:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:20pm PDT

IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
We address the problem of “fail-slow” fault, a fault where a hardware or software component can still function (does not fail-stop) but in much lower performance than expected. To address this, we built IASO, a peer-based, non-intrusive fail-slow detection framework that has been deployed for more than 1.5 years across 39,000 nodes in our customer sites and helped our customers reduce major outages due to fail-slow incidents. IASO primarily works based on timeout signals (a negligible overhead of monitoring) and converts them into a stable and accurate fail-slow metric. IASO can quickly and accurately isolate a slow node within minutes. Within a 7-month period, IASO managed to catch 232 fail-slow incidents in our large deployment field. In this paper, we have also assembled a large dataset of 232 fail-slow incidents along with our analysis. We found that the fail-slow annual failure rate in our field is 1.02%.

Speakers
BP

Biswaranjan Panda

Nutanix Inc.
HK

Huan Ke

University of Chicago
KG

Karan Gupta

Nutanix Inc.
VK

Vinayak Khot

Nutanix Inc.
HS

Haryadi S. Gunawi

University of Chicago


Wednesday July 10, 2019 12:20pm - 12:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:20pm PDT

Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code
All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average 10% slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm , a significant extension to Browsix that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of 2.08$\times$ (Firefox) and 2.5$\times$ (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.

Speakers
AJ

Abhinav Jangda

University of Massachusetts Amherst
BP

Bobby Powers

University of Massachusetts Amherst
ED

Emery D. Berger

University of Massachusetts Amherst
AG

Arjun Guha

University of Massachusetts Amherst


Wednesday July 10, 2019 12:20pm - 12:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:40pm PDT

Lunch (on your own)
Wednesday July 10, 2019 12:40pm - 2:20pm PDT
N/A

2:20pm PDT

Extension Framework for File Systems in User space
User file systems offer numerous advantages over their in-kernel implementations, such as ease of development and better system reliability. However, they incur heavy performance penalty. We observe that existing user file system frameworks are highly general; they consist of a minimal interposition layer in the kernel that simply forwards all low-level requests to user space. While this design offers flexibility, it also severely degrades performance due to frequent kernel-user context switching.

This work introduces ExtFUSE, a framework for developing extensible user file systems that also allows applications to register "thin" specialized request handlers in the kernel to meet their specific operative needs, while retaining the complex functionality in user space. Our evaluation with two FUSE file systems shows that ExtFUSE can improve the performance of user file systems with less than a few hundred lines on average. ExtFUSE is available on GitHub.

Speakers
AB

Ashish Bijlani

PhD Student, Georgia Institute of Technology
Ashish is a CS PhD student at Georgia Institute of Technology. His area of research is mobile storage.
UR

Umakishore Ramachandran

Georgia Institute of Technology


Wednesday July 10, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:20pm PDT

Apache Nemo: A Framework for Building Distributed Dataflow Optimization Policies
Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance, and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation, optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario.

Speakers
YY

Youngseok Yang

Seoul National University
JE

Jeongyoon Eo

Seoul National University
GK

Geon-Woo Kim

Viva Republica
JY

Joo Yeon Kim

Samsung Electronics
SL

Sanha Lee

Naver Corp.
JS

Jangho Seo

Seoul National University
WW

Won Wook Song

Seoul National University
BC

Byung-Gon Chun

Seoul National University


Wednesday July 10, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:40pm PDT

FlexGroup Volumes: A Distributed WAFL File System
The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFL® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.

In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.


Wednesday July 10, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:40pm PDT

Tangram: Bridging Immutable and Mutable Abstractions for Distributed Data Analytics
Data analytics frameworks that adopt immutable data abstraction usually provide better support for failure recovery and straggler mitigation, while those that adopt mutable data abstraction are more efficient for iterative workloads thanks to their support for in-place state updates and asynchronous execution. Most existing frameworks adopt either one of the two data abstractions and do not enjoy the benefits of the other. In this paper, we propose a novel programming model named MapUpdate, which can determine whether a distributed dataset is mutable or immutable in an application. We show that MapUpdate not only offers good expressiveness, but also allows us to enjoy the benefits of both mutable and immutable abstractions. MapUpdate naturally supports iterative and asynchronous execution, and can use different recovery strategies adaptively according to failure scenarios. We implemented MapUpdate in a system, called Tangram, with novel system designs such as lightweight local task management, partition-based progress control, and context-aware failure recovery. Extensive experiments verified the benefits of Tangram on a variety of workloads including bulk processing, graph analytics, and iterative machine learning.

Speakers
YH

Yuzhen Huang

The Chinese University of Hong Kong
XY

Xiao Yan

The Chinese University of Hong Kong
GJ

Guanxian Jiang

The Chinese University of Hong Kong
TJ

Tatiana Jin

The Chinese University of Hong Kong
JC

James Cheng

The Chinese University of Hong Kong
AX

An Xu

The Chinese University of Hong Kong
ZL

Zhanhao Liu

The Chinese University of Hong Kong
ST

Shuo Tu

The Chinese University of Hong Kong


Wednesday July 10, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:00pm PDT

EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices
Smartphones usually have limited storage and runtime memory. Compressed read-only file systems can dramatically decrease the storage used by read-only system resources. However, existing compressed read-only file systems use fixed-sized input compression, which causes significant I/O amplification and unnecessary computation. They also consume excessive runtime memory during decompression and deteriorate the performance when the runtime memory is scarce. In this paper, we describe EROFS, a new compression-friendly read-only file system that leverages fixed-sized output compression and memory-efficient decompression to achieve high performance with little extra memory overhead. We also report our experience of deploying EROFS on tens of millions of smartphones. Evaluation results show that EROFS outperforms existing compressed read-only file systems with various micro-benchmarks and reduces the boot time of real-world applications by up to 22.9% while nearly halving the storage usage.

Speakers
XG

Xiang Gao

Huawei Technologies Co., Ltd.
MD

Mingkai Dong

Shanghai Jiao Tong University
XM

Xie Miao

Huawei Technologies Co., Ltd.
WD

Wei Du

Huawei Technologies Co., Ltd.
CY

Chao Yu

Huawei Technologies Co., Ltd.
HC

Haibo Chen

Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.


Wednesday July 10, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:00pm PDT

STRADS-AP: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model
It is a daunting task for a data scientist to convert sequential code for a Machine Learning (ML) model, published by an ML researcher, to a distributed framework that runs on a cluster and operates on massive datasets. The process of fitting the sequential code to an appropriate programming model and data abstractions determined by the framework of choice requires significant engineering and cognitive effort. Furthermore, inherent constraints of frameworks sometimes lead to inefficient implementations, delivering suboptimal performance.

We show that it is possible to achieve automatic and efficient distributed parallelization of familiar sequential ML code by making a few mechanical changes to it while hiding the details of concurrency control, data partitioning, task parallelization, and fault-tolerance. To this end, we design and implement a new distributed ML framework, STRADS-Automatic Parallelization (AP), and demonstrate that it simplifies distributed ML programming significantly, while outperforming a popular data-parallel framework with a non-familiar programming model, and achieving performance comparable to an ML-specialized framework.

Speakers
JK

Jin Kyu Kim

Carnegie Mellon University
AA

Abutalib Aghayev

Carnegie Mellon University
GA

Garth A. Gibson

Carnegie Mellon University, Vector Institute, University of Toronto
EP

Eric P. Xing

Petuum Inc, Carnegie Mellon University


Wednesday July 10, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:20pm PDT

QZFS: QAT Accelerated Compression in File System for Application Agnostic and Cost Efficient Data Storage
Data compression can not only provide space efficiency with lower Total Cost of Ownership (TCO) but also enhance I/O performance because of the reduced read/write operations. However, lossless compression algorithms with high compression ratio (e.g. gzip) inevitably incur high CPU resource consumption. Prior studies mainly leveraged general-purpose hardware accelerators such as GPU and FPGA to offload costly (de)compression operations for application workloads. This paper investigates ASIC-accelerated compression in file system to transparently benefit all applications running on it and provide high-performance and cost-efficient data storage. Based on Intel® QAT ASIC, we propose QZFS that integrates QAT into ZFS file system to achieve efficient gzip (de)compression offloading at the file system layer. A compression service engine is introduced in QZFS to serve as an algorithm selector and implement compressibility-dependent offloading and selective offloading by source data size. More importantly, a QAT offloading module is designed to leverage the vectored I/O model to reconstruct data blocks, making them able to be used by QAT hardware without incurring extra memory copy. The comprehensive evaluation validates that QZFS can achieve up to 5x write throughput improvement for FIO micro-benchmark and more than 6x cost-efficiency enhancement for genomic data post-processing over the software-implemented alternative.

Speakers
XH

Xiaokang Hu

Shanghai Jiao Tong University, Intel Asia-Pacific R&D Ltd.
FW

Fuzong Wang

Shanghai Jiao Tong University, Intel Asia-Pacific R&D Ltd.
WL

Weigang Li

Intel Asia-Pacific R&D Ltd.
JL

Jian Li

Shanghai Jiao Tong University
HG

Haibing Guan

Shanghai Jiao Tong University


Wednesday July 10, 2019 3:20pm - 3:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:20pm PDT

SOPHIA: Online Reconfiguration of Clustered NoSQL Databases for Time-Varying Workloads
Reconfiguring NoSQL databases under changing workload patterns is crucial for maximizing database throughput. This is challenging because of the large configuration parameter search space with complex interdependencies among the parameters. While state-of-the-art systems can automatically identify close-to-optimal configurations for static workloads, they suffer for dynamic workloads as they overlook three fundamental challenges: (1) Estimating performance degradation during the reconfiguration process (such as due to database restart). (2) Predicting how transient the new workload pattern will be. (3) Respecting the application’s availability requirements during reconfiguration. Our solution, SOPHIA, addresses all these shortcomings using an optimization technique that combines workload prediction with a cost-benefit analyzer. SOPHIA computes the relative cost and benefit of each reconfiguration step, and determines an optimal reconfiguration for a future time window. This plan specifies when to change configurations and to what, to achieve the best performance without degrading data availability. We demonstrate its effectiveness for three different workloads: a multi-tenant, global-scale metagenomics repository (MG-RAST), a bus-tracking application (Tiramisu), and an HPC data-analytics system, all with varying levels of workload complexity and demonstrating dynamic workload changes. We compare SOPHIA’s performance in throughput and tail-latency over various baselines for two popular NoSQL databases, Cassandra and Redis.

Speakers
AM

Ashraf Mahgoub

Purdue University
PW

Paul Wood

Johns Hopkins University
AM

Alexander Medoff

Purdue University
SM

Subrata Mitra

Adobe Research
FM

Folker Meyer

Argonne National Lab
SC

Somali Chaterji

Purdue University
SB

Saurabh Bagchi

Purdue University


Wednesday July 10, 2019 3:20pm - 3:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:40pm PDT

Break with Refreshments
Wednesday July 10, 2019 3:40pm - 4:10pm PDT
Grand Ballroom Prefunction

4:10pm PDT

DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching
Best Paper at FAST '19

Authors:
Zaoxing Liu and Zhihao Bai, Johns Hopkins University; Zhenming Liu, College of William and Mary; Xiaozhou Li, Celer Network; Changhoon Kim, Barefoot Networks; Vladimir Braverman and Xin Jin, Johns Hopkins University; Ion Stoica, UC Berkeley

Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to multiple clusters, the fast cache itself would become the bottleneck. Traditional mechanisms like cache partition and cache replication either result in load imbalance between cache nodes or have high overhead for cache coherence.

We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.

Wednesday July 10, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:10pm PDT

libmpk: Software Abstraction for Intel Memory Protection Keys (Intel MPK)
Intel Memory Protection Keys (MPK) is a new hardware primitive to support thread-local permission control on groups of pages without requiring modification of page tables. Unfortunately, its current hardware implementation and software support suffer from security, scalability, and semantic problems: (1) vulnerable to protection-key-use-after-free; (2) providing the limited number of protection keys; and (3) incompatible with mprotect()’s process-based permission model.

In this paper, we propose libmpk, a software abstraction for MPK. It virtualizes the hardware protection keys to eliminate the protection-key-use-after-free problem while providing accesses to an unlimited number of virtualized keys. To support legacy applications, it also provides a lazy inter-thread key synchronization. To enhance the security of MPK itself, libmpk restricts unauthorized writes to its metadata. We apply libmpk to three real-world applications: OpenSSL, JavaScript JIT compiler, and Memcached for memory protection and isolation. Our evaluation shows that it introduces negligible performance overhead (

Speakers
SP

Soyeon Park

Georgia Institute of Technology
SL

Sangho Lee

Microsoft Research
WX

Wen Xu

Georgia Institute of Technology
HM

Hyungon Moon

Ulsan National Institute of Science and Technology
TK

Taesoo Kim

Georgia Institute of Technology


Wednesday July 10, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:30pm PDT

Protocol-Aware Recovery for Consensus-Based Storage
Best Paper at FAST '18

Authors:
Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison; Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; Vijay Chidambaram, University of Texas at Austin;Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

We introduce protocol-aware recovery (PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of PAR through the design and implementation of corruption-tolerant replication (CTRL), a PAR mechanism specific to replicated state machine (RSM) systems. We experimentally show that the CTRL versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the CTRL versions have little performance overhead.

Wednesday July 10, 2019 4:30pm - 4:50pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:30pm PDT

Effective Static Analysis of Concurrency Use-After-Free Bugs in Linux Device Drivers
In Linux device drivers, use-after-free (UAF) bugs can cause system crashes and serious security problems. According to our study of Linux kernel commits, 42% of the driver commits fixing use-after-free bugs involve driver concurrency. We refer to these use-after-free bugs as concurrency use-after-free bugs. Due to the non-determinism of concurrent execution, concurrency use-after-free bugs are often more difficult to reproduce and detect than sequential use-after-free bugs.

In this paper, we propose a practical static analysis approach named DCUAF, to effectively detect concurrency use-after-free bugs in Linux device drivers. DCUAF combines a local analysis analyzing the source code of each driver with a global analysis statistically analyzing the local results of all drivers, forming a local-global analysis, to extract the pairs of driver interface functions that may be concurrently executed. Then, with these pairs, DCUAF performs a summary-based lockset analysis to detect concurrency use-after-free bugs. We have evaluated DCUAF on the driver code of Linux 4.19, and found 640 real concurrency use-after-free bugs. We have randomly selected 130 of the real bugs and reported them to Linux kernel developers, and 95 have been confirmed.

Speakers
JB

Jia-Ju Bai

Tsinghua University
JL

Julia Lawall

Sorbonne Université/Inria/LIP6
QC

Qiu-Liang Chen

Tsinghua University
SH

Shi-Min Hu

Tsinghua University


Wednesday July 10, 2019 4:30pm - 4:50pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:50pm PDT

Orca: Differential Bug Localization in Large-Scale Services
Best Paper at OSDI '18

Authors:
Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Adithya Abraham Philip, Microsoft Research India

Today, we depend on numerous large-scale services for basic operations such as email. These services are complex and extremely dynamic as developers continously commit code and introduce new features, fixes and, consequently, new bugs. Hundreds of commits may enter deployment simultaneously. Therefore one of the most time-critical, yet complex tasks towards mitigating service disruption is to localize the bug to the right commit.

This paper presents the concept of differential bug localization that uses a combination of differential code analysis and software provenance tracking to effectively pin-point buggy commits. We have built Orca, a customized code search-engine that implements differential bug localization. Orca is actively being used by the On-Call Engineers (OCEs) of a large enterprise email and collaboration service to localize bugs to the appropriate buggy commits. Our evaluation shows that Orca correctly localizes 77% of bugs for which it has been used. We also show that it causes a 4x reduction in the work done by the OCE.

Wednesday July 10, 2019 4:50pm - 5:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:50pm PDT

LXDs: Towards Isolation of Kernel Subsystems
Modern operating systems are monolithic. Today, however, lack of isolation is one of the main factors undermining security of the kernel. Inherent complexity of the kernel code and rapid development pace combined with the use of unsafe, low-level programming language results in a steady stream of errors. Even after decades of efforts to make commodity kernels more secure, i.e., development of numerous static and dynamic approaches aimed to prevent exploitation of most common errors, several hundreds of serious kernel vulnerabilities are reported every year. Unfortunately, in a monolithic kernel a single exploitable vulnerability potentially provides an attacker with access to the entire kernel.

Modern kernels need isolation as a practical means of confining the effects of exploits to individual kernel subsystems. Historically, introducing isolation in the kernel is hard. First, commodity hardware interfaces provide no support for efficient, fine-grained isolation. Second, the complexity of a modern kernel prevents a naive decomposition effort. Our work on Lightweight Execution Domains (LXDs) takes a step towards enabling isolation in a full-featured operating system kernel. LXDs allow one to take an existing kernel subsystem and run it inside an isolated domain with minimal or no modifications and with a minimal overhead. We evaluate our approach by developing isolated versions of several performance-critical device drivers in the Linux kernel.

Speakers
VN

Vikram Narayanan

University of California, Irvine
AB

Abhiram Balasubramanian

University of Utah
CJ

Charlie Jacobsen

University of Utah
SS

Sarah Spall

University of Utah
SB

Scott Bauer

University of Utah
MQ

Michael Quigley

University of Utah
AH

Aftab Hussain

University of California, Irvine
AY

Abdullah Younis

University of California, Irvine
JS

Junjie Shen

University of California, Irvine
MB

Moinak Bhattacharyya

University of California, Irvine
AB

Anton Burtsev

University of California, Irvine


Wednesday July 10, 2019 4:50pm - 5:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

5:10pm PDT

LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
Best Paper at OSDI '18

Authors:
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University

The monolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve heterogeneity, elasticity, resource utilization, and failure handling in datacenters, we believe that datacenters should break monolithic servers into disaggregated, network-attached hardware components. Despite the promising benefits of hardware resource disaggregation, no existing OSes or software systems can properly manage it. We propose a new OS model called the splitkernel to manage disaggregated systems. Splitkernel disseminates traditional OS functionalities into loosely-coupled monitors, each of which runs on and manages a hardware component. Using the splitkernel model, we built LegoOS, a new OS designed for hardware resource disaggregation. LegoOS appears to users as a set of distributed servers. Internally, LegoOS cleanly separates processor, memory, and storage devices both at the hardware level and the OS level. We implemented LegoOS from scratch and evaluated it by emulating hardware components using commodity servers. Our evaluation results show that LegoOS’s performance is comparable to monolithic Linux servers, while largely improving resource packing and failure rate over monolithic clusters.

Wednesday July 10, 2019 5:10pm - 5:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

5:10pm PDT

JumpSwitches: Restoring the Performance of Indirect Branches In the Era of Spectre
The Spectre family of security vulnerabilities show that speculative execution attacks on modern processors are practical. Spectre variant 2 attacks exploit the speculation of indirect branches, which enable processors to execute code from arbitrary locations in memory. Retpolines serve as the state-of-the-art defense, effectively disabling speculative execution for indirect branches. While retpolines succeed in protecting against Spectre, they come with a significant penalty — 20% on some workloads.

In this paper, we describe and implement an alternative mechanism: the JumpSwitch, which enables speculative execution of indirect branches on safe targets by leveraging indirect call promotion, transforming indirect calls into direct calls. Unlike traditional inlining techniques which apply call promotion at compile time, JumpSwitches aggressively learn targets at runtime and leverages an optimized patching infrastructure to perform just-in-time promotion without the overhead of binary translation.

We designed and optimized JumpSwitches for common patterns we have observed. If a JumpSwitch cannot learn safe targets, we fall back to the safety of retpolines. JumpSwitches seamlessly integrate into Linux, and are evaluated in Linux v4.18. In addition, we have submitted patches enabling JumpSwitch upstream to the Linux community. We show that JumpSwitches can improve performance over retpolines by up to 20% for a range of workloads. In some cases,JumpSwitches even show improvement over a system without retpolines by directing speculative execution into direct calls just-in-time and reducing mispredictions.

Speakers
NA

Nadav Amit

VMware Research
MW

Michael Wei

VMware Research


Wednesday July 10, 2019 5:10pm - 5:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

6:00pm PDT

Poster Session and Happy Hour
Take part in discussions with your colleagues over complimentary food and drinks. Posters of all papers presented in the Technical Sessions on Wednesday and on Thursday morning will be on display.

Wednesday July 10, 2019 6:00pm - 7:30pm PDT
Lake Washington Ballroom

7:30pm PDT

UNIX Exhibit Preview & Meetup at the Living Computer Museum + Labs
Open to USENIX ATC ’19 Conference Attendees

The Museum invites you to celebrate the 50th Anniversary of UNIX and associated Internet technologies in its latest exhibit update. Enjoy full museum access as well as light refreshments at the celebration of this computing milestone with conversation and live demos. Admission is free for USENIX conference attendees. While USENIX is not responsible for transportation to and from the event, we are happy to provide a limited number of Lyft coupons for transportation; if you choose to use a coupon, please share rides with your fellow attendees and use code UNIX50TH in the “Promos” section of the Lyft app.

Wednesday July 10, 2019 7:30pm - 9:00pm PDT
N/A
 
Thursday, July 11
 

7:30am PDT

Continental Breakfast
Thursday July 11, 2019 7:30am - 8:30am PDT
Grand Ballroom Prefunction

8:00am PDT

Badge Pickup
Thursday July 11, 2019 8:00am - 5:00pm PDT
Grand Stair Prefunction

8:30am PDT

Lightning Talks
Thursday July 11, 2019 8:30am - 9:40am PDT
Grand Ballroom

9:40am PDT

Short Break
Thursday July 11, 2019 9:40am - 9:45am PDT
Grand Ballroom Prefunction

9:45am PDT

Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly
Best Paper at ASPLOS '18

Authors: Yatish Turakhia and Gill Bejerano, Stanford University; William J. Dally, Stanford University and NVIDIA Research

Genomics is transforming medicine and our understanding of life in fundamental ways. Genomics data, however, is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. Over 1,300 CPU hours are required for reference-guided assembly of the human genome, and over 15,600 CPU hours are required for de novo assembly. This paper describes "Darwin," a co-processor for genomic sequence alignment that, without sacrificing sensitivity, provides up to $15,000X speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design, trading more easily accelerated alignment for less memory-intensive filtering, and by optimizing the memory system for filtering. Darwin combines a hardware-accelerated version of D-SOFT, a novel filtering algorithm, alignment at high speed, and with a hardware-accelerated version of GACT, a novel alignment algorithm. GACT generates near-optimal alignments of arbitrarily long genomic sequences using constant memory for the compute-intensive step. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.

Thursday July 11, 2019 9:45am - 10:05am PDT
USENIX ATC Track I: Grand Ballroom I–VI

9:45am PDT

Multi Queue Fair Queuing
Modern high-speed devices (e.g., network adapters, storage, accelerators) use new host interfaces, which expose multiple software queues directly to the device. These multi-queue interfaces allow mutually distrusting applications to access the device without any cross-core interaction, enabling throughput in the order of millions of IOP/s on multicore systems. Unfortunately, while independent device access is scalable, it also introduces a new problem: unfairness. Mechanisms that were used to provide fairness for older devices are no longer tenable in the wake of multi-queue design, and straightforward attempts to re-introduce it would require cross-core synchronization that undermines the scalability for which multiple queues were designed.

To address these challenges, we present Multi-Queue Fair Queueing (MQFQ), the first fair, work-conserving scheduler suitable for multi-queue systems. Specifically, we (1) reformulate a classical fair queueing algorithm to accommodate multi-queue designs, and (2) describe a scalable implementation that bounds potential unfairness while minimizing synchronization overhead. Our implementation of MQFQ in Linux 4.15 demonstrates both fairness and high throughput. Evaluation with an NVMe over RDMA fabric (NVMf) device shows that MQFQ can reach up to 3.1 Million IOP/s on a single machine — $20\times$ higher than the state-of-the-art Linux Budget Fair Queueing. Compared to a system with no fairness, MQFQ reduces the slowdown caused by an antagonist from $3.78\times$ to $1.33\times$ for the FlashX workload and from $6.57\times$ to $1.03\times$ for the Aerospike workload ($2\times$ is considered "fair" slowdown).

Speakers
MH

Mohammad Hedayati

University of Rochester
KS

Kai Shen

Google
ML

Michael L. Scott

University of Rochester


Thursday July 11, 2019 9:45am - 10:05am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:05am PDT

The Semantics of Transactions and Weak Memory in x86, Power, ARM, and C++
Best Paper at PLDI 2018

Authors:
Nathan Chong, Arm; Tyler Sorensen and John Wickerson, Imperial College London

Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models.

Our work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models (x86, Power, ARMv8, and C++) with new rules for TM. Our formal models are backed by automated tooling that enables (1) the synthesis of tests for validating our models against existing implementations and (2) the model-checking of TM-related transformations, such as lock elision and compiling C++ transactions to hardware. A key finding is that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.

Thursday July 11, 2019 10:05am - 10:25am PDT
USENIX ATC Track I: Grand Ballroom I–VI

10:05am PDT

BRAVO -- Biased Locking for Reader-Writer Locks
Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads when the "reader indicator" state is updated frequently by a diverse set of threads, causing cache invalidation and coherence traffic. Other designs use distributed reader indicators, one per NUMA node, per core or even per thread. This improves reader-reader scalability, but also increases the size of each lock instance and creates overhead for writers.

We propose a simple transformation, BRAVO, that augments any existing reader-writer lock, adding just two integer fields to the lock instance. Readers make their presence known to writers by hashing their thread's identity with the lock address, forming an index into a visible readers table and installing the lock address into the table. All locks and threads in an address space can share the same readers table. Crucially, readers of the same lock tend to write to different locations in the table, reducing coherence traffic. Therefore, BRAVO can augment a simple compact lock to provide scalable concurrent reading, but with only modest and constant increase in memory footprint.

We implemented BRAVO in user-space, as well as integrated it with the Linux kernel reader-writer semaphore (rwsem). Our evaluation with numerous benchmarks and real applications, both in user and kernel-space, demonstrate that BRAVO improves performance and scalability of underlying locks in read-heavy workloads while introducing virtually no overhead, including in workloads in which writes are frequent.

Speakers
DD

Dave Dice

Oracle Labs
AK

Alex Kogan

Oracle Labs


Thursday July 11, 2019 10:05am - 10:25am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:25am PDT

Who Left Open the Cookie Jar? A Comprehensive Evaluation of Third-Party Cookie Policies
Distinguished Paper Award and 2018 Internet Defense Prize at USENIX Security '18

Authors:
Gertjan Franken, Tom Van Goethem, and Wouter Joosen, imec-DistriNet, KU Leuven

Nowadays, cookies are the most prominent mechanism to identify and authenticate users on the Internet. Although protected by the Same Origin Policy, popular browsers include cookies in all requests, even when these are cross-site. Unfortunately, these third-party cookies enable both cross-site attacks and third-party tracking. As a response to these nefarious consequences, various countermeasures have been developed in the form of browser extensions or even protection mechanisms that are built directly into the browser.

In this paper, we evaluate the effectiveness of these defense mechanisms by leveraging a framework that automatically evaluates the enforcement of the policies imposed to third-party requests. By applying our framework, which generates a comprehensive set of test cases covering various web mechanisms, we identify several flaws in the policy implementations of the 7 browsers and 46 browser extensions that were evaluated. We find that even built-in protection mechanisms can be circumvented by multiple novel techniques we discover. Based on these results, we argue that our proposed framework is a much-needed tool to detect bypasses and evaluate solutions to the exposed leaks. Finally, we analyze the origin of the identified bypass techniques, and find that these are due to a variety of implementation, configuration and design flaws.

Thursday July 11, 2019 10:25am - 10:45am PDT
USENIX ATC Track I: Grand Ballroom I–VI

10:25am PDT

Mitigating Asymmetric Read and Write Costs in Cuckoo Hashing for Storage Systems
In storage systems, cuckoo hash tables have been widely used to support fast query services. For a read, the cuckoo hashing delivers real-time access with $O(1)$ lookup complexity via open-addressing approach. For a write, most concurrent cuckoo hash tables fail to efficiently address the problem of endless loops during item insertion due to the essential property of hash collisions. The asymmetric feature of cuckoo hashing exhibits fast-read-slow-write performance, which often becomes the bottleneck from single-thread writes. In order to address the problem of asymmetric performance and significantly improve the write/insert efficiency, we propose an optimized Concurrent Cuckoo hashing scheme, called CoCuckoo. To predetermine the occurrence of endless loops, CoCuckoo leverages a directed pseudoforest containing several subgraphs to leverage the cuckoo paths that represent the relationship among items. CoCuckoo exploits the observation that the insertion operations sharing a cuckoo path access the same subgraph, and hence a lock is needed for ensuring the correctness of concurrency control via allowing only one thread to access the shared path at a time; Insertion operations accessing different subgraphs are simultaneously executed without collisions. CoCuckoo improves the throughput performance by a graph-grained locking to support concurrent writes and reads. We have implemented all components of CoCuckoo and extensive experiments using the YCSB benchmark have demonstrated the efficiency and efficacy of our proposed scheme.

Speakers
YS

Yuanyuan Sun

Huazhong University of Science and Technology
YH

Yu Hua

Huazhong University of Science and Technology
ZC

Zhangyu Chen

Huazhong University of Science and Technology
YG

Yuncheng Guo

Huazhong University of Science and Technology


Thursday July 11, 2019 10:25am - 10:45am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:45am PDT

Break with Refreshments
Thursday July 11, 2019 10:45am - 11:15am PDT
Grand Ballroom Prefunction

11:15am PDT

NICA: An Infrastructure for Inline Acceleration of Network Applications
With rising network rates, cloud vendors increasingly deploy FPGA-based SmartNICs (F-NICs), leveraging their inline processing capabilities to offload hypervisor networking infrastructure. However, the use of F-NICs for accelerating general-purpose server applications in clouds has been limited.

NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems. A new ikernel programming abstraction, tightly integrated with the network stack, enables application control of F-NIC computations that process application network traffic, with minimal code changes. In addition, NICA’s virtualization architecture supports fine-grain time-sharing of F-NIC logic and provides I/O path virtualization. Together these features enable cost-effective sharing of F-NICs across virtual machines with strict performance guarantees.

We prototype NICA on Mellanox F-NICs and integrate ikernels with the high-performance VMA network stack and the KVM hypervisor. We demonstrate significant acceleration of real-world applications in both bare-metal and virtualized environments, while requiring only minor code modifications to accelerate them on F-NICs. For example, a transparent key-value store cache ikernel added to the stock memcached server reaches 40 Gbps server throughput (99% line-rate) at 6 μs 99th-percentile latency for 16-byte key-value pairs, which is 21× the throughput of a 6-core CPU with a kernel-bypass network stack. The throughput scales linearly for up to 6 VMs running independent instances of memcached.

Speakers
HE

Haggai Eran

Technion – Israel Institute of Technology & Mellanox Technologies
LZ

Lior Zeno

Technion – Israel Institute of Technology
MT

Maroun Tork

Technion – Israel Institute of Technology
GM

Gabi Malka

Technion – Israel Institute of Technology
MS

Mark Silberstein

Technion – Israel Institute of Technology


Thursday July 11, 2019 11:15am - 11:35am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:15am PDT

SIMD-X: Programming and Processing of Graph Algorithms on GPUs
With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However, graph algorithms such as breadth-first search and k-core, often fail to take full advantage of GPUs, due to irregularity in memory access and control flow. To address this challenge, we have developed SIMD-X, for programming and processing of single instruction multiple, complex, data on GPUs. Specifically, the new Active-Compute-Combine (ACC) model not only provides ease of programming to programmers, but more importantly creates opportunities for system-level optimizations. To this end, SIMD-X utilizes just-in-time task management which filters out inactive vertices at runtime and intelligently maps various tasks to different amount of GPU cores in pursuit of workload balancing. In addition, SIMD-X leverages push-pull based kernel fusion that, with the help of a new deadlock-free global barrier, reduces a large number of computation kernels to very few. Using SIMD-X, a user can program a graph algorithm in tens of lines of code, while achieving 3x, 6x, 24x, 3x speedup over Gunrock, Galois, CuSha, and Ligra, respectively.

Speakers
HL

Hang Liu

University of Massachusetts Lowell
HH

H. Howie Huang

George Washington University


Thursday July 11, 2019 11:15am - 11:35am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:35am PDT

E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers
We investigate the use of SmartNIC-accelerated servers to execute microservice-based applications in the data center. By offloading suitable microservices to the SmartNIC’s low-power processor, we can improve server energy-efficiency without latency loss. However, as a heterogeneous computing substrate in the data path of the host, SmartNICs bring several challenges to a microservice platform: network traffic routing and load balancing, microservice placement on heterogeneous hardware, and contention on shared SmartNIC resources. We present E3, a microservice execution platform for SmartNIC-accelerated servers. E3 follows the design philosophies of the Azure Service Fabric microservice platform and extends key system components to a SmartNIC to address the above-mentioned challenges. E3 employs three key techniques: ECMP-based load balancing via SmartNICs to the host, network topology-aware microservice placement, and a data-plane orchestrator that can detect SmartNIC overload. Our E3 prototype using Cavium LiquidIO SmartNICs shows that SmartNIC offload can improve cluster energy-efficiency up to 3× and cost efficiency up to 1.9× at up to 4% latency cost for common microservices, including real-time analytics, an IoT hub, and virtual network functions.

Speakers
ML

Ming Liu

University of Washington
SP

Simon Peter

The University of Texas at Austin
AK

Arvind Krishnamurthy

University of Washington
PM

Phitchaya Mangpo Phothilimthana

University of California, Berkeley


Thursday July 11, 2019 11:35am - 11:55am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:35am PDT

11:55am PDT

INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive
We present INSIDER, a full-stack redesigned storage system to help users fully utilize the performance of emerging storage drives with moderate programming efforts. On the hardware side, INSIDER introduces an FPGA-based reconfigurable drive controller as the in-storage computing (ISC) unit; it is able to saturate the high drive performance while retaining enough programmability. On the software side, INSIDER integrates with the existing system stack and provides effective abstractions. For the host programmer, we introduce virtual file abstraction to abstract ISC as file operations; this hides the existence of the drive processing unit and minimizes the host code modification to leverage the drive computing capability. By separating out the drive processing unit to the data plane, we expose a clear drive-side interface so that drive programmers can focus on describing the computation logic; the details of data movement between different system components are hidden. With the software/hardware co-design, INSIDER runtime provides crucial system support. It not only transparently enforces the isolation and scheduling among offloaded programs, but it also protects the drive data from being accessed by unwarranted programs.

We build an INSIDER drive prototype and implement its corresponding software stack. The evaluation shows that INSIDER achieves an average 12X performance improvement and 31X accelerator cost efficiency when compared to the existing ARM-based ISC system. Additionally, it requires much less effort when implementing applications. INSIDER is open-sourced, and we have adapted it to the AWS F1 instance for public access.

Speakers

Thursday July 11, 2019 11:55am - 12:15pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:55am PDT

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Recent deep learning models have moved beyond low dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, e-commerce user-item graphs, and knowledge graphs. This evolution has led to large graph-based neural network models that go beyond what existing deep learning frameworks or graph computing systems are designed for. We present NeuGraph, a new framework that bridges the graph and dataflow models to support efficient and scalable parallel neural network computation on graphs. NeuGraph introduces graph computation optimizations into the management of data partitioning, scheduling, and parallelism in dataflow-based deep learning frameworks. Our evaluation shows that, on small graphs that can fit in a single GPU, NeuGraph outperforms state-of-the-art implementations by a significant margin, while scaling to large real-world graphs that none of the existing frameworks can handle directly with GPUs.

Speakers
LM

Lingxiao Ma

Peking University
ZY

Zhi Yang

Peking University
YM

Youshan Miao

Microsoft Research
JX

Jilong Xue

Microsoft Research
MW

Ming Wu

Microsoft Research
LZ

Lidong Zhou

Microsoft Research
YD

Yafei Dai

Peking University


Thursday July 11, 2019 11:55am - 12:15pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:15pm PDT

Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval
Data analysis and retrieval is a widely-used component in existing artificial intelligence systems. However, each request has to go through each layer across the I/O stack, which moves tremendous irrelevant data between secondary storage, DRAM, and the on-chip cache. This leads to high response latency and rising energy consumption. To address this issue, we propose Cognitive SSD, an energy-efficient engine for deep learning based unstructured data retrieval. In Cognitive SSD, a flash-accessing accelerator named DLG-x is placed by the side of flash memory to achieve near-data deep learning and graph search. Such functions of in-SSD deep learning and graph search are exposed to the users as library APIs via NVMe command extension. Experimental results on the FPGA-based prototype reveal that the proposed Cognitive SSD reduces latency by 69.9% on average in comparison with CPU based solutions on conventional SSDs, and it reduces the overall system power consumption by up to 34.4% and 63.0% respectively when compared to CPU and GPU based solutions that deliver comparable performance.

Speakers
SL

Shengwen Liang

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
YW

Ying Wang

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
YL

Youyou Lu

Tsinghua University
ZY

Zhe Yang

Tsinghua University
HL

Huawei Li

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
XL

Xiaowei Li

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences


Thursday July 11, 2019 12:15pm - 12:35pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:15pm PDT

Pre-Select Static Caching and Neighborhood Ordering for BFS-like Algorithms on Disk-based Graph Engines
Many important graph algorithms are based on the breadth first search (BFS) approach, which builds itself on recursive vertex traversal. We classify algorithms that share this characteristic into what we call a BFS-like algorithm. In this work, we first analyze and study the I/O request patterns of BFS-like algorithms executed on disk-based graph engines. Our analysis exposes two shortcomings in executing BFS-like algorithms. First, we find that the use of the cache is ineffective. To make use of the cache more effectively, we propose an in-memory static cache, which we call BFS-Aware Static Cache or Basc, for short. Basc is static as its contents, which are edge lists of vertices that are pre-selected before algorithm execution, do not change throughout the execution of the algorithm. Second, we find that state-of-the-art ordering method for graphs on disks is ineffective with BFS-like algorithms. Thus, based on an I/O cost model that estimates the performance based on the ordering of graphs, we develop an efficient graph ordering called Neighborhood Ordering or Norder. We provide extensive evaluations of Basc and Norder on two well-known graph engines using five real-world graphs including Twitter that has 1.9 billion edges. Our experimental results show that Basc and Norder, collectively have substantial performance impact.

Speakers
JS

Jiwon Seo

Hanyang University


Thursday July 11, 2019 12:15pm - 12:35pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:35pm PDT

Conference Luncheon
Thursday July 11, 2019 12:35pm - 2:00pm PDT
Olympic Pavilion

2:00pm PDT

From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
We present gg, a framework and a set of command-line tools that helps people execute everyday applications—e.g., software compilation, unit tests, video encoding, or object recognition—using thousands of parallel threads on a cloud-functions service to achieve near-interactive completion time. In the future, instead of running these tasks on a laptop, or keeping a warm cluster running in the cloud, users might push a button that spawns 10,000 parallel cloud functions to execute a large job in a few seconds from start. gg is designed to make this practical and easy.

With gg, applications express a job as a composition of lightweight OS containers that are individually transient (lifetimes of 1–60 seconds) and functional (each container is hermetically sealed and deterministic). gg takes care of instantiating these containers on cloud functions, loading dependencies, minimizing data movement, moving data between containers, and dealing with failure and stragglers.

We ported several latency-sensitive applications to run on gg and evaluated its performance. In the best case, a distributed compiler built on gg outperformed a conventional tool (icecc) by 2–5×, without requiring a warm cluster running continuously. In the worst case, gg was within 20% of the hand-tuned performance of an existing tool for video encoding (ExCamera).

Speakers
SF

Sadjad Fouladi

Stanford University
FR

Francisco Romero

Stanford University
DI

Dan Iter

Stanford University
QL

Qian Li

Stanford University
SC

Shuvo Chatterjee

Unaffiliated
CK

Christos Kozyrakis

Stanford University
MZ

Matei Zaharia

Stanford University
KW

Keith Winstein

Stanford University


Thursday July 11, 2019 2:00pm - 2:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:00pm PDT

StreamBox-TZ: Secure Stream Analytics at the Edge with TrustZone
While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted.

This approach faces two major challenges: (1) executing high-throughput, low-delay stream analytics in a single TEE, which is constrained by a low trusted computing base (TCB) and limited physical memory; (2) verifying execution of stream analytics as the execution involves untrusted software components on the edge. In response, we present StreamBox-TZ (SBT), a stream analytics engine for an edge platform that offers strong data security, verifiable results, and good performance. SBT contributes a data plane designed and optimized for a TEE based on ARM TrustZone. It supports continuous remote attestation for analytics correctness and result freshness while incurring low overhead. SBT only adds 42.5 KB executable to the TCB (16% of the entire TCB). On an octa core ARMv8 platform, it delivers the state-of-the-art performance by processing input events up to 140 MB/sec (12M events/sec) with sub-second delay. The overhead incurred by SBT’s security mechanism is less than 25%.

Speakers
HP

Heejin Park

Purdue ECE
SZ

Shuang Zhai

Purdue ECE
LL

Long Lu

Northeastern University


Thursday July 11, 2019 2:00pm - 2:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:20pm PDT

Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries
As network, I/O, accelerator, and NVM devices capable of a million operations per second make their way into data centers, the software stack managing such devices has been shifting from implementations within the operating system kernel to more specialized kernel-bypass approaches. While the in-kernel approach guarantees safety and provides resource multiplexing, it imposes too much overhead on microsecond-scale tasks. Kernel-bypass approaches improve throughput substantially but sacrifice safety and complicate resource management: if applications are mutually distrusting, then either each application must have exclusive access to its own device or else the device itself must implement resource management.

This paper shows how to attain both safety and performance via intra-process isolation for data plane libraries. We propose protected libraries as a new OS abstraction which provides separate user-level protection domains for different services (e.g., network and in-memory database), with performance approaching that of unprotected kernel bypass. We also show how this new feature can be utilized to enable sharing of data plane libraries across distrusting applications. Our proposed solution uses Intel's memory protection keys (PKU) in a safe way to change the permissions associated with subsets of a single address space. In addition, it uses hardware watchpoints to delay asynchronous event delivery and to guarantee independent failure of applications sharing a protected library.

We show that our approach can efficiently protect high-throughput in-memory databases and user-space network stacks. Our implementation allows up to 2.3 million library entrances per second per core, outperforming both kernel-level protection and two alternative implementations that use system calls and Intel's VMFUNC switching of user-level address spaces, respectively.

Speakers
MH

Mohammad Hedayati

University of Rochester
SG

Spyridoula Gravani

University of Rochester
EJ

Ethan Johnson

University of Rochester
JC

John Criswell

University of Rochester
ML

Michael L. Scott

University of Rochester
KS

Kai Shen

Google


Thursday July 11, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:20pm PDT

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves
Hardware secure enclaves are increasingly used to run complex applications. Unfortunately, existing and emerging enclave architectures do not allow secure and efficient implementation of custom page fault handlers. This limitation impedes in-enclave use of secure memory-mapped files and prevents extensions of the application memory layer commonly used in untrusted systems, such as transparent memory compression or access to remote memory.

CoSMIX is a Compiler-based system for Secure Memory Instrumentation and eXecution of applications in secure enclaves. A novel memory store abstraction allows implementation of application-level secure page fault handlers that are invoked by a lightweight enclave runtime. The CoSMIX compiler instruments the application memory accesses to use one or more memory stores, guided by a global instrumentation policy or code annotations without changing application code.

The CoSMIX prototype runs on Intel SGX and is compatible with popular SGX execution environments, including SCONE and Graphene. Our evaluation of several production applications shows how CoSMIX improves their security and performance by recompiling them with appropriate memory stores. For example, unmodified Redis and Memcached key-value stores achieve about 2× speedup by using a self-paging memory store while working on datasets up to 6× larger than the enclave’s secure memory. Similarly, annotating a single line of code in a biometric verification server changes it to store its sensitive data in Oblivious RAM and makes it resilient against SGX side-channel attacks.

Speakers
YM

Yan Michalevsky

Anjuna Security
MS

Mark Silberstein

Technion – Israel Institute of Technology


Thursday July 11, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

2:40pm PDT

A Retargetable System-Level DBT Hypervisor
System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance-critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard Qemu DBT system. Our system delivers an average speedup of 2.21x over Qemu across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA, and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49x on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance.

Speakers
TS

Tom Spink

University of Edinburgh
HW

Harry Wagstaff

University of Edinburgh
BF

Björn Franke

University of Edinburgh


Thursday July 11, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:40pm PDT

Secured Routines: Language-based Construction of Trusted Execution Environments
Trusted Execution Environments (TEEs), such as Intel SGX’s enclave, use hardware to ensure the confidentiality and integrity of operations on sensitive data. While the technology is widely available, the complexity of its programming model and its performance overhead have limited adoption. TEEs provide a new and valuable hardware functionality that has no obvious analogue in programming languages, which means that developers must manually partition their application into trusted and untrusted components.

This paper describes an approach that fully integrates trusted execution in a language-appropriate manner. We extend the Go language to allow a programmer to execute a goroutine within an enclave, to use low-overhead channels to communicate between the trusted and untrusted environments, and to rely on a compiler to automatically extract the secure code and data. Our prototype compiler and runtime, GOTEE , is a backward-compatible fork of the Go compiler.

The evaluation shows that our compiler-driven code and data partitioning efficiently executes both microbenchmarks and applications. On the former, GOTEE achieves a 5.2x throughput, and a 2.3x latency improvement over the Intel SGX SDK. Our case studies, the Go tls package and a secured keystore inspired by the go-ethereum project, show that minor source-code modifications suffice to provide confidentiality and integrity guarantees with only moderate performance overheads.


Thursday July 11, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:00pm PDT

MTS: Bringing Multi-Tenancy to Virtual Networking
Multi-tenant cloud computing provides great benefits in terms of resource sharing, elastic pricing, and scalability, however, it also changes the security landscape and introduces the need for strong isolation between the tenants, also inside the network. This paper is motivated by the observation that while multi-tenancy is widely used in cloud computing, the virtual switch designs currently used for network virtualization lack sufficient support for tenant isolation. Hence, we present, implement, and evaluate a virtual switch architecture, MTS, which brings secure design best-practice to the context of multi-tenant virtual networking: compartmentalization of virtual switches, least-privilege execution, complete mediation of all network communication, and reducing the trusted computing base shared between tenants. We build MTS from commodity components, providing an incrementally deployable and inexpensive upgrade path to cloud operators. Our extensive experiments, extending to both micro-benchmarks and cloud applications, show that, depending on the way it is deployed, MTS may produce 1.5-2x the throughput compared to state-of-the-art, with similar or better latency and modest resource overhead (1 extra CPU). MTS is available as open source software.

Speakers
KT

Kashyap Thimmaraju

Technische Universität Berlin
SH

Saad Hermak

Technische Universität Berlin
GR

Gabor Retvari

BME HSNLab
SS

Stefan Schmid

Faculty of Computer Science, University of Vienna


Thursday July 11, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:00pm PDT

Supporting Security Sensitive Tenants in a Bare-Metal Cloud
SecCloud is a new architecture for bare-metal clouds that enables tenants to control tradeoffs between security, price, and performance. It enables security-sensitive tenants to minimize their trust in the public cloud provider and achieve similar levels of security and control that they can obtain in their own private data centers, while not imposing overhead on tenants that are security insensitive and not compromising the flexibility or operational efficiency of the provider. Our prototype exploits a novel provisioning system and specialized firmware to enable elasticity similar to virtualized clouds. Experimentally we quantify the cost of different levels of security for a variety of workloads and demonstrate the value of giving control to the tenant.

Speakers
AM

Amin Mosayyebzadeh

Boston University
AM

Apoorve Mohan

Northeastern University
ST

Sahil Tikale

Boston University
MA

Mania Abdi

Northeastern University
NS

Nabil Schear

MIT Lincoln Laboratory
CM

Charles Munson

MIT Lincoln Laboratory
LR

Larry Rudolph

Two Sigma
GC

Gene Cooperman

Northeastern University
PD

Peter Desnoyers

Northeastern University
OK

Orran Krieger

Boston University


Thursday July 11, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

3:20pm PDT

3:50pm PDT

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs
Today's ultra-low latency SSDs can deliver an I/O latency of sub-ten microseconds. With this dramatically shrunken device time, operations inside the kernel I/O stack, which were traditionally considered lightweight, are no longer a negligible portion. This motivates us to reexamine the storage I/O stack design and propose an asynchronous I/O stack (AIOS), where synchronous operations in the I/O path are replaced by asynchronous ones to overlap I/O-related CPU operations with device I/O. The asynchronous I/O stack leverages a lightweight block layer specialized for NVMe SSDs using the page cache without block I/O scheduling and merging, thereby reducing the sojourn time in the block layer. We prototype the proposed asynchronous I/O stack on the Linux kernel and evaluate it with various workloads. Synthetic FIO benchmarks demonstrate that the application-perceived I/O latency falls into single-digit microseconds for 4 KB random reads on Optane SSD, and the overall I/O latency is reduced by 15-33% across varying block sizes. This I/O latency reduction leads to a significant performance improvement of real-world applications as well: 11-44% IOPS increase on RocksDB and 15-30% throughput improvement on Filebench and OLTP workloads.

Speakers
GL

Gyusun Lee

Sungkyunkwan University
SS

Seokha Shin

Sungkyunkwan University
WS

Wonsuk Song

Sungkyunkwan University
TJ

Tae Jun Ham

Seoul National University
JW

Jae W. Lee

Seoul National University
JJ

Jinkyu Jeong

Sungkyunkwan University


Thursday July 11, 2019 3:50pm - 4:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:50pm PDT

SmartDedup: Optimizing Deduplication for Resource-constrained Devices
Storage on smart devices such as smartphones and the Internet of Things has limited performance, capacity, and endurance. Deduplication has the potential to address these limitations by eliminating redundant I/Os and data, but it must be considered under the various resource constraints of the devices. This paper presents SmartDedup, a deduplication solution optimized for resource-constrained devices. It proposes a novel architecture that supports symbiotic in-line and out-of-line deduplication to take advantage of their complementary strengths and allow them to be adapted according to a device's current resource availability. It also cohesively combines in-memory and on-disk fingerprint stores to minimize the memory overhead while achieving a good level of deduplication. SmartDedup is prototyped on EXT4 and F2FS and evaluated using benchmarks, workloads generated from real-world device images, and traces collected from real-world devices. The results show that SmartDedup substantially improves I/O performance (e.g., increases write and read throughput by 31.1% and 32%, respectively for an FIO experiment with 25% deduplication ratio), reduces flash writes (e.g., by 70.9% in a trace replay experiment with 72.4% deduplication ratio) and saves space usage (e.g., by 45% in a DEDISbench experiment with 46.1% deduplication ratio) with low memory, storage, and battery overhead, compared to both native file systems and related deduplication solutions.

Speakers
QY

Qirui Yang

Arizona State University
RJ

Runyu Jin

Arizona State University
MZ

Ming Zhao

Arizona State University


Thursday July 11, 2019 3:50pm - 4:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:10pm PDT

M³x: Autonomous Accelerators via Context-Enabled Fast-Path Communication
Performance and efficiency requirements are driving a trend towards specialized accelerators in both datacenters and embedded devices. In order to cut down communication overheads, system components are pinned to cores and fast-path communication between them is established. These fast paths reduce latency by avoiding indirections through the operating system. However, we see three roadblocks that can impede further gains: First, accelerators today need to be assisted by a general-purpose core, because they cannot autonomously access operating system services like file systems or network stacks. Second, fast-path communication is at odds with preemptive context switching, which is still necessary today to improve efficiency when applications underutilize devices. Third, these concepts should be kept orthogonal, such that direct and unassisted communication is possible between any combination of accelerators and general-purpose cores. At the same time, all of them should support switching between multiple application contexts, which is most difficult with accelerators that lack the hardware features to run an operating system.

We present M³x, a system architecture that removes these roadblocks. M³x retains the low overhead of fast-path communication while enabling context switching for general-purpose cores and specialized accelerators. M³x runs accelerators autonomously and achieves a speedup of 4.7 for PCIe-attached image-processing accelerators compared to traditional assisted operation. At the same time, utilization of the host CPU is reduced by a factor of 30.

Speakers
NA

Nils Asmussen

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany
MR

Michael Roitzsch

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany
HH

Hermann Härtig

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany


Thursday July 11, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:10pm PDT

Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere!
Data Domain has added a cloud tier capability to its onpremises storage appliance, allowing clients to achieve the cost benefits of deduplication in the cloud. While there were many architectural changes necessary to support a cloud tier in a mature storage product, in this paper, we focus on innovations needed to support key functionality for customers. Consider typical customer interactions: First, a customer determines which files to migrate to the cloud by estimating how much space will be freed on the on-premises Data Domain appliance. Second, a customer transfers selected files to the cloud and later restores files back. Finally, a customer deletes a file in the cloud when its retention period has expired. Each of these operations requires significant architectural changes and new algorithms to address both the impact of deduplicated storage and the latency and expense of cloud object storage. We also present analysis from deployed cloud tier systems. As an example, some customers have moved more than 20PB of logical data to the cloud tier and achieved a total compression factor (deduplication * local compression) of 40× or more, resulting in millions of dollars of cost savings.

Speakers

Thursday July 11, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:30pm PDT

Short Break
Thursday July 11, 2019 4:30pm - 4:35pm PDT
Grand Ballroom Prefunction

4:35pm PDT

GAIA: An OS Page Cache for Heterogeneous Systems
We propose a principled approach to integrating GPU memory with an OS page cache. We design GAIA, a weakly-consistent page cache that spans CPU and GPU memories. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA (1) integrates lazy release consistency protocol into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels; (2) improves CPU I/O performance by using data cached in GPU memory, and (3) optimizes the readahead prefetcher to support accesses to files cached in GPUs. We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3× speedup in CPU file I/O and up to 8× in unmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.

Speakers
MS

Mark Silberstein

Technion – Israel Institute of Technology


Thursday July 11, 2019 4:35pm - 4:55pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:35pm PDT

Pragh: Locality-preserving Graph Traversal with Split Live Migration
Many real-world data like social, transportation, biology, and communication data can be efficiently modeled as a graph. Hence, graph traversal such as multi-hop or graph-walking queries has been key operations atop graph stores. However, since different graph traversals may touch different sets of data, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance cost. This paper proposes Pragh, an efficient locality-preserving live graph migration scheme for graph store in the form of key-value pairs. The key idea of Pragh is a split migration model which only migrates values physically while retains keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized accesses to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong with up to 2.53× throughput improvement further confirms the effectiveness and generality of Pragh.

Speakers
XX

Xiating Xie

Shanghai Jiao Tong University
XW

Xingda Wei

Shanghai Jiao Tong University
RC

Rong Chen

Shanghai Jiao Tong University
HC

Haibo Chen

Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.


Thursday July 11, 2019 4:35pm - 4:55pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

4:55pm PDT

Transkernel: Bridging Monolithic Kernels to Peripheral Cores
Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off running on a low-power, microcontroller-like core, i.e., peripheral core, relieving CPU from the inefficiency.

We therefore present a new OS structure, in which a lightweight virtual executor called transkernel offloads specific phases from a monolithic kernel. The transkernel translates stateful kernel execution through cross-ISA, dynamic binary translation (DBT); it emulates a small set of stateless kernel services behind a narrow, stable binary interface; it specializes for hot paths; it exploits ISA similarities for lowering DBT cost.

Through an ARM-based prototype, we demonstrate transkernel’s feasibility and benefit. We show that while cross-ISA DBT is typically used under the assumption of efficiency loss, it can enable efficiency gain, even on off-the-shelf hardware.

Speakers
LG

Liwei Guo

Purdue ECE
SZ

Shuang Zhai

Purdue ECE
YQ

Yi Qiao

Purdue ECE


Thursday July 11, 2019 4:55pm - 5:15pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:55pm PDT

ElasticBF: Elastic Bloom Filter with Hotness Awareness for Boosting Read Performance in Large Key-Value Stores
LSM-tree based key-value (KV) stores suffer from severe read amplification because searching a key requires to check multiple SSTables. To reduce extra I/Os, Bloom filters are usually deployed in KV stores to improve read performance. However, Bloom filters suffer from false positive, and simply enlarging the size of Bloom filters introduces large memory overhead, so it still causes extra I/Os in memory-constrained systems. In this paper, we observe that access skewness is very common among SSTables or even small-sized segments within each SSTable. To leverage this skewness feature, we develop ElasticBF, a fine-grained heterogeneous Bloom filter management scheme with dynamic adjustment according to data hotness. ElasticBF is orthogonal to the works optimizing the architecture of LSM-tree based KV stores, so it can be integrated to further speed up their read performance. We build ElasticBF atop of LevelDB, RocksDB, and PebblesDB, and our experimental results show that ElasticBF increases the read throughput of the above KV stores to 2.34x, 2.35x, and 2.58x, respectively, while keeps almost the same write and range query performance.

Speakers
YL

Yongkun Li

University of Science and Technology of China
CT

Chengjin Tian

University of Science and Technology of China
FG

Fan Guo

University of Science and Technology of China
CL

Cheng Li

University of Science and Technology of China
YX

Yinlong Xu

University of Science and Technology of China


Thursday July 11, 2019 4:55pm - 5:15pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

5:15pm PDT

Detecting Asymmetric Application-layer Denial-of-Service Attacks In-Flight with Finelame
Denial of service (DoS) attacks increasingly exploit algorithmic, semantic, or implementation characteristics dormant in victim applications, often with minimal attacker resources. Practical and efficient detection of these asymmetric DoS attacks requires us to (i) catch offending requests in-flight, before they consume a critical amount of resources, (ii) remain agnostic to the application internals, such as the programming language or runtime system, and (iii) introduce low overhead in terms of both performance and programmer effort.

This paper introduces Finelame, a language-independent framework for detecting asymmetric DoS attacks. Finelame leverages operating system visibility across the entire software stack to instrument key resource allocation and negotiation points. It leverages recent advances in the Linux extended Berkeley Packet Filter virtual machine to attach application-level interposition probes to key request processing functions, and lightweight resource monitors---user/kernel-level probes---to key resource allocation functions. The data collected is used to train a model of resource utilization that occurs throughout the lifetime of individual requests. The model parameters are then shared with the resource monitors, which use them to catch offending requests in-flight, inline with resource allocation. We demonstrate that Finelame can be integrated with legacy applications with minimal effort, and that it is able to detect resource abuse attacks much earlier than their intended completion time while posing low performance overheads.

Speakers
HM

Henri Maxime Demoulin

University of Pennsylvania
IP

Isaac Pedisich

University of Pennsylvania
NV

Nikos Vasilakis

University of Pennsylvania
VL

Vincent Liu

University of Pennsylvania
BT

Boon Thau Loo

University of Pennsylvania
LT

Linh Thi Xuan Phan

University of Pennsylvania


Thursday July 11, 2019 5:15pm - 5:35pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

5:15pm PDT

SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores
LSM-based KV stores are designed to offer good write performance, by capturing client writes in memory, and only later flushing them to storage. Writes are later compacted into a tree-like data structure on disk to improve read performance and to reduce storage space use. It has been widely documented that compactions severely hamper throughput. Various optimizations have successfully dealt with this problem. These techniques include, among others, rate-limiting flushes and compactions, selecting among compactions for maximum effect, and limiting compactions to the highest level by so-called fragmented LSMs.

In this paper we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and in fact in some cases exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes and compactions. We then introduce the notion of an I/O scheduler for an LSM-based KV store to reduce this interference. We explore three techniques as part of this I/O scheduler: 1) opportunistically allocating more bandwidth to internal operations during periods of low load, 2) prioritizing flushes and compactions at the lower levels of the tree, and 3) preempting compactions.

SILK is a new open-source KV store that incorporates this notion of an I/O scheduler. SILK is derived from RocksDB, but the concepts can be applied to other LSM-based KV stores. We use both a production workload at Nutanix and synthetic benchmarks to demonstrate that SILK achieves up to two orders of magnitude lower 99th percentile latencies than RocksDB and TRIAD, without any significant negative effects on other performance metrics.

Speakers
OB

Oana Balmau

University of Sydney
FD

Florin Dinu

University of Sydney
WZ

Willy Zwaenepoel

University of Sydney
KG

Karan Gupta

Nutanix Inc.
DD

Diego Didona

IBM Research - Zurich


Thursday July 11, 2019 5:15pm - 5:35pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

5:35pm PDT

SemperOS: A Distributed Capability System
Capabilities provide an efficient and secure mechanism for fine-grained resource management and protection. However, as the modern hardware architectures continue to evolve with large numbers of non-coherent and heterogeneous cores, we focus on the following research question: can capability systems scale to modern hardware architectures? In this work, we present a scalable capability system to drive future systems with many non-coherent heterogeneous cores. More specifically, we have designed a distributed capability system based on a HW/SW co-designed capability system. We analyzed the pitfalls of distributed capability operations running concurrently and built the protocols in accordance with the insights. We have incorporated these distributed capability management protocols in a new microkernel-based OS called SemperOS. Our OS operates the system by means of multiple microkernels, which employ distributed capabilities to provide an efficient and secure mechanism for fine-grained access to system resources. In the evaluation we investigated the scalability of our algorithms and run applications (Nginx, LevelDB, SQLite, PostMark, etc.), which are heavily dependent on the OS services of SemperOS. The results indicate that there is no inherent scalability limitation for capability systems. Our evaluation shows that we achieve a parallel efficiency of 70% to 78% when examining a system with 576 cores executing 512 application instances while using 11% of the system’s cores for OS services.

Speakers
MH

Matthias Hille

Technische Universität Dresden
NA

Nils Asmussen

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany
PB

Pramod Bhatotia

University of Edinburgh
HH

Hermann Härtig

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany


Thursday July 11, 2019 5:35pm - 5:55pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

5:35pm PDT

Unification of Temporary Storage in the NodeKernel Architecture
Efficiently exchanging temporary data between tasks is critical to the end-to-end performance of many data processing frameworks and applications. Unfortunately, the diverse nature of temporary data creates storage demands that often fall between the sweet spots of traditional storage platforms, such as file systems or key-value stores.

We present NodeKernel, a novel distributed storage architecture that offers a convenient new point in the design space by fusing file system and key-value semantics in a common storage kernel while leveraging modern networking and storage hardware to achieve high performance and cost-efficiency. NodeKernel provides hierarchical naming, high scalability, and close to bare-metal performance for a wide range of data sizes and access patterns that are characteristic of temporary data. We show that storing temporary data in Crail, our concrete implementation of the NodeKernel architecture which uses RDMA networking with tiered DRAM/NVMe-Flash storage, improves NoSQL workload performance by up to 4.8× and Spark application performance by up to 3.4×. Furthermore, by storing data across NVMe Flash and DRAM storage tiers, Crail reduces storage cost by up to 8× compared to DRAM-only storage systems.

Speakers
PS

Patrick Stuedi

IBM Research
AT

Animesh Trivedi

Vrije Universiteit
JP

Jonas Pfefferle

IBM Research
AK

Ana Klimovic

Stanford University
AS

Adrian Schuepbach

IBM Research
BM

Bernard Metzler

IBM Research


Thursday July 11, 2019 5:35pm - 5:55pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

6:30pm PDT

Poster Session and Reception
Mingle with fellow attendees at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers. Posters of the papers presented in the Technical Sessions on Thursday afternoon and on Friday will be on display. View the list of accepted posters.

Thursday July 11, 2019 6:30pm - 8:30pm PDT
Lake Washington Ballroom
 
Friday, July 12
 

7:30am PDT

8:00am PDT

Badge Pickup
Friday July 12, 2019 8:00am - 12:00pm PDT
Grand Stair Prefunction

8:30am PDT

Lightning Talks
Friday July 12, 2019 8:30am - 9:10am PDT
Grand Ballroom

9:10am PDT

Short Break
Friday July 12, 2019 9:10am - 9:15am PDT
Grand Ballroom Prefunction

9:15am PDT

Evaluating File System Reliability on Solid State Drives
As solid state drives (SSDs) are increasingly replacing hard disk drives, the reliability of storage systems depends on the failure modes of SSDs and the ability of the file system layered on top to handle these failure modes. While the classical paper on IRON File Systems provides a thorough study of the failure policies of three file systems common at the time, we argue that 13 years later it is time to revisit file system reliability with SSDs and their reliability characteristics in mind, based on modern file systems that incorporate journaling, copy-on-write and log-structured approaches, and are optimized for flash. This paper presents a detailed study, spanning ext4, Btrfs and F2FS, and covering a number of different SSD error modes. We develop our own fault injection framework and explore over a thousand error cases. Our results indicate that 16\% of these cases result in a file system that cannot be mounted or even repaired by its system checker. We also identify the key file system metadata structures that can cause such failures and finally, we recommend some design guidelines for file systems that are deployed on top of SSDs.

Speakers
SJ

Shehbaz Jaffer

University of Toronto
SM

Stathis Maneas

University of Toronto
AH

Andy Hwang

University of Toronto
BS

Bianca Schroeder

University of Toronto


Friday July 12, 2019 9:15am - 9:35am PDT
USENIX ATC Track I: Grand Ballroom I–VI

9:15am PDT

R2P2: Making RPCs first-class datacenter citizens
Remote Procedure Calls are widely used to connect datacenter applications with strict tail-latency service level objectives in the scale of μs. Existing solutions utilize streaming or datagram-based transport protocols for RPCs that impose overheads and limit the design flexibility. Our work exposes the RPC abstraction to the endpoints and the network, making RPCs first-class datacenter citizens and allowing for in-network RPC scheduling. We propose R2P2, a UDP-based transport protocol specifically designed for RPCs inside a datacenter. R2P2 exposes pairs of requests and responses and allows efficient and scalable RPC routing by separating the RPC target selection from request and reply streaming. Leveraging R2P2, we implement a novel join-bounded-shortest-queue (JBSQ) RPC load balancing policy, which lowers tail latency by centralizing pending RPCs in the router and ensures that requests are only routed to servers with a bounded number of outstanding requests. The R2P2 router logic can be implemented either in a software middlebox or within a P4 switch ASIC pipeline. Our evaluation, using a range of microbenchmarks, shows that the protocol is suitable for μs-scale RPCs and that its tail latency outperforms both random selection and classic HTTP reverse proxies. The P4-based implementation of R2P2 on a Tofino ASIC adds less than 1μs of latency whereas the software middlebox implementation adds 5μs latency and requires only two CPU cores to route RPCs at 10 Gbps line-rate. R2P2 improves the tail latency of web index searching on a cluster of 16 workers operating at 50% of capacity by 5.7× over NGINX. R2P2 improves the throughput of the Redis key-value store on a 4-node cluster with master/slave replication for a tail-latency service-level objective of 200μs by more than 4.8× vs. vanilla Redis.


Friday July 12, 2019 9:15am - 9:35am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

9:35am PDT

Alleviating Garbage Collection Interference Through Spatial Separation in All Flash Arrays
We present SWAN, a novel All Flash Array (AFA) management scheme. Recent flash SSDs provide high I/O bandwidth (e.g., 3-10GB/s) so the storage bandwidth can easily surpass the network bandwidth by aggregating a few SSDs. However, it is still challenging to unlock the full performance of SSDs. The main source of performance degradation is garbage collection (GC). We find that existing AFA designs are susceptible to GC at SSD-level and AFA software-level. In designing SWAN, we aim to alleviate the performance interference caused by GC at both levels. Unlike the commonly used temporal separation approach that performs GC at idle time, we take a spatial separation approach that partitions SSDs into the front-end SSDs dedicated to serve write requests and the back-end SSDs where GC is performed. Compared to temporal separation of GC and application I/O, which is hard to be controlled by AFA software, our approach guarantees that the storage bandwidth always matches the full network performance without being interfered by AFA-level GC. Our analytical model confirms this if the size of front-end SSDs and the back-end SSDs are properly configured. We provide extensive evaluations that show SWAN is effective for a variety of workloads.

Speakers
JK

Jaeho Kim

Virginia Tech
KL

Kwanghyun Lim

Cornell University
CM

Changwoo Min

Virginia Tech


Friday July 12, 2019 9:35am - 9:55am PDT
USENIX ATC Track I: Grand Ballroom I–VI

9:35am PDT

Your Coflow has Many Flows: Sampling them for Fun and Speed
Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling.

We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philae's robustness to flow size skew.

Speakers
AJ

Akshay Jajoo

Purdue University
YC

Y. Charlie Hu

Purdue University
XL

Xiaojun Lin

Purdue University


Friday July 12, 2019 9:35am - 9:55am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

9:55am PDT

Practical Erase Suspension for Modern Low-latency SSDs
As NAND flash technology continues to scale, flash-based SSDs have become key components in data-center servers. One of the main design goals for data-center SSDs is low read tail latency, which is crucial for interactive online services as a single query can generate thousands of disk accesses. Towards this goal, many prior works have focused on minimizing the effect of garbage collection on read tail latency. Such advances have made the other, less explored source of long read tails, block erase operation, more important. Prior work on erase suspension addresses this problem by allowing a read operation to interrupt an ongoing erase operation, to minimize its effect on read latency. Unfortunately, the erase suspension technique attempts to suspend/resume an erase pulse at an arbitrary point, which incurs additional hardware cost for NAND peripherals and reduces the lifetime of the device. Furthermore, we demonstrate this technique suffers a write starvation problem, using a real, production-grade SSD. To overcome these limitations, we propose alternative practical erase suspension mechanisms, leveraging the iterative erase mechanism used in modern SSDs, to suspend/resume erase operation at well-aligned safe points. The resulting design achieves a sub-200μs 99.999th percentile read tail latency for 4KB random I/O workload at queue depth 16 (70% reads and 30% writes). Furthermore, it reduces the read tail latency by about 5 over the baseline for the two data-center workloads that we evaluated with.

Speakers
SK

Shine Kim

Seoul National University and Samsung Electronics
JB

Jonghyun Bae

Seoul National University
HJ

Hakbeom Jang

Sungkyunkwan University
WJ

Wenjing Jin

Seoul National University
JG

Jeonghun Gong

Seoul National University
SL

Seungyeon Lee

Samsung Electronics
TJ

Tae Jun Ham

Seoul National University
JW

Jae W. Lee

Seoul National University


Friday July 12, 2019 9:55am - 10:15am PDT
USENIX ATC Track I: Grand Ballroom I–VI

9:55am PDT

PostMan: Rapidly Mitigating Bursty Traffic by Offloading Packet Processing
Unexpected bursty traffic due to certain sudden events, such as news in the spotlight on a social network or discounted items on sale, can cause severe load imbalance in backend services. Migrating hot data---the standard approach to achieve load balance---meets a challenge when handling such unexpected load imbalance, because migrating data will slow down the server that is already under heavy pressure.

This paper proposes PostMan, an alternative approach to rapidly mitigate load imbalance for services processing small requests. Motivated by the observation that processing large packets incurs far less CPU overhead than processing small ones, PostMan deploys a number of middleboxes called helpers to assemble small packets into large ones for the heavily-loaded server. This approach essentially offloads the overhead of packet processing from the heavily-loaded server to others. To minimize the overhead, PostMan activates helpers on demand, only when bursty traffic is detected. To tolerate helper failures, PostMan can migrate connections across helpers and can ensure packet ordering despite such migration. Our evaluation shows that, with the help of PostMan, a Memcached server can mitigate bursty traffic within hundreds of milliseconds, while migrating data takes tens of seconds and increases the latency during migration.

Speakers
PJ

Panpan Jin

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
JG

Jian Guo

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
YX

Yikai Xiao

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
RS

Rong Shi

The Ohio State University, USA
YN

Yipei Niu

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
FL

Fangming Liu

National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
CQ

Chen Qian

University of California Santa Cruz, USA
YW

Yang Wang

The Ohio State University


Friday July 12, 2019 9:55am - 10:15am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:15am PDT

Track-based Translation Layers for Interlaced Magnetic Recording
Interlaced magnetic recording (IMR) is a state-of-the-art recording technology for hard drives that makes use of heat-assisted magnetic recording (HAMR) and track overlap to offer higher capacity than conventional and shingled magnetic recording (CMR and SMR). It carries a set of write constraints that differ from those in SMR: “bottom” (e.g. even-numbered) tracks cannot be written without data loss on the adjoining “top” (e.g. odd-numbered) ones. Previously described algorithms for writing arbitrary (i.e. bottom) sectors on IMR are in some cases poorly characterized, and are either slow or require more memory than is available within the constrained disk controller environment.

We provide the first accurate performance analysis of the simple read-modify-write (RMW) approach to IMR bottom track writes, noting several inaccuracies in earlier descriptions of its performance, and evaluate it for latency, throughput and I/O amplification on real-world traces. In addition we propose three novel memory-efficient, track-based translation layers for IMR—track flipping, selective track caching and dynamic track mapping, which reduce bottom track writes by moving hot data to top tracks and cold data to bottom ones in different ways. We again provide a detailed performance analysis using simulations based on real-world traces.

We find that RMW performance is poor on most traces and worse on others. The proposed approaches perform much better, especially dynamic track mapping, with low write amplification and latency comparable to CMR for many traces.

Speakers
MH

Mohammad Hossein Hajkazemi

Northeastern University
AN

Ajay Narayan Kulkarni

Seagate Technology
PD

Peter Desnoyers

Northeastern University
TR

Timothy R Feldman

Seagate Technology


Friday July 12, 2019 10:15am - 10:35am PDT
USENIX ATC Track I: Grand Ballroom I–VI

10:15am PDT

Lancet: A self-correcting Latency Measuring Tool
We present LANCET, a self-correcting tool designed to measure the open-loop tail latency of μs-scale datacenter applications with high fan-in connection patterns. LANCET is self-correcting as it relies on online statistical tests to determine situations in which tail latency cannot be accurately measured from a statistical perspective, including situations where the workload configuration, the client infrastructure, or the application itself does not allow it. Because of its design, LANCET is also extremely easy to use. In fact, the user is only responsible for (i) configuring the workload parameters, i.e., the mix of requests and the size of the client connection pool, and (ii) setting the desired confidence interval for a particular tail latency percentile. All other parameters, including the length of the warmup phase, the measurement length, and the necessary sampling rate, are dynamically determined by the LANCET experiment coordinator. When available, LANCET leverages NIC-based hardware timestamping to measure RPC end-to-end latency. Otherwise, it uses an asymmetric setup with a latency-agent that leverages busy-polling system calls to reduce the client bias. Our evaluation shows that LANCET automatically identifies situations in which tail latency cannot be determined and accurately reports the latency distribution of workloads with single-digit μs service time. For the workloads studied, LANCET can successfully report, with 95% confidence, the 99th percentile tail latency within an interval of ≤ 10μs. In comparison with state-of-the-art tools such as Mutilate and Treadmill, LANCET reports a latency cumulative distribution that is ∼20μs lower when the NIC timestamping capability is available and ∼10μs lower when it is not.

Speakers

Friday July 12, 2019 10:15am - 10:35am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

10:35am PDT

11:05am PDT

Pangolin: A Fault-Tolerant Persistent Memory Programming Library
Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-based data structures that can offer substantial performance gains over conventional approaches to managing persistent state. This programming model removes the file system from the critical path which improves performance, but it also places these data structures out of reach of file system-based fault tolerance mechanisms (e.g., block-based checksums or erasure coding). Without fault-tolerance, using NVMM to hold critical data will be much less attractive.

This paper presents Pangolin, a fault-tolerant persistent object library designed for NVMM. Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs. It provides these protections for objects of any size and supports automatic, online detection of data corruption and recovery. The required storage overhead is small (1% for gigabyte-sized pools of NVMM). Pangolin provides stronger protection, requires orders of magnitude less storage overhead, and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.

Speakers
LZ

Lu Zhang

UC San Diego
SS

Steven Swanson

UC San Diego


Friday July 12, 2019 11:05am - 11:25am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:05am PDT

EdgeWise: A Better Stream Processing Engine for the Edge
Many Internet of Things (IoT) applications would benefit if streams of data could be analyzed rapidly at the Edge, near the data source. However, existing Stream Processing Engines (SPEs) are unsuited for the Edge because their designs assume Cloud-class resources and relatively generous throughput and latency constraints.

This paper presents EdgeWise, a new Edge-friendly SPE, and shows analytically and empirically that EdgeWise improves both throughput and latency. The key idea of EdgeWise is to incorporate a congestion-aware scheduler and a fixed-size worker pool into an SPE. Though this idea has been explored in the past, we are the first to apply it to modern SPEs and we provide a new queue-theoretic analysis to support it. In our single-node and distributed experiments we compare EdgeWise to the state-of-the-art Storm system. We report up to a 3x improvement in throughput while keeping latency low.

Speakers
XF

Xinwei Fu

Virginia Tech
TG

Talha Ghaffar

Virginia Tech
JC

James C. Davis

Virginia Tech
DL

Dongyoon Lee

Virginia Tech


Friday July 12, 2019 11:05am - 11:25am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:25am PDT

Pisces: A Scalable and Efficient Persistent Transactional Memory
Persistent transactional memory (PTM) programming model has recently been exploited to provide crash-consistent transactional interfaces to ease programming atop NVM. However, existing PTM designs either incur high reader-side overhead due to blocking or long delay in the writer side (efficiency), or place excessive constraints on persistent ordering (scalability). This paper presents Pisces, a read-friendly PTM that exploits snapshot isolation (SI) on NVM. The key design of Pisces is based on two observations: the redo logs of transactions can be reused as newer versions for the data, and an intuitive MVCC-based design has read deficiency. Based on the observations, we propose a dual-version concurrency control (DVCC) protocol that maintains up to two versions in NVM-backed storage hierarchy. Together with a three-stage commit protocol, Pisces ensures SI and allows more transactions to commit and persist simultaneously. Most importantly, it promises a desired feature: hiding NVM persistence overhead from reads and allowing nearly non-blocking reads. Experimental evaluation on an Intel 40-thread (20-core) machine with real NVM equipped shows that Pisces outperforms the state-of-the-art design (i.e., DUDETM) by up to 6.3× for micro-benchmarks and 4.6× for TPC-C new order transaction, and also scales much better. The persistency cost is from 19% to 50% for 40 threads.

Speakers
JG

Jinyu Gu

Shanghai Jiao Tong University
QY

Qianqian Yu

Shanghai Jiao Tong University
XW

Xiayang Wang

Shanghai Jiao Tong University
ZW

Zhaoguo Wang

Shanghai Jiao Tong University
BZ

Binyu Zang

Shanghai Jiao Tong University
HG

Haibing Guan

Shanghai Jiao Tong University
HC

Haibo Chen

Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.


Friday July 12, 2019 11:25am - 11:45am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:25am PDT

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in Microsoft. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

Speakers
MJ

Myeongjae Jeon

UNIST and Microsoft Research
SV

Shivaram Venkataraman

University of Wisconsin and Microsoft Research
AP

Amar Phanishayee

Microsoft Research
JQ

Junjie Qian

Microsoft Research
WX

Wencong Xiao

Beihang University and Microsoft Research
FY

Fan Yang

Microsoft Research


Friday July 12, 2019 11:25am - 11:45am PDT
USENIX ATC Track II: Grand Ballroom VII–IX

11:45am PDT

Short Break
Friday July 12, 2019 11:45am - 11:50am PDT
Grand Ballroom Prefunction

11:50am PDT

Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high performance and low energy cost. However, SSD introduces more complex failure modes compared to traditional hard disk. While great efforts have been made to understand the reliability of SSD itself, it remains unclear what types of system level failures are related to SSD, what are the root causes, and how the rest of the system interacts with SSD and contributes to failures. Answering these questions can help practitioners build and maintain highly reliable SSD-based storage systems.

In this paper, we study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services. We take a holistic view to analyze both device errors and system failures to better understand the potential casual relations. Particularly, we focus on failures that are Reported As "SSD-Related" (RASR) by system status monitoring daemons. Through log analysis, field studies, and validation experiments, we identify the characteristics of RASR failures in terms of their distribution, symptoms, and correlations. Moreover, we derive a number of major lessons and a set of effective methods to address the issues observed. We believe that our study and experience would be beneficial to the community and could facilitate building highly-reliable SSD-based storage systems.

Speakers
EX

Erci Xu

Ohio State University
MZ

Mai Zheng

Iowa State University
FQ

Feng Qin

Ohio State University
YX

Yikang Xu

Alibaba Group
JW

Jiesheng Wu

Alibaba Group


Friday July 12, 2019 11:50am - 12:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:50am PDT

Optimizing CNN Model Inference on CPUs
The popularity of Convolutional Neural Network (CNN) models and the ubiquity of CPUs imply that better performance of CNN model inference on CPUs can deliver significant gain to a large number of users. To improve the performance of CNN inference on CPUs, current approaches like MXNet and Intel OpenVINO usually treat the model as a graph and use the high-performance libraries such as Intel MKL-DNN to implement the operations of the graph. While achieving reasonable performance on individual operations from the off-the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level optimizations are predefined. Therefore, it is restrictive and misses the opportunity to optimize the end-to-end inference pipeline as a whole. This paper presents \emph{NeoCPU}, a comprehensive approach of CNN model inference on CPUs that employs a full-stack and systematic scheme of optimizations. \emph{NeoCPU} optimizes the operations as templates without relying on third-parties libraries, which enables further improvement of the performance via operation- and graph-level joint optimization. Experiments show that \emph{NeoCPU} achieves up to 3.45$\times$ lower latency for CNN model inference than the current state-of-the-art implementations on various kinds of popular CPUs.

Speakers
YW

Yao Wang

Amazon
ML

Mu Li

Amazon


Friday July 12, 2019 11:50am - 12:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:10pm PDT

Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy
Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we propose addressing the flash lifetime problem by allowing devices to expose higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems.

We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook and backed by RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100x and recovery time by more than 10,000x. DIRECT also allows HDFS to tolerate a 10,000--100,000x higher bit error rate without experiencing application-visible errors. By significantly increasing the availability and durability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.

Speakers
AT

Amy Tai

Princeton University and VMware Research
KJ

Kyle Jamieson

Princeton University
MJ

Michael J. Freedman

Princeton University
AC

Asaf Cidon

Columbia University


Friday July 12, 2019 12:10pm - 12:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:10pm PDT

Accelerating Rule-matching Systems with Learned Rankers
Infusing machine learning (ML) and deep learning (DL) into modern systems has driven a paradigm shift towards learning-augmented system design. This paper proposes the learned ranker as a system building block, and demonstrates its potential by using rule-matching systems as a concrete scenario. Specifically, checking rules can be time-consuming, especially complex regular expression (regex) conditions. The learned ranker prioritizes rules based on their likelihood of matching a given input. If the matching rule is successfully prioritized as a top candidate, the system effectively achieves early termination. We integrated the learned rule ranker as a component of popular regex matching engines: PCRE, PCRE-JIT, and RE2. Empirical results show that the rule ranker achieves a top-5 classification accuracy at least 96.16%, and reduces the rule-matching system latency by up to 78.81% on a 8-core CPU.

Speakers
ZL

Zhao Lucis Li

University of Science and Technology China
CM

Chieh-Jan Mike Liang

Microsoft Research
WB

Wei Bai

Microsoft Research
QZ

Qiming Zheng

Shanghai Jiao Tong University
YX

Yongqiang Xiong

Microsoft Research
GS

Guangzhong Sun

University of Science and Technology China


Friday July 12, 2019 12:10pm - 12:30pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:30pm PDT

Dayu: Fast and Low-interference Data Recovery in Very-large Storage Systems
This paper investigates I/O and failure traces from a realworld large-scale storage system: it finds that because of the scale of the system and because of the imbalanced and dynamic foreground traffic, no existing recovery protocol can compute a high-quality re-replicating strategy in a short time. To address this problem, this paper proposes Dayu, a timeslot based recovery architecture. For each timeslot, Dayu only schedules a subset of tasks which are expected to be finished in this timeslot: this approach reduces the computation overhead and naturally can cope with the dynamic foreground traffic. In each timeslot, Dayu incorporates a greedy algorithm with convex hull optimization to achieve both high speed and high quality. Our evaluation in a 1,000-node cluster and in a 3,500-node simulation both confirm that Dayu can outperform existing recovery protocols, achieving high speed and high quality.

Speakers
ZW

Zhufan Wang

Tsinghua University
GZ

Guangyan Zhang

Tsinghua University
YW

Yang Wang

The Ohio State University
QY

Qinglin Yang

Tsinghua University
JZ

Jiaji Zhu

Alibaba Cloud


Friday July 12, 2019 12:30pm - 12:50pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:30pm PDT

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving
The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale. The key challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing the serving cost. In this paper, we tackle the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS). MArk employs three design choices tailor-made for inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive for inference serving, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover the occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to $7.8\times$ while achieving even better latency performance.

Speakers
CZ

Chengliang Zhang

Hong Kong University of Science and Technology
MY

Minchen Yu

Hong Kong University of Science and Technology
WW

Wei Wang

Hong Kong University of Science and Technology
FY

Feng Yan

University of Nevada, Reno


Friday July 12, 2019 12:30pm - 12:50pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX

12:50pm PDT

OPTR: Order-Preserving Translation and Recovery Design for SSDs with a Standard Block Device Interface
Consumer-grade solid-state drives (SSDs) guarantee very few things upon a crash. Lacking a strong disk-level crash guarantee forces programmers to equip applications and filesystems with safety nets using redundant writes and flushes, which in turn degrade the overall system performance. Although some prior works propose transactional SSDs with revolutionized disk interfaces to offer strong crash guarantees, adopting transactional SSDs inevitably incurs dramatic software stack changes. Therefore, most consumer-grade SSDs still keep using the standard block device interface.

This paper addresses the above issues by breaking the impression that increasing SSDs' crash guarantees are typically available at the cost of altering the standard block device interface. We propose Order-Preserving Translation and Recovery (OPTR), a collection of novel flash translation layer (FTL) and crash recovery techniques that are realized internal to block-interface SSDs to endow the SSDs with \emph{strong request-level crash guarantees} defined as follows: 1) A write request is not made durable unless all its prior write requests are durable. 2) Each write request is atomic. 3) All write requests prior to a flush are guaranteed durable. We have realized OPTR in real SSD hardware and optimized applications and filesystems (SQLite and Ext4) to demonstrate OPTR's benefits. Experimental results show 1.27$\times$ (only Ext4 is optimized), 2.85$\times$ (both Ext4 and SQLite are optimized), and 6.03$\times$ (an OPTR-enabled no-barrier mode) performance improvement.

Speakers
YC

Yun-Sheng Chang

National Tsing Hua University
RL

Ren-Shuo Liu

National Tsing Hua University


Friday July 12, 2019 12:50pm - 1:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:50pm PDT

Cross-dataset Time Series Anomaly Detection for Cloud Systems
In recent years, software applications are increasingly deployed as online services on cloud computing platforms. It is important to detect anomalies in cloud systems in order to maintain high service availability. However, given the velocity, volume, and diversified nature of cloud monitoring data, it is difficult to obtain sufficient labelled data to build an accurate anomaly detection model. In this paper, we propose cross-dataset anomaly detection: detect anomalies in a new unlabelled dataset (the target) by training an anomaly detection model on existing labelled datasets (the source). Our approach, called ATAD (Active Transfer Anomaly Detection), integrates both transfer learning and active learning techniques. Transfer learning is applied to transfer knowledge from the source dataset to the target dataset, and active learning is applied to determine informative labels of a small part of samples from unlabelled datasets. Through experiments, we show that ATAD is effective in cross-dataset time series anomaly detection. Furthermore, we only need to label about 1%-5% of unlabelled data and can still achieve significant performance improvement.

Speakers
XZ

Xu Zhang

Microsoft Research, Nanjing University
QL

Qingwei Lin

Microsoft Research
YX

Yong Xu

Microsoft Research
SQ

Si Qin

Microsoft Research
HZ

Hongyu Zhang

The University of Newcastle
BQ

Bo Qiao

Microsoft Research
YD

Yingnong Dang

Microsoft
XY

Xinsheng Yang

Microsoft
QC

Qian Cheng

Microsoft
YW

Youjiang Wu

Microsoft
KH

Ken Hsieh

Microsoft
KS

Kaixin Sui

Microsoft Research
XM

Xin Meng

Microsoft Research
YX

Yaohai Xu

Microsoft Research
WZ

Wenchi Zhang

Microsoft Research
FS

Furao Shen

Nanjing University
DZ

Dongmei Zhang

Microsoft Research


Friday July 12, 2019 12:50pm - 1:10pm PDT
USENIX ATC Track II: Grand Ballroom VII–IX