Loading…

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Track 1 [clear filter]
Wednesday, July 10
 

11:20am PDT

The Design and Operation of CloudLab
Given the highly empirical nature of research in cloud computing, networked systems, and related fields, testbeds play an important role in the research ecosystem. In this paper, we cover one such facility, CloudLab, which supports systems research by providing raw access to programmable hardware, enabling research at large scales, and creating a shared platform for repeatable research.

We present our experiences designing CloudLab and operating it for four years, serving nearly 4,000 users who have run over 79,000 experiments on 2,250 servers, switches, and other pieces of datacenter equipment. From this experience, we draw lessons organized around two themes. The first set comes from analysis of data regarding the use of CloudLab: how users interact with it, what they use it for, and the implications for facility design and operation. Our second set of lessons comes from looking at the ways that algorithms used "under the hood," such as resource allocation, have important---and sometimes unexpected---effects on user experience and behavior. These lessons can be of value to the designers and operators of IaaS facilities in general, systems testbeds in particular, and users who have a stake in understanding how these systems are built.

Speakers
DD

Dmitry Duplyakin

University of Utah
RR

Robert Ricci

University of Utah
AM

Aleksander Maricq

University of Utah
GW

Gary Wong

University of Utah
JD

Jonathon Duerig

University of Utah
EE

Eric Eide

University of Utah
LS

Leigh Stoller

University of Utah
MH

Mike Hibler

University of Utah
DJ

David Johnson

University of Utah
KW

Kirk Webb

University of Utah
AA

Aditya Akella

University of Wisconsin - Madison
KW

Kuangching Wang

Clemson University
GR

Glenn Ricart

US Ignite
LL

Larry Landweber

University of Wisconsin - Madison
MZ

Michael Zink

University of Massachusetts Amherst
EC

Emmanuel Cecchet

University of Massachusetts Amherst
SK

Snigdhaswin Kar

Clemson University
PM

Prabodh Mishra

Clemson University


Wednesday July 10, 2019 11:20am - 11:40am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:40am PDT

Everyone Loves File: File Storage Service (FSS) in Oracle Cloud Infrastructure
File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service in Oracle Cloud Infrastructure. Using a pipelined Paxos implementation, we implemented a scalable block store that provides linearizable multipage limited-size transactions. On top of the block store, we built a scalable B-tree that provides linearizable multikey limited-size transactions. By using self-validating B-tree nodes and performing all Btree housekeeping operations as separate transactions, each key in a B-tree transaction requires only one page in the underlying block transaction. The B-tree holds the filesystem metadata. The filesystem provides snapshots by using versioned key-value pairs. The entire system is programmed using a nonblocking lock-free programming style. The presentation servers maintain no persistent local state, with any state kept in the B-tree, making it easy to scale up and failover the presentation servers. We use a non-scalable Paxos-replicated hash table to store configuration information required to bootstrap the system. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication.

Speakers
BC

Bradley C. Kuszmaul

Oracle Corporation
MF

Matteo Frigo

Oracle Corporation
JM

Justin Mazzola Paluska

Oracle Corporation
AS

Alexander (Sasha) Sandler

Oracle Corporation


Wednesday July 10, 2019 11:40am - 12:00pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:00pm PDT

Zanzibar: Google’s Consistent, Global Authorization System
Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.


Wednesday July 10, 2019 12:00pm - 12:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:20pm PDT

IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
We address the problem of “fail-slow” fault, a fault where a hardware or software component can still function (does not fail-stop) but in much lower performance than expected. To address this, we built IASO, a peer-based, non-intrusive fail-slow detection framework that has been deployed for more than 1.5 years across 39,000 nodes in our customer sites and helped our customers reduce major outages due to fail-slow incidents. IASO primarily works based on timeout signals (a negligible overhead of monitoring) and converts them into a stable and accurate fail-slow metric. IASO can quickly and accurately isolate a slow node within minutes. Within a 7-month period, IASO managed to catch 232 fail-slow incidents in our large deployment field. In this paper, we have also assembled a large dataset of 232 fail-slow incidents along with our analysis. We found that the fail-slow annual failure rate in our field is 1.02%.

Speakers
BP

Biswaranjan Panda

Nutanix Inc.
HK

Huan Ke

University of Chicago
KG

Karan Gupta

Nutanix Inc.
VK

Vinayak Khot

Nutanix Inc.
HS

Haryadi S. Gunawi

University of Chicago


Wednesday July 10, 2019 12:20pm - 12:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:20pm PDT

Extension Framework for File Systems in User space
User file systems offer numerous advantages over their in-kernel implementations, such as ease of development and better system reliability. However, they incur heavy performance penalty. We observe that existing user file system frameworks are highly general; they consist of a minimal interposition layer in the kernel that simply forwards all low-level requests to user space. While this design offers flexibility, it also severely degrades performance due to frequent kernel-user context switching.

This work introduces ExtFUSE, a framework for developing extensible user file systems that also allows applications to register "thin" specialized request handlers in the kernel to meet their specific operative needs, while retaining the complex functionality in user space. Our evaluation with two FUSE file systems shows that ExtFUSE can improve the performance of user file systems with less than a few hundred lines on average. ExtFUSE is available on GitHub.

Speakers
AB

Ashish Bijlani

PhD Student, Georgia Institute of Technology
Ashish is a CS PhD student at Georgia Institute of Technology. His area of research is mobile storage.
UR

Umakishore Ramachandran

Georgia Institute of Technology


Wednesday July 10, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:40pm PDT

FlexGroup Volumes: A Distributed WAFL File System
The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFL® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.

In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.


Wednesday July 10, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:00pm PDT

EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices
Smartphones usually have limited storage and runtime memory. Compressed read-only file systems can dramatically decrease the storage used by read-only system resources. However, existing compressed read-only file systems use fixed-sized input compression, which causes significant I/O amplification and unnecessary computation. They also consume excessive runtime memory during decompression and deteriorate the performance when the runtime memory is scarce. In this paper, we describe EROFS, a new compression-friendly read-only file system that leverages fixed-sized output compression and memory-efficient decompression to achieve high performance with little extra memory overhead. We also report our experience of deploying EROFS on tens of millions of smartphones. Evaluation results show that EROFS outperforms existing compressed read-only file systems with various micro-benchmarks and reduces the boot time of real-world applications by up to 22.9% while nearly halving the storage usage.

Speakers
XG

Xiang Gao

Huawei Technologies Co., Ltd.
MD

Mingkai Dong

Shanghai Jiao Tong University
XM

Xie Miao

Huawei Technologies Co., Ltd.
WD

Wei Du

Huawei Technologies Co., Ltd.
CY

Chao Yu

Huawei Technologies Co., Ltd.
HC

Haibo Chen

Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.


Wednesday July 10, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:20pm PDT

QZFS: QAT Accelerated Compression in File System for Application Agnostic and Cost Efficient Data Storage
Data compression can not only provide space efficiency with lower Total Cost of Ownership (TCO) but also enhance I/O performance because of the reduced read/write operations. However, lossless compression algorithms with high compression ratio (e.g. gzip) inevitably incur high CPU resource consumption. Prior studies mainly leveraged general-purpose hardware accelerators such as GPU and FPGA to offload costly (de)compression operations for application workloads. This paper investigates ASIC-accelerated compression in file system to transparently benefit all applications running on it and provide high-performance and cost-efficient data storage. Based on Intel® QAT ASIC, we propose QZFS that integrates QAT into ZFS file system to achieve efficient gzip (de)compression offloading at the file system layer. A compression service engine is introduced in QZFS to serve as an algorithm selector and implement compressibility-dependent offloading and selective offloading by source data size. More importantly, a QAT offloading module is designed to leverage the vectored I/O model to reconstruct data blocks, making them able to be used by QAT hardware without incurring extra memory copy. The comprehensive evaluation validates that QZFS can achieve up to 5x write throughput improvement for FIO micro-benchmark and more than 6x cost-efficiency enhancement for genomic data post-processing over the software-implemented alternative.

Speakers
XH

Xiaokang Hu

Shanghai Jiao Tong University, Intel Asia-Pacific R&D Ltd.
FW

Fuzong Wang

Shanghai Jiao Tong University, Intel Asia-Pacific R&D Ltd.
WL

Weigang Li

Intel Asia-Pacific R&D Ltd.
JL

Jian Li

Shanghai Jiao Tong University
HG

Haibing Guan

Shanghai Jiao Tong University


Wednesday July 10, 2019 3:20pm - 3:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:10pm PDT

DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching
Best Paper at FAST '19

Authors:
Zaoxing Liu and Zhihao Bai, Johns Hopkins University; Zhenming Liu, College of William and Mary; Xiaozhou Li, Celer Network; Changhoon Kim, Barefoot Networks; Vladimir Braverman and Xin Jin, Johns Hopkins University; Ion Stoica, UC Berkeley

Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to multiple clusters, the fast cache itself would become the bottleneck. Traditional mechanisms like cache partition and cache replication either result in load imbalance between cache nodes or have high overhead for cache coherence.

We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.

Wednesday July 10, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:30pm PDT

Protocol-Aware Recovery for Consensus-Based Storage
Best Paper at FAST '18

Authors:
Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison; Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; Vijay Chidambaram, University of Texas at Austin;Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

We introduce protocol-aware recovery (PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of PAR through the design and implementation of corruption-tolerant replication (CTRL), a PAR mechanism specific to replicated state machine (RSM) systems. We experimentally show that the CTRL versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the CTRL versions have little performance overhead.

Wednesday July 10, 2019 4:30pm - 4:50pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:50pm PDT

Orca: Differential Bug Localization in Large-Scale Services
Best Paper at OSDI '18

Authors:
Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Adithya Abraham Philip, Microsoft Research India

Today, we depend on numerous large-scale services for basic operations such as email. These services are complex and extremely dynamic as developers continously commit code and introduce new features, fixes and, consequently, new bugs. Hundreds of commits may enter deployment simultaneously. Therefore one of the most time-critical, yet complex tasks towards mitigating service disruption is to localize the bug to the right commit.

This paper presents the concept of differential bug localization that uses a combination of differential code analysis and software provenance tracking to effectively pin-point buggy commits. We have built Orca, a customized code search-engine that implements differential bug localization. Orca is actively being used by the On-Call Engineers (OCEs) of a large enterprise email and collaboration service to localize bugs to the appropriate buggy commits. Our evaluation shows that Orca correctly localizes 77% of bugs for which it has been used. We also show that it causes a 4x reduction in the work done by the OCE.

Wednesday July 10, 2019 4:50pm - 5:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

5:10pm PDT

LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
Best Paper at OSDI '18

Authors:
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University

The monolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve heterogeneity, elasticity, resource utilization, and failure handling in datacenters, we believe that datacenters should break monolithic servers into disaggregated, network-attached hardware components. Despite the promising benefits of hardware resource disaggregation, no existing OSes or software systems can properly manage it. We propose a new OS model called the splitkernel to manage disaggregated systems. Splitkernel disseminates traditional OS functionalities into loosely-coupled monitors, each of which runs on and manages a hardware component. Using the splitkernel model, we built LegoOS, a new OS designed for hardware resource disaggregation. LegoOS appears to users as a set of distributed servers. Internally, LegoOS cleanly separates processor, memory, and storage devices both at the hardware level and the OS level. We implemented LegoOS from scratch and evaluated it by emulating hardware components using commodity servers. Our evaluation results show that LegoOS’s performance is comparable to monolithic Linux servers, while largely improving resource packing and failure rate over monolithic clusters.

Wednesday July 10, 2019 5:10pm - 5:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI
 
Thursday, July 11
 

9:45am PDT

Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly
Best Paper at ASPLOS '18

Authors: Yatish Turakhia and Gill Bejerano, Stanford University; William J. Dally, Stanford University and NVIDIA Research

Genomics is transforming medicine and our understanding of life in fundamental ways. Genomics data, however, is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. Over 1,300 CPU hours are required for reference-guided assembly of the human genome, and over 15,600 CPU hours are required for de novo assembly. This paper describes "Darwin," a co-processor for genomic sequence alignment that, without sacrificing sensitivity, provides up to $15,000X speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design, trading more easily accelerated alignment for less memory-intensive filtering, and by optimizing the memory system for filtering. Darwin combines a hardware-accelerated version of D-SOFT, a novel filtering algorithm, alignment at high speed, and with a hardware-accelerated version of GACT, a novel alignment algorithm. GACT generates near-optimal alignments of arbitrarily long genomic sequences using constant memory for the compute-intensive step. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.

Thursday July 11, 2019 9:45am - 10:05am PDT
USENIX ATC Track I: Grand Ballroom I–VI

10:05am PDT

The Semantics of Transactions and Weak Memory in x86, Power, ARM, and C++
Best Paper at PLDI 2018

Authors:
Nathan Chong, Arm; Tyler Sorensen and John Wickerson, Imperial College London

Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models.

Our work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models (x86, Power, ARMv8, and C++) with new rules for TM. Our formal models are backed by automated tooling that enables (1) the synthesis of tests for validating our models against existing implementations and (2) the model-checking of TM-related transformations, such as lock elision and compiling C++ transactions to hardware. A key finding is that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.

Thursday July 11, 2019 10:05am - 10:25am PDT
USENIX ATC Track I: Grand Ballroom I–VI

10:25am PDT

Who Left Open the Cookie Jar? A Comprehensive Evaluation of Third-Party Cookie Policies
Distinguished Paper Award and 2018 Internet Defense Prize at USENIX Security '18

Authors:
Gertjan Franken, Tom Van Goethem, and Wouter Joosen, imec-DistriNet, KU Leuven

Nowadays, cookies are the most prominent mechanism to identify and authenticate users on the Internet. Although protected by the Same Origin Policy, popular browsers include cookies in all requests, even when these are cross-site. Unfortunately, these third-party cookies enable both cross-site attacks and third-party tracking. As a response to these nefarious consequences, various countermeasures have been developed in the form of browser extensions or even protection mechanisms that are built directly into the browser.

In this paper, we evaluate the effectiveness of these defense mechanisms by leveraging a framework that automatically evaluates the enforcement of the policies imposed to third-party requests. By applying our framework, which generates a comprehensive set of test cases covering various web mechanisms, we identify several flaws in the policy implementations of the 7 browsers and 46 browser extensions that were evaluated. We find that even built-in protection mechanisms can be circumvented by multiple novel techniques we discover. Based on these results, we argue that our proposed framework is a much-needed tool to detect bypasses and evaluate solutions to the exposed leaks. Finally, we analyze the origin of the identified bypass techniques, and find that these are due to a variety of implementation, configuration and design flaws.

Thursday July 11, 2019 10:25am - 10:45am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:15am PDT

NICA: An Infrastructure for Inline Acceleration of Network Applications
With rising network rates, cloud vendors increasingly deploy FPGA-based SmartNICs (F-NICs), leveraging their inline processing capabilities to offload hypervisor networking infrastructure. However, the use of F-NICs for accelerating general-purpose server applications in clouds has been limited.

NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems. A new ikernel programming abstraction, tightly integrated with the network stack, enables application control of F-NIC computations that process application network traffic, with minimal code changes. In addition, NICA’s virtualization architecture supports fine-grain time-sharing of F-NIC logic and provides I/O path virtualization. Together these features enable cost-effective sharing of F-NICs across virtual machines with strict performance guarantees.

We prototype NICA on Mellanox F-NICs and integrate ikernels with the high-performance VMA network stack and the KVM hypervisor. We demonstrate significant acceleration of real-world applications in both bare-metal and virtualized environments, while requiring only minor code modifications to accelerate them on F-NICs. For example, a transparent key-value store cache ikernel added to the stock memcached server reaches 40 Gbps server throughput (99% line-rate) at 6 μs 99th-percentile latency for 16-byte key-value pairs, which is 21× the throughput of a 6-core CPU with a kernel-bypass network stack. The throughput scales linearly for up to 6 VMs running independent instances of memcached.

Speakers
HE

Haggai Eran

Technion – Israel Institute of Technology & Mellanox Technologies
LZ

Lior Zeno

Technion – Israel Institute of Technology
MT

Maroun Tork

Technion – Israel Institute of Technology
GM

Gabi Malka

Technion – Israel Institute of Technology
MS

Mark Silberstein

Technion – Israel Institute of Technology


Thursday July 11, 2019 11:15am - 11:35am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:35am PDT

E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers
We investigate the use of SmartNIC-accelerated servers to execute microservice-based applications in the data center. By offloading suitable microservices to the SmartNIC’s low-power processor, we can improve server energy-efficiency without latency loss. However, as a heterogeneous computing substrate in the data path of the host, SmartNICs bring several challenges to a microservice platform: network traffic routing and load balancing, microservice placement on heterogeneous hardware, and contention on shared SmartNIC resources. We present E3, a microservice execution platform for SmartNIC-accelerated servers. E3 follows the design philosophies of the Azure Service Fabric microservice platform and extends key system components to a SmartNIC to address the above-mentioned challenges. E3 employs three key techniques: ECMP-based load balancing via SmartNICs to the host, network topology-aware microservice placement, and a data-plane orchestrator that can detect SmartNIC overload. Our E3 prototype using Cavium LiquidIO SmartNICs shows that SmartNIC offload can improve cluster energy-efficiency up to 3× and cost efficiency up to 1.9× at up to 4% latency cost for common microservices, including real-time analytics, an IoT hub, and virtual network functions.

Speakers
ML

Ming Liu

University of Washington
SP

Simon Peter

The University of Texas at Austin
AK

Arvind Krishnamurthy

University of Washington
PM

Phitchaya Mangpo Phothilimthana

University of California, Berkeley


Thursday July 11, 2019 11:35am - 11:55am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:55am PDT

INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive
We present INSIDER, a full-stack redesigned storage system to help users fully utilize the performance of emerging storage drives with moderate programming efforts. On the hardware side, INSIDER introduces an FPGA-based reconfigurable drive controller as the in-storage computing (ISC) unit; it is able to saturate the high drive performance while retaining enough programmability. On the software side, INSIDER integrates with the existing system stack and provides effective abstractions. For the host programmer, we introduce virtual file abstraction to abstract ISC as file operations; this hides the existence of the drive processing unit and minimizes the host code modification to leverage the drive computing capability. By separating out the drive processing unit to the data plane, we expose a clear drive-side interface so that drive programmers can focus on describing the computation logic; the details of data movement between different system components are hidden. With the software/hardware co-design, INSIDER runtime provides crucial system support. It not only transparently enforces the isolation and scheduling among offloaded programs, but it also protects the drive data from being accessed by unwarranted programs.

We build an INSIDER drive prototype and implement its corresponding software stack. The evaluation shows that INSIDER achieves an average 12X performance improvement and 31X accelerator cost efficiency when compared to the existing ARM-based ISC system. Additionally, it requires much less effort when implementing applications. INSIDER is open-sourced, and we have adapted it to the AWS F1 instance for public access.

Speakers

Thursday July 11, 2019 11:55am - 12:15pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:15pm PDT

Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval
Data analysis and retrieval is a widely-used component in existing artificial intelligence systems. However, each request has to go through each layer across the I/O stack, which moves tremendous irrelevant data between secondary storage, DRAM, and the on-chip cache. This leads to high response latency and rising energy consumption. To address this issue, we propose Cognitive SSD, an energy-efficient engine for deep learning based unstructured data retrieval. In Cognitive SSD, a flash-accessing accelerator named DLG-x is placed by the side of flash memory to achieve near-data deep learning and graph search. Such functions of in-SSD deep learning and graph search are exposed to the users as library APIs via NVMe command extension. Experimental results on the FPGA-based prototype reveal that the proposed Cognitive SSD reduces latency by 69.9% on average in comparison with CPU based solutions on conventional SSDs, and it reduces the overall system power consumption by up to 34.4% and 63.0% respectively when compared to CPU and GPU based solutions that deliver comparable performance.

Speakers
SL

Shengwen Liang

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
YW

Ying Wang

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
YL

Youyou Lu

Tsinghua University
ZY

Zhe Yang

Tsinghua University
HL

Huawei Li

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
XL

Xiaowei Li

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences


Thursday July 11, 2019 12:15pm - 12:35pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:00pm PDT

From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
We present gg, a framework and a set of command-line tools that helps people execute everyday applications—e.g., software compilation, unit tests, video encoding, or object recognition—using thousands of parallel threads on a cloud-functions service to achieve near-interactive completion time. In the future, instead of running these tasks on a laptop, or keeping a warm cluster running in the cloud, users might push a button that spawns 10,000 parallel cloud functions to execute a large job in a few seconds from start. gg is designed to make this practical and easy.

With gg, applications express a job as a composition of lightweight OS containers that are individually transient (lifetimes of 1–60 seconds) and functional (each container is hermetically sealed and deterministic). gg takes care of instantiating these containers on cloud functions, loading dependencies, minimizing data movement, moving data between containers, and dealing with failure and stragglers.

We ported several latency-sensitive applications to run on gg and evaluated its performance. In the best case, a distributed compiler built on gg outperformed a conventional tool (icecc) by 2–5×, without requiring a warm cluster running continuously. In the worst case, gg was within 20% of the hand-tuned performance of an existing tool for video encoding (ExCamera).

Speakers
SF

Sadjad Fouladi

Stanford University
FR

Francisco Romero

Stanford University
DI

Dan Iter

Stanford University
QL

Qian Li

Stanford University
SC

Shuvo Chatterjee

Unaffiliated
CK

Christos Kozyrakis

Stanford University
MZ

Matei Zaharia

Stanford University
KW

Keith Winstein

Stanford University


Thursday July 11, 2019 2:00pm - 2:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:20pm PDT

Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries
As network, I/O, accelerator, and NVM devices capable of a million operations per second make their way into data centers, the software stack managing such devices has been shifting from implementations within the operating system kernel to more specialized kernel-bypass approaches. While the in-kernel approach guarantees safety and provides resource multiplexing, it imposes too much overhead on microsecond-scale tasks. Kernel-bypass approaches improve throughput substantially but sacrifice safety and complicate resource management: if applications are mutually distrusting, then either each application must have exclusive access to its own device or else the device itself must implement resource management.

This paper shows how to attain both safety and performance via intra-process isolation for data plane libraries. We propose protected libraries as a new OS abstraction which provides separate user-level protection domains for different services (e.g., network and in-memory database), with performance approaching that of unprotected kernel bypass. We also show how this new feature can be utilized to enable sharing of data plane libraries across distrusting applications. Our proposed solution uses Intel's memory protection keys (PKU) in a safe way to change the permissions associated with subsets of a single address space. In addition, it uses hardware watchpoints to delay asynchronous event delivery and to guarantee independent failure of applications sharing a protected library.

We show that our approach can efficiently protect high-throughput in-memory databases and user-space network stacks. Our implementation allows up to 2.3 million library entrances per second per core, outperforming both kernel-level protection and two alternative implementations that use system calls and Intel's VMFUNC switching of user-level address spaces, respectively.

Speakers
MH

Mohammad Hedayati

University of Rochester
SG

Spyridoula Gravani

University of Rochester
EJ

Ethan Johnson

University of Rochester
JC

John Criswell

University of Rochester
ML

Michael L. Scott

University of Rochester
KS

Kai Shen

Google


Thursday July 11, 2019 2:20pm - 2:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

2:40pm PDT

A Retargetable System-Level DBT Hypervisor
System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance-critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard Qemu DBT system. Our system delivers an average speedup of 2.21x over Qemu across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA, and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49x on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance.

Speakers
TS

Tom Spink

University of Edinburgh
HW

Harry Wagstaff

University of Edinburgh
BF

Björn Franke

University of Edinburgh


Thursday July 11, 2019 2:40pm - 3:00pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:00pm PDT

MTS: Bringing Multi-Tenancy to Virtual Networking
Multi-tenant cloud computing provides great benefits in terms of resource sharing, elastic pricing, and scalability, however, it also changes the security landscape and introduces the need for strong isolation between the tenants, also inside the network. This paper is motivated by the observation that while multi-tenancy is widely used in cloud computing, the virtual switch designs currently used for network virtualization lack sufficient support for tenant isolation. Hence, we present, implement, and evaluate a virtual switch architecture, MTS, which brings secure design best-practice to the context of multi-tenant virtual networking: compartmentalization of virtual switches, least-privilege execution, complete mediation of all network communication, and reducing the trusted computing base shared between tenants. We build MTS from commodity components, providing an incrementally deployable and inexpensive upgrade path to cloud operators. Our extensive experiments, extending to both micro-benchmarks and cloud applications, show that, depending on the way it is deployed, MTS may produce 1.5-2x the throughput compared to state-of-the-art, with similar or better latency and modest resource overhead (1 extra CPU). MTS is available as open source software.

Speakers
KT

Kashyap Thimmaraju

Technische Universität Berlin
SH

Saad Hermak

Technische Universität Berlin
GR

Gabor Retvari

BME HSNLab
SS

Stefan Schmid

Faculty of Computer Science, University of Vienna


Thursday July 11, 2019 3:00pm - 3:20pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

3:50pm PDT

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs
Today's ultra-low latency SSDs can deliver an I/O latency of sub-ten microseconds. With this dramatically shrunken device time, operations inside the kernel I/O stack, which were traditionally considered lightweight, are no longer a negligible portion. This motivates us to reexamine the storage I/O stack design and propose an asynchronous I/O stack (AIOS), where synchronous operations in the I/O path are replaced by asynchronous ones to overlap I/O-related CPU operations with device I/O. The asynchronous I/O stack leverages a lightweight block layer specialized for NVMe SSDs using the page cache without block I/O scheduling and merging, thereby reducing the sojourn time in the block layer. We prototype the proposed asynchronous I/O stack on the Linux kernel and evaluate it with various workloads. Synthetic FIO benchmarks demonstrate that the application-perceived I/O latency falls into single-digit microseconds for 4 KB random reads on Optane SSD, and the overall I/O latency is reduced by 15-33% across varying block sizes. This I/O latency reduction leads to a significant performance improvement of real-world applications as well: 11-44% IOPS increase on RocksDB and 15-30% throughput improvement on Filebench and OLTP workloads.

Speakers
GL

Gyusun Lee

Sungkyunkwan University
SS

Seokha Shin

Sungkyunkwan University
WS

Wonsuk Song

Sungkyunkwan University
TJ

Tae Jun Ham

Seoul National University
JW

Jae W. Lee

Seoul National University
JJ

Jinkyu Jeong

Sungkyunkwan University


Thursday July 11, 2019 3:50pm - 4:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:10pm PDT

M³x: Autonomous Accelerators via Context-Enabled Fast-Path Communication
Performance and efficiency requirements are driving a trend towards specialized accelerators in both datacenters and embedded devices. In order to cut down communication overheads, system components are pinned to cores and fast-path communication between them is established. These fast paths reduce latency by avoiding indirections through the operating system. However, we see three roadblocks that can impede further gains: First, accelerators today need to be assisted by a general-purpose core, because they cannot autonomously access operating system services like file systems or network stacks. Second, fast-path communication is at odds with preemptive context switching, which is still necessary today to improve efficiency when applications underutilize devices. Third, these concepts should be kept orthogonal, such that direct and unassisted communication is possible between any combination of accelerators and general-purpose cores. At the same time, all of them should support switching between multiple application contexts, which is most difficult with accelerators that lack the hardware features to run an operating system.

We present M³x, a system architecture that removes these roadblocks. M³x retains the low overhead of fast-path communication while enabling context switching for general-purpose cores and specialized accelerators. M³x runs accelerators autonomously and achieves a speedup of 4.7 for PCIe-attached image-processing accelerators compared to traditional assisted operation. At the same time, utilization of the host CPU is reduced by a factor of 30.

Speakers
NA

Nils Asmussen

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany
MR

Michael Roitzsch

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany
HH

Hermann Härtig

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany


Thursday July 11, 2019 4:10pm - 4:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:35pm PDT

GAIA: An OS Page Cache for Heterogeneous Systems
We propose a principled approach to integrating GPU memory with an OS page cache. We design GAIA, a weakly-consistent page cache that spans CPU and GPU memories. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA (1) integrates lazy release consistency protocol into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels; (2) improves CPU I/O performance by using data cached in GPU memory, and (3) optimizes the readahead prefetcher to support accesses to files cached in GPUs. We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3× speedup in CPU file I/O and up to 8× in unmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.

Speakers
MS

Mark Silberstein

Technion – Israel Institute of Technology


Thursday July 11, 2019 4:35pm - 4:55pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

4:55pm PDT

Transkernel: Bridging Monolithic Kernels to Peripheral Cores
Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off running on a low-power, microcontroller-like core, i.e., peripheral core, relieving CPU from the inefficiency.

We therefore present a new OS structure, in which a lightweight virtual executor called transkernel offloads specific phases from a monolithic kernel. The transkernel translates stateful kernel execution through cross-ISA, dynamic binary translation (DBT); it emulates a small set of stateless kernel services behind a narrow, stable binary interface; it specializes for hot paths; it exploits ISA similarities for lowering DBT cost.

Through an ARM-based prototype, we demonstrate transkernel’s feasibility and benefit. We show that while cross-ISA DBT is typically used under the assumption of efficiency loss, it can enable efficiency gain, even on off-the-shelf hardware.

Speakers
LG

Liwei Guo

Purdue ECE
SZ

Shuang Zhai

Purdue ECE
YQ

Yi Qiao

Purdue ECE


Thursday July 11, 2019 4:55pm - 5:15pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

5:15pm PDT

Detecting Asymmetric Application-layer Denial-of-Service Attacks In-Flight with Finelame
Denial of service (DoS) attacks increasingly exploit algorithmic, semantic, or implementation characteristics dormant in victim applications, often with minimal attacker resources. Practical and efficient detection of these asymmetric DoS attacks requires us to (i) catch offending requests in-flight, before they consume a critical amount of resources, (ii) remain agnostic to the application internals, such as the programming language or runtime system, and (iii) introduce low overhead in terms of both performance and programmer effort.

This paper introduces Finelame, a language-independent framework for detecting asymmetric DoS attacks. Finelame leverages operating system visibility across the entire software stack to instrument key resource allocation and negotiation points. It leverages recent advances in the Linux extended Berkeley Packet Filter virtual machine to attach application-level interposition probes to key request processing functions, and lightweight resource monitors---user/kernel-level probes---to key resource allocation functions. The data collected is used to train a model of resource utilization that occurs throughout the lifetime of individual requests. The model parameters are then shared with the resource monitors, which use them to catch offending requests in-flight, inline with resource allocation. We demonstrate that Finelame can be integrated with legacy applications with minimal effort, and that it is able to detect resource abuse attacks much earlier than their intended completion time while posing low performance overheads.

Speakers
HM

Henri Maxime Demoulin

University of Pennsylvania
IP

Isaac Pedisich

University of Pennsylvania
NV

Nikos Vasilakis

University of Pennsylvania
VL

Vincent Liu

University of Pennsylvania
BT

Boon Thau Loo

University of Pennsylvania
LT

Linh Thi Xuan Phan

University of Pennsylvania


Thursday July 11, 2019 5:15pm - 5:35pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

5:35pm PDT

SemperOS: A Distributed Capability System
Capabilities provide an efficient and secure mechanism for fine-grained resource management and protection. However, as the modern hardware architectures continue to evolve with large numbers of non-coherent and heterogeneous cores, we focus on the following research question: can capability systems scale to modern hardware architectures? In this work, we present a scalable capability system to drive future systems with many non-coherent heterogeneous cores. More specifically, we have designed a distributed capability system based on a HW/SW co-designed capability system. We analyzed the pitfalls of distributed capability operations running concurrently and built the protocols in accordance with the insights. We have incorporated these distributed capability management protocols in a new microkernel-based OS called SemperOS. Our OS operates the system by means of multiple microkernels, which employ distributed capabilities to provide an efficient and secure mechanism for fine-grained access to system resources. In the evaluation we investigated the scalability of our algorithms and run applications (Nginx, LevelDB, SQLite, PostMark, etc.), which are heavily dependent on the OS services of SemperOS. The results indicate that there is no inherent scalability limitation for capability systems. Our evaluation shows that we achieve a parallel efficiency of 70% to 78% when examining a system with 576 cores executing 512 application instances while using 11% of the system’s cores for OS services.

Speakers
MH

Matthias Hille

Technische Universität Dresden
NA

Nils Asmussen

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany
PB

Pramod Bhatotia

University of Edinburgh
HH

Hermann Härtig

Technische Universität Dresden, Germany; Barkhausen Institut, Dresden, Germany


Thursday July 11, 2019 5:35pm - 5:55pm PDT
USENIX ATC Track I: Grand Ballroom I–VI
 
Friday, July 12
 

9:15am PDT

Evaluating File System Reliability on Solid State Drives
As solid state drives (SSDs) are increasingly replacing hard disk drives, the reliability of storage systems depends on the failure modes of SSDs and the ability of the file system layered on top to handle these failure modes. While the classical paper on IRON File Systems provides a thorough study of the failure policies of three file systems common at the time, we argue that 13 years later it is time to revisit file system reliability with SSDs and their reliability characteristics in mind, based on modern file systems that incorporate journaling, copy-on-write and log-structured approaches, and are optimized for flash. This paper presents a detailed study, spanning ext4, Btrfs and F2FS, and covering a number of different SSD error modes. We develop our own fault injection framework and explore over a thousand error cases. Our results indicate that 16\% of these cases result in a file system that cannot be mounted or even repaired by its system checker. We also identify the key file system metadata structures that can cause such failures and finally, we recommend some design guidelines for file systems that are deployed on top of SSDs.

Speakers
SJ

Shehbaz Jaffer

University of Toronto
SM

Stathis Maneas

University of Toronto
AH

Andy Hwang

University of Toronto
BS

Bianca Schroeder

University of Toronto


Friday July 12, 2019 9:15am - 9:35am PDT
USENIX ATC Track I: Grand Ballroom I–VI

9:35am PDT

Alleviating Garbage Collection Interference Through Spatial Separation in All Flash Arrays
We present SWAN, a novel All Flash Array (AFA) management scheme. Recent flash SSDs provide high I/O bandwidth (e.g., 3-10GB/s) so the storage bandwidth can easily surpass the network bandwidth by aggregating a few SSDs. However, it is still challenging to unlock the full performance of SSDs. The main source of performance degradation is garbage collection (GC). We find that existing AFA designs are susceptible to GC at SSD-level and AFA software-level. In designing SWAN, we aim to alleviate the performance interference caused by GC at both levels. Unlike the commonly used temporal separation approach that performs GC at idle time, we take a spatial separation approach that partitions SSDs into the front-end SSDs dedicated to serve write requests and the back-end SSDs where GC is performed. Compared to temporal separation of GC and application I/O, which is hard to be controlled by AFA software, our approach guarantees that the storage bandwidth always matches the full network performance without being interfered by AFA-level GC. Our analytical model confirms this if the size of front-end SSDs and the back-end SSDs are properly configured. We provide extensive evaluations that show SWAN is effective for a variety of workloads.

Speakers
JK

Jaeho Kim

Virginia Tech
KL

Kwanghyun Lim

Cornell University
CM

Changwoo Min

Virginia Tech


Friday July 12, 2019 9:35am - 9:55am PDT
USENIX ATC Track I: Grand Ballroom I–VI

9:55am PDT

Practical Erase Suspension for Modern Low-latency SSDs
As NAND flash technology continues to scale, flash-based SSDs have become key components in data-center servers. One of the main design goals for data-center SSDs is low read tail latency, which is crucial for interactive online services as a single query can generate thousands of disk accesses. Towards this goal, many prior works have focused on minimizing the effect of garbage collection on read tail latency. Such advances have made the other, less explored source of long read tails, block erase operation, more important. Prior work on erase suspension addresses this problem by allowing a read operation to interrupt an ongoing erase operation, to minimize its effect on read latency. Unfortunately, the erase suspension technique attempts to suspend/resume an erase pulse at an arbitrary point, which incurs additional hardware cost for NAND peripherals and reduces the lifetime of the device. Furthermore, we demonstrate this technique suffers a write starvation problem, using a real, production-grade SSD. To overcome these limitations, we propose alternative practical erase suspension mechanisms, leveraging the iterative erase mechanism used in modern SSDs, to suspend/resume erase operation at well-aligned safe points. The resulting design achieves a sub-200μs 99.999th percentile read tail latency for 4KB random I/O workload at queue depth 16 (70% reads and 30% writes). Furthermore, it reduces the read tail latency by about 5 over the baseline for the two data-center workloads that we evaluated with.

Speakers
SK

Shine Kim

Seoul National University and Samsung Electronics
JB

Jonghyun Bae

Seoul National University
HJ

Hakbeom Jang

Sungkyunkwan University
WJ

Wenjing Jin

Seoul National University
JG

Jeonghun Gong

Seoul National University
SL

Seungyeon Lee

Samsung Electronics
TJ

Tae Jun Ham

Seoul National University
JW

Jae W. Lee

Seoul National University


Friday July 12, 2019 9:55am - 10:15am PDT
USENIX ATC Track I: Grand Ballroom I–VI

10:15am PDT

Track-based Translation Layers for Interlaced Magnetic Recording
Interlaced magnetic recording (IMR) is a state-of-the-art recording technology for hard drives that makes use of heat-assisted magnetic recording (HAMR) and track overlap to offer higher capacity than conventional and shingled magnetic recording (CMR and SMR). It carries a set of write constraints that differ from those in SMR: “bottom” (e.g. even-numbered) tracks cannot be written without data loss on the adjoining “top” (e.g. odd-numbered) ones. Previously described algorithms for writing arbitrary (i.e. bottom) sectors on IMR are in some cases poorly characterized, and are either slow or require more memory than is available within the constrained disk controller environment.

We provide the first accurate performance analysis of the simple read-modify-write (RMW) approach to IMR bottom track writes, noting several inaccuracies in earlier descriptions of its performance, and evaluate it for latency, throughput and I/O amplification on real-world traces. In addition we propose three novel memory-efficient, track-based translation layers for IMR—track flipping, selective track caching and dynamic track mapping, which reduce bottom track writes by moving hot data to top tracks and cold data to bottom ones in different ways. We again provide a detailed performance analysis using simulations based on real-world traces.

We find that RMW performance is poor on most traces and worse on others. The proposed approaches perform much better, especially dynamic track mapping, with low write amplification and latency comparable to CMR for many traces.

Speakers
MH

Mohammad Hossein Hajkazemi

Northeastern University
AN

Ajay Narayan Kulkarni

Seagate Technology
PD

Peter Desnoyers

Northeastern University
TR

Timothy R Feldman

Seagate Technology


Friday July 12, 2019 10:15am - 10:35am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:05am PDT

Pangolin: A Fault-Tolerant Persistent Memory Programming Library
Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-based data structures that can offer substantial performance gains over conventional approaches to managing persistent state. This programming model removes the file system from the critical path which improves performance, but it also places these data structures out of reach of file system-based fault tolerance mechanisms (e.g., block-based checksums or erasure coding). Without fault-tolerance, using NVMM to hold critical data will be much less attractive.

This paper presents Pangolin, a fault-tolerant persistent object library designed for NVMM. Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs. It provides these protections for objects of any size and supports automatic, online detection of data corruption and recovery. The required storage overhead is small (1% for gigabyte-sized pools of NVMM). Pangolin provides stronger protection, requires orders of magnitude less storage overhead, and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.

Speakers
LZ

Lu Zhang

UC San Diego
SS

Steven Swanson

UC San Diego


Friday July 12, 2019 11:05am - 11:25am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:25am PDT

Pisces: A Scalable and Efficient Persistent Transactional Memory
Persistent transactional memory (PTM) programming model has recently been exploited to provide crash-consistent transactional interfaces to ease programming atop NVM. However, existing PTM designs either incur high reader-side overhead due to blocking or long delay in the writer side (efficiency), or place excessive constraints on persistent ordering (scalability). This paper presents Pisces, a read-friendly PTM that exploits snapshot isolation (SI) on NVM. The key design of Pisces is based on two observations: the redo logs of transactions can be reused as newer versions for the data, and an intuitive MVCC-based design has read deficiency. Based on the observations, we propose a dual-version concurrency control (DVCC) protocol that maintains up to two versions in NVM-backed storage hierarchy. Together with a three-stage commit protocol, Pisces ensures SI and allows more transactions to commit and persist simultaneously. Most importantly, it promises a desired feature: hiding NVM persistence overhead from reads and allowing nearly non-blocking reads. Experimental evaluation on an Intel 40-thread (20-core) machine with real NVM equipped shows that Pisces outperforms the state-of-the-art design (i.e., DUDETM) by up to 6.3× for micro-benchmarks and 4.6× for TPC-C new order transaction, and also scales much better. The persistency cost is from 19% to 50% for 40 threads.

Speakers
JG

Jinyu Gu

Shanghai Jiao Tong University
QY

Qianqian Yu

Shanghai Jiao Tong University
XW

Xiayang Wang

Shanghai Jiao Tong University
ZW

Zhaoguo Wang

Shanghai Jiao Tong University
BZ

Binyu Zang

Shanghai Jiao Tong University
HG

Haibing Guan

Shanghai Jiao Tong University
HC

Haibo Chen

Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.


Friday July 12, 2019 11:25am - 11:45am PDT
USENIX ATC Track I: Grand Ballroom I–VI

11:50am PDT

Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high performance and low energy cost. However, SSD introduces more complex failure modes compared to traditional hard disk. While great efforts have been made to understand the reliability of SSD itself, it remains unclear what types of system level failures are related to SSD, what are the root causes, and how the rest of the system interacts with SSD and contributes to failures. Answering these questions can help practitioners build and maintain highly reliable SSD-based storage systems.

In this paper, we study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services. We take a holistic view to analyze both device errors and system failures to better understand the potential casual relations. Particularly, we focus on failures that are Reported As "SSD-Related" (RASR) by system status monitoring daemons. Through log analysis, field studies, and validation experiments, we identify the characteristics of RASR failures in terms of their distribution, symptoms, and correlations. Moreover, we derive a number of major lessons and a set of effective methods to address the issues observed. We believe that our study and experience would be beneficial to the community and could facilitate building highly-reliable SSD-based storage systems.

Speakers
EX

Erci Xu

Ohio State University
MZ

Mai Zheng

Iowa State University
FQ

Feng Qin

Ohio State University
YX

Yikang Xu

Alibaba Group
JW

Jiesheng Wu

Alibaba Group


Friday July 12, 2019 11:50am - 12:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:10pm PDT

Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy
Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we propose addressing the flash lifetime problem by allowing devices to expose higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems.

We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook and backed by RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100x and recovery time by more than 10,000x. DIRECT also allows HDFS to tolerate a 10,000--100,000x higher bit error rate without experiencing application-visible errors. By significantly increasing the availability and durability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.

Speakers
AT

Amy Tai

Princeton University and VMware Research
KJ

Kyle Jamieson

Princeton University
MJ

Michael J. Freedman

Princeton University
AC

Asaf Cidon

Columbia University


Friday July 12, 2019 12:10pm - 12:30pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:30pm PDT

Dayu: Fast and Low-interference Data Recovery in Very-large Storage Systems
This paper investigates I/O and failure traces from a realworld large-scale storage system: it finds that because of the scale of the system and because of the imbalanced and dynamic foreground traffic, no existing recovery protocol can compute a high-quality re-replicating strategy in a short time. To address this problem, this paper proposes Dayu, a timeslot based recovery architecture. For each timeslot, Dayu only schedules a subset of tasks which are expected to be finished in this timeslot: this approach reduces the computation overhead and naturally can cope with the dynamic foreground traffic. In each timeslot, Dayu incorporates a greedy algorithm with convex hull optimization to achieve both high speed and high quality. Our evaluation in a 1,000-node cluster and in a 3,500-node simulation both confirm that Dayu can outperform existing recovery protocols, achieving high speed and high quality.

Speakers
ZW

Zhufan Wang

Tsinghua University
GZ

Guangyan Zhang

Tsinghua University
YW

Yang Wang

The Ohio State University
QY

Qinglin Yang

Tsinghua University
JZ

Jiaji Zhu

Alibaba Cloud


Friday July 12, 2019 12:30pm - 12:50pm PDT
USENIX ATC Track I: Grand Ballroom I–VI

12:50pm PDT

OPTR: Order-Preserving Translation and Recovery Design for SSDs with a Standard Block Device Interface
Consumer-grade solid-state drives (SSDs) guarantee very few things upon a crash. Lacking a strong disk-level crash guarantee forces programmers to equip applications and filesystems with safety nets using redundant writes and flushes, which in turn degrade the overall system performance. Although some prior works propose transactional SSDs with revolutionized disk interfaces to offer strong crash guarantees, adopting transactional SSDs inevitably incurs dramatic software stack changes. Therefore, most consumer-grade SSDs still keep using the standard block device interface.

This paper addresses the above issues by breaking the impression that increasing SSDs' crash guarantees are typically available at the cost of altering the standard block device interface. We propose Order-Preserving Translation and Recovery (OPTR), a collection of novel flash translation layer (FTL) and crash recovery techniques that are realized internal to block-interface SSDs to endow the SSDs with \emph{strong request-level crash guarantees} defined as follows: 1) A write request is not made durable unless all its prior write requests are durable. 2) Each write request is atomic. 3) All write requests prior to a flush are guaranteed durable. We have realized OPTR in real SSD hardware and optimized applications and filesystems (SQLite and Ext4) to demonstrate OPTR's benefits. Experimental results show 1.27$\times$ (only Ext4 is optimized), 2.85$\times$ (both Ext4 and SQLite are optimized), and 6.03$\times$ (an OPTR-enabled no-barrier mode) performance improvement.

Speakers
YC

Yun-Sheng Chang

National Tsing Hua University
RL

Ren-Shuo Liu

National Tsing Hua University


Friday July 12, 2019 12:50pm - 1:10pm PDT
USENIX ATC Track I: Grand Ballroom I–VI