Recently proposed key-value SSD (KVSSD) provides the popular and versatile key-value interface at the device level, promising high performance and simplified storage management with the minimal involvement of the host software. However, its I/O command set over NVMe is defined on a per key-value pair basis, enforcing the host to post key-value operations to KVSSD independently. This not only incurs high interfacing overhead for small key-value operations but also makes it subtle to support transactions in KVSSDs without a software support.
In this paper, we propose compound commands for KVSSDs. The compound command allows the host to specify multiple key-value pairs in a single NVMe operation, thereby effectively amortizing I/O interfacing overhead. In addition, it provides an effective way for defining a transaction comprised of multiple key-value pairs. Our evaluation using a prototype KVSSD and an in-house KVSSD emulator shows promising benefits of the compound command, with improving the performance by up to 55%.
In recent years, our society is being plagued by unprecedented levels of privacy and security breaches. To rein in this trend, the European Union, in 2018, introduced a comprehensive legislation called the General Data Protection Regulation (GDPR). In this paper, we review GDPR from a system design perspective, and identify how its regulations conflict with the design, architecture, and operation of modern systems. We illustrate these conflicts via the seven GDPR sins: storing data forever; reusing data indiscriminately; walled gardens and black markets; risk-agnostic data processing; hiding data breaches; making unexplainable decisions; treating security as a secondary goal. Our findings reveal a deep-rooted tussle between GDPR requirements and how modern systems have evolved. We believe that achieving compliance requires comprehensive, grounds up solutions, and anything short would amount to fixing a leaky faucet in a sinking ship.
Network covert channels are an advanced threat to the security and privacy of cloud systems. One common limitation of existing defenses is that they all come at the cost of performance. This presents significant barriers to their practical deployment in high-speed networks. We sketch the design of NetWarden, a novel defense whose key design goal is to preserve TCP performance while mitigating covert channels. The use of programmable data planes makes it possible for NetWarden to adapt defenses that were only demonstrated before as proof of concept, and apply them at linespeed. Moreover, NetWarden uses a set of performance boosting techniques to temporarily increase the performance of connections that have been affected by channel mitigation, with the ultimate goal of neutralizing its impact on performance. Our simulation provides initial evidence that NetWarden can mitigate several covert channels with little performance disturbance. As ongoing work, we are working on a full system design and implementation of NetWarden.
The emergence of Hybrid Shingled Magnetic Recording (H-SMR) allows dynamic conversion of the recording format between Conventional Magnetic Recording (CMR) and SMR on a single disk drive. H-SMR is promising for its ability to manage the performance/capacity trade-off on the disk platters and to adaptively support different application scenarios in large-scale storage systems. However, there is little research on how to efficiently manage data and space in such H-SMR drives.
In this paper, we present ZoneAlloy, an elastic data and space management scheme for H-SMR drives, to explore the benefit of using such drives. ZoneAlloy initially allocates CMR space for the application and then gradually converts the disk format from CMR to SMR to create more space for the application. ZoneAlloy controls the overhead of the format conversion on the application I/O with our quantized migration mechanism. When data is stored in an SMR area, ZoneAlloy reduces the SMR update overhead using H-Buffer and Zone-Swap. H-Buffer is a small host-controlled CMR space that absorbs the SMR updates and migrates those updates back to the SMR space in batches to bring down the SMR update cost. Zone-Swap dynamically swaps ``hot'' data from the SMR space to the CMR space to further alleviate the SMR update problem. Evaluation results based on MSR-Cambridge traces demonstrate that ZoneAlloy can reduce the average I/O latency and limit the performance degradation of the application I/O during format conversion.
One-sided network communication technologies such as RDMA and NVMe-over-Fabrics are quickly gaining adoption in production software and in datacenters. Although appealing for their low CPU utilization and good performance, they raise new security concerns that could seriously undermine datacenter software systems building on top of them. At the same time, they offer unique opportunities to help enhance security. Indeed, one-sided network communication is a double-edged sword in security. This paper presents our insights into security implications and opportunities of one-sided communication.
Fueled by IoT botnets and DDoS-as-a-Service tools, distributed denial of service (DDoS) attacks have reached record high volumes. Although there exist DDoS protection services, they can be costly for small organizations as well as individual users. In this paper, we present a low-cost DDoS solution, DynaShield, which a user can deploy at common cloud service providers. DynaShield employs three techniques to reduce cost. First, it uses an on-demand model. A server dynamically updates its DNS record to redirect clients’ traffic to DynaShield when it is under attack, avoiding paying for cloud services during peacetime. Second, DynaShield combines serverless functions and elastic servers provided by cloud providers to auto-scale to large attacks without overprovisioning. Third, DynaShield uses cryptocurrency puzzles as proof of work. The coin mining profit can further offset a protected server’s cloud service charges. Our preliminary evaluation suggests that DynaShield can cost as little as a few dollars per month to prevent an organization from common DDoS attacks.
New non-volatile memory technologies offer unprecedented performance levels for persistent storage. However, to exploit their full potential, a deeper performance characterization of such devices is required. In this paper, we analyze a NVM-based block device -- the Intel Optane SSD -- and formalize an "unwritten contract'' of the Optane SSD. We show that violating this contract can result in 11x worse read latency and limited throughput (only 20% of peak bandwidth) regardless of parallelism. We present that this contract is relevant to features of 3D XPoint memory and Intel Optane SSD's controller/interconnect design. Finally, we discuss the implications of the contract.
Remzi Arpaci-Dusseau is the Grace Wahba professor of Computer Sciences at UW-Madison. He co-leads a research group with Professor Andrea Arpaci-Dusseau. Together, they have graduated 24 Ph.D. students and won numerous best-paper awards; many of their innovations are used by commercial... Read More →
Despite the growing collection and use of private data in the cloud, there remains a fundamental disconnect between unified data governance and the storage system enforcement techniques. On one side, high-level governance policies derived from regulations like General Data Protection Regulation (GDPR) have emerged with stricter rules dictating who, when and how data can be processed. On the other side, storage-level controls, both role- or attribute-based, continue to focus on access/deny enforcement. In this paper, we propose how to bridge this gap. We introduce Deep Enforcement, a system that provides unified governance and transformation policies coupled with data transformations embedded into the storage fabric to achieve policy compliance. Data transformations can vary in complexity, from simple redactions to complex differential privacy-based techniques to provide the required amount of anonymization. We show how this architecture can be implemented into two broad classes of data storage systems in the cloud: object storages and SQL databases. Depending on the complexity of the transformation, we also demonstrate how to implement them either in-line (on data access) or off-line (creating an alternate cached dataset).
In-network processing, where data is processed by special-purpose devices as it passes over the network, is showing great promise at improving application performance, in particular for data analytics tasks. However, analytics and in-network processing are not yet integrated and widely deployed. This paper presents a vision for providing in-network processing as a service to data analytics frameworks, and outlines benefits, remaining challenges, and our current research directions towards realizing this vision.
The recently introduced General Data Protection Regulation (GDPR) is forcing several companies to make significant changes to their systems to achieve compliance. Motivated by the finding that more than 30% of GDPR articles are related to storage, we investigate the impact of GDPR compliance on storage systems. We illustrate the challenges of retrofitting existing systems into compliance by modifying GDPR-compliant Redis. We show that despite needing to introduce a small set of new features, a strict real-time compliance (e.g., logging every user request synchronously) lowers Redis’ throughput by 20x. Our work reveals how GDPR allows compliance to be a spectrum, and what its implications are for system designers. We discuss the technical challenges that need to be solved before strict compliance can be efficiently achieved.
Network links and server CPUs are heavily contended resources in modern datacenters. To keep tail latencies low, datacenter operators drastically overprovision both types of resources today, and there has been significant research into effectively managing network traffic and CPU load. However, this work typically looks at the two resources in isolation.
In this paper, we make the observation that, in the datacenter, the allocation of network and CPU resources should be co-designed for the most efficiency and the best response times. For example, while congestion control protocols can prioritize traffic from certain flows, this provides no benefit if the traffic arrives at an overloaded server that will only queue the request.
This paper explores the potential benefits of such a co-designed resource allocator and considers the recent work in both CPU scheduling and congestion control that is best suited to such a system. We propose a Chimera, a new datacenter OS that integrates a receiver-based congestion control protocol with OS insight into application queues, using the recent Shenango operating system.
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.
Storage researchers have always been interested in understanding the complex behavior of storage systems with the help of statistics, machine learning, and simple visualization techniques. However, when a system's behavior is affected by hundreds or even thousands of factors, existing approaches break down. Results are often difficult to interpret, and it can be challenging for humans to apply domain knowledge to a complex system. We propose to enhance storage system analysis by applying "interactive visual analytics," which can address the aforementioned limitations. We have devised a suitable Interactive Configuration Explorer (ICE), and conducted several case studies on a typical storage system, to demonstrate its benefits for storage system researchers and designers. We found that ICE makes it easy to explore a large parameter space, identify critical parameters, and quickly zero in on optimal parameter settings.
Overlay networks are the de facto networking technique for providing flexible, customized connectivity among distributed containers in the cloud. However, overlay networks also incur non-trivial overhead due to its complexity, resulting in significant network performance degradation of containers. In this paper, we perform a comprehensive empirical performance study of container overlay networks which identifies unrevealed, important parallelization bottlenecks of the kernel network stack that prevent container overlay networks from scaling. Our observations and root cause analysis cast light on optimizing the network stack of modern operating systems on multi-core systems to more efficiently support container overlay networks in light of high-speed network devices.
Cloud management tasks such as performance diagnosis, workload placement, and power management depend critically on estimating the utilization of an application. But, it is challenging to measure actual utilization for polled IO network functions (NFs) without code instrumentation. We ask if CPU events (e.g., data cache misses) measured using hardware performance counters are good at estimating utilization for polled-IO NFs. We find a strong correlation between several CPU events and NF utilization for three representative types of network functions. Inspired by this finding, we explore the possibility of computing a universal estimation function that maps selected CPU events to NF utilization estimates for a wide-range of NFs, traffic profiles and traffic loads. Our NF-specific estimators and universal estimators achieve absolute estimation errors below 6% and 10% respectively.
We present Fair-EDF, a framework for latency guarantees in shared storage servers. It provides fairness control while supporting latency guarantees. Fair-EDF extends the pure earliest deadline first (EDF) scheduler by adding a controller to shape the workloads. Under overload it selects a minimal number of requests to drop and to choose the dropped requests in a fair manner. The evaluation results show Fair-EDF provides steady fairness control among a set of clients with different runtime behaviors.
Deep learning is a popular technique for building inference models and classifiers from large quantities of input data for applications in many domains. With the proliferation of edge devices such as sensor and mobile devices, large volumes of data are generated at rapid pace all over the world. Migrating large amounts of data into centralized data center(s) over WAN environments is often infeasible due to cost, performance or privacy reasons. Moreover, there is an increasing need for incremental or online deep learning over newly generated data in real-time. These trends require rethinking of the traditional training approach to deep learning. To handle the computation on distributed input data, micro-clouds, small-scale clouds deployed near edge devices in many different locations, provide an attractive alternative for data locality reasons. However, existing distributed deep learning systems do not support training in micro-clouds, due to the unique characteristics and challenges in this environment. In this paper, we examine the key challenges of deep learning in micro-clouds: computation and network resource heterogeneity at inter- and intra micro-cloud levels and their scale. We present DLion, a decentralized distributed deep learning system for such environments. It employs techniques specifically designed to address the above challenges to reduce training time, enhance model accuracy, and provide system scalability. We have implemented a prototype of DLion in TensorFlow and our preliminary experiments show promising results towards achieving accurate and efficient distributed deep learning in micro-clouds.
Current Linux memory management algorithms have been applied for many years. Android inherits Linux kernel, and thus the memory management algorithms of Linux are transplanted to Android smartphones. To evaluate the efficiency of the memory management algorithms of Android, page re-fault is applied as the target metric in this paper. Through carefully designed experiments, this paper shows that current memory management algorithms are not working well on Android smartphones. For example, page re-fault is up to 37% when running a set of popular apps, which means a large proportion of pages evicted by the existing memory management algorithms are accessed again in the near future. Furthermore, the causes of the high page re-fault ratio are analyzed. Based on the analysis, a tradeoff between the reclaim size and the overall performance is uncovered. By exploiting this tradeoff, a preliminary idea is proposed to improve the performance of Android smartphones.
Training machine learning models involves iteratively fetching and pre-processing batches of data. Conventionally, popular ML frameworks implement data loading within a job and focus on improving the performance of a single job. However, such an approach is inefficient in shared clusters where multiple training jobs are likely to be accessing the same data and duplicating operations. To illustrate this, we present a case study which reveals that for hyper-parameter tuning experiments we can reduce up to 89% I/O and 97% pre-processing redundancy.
Based on this observation, we make the case for unifying data loading in machine learning clusters by bringing the isolated data loading systems together into a single system. Such a system architecture can remove the aforementioned redundancies that arise due to the isolation of data loading in each job. We introduce OneAccess, a unified data access layer and present a prototype implementation that shows a 47.3% improvement in I/O cost when sharing data across jobs. Finally we discuss open research challenges in designing and developing a unified data loading layer that can run across frameworks on shared multi-tenant clusters, including how to handle distributed data access, support diverse sampling schemes, and exploit new storage media.
Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher prediction latencies. Prior efforts to reduce this latency such as quantization, model distillation, and any-time prediction models typically trade-off accuracy for performance. In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests. We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18. We present Freeze Inference, a system that introduces approximate caching at each intermediate layer and we discuss techniques to reduce the cache size and improve the cache hit rate. Finally, we discuss some of the open research challenges in realizing such a design.
Many distributed systems/databases rely on Paxos for providing linearizable reads. Linearizable reads in Paxos are achieved either through running a full read round with followers, or via reading from a stable leader which holds leases on followers. We introduce a third method for performing linearizable reads by eschewing the leader and only reading from a quorum of followers. For guaranteeing linearizability, a bare quorum read is insufficient and it needs to be amended with a rinse phase to account for pending update operations. We present our Paxos Quorum Read (PQR) protocol that implements this. Our evaluations show that PQR significantly improves throughput compared to the other methods. The evaluations also show that PQR achieves comparable latency to the read from stable Paxos leader optimization.
Sharing GPUs in the cloud is cost effective and can facilitate the adoption of hardware accelerator enabled cloud. Butsharing causes interference between co-located VMs andleads to performance degradation. In this paper, we proposedan interference-aware VM scheduler at the cluster level withthe goal of minimizing interference. NVIDIA vGPU pro-vides sharing capability and high performance, but it has unique performance characteristics, which have not been studied thoroughly before. Our study reveals several key ob-servations. We leverage our observations to construct modelsbased on machine learning techniques to predict interferencebetween co-located VMs on the same GPU. We proposed a system architecture leveraging our models to schedule VMs to minimize the interference. The experiments show that our observations improves the model accuracy (by 15% ̃ 40%) and the scheduler reduces application run-time overhead by 24.2% in simulated scenarios.
Cloud providers and their tenants have a mutual interest in identifying optimal configurations in which to run tenant jobs, i.e., ones that achieve tenants' performance goals at minimum cost; or ones that maximize performance within a specified budget. However, different tenants may have different performance goals that are opaque to the provider. A consequence of this opacity is that providers today typically offer fixed bundles of cloud resources, which tenants must themselves explore and choose from. This is burdensome for tenants and can lead to choices that are sub-optimal for both parties.
We thus explore a simple, minimal interface, which lets tenants communicate their happiness with cloud infrastructure to the provider, and enables the provider to explore resource configurations that maximize this happiness. Our early results indicate that this interface could strike a good balance between enabling efficient discovery of application resource needs and the complexity of communicating a full description of tenant utility from different configurations to the provider.
Designing key-value stores based on log-structured merge-tree (LSM-tree) encounters a well-known trade-off between the I/O cost of update and that of lookup as well as of space usage. It is generally believed that they cannot be improved at the same time; reducing update cost will increase lookup cost and space usage, and vice versa. Recent works have been addressing this issue, but they focus on probabilistic approaches or reducing amortized cost only, which may not be helpful for tail latency that is critical to server applications. This paper suggests a novel approach that transplants copy-on-write B+-tree into LSM-tree, aiming at reducing update cost without sacrificing lookup cost. In addition to that, our scheme provides a simple and practical way to adjust the index between update-optimized form and space-optimized form. The evaluation results show that it significantly reduces update cost with consistent lookup cost.
We analyze many facets of the performance of gVisor, a new security-oriented container engine that integrates with Docker and backs Google’s serverless platform. We explore the effect gVisor’s in-Sentry network stack has on network throughput as well as the overheads of performing all file opens via gVisor’s Gofer service. We further analyze gVisor startup performance, memory efficiency, and system-call overheads. Our findings have implications for the future design of similar hypervisor- based container engines.
Computational storage has remained an elusive goal. Though minimizing data movement by placing computation close to storage has quantifiable benefits, many of the previous attempts failed to take root in industry. They either require a departure from the widespread block protocol to one that is more computationally-friendly (e.g., file, object, or key-value), or they introduce significant complexity (state) on top of the block protocol.
We participated in many of these attempts and have since concluded that neither a departure from nor a significant addition to the block protocol is needed. Here we introduce a block-compatible design based on virtual objects. Like a real object (e.g., a file), a virtual object contains the metadata that is needed to process the data. We show how numerous offloads are possible using virtual objects and, as one example, demonstrate a 99% reduction in the data movement required to “scrub” object storage for bitrot. We also present our early work with erasure coded data which, unlike RAID, can be easily adapted to computational storage using virtual objects.
Advanced vision analytics plays a key role in a plethora of real-world applications. Unfortunately, many of these applications fail to leverage the abundant compute resource in cloud services, because they require high computing resources {\em and} high-quality video input, but the (wireless) network connections between visual sensors (cameras) and the cloud/edge servers do not always provide sufficient and stable bandwidth to stream high-fidelity video data in real time.
This paper presents CloudSeg, an edge-to-cloud framework for advanced vision analytics that co-designs the cloud-side inference with real-time video streaming, to achieve both low latency and high inference accuracy. The core idea is to send the video stream in low resolution, but recover the high-resolution frames from the low-resolution stream via a {\em super-resolution} procedure tailored for the actual analytics tasks. In essence, CloudSeg trades additional cloud-side computation (super-resolution) for significantly reduced network bandwidth. Our initial evaluation shows that compared to previous work, CloudSeg can reduce bandwidth consumption by $\sim$6.8$\times$ with negligible drop in accuracy.
Cloud-based Natural Language Understanding (NLU) services are getting more and more popular with the development of artificial intelligence. More applications are integrated with cloud-based NLU services to enhance the way people communicate with machines. However, with NLU services provided by different companies powered by unrevealed AI technology, how to choose the best one is a problem for users. To our knowledge, there is currently no platform that can provide guidance to users and make recommendations based on their needs. To fill this gap, in this paper, we propose NLUBroker, a platform to comprehensively measure the performance indicators of candidate NLU services, and further provide a broker to select the most suitable service according to the different needs of users. Our evaluation shows that different NLU services leading in different aspects, and NLUBroker is able to improve the quality of experience by automatically choosing the best service. In addition, reinforcement learning is used to support NLUBroker by an intelligent agent in a dynamic environment, and the results are promising.
The increasing availability of byte-addressable non-volatile memory on the system bus provides an opportunity to dramatically simplify application interaction with persistent data. However, software and hardware leverage different abstractions: software operating on persistent data structures requires “global” pointers that remain valid after a process terminates, while hardware requires that a diverse set of devices all have the same mappings they need for bulk transfers to and from memory, and that they be able to do so for a potentially heterogeneous memory system. Both abstractions must be implemented in a way that is efficient using existing hardware.
We propose to abstract physical memory into an object space, which maps objects to physical memory, while providing applications with a way to refer to data that may have a lifetime longer than the processes accessing it. This approach reduces the coordination required for access to multiple types of memory while improving hardware security and enabling more hardware autonomy. We describe how we can use existing hardware support to implement these abstractions, both for applications and for the OS and devices, and show that the performance penalty for this approach is minimal.
We present new means for performing static program analysis on serverless programs. We propose a new type of call graph that captures the stateless, event-driven nature of such programs and describe a method for constructing these new extended service call graphs. Next, we survey applications of program analysis that can leverage our extended service call graphs to answer questions about code that executes on a serverless platform. We present findings on the applicability of our techniques to real open source serverless programs. Finally, we close with several open questions about how to best incorporate static analysis in problem solving for developing serverless applications.
The Serverless or Function-as-a-Service (FaaS) model capitalizes on lightweight execution by packaging code and dependencies together for just-in-time dispatch. Often a container environment has to be set up afresh– a condition called “cold start", and in such cases, performance suffers and overheads mount, both deteriorating rapidly under high concurrency. Caching and reusing previously employed containers ties up memory and risks information leakage. Latency for cold starts is frequently due to work and wait-times in setting up various dependencies – such as in initializing networking elements. This paper proposes a solution that pre-crafts such resources and then dynamically re-associates them with baseline containers. Applied to networking, this approach demonstrates an order of magnitude gain in cold starts, negligible memory consumption, and flat startup time under rising concurrency.
Containers continue to gain traction in the cloud as lightweight alternatives to virtual machines (VMs). This is partially due to their use of host filesystem abstractions, which play a role in startup times, memory utilization, crash consistency, file sharing, host introspection, and image management. However, the filesystem interface is high-level and wide, presenting a large attack surface to the host. Emerging secure container efforts focus on lowering the level of abstraction of the interface to the host through deprivileged functionality recreation (e.g., VMs, userspace kernels). However, the filesystem abstraction is so important that some have resorted to directly exposing it from the host instead of suffering the resulting semantic gap. In this paper, we suggest that through careful ahead-of-time metadata preparation, secure containers can maintain a small attack surface while simultaneously alleviating the semantic gap.
Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain’s lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.
Object storage has emerged as a low-cost and hyper-scalable alternative to distributed file systems. However, interface incompatibilities and performance limitations often compel users to either transfer data between a file system and object storage or use inefficient file connectors over object stores. The result is growing storage sprawl, unacceptably low performance, and an increase in associated storage costs. One promising solution to this problem is providing dual access, the ability to transparently read and write the same data through both file system interfaces and object storage APIs. In this position paper we argue that there is a need for dual-access file systems over object storage, and examine some representative use cases which benefit from such systems. Based on our conversations with end users, we discuss features which we believe are essential or desirable in a dual-access object storage file system (OSFS). Further, we design and implement an early prototype of Agni, an efficient dual-access OSFS which overcomes the shortcomings of existing approaches. Our preliminary experiments demonstrate that for some representative workloads Agni can improve performance by 20%--60% compared to either S3FS, a popular OSFS, or the prevalent approach of manually copying data between different storage systems.
The wide deployment of 4G/5G has enabled connected vehicles as a perfect edge computing platform for a plethora of new services which are impossible before, such as remote real-time diagnostics and advanced driver assistance. In this work, we propose CLONE, a collaborative learning setting on the edges based on the real-world dataset collected from a large electric vehicle (EV) company. Our approach is built on top of the federated learning algorithm and long short-term memory networks, and it demonstrates the effectiveness of driver personalization, privacy serving, latency reduction (asynchronous execution), and security protection. We choose the failure of EV battery and associated accessories as our case study to show how the CLONE solution can accurately predict failures to ensure sustainable and reliable driving in a collaborative fashion.
As smart home environments get more complex and denser, they are becoming harder to manage. We present our ongoing work on the design and implementation of ``SafeHome'', a system for management and coordination inside a smart home. SafeHome offers users and programmers the flexibility to specify safety properties in a declarative way, and to specify routines of commands in an imperative way. SafeHome includes mechanisms which ensure that under concurrent routines and device failures, the smart home behavior is consistent (e.g., serializable) and safety properties are always guaranteed. SafeHome is intended to run on edge machines co-located with the smart home. Our design space opens the opportunity to borrow and adapt rich ideas and mechanisms from related areas such as databases and compilers.
We introduce file systems as processes (FSP), a storage architecture designed for modern ultra-fast storage devices. By building a direct-access file system as a standalone user-level process, FSP accelerates file system development velocity without compromising essential file system properties. FSP promises to deliver raw device-level performance via highly tuned inter-process communication mechanisms; FSP also ensures protection and metadata integrity by design. To study the potential advantages and disadvantages of the FSP approach, we develop DashFS, a prototype user-level file system. We discuss its architecture and show preliminary performance benefits.
With the explosive growth of data, largely contributed by the rapidly and widely deployed smart devices on the edge, we need to rethink the training paradigm for learning on such realworld data. The conventional cloud-only approach can hardly keep up with the computational demand from these deep learning tasks; and the traditional back propagation based training method also makes it difficult to scale out the training. Fortunately, the continuous advancement in System on Chip (SoC) hardware is transforming edge devices into capable computing platforms, and can potentially be exploited to address these challenges. These observations have motivated this paper’s study on the use of synthetic gradients for distributed training cross cloud and edge devices. We employ synthetic gradients into various neural network models to comprehensively evaluate its feasibility in terms of accuracy and convergence speed. We distribute the training of the various layers of a model using synthetic gradients, and evaluate its effectiveness on the edge by using resource-limited containers to emulate edge devices. The evaluation result shows that the synthetic gradient approach can achieve comparable accuracy compared to the conventional back propagation, for an eight-layer model with both fully-connected and convolutional layers. For a more complex model (VGG16), the training suffers from some accuracy degradation (up to 15%). But it achieves 11% improvement in training speed when the layers of a model are decoupled and trained on separate resource-limited containers, compared to the training of the whole model using the conventional method on the physical machine.
Deploying machine learning on edge devices is becoming increasingly important, driven by new applications such as smart homes, smart cities, and autonomous vehicles. Unfortunately, it is challenging to deploy deep neural networks (DNNs) on resource-constrained devices. These workloads are computationally intensive and often require cloud-like resources. Prior solutions attempted to address these challenges by either sacrificing accuracy or by relying on cloud resources for assistance.
In this paper, we propose a containerized partition-based runtime adaptive convolutional neural network (CNN) acceleration framework for Internet of Things (IoT) environments. The framework leverages spatial partitioning techniques through convolution layer fusion to dynamically select the optimal partition according to the availability of computational resources and network conditions. By containerizing each partition, we simplify the model update and deployment with Docker and Kubernetes to efficiently handle runtime resource management and scheduling of containers.
Filesystem fragmentation is a first-order performance problem that has been the target of many heuristic and algorithmic approaches. Real-world application benchmarks show that common filesystem operations cause many filesystems to fragment over time, a phenomenon known as filesystem aging.
This paper examines the common assumption that space pressure will exacerbate fragmentation. Our microbenchmarks show that space pressure can cause a substantial amount of inter-file and intra-file fragmentation. However, on a “real-world” application benchmark, space pressure causes fragmentation that slows subsequent reads by only 20% on ext4, relative to the amount of fragmentation that would occur on a file system with abundant space. The other file systems show negligible additional degradation under space pressure.
Our results suggest that the effect of free-space fragmentation on read performance is best described as accelerating the filesystem aging process. The effect on write performance is non-existent in some cases, and, in most cases, an order of magnitude smaller than the read degradation from fragmentation cause by normal usage.
Telepathology is the practice of digitizing histological images for transmission along telecommunication pathways for diagnosis, consultation or continuing medical education. Existing telepathology solutions are limited to offline or delay-tolerant diagnosis.
In this paper we present LiveMicro, a telepathology system that, leveraging edge computing, enables multiple pathologists to collaborate on a diagnosis by allowing a remote live control of a microscope. In such environment, computation at the edge is used in three ways: (1) to allow remote users to control the microscope simultaneously, (2) to process histological image and live video, by running algorithms that recognize e.g., tumor grades, (3) to preserve privacy creating virtual shared data views. In particular, we built the first opensource edge computing based telepathology system. In our prototype, the examples of edge processing that we currently support are extraction of diagnosis-oriented features and compression of payloads to minimize transmission delays. Our evaluation shows how LiveMicro can help a medical team with a remote, faster and more accurate diagnosis.
Prior research has proposed using peer-to-peer (P2P) content delivery to serve Internet video at lower costs. Yet, such methods have not witnessed widespread adoption. An important challenge is incentivization: what tangible benefits does P2P content delivery offer users who bring resources to the table? In this paper, we ask whether monetary incentives can help attract peers in P2P content delivery systems. We first propose Gringotts, a system to enable secure monetary incentives for P2P content delivery systems. Gringotts provides a novel Proof of Delivery mechanism that allows content providers to verify correct delivery of their files, and shows how to use cryptocurrency to pay peers while guarding against liars and Sybil attacks. We then present results from an 876-person professional survey we commissioned to understand users’ willingness to participate in Gringotts, and what challenges remain. Our survey revealed that 51% would participate for suitable financial incentives, and motivated the need for alternate payment forms, device security, and peer anonymity.
The extremely low latency of non-volatile memory (NVM) raises issues of latency in file systems. In particular, user-kernel context switches caused by system calls and hardware interrupts become a non-negligible performance penalty. A solution to this problem is using direct-access file systems, but existing work focuses on optimizing their non-POSIX user interfaces. In this work, we propose EvFS, our new user-level POSIX file system that directly manages NVM in user applications. EvFS minimizes the latency by building a user-level storage stack and introducing asynchronous processing of complex file I/O with page cache and direct I/O. We report that the event-driven architecture of EvFS leads to a 700-ns latency for 64-byte non-blocking file writes and reduces the latency for 4-Kbyte blocking file I/O by 20 us compared to a kernel file system with journaling disabled.
Visual data produced at the edge is rich with information, opening a world of analytics opportunities for applications to explore. However, the demanding requirements of visual data on computational resources and bandwidth have hindered effective processing, preventing the data from being used in an economically efficient manner. In order to scale out visual analytics systems, it is necessary to have a framework that works collaboratively between edge and cloud. In this paper, we propose an end-to-end (E2E) visual fog architecture, designed for processing and management of visual data. Using our architecture to extract shopper insights, we are able to achieve application specified real time requirements for extracting and querying visual data, showing the feasibility of our design in a real-world setting. We also discuss the lessons we learned from deploying an edge-to-cloud architecture for video streaming applications.
The rise of edge computing as a new storage and compute model has already motivated numerous studies within the systems community, focusing on the choices and mechanisms of task offloading from end devices to the edge infrastructure, pricing, consistency, indexing and caching. However, it is not yet entirely clear how the edge infrastructure itself will be deployed, and, more importantly, managed. A common point of view considers the edge as an extension of traditional content distribution networks (CDN), due to its hierarchical layout, centralized ownership, and cloud back-end.
In this paper, we consider a different view of the edge, as a "reincarnation" of the well-known peer-to-peer (P2P) model. We show how the edge is similar to P2P systems in many aspects, including the number, heterogeneity and limited availability and resources of its nodes, their central role in performing the system's storage and computation, and the vulnerabilities related to tight interoperability with user end devices. We describe the similarities of the edge to both CDNs and P2P systems, the challenges that arise from these similarities, and the previous approaches to address them in both contexts. We show that the challenges that remain in applying these approaches may be addressed by viewing the edge as a larger and smarter reincarnation of P2P systems.
This paper proposes a serverless platform for building and operating edge AI applications. We analyze edge AI use cases to illustrate the challenges in building and operating AI applications in edge cloud scenarios. By elevating concepts from AI lifecycle management into the established serverless model, we enable easy development of edge AI workflow functions. We take a deviceless approach, i.e., we treat edge resources transparently as cluster resources, but give developers fine-grained control over scheduling constraints. Furthermore, we demonstrate the limitations of current serverless function schedulers, and present the current state of our prototype.
Edge computing and the Internet of Things (IoT) are irrevocably intertwined and much work has proposed enhancing the IoT through the use of edge computing. These solutions have typically focused on using the edge to increase the locality of cloud applications, achieving benefits mainly in terms of lower network latency. In this work, we argue that IoT systems can benefit much more from semantic properties which are best recognized and exploited in situ, at the edge of the network where the data streams and actuators exist. We outline the idea of a real-time semantic operating system, hosted on the edge, which can provide higher performance in energy consumption, latency, accuracy, and improve not only individual application performance but that of an entire IoT infrastructure. We have implemented a prototype system and show some initial results demonstrating the efficacy of our proposed optimizations, and provide insights into how to handle some of the most critical issues faced in such a system.
In order to keep up with big data workloads, distributed storage needs to offer low latency, high bandwidth and energy efficient access to data. To achieve these properties, most state of the art solutions focus either exclusively on software or on hardware-based implementation. FPGAs are an example of the latter and a promising platform for building storage nodes but they are more cumbersome to program and less flexible than software, which limits their adoption.
We make the case that, in order to be feasible in the cloud, solutions designed around programmable hardware, such as FPGAs, have to follow a service provider-centric methodology: the hardware should only provide functionality that is useful across all tenants and rarely changes. Conversely, application-specific functionality should be delivered through software that, in a cloud setting, is under the provider's control. Deploying FPGAs this way is less cumbersome, requires less hardware programming and flexibility increases overall.
We demonstrate the benefits of this approach by building an application-aware storage for Parquet files, a columnar data format widely used in big data frameworks. Our prototype offers transparent 10Gbps deduplication in hardware without sacrificing low latency operation and specializes to Parquet files using a companion library. This work paves the way for in-storage filtering of columnar data without having to implement file-type and tenant-specific parsing in the FPGA.
Recent interest in Edge/Fog Computing has pushed IoT Platforms to support a broader range of general-purpose workloads. We propose a design of an IoT Platform called OneOS, inspired by Distributed OS and micro-kernel principles, providing a single system image of the IoT network. OneOS aims to preserve the portability of applications by reusing a subset of the POSIX interface at a higher layer over a flat group of Actors. As a distributed middleware, OneOS achieves its goal through evaluation context replacement, which enables a process to run in a virtual context rather than its local context.
High demand for low latency services and local data processing has given rise for edge computing. As opposed to cloud computing, in this new paradigm computational facilities are located close to the end-users and data producers, on the edge of the network, hence the name. The critical issue for the proliferation of edge computing is the availability of local computational resources. Major cloud providers are already addressing the problem by establishing facilities in the proximity of end-users. However, there is an alternative trend, namely, developing open infrastructure as a set of standards, technologies, and practices to enable any motivated parties to offer their computational capacity for the needs of edge computing. Open infrastructure can give an additional boost to this new promising paradigm and, moreover, help to avoid problems for which cloud computing has been long criticized for, such as vendor lock-in or privacy. In this paper, we discuss the challenges related to creating such an open infrastructure, in particular focusing on the applicability of distributed ledgers for contractual agreement and payment. Solving the challenge of contracting is central to realizing an open infrastructure for edge computing, and in this paper, we highlight the potential and shortcomings of distributed ledger technologies in the context of our use case.
Memory pressure is inevitable as the size of working sets is rapidly growing while the capacity of dynamic random access memory (DRAM) is not. Meanwhile, storage devices have evolved so that their speed is comparable to the speed of DRAM while their capacity scales are comparable to that of hard disk drives (HDD). Thus, hierarchial memory systems configuring DRAM as the main memory and high-end storages as swap devices will be common.
Due to the unique characteristics of these modern storage devices, the swap target decision should be optimal. It is essential to know the exact data access patterns of workloads for such an optimal decision, although underlying systems cannot accurately estimate such complex and dynamic patterns. For this reason, memory systems allow programs to voluntarily hint their data access pattern. Nevertheless, it is exhausting for a human to manually figure out the patterns and embed optimal hints if the workloads are huge and complex.
This paper introduces a compiler extension that automatically optimizes a program to voluntarily hint its dynamic data access patterns to the underlying swap system using a static/dynamic analysis based profiling result. To our best knowledge, this is the first profile-guided optimization (PGO) for modern swap devices. Our empirical evaluation of the scheme using realistic workloads shows consistent improvement in performance and swap device lifetime up to 2.65 times and 2.98 times, respectively.
Connected and autonomous vehicles (CAVs) is currently a hot topic and a major focus in the field of edge computing, and it has created numerous pivotal and challenging research problems. In this paper, we present HydraOne, an indoor experimental research and education platform for edge computing in the CAVs scenario. HydraOne is a hardware-software co-design platform built from scratch based on our experience with the requirements of edge computing research problems. We present the design and implementation details and discuss three key characteristics of HydraOne: design modularization, resource extensibility and openness, as well as function isolation. These will help researchers and students fully understand the platform and take advantage of it to conduct research experiments. We also provide three case studies deployed on HydraOne to demonstrate the capabilities of our research platform.
The increase in privacy concerns among the users has led to edge based analytics applications such as federated learning which trains machine learning models in an iterative and collaborative fashion on the edge devices without sending the raw private data to the central cloud. In this paper, we propose a system for enabling iterative collaborative processing (ICP) in resource constrained edge environments. We first identify the unique systems challenges posed by ICP, which are not addressed by the existing distributed machine learning frameworks such as the parameter server. We then propose the system components necessary for ICP to work well in highly distributed edge environments. Based on this, we propose a system design for enabling such applications over the edge. We show the benefits of our proposed system components with a preliminary evaluation.
To get good performance for data stored in Object storage services like S3, data analysis clusters need to cache data locally. Recently these caches have started taking into account higher-level information from analysis framework, allowing prefetching based on predictions of future data accesses. There is, however, a broader opportunity; rather than using this information to predict one future, we can use it to select a future that is best for caching. This paper provides preliminary evidence that we can exploit the directed acyclic graph (DAG) of inter-task dependencies used by data-parallel frameworks such as Spark, Pig, and Hive to improve application performance, by optimizing caching for the critical path through the DAG for the application. We present experimental results for PIG running TPC-H queries, showing completion time improvements of up to 23% vs our implementation of MRD, a state-of-the-art DAG-based prefetching system, and improvements of up to 2.5x vs LRU caching. We then discuss the broader opportunity for building a system based on this opportunity.
With the increasing need for more reactive services, and the need to process large amounts of IoT data, edge clouds are emerging to enable applications to be run close to the users and/or devices. Following the trend in hyperscale clouds, applications are trending toward a microservices architecture where the application is decomposed into smaller pieces that can each run in its own container and communicate with each other over a network through well defined APIs. This improves the development effort and deployability, but also introduces inefficiencies in communication. In this paper, we rethink the communication model, and introduce the ability to create shared memory channels between containers supporting both a pub/sub model and streaming model. Our approach is not only applicable to the edge clouds but also beneficial in core cloud environments. Local communication is made more efficient, and remote communication is efficiently supported through synchronizing shared memory regions via RDMA.
Emerging edge applications, such as augmented and virtual reality, real-time video analytics and thin-client gaming, are latency-sensitive, resource-intensive, and stateful. Transitioning these applications from cloud deployments to the edge is non-trivial since edge deployments will exhibit variable resource availability, significant user mobility, and high potential for faults and application preemption, requiring considerable developer effort per application to maintain stable quality of experience for the user.
In this paper, we propose deterministic containers, a new abstraction that simplifies the development of complex applications on the edge. Deterministic containers enforce the property that all activity within a container behave deterministically. Determinism provides replication, which in turn provides key benefits for edge computing including resilience to performance jitter, enhanced fault-tolerance, seamless migration, and data provenance.
We are currently building a prototype, Shadow, that aims to provide deterministic containers with minimal performance overhead while requiring few application modifications. For all sources of non-determinism, Shadow either converts the behavior to be deterministic or restricts the allowable application behavior. Preliminary results indicate that using Shadow to reduce performance jitter at the edge for a vehicle caravan application involving video analytics reduces median application response time by up to 25%.
In solid-state drives (SSDs), garbage collection (GC) plays a key role in making free NAND blocks for newly coming data. The data copied from one block to another by GC affects both the performance and lifetime of SSD significantly. Placing the data with different “temperature” into different NAND blocks can reduce data copy overhead in GC. This paper proposes a scheme to place data according to its predicted future temperature. A neural network called LSTM is applied to increase the accuracy of temperature prediction in both temporal and spatial dimensions. And it also uses K-Means to do clustering and automatically dispatch similar “future temperature” data to the same NAND blocks. The results obtained show that performance and write amplification factor (WAF) are improved in various applications. In the best case, the WAF and 99.99% of the write latency are reduced by up to 43.5% and 79.3% respectively.
With the pervasiveness and growth in media technology, user-generated content has become intertwined with our day-to-day life. Such advancements however, have enabled the exponential growth in media file sizes, which leads to shortage of storage on small-scale edge devices. Online clouds are generally a potential solution, however, they raise privacy concerns, are not fully automated, and do not adapt to different networking environments (rural/urban/metropolitan). Distributed storage systems rely on their distributed nature to combat concerns over privacy and are adaptable to different networking environments. Nevertheless, such systems lack optimization via compression due to energy concerns on edge devices. In this work, we propose Smart Media Compression (SMC) for distributed edge storage systems. SMC dynamically adjusts compression parameters, in order to reduce the amount of needless compression, thus reducing energy consumption while providing smaller user file access delays. Our results show an improvement in average file access delay by up to 90%, while only costing an additional 14% in energy consumption.
User-generated video content is imposing an increasing burden on live video service architectures such as Facebook Live. These services are responsible for ingesting large amounts of video, transcoding that video into different quality levels (i.e., bitrates), and adaptively streaming it to viewers. These tasks are expensive, both computationally and network-wise, often forcing service providers to sacrifice the “liveness” of delivered video. Given the steady increases in smartphone bandwidth and energy resources, coupled with the fact that commodity smartphones now include hardware-accelerated codecs, we propose that live video services augment their existing infrastructure with edge support for transcoding and transmission. We present measurements to motivate the feasibility of incorporating such edge-support into the live video ecosystem, present the design of a peer-to-peer adaptive live video streaming system, and discuss directions for future work to realize this vision in practice.
With latest development, NAND flash is experiencing increased errors. The read reference voltages are the key factor for RBER seen by ECC. The limited error correction capability of ECC determines a value range that the read voltages should fall into, otherwise a read failure followed by a read retry with tuned read voltage, would happen. Therefore, finding a correct read voltage with the smallest number of read failures has been a hot research problem. Previous methods in the literature are designed to either progressively tune the voltage value or empirically predict a read voltage based on error models. However, straightforward tuning leads to unpredictable large number of read retries, whereas complex modeling brings large overhead. This paper proposes a novel approach, by reserving a small set of cells as sentinels, which directly tell us the optimal voltage, as drifting caused errors exhibits strong locality. Experiments demonstrate the proposed technique is both efficient and effective.
In contrast to the classic fashion for designing distributed end-to-end (e2e) TCP schemes for cellular networks (CN), we explore another design space by having the CN assist the task of the transport control. We show that in the emerging cellular architectures such as mobile/multi-access edge computing (MEC), where the servers are located close to the radio access network (RAN), significant improvements can be achieved by leveraging the nature of the logically centralized network measurements at the RAN and passing information such as its minimum e2e delay and access link capacity to each server. Particularly, a Network Assistance module (located at the mobile edge) will pair up with wireless scheduler to provide feedback information to each server and facilitate the task of congestion control. To that end, we present two Network Assisted schemes called NATCP (a clean-slate design replacing TCP at end-hosts) and NACubic (a backward compatible design requiring no change for TCP at end-hosts). Our preliminary evaluations using real cellular traces show that both schemes dramatically outperform existing schemes both in single-flow and multi-flow scenarios.
Manufacturer Usage Description (MUD) is a proposed IETF standard enabling local area networks (LAN) to automatically configure their access control when adding a new IoT device based on the recommendations provided for that device by the manufacturer. MUD has been proposed as an isolation-based defensive mechanism with a focus on devices in the home, where there is no dedicated network administrator. In this paper, we describe the efficacy of MUD for a generic IoT device under different threat scenarios in the context of the Fog. We propose a method to use rate limiting to prevent end devices from participating in denial of service attacks (DDoS), including against the Fog itself. We illustrate our assumptions by providing a possible real world example and describe the benefits for MUD in the Fog for various stakeholders.
We argue that, along with bandwidth and capacity, lifetime of flash devices is also a critical resource that needs to be explicitly and carefully managed, especially in emerging consolidated environments. We study the resulting multi-resource allocation problem in a setting where "fairness" across consolidated workloads is desired. Towards this, we propose to adapt the well-known notion of dominant resource fairness (DRF). We empirically show that using DRF with only bandwidth and capacity (and ignoring lifetime) may result in poor device lifetime. Incorporating lifetime, however, turns out to be non-trivial. We identify key challenges in this adaptation and present simple heuristics. We also discuss possible design choices which will be fully explored in future work.
We investigate a new distributed services model and architecture for Internet of Things (IoT) applications. In particular, we observe that devices at the edge of the network, although resource constrained, are increasingly capable—performing actions (e.g. data analytics, decision support, actuation, control, etc.) in addition to event telemetry. Thus, such devices are better modeled as servers, which applications in the cloud compose for their functionality. We investigate the implications of this "flipped" IoT client-server model, for server discovery, authentication, and resource use. We find that by combining capability-based security with an edge-aware registry, this model can achieve fast response and energy efficiency.
IoT applications are starting to rely heavily on edge computing due to the advent of low-power and high data-rate wireless communication technologies such as 5G and the processing capability of GPU-driven edge platforms. However, the computation and the data communication model for the edge computing applications are quite diverse, which limits their interoperability. An interoperable edge computing architecture with a versatile communication model would lead to the development of innovative and incentive-driven edge computing applications by combining various data sources from a wide array of devices. In this paper, we present an edge computing architecture by extending the publish-subscribe protocol with support for incentives. Our novel publish-pay-subscribe protocol enables the data producers (publishers) to sell their data with data consumers and service providers (subscribers). The proposed architecture not only allows the device owners to gain incentive but also enable the service providers to sell the processed data with one or more data consumers. Our proof-of-concept implementation using AEDES publish-subscribe broker and Ethereum cryptocurrency shows the feasibility of publish-pay-subscribe broker and its support for data-driven and incentive-based edge computing applications.
This talk will discuss an approach to systems research based upon an iterative, empirically-driven approach, loosely known as "measure, then build." By first carefully analyzing the state of the art, one can learn what the real problems in today's systems are; by then designing and implementing new systems to solve said problems, one can ensure that one's work is both relevant and important. The talk will draw on examples in research done at the University of Wisconsin-Madison over nearly two decades, including research into Linux file systems, enterprise storage, key-value storage systems, and distributed systems.
Remzi Arpaci-Dusseau is the Grace Wahba professor of Computer Sciences at UW-Madison. He co-leads a research group with Professor Andrea Arpaci-Dusseau. Together, they have graduated 24 Ph.D. students and won numerous best-paper awards; many of their innovations are used by commercial... Read More →
Wednesday July 10, 2019 9:00am - 10:00am PDT
Grand Ballroom
Given the highly empirical nature of research in cloud computing, networked systems, and related fields, testbeds play an important role in the research ecosystem. In this paper, we cover one such facility, CloudLab, which supports systems research by providing raw access to programmable hardware, enabling research at large scales, and creating a shared platform for repeatable research.
We present our experiences designing CloudLab and operating it for four years, serving nearly 4,000 users who have run over 79,000 experiments on 2,250 servers, switches, and other pieces of datacenter equipment. From this experience, we draw lessons organized around two themes. The first set comes from analysis of data regarding the use of CloudLab: how users interact with it, what they use it for, and the implications for facility design and operation. Our second set of lessons comes from looking at the ways that algorithms used "under the hood," such as resource allocation, have important---and sometimes unexpected---effects on user experience and behavior. These lessons can be of value to the designers and operators of IaaS facilities in general, systems testbeds in particular, and users who have a stake in understanding how these systems are built.
We present the design of an alternative runtime system for improved scalability and reduced latency in actor applications called Agner. Agner provides higher scalability by allowing the application developer to specify the network overlay used at runtime without changing application semantics, thereby specializing the network communication patterns to the application. Agner reduces message latency through a combination of three predominately automatic optimizations: parallelism, named channels, and affinitized scheduling. We implement a prototype of Agner in Erlang and demonstrate that Agner achieves up to an order of magnitude increase in the number of nodes the system can scale to through runtime overlay selection, up to a 34.96x increase in throughput, and up to a 13.4x reduction in latency over Distributed Erlang.
File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service in Oracle Cloud Infrastructure. Using a pipelined Paxos implementation, we implemented a scalable block store that provides linearizable multipage limited-size transactions. On top of the block store, we built a scalable B-tree that provides linearizable multikey limited-size transactions. By using self-validating B-tree nodes and performing all Btree housekeeping operations as separate transactions, each key in a B-tree transaction requires only one page in the underlying block transaction. The B-tree holds the filesystem metadata. The filesystem provides snapshots by using versioned key-value pairs. The entire system is programmed using a nonblocking lock-free programming style. The presentation servers maintain no persistent local state, with any state kept in the B-tree, making it easy to scale up and failover the presentation servers. We use a non-scalable Paxos-replicated hash table to store configuration information required to bootstrap the system. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication.
Dynamic binary translation (DBT) is a key system technology that enables many important system applications such as system virtualization and emulation. To achieve good performance, it is important for a DBT system to be equipped with high-quality translation rules. However, most translation rules in existing DBT systems are created manually with high engineering efforts and poor quality. To solve this problem, a learning-based approach was recently proposed to automatically learn semantically-equivalent translation rules, and symbolic verification is used to prove the semantic equivalence of such rules. But, they still suffer from some shortcomings.
In this paper, we first give an in-depth analysis on the constraints of prior learning-based methods and observe that the equivalence requirements are often unduly restrictive. It excludes many potentially high-quality rule candidates from being included and applied. Based on this observation, we propose an enhanced learning-based approach that relaxes such equivalence requirements but supplements them with constraining conditions to make them semantically equivalent when such rules are applied. Experimental results on SPEC CINT2006 show that the proposed approach can improve the dynamic coverage of the translation from 55.7\% to 69.1\% and the static coverage from 52.2\% to 61.8\%, compared to the original approach. Moreover, up to 1.65X performance speedup with an average of 1.19X are observed.
Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.
A large class of IoT applications read sensors, execute application logic, and actuate actuators. However, the lack of high-level programming abstractions compromises correctness especially in presence of failures and unwanted interleaving between applications. A key problem arises when operations on IoT devices or the application itself fails, which leads to inconsistencies between the physical state and application state, breaking application semantics and causing undesired consequences. Transactions are a well-established abstraction for correctness, but assume properties that are absent in an IoT context. In this paper, we study one such environment, smart home, and establish inconsistencies manifesting out of failures. We propose an abstraction called transactuation that empowers developers to build reliable applications. Our runtime, Relacs, implements the abstraction atop a real smart-home platform. We evaluate programmability, performance, and effectiveness of transactuations to demonstrate its potential as a powerful abstraction and execution model.
We address the problem of “fail-slow” fault, a fault where a hardware or software component can still function (does not fail-stop) but in much lower performance than expected. To address this, we built IASO, a peer-based, non-intrusive fail-slow detection framework that has been deployed for more than 1.5 years across 39,000 nodes in our customer sites and helped our customers reduce major outages due to fail-slow incidents. IASO primarily works based on timeout signals (a negligible overhead of monitoring) and converts them into a stable and accurate fail-slow metric. IASO can quickly and accurately isolate a slow node within minutes. Within a 7-month period, IASO managed to catch 232 fail-slow incidents in our large deployment field. In this paper, we have also assembled a large dataset of 232 fail-slow incidents along with our analysis. We found that the fail-slow annual failure rate in our field is 1.02%.
All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average 10% slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm , a significant extension to Browsix that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of 2.08$\times$ (Firefox) and 2.5$\times$ (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.
User file systems offer numerous advantages over their in-kernel implementations, such as ease of development and better system reliability. However, they incur heavy performance penalty. We observe that existing user file system frameworks are highly general; they consist of a minimal interposition layer in the kernel that simply forwards all low-level requests to user space. While this design offers flexibility, it also severely degrades performance due to frequent kernel-user context switching.
This work introduces ExtFUSE, a framework for developing extensible user file systems that also allows applications to register "thin" specialized request handlers in the kernel to meet their specific operative needs, while retaining the complex functionality in user space. Our evaluation with two FUSE file systems shows that ExtFUSE can improve the performance of user file systems with less than a few hundred lines on average. ExtFUSE is available on GitHub.
Ashish is a published author and researcher with a Ph.D. in Computer Science from Georgia Institute of Technology and extensive experience in building secure systems software from the ground-up. He has worked in the industry for over a decade, coupled with nearly a decade of top-tier... Read More →
Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance, and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation, optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario.
The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFL® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.
In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.
Data analytics frameworks that adopt immutable data abstraction usually provide better support for failure recovery and straggler mitigation, while those that adopt mutable data abstraction are more efficient for iterative workloads thanks to their support for in-place state updates and asynchronous execution. Most existing frameworks adopt either one of the two data abstractions and do not enjoy the benefits of the other. In this paper, we propose a novel programming model named MapUpdate, which can determine whether a distributed dataset is mutable or immutable in an application. We show that MapUpdate not only offers good expressiveness, but also allows us to enjoy the benefits of both mutable and immutable abstractions. MapUpdate naturally supports iterative and asynchronous execution, and can use different recovery strategies adaptively according to failure scenarios. We implemented MapUpdate in a system, called Tangram, with novel system designs such as lightweight local task management, partition-based progress control, and context-aware failure recovery. Extensive experiments verified the benefits of Tangram on a variety of workloads including bulk processing, graph analytics, and iterative machine learning.
Smartphones usually have limited storage and runtime memory. Compressed read-only file systems can dramatically decrease the storage used by read-only system resources. However, existing compressed read-only file systems use fixed-sized input compression, which causes significant I/O amplification and unnecessary computation. They also consume excessive runtime memory during decompression and deteriorate the performance when the runtime memory is scarce. In this paper, we describe EROFS, a new compression-friendly read-only file system that leverages fixed-sized output compression and memory-efficient decompression to achieve high performance with little extra memory overhead. We also report our experience of deploying EROFS on tens of millions of smartphones. Evaluation results show that EROFS outperforms existing compressed read-only file systems with various micro-benchmarks and reduces the boot time of real-world applications by up to 22.9% while nearly halving the storage usage.
It is a daunting task for a data scientist to convert sequential code for a Machine Learning (ML) model, published by an ML researcher, to a distributed framework that runs on a cluster and operates on massive datasets. The process of fitting the sequential code to an appropriate programming model and data abstractions determined by the framework of choice requires significant engineering and cognitive effort. Furthermore, inherent constraints of frameworks sometimes lead to inefficient implementations, delivering suboptimal performance.
We show that it is possible to achieve automatic and efficient distributed parallelization of familiar sequential ML code by making a few mechanical changes to it while hiding the details of concurrency control, data partitioning, task parallelization, and fault-tolerance. To this end, we design and implement a new distributed ML framework, STRADS-Automatic Parallelization (AP), and demonstrate that it simplifies distributed ML programming significantly, while outperforming a popular data-parallel framework with a non-familiar programming model, and achieving performance comparable to an ML-specialized framework.
Data compression can not only provide space efficiency with lower Total Cost of Ownership (TCO) but also enhance I/O performance because of the reduced read/write operations. However, lossless compression algorithms with high compression ratio (e.g. gzip) inevitably incur high CPU resource consumption. Prior studies mainly leveraged general-purpose hardware accelerators such as GPU and FPGA to offload costly (de)compression operations for application workloads. This paper investigates ASIC-accelerated compression in file system to transparently benefit all applications running on it and provide high-performance and cost-efficient data storage. Based on Intel® QAT ASIC, we propose QZFS that integrates QAT into ZFS file system to achieve efficient gzip (de)compression offloading at the file system layer. A compression service engine is introduced in QZFS to serve as an algorithm selector and implement compressibility-dependent offloading and selective offloading by source data size. More importantly, a QAT offloading module is designed to leverage the vectored I/O model to reconstruct data blocks, making them able to be used by QAT hardware without incurring extra memory copy. The comprehensive evaluation validates that QZFS can achieve up to 5x write throughput improvement for FIO micro-benchmark and more than 6x cost-efficiency enhancement for genomic data post-processing over the software-implemented alternative.
Reconfiguring NoSQL databases under changing workload patterns is crucial for maximizing database throughput. This is challenging because of the large configuration parameter search space with complex interdependencies among the parameters. While state-of-the-art systems can automatically identify close-to-optimal configurations for static workloads, they suffer for dynamic workloads as they overlook three fundamental challenges: (1) Estimating performance degradation during the reconfiguration process (such as due to database restart). (2) Predicting how transient the new workload pattern will be. (3) Respecting the application’s availability requirements during reconfiguration. Our solution, SOPHIA, addresses all these shortcomings using an optimization technique that combines workload prediction with a cost-benefit analyzer. SOPHIA computes the relative cost and benefit of each reconfiguration step, and determines an optimal reconfiguration for a future time window. This plan specifies when to change configurations and to what, to achieve the best performance without degrading data availability. We demonstrate its effectiveness for three different workloads: a multi-tenant, global-scale metagenomics repository (MG-RAST), a bus-tracking application (Tiramisu), and an HPC data-analytics system, all with varying levels of workload complexity and demonstrating dynamic workload changes. We compare SOPHIA’s performance in throughput and tail-latency over various baselines for two popular NoSQL databases, Cassandra and Redis.
Authors: Zaoxing Liu and Zhihao Bai, Johns Hopkins University; Zhenming Liu, College of William and Mary; Xiaozhou Li, Celer Network; Changhoon Kim, Barefoot Networks; Vladimir Braverman and Xin Jin, Johns Hopkins University; Ion Stoica, UC Berkeley
Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to multiple clusters, the fast cache itself would become the bottleneck. Traditional mechanisms like cache partition and cache replication either result in load imbalance between cache nodes or have high overhead for cache coherence.
We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.
Intel Memory Protection Keys (MPK) is a new hardware primitive to support thread-local permission control on groups of pages without requiring modification of page tables. Unfortunately, its current hardware implementation and software support suffer from security, scalability, and semantic problems: (1) vulnerable to protection-key-use-after-free; (2) providing the limited number of protection keys; and (3) incompatible with mprotect()’s process-based permission model.
In this paper, we propose libmpk, a software abstraction for MPK. It virtualizes the hardware protection keys to eliminate the protection-key-use-after-free problem while providing accesses to an unlimited number of virtualized keys. To support legacy applications, it also provides a lazy inter-thread key synchronization. To enhance the security of MPK itself, libmpk restricts unauthorized writes to its metadata. We apply libmpk to three real-world applications: OpenSSL, JavaScript JIT compiler, and Memcached for memory protection and isolation. Our evaluation shows that it introduces negligible performance overhead (
Authors: Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison; Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; Vijay Chidambaram, University of Texas at Austin;Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison
We introduce protocol-aware recovery (PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of PAR through the design and implementation of corruption-tolerant replication (CTRL), a PAR mechanism specific to replicated state machine (RSM) systems. We experimentally show that the CTRL versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the CTRL versions have little performance overhead.
In Linux device drivers, use-after-free (UAF) bugs can cause system crashes and serious security problems. According to our study of Linux kernel commits, 42% of the driver commits fixing use-after-free bugs involve driver concurrency. We refer to these use-after-free bugs as concurrency use-after-free bugs. Due to the non-determinism of concurrent execution, concurrency use-after-free bugs are often more difficult to reproduce and detect than sequential use-after-free bugs.
In this paper, we propose a practical static analysis approach named DCUAF, to effectively detect concurrency use-after-free bugs in Linux device drivers. DCUAF combines a local analysis analyzing the source code of each driver with a global analysis statistically analyzing the local results of all drivers, forming a local-global analysis, to extract the pairs of driver interface functions that may be concurrently executed. Then, with these pairs, DCUAF performs a summary-based lockset analysis to detect concurrency use-after-free bugs. We have evaluated DCUAF on the driver code of Linux 4.19, and found 640 real concurrency use-after-free bugs. We have randomly selected 130 of the real bugs and reported them to Linux kernel developers, and 95 have been confirmed.
Authors: Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Adithya Abraham Philip, Microsoft Research India
Today, we depend on numerous large-scale services for basic operations such as email. These services are complex and extremely dynamic as developers continously commit code and introduce new features, fixes and, consequently, new bugs. Hundreds of commits may enter deployment simultaneously. Therefore one of the most time-critical, yet complex tasks towards mitigating service disruption is to localize the bug to the right commit.
This paper presents the concept of differential bug localization that uses a combination of differential code analysis and software provenance tracking to effectively pin-point buggy commits. We have built Orca, a customized code search-engine that implements differential bug localization. Orca is actively being used by the On-Call Engineers (OCEs) of a large enterprise email and collaboration service to localize bugs to the appropriate buggy commits. Our evaluation shows that Orca correctly localizes 77% of bugs for which it has been used. We also show that it causes a 4x reduction in the work done by the OCE.
Modern operating systems are monolithic. Today, however, lack of isolation is one of the main factors undermining security of the kernel. Inherent complexity of the kernel code and rapid development pace combined with the use of unsafe, low-level programming language results in a steady stream of errors. Even after decades of efforts to make commodity kernels more secure, i.e., development of numerous static and dynamic approaches aimed to prevent exploitation of most common errors, several hundreds of serious kernel vulnerabilities are reported every year. Unfortunately, in a monolithic kernel a single exploitable vulnerability potentially provides an attacker with access to the entire kernel.
Modern kernels need isolation as a practical means of confining the effects of exploits to individual kernel subsystems. Historically, introducing isolation in the kernel is hard. First, commodity hardware interfaces provide no support for efficient, fine-grained isolation. Second, the complexity of a modern kernel prevents a naive decomposition effort. Our work on Lightweight Execution Domains (LXDs) takes a step towards enabling isolation in a full-featured operating system kernel. LXDs allow one to take an existing kernel subsystem and run it inside an isolated domain with minimal or no modifications and with a minimal overhead. We evaluate our approach by developing isolated versions of several performance-critical device drivers in the Linux kernel.
Authors: Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University
The monolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve heterogeneity, elasticity, resource utilization, and failure handling in datacenters, we believe that datacenters should break monolithic servers into disaggregated, network-attached hardware components. Despite the promising benefits of hardware resource disaggregation, no existing OSes or software systems can properly manage it. We propose a new OS model called the splitkernel to manage disaggregated systems. Splitkernel disseminates traditional OS functionalities into loosely-coupled monitors, each of which runs on and manages a hardware component. Using the splitkernel model, we built LegoOS, a new OS designed for hardware resource disaggregation. LegoOS appears to users as a set of distributed servers. Internally, LegoOS cleanly separates processor, memory, and storage devices both at the hardware level and the OS level. We implemented LegoOS from scratch and evaluated it by emulating hardware components using commodity servers. Our evaluation results show that LegoOS’s performance is comparable to monolithic Linux servers, while largely improving resource packing and failure rate over monolithic clusters.
The Spectre family of security vulnerabilities show that speculative execution attacks on modern processors are practical. Spectre variant 2 attacks exploit the speculation of indirect branches, which enable processors to execute code from arbitrary locations in memory. Retpolines serve as the state-of-the-art defense, effectively disabling speculative execution for indirect branches. While retpolines succeed in protecting against Spectre, they come with a significant penalty — 20% on some workloads.
In this paper, we describe and implement an alternative mechanism: the JumpSwitch, which enables speculative execution of indirect branches on safe targets by leveraging indirect call promotion, transforming indirect calls into direct calls. Unlike traditional inlining techniques which apply call promotion at compile time, JumpSwitches aggressively learn targets at runtime and leverages an optimized patching infrastructure to perform just-in-time promotion without the overhead of binary translation.
We designed and optimized JumpSwitches for common patterns we have observed. If a JumpSwitch cannot learn safe targets, we fall back to the safety of retpolines. JumpSwitches seamlessly integrate into Linux, and are evaluated in Linux v4.18. In addition, we have submitted patches enabling JumpSwitch upstream to the Linux community. We show that JumpSwitches can improve performance over retpolines by up to 20% for a range of workloads. In some cases,JumpSwitches even show improvement over a system without retpolines by directing speculative execution into direct calls just-in-time and reducing mispredictions.
Take part in discussions with your colleagues over complimentary food and drinks. Posters of all papers presented in the Technical Sessions on Wednesday and on Thursday morning will be on display.
The Museum invites you to celebrate the 50th Anniversary of UNIX and associated Internet technologies in its latest exhibit update. Enjoy full museum access as well as light refreshments at the celebration of this computing milestone with conversation and live demos. Admission is free for USENIX conference attendees. While USENIX is not responsible for transportation to and from the event, we are happy to provide a limited number of Lyft coupons for transportation; if you choose to use a coupon, please share rides with your fellow attendees and use code UNIX50TH in the “Promos” section of the Lyft app.
Authors: Yatish Turakhia and Gill Bejerano, Stanford University; William J. Dally, Stanford University and NVIDIA Research
Genomics is transforming medicine and our understanding of life in fundamental ways. Genomics data, however, is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. Over 1,300 CPU hours are required for reference-guided assembly of the human genome, and over 15,600 CPU hours are required for de novo assembly. This paper describes "Darwin," a co-processor for genomic sequence alignment that, without sacrificing sensitivity, provides up to $15,000X speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design, trading more easily accelerated alignment for less memory-intensive filtering, and by optimizing the memory system for filtering. Darwin combines a hardware-accelerated version of D-SOFT, a novel filtering algorithm, alignment at high speed, and with a hardware-accelerated version of GACT, a novel alignment algorithm. GACT generates near-optimal alignments of arbitrarily long genomic sequences using constant memory for the compute-intensive step. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.
Modern high-speed devices (e.g., network adapters, storage, accelerators) use new host interfaces, which expose multiple software queues directly to the device. These multi-queue interfaces allow mutually distrusting applications to access the device without any cross-core interaction, enabling throughput in the order of millions of IOP/s on multicore systems. Unfortunately, while independent device access is scalable, it also introduces a new problem: unfairness. Mechanisms that were used to provide fairness for older devices are no longer tenable in the wake of multi-queue design, and straightforward attempts to re-introduce it would require cross-core synchronization that undermines the scalability for which multiple queues were designed.
To address these challenges, we present Multi-Queue Fair Queueing (MQFQ), the first fair, work-conserving scheduler suitable for multi-queue systems. Specifically, we (1) reformulate a classical fair queueing algorithm to accommodate multi-queue designs, and (2) describe a scalable implementation that bounds potential unfairness while minimizing synchronization overhead. Our implementation of MQFQ in Linux 4.15 demonstrates both fairness and high throughput. Evaluation with an NVMe over RDMA fabric (NVMf) device shows that MQFQ can reach up to 3.1 Million IOP/s on a single machine — $20\times$ higher than the state-of-the-art Linux Budget Fair Queueing. Compared to a system with no fairness, MQFQ reduces the slowdown caused by an antagonist from $3.78\times$ to $1.33\times$ for the FlashX workload and from $6.57\times$ to $1.03\times$ for the Aerospike workload ($2\times$ is considered "fair" slowdown).
Authors: Nathan Chong, Arm; Tyler Sorensen and John Wickerson, Imperial College London
Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models.
Our work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models (x86, Power, ARMv8, and C++) with new rules for TM. Our formal models are backed by automated tooling that enables (1) the synthesis of tests for validating our models against existing implementations and (2) the model-checking of TM-related transformations, such as lock elision and compiling C++ transactions to hardware. A key finding is that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.
Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads when the "reader indicator" state is updated frequently by a diverse set of threads, causing cache invalidation and coherence traffic. Other designs use distributed reader indicators, one per NUMA node, per core or even per thread. This improves reader-reader scalability, but also increases the size of each lock instance and creates overhead for writers.
We propose a simple transformation, BRAVO, that augments any existing reader-writer lock, adding just two integer fields to the lock instance. Readers make their presence known to writers by hashing their thread's identity with the lock address, forming an index into a visible readers table and installing the lock address into the table. All locks and threads in an address space can share the same readers table. Crucially, readers of the same lock tend to write to different locations in the table, reducing coherence traffic. Therefore, BRAVO can augment a simple compact lock to provide scalable concurrent reading, but with only modest and constant increase in memory footprint.
We implemented BRAVO in user-space, as well as integrated it with the Linux kernel reader-writer semaphore (rwsem). Our evaluation with numerous benchmarks and real applications, both in user and kernel-space, demonstrate that BRAVO improves performance and scalability of underlying locks in read-heavy workloads while introducing virtually no overhead, including in workloads in which writes are frequent.
Distinguished Paper Award and 2018 Internet Defense Prize at USENIX Security '18
Authors: Gertjan Franken, Tom Van Goethem, and Wouter Joosen, imec-DistriNet, KU Leuven
Nowadays, cookies are the most prominent mechanism to identify and authenticate users on the Internet. Although protected by the Same Origin Policy, popular browsers include cookies in all requests, even when these are cross-site. Unfortunately, these third-party cookies enable both cross-site attacks and third-party tracking. As a response to these nefarious consequences, various countermeasures have been developed in the form of browser extensions or even protection mechanisms that are built directly into the browser.
In this paper, we evaluate the effectiveness of these defense mechanisms by leveraging a framework that automatically evaluates the enforcement of the policies imposed to third-party requests. By applying our framework, which generates a comprehensive set of test cases covering various web mechanisms, we identify several flaws in the policy implementations of the 7 browsers and 46 browser extensions that were evaluated. We find that even built-in protection mechanisms can be circumvented by multiple novel techniques we discover. Based on these results, we argue that our proposed framework is a much-needed tool to detect bypasses and evaluate solutions to the exposed leaks. Finally, we analyze the origin of the identified bypass techniques, and find that these are due to a variety of implementation, configuration and design flaws.
In storage systems, cuckoo hash tables have been widely used to support fast query services. For a read, the cuckoo hashing delivers real-time access with $O(1)$ lookup complexity via open-addressing approach. For a write, most concurrent cuckoo hash tables fail to efficiently address the problem of endless loops during item insertion due to the essential property of hash collisions. The asymmetric feature of cuckoo hashing exhibits fast-read-slow-write performance, which often becomes the bottleneck from single-thread writes. In order to address the problem of asymmetric performance and significantly improve the write/insert efficiency, we propose an optimized Concurrent Cuckoo hashing scheme, called CoCuckoo. To predetermine the occurrence of endless loops, CoCuckoo leverages a directed pseudoforest containing several subgraphs to leverage the cuckoo paths that represent the relationship among items. CoCuckoo exploits the observation that the insertion operations sharing a cuckoo path access the same subgraph, and hence a lock is needed for ensuring the correctness of concurrency control via allowing only one thread to access the shared path at a time; Insertion operations accessing different subgraphs are simultaneously executed without collisions. CoCuckoo improves the throughput performance by a graph-grained locking to support concurrent writes and reads. We have implemented all components of CoCuckoo and extensive experiments using the YCSB benchmark have demonstrated the efficiency and efficacy of our proposed scheme.
With rising network rates, cloud vendors increasingly deploy FPGA-based SmartNICs (F-NICs), leveraging their inline processing capabilities to offload hypervisor networking infrastructure. However, the use of F-NICs for accelerating general-purpose server applications in clouds has been limited.
NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems. A new ikernel programming abstraction, tightly integrated with the network stack, enables application control of F-NIC computations that process application network traffic, with minimal code changes. In addition, NICA’s virtualization architecture supports fine-grain time-sharing of F-NIC logic and provides I/O path virtualization. Together these features enable cost-effective sharing of F-NICs across virtual machines with strict performance guarantees.
We prototype NICA on Mellanox F-NICs and integrate ikernels with the high-performance VMA network stack and the KVM hypervisor. We demonstrate significant acceleration of real-world applications in both bare-metal and virtualized environments, while requiring only minor code modifications to accelerate them on F-NICs. For example, a transparent key-value store cache ikernel added to the stock memcached server reaches 40 Gbps server throughput (99% line-rate) at 6 μs 99th-percentile latency for 16-byte key-value pairs, which is 21× the throughput of a 6-core CPU with a kernel-bypass network stack. The throughput scales linearly for up to 6 VMs running independent instances of memcached.
With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However, graph algorithms such as breadth-first search and k-core, often fail to take full advantage of GPUs, due to irregularity in memory access and control flow. To address this challenge, we have developed SIMD-X, for programming and processing of single instruction multiple, complex, data on GPUs. Specifically, the new Active-Compute-Combine (ACC) model not only provides ease of programming to programmers, but more importantly creates opportunities for system-level optimizations. To this end, SIMD-X utilizes just-in-time task management which filters out inactive vertices at runtime and intelligently maps various tasks to different amount of GPU cores in pursuit of workload balancing. In addition, SIMD-X leverages push-pull based kernel fusion that, with the help of a new deadlock-free global barrier, reduces a large number of computation kernels to very few. Using SIMD-X, a user can program a graph algorithm in tens of lines of code, while achieving 3x, 6x, 24x, 3x speedup over Gunrock, Galois, CuSha, and Ligra, respectively.
We investigate the use of SmartNIC-accelerated servers to execute microservice-based applications in the data center. By offloading suitable microservices to the SmartNIC’s low-power processor, we can improve server energy-efficiency without latency loss. However, as a heterogeneous computing substrate in the data path of the host, SmartNICs bring several challenges to a microservice platform: network traffic routing and load balancing, microservice placement on heterogeneous hardware, and contention on shared SmartNIC resources. We present E3, a microservice execution platform for SmartNIC-accelerated servers. E3 follows the design philosophies of the Azure Service Fabric microservice platform and extends key system components to a SmartNIC to address the above-mentioned challenges. E3 employs three key techniques: ECMP-based load balancing via SmartNICs to the host, network topology-aware microservice placement, and a data-plane orchestrator that can detect SmartNIC overload. Our E3 prototype using Cavium LiquidIO SmartNICs shows that SmartNIC offload can improve cluster energy-efficiency up to 3× and cost efficiency up to 1.9× at up to 4% latency cost for common microservices, including real-time analytics, an IoT hub, and virtual network functions.
We present INSIDER, a full-stack redesigned storage system to help users fully utilize the performance of emerging storage drives with moderate programming efforts. On the hardware side, INSIDER introduces an FPGA-based reconfigurable drive controller as the in-storage computing (ISC) unit; it is able to saturate the high drive performance while retaining enough programmability. On the software side, INSIDER integrates with the existing system stack and provides effective abstractions. For the host programmer, we introduce virtual file abstraction to abstract ISC as file operations; this hides the existence of the drive processing unit and minimizes the host code modification to leverage the drive computing capability. By separating out the drive processing unit to the data plane, we expose a clear drive-side interface so that drive programmers can focus on describing the computation logic; the details of data movement between different system components are hidden. With the software/hardware co-design, INSIDER runtime provides crucial system support. It not only transparently enforces the isolation and scheduling among offloaded programs, but it also protects the drive data from being accessed by unwarranted programs.
We build an INSIDER drive prototype and implement its corresponding software stack. The evaluation shows that INSIDER achieves an average 12X performance improvement and 31X accelerator cost efficiency when compared to the existing ARM-based ISC system. Additionally, it requires much less effort when implementing applications. INSIDER is open-sourced, and we have adapted it to the AWS F1 instance for public access.
Recent deep learning models have moved beyond low dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, e-commerce user-item graphs, and knowledge graphs. This evolution has led to large graph-based neural network models that go beyond what existing deep learning frameworks or graph computing systems are designed for. We present NeuGraph, a new framework that bridges the graph and dataflow models to support efficient and scalable parallel neural network computation on graphs. NeuGraph introduces graph computation optimizations into the management of data partitioning, scheduling, and parallelism in dataflow-based deep learning frameworks. Our evaluation shows that, on small graphs that can fit in a single GPU, NeuGraph outperforms state-of-the-art implementations by a significant margin, while scaling to large real-world graphs that none of the existing frameworks can handle directly with GPUs.
Data analysis and retrieval is a widely-used component in existing artificial intelligence systems. However, each request has to go through each layer across the I/O stack, which moves tremendous irrelevant data between secondary storage, DRAM, and the on-chip cache. This leads to high response latency and rising energy consumption. To address this issue, we propose Cognitive SSD, an energy-efficient engine for deep learning based unstructured data retrieval. In Cognitive SSD, a flash-accessing accelerator named DLG-x is placed by the side of flash memory to achieve near-data deep learning and graph search. Such functions of in-SSD deep learning and graph search are exposed to the users as library APIs via NVMe command extension. Experimental results on the FPGA-based prototype reveal that the proposed Cognitive SSD reduces latency by 69.9% on average in comparison with CPU based solutions on conventional SSDs, and it reduces the overall system power consumption by up to 34.4% and 63.0% respectively when compared to CPU and GPU based solutions that deliver comparable performance.
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
Many important graph algorithms are based on the breadth first search (BFS) approach, which builds itself on recursive vertex traversal. We classify algorithms that share this characteristic into what we call a BFS-like algorithm. In this work, we first analyze and study the I/O request patterns of BFS-like algorithms executed on disk-based graph engines. Our analysis exposes two shortcomings in executing BFS-like algorithms. First, we find that the use of the cache is ineffective. To make use of the cache more effectively, we propose an in-memory static cache, which we call BFS-Aware Static Cache or Basc, for short. Basc is static as its contents, which are edge lists of vertices that are pre-selected before algorithm execution, do not change throughout the execution of the algorithm. Second, we find that state-of-the-art ordering method for graphs on disks is ineffective with BFS-like algorithms. Thus, based on an I/O cost model that estimates the performance based on the ordering of graphs, we develop an efficient graph ordering called Neighborhood Ordering or Norder. We provide extensive evaluations of Basc and Norder on two well-known graph engines using five real-world graphs including Twitter that has 1.9 billion edges. Our experimental results show that Basc and Norder, collectively have substantial performance impact.
We present gg, a framework and a set of command-line tools that helps people execute everyday applications—e.g., software compilation, unit tests, video encoding, or object recognition—using thousands of parallel threads on a cloud-functions service to achieve near-interactive completion time. In the future, instead of running these tasks on a laptop, or keeping a warm cluster running in the cloud, users might push a button that spawns 10,000 parallel cloud functions to execute a large job in a few seconds from start. gg is designed to make this practical and easy.
With gg, applications express a job as a composition of lightweight OS containers that are individually transient (lifetimes of 1–60 seconds) and functional (each container is hermetically sealed and deterministic). gg takes care of instantiating these containers on cloud functions, loading dependencies, minimizing data movement, moving data between containers, and dealing with failure and stragglers.
We ported several latency-sensitive applications to run on gg and evaluated its performance. In the best case, a distributed compiler built on gg outperformed a conventional tool (icecc) by 2–5×, without requiring a warm cluster running continuously. In the worst case, gg was within 20% of the hand-tuned performance of an existing tool for video encoding (ExCamera).
While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted.
This approach faces two major challenges: (1) executing high-throughput, low-delay stream analytics in a single TEE, which is constrained by a low trusted computing base (TCB) and limited physical memory; (2) verifying execution of stream analytics as the execution involves untrusted software components on the edge. In response, we present StreamBox-TZ (SBT), a stream analytics engine for an edge platform that offers strong data security, verifiable results, and good performance. SBT contributes a data plane designed and optimized for a TEE based on ARM TrustZone. It supports continuous remote attestation for analytics correctness and result freshness while incurring low overhead. SBT only adds 42.5 KB executable to the TCB (16% of the entire TCB). On an octa core ARMv8 platform, it delivers the state-of-the-art performance by processing input events up to 140 MB/sec (12M events/sec) with sub-second delay. The overhead incurred by SBT’s security mechanism is less than 25%.
As network, I/O, accelerator, and NVM devices capable of a million operations per second make their way into data centers, the software stack managing such devices has been shifting from implementations within the operating system kernel to more specialized kernel-bypass approaches. While the in-kernel approach guarantees safety and provides resource multiplexing, it imposes too much overhead on microsecond-scale tasks. Kernel-bypass approaches improve throughput substantially but sacrifice safety and complicate resource management: if applications are mutually distrusting, then either each application must have exclusive access to its own device or else the device itself must implement resource management.
This paper shows how to attain both safety and performance via intra-process isolation for data plane libraries. We propose protected libraries as a new OS abstraction which provides separate user-level protection domains for different services (e.g., network and in-memory database), with performance approaching that of unprotected kernel bypass. We also show how this new feature can be utilized to enable sharing of data plane libraries across distrusting applications. Our proposed solution uses Intel's memory protection keys (PKU) in a safe way to change the permissions associated with subsets of a single address space. In addition, it uses hardware watchpoints to delay asynchronous event delivery and to guarantee independent failure of applications sharing a protected library.
We show that our approach can efficiently protect high-throughput in-memory databases and user-space network stacks. Our implementation allows up to 2.3 million library entrances per second per core, outperforming both kernel-level protection and two alternative implementations that use system calls and Intel's VMFUNC switching of user-level address spaces, respectively.
Hardware secure enclaves are increasingly used to run complex applications. Unfortunately, existing and emerging enclave architectures do not allow secure and efficient implementation of custom page fault handlers. This limitation impedes in-enclave use of secure memory-mapped files and prevents extensions of the application memory layer commonly used in untrusted systems, such as transparent memory compression or access to remote memory.
CoSMIX is a Compiler-based system for Secure Memory Instrumentation and eXecution of applications in secure enclaves. A novel memory store abstraction allows implementation of application-level secure page fault handlers that are invoked by a lightweight enclave runtime. The CoSMIX compiler instruments the application memory accesses to use one or more memory stores, guided by a global instrumentation policy or code annotations without changing application code.
The CoSMIX prototype runs on Intel SGX and is compatible with popular SGX execution environments, including SCONE and Graphene. Our evaluation of several production applications shows how CoSMIX improves their security and performance by recompiling them with appropriate memory stores. For example, unmodified Redis and Memcached key-value stores achieve about 2× speedup by using a self-paging memory store while working on datasets up to 6× larger than the enclave’s secure memory. Similarly, annotating a single line of code in a biometric verification server changes it to store its sensitive data in Oblivious RAM and makes it resilient against SGX side-channel attacks.
System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance-critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard Qemu DBT system. Our system delivers an average speedup of 2.21x over Qemu across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA, and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49x on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance.
Trusted Execution Environments (TEEs), such as Intel SGX’s enclave, use hardware to ensure the confidentiality and integrity of operations on sensitive data. While the technology is widely available, the complexity of its programming model and its performance overhead have limited adoption. TEEs provide a new and valuable hardware functionality that has no obvious analogue in programming languages, which means that developers must manually partition their application into trusted and untrusted components.
This paper describes an approach that fully integrates trusted execution in a language-appropriate manner. We extend the Go language to allow a programmer to execute a goroutine within an enclave, to use low-overhead channels to communicate between the trusted and untrusted environments, and to rely on a compiler to automatically extract the secure code and data. Our prototype compiler and runtime, GOTEE , is a backward-compatible fork of the Go compiler.
The evaluation shows that our compiler-driven code and data partitioning efficiently executes both microbenchmarks and applications. On the former, GOTEE achieves a 5.2x throughput, and a 2.3x latency improvement over the Intel SGX SDK. Our case studies, the Go tls package and a secured keystore inspired by the go-ethereum project, show that minor source-code modifications suffice to provide confidentiality and integrity guarantees with only moderate performance overheads.
Multi-tenant cloud computing provides great benefits in terms of resource sharing, elastic pricing, and scalability, however, it also changes the security landscape and introduces the need for strong isolation between the tenants, also inside the network. This paper is motivated by the observation that while multi-tenancy is widely used in cloud computing, the virtual switch designs currently used for network virtualization lack sufficient support for tenant isolation. Hence, we present, implement, and evaluate a virtual switch architecture, MTS, which brings secure design best-practice to the context of multi-tenant virtual networking: compartmentalization of virtual switches, least-privilege execution, complete mediation of all network communication, and reducing the trusted computing base shared between tenants. We build MTS from commodity components, providing an incrementally deployable and inexpensive upgrade path to cloud operators. Our extensive experiments, extending to both micro-benchmarks and cloud applications, show that, depending on the way it is deployed, MTS may produce 1.5-2x the throughput compared to state-of-the-art, with similar or better latency and modest resource overhead (1 extra CPU). MTS is available as open source software.
SecCloud is a new architecture for bare-metal clouds that enables tenants to control tradeoffs between security, price, and performance. It enables security-sensitive tenants to minimize their trust in the public cloud provider and achieve similar levels of security and control that they can obtain in their own private data centers, while not imposing overhead on tenants that are security insensitive and not compromising the flexibility or operational efficiency of the provider. Our prototype exploits a novel provisioning system and specialized firmware to enable elasticity similar to virtualized clouds. Experimentally we quantify the cost of different levels of security for a variety of workloads and demonstrate the value of giving control to the tenant.
Today's ultra-low latency SSDs can deliver an I/O latency of sub-ten microseconds. With this dramatically shrunken device time, operations inside the kernel I/O stack, which were traditionally considered lightweight, are no longer a negligible portion. This motivates us to reexamine the storage I/O stack design and propose an asynchronous I/O stack (AIOS), where synchronous operations in the I/O path are replaced by asynchronous ones to overlap I/O-related CPU operations with device I/O. The asynchronous I/O stack leverages a lightweight block layer specialized for NVMe SSDs using the page cache without block I/O scheduling and merging, thereby reducing the sojourn time in the block layer. We prototype the proposed asynchronous I/O stack on the Linux kernel and evaluate it with various workloads. Synthetic FIO benchmarks demonstrate that the application-perceived I/O latency falls into single-digit microseconds for 4 KB random reads on Optane SSD, and the overall I/O latency is reduced by 15-33% across varying block sizes. This I/O latency reduction leads to a significant performance improvement of real-world applications as well: 11-44% IOPS increase on RocksDB and 15-30% throughput improvement on Filebench and OLTP workloads.
Storage on smart devices such as smartphones and the Internet of Things has limited performance, capacity, and endurance. Deduplication has the potential to address these limitations by eliminating redundant I/Os and data, but it must be considered under the various resource constraints of the devices. This paper presents SmartDedup, a deduplication solution optimized for resource-constrained devices. It proposes a novel architecture that supports symbiotic in-line and out-of-line deduplication to take advantage of their complementary strengths and allow them to be adapted according to a device's current resource availability. It also cohesively combines in-memory and on-disk fingerprint stores to minimize the memory overhead while achieving a good level of deduplication. SmartDedup is prototyped on EXT4 and F2FS and evaluated using benchmarks, workloads generated from real-world device images, and traces collected from real-world devices. The results show that SmartDedup substantially improves I/O performance (e.g., increases write and read throughput by 31.1% and 32%, respectively for an FIO experiment with 25% deduplication ratio), reduces flash writes (e.g., by 70.9% in a trace replay experiment with 72.4% deduplication ratio) and saves space usage (e.g., by 45% in a DEDISbench experiment with 46.1% deduplication ratio) with low memory, storage, and battery overhead, compared to both native file systems and related deduplication solutions.
Performance and efficiency requirements are driving a trend towards specialized accelerators in both datacenters and embedded devices. In order to cut down communication overheads, system components are pinned to cores and fast-path communication between them is established. These fast paths reduce latency by avoiding indirections through the operating system. However, we see three roadblocks that can impede further gains: First, accelerators today need to be assisted by a general-purpose core, because they cannot autonomously access operating system services like file systems or network stacks. Second, fast-path communication is at odds with preemptive context switching, which is still necessary today to improve efficiency when applications underutilize devices. Third, these concepts should be kept orthogonal, such that direct and unassisted communication is possible between any combination of accelerators and general-purpose cores. At the same time, all of them should support switching between multiple application contexts, which is most difficult with accelerators that lack the hardware features to run an operating system.
We present M³x, a system architecture that removes these roadblocks. M³x retains the low overhead of fast-path communication while enabling context switching for general-purpose cores and specialized accelerators. M³x runs accelerators autonomously and achieves a speedup of 4.7 for PCIe-attached image-processing accelerators compared to traditional assisted operation. At the same time, utilization of the host CPU is reduced by a factor of 30.
Data Domain has added a cloud tier capability to its onpremises storage appliance, allowing clients to achieve the cost benefits of deduplication in the cloud. While there were many architectural changes necessary to support a cloud tier in a mature storage product, in this paper, we focus on innovations needed to support key functionality for customers. Consider typical customer interactions: First, a customer determines which files to migrate to the cloud by estimating how much space will be freed on the on-premises Data Domain appliance. Second, a customer transfers selected files to the cloud and later restores files back. Finally, a customer deletes a file in the cloud when its retention period has expired. Each of these operations requires significant architectural changes and new algorithms to address both the impact of deduplicated storage and the latency and expense of cloud object storage. We also present analysis from deployed cloud tier systems. As an example, some customers have moved more than 20PB of logical data to the cloud tier and achieved a total compression factor (deduplication * local compression) of 40× or more, resulting in millions of dollars of cost savings.
We propose a principled approach to integrating GPU memory with an OS page cache. We design GAIA, a weakly-consistent page cache that spans CPU and GPU memories. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA (1) integrates lazy release consistency protocol into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels; (2) improves CPU I/O performance by using data cached in GPU memory, and (3) optimizes the readahead prefetcher to support accesses to files cached in GPUs. We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3× speedup in CPU file I/O and up to 8× in unmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.
Many real-world data like social, transportation, biology, and communication data can be efficiently modeled as a graph. Hence, graph traversal such as multi-hop or graph-walking queries has been key operations atop graph stores. However, since different graph traversals may touch different sets of data, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance cost. This paper proposes Pragh, an efficient locality-preserving live graph migration scheme for graph store in the form of key-value pairs. The key idea of Pragh is a split migration model which only migrates values physically while retains keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized accesses to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong with up to 2.53× throughput improvement further confirms the effectiveness and generality of Pragh.
Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off running on a low-power, microcontroller-like core, i.e., peripheral core, relieving CPU from the inefficiency.
We therefore present a new OS structure, in which a lightweight virtual executor called transkernel offloads specific phases from a monolithic kernel. The transkernel translates stateful kernel execution through cross-ISA, dynamic binary translation (DBT); it emulates a small set of stateless kernel services behind a narrow, stable binary interface; it specializes for hot paths; it exploits ISA similarities for lowering DBT cost.
Through an ARM-based prototype, we demonstrate transkernel’s feasibility and benefit. We show that while cross-ISA DBT is typically used under the assumption of efficiency loss, it can enable efficiency gain, even on off-the-shelf hardware.
LSM-tree based key-value (KV) stores suffer from severe read amplification because searching a key requires to check multiple SSTables. To reduce extra I/Os, Bloom filters are usually deployed in KV stores to improve read performance. However, Bloom filters suffer from false positive, and simply enlarging the size of Bloom filters introduces large memory overhead, so it still causes extra I/Os in memory-constrained systems. In this paper, we observe that access skewness is very common among SSTables or even small-sized segments within each SSTable. To leverage this skewness feature, we develop ElasticBF, a fine-grained heterogeneous Bloom filter management scheme with dynamic adjustment according to data hotness. ElasticBF is orthogonal to the works optimizing the architecture of LSM-tree based KV stores, so it can be integrated to further speed up their read performance. We build ElasticBF atop of LevelDB, RocksDB, and PebblesDB, and our experimental results show that ElasticBF increases the read throughput of the above KV stores to 2.34x, 2.35x, and 2.58x, respectively, while keeps almost the same write and range query performance.
Denial of service (DoS) attacks increasingly exploit algorithmic, semantic, or implementation characteristics dormant in victim applications, often with minimal attacker resources. Practical and efficient detection of these asymmetric DoS attacks requires us to (i) catch offending requests in-flight, before they consume a critical amount of resources, (ii) remain agnostic to the application internals, such as the programming language or runtime system, and (iii) introduce low overhead in terms of both performance and programmer effort.
This paper introduces Finelame, a language-independent framework for detecting asymmetric DoS attacks. Finelame leverages operating system visibility across the entire software stack to instrument key resource allocation and negotiation points. It leverages recent advances in the Linux extended Berkeley Packet Filter virtual machine to attach application-level interposition probes to key request processing functions, and lightweight resource monitors---user/kernel-level probes---to key resource allocation functions. The data collected is used to train a model of resource utilization that occurs throughout the lifetime of individual requests. The model parameters are then shared with the resource monitors, which use them to catch offending requests in-flight, inline with resource allocation. We demonstrate that Finelame can be integrated with legacy applications with minimal effort, and that it is able to detect resource abuse attacks much earlier than their intended completion time while posing low performance overheads.
LSM-based KV stores are designed to offer good write performance, by capturing client writes in memory, and only later flushing them to storage. Writes are later compacted into a tree-like data structure on disk to improve read performance and to reduce storage space use. It has been widely documented that compactions severely hamper throughput. Various optimizations have successfully dealt with this problem. These techniques include, among others, rate-limiting flushes and compactions, selecting among compactions for maximum effect, and limiting compactions to the highest level by so-called fragmented LSMs.
In this paper we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and in fact in some cases exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes and compactions. We then introduce the notion of an I/O scheduler for an LSM-based KV store to reduce this interference. We explore three techniques as part of this I/O scheduler: 1) opportunistically allocating more bandwidth to internal operations during periods of low load, 2) prioritizing flushes and compactions at the lower levels of the tree, and 3) preempting compactions.
SILK is a new open-source KV store that incorporates this notion of an I/O scheduler. SILK is derived from RocksDB, but the concepts can be applied to other LSM-based KV stores. We use both a production workload at Nutanix and synthetic benchmarks to demonstrate that SILK achieves up to two orders of magnitude lower 99th percentile latencies than RocksDB and TRIAD, without any significant negative effects on other performance metrics.
Capabilities provide an efficient and secure mechanism for fine-grained resource management and protection. However, as the modern hardware architectures continue to evolve with large numbers of non-coherent and heterogeneous cores, we focus on the following research question: can capability systems scale to modern hardware architectures? In this work, we present a scalable capability system to drive future systems with many non-coherent heterogeneous cores. More specifically, we have designed a distributed capability system based on a HW/SW co-designed capability system. We analyzed the pitfalls of distributed capability operations running concurrently and built the protocols in accordance with the insights. We have incorporated these distributed capability management protocols in a new microkernel-based OS called SemperOS. Our OS operates the system by means of multiple microkernels, which employ distributed capabilities to provide an efficient and secure mechanism for fine-grained access to system resources. In the evaluation we investigated the scalability of our algorithms and run applications (Nginx, LevelDB, SQLite, PostMark, etc.), which are heavily dependent on the OS services of SemperOS. The results indicate that there is no inherent scalability limitation for capability systems. Our evaluation shows that we achieve a parallel efficiency of 70% to 78% when examining a system with 576 cores executing 512 application instances while using 11% of the system’s cores for OS services.
Efficiently exchanging temporary data between tasks is critical to the end-to-end performance of many data processing frameworks and applications. Unfortunately, the diverse nature of temporary data creates storage demands that often fall between the sweet spots of traditional storage platforms, such as file systems or key-value stores.
We present NodeKernel, a novel distributed storage architecture that offers a convenient new point in the design space by fusing file system and key-value semantics in a common storage kernel while leveraging modern networking and storage hardware to achieve high performance and cost-efficiency. NodeKernel provides hierarchical naming, high scalability, and close to bare-metal performance for a wide range of data sizes and access patterns that are characteristic of temporary data. We show that storing temporary data in Crail, our concrete implementation of the NodeKernel architecture which uses RDMA networking with tiered DRAM/NVMe-Flash storage, improves NoSQL workload performance by up to 4.8× and Spark application performance by up to 3.4×. Furthermore, by storing data across NVMe Flash and DRAM storage tiers, Crail reduces storage cost by up to 8× compared to DRAM-only storage systems.
Mingle with fellow attendees at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers. Posters of the papers presented in the Technical Sessions on Thursday afternoon and on Friday will be on display. View the list of accepted posters.
As solid state drives (SSDs) are increasingly replacing hard disk drives, the reliability of storage systems depends on the failure modes of SSDs and the ability of the file system layered on top to handle these failure modes. While the classical paper on IRON File Systems provides a thorough study of the failure policies of three file systems common at the time, we argue that 13 years later it is time to revisit file system reliability with SSDs and their reliability characteristics in mind, based on modern file systems that incorporate journaling, copy-on-write and log-structured approaches, and are optimized for flash. This paper presents a detailed study, spanning ext4, Btrfs and F2FS, and covering a number of different SSD error modes. We develop our own fault injection framework and explore over a thousand error cases. Our results indicate that 16\% of these cases result in a file system that cannot be mounted or even repaired by its system checker. We also identify the key file system metadata structures that can cause such failures and finally, we recommend some design guidelines for file systems that are deployed on top of SSDs.
Remote Procedure Calls are widely used to connect datacenter applications with strict tail-latency service level objectives in the scale of μs. Existing solutions utilize streaming or datagram-based transport protocols for RPCs that impose overheads and limit the design flexibility. Our work exposes the RPC abstraction to the endpoints and the network, making RPCs first-class datacenter citizens and allowing for in-network RPC scheduling. We propose R2P2, a UDP-based transport protocol specifically designed for RPCs inside a datacenter. R2P2 exposes pairs of requests and responses and allows efficient and scalable RPC routing by separating the RPC target selection from request and reply streaming. Leveraging R2P2, we implement a novel join-bounded-shortest-queue (JBSQ) RPC load balancing policy, which lowers tail latency by centralizing pending RPCs in the router and ensures that requests are only routed to servers with a bounded number of outstanding requests. The R2P2 router logic can be implemented either in a software middlebox or within a P4 switch ASIC pipeline. Our evaluation, using a range of microbenchmarks, shows that the protocol is suitable for μs-scale RPCs and that its tail latency outperforms both random selection and classic HTTP reverse proxies. The P4-based implementation of R2P2 on a Tofino ASIC adds less than 1μs of latency whereas the software middlebox implementation adds 5μs latency and requires only two CPU cores to route RPCs at 10 Gbps line-rate. R2P2 improves the tail latency of web index searching on a cluster of 16 workers operating at 50% of capacity by 5.7× over NGINX. R2P2 improves the throughput of the Redis key-value store on a 4-node cluster with master/slave replication for a tail-latency service-level objective of 200μs by more than 4.8× vs. vanilla Redis.
We present SWAN, a novel All Flash Array (AFA) management scheme. Recent flash SSDs provide high I/O bandwidth (e.g., 3-10GB/s) so the storage bandwidth can easily surpass the network bandwidth by aggregating a few SSDs. However, it is still challenging to unlock the full performance of SSDs. The main source of performance degradation is garbage collection (GC). We find that existing AFA designs are susceptible to GC at SSD-level and AFA software-level. In designing SWAN, we aim to alleviate the performance interference caused by GC at both levels. Unlike the commonly used temporal separation approach that performs GC at idle time, we take a spatial separation approach that partitions SSDs into the front-end SSDs dedicated to serve write requests and the back-end SSDs where GC is performed. Compared to temporal separation of GC and application I/O, which is hard to be controlled by AFA software, our approach guarantees that the storage bandwidth always matches the full network performance without being interfered by AFA-level GC. Our analytical model confirms this if the size of front-end SSDs and the back-end SSDs are properly configured. We provide extensive evaluations that show SWAN is effective for a variety of workloads.
Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling.
We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philae's robustness to flow size skew.
As NAND flash technology continues to scale, flash-based SSDs have become key components in data-center servers. One of the main design goals for data-center SSDs is low read tail latency, which is crucial for interactive online services as a single query can generate thousands of disk accesses. Towards this goal, many prior works have focused on minimizing the effect of garbage collection on read tail latency. Such advances have made the other, less explored source of long read tails, block erase operation, more important. Prior work on erase suspension addresses this problem by allowing a read operation to interrupt an ongoing erase operation, to minimize its effect on read latency. Unfortunately, the erase suspension technique attempts to suspend/resume an erase pulse at an arbitrary point, which incurs additional hardware cost for NAND peripherals and reduces the lifetime of the device. Furthermore, we demonstrate this technique suffers a write starvation problem, using a real, production-grade SSD. To overcome these limitations, we propose alternative practical erase suspension mechanisms, leveraging the iterative erase mechanism used in modern SSDs, to suspend/resume erase operation at well-aligned safe points. The resulting design achieves a sub-200μs 99.999th percentile read tail latency for 4KB random I/O workload at queue depth 16 (70% reads and 30% writes). Furthermore, it reduces the read tail latency by about 5 over the baseline for the two data-center workloads that we evaluated with.
Unexpected bursty traffic due to certain sudden events, such as news in the spotlight on a social network or discounted items on sale, can cause severe load imbalance in backend services. Migrating hot data---the standard approach to achieve load balance---meets a challenge when handling such unexpected load imbalance, because migrating data will slow down the server that is already under heavy pressure.
This paper proposes PostMan, an alternative approach to rapidly mitigate load imbalance for services processing small requests. Motivated by the observation that processing large packets incurs far less CPU overhead than processing small ones, PostMan deploys a number of middleboxes called helpers to assemble small packets into large ones for the heavily-loaded server. This approach essentially offloads the overhead of packet processing from the heavily-loaded server to others. To minimize the overhead, PostMan activates helpers on demand, only when bursty traffic is detected. To tolerate helper failures, PostMan can migrate connections across helpers and can ensure packet ordering despite such migration. Our evaluation shows that, with the help of PostMan, a Memcached server can mitigate bursty traffic within hundreds of milliseconds, while migrating data takes tens of seconds and increases the latency during migration.
National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
National Engineering Research Center for Big Data Technology and System, Key Laboratory of Services Computing Technology and System, Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, China
Interlaced magnetic recording (IMR) is a state-of-the-art recording technology for hard drives that makes use of heat-assisted magnetic recording (HAMR) and track overlap to offer higher capacity than conventional and shingled magnetic recording (CMR and SMR). It carries a set of write constraints that differ from those in SMR: “bottom” (e.g. even-numbered) tracks cannot be written without data loss on the adjoining “top” (e.g. odd-numbered) ones. Previously described algorithms for writing arbitrary (i.e. bottom) sectors on IMR are in some cases poorly characterized, and are either slow or require more memory than is available within the constrained disk controller environment.
We provide the first accurate performance analysis of the simple read-modify-write (RMW) approach to IMR bottom track writes, noting several inaccuracies in earlier descriptions of its performance, and evaluate it for latency, throughput and I/O amplification on real-world traces. In addition we propose three novel memory-efficient, track-based translation layers for IMR—track flipping, selective track caching and dynamic track mapping, which reduce bottom track writes by moving hot data to top tracks and cold data to bottom ones in different ways. We again provide a detailed performance analysis using simulations based on real-world traces.
We find that RMW performance is poor on most traces and worse on others. The proposed approaches perform much better, especially dynamic track mapping, with low write amplification and latency comparable to CMR for many traces.
We present LANCET, a self-correcting tool designed to measure the open-loop tail latency of μs-scale datacenter applications with high fan-in connection patterns. LANCET is self-correcting as it relies on online statistical tests to determine situations in which tail latency cannot be accurately measured from a statistical perspective, including situations where the workload configuration, the client infrastructure, or the application itself does not allow it. Because of its design, LANCET is also extremely easy to use. In fact, the user is only responsible for (i) configuring the workload parameters, i.e., the mix of requests and the size of the client connection pool, and (ii) setting the desired confidence interval for a particular tail latency percentile. All other parameters, including the length of the warmup phase, the measurement length, and the necessary sampling rate, are dynamically determined by the LANCET experiment coordinator. When available, LANCET leverages NIC-based hardware timestamping to measure RPC end-to-end latency. Otherwise, it uses an asymmetric setup with a latency-agent that leverages busy-polling system calls to reduce the client bias. Our evaluation shows that LANCET automatically identifies situations in which tail latency cannot be determined and accurately reports the latency distribution of workloads with single-digit μs service time. For the workloads studied, LANCET can successfully report, with 95% confidence, the 99th percentile tail latency within an interval of ≤ 10μs. In comparison with state-of-the-art tools such as Mutilate and Treadmill, LANCET reports a latency cumulative distribution that is ∼20μs lower when the NIC timestamping capability is available and ∼10μs lower when it is not.
Non-volatile main memory (NVMM) allows programmers to build complex, persistent, pointer-based data structures that can offer substantial performance gains over conventional approaches to managing persistent state. This programming model removes the file system from the critical path which improves performance, but it also places these data structures out of reach of file system-based fault tolerance mechanisms (e.g., block-based checksums or erasure coding). Without fault-tolerance, using NVMM to hold critical data will be much less attractive.
This paper presents Pangolin, a fault-tolerant persistent object library designed for NVMM. Pangolin uses a combination of checksums, parity, and micro-buffering to protect an application's objects from both media errors and corruption due to software bugs. It provides these protections for objects of any size and supports automatic, online detection of data corruption and recovery. The required storage overhead is small (1% for gigabyte-sized pools of NVMM). Pangolin provides stronger protection, requires orders of magnitude less storage overhead, and achieves comparable performance relative to the current state-of-the-art fault-tolerant persistent object library.
Many Internet of Things (IoT) applications would benefit if streams of data could be analyzed rapidly at the Edge, near the data source. However, existing Stream Processing Engines (SPEs) are unsuited for the Edge because their designs assume Cloud-class resources and relatively generous throughput and latency constraints.
This paper presents EdgeWise, a new Edge-friendly SPE, and shows analytically and empirically that EdgeWise improves both throughput and latency. The key idea of EdgeWise is to incorporate a congestion-aware scheduler and a fixed-size worker pool into an SPE. Though this idea has been explored in the past, we are the first to apply it to modern SPEs and we provide a new queue-theoretic analysis to support it. In our single-node and distributed experiments we compare EdgeWise to the state-of-the-art Storm system. We report up to a 3x improvement in throughput while keeping latency low.
Persistent transactional memory (PTM) programming model has recently been exploited to provide crash-consistent transactional interfaces to ease programming atop NVM. However, existing PTM designs either incur high reader-side overhead due to blocking or long delay in the writer side (efficiency), or place excessive constraints on persistent ordering (scalability). This paper presents Pisces, a read-friendly PTM that exploits snapshot isolation (SI) on NVM. The key design of Pisces is based on two observations: the redo logs of transactions can be reused as newer versions for the data, and an intuitive MVCC-based design has read deficiency. Based on the observations, we propose a dual-version concurrency control (DVCC) protocol that maintains up to two versions in NVM-backed storage hierarchy. Together with a three-stage commit protocol, Pisces ensures SI and allows more transactions to commit and persist simultaneously. Most importantly, it promises a desired feature: hiding NVM persistence overhead from reads and allowing nearly non-blocking reads. Experimental evaluation on an Intel 40-thread (20-core) machine with real NVM equipped shows that Pisces outperforms the state-of-the-art design (i.e., DUDETM) by up to 6.3× for micro-benchmarks and 4.6× for TPC-C new order transaction, and also scales much better. The persistency cost is from 19% to 50% for 40 threads.
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in Microsoft. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high performance and low energy cost. However, SSD introduces more complex failure modes compared to traditional hard disk. While great efforts have been made to understand the reliability of SSD itself, it remains unclear what types of system level failures are related to SSD, what are the root causes, and how the rest of the system interacts with SSD and contributes to failures. Answering these questions can help practitioners build and maintain highly reliable SSD-based storage systems.
In this paper, we study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services. We take a holistic view to analyze both device errors and system failures to better understand the potential casual relations. Particularly, we focus on failures that are Reported As "SSD-Related" (RASR) by system status monitoring daemons. Through log analysis, field studies, and validation experiments, we identify the characteristics of RASR failures in terms of their distribution, symptoms, and correlations. Moreover, we derive a number of major lessons and a set of effective methods to address the issues observed. We believe that our study and experience would be beneficial to the community and could facilitate building highly-reliable SSD-based storage systems.
The popularity of Convolutional Neural Network (CNN) models and the ubiquity of CPUs imply that better performance of CNN model inference on CPUs can deliver significant gain to a large number of users. To improve the performance of CNN inference on CPUs, current approaches like MXNet and Intel OpenVINO usually treat the model as a graph and use the high-performance libraries such as Intel MKL-DNN to implement the operations of the graph. While achieving reasonable performance on individual operations from the off-the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level optimizations are predefined. Therefore, it is restrictive and misses the opportunity to optimize the end-to-end inference pipeline as a whole. This paper presents \emph{NeoCPU}, a comprehensive approach of CNN model inference on CPUs that employs a full-stack and systematic scheme of optimizations. \emph{NeoCPU} optimizes the operations as templates without relying on third-parties libraries, which enables further improvement of the performance via operation- and graph-level joint optimization. Experiments show that \emph{NeoCPU} achieves up to 3.45$\times$ lower latency for CNN model inference than the current state-of-the-art implementations on various kinds of popular CPUs.
Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we propose addressing the flash lifetime problem by allowing devices to expose higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems.
We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook and backed by RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100x and recovery time by more than 10,000x. DIRECT also allows HDFS to tolerate a 10,000--100,000x higher bit error rate without experiencing application-visible errors. By significantly increasing the availability and durability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.
Infusing machine learning (ML) and deep learning (DL) into modern systems has driven a paradigm shift towards learning-augmented system design. This paper proposes the learned ranker as a system building block, and demonstrates its potential by using rule-matching systems as a concrete scenario. Specifically, checking rules can be time-consuming, especially complex regular expression (regex) conditions. The learned ranker prioritizes rules based on their likelihood of matching a given input. If the matching rule is successfully prioritized as a top candidate, the system effectively achieves early termination. We integrated the learned rule ranker as a component of popular regex matching engines: PCRE, PCRE-JIT, and RE2. Empirical results show that the rule ranker achieves a top-5 classification accuracy at least 96.16%, and reduces the rule-matching system latency by up to 78.81% on a 8-core CPU.
This paper investigates I/O and failure traces from a realworld large-scale storage system: it finds that because of the scale of the system and because of the imbalanced and dynamic foreground traffic, no existing recovery protocol can compute a high-quality re-replicating strategy in a short time. To address this problem, this paper proposes Dayu, a timeslot based recovery architecture. For each timeslot, Dayu only schedules a subset of tasks which are expected to be finished in this timeslot: this approach reduces the computation overhead and naturally can cope with the dynamic foreground traffic. In each timeslot, Dayu incorporates a greedy algorithm with convex hull optimization to achieve both high speed and high quality. Our evaluation in a 1,000-node cluster and in a 3,500-node simulation both confirm that Dayu can outperform existing recovery protocols, achieving high speed and high quality.
The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale. The key challenge of ML model serving is to meet the response-time Service-Level Objectives (SLOs) of inference workloads while minimizing the serving cost. In this paper, we tackle the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS). MArk employs three design choices tailor-made for inference workload. First, MArk dynamically batches requests and opportunistically serves them using expensive hardware accelerators (e.g., GPU) for improved performance-cost ratio. Second, instead of relying on feedback control scaling or over-provisioning to serve dynamic workload, which can be too slow or too expensive for inference serving, MArk employs predictive autoscaling to hide the provisioning latency at low cost. Third, given the stateless nature of inference serving, MArk exploits the flexible, yet costly serverless instances to cover the occasional load spikes that are hard to predict. We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras. Compared with the premier industrial ML serving platform SageMaker, MArk reduces the serving cost up to $7.8\times$ while achieving even better latency performance.
Consumer-grade solid-state drives (SSDs) guarantee very few things upon a crash. Lacking a strong disk-level crash guarantee forces programmers to equip applications and filesystems with safety nets using redundant writes and flushes, which in turn degrade the overall system performance. Although some prior works propose transactional SSDs with revolutionized disk interfaces to offer strong crash guarantees, adopting transactional SSDs inevitably incurs dramatic software stack changes. Therefore, most consumer-grade SSDs still keep using the standard block device interface.
This paper addresses the above issues by breaking the impression that increasing SSDs' crash guarantees are typically available at the cost of altering the standard block device interface. We propose Order-Preserving Translation and Recovery (OPTR), a collection of novel flash translation layer (FTL) and crash recovery techniques that are realized internal to block-interface SSDs to endow the SSDs with \emph{strong request-level crash guarantees} defined as follows: 1) A write request is not made durable unless all its prior write requests are durable. 2) Each write request is atomic. 3) All write requests prior to a flush are guaranteed durable. We have realized OPTR in real SSD hardware and optimized applications and filesystems (SQLite and Ext4) to demonstrate OPTR's benefits. Experimental results show 1.27$\times$ (only Ext4 is optimized), 2.85$\times$ (both Ext4 and SQLite are optimized), and 6.03$\times$ (an OPTR-enabled no-barrier mode) performance improvement.
In recent years, software applications are increasingly deployed as online services on cloud computing platforms. It is important to detect anomalies in cloud systems in order to maintain high service availability. However, given the velocity, volume, and diversified nature of cloud monitoring data, it is difficult to obtain sufficient labelled data to build an accurate anomaly detection model. In this paper, we propose cross-dataset anomaly detection: detect anomalies in a new unlabelled dataset (the target) by training an anomaly detection model on existing labelled datasets (the source). Our approach, called ATAD (Active Transfer Anomaly Detection), integrates both transfer learning and active learning techniques. Transfer learning is applied to transfer knowledge from the source dataset to the target dataset, and active learning is applied to determine informative labels of a small part of samples from unlabelled datasets. Through experiments, we show that ATAD is effective in cross-dataset time series anomaly detection. Furthermore, we only need to label about 1%-5% of unlabelled data and can still achieve significant performance improvement.