Papers by Dmitrii Ustiugov

Proceedings of the 49th Annual International Symposium on Computer Architecture
Serverless computing has emerged as a widely-used paradigm for running services in the cloud. In ... more Serverless computing has emerged as a widely-used paradigm for running services in the cloud. In serverless, developers organize their applications as a set of functions, which are invoked ondemand in response to events, such as an HTTP request. To avoid long start-up delays of launching a new function instance, cloud providers tend to keep recently-triggered instances idle (or warm) for some time after the most recent invocation in anticipation of future invocations. Thus, at any given moment on a server, there may be thousands of warm instances of various functions whose executions are interleaved in time based on incoming invocations. This paper observes that (1) there is a high degree of interleaving among warm instances on a given server; (2) the individual warm functions are invoked relatively infrequently, often at the granularity of seconds or minutes; and (3) many function invocations complete within a few milliseconds. Interleaved execution of rarely invoked functions on a server leads to thrashing of each function's microarchitectural state between invocations. Meanwhile, the short execution time of a function impedes amortization of the warmup latency of the cache hierarchy, causing a 31-114% increase in CPI compared to execution with warm microarchitectural state. We identify on-chip misses for instructions as a major contributor to the performance loss. In response we propose Jukebox, a record-and-replay instruction prefetcher specifically designed for reducing the start-up latency of warm function instances. Jukebox requires just 32KB of metadata per function instance and boosts performance by an average of 18.7% for a wide range of functions, which translates into a corresponding throughput improvement.

arXiv: Hardware Architecture, Jan 20, 2018
With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it wil... more With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the muchanticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of in-memory services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to help improve overall performance per cost over existing DRAM-only architectures. We first show that even with the best latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deploying a modestly sized high-bandwidth stacked DRAM cache makes SCM-based memory competitive. The high degree of spatial locality in-memory services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and mitigation of SCM's read/write latency disparity. We finally perform a case study with PCM, and show that a 2 bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 5% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements. 1 With a commodity Xeon E5-2660 v4 CPU and 256GB of DRAM, the memory represents up to 40% of the server's total acquisition cost [1, 31].
amargaritov/PTEMagnet_AE v1.0
This repo includes an artifact evaluation pack for the paper "PTEMagnet: Fine-grained Physic... more This repo includes an artifact evaluation pack for the paper "PTEMagnet: Fine-grained Physical Memory Reservation for Faster Page Walks in Public Clouds" which is to appear at ASPLOS'21. The artifact contains a Linux kernel patch for enabling PTEMagnet, shell scripts for Linux kernel compilation, a VM disk image with precompiled benchmarks, and Python/shell scripts, which are expected to reproduce the results of experiments in Figure 6. This artifact can run on a two-socket x86 server that has 1) more than 20 CPU physical cores (i.e., 40 hyperthreads) in total and 2) more than 128GB of RAM, and runs Ubuntu 18.04 LTS.

2021 IEEE International Symposium on Workload Characterization (IISWC)
Serverless computing has seen rapid adoption because of its instant scalability, flexible billing... more Serverless computing has seen rapid adoption because of its instant scalability, flexible billing model, and economies of scale. In serverless, developers structure their applications as a collection of functions invoked by various events like clicks, and cloud providers take responsibility for cloud infrastructure management. As with other cloud services, serverless deployments require responsiveness and performance predictability manifested through low average and tail latencies. While the average end-to-end latency has been extensively studied in prior works, existing papers lack a detailed characterization of the effects of tail latency in real-world serverless scenarios and their root causes. In response, we introduce STeLLAR, an open-source serverless benchmarking framework, which enables an accurate performance characterization of serverless deployments. STeLLAR is provider-agnostic and highly configurable, allowing the analysis of both end-to-end and per-component performance with minimal instrumentation effort. Using STeLLAR, we study three leading serverless clouds and reveal that storage accesses and bursty function invocation traffic are key factors impacting tail latency in modern serverless systems. Finally, we identify important factors that do not contribute to latency variability, such as the choice of language runtime.

Enabling Storage Class Memory as a DRAM Replacement for Datacenter Services
arXiv: Hardware Architecture, 2018
With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it wil... more With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access late...

Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021
The last few years have seen a rapid adoption of cloud computing for data-intensive tasks. In the... more The last few years have seen a rapid adoption of cloud computing for data-intensive tasks. In the cloud environment, it is common for applications to run under virtualization and to share a virtual machine with other applications (e.g., in a virtual private cloud setup). In this setting, our work identifies a new address translation bottleneck caused by memory fragmentation stemming from the interaction of virtualization, colocation, and the Linux memory allocator. The fragmentation results in the effective cache footprint of the host PT being larger than that of the guest PT. The bloated footprint of the host PT leads to frequent cache misses during nested page walks, increasing page walk latency. In response to these observations, we propose PTEMagnet, a new software-only approach for reducing address translation latency in a public cloud. PTEMagnet prevents memory fragmentation through a fine-grained reservation-based allocator in the guest OS. Our evaluation shows that PTEMagnet...
Modern large-scale applications are highly concurrent and require efficient concurrency control m... more Modern large-scale applications are highly concurrent and require efficient concurrency control mechanisms to achieve high performance, while preserving consistency of the data that is shared between a large number of servers. Traditional software techniques, such as atomic operations, used in concurrency control mechanisms introduce considerable overheads, whereas mechanisms leveraging architecture support for transactions do not scale beyond a single coherence domain due to high design complexity. Using scale-out NUMA [1] platform, as an example of an extensible protocol and architecture, we show the insights for building an efficient software/hardware co-designed concurrency control on top of an extended set of hardware-based operations that can be leveraged by distributed transactions in a rack-scale system.
Address translation is an established performance bottleneck [4] in workloads operating on large ... more Address translation is an established performance bottleneck [4] in workloads operating on large datasets due to frequent TLB misses and subsequent page table walks that often require multiple memory accesses to resolve. Inspired by recent research at Google on Learned Index Structures [14], we propose to accelerate address translation by introducing a new translation mechanism based on learned models using neural networks. We argue that existing software-based learned models are unable to outperform the traditional address translation mechanisms due to their high inference time, pointing toward the need for hardware-accelerated learned models. With a challenging goal to microarchitect a hardware-friendly learned page table index, we discuss a number of machine learning and systems trade-offs, and suggest future directions.

Storage-Class Memory Hierarchies for Servers
With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it wil... more With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache’s design as page-based, but also enables the amortization of increased SCM access latencie...

Recent years have seen a surge in the number of data leaks despite aggressive information-contain... more Recent years have seen a surge in the number of data leaks despite aggressive information-containment measures deployed by cloud providers. When attackers acquire sensitive data in a secure cloud environment, covert communication channels are a key tool to exfiltrate the data to the outside world. While the bulk of prior work focused on covert channels within a single CPU, they require the spy (transmitter) and the receiver to share the CPU, which might be difficult to achieve in a cloud environment with hundreds or thousands of machines. This work presents Bankrupt, a high-rate highly clandestine channel that enables covert communication between the spy and the receiver running on different nodes in an RDMA network. In Bankrupt, the spy communicates with the receiver by issuing RDMA network packets to a private memory region allocated to it on a different machine (an intermediary). The receiver similarly allocates a separate memory region on the same intermediary, also accessed via...

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 1, 2016
Modern in-memory services rely on large distributed object stores to achieve the high scalability... more Modern in-memory services rely on large distributed object stores to achieve the high scalability essential to service thousands of requests concurrently. The independent and unpredictable nature of incoming requests results in random accesses to the object store, triggering frequent remote memory accesses. State-of-the-art distributed memory frameworks leverage the one-sided operations offered by RDMA technology to mitigate the traditionally high cost of remote memory access. Unfortunately, the limited semantics of RDMA onesided operations bound remote memory access atomicity to a single cache block; therefore, atomic remote object access relies on software mechanisms. Emerging highly integrated rackscale systems that reduce the latency of one-sided operations to a small multiple of DRAM latency expose the overhead of these software mechanisms as a major latency contributor. This technology-triggered paradigm shift calls for new onesided operations with stronger semantics. We take a step in that direction by proposing SABRes, a new one-sided operation that provides atomic remote object reads in hardware. We then present LightSABRes, a lightweight hardware accelerator for SABRes that removes all atomicity-associated software overheads. Compared to a state-of-the-art software atomicity mechanism, LightSABRes improve the throughput of a microbenchmark atomically accessing 128B-8KB objects from remote memory by 15-97%, and the throughput of a modern in-memory distributed object store by 30-60%.
Benchmarking, analysis, and optimization of serverless function snapshots
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
Algorithm/Architecture Co-Design for Near-Memory Processing
ACM SIGOPS Operating Systems Review
Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling
ACM Transactions on Computer Systems

Proceedings of the 44th Annual International Symposium on Computer Architecture
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to... more The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption. Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMP's ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP's tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. Compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49× and 5×, and efficiency by up to 28× and 5×, respectively. CCS CONCEPTS • Computer systems organization → Processors and memory architectures; Single instruction, multiple data; • Information systems → Main memory engines; • Hardware → Die and wafer stacking;
Design guidelines for high-performance SCM hierarchies
Proceedings of the International Symposium on Memory Systems
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
With explosive growth in dataset sizes and increasing machine memory capacities, per-application ... more With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk-a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table. Anticipating further growth in dataset sizes and their adverse * This work was done while the author was at EPFL.
The Mondrian Data Engine
ACM SIGARCH Computer Architecture News
Uploads
Papers by Dmitrii Ustiugov