Papers by Hillel Kolodner

Multithreaded applications with multi-gigabyte heaps running on modern servers provide new challe... more Multithreaded applications with multi-gigabyte heaps running on modern servers provide new challenges for garbage collection (GC). The challenges for "server-oriented" GC include: ensuring short pause times on a multi-gigabyte heap, while minimizing throughput penalty, good scaling on multiprocessor hardware, and keeping the number of expensive multi-cycle fence instructions required by weak ordering to a minimum. We designed and implemented a fully parallel, incremental, mostly concurrent collector, which employs several novel techniques to meet these challenges. First, it combines incremental GC, to ensure short pause times, with concurrent low-priority background GC threads, to take advantage of processor idle time. Second, it employs a lowoverhead work packet mechanism to enable full parallelism among the incremental and concurrent collecting threads and ensure load balancing. Third, it reduces memory fence instructions by using batching techniques: one fence for each block of small objects allocated, one fence for each group of objects marked, and no fence at all in the write barrier. When compared to the mature well-optimized parallel stopthe-world mark-sweep collector already in the IBM JVM, our collector prototype reduces the maximum pause time from 284 ms to 101 ms, and the average pause time from 266 ms to 66 ms while losing only 10% throughput when running the SPECjbb2000 benchmark on a 256 MB heap on a 4-way 550 MHz Pentium multiprocessor.

Data is the natural resource of the 21st century. It is being produced at dizzying rates, e.g., f... more Data is the natural resource of the 21st century. It is being produced at dizzying rates, e.g., for genomics, for media and entertainment, and for Internet of Things. Object storage systems such as Amazon S3, Azure Blob storage, and IBM Cloud Object Storage, are highly scalable distributed storage systems that offer high capacity, cost effective storage. But it is not enough just to store data; we also need to derive value from it. Apache Spark is the leading big data analytics processing engine combining MapReduce, SQL, streaming, and complex analytics. We present Stocator, a high performance storage connector, enabling Spark to work directly on data stored in object storage systems, while providing the same correctness guarantees as Hadoop's original storage system, HDFS. Current object storage connectors from the Hadoop community, e.g., for the S3 and Swift APIs, do not deal well with eventual consistency, which can lead to failure. These connectors assume file system semantics, which is natural given that their model of operation is based on interaction with HDFS. In particular, Spark and Hadoop achieve fault tolerance and enable speculative execution by creating temporary files, listing directories to identify these files, and then renaming them. This paradigm avoids interference between tasks doing the same work and thus writing output with the same name. However, with eventually consistent object storage, a container listing may not yet include a recently created object, and thus an object may not be renamed, leading to incomplete or incorrect results. Solutions such as EMRFS [1] from Amazon, S3mper [4] from Netflix, and S3Guard [2], attempt to overcome eventual consistency by requiring additional strongly consistent data storage. These solutions require multiple storage systems, are costly, and can introduce issues of consistency between the stores. Current object storage connectors from the Hadoop community are also notorious for their poor performance for write workloads. This, too, stems from their use of the rename operation, which is not a native object storage operation; not only is it not atomic, but it must be implemented using a costly copy operation, followed by delete. Others have tried to improve the performance of * The research leading to these results has received funding from the European Union Horizon 2020 Research and Innovation Programme (grant agreement 644182-IOStack) Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
We present a framework for statically reasoning about temporal heap safety properties. We focus o... more We present a framework for statically reasoning about temporal heap safety properties. We focus on local temporal heap safety properties, in which the verification process may be performed for a program object independently of other program objects. We apply our framework to produce new conservative static algorithms for compile-time memory management, which prove for certain program points that a memory object or a heap reference will not be needed further. These algorithms can be used for reducing space consumption of Java programs. We have implemented a prototype of our framework, and used it to verify compile-time memory management properties for several small, but interesting example programs, including JavaCard programs.

CogNETive
Operating a cloud-scale service is a huge challenge. There are millions of users worldwide and mi... more Operating a cloud-scale service is a huge challenge. There are millions of users worldwide and millions of requests per seconds. For example, Amazon's Simple Storage Service (S3) in 2013 contained two trillion objects and its logs contained 1.1 million log lines per second, which are approximately 10 PB of log records per year (see [1]). Cloud scale implies thousands of servers and network elements, and hundreds of services from multiple cross-regional data centers. Cloud service operation data is scattered over various types of semi-structured and unstructured logs (e.g., application, error, debug), telemetry and network data, as well as customer service records. It is therefore extremely difficult for the multiple owners and administrators in such systems, coming from different units of the organization, to follow the possible paths and system alternatives in order to detect problems, solve issues and understand the service operation.
3.5 Preservation Storage/Cloud Services Issues

Supporting media workflows on an advanced cloud object store platform
The media industry faces daily challenges storing, managing and facilitating collaboration for va... more The media industry faces daily challenges storing, managing and facilitating collaboration for vast amounts of video, photos, text, audio, and social media. This, together with increasing quality factors like high frame rates and Ultra-High resolution, drives the need for more dependable storage and scalable computation. As a solution to tackle these challenges, we describe the Active Media Store (AMS), which extends OpenStack Swift with storlets and content-centric access. Computation units called "Storlets" run typical media workflow functions, such as transcoding, metadata extraction and quality checks, close to the stored data, thereby reducing data transfer bandwidth and latency typically associated with cloud storage. Powerful search capabilities over rich user-defined metadata enable content-centric access, the ability to find stored media objects based on their content and relationships. Media workflows are easily implemented and managed through the Media Bridge, middleware running over the AMS and adapting its low level APIs. We describe an example workflow for file staging over this infrastructure in order to illustrate its power.

We study the potential impact of different kinds of liveness information on the space consumption... more We study the potential impact of different kinds of liveness information on the space consumption of a program in a garbage collected environment, specifically for Java. The idea is to measure the time difference between the actual time an object is collected by the garbage collector (GC) and the potential earliest time an object could be collected assuming liveness information were available. We focus on the following kinds of liveness information: (i) stack reference liveness (local reference variable liveness in Java), (ii) global reference liveness (static reference variable liveness in Java), (iii) heap reference liveness (instance reference variable liveness or array reference liveness in Java), and (vi) any combination of (i)-(iii). We also provide some insights on the kind of interface between a compiler and GC that could achieve these potential savings. The Java Virtual Machine (JVM) was instrumented to measure (dynamic) liveness information. Experimental results are given for 10 benchmarks, including 5 of the SPEC-jvm98 benchmark suite. We show that in general stack reference liveness may yield small benefits, global reference liveness combined with stack reference liveness may yield medium benefits, and heap reference liveness yields the largest potential benefit. Specifically, for heap reference liveness we measure an average potential savings of 39% using an interface with complete liveness information, and an average savings of 15% using a more restricted interface.
Atomic Garbage Collection
International Symposium on Memory Management, Sep 17, 1992

Until now object storage has not been a firstclass citizen of the Apache Hadoop ecosystem includi... more Until now object storage has not been a firstclass citizen of the Apache Hadoop ecosystem including Apache Spark. Hadoop connectors to object storage have been based on file semantics, an impedance mismatch, which leads to low performance and the need for an additional consistent storage system to achieve fault tolerance. In particular, Hadoop depends on its underlying storage system and its associated connector for fault tolerance and allowing speculative execution. However, these characteristics are obtained through file operations that are not native for object storage, and are both costly and not atomic. As a result these connectors are not efficient and more importantly they cannot help with fault tolerance for object storage. We introduce Stocator, whose novel algorithm achieves both high performance and fault tolerance by taking advantage of object storage semantics. This greatly decreases the number of operations on object storage as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object storage. We have implemented Stocator and shared it in open source. Performance testing with Apache Spark shows that it can be 18 times faster for write intensive workloads and can perform 30 times fewer operations on object storage than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.

We study the effectiveness of garbage collection (CC) algorithms by measuring the time difference... more We study the effectiveness of garbage collection (CC) algorithms by measuring the time difference between the actual collection time of an object and the potential earliest collection time for that object. Our ultimate goal is to use this study in order to develop static analysis techniques that can be used together with GC to allow earlier reclamation of objects. The results may also be used to pinpoint application source code that could be rewritten in a way that would allow more timely GC. Specifically, we compare the objects reachable from the root set to the ones that are actually used again. The idea is that GC could reclaim unused objects even if they are reachable from the root set. Thus, our experiments indicate a kind of upper bound on storage savings that could be achieved. We also try to characterize these objects in order to understand the potential benefits of various static analysis algorithms. The Java Virtual Machine (JVM) was instrumented to measure objects that are reachable, but not used again, and to characterize these objects. Experimental results are shown for the SPECjvm98 benchmark suite. The potential memory savings for these benchmarks range from 23% to 74%.

Network Aware Reliability Analysis for Distributed Storage Systems
It is hard to measure the reliability of a large distributed storage system, since it is influenc... more It is hard to measure the reliability of a large distributed storage system, since it is influenced by low probability failure events that occur over time. Nevertheless, it is critical to be able to predict reliability in order to plan, deploy and operate the system. Existing approaches suffer from unrealistic assumptions regarding network bandwidth. This paper introduces a new framework that combines simulation and an analytic model to estimate durability for large distributed cloud storage systems. Our approach is the first that takes into account network bandwidth with a focus on the cumulative effect of simultaneous failures on repair time. Using our framework we evaluate the trade-offs between durability, network and storage costs for the OpenStack Swift object store, comparing various system configurations and resiliency schemes, including replication and erasure coding. In particular, we show that when accounting for the cumulative effect of simultaneous failures, the probability of data loss estimates can vary by two to four orders of magnitude.

A stable heap is storage that is managed automatically using garbage collection, manipulated usin... more A stable heap is storage that is managed automatically using garbage collection, manipulated using atomic transactions, and accessed using a uniform storage model. These featnres enhance reliability and simplify programming by preventing errors due to explicit deallocation, by masking failures and concurrency using transactions, and by eliminating the distinction between accessing temporary storage and permanent storage. Stable heap management is useful for programming languages for reliable distributed computing, programming languages with persistent storage, and object+riented database systems. Many applications that could benefit from a stable heap (e.g., computer-aided design, computer-aided software engineering, and office information systems) require large amounts of storage, timely responses for transactions, and high availability. We present garbage collection and recovery algorithms for a stable heap implementation that meet these goals and are appropriate for stock hardware. The collector is incremental it does not attempt to collect the whole heap at once. The collector is also atomic: it is coordinated with the recovery system to prevent problems when it moves and modifies objects. The time for recovery is independent of heap size, even if a failure occurs during garbage collection.

Sigplan Notices, May 17, 2002
Multithreaded applications with multi-gigabyte heaps running on modern servers provide new challe... more Multithreaded applications with multi-gigabyte heaps running on modern servers provide new challenges for garbage collection (GC). The challenges for "server-oriented" GC include: ensuring short pause times on a multi-gigabyte heap, while minimizing throughput penalty, good scaling on multiprocessor hardware, and keeping the number of expensive multi-cycle fence instructions required by weak ordering to a minimum. We designed and implemented a fully parallel, incremental, mostly concurrent collector, which employs several novel techniques to meet these challenges. First, it combines incremental GC, to ensure short pause times, with concurrent low-priority background GC threads, to take advantage of processor idle time. Second, it employs a lowoverhead work packet mechanism to enable full parallelism among the incremental and concurrent collecting threads and ensure load balancing. Third, it reduces memory fence instructions by using batching techniques: one fence for each block of small objects allocated, one fence for each group of objects marked, and no fence at all in the write barrier. When compared to the mature well-optimized parallel stopthe-world mark-sweep collector already in the IBM JVM, our collector prototype reduces the maximum pause time from 284 ms to 101 ms, and the average pause time from 266 ms to 66 ms while losing only 10% throughput when running the SPECjbb2000 benchmark on a 256 MB heap on a 4-way 550 MHz Pentium multiprocessor.

The need for online storage and backup of data constantly increases. Enterprises from different d... more The need for online storage and backup of data constantly increases. Enterprises from different domains, such as media and telco operators, would like to be able to store large amounts of data which can then be made available from different geographic locations. The Cloud paradigm allows for the illusion of unlimited online storage space, hiding the complexity of managing a large infrastructure from the end users. While Storage Cloud solutions, such as Amazon S3, have already met with success, more fine-grained guarantees on the provided QoS need to be offered for their larger uptake. In this paper, we address the problem of managing SLAs in Storage Clouds by taking advantage of the content terms that pertain to the stored objects thus supporting more efficient capabilities, such as quicker search and retrieval operations. A system that better captures the content of the signed SLAs is very useful, as the benefits in terms of reducing the management overhead while offering better services are potentially significant.

We present Stocator, a high performance object store connector for Apache Spark, that takes advan... more We present Stocator, a high performance object store connector for Apache Spark, that takes advantage of object store semantics. Previous connectors have assumed file system semantics, in particular, achieving fault tolerance and allowing speculative execution by creating temporary files to avoid interference between worker threads executing the same task and then renaming these files. Rename is not a native object store operation; not only is it not atomic, but it is implemented using a costly copy operation and a delete. Instead our connector leverages the inherent atomicity of object creation, and by avoiding the rename paradigm it greatly decreases the number of operations on the object store as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object stores. We have implemented Stocator and shared it in open source. Performance testing shows that it is as much as 18 times faster for write intensive workloads and performs as much as 30 times fewer operations on the object store than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.

Springer eBooks, 2005
A reference-counting garbage collector cannot reclaim unreachable cyclic structures of objects. T... more A reference-counting garbage collector cannot reclaim unreachable cyclic structures of objects. Therefore, reference-counting collectors either use a backup tracing collector infrequently, or employ a cycle collector to reclaim cyclic structures. We propose a new concurrent cycle collector, i.e., one that runs concurrently with the program threads, imposing negligible pauses (of around 1ms) on a multiprocessor. Our new collector combines the state-of-the-art cycle collector [5] with the sliding-views collectors [20, 2]. The use of sliding views for cycle collection yields two advantages. First, it drastically reduces the number of cycle candidates, which in turn, drastically reduces the work required to record and trace these candidates. Therefore, a large improvement in cycle collection efficiency is obtained. Second, it eliminates the theoretical termination problem that appeared in the previous concurrent cycle collector. There, a rare race may delay the reclamation of an unreachable cyclic structure forever. The sliding-views cycle collector guarantees reclamation of all unreachable cyclic structures. The proposed collector was implemented on the Jikes RVM and we provide measurements including a comparison between the use of backup tracing and the use of cycle collection with reference counting. To the best of our knowledge, such a comparison has not been reported before.
Lecture Notes in Computer Science, 2003
We present a framework for statically reasoning about temporal heap safety properties. We focus o... more We present a framework for statically reasoning about temporal heap safety properties. We focus on local temporal heap safety properties, in which the verification process may be performed for a program object independently of other program objects. We apply our framework to produce new conservative static algorithms for compile-time memory management, which prove for certain program points that a memory object or a heap reference will not be needed further. These algorithms can be used for reducing space consumption of Java programs. We have implemented a prototype of our framework, and used it to verify compile-time memory management properties for several small, but interesting example programs, including JavaCard programs.

Garbage collectors of the mark-sweep family may suffer from memory fragmentation and require the ... more Garbage collectors of the mark-sweep family may suffer from memory fragmentation and require the use of compaction. Known compaction methods are expensive and work while program activity is stopped, so that compaction is often a major contributor to garbage collection pause times. We present a parallel incremental compaction algorithm that reduces pause times by working in parallel and evacuating a part of the heap when the program threads are stopped for garbage collection. Our algorithm works with collectors based on mark-sweep, including mostly concurrent collectors. We have implemented a prototype of our algorithm as part of the garbage collector in the IBM JVM. Measurements of our prototype show that even with the most simple-minded policies, e.g., for choosing the area to evacuate, parallel incremental compaction can successfully reduce maximum garbage collection pause times with a minimal performance penalty.
In designing fault-tolerant distributed database systems, a frequent goal is making the system hi... more In designing fault-tolerant distributed database systems, a frequent goal is making the system highly available despite component failure. We examine software approaches to achieving high availability in the presence of partitions. In particular, we consider various replicated-data management protocols that maintain database consistency and attempt to increase database availability when networks partition. We conclude that no protocol does better than a bound we have determined. Our conclusions hold under the assumption that the pattern of data accesses by transactions obeys a uniformity assumption. There may be some particular distribution for which specialized protocols can increase availability.
Uploads
Papers by Hillel Kolodner