Papers by Hubertus Franke

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020
The need for higher energy efficiency has resulted in the proliferation of accelerators across pl... more The need for higher energy efficiency has resulted in the proliferation of accelerators across platforms, with custom and reconfigurable accelerators adopted in both edge devices and cloud servers. However, existing solutions fall short in providing accelerators with low-latency, high-bandwidth access to the working set and suffer from the high latency and energy cost of data transfers. Such costs can severely limit the smallest granularity of the tasks that can be accelerated and thus the applicability of the accelerators. In this work, we present FReaC Cache, a novel architecture that natively supports reconfigurable computing in the last level cache (LLC), thereby giving energy-efficient accelerators low-latency, high-bandwidth access to the working set. By leveraging the cache's existing dense memory arrays, buses, and logic folding, we construct a reconfigurable fabric in the LLC with minimal changes to the system, processor, cache, and memory architecture. FReaC Cache is a low-latency, low-cost, and low-power alternative to off-die/offchip accelerators, and a flexible, and low-cost alternative to fixed function accelerators. We demonstrate an average speedup of 3X and Perf/W improvements of 6.1X over an edge-class multi-core CPU, and add 3.5% to 15.3% area overhead per cache slice.

2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2018
Cost-effective and scalable analysis of the human genome is crucial for the democratization of pr... more Cost-effective and scalable analysis of the human genome is crucial for the democratization of precision medicine. The new version of the Genome Analysis Toolkit (GATK4), an industry-standard end-to-end tool for variant discovery analysis in next-generation sequencing (NGS) data, introduces Apache Spark support to improve scaling for both local multithreading and cluster-wide parallelization, as well as facilitate the deployment on cloud infrastructures. In this paper, we evaluate the performance and scalability of GATK4-Spark running on a next-generation cloud platform. After identifying bottlenecks and scaling challenges, we optimize the software stack that includes an optimized JVM, enhancements of Spark and targeted configuration tuning, which in turn enables more effective use of the underlying computing resources. We demonstrate the effectiveness of our comprehensive optimization techniques on a reference Single Nucleotide Polymorphisms (SNPs) pipeline, achieving ≤1 hr computation time for whole human genome analysis.

IBM Journal of Research and Development, 2013
ABSTRACT The ability to analyze massive amounts of network traffic data in real time is becoming ... more ABSTRACT The ability to analyze massive amounts of network traffic data in real time is becoming increasingly important for communication service providers, as it enables them to optimize use of their service infrastructure and develop innovative revenue-generating opportunities. In particular, the real-time analysis of perishable user traffic (which is not stored because of privacy, regulatory, and other constraints) can provide insights into the use of applications and services by telecommunication subscribers. In this paper, we describe the design and implementation of a novel system for real-time analysis of network traffic based on IBM InfoSphere® Streams, a scalable stream-processing platform, which provides access and analysis with respect to the data objects and communication patterns of users at the application layer, in contrast to simple packet- and flow-based analysis that most current systems provide. We discuss our design considerations for such a system and further describe analytics applications developed to showcase its capabilities: online identification of most-frequent objects, online social network discovery, and real-time sentiment analysis. We also present performance results from a pilot deployment of this platform and its applications that analyzed Internet traffic generated by users at a large corporate research lab.
In this paper we discuss an implementation of the M essage Passing I nterface standard (MPI) for ... more In this paper we discuss an implementation of the M essage Passing I nterface standard (MPI) for the IBM Scalable Power PARALLEL 1 and 2 (SPl, SP2). Key to a reliable and efficient implementation of a message passing library on these machines is the careful design of a UNIX-Socket like layer in the u ser space with controlled access to the communication adapters and with adequate recovery and flow control. The performance of this i mplementation is at the s ame level as the IBMproprietary message passing library (MPL). We also show t hat in the IBM SPl and SP2 we achieve integrated tracing ability, where both system events, such as context switches and page fault etc., and MPI related activities are traced, with minimal overhead to the application program, thus presenting application programmers the trace of all the events that ultimately affect efficiency of a parallel program.

arXiv (Cornell University), Mar 24, 2022
Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and ... more Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and safety requirements prior to deployment. With increasing complexity in AV applications, the system needs to meet stringent real-time demands of multiple safety-critical applications simultaneously. A typical AV-SoC is a heterogeneous multiprocessor consisting of accelerators supported by general-purpose cores. Such heterogeneity, while needed for power-performance efficiency, complicates the art of task (process) scheduling. In this paper, we demonstrate that hardware heterogeneity impacts the scheduler's effectiveness and that optimizing for only the real-time aspect of applications is not sufficient in AVs. Therefore, a more holistic approach is required-one that considers global Quality-of-Mission (QoM) metrics, as defined in the paper. We then propose HetSched, a multi-step scheduler that leverages dynamic runtime information about the underlying heterogeneous hardware platform, along with the applications' real-time constraints and the task traffic in the system to optimize overall mission performance. HetSched proposes two scheduling policies: M Sstat and M S dyn and scheduling optimizations like task pruning, hybrid heterogeneous ranking and rank update. HetSched improves overall mission performance on average by 4.6×, 2.6× and 2.6× when compared against CPATH, ADS and 2lvl-EDF (state-of-the-art real-time schedulers built for heterogeneous systems), respectively, and achieves an average of 53.3% higher hardware utilization, while meeting 100% critical deadlines for real-world applications of autonomous driving and aerial vehicles. Furthermore, when used as part of an SoC design space exploration loop, in comparison to the prior schedulers, HetSched reduces the number of processing elements required by an SoC to safely complete AV's missions by 35% on average while achieving 2.7× lower energy-mission time product. 1 Note that offline application profiling is a common approach across most of the schedulers considered in this work.

IEEE Transactions on Parallel and Distributed Systems, 2018
Recent research trends exhibit a growing imbalance between the demands of tenants' software appli... more Recent research trends exhibit a growing imbalance between the demands of tenants' software applications and the provisioning of hardware resources. Misalignment of demand and supply gradually hinders workloads from being efficiently mapped to fixed-sized server nodes in traditional data centers. The incurred resource holes not only lower infrastructure utilization but also cripple the capability of a data center for hosting large-sized workloads. This deficiency motivates the development of a new rack-wide architecture referred to as the composable system. The composable system transforms traditional server racks of static capacity into a dynamic compute platform. Specifically, this novel architecture aims to link up all compute components that are traditionally distributed on traditional server boards, such as central processing unit (CPU), random access memory (RAM), storage devices, and other application-specific processors. By doing so, a logically giant compute platform is created and this platform is more resistant against the variety of workload demands by breaking the resource boundaries among traditional server boards. In this paper, we introduce the concepts of this reconfigurable architecture and design a framework of the composable system for cloud data centers. We then develop mathematical models to describe the resource usage patterns on this platform and enumerate some types of workloads that commonly appear in data centers. From the simulations, we show that the composable system sustains nearly up to 1.6 times stronger workload intensity than that of traditional systems and it is insensitive to the distribution of workload demands. This demonstrates that this composable system is indeed an effective solution to support cloud data center services.

IBM Journal of Research and Development, 2010
In this paper, we examine two network-processing appliances, i.e., the IBM Proventia A Network In... more In this paper, we examine two network-processing appliances, i.e., the IBM Proventia A Network Intrusion Prevention System and the IBM WebSphere A DataPower A service-oriented architecture appliance, and the specific requirements they pose on emerging heterogeneous multicore-processor systems. We first describe the function and architecture of these applications. Next, we describe the computational requirements imposed on the applications as a result of the expectation that they operate at the maximum transmission rate on high-speed networks (i.e., on networks at speeds greater than 10 Gb/s) with minimal latency. Given that next-generation systems will provide on-chip and off-chip hardware acceleration functions, we identify and quantify the functions that can be offloaded onto hardware accelerators to provide latency reduction by more efficient execution, increased concurrence, or both. Referring to models of specific hardware accelerators, we estimate and quantify the impact on the performance of the applications. We conclude with a discussion of the modifications to these applications that are required to exploit the large number of hardware threads and accelerators available on emerging multicore-processor systems.
Proceedings of the 30th annual international symposium on Computer architecture - ISCA '03, 2003

IEEE Transactions on Parallel and Distributed Systems, 2003
Effective scheduling strategies to improve response times, throughput, and utilization are an imp... more Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Parallel machines in these environments have traditionally used space-sharing strategies to accommodate multiple jobs at the same time by dedicating the nodes to a single job until it completes. This approach, however, can result in low system utilization and large job wait times. This paper discusses three techniques that can be used beyond simple spacesharing to improve the performance of large parallel systems. The first technique we analyze is backfilling, the second is gangscheduling, and the third is migration. The main contribution of this paper is an analysis of the effects of combining the above techniques. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining the various techniques are shown over a spectrum of performance criteria.
IBM Journal of Research and Development, 2014

IBM Journal of Research and Development, 2014
ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud comp... more ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud computing has completely changed the way infrastructure is viewed, offering more simplicity, flexibility, and monetary benefits compared to a traditional view of infrastructure. At the core of this transformation is the notion of virtualization of infrastructure as a whole, with providers offering infrastructure-as-a-service (IaaS) to consumers. However, just offering IaaS alone is insufficient for software defined environments (SDEs). This paper examines infrastructure in the context of SDE and discusses what we believe are some of the fundamental characteristics required of such infrastructure—called software defined infrastructure (SDI)—and how it fits into the larger landscape of cloud computing environments and SDEs. Various components of SDI are discussed, including core intelligence, monitoring pieces, and management, in addition to a brief discussion on silos such as compute, network and storage. Consumer and provider points of view are also presented along with infrastructure-level service-level agreements (SLAs). Also presented are the design principles and high-level architectural design of the infrastructure intelligence controller, which constantly transforms infrastructure to honor consumer requirements (SLAs) amidst provider constraints (costs). We believe that the insights presented in this paper can be used for better design of SDE architectures and of data-center systems software in general.

IBM Journal of Research and Development, 2014
ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud comp... more ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud computing has completely changed the way infrastructure is viewed, offering more simplicity, flexibility, and monetary benefits compared to a traditional view of infrastructure. At the core of this transformation is the notion of virtualization of infrastructure as a whole, with providers offering infrastructure-as-a-service (IaaS) to consumers. However, just offering IaaS alone is insufficient for software defined environments (SDEs). This paper examines infrastructure in the context of SDE and discusses what we believe are some of the fundamental characteristics required of such infrastructure—called software defined infrastructure (SDI)—and how it fits into the larger landscape of cloud computing environments and SDEs. Various components of SDI are discussed, including core intelligence, monitoring pieces, and management, in addition to a brief discussion on silos such as compute, network and storage. Consumer and provider points of view are also presented along with infrastructure-level service-level agreements (SLAs). Also presented are the design principles and high-level architectural design of the infrastructure intelligence controller, which constantly transforms infrastructure to honor consumer requirements (SLAs) amidst provider constraints (costs). We believe that the insights presented in this paper can be used for better design of SDE architectures and of data-center systems software in general.

Proceedings of the 2nd European Workshop on Machine Learning and Systems, 2022
Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees applic... more Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.6× higher function tail latency degradation on multi-tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-agent RL algorithm based on Proximal Policy Optimization, i.e., multi-agent PPO (MA-PPO). We show that in multi-tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4× improvement in S-RL performance (in terms of function tail latency) in multi-tenant cases. • Software and its engineering → Cloud computing; • Computing methodologies → Multi-agent planning; Multi-agent systems.

IBM Journal of Research and Development, 2016
OpenStack® (the leading open source platform for public and private infrastructure-as-a-service c... more OpenStack® (the leading open source platform for public and private infrastructure-as-a-service clouds) is composed of a set of loosely coupled and rapidly evolving projects that support a wide set of technologies and configuration options. Deciding how to combine and configure such projects is the determining factor on the overall quality of the cloud, in terms of performance, scalability, and availability. In this paper, we present a methodical framework and empirical analysis to help both cloud providers and users optimize their design and deployment decisions. Cloud providers can rely on this framework to select an appropriate configuration of their cloud for a given service-level agreement. Users developing and running applications on a cloud can better fit virtual resources to their workloads. We demonstrate the power of this framework using several scenarios collected by our CloudBench® tool using application benchmarks running on actual clouds.
China Communications, 2015
Compression and encryption are widely used in network traffic in order to improve efficiency and ... more Compression and encryption are widely used in network traffic in order to improve efficiency and security of some systems. We propose a scheme to concatenate both functions and run them in a paralle pipelined fashion, demonstrating both a hardware and a software implementation. With minor modifications to the hardware accelerators, latency can be reduced to half. Furthermore, we also propose a seminal and more efficient scheme, where we integrate the technology of encryption into the compression algorithm. Our new integrated optimization scheme reaches an increase of 1.6X by using parallel software scheme However, the security level of our new scheme is not desirable compare with previous ones. Fortunately, we prove that this does not affect the application of our schemes.

IEEE Transactions on Computers, 2001
AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for comp... more AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for compressing main memory contents. MXT effectively doubles the physically available memory transparently to the CPUs, input/output devices, device drivers, and application software. An average compression ratio of two or greater has been observed for many applications. Since compressibility of memory contents varies dynamically, the size of the memory managed by the operating system is not fixed. In this paper, we describe operating system techniques that can deal with such dynamically changing memory sizes. We also demonstrate the performance impact of memory compression using the SPEC CPU2000 and SPECweb99 benchmarks. Results show that the hardware compression of memory has a negligible performance penalty compared to a standard memory for many applications. For memory starved applications and benchmarks such as SPECweb99, memory compression improves the performance significantly. Results also show that the memory contents of many applications can be compressed, usually by a factor of two to one.
IBM Journal of Research and Development, 2001
A novel memory subsystem called Memory Expansion Technology (MXT) has been built for fast hardwar... more A novel memory subsystem called Memory Expansion Technology (MXT) has been built for fast hardware compression of main-memory content. This allows a memory expansion to present a "real" memory larger than the physically available memory. This paper provides an overview of the memorycompression architecture, its OS support under Linux and Windows ® , and an analysis of the performance impact of memory compression. Results show that the hardware compression of main memory has a negligible penalty compared to an uncompressed main memory, and for memory-starved applications it increases performance significantly. We also show that the memory content of an application can usually be compressed by a factor of 2.

IBM Journal of Research and Development, 2010
This paper describes a recent system-level trend toward the use of massive on-chip parallelism co... more This paper describes a recent system-level trend toward the use of massive on-chip parallelism combined with efficient hardware accelerators and integrated networking to enable new classes of applications and computing-systems functionality. This system transition is driven by semiconductor physics and emerging network-application requirements. In contrast to general-purpose approaches, workload and network-optimized computing provides significant cost, performance, and power advantages relative to historical frequency-scaling approaches in a serial computational model. We highlight the advantages of on-chip network optimization that enables efficient computation and new services at the network edge of the data center. Software and application development challenges are presented, and a service-oriented architecture application example is shown that characterizes the power and performance advantages for these systems. We also discuss a roadmap for next-generation systems that proportionally scale with future networking bandwidth growth rates and employ 3-D chip integration methods for design flexibility and modularity.
Uploads
Papers by Hubertus Franke