William Allcock

Followers

Following

Co-authors

Public Views

Computer Scientist and problem solver with broad experience in High Performance and Distributed Computing including networking, storage, and system software.

less

Eve Emshwiller

University of Wisconsin-Madison

Rachel E Smith

University of Aberdeen

Juan Alvaro Echeverri

Universidad Nacional de Colombia (National University of Colombia)

ABHAYA P Das

North Bengal University

Robin M Wright

University of Florida

Viacheslav Kuleshov

Stockholm University

Lewis Daly

University College London

Fernando Garcés

Universidad Politécnica Salesiana

Diego Villar

National Scientific and Technical Research Council

Jaime VALENZUELA MÁRQUEZ

Pontificia Universidad Catolica de Chile

Interests

Uploads

Papers by William Allcock

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

IEEE Transactions on Parallel and Distributed Systems

Petascale DTN Project host network configurations and test data set

This page describes two things. One is a spreadsheet containing the host network configurations u... more This page describes two things. One is a spreadsheet containing the host network configurations use by the Data Transfer Nodes in the Petascale DTN Project. The other is a pointer to the data set used in the transfer tests conducted during the Petascale DTN Project. The spreadsheet describes the network tuning parameters, Linux distribution and version, and related information for the DTNs at the four HPC facilities of the Petascale DTN Project. The test data set is available at https://doi.org/10.34941/S1159N and is 4442781786482 bytes (about 4.4TB) in size. The data set carries with it the following funding credit: "This data set was generated with the Hardware/Hybrid Accelerated Cosmology Code (HACC) using resources of the Argonne Leadership Computing Facility at the Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357."

Workflows Community Summit: Tightening the Integration between Computing Facilities and Scientific Workflows

Download

1Practical Resource Monitoring for Robust High Throughput Computing

Abstract—Robust high throughput computing requires ef-fective monitoring and enforcement of a var... more Abstract—Robust high throughput computing requires ef-fective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload ma-chines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computa-tional tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the avai...

Transferring Data from High-Performance Simulations to Extreme Scale Analysis Applications in Real-Time

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018

Extreme scale analytics often requires distributed memory algorithms in order to process the volu... more Extreme scale analytics often requires distributed memory algorithms in order to process the volume of data output by high performance simulations. Traditionally, these analysis routines post-process data saved to disk after a simulation has completed. However, concurrently executing both simulation and analysis can yield great benefits – reduce or eliminate disk I/O, increase output frequency to improve fidelity, and ultimately shorten time-to-discovery. One such method for concurrent simulation and analysis is in transit – transferring data from the resource running the simulation to a separate resource running the analysis. In transit analysis can be beneficial since computational resources may not have certain resources needed for analysis (e.g. GPUs) and to reduce the impact of performing analysis tasks to the run time of the simulation. The work described in this paper compares three techniques for transferring data between distributed memory applications: 1) writing data to and reading data from a parallel file system, 2) copying data into and out of a network-accessed shared memory pool, and 3) streaming data in parallel from the processes in the simulation application to the processes in the analysis application. Our results show that using a shared memory pool and streaming data over high-bandwidth networks can both drastically increase I/O speeds and lead to quicker analysis.

RAM as a Network Managed Resource

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018

This paper describes an architecture for and several application examples of using a dynamically ... more This paper describes an architecture for and several application examples of using a dynamically allocable RAM pool over the network as part of a deep memory hierarchy. We present four different use cases, including in situ analysis, machine learning, quantum chemistry simulations, and virtualization. In each use case, we modified or implemented software to use a network RAM pool, and then evaluated the performance. The cases of in situ analysis, machine learning, and virtualization demonstrated good performance. While the quantum chemistry experiments exhibited an average slowdown of 53% for all runs, including those where resources were clearly oversubscribed, the scalability results showed promise.

Blue Gene/Q: Sequoia and Mira

GridFTP protocol specification

Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates

2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017

Job runtime estimates provided by users are widely acknowledged to be overestimated and runtime o... more Job runtime estimates provided by users are widely acknowledged to be overestimated and runtime overestimation can greatly degrade job scheduling performance. Previous studies focus on improving accuracy of job runtime estimates by reducing runtime overestimation, but fail to address the underestimation problem (i.e., the underestimation of job runtimes). Using an underestimated runtime is catastrophic to a job as the job will be killed by the scheduler before completion. We argue that both the improvement of runtime accuracy and the reduction of underestimation rate are equally important. To address this problem, we propose an online runtime adjustment framework called TRIP. TRIP explores the data censoring capability of the Tobit model to improve prediction accuracy while keeping a low underestimation rate of job runtimes. TRIP can be used as a plugin to job scheduler for improving job runtime estimates and hence boosting job scheduling performance. Preliminary results demonstrate that TRIP is capable of achieving high accuracy of 80% and low underestimation rate of 5%. This is significant as compared to other well-known machine learning methods such as SVM, Random Forest, and Last-2 which result in a high underestimation rate (20%-50%). Our experiments further quantify the amount of scheduling performance gain achieved by the use of TRIP.

Programming with the Globus Toolkit GridFTP Client Library

ABSTRACT

GridFTP: Protocol Extensions to FTP for the Grid Status of this Memo This document is a Global Grid Forum Draft and is in full conformance with all provisions of GFD-C.1. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", ...

Impact of Grid Computing on Network Operators and HW Vendors

13th Symposium on High Performance Interconnects (HOTI'05)

Grid computing is an attempt to make computing work like the power grid. When you run a job, you ... more Grid computing is an attempt to make computing work like the power grid. When you run a job, you shouldn&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#x27;t know or care where it runs, so long as it gets done within your constraints (including security). However, in attempting to accomplish this, Grid researchers are presenting network access patterns and loads different from what has been typical of Internet

Networking issues for grid infrastructure

Page 1. 1 Networking Issues for Grid Infrastructure [Page 1] GFD-I.037 Category: Informational Gr... more

Applied techniques for high bandwidth data transfers across wide area networks

Large distributed systems such as Computational/Data Grids require large amounts of data to be co... more Large distributed systems such as Computational/Data Grids require large amounts of data to be colocated with the computing facilities for processing. Ensuring that the data is there in time for the computation in today's Internet is a massive problem. From our work developing a scalable distributed network cache, we have gained experience with techniques necessary to achieve high data throughput over high bandwidth Wide Area Networks (WAN). In this paper, we discuss several hardware and software design techniques and issues, and then describe their application to an implementation of an enhanced FTP protocol called GridFTP. We also describe results from two applications using these techniques, which were obtained at the Supercomputing 2000 conference.

Download

Fault location in grids using bayesian belief networks

MRSch: Multi-Resource Scheduling for HPC

2022 IEEE International Conference on Cluster Computing (CLUSTER)

What does Inter-Cluster Job Submission and Execution Behavior Reveal to Us?

2022 IEEE International Conference on Cluster Computing (CLUSTER)

Hybrid Workload Scheduling on HPC Systems

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on s... more Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these applications on a single HPC system. Although allocating the hybrid workloads within one system could potentially improve system efficiency, it is difficult to balance the tradeoff between the responsiveness of on-demand requests, the incentive for malleable jobs, and the performance of rigid applications. In this study, we present several scheduling mechanisms to address the issues involved in co-scheduling on-demand, rigid, and malleable jobs on a single HPC system. We extensively evaluate and compare their performance under various configurations and workloads. Our experimental results show that our proposed mechanisms are capable of serving on-demand workloads with minimal delay, offering incentives for declaring malleability, and improving system performance. Index Terms-cluster scheduling, high-performance computing, on-demand jobs, rigid jobs, malleable jobs

Download

Early Investigations into Using a Remote RAM Pool with the vl3 Visualization Framework

2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV), 2016

This paper discusses early efforts to integrate the RAN remote memory technology into the vl3 vol... more

Group GRID WORKING DRAFT

Download

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

IEEE Transactions on Parallel and Distributed Systems

Petascale DTN Project host network configurations and test data set

Workflows Community Summit: Tightening the Integration between Computing Facilities and Scientific Workflows

Download

1Practical Resource Monitoring for Robust High Throughput Computing

Transferring Data from High-Performance Simulations to Extreme Scale Analysis Applications in Real-Time

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018

RAM as a Network Managed Resource

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018

Blue Gene/Q: Sequoia and Mira

GridFTP protocol specification

Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates

2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017

Programming with the Globus Toolkit GridFTP Client Library

ABSTRACT

Impact of Grid Computing on Network Operators and HW Vendors

13th Symposium on High Performance Interconnects (HOTI'05)

Networking issues for grid infrastructure

Page 1. 1 Networking Issues for Grid Infrastructure [Page 1] GFD-I.037 Category: Informational Gr... more

Applied techniques for high bandwidth data transfers across wide area networks

Download

Fault location in grids using bayesian belief networks

MRSch: Multi-Resource Scheduling for HPC

2022 IEEE International Conference on Cluster Computing (CLUSTER)

What does Inter-Cluster Job Submission and Execution Behavior Reveal to Us?

2022 IEEE International Conference on Cluster Computing (CLUSTER)

Hybrid Workload Scheduling on HPC Systems

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Download

Early Investigations into Using a Remote RAM Pool with the vl3 Visualization Framework

2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV), 2016

This paper discusses early efforts to integrate the RAN remote memory technology into the vl3 vol... more

Group GRID WORKING DRAFT

Download

William Allcock

Related Authors

Uploads

Papers by William Allcock

Log In