Vijay Saraswat

Followers

Following

Co-authors

Public Views

IBM Research

Interests

Uploads

Papers by Vijay Saraswat

2012 HPC Challenge Class 2 X10 for Productivity and Performance at Scale

Download

M3R: Increased performance for in-memory Hadoop jobs

ArXiv, 2012

Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted ... more Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged -- including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets -- while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their performance under the Hadoop engine.

Download

Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

2008 37th International Conference on Parallel Processing, 2008

Solving large, irregular graph problems efficiently is challenging. Current software systems and ... more Solving large, irregular graph problems efficiently is challenging. Current software systems and commodity multiprocessors do not support fine-grained, irregular parallelism well. We present XWS, the X10 Work Stealing framework, an open-source runtime for the parallel programming language X10 and a library to be used directly by application writers. XWS extends the Cilk work-stealing framework with several features necessary to efficiently implement graph algorithms, viz., support for improperly nested procedures, global termination detection, and phased computation. We also present a strategy to adaptively control the granularity of parallel tasks in the work-stealing scheme, depending on the instantaneous size of the work queue. We compare the performance of the XWS implementations of spanning tree algorithms with that of the handwritten C and Cilk implementations using various graph inputs. We show that XWS programs (written in Java) scale and exhibit comparable or better performance.

Download

Design, implementation, and evaluation of the constraint language cc(FD)

The Journal of Logic Programming, 1998

This paper describes the design, implementation, and applications of the constraint logic languag... more This paper describes the design, implementation, and applications of the constraint logic language cc(FD). cc(FD) is a declarative nondeterministic constraint logic language over nite domains based on the cc framework 33], an extension of the CLP scheme 21]. Its constraint s o l v er includes (non-linear) arithmetic constraints over natural numbers which are approximated using domain and interval consistency. The main novelty o f cc(FD) is the inclusion of a number of general-purpose combinators, in particular cardinality, constructive disjunction, and blocking implication, in conjunction with new constraint operations such as constraint e n tailment a n d generalization. These combinators signi cantly improve the operational expressiveness, extensibility, and exibility of CLP languages and allow issues such as the de nition of non-primitive constraints and disjunctions to be tackled at the language level. The implementation o f cc(FD) (about 40,000 lines of C) includes a WAM-based engine 44], optimal arc-consistency algorithms based on AC-5 40], and incremental implementation of the combinators. Results on numerous problems, including scheduling, resource allocation, sequencing, packing, and hamiltonian paths are reported and indicate that cc(FD) comes close to procedural languages on a number of combinatorial problems. In addition, a small cc(FD) program was able to nd the optimal solution and prove optimality to a famous 10/10 disjunctive s c heduling problem 29], which w as left open for more than 20 years and nally solved in 1986.

Download

A performance model for X10 applications

Proceedings of the 2011 ACM SIGPLAN X10 Workshop, 2011

To reliably write high performance code in any programming language, an application programmer mu... more To reliably write high performance code in any programming language, an application programmer must have some understanding of the performance characteristics of the language's core constructs. We call this understanding a performance model for the language. Some aspects of a performance model are fundamental to the programming language and are expected to be true for any plausible implementation of the language. Other aspects are less fundamental and merely represent design choices made in a particular version of the language's implementation. In this paper we present a basic performance model for the X10 programming language. We first describe some performance characteristics that we believe will be generally true of any implementation of the X10 2.2 language specification. We then discuss selected aspects of our implementations of X10 2.2 that have significant implications for the performance model.

Download

Java interoperability in managed X10

Proceedings of the third ACM SIGPLAN X10 Workshop, 2013

The ability to smoothly interoperate with other programming languages is an essential feature to ... more The ability to smoothly interoperate with other programming languages is an essential feature to reduce the barriers to adoption for new languages such as X10. Compiler-supported interoperability between Managed X10 and Java was initially previewed in X10 version 2.2.2 and is now fully supported in X10 version 2.3. In this paper we describe and motivate the Java interoperability features of Managed X10. For calling Java from X10, external linkage for Java code is explained. For calling X10 from Java, the current implementation of Java code generation is explained. An unusual aspect of X10 is that, unlike most other JVM-hosted languages, X10 is also implemented via compilation to C++ (Native X10). The requirement to support multiple execution platforms results in unique challenges to the design of cross-language interoperability. In particular, we discovered that a single top exception type that covers all exception types from source and all target languages is needed as a native type of the source language for portable exception handling. This realization motivated both minor changes in the X10 language specification and an extensive redesign of the X10 core class library for X10 2.3.

Download

SatX10: A Scalable Plug&Play Parallel SAT Framework

Lecture Notes in Computer Science, 2012

We propose a framework for SAT researchers to conveniently try out new ideas in the context of pa... more We propose a framework for SAT researchers to conveniently try out new ideas in the context of parallel SAT solving without the burden of dealing with all the underlying system issues that arise when implementing a massively parallel algorithm. The framework is based on the parallel execution language X10, and allows the parallel solver to easily run on both a single machine with multiple cores and across multiple machines, sharing information such as learned clauses.

Download

Parallel Combinatorial Optimization with Decision Diagrams

Lecture Notes in Computer Science, 2014

We propose a new approach for parallelizing search for combinatorial optimization that is based o... more We propose a new approach for parallelizing search for combinatorial optimization that is based on a recursive application of approximate Decision Diagrams. This generic scheme can, in principle, be applied to any combinatorial optimization problem for which a decision diagram representation is available. We consider the maximum independent set problem as a specific case study, and show how a recently proposed sequential branch-and-bound scheme based on approximate decision diagrams can be parallelized efficiently using the X10 parallel programming and execution framework. Experimental results using our parallel solver, DDX10, running on up to 256 compute cores spread across a cluster of machines indicate that parallel decision diagrams scale effectively and consistently. Moreover, on graphs of relatively high density, parallel decision diagrams often outperform state-of-the-art parallel integer programming when both use a single 32-core machine.

Download

Lifeline-based global load balancing

ACM SIGPLAN Notices, 2011

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irreg... more On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, henc...

Download

Strategic directions in constraint programming

ACM Computing Surveys, 1996

Download

X10 for Productivity and Performance at Scale

We implement all four HPC Class II benchmarks in X10: Global HPL, Global RandomAccess, EP Stream ... more We implement all four HPC Class II benchmarks in X10: Global HPL, Global RandomAccess, EP Stream (Triad), and Global FFT. We also implement the Unbalanced Tree Search benchmark (UTS). We show performance results for these benchmarks running on an IBM Power 775 Supercomputer utilizing up to 47,040 Power7 cores. We believe that our UTS implementation demonstrates that X10 can deliver unprecedented productivity and performance at scale for unbalanced workloads. The X10 tool chain and the benchmark codes are publicly available at http://x10-lang.org.

Download

GLB: Lifeline-based Global Load Balancing library in X10

We present GLB, a programming model and an associated implementation that can handle a wide range... more We present GLB, a programming model and an associated implementation that can handle a wide range of irregular paral- lel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced via static scheduling and to problems that are hard to statically load balance. GLB hides the intricate syn- chronizations (e.g., inter-node communication, initialization and startup, load balancing, termination and result collection) from the users. GLB internally uses a version of the lifeline graph based work-stealing algorithm proposed by Saraswat et al. Users of GLB are simply required to write several pieces of sequential code that comply with the GLB interface. GLB then schedules and orchestrates the parallel execution of the code correctly and efficiently at scale. We have applied GLB to two representative benchmarks: Betweenness Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be statically load-balanced ...

Download

X10 language specification - Version 2.3

This report provides a description of the programming language X10. X10 is a classbased object-or... more This report provides a description of the programming language X10. X10 is a classbased object-oriented programming language designed for high-performance, highproductivity computing on high-end computers supporting ≈ 10 5 hardware threads and ≈ 10 15 operations per second. X10 is based on state-of-the-art object-oriented programming languages and deviates from them only as necessary to support its design goals. The language is intended to have a simple and clear semantics and be readily accessible to mainstream OO programmers. It is intended to support a wide variety of concurrent programming idioms. The X10 design team consists of Bard Bloom, David Cunningham, Robert Fuhrer,

Download

Determinate Imperative Programming Determinate Imperative Programming

There are a large class of applications, notably those in highperformance computation (HPC), for ... more There are a large class of applications, notably those in highperformance computation (HPC), for which parallelism is necessary for performance, not expressiveness. Such applications are typically determinate and have no natural notion of deadlock. Unfortunately, today's dominant HPC programming paradigms (MPI and OpenMP) are based on imperative concurrency and do not guarantee determinacy or deadlock-freedom. This substantially complicates writing and debugging such code. We present a new concurrent model for mutable variables, the clocked final model, CF, that guarantees determinacy and deadlockfreedom. CF views a mutable location as a monotonic stream together with a global stability rule which permits reads to stutter (return a previous value) if it can be established that no other activity can write in the current phase. Each activity maintains a local index into the stream and advances it independently as it performs reads and writes. Computation is aborted if two different activities write different values in the same phase. This design unifies and extends several well-known determinate programming paradigms: single-threaded imperative programs, the "safe asynchrony" of [31], reader-writer communication via immutable variables, Kahn networks, and barrier-based synchronization. Since it is predicated quite narrowly on a re-analysis of mutable variables, it is applicable to existing sequential and concurrent languages, such as Jade, Cilk, Java and X10. We present a formal operational model for a specific CF language, MJ/CF, based on the MJ calculus of [15]. We present an outline of a denotational semantics based on a connection with default concurrent constraint programming. We show that CF leads to a very natural programming style: often an "obvious" shared-variable formulation provides the correct solution under the CF interpretation. We present several examples and discuss implementation issues.

Download

Determinate Imperative Programming Determinate Imperative Programming A clocked interpretation of imperative syntax ( Extended Abstract )

Download

cc � A Generic Framework for Domain Specific Languages (Extended Abstract)

cc programming is a general framework for constructing a wide variety of domain-specific language... more

Download

A Generic Framework for Domain Specific Languages

) Markus P J Fromherz y Vineet Gupta y Vijay Saraswat z Abstract cc programming is a general fram... more ) Markus P J Fromherz y Vineet Gupta y Vijay Saraswat z Abstract cc programming is a general framework for constructing a wide variety of domain-specific languages. In this paper we show how such languages can be easily constructed using cc, and why cc is particularly suitable for the construction of such languages. 1 Introduction Increasingly, the widely available cheap and powerful computers of today are being applied in extraordinarily diverse settings --- from powering photocopiers and telephony systems and other real-time testing, control and diagnosis systems, to automation of inventory and account management at the neighbourhood video rental store or the phone-order catalog company, to suporting net-based publication and electronic communities. This brings computational scientists (and their tools and techniques) in contact with practitioners of other disciplines whose work touches these diverse areas of human activity --- engineers, control-theorists, management sci...

Download

Resilient X10

ACM SIGPLAN Notices, 2014

Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fa... more Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically. We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {\em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The ...

Download

Semantics of (Resilient) X10

Lecture Notes in Computer Science, 2014

Object Initialization in X10

ECOOP 2012 – Object-Oriented Programming, 2012

2012 HPC Challenge Class 2 X10 for Productivity and Performance at Scale

Download

M3R: Increased performance for in-memory Hadoop jobs

ArXiv, 2012

Download

Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

2008 37th International Conference on Parallel Processing, 2008