PDC Review 1 Group 11

This study focuses on optimizing streaming parallelism in heterogeneous many-core architectures, specifically using the technique of heterogeneous streaming to improve performance by sharing workloads. The research presents an automated approach utilizing machine learning for core allocation, tested on Intel XeonPhi and NVIDIA GTX 1080Ti, achieving speedups of 1.6x and 1.1x respectively. It also reviews related works and highlights the challenges of effectively utilizing the processing power of modern many-core accelerators.

Uploaded by

Lakshit Mangla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

PDC Review 1 Group 11

Uploaded by

Lakshit Mangla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Problem Statement

Optimizing Streaming Parallelism on Heterogeneous

Many-Core Architectures

Domain Introduction:
Heterogeneous many core accelerators refer to processing units used to accelerate the
performance of parallel applications. Some commonly used many core accelerators are GPGPUs
and Intel XeonPhi architectures. With an increase in the demand for performance by users, many
core accelerators have become more powerful with more cores and more processing power per
core. However, on the software side of things, it becomes difficult for applications to keep up
and utilize all the parallel processing power available from these many core accelerators.
This study focuses on how parallel programs can fully utilize the performance boost offered by
such multicore architectures.

Techniques & its Related Challenges:

The technique used in this study[1] to improve the performance of heterogeneous many core
performance is known as heterogeneous streaming. T his technique shares workloads of parallel
programs by utilizing the independence in parts of the program so that they can run
simultaneously, thus improving performance. This effectively allows the current kernel
executions with data movements.
Some commonly used implementations of heterogeneous streaming include CUDA streams,
OpenCL and Intel hStreams.
This study aims to note the improvement of performance of a data parallel application by
exploiting spatial and temporal sharing of heterogeneous streams.
The main difference between this study and the ones done of Task Scheduling in the past is that
this study relies on partitioning processors into groups in order to improve host-device
communication on coprocessors like XeonPhi. The study also departs from earlier work by
developing an automatic approach to dynamically adjust the processor partition and task
granularity during runtime. It also includes many performance improvements specific to the Intel
XeonPhi platform as specified in Cheng et al.[2] and Jha et al.[3]

Research Findings:
The study contributes by presenting an easy and automated approach to exploiting streaming
parallelism in heterogeneous many-core architectures using machine learning based-model that
decides the core allocation for specific programs. The approach was tested on an Intel XeonPhi
as well as NVIDIA GTX 1080Ti GPU, with 39 different benchmarks. Experimental results
showed a speedup of 1.6x and 1.1x on XeonPhi and GTX 1080Ti respectively.

Individual Contribution:
Prateek Sharma - 18BCI0215: Review and summary of base paper[1] and related works[2][3]
as well as compilation of the final document.
Lakshit Mangla - 18BCI0246: Review and summarization of related works[4][5]
Anshil Seth - 18BCI0173: Review and summarization of related works[6][7]

Review of related papers:

1. Paper Name :- Heterogeneous Parallel Computing With Java : Jabber or
Justified[4]
Author :- H G Dietz

The main focus of this study was to find whether Java is a good language for
Heterogeneous Parallel Computing or not. The paper starts by describing what
heterogeneous computing is. It is the concept of using a collection of machines, in which
each machine may have properties somewhat different from the others to achieve
speedup on a computation.
Then the study divides heterogeneous computing into 4 main parts which are
1) Architecture
2) Speedup
3) Portability
4) Transformability

The Study then focuses on Java and gives some of its benefits and what it lacks. These
are some of its benefits :-
1) Data Types
2) Object Oriented
3) Threads
4) Support for Networking
5) Graphic Support

In the end of the paper it was concluded that Java has many features that should be a part
of a programming model for heterogeneous parallel computing. But the Java model is not
100% appropriate.
2. Paper Name :- Efficient Strategies of Compressing Three-Dimensional Sparse
Arrays based on Intel XEON and Intel XEON Phi Environments[5]
Author :- Chun-Yuan Lin , Che-Lun Hung

Array operations are useful in a lot of important scientific codes. In the past many
methods have been proposed to implement these array operations efficiently but most
cases were focused on 2D implementation of arrays.
Parallel computing is a suitable solution to speed up the array operations in terms of time
and memory space. For parallel array operations we can design and implement parallel
programs of them on a shared memory multiprocessor.
There are 3 strategies which are compared to increase the efficiency: -
1) CRS Scheme
2) Inter Task parallelization Strategy
3) Intra Task Parallelization Strategy

Experimental tests were done using these 3 strategies and compared. The results were
that the speed up ratio by inter task parallelization was better than that of intra task
parallelization in most of the cases.

3. Paper Name :- Efcient streaming applications on multi-core with FastFlow: the

biosequence alignment test-bed[6]
Author :- Marco ALDINUCCI , Marco DANELUTTO, Massimiliano MENEGHIN
,Massimo TORQUATI and Peter KILPATRICK

The tighter coupling of on-chip resources changes the communication to computation

ratio that inuences the design of parallel algorithms. Modern Single Chip
Multiprocessor (SCM) architectures introduce the potential for low overhead inter-core
communications, synchronisations and data sharing due to fast path access to the on-die
caches. These caches, organised in a hierarchy, are also a potential limiting factor of
these architectures since access patterns which are not carefully optimised may lead to
relevant access contention for shared cache and invalidation pressure for replicated
caches.
How the FastFlow library can be used to build a widely used parallel programming
paradigm (a.k.a. skeleton), i.e. the streaming stateful farm; and we com-pare its raw
scalability against a hand-tuned Pthread-based counterpart on a dual quad-core Intel
platform. The Fast Flow farm skeleton can be rapidly and effectively used to boost the
performance of many existing real-world applications, for example the Smith-Waterman
local alignment algorithm [18]. In the following we show that a straight for-ward porting
of the multi-threaded x86/SSE2-enabled SWPS3 implementation [19] onto FastFlow is
twice as fast as the SWPS3 itself, which is a hand-tuned high-performance
implementation.
They have evaluated the performance of FastFlow with two families of applications:
A synthetic micro-benchmark and the Smith-Waterman local sequence alignment
algorithm.
All experiments are executed on a shared memory Intel platform with two quad-core
XeonE5420 Harpertown 2.5GHz 6MB L2 cache and 8 GBytes of main memory.
The above evaluations are concluded by : The FastFlow version of the Smith-Waterman
algorithm, obtained from a third-party high-performance implementation by simply
substituting communication primitives, is always faster when compared to the original
version and exhibits double its speedup on ne-grained datasets. The presented results are
the “feasibility study” of the proposed approach.

4. Paper Name :- Revisiting the Design of Data Stream Processing Systems on

Multi-Core Processors[7]
Author :- Shuhao Zhang - SAP Innovation Center Singapore; Bingsheng He - Nat. Univ.
of Singapore; Daniel Dahlmeier - SAP Innovation Center Singapore; Amelie Chi Zhou

The performance issues are addressed and they allow the DSP system to exploit modern
scale up architectures, which also benefits scaling out environments. These performance
issues are shown and reflected by using two platforms: -
Apache Storm & Flink
3 strategies used to evaluate are:-
Pipelined processing with message passing.
On demand data parallelism and,
JVM based implementation.
They have revisited three common design aspects of modern DSP systems on modern
multi socket multi core architectures. Detailed profiling studies are done on Apache
Storm and Flink.
2 major highlighted outcomes from the studies are:-
The design of supporting both pipeline and data parallel processing results in a very
complex parallel execution model in DSP systems,causing high frontend stalls on a single
CPU.
The design of continuous message passing mechanisms between operators severely limits
the scalability of DSP systems on multi socket multi core architectures.
Base Paper and Reference Papers:
[1] P. Zhang, J. Fang, C. Yang, C. Huang, T. Tang and Z. Wang, "Optimizing Streaming
Parallelism on Heterogeneous Many-Core Architectures," in IEEE Transactions on Parallel and
Distributed Systems, vol. 31, no. 8, pp. 1878-1896, 1 Aug. 2020, doi:
10.1109/TPDS.2020.2978045.
[2] X. Cheng et al., “Many-core needs fine-grained scheduling: A case study of query processing
on Intel Xeon Phi processors,” J. Parallel Distrib. Comput., vol. 120, pp. 395–404, 2018.
[3] S. Jha et al., “Improving main memory hash joins on Intel Xeon Phi processors: An
experimental approach,” Proc. VLDB Endowment, vol. 8, pp. 642–653, 2015
[4] H. G. Dietz, "Heterogeneous parallel computing with Java: jabber or justified?," Proceedings
Seventh Heterogeneous Computing Workshop (HCW'98), Orlando, FL, USA, 1998, pp.
159-162, doi: 10.1109/HCW.1998.666554.
[5] C. Lin, H. T. Yen and C. Hung, "Efficient Strategies of Compressing Three-Dimensional
Sparse Arrays Based on Intel XEON and Intel XEON Phi Environments," 2015 IEEE
International Conference on Computer and Information Technology; Ubiquitous Computing and
Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and
Computing, Liverpool, 2015, pp. 1383-1388, doi: 10.1109/CIT/IUCC/DASC/PICOM.2015.206.
[6] Aldinucci, Marco & Danelutto, Marco & Meneghin, Massimiliano & Kilpatrick, Peter &
Torquati, Massimo. (2009). Efficient streaming applications on multi-core with FastFlow: The
biosequence alignment test-bed. 19. 10.3233/978-1-60750-530-3-273.
[7] S. Zhang, B. He, D. Dahlmeier, A. C. Zhou and T. Heinze, "Revisiting the Design of Data
Stream Processing Systems on Multi-Core Processors," 2017 IEEE 33rd International
Conference on Data Engineering (ICDE), San Diego, CA, 2017, pp. 659-670, doi:
10.1109/ICDE.2017.119.

Streaming Applications On Multi Core
No ratings yet
Streaming Applications On Multi Core
25 pages
Veljko Milutinović: University of Belgrade
No ratings yet
Veljko Milutinović: University of Belgrade
42 pages
New Paradigm in Parallel Programming
No ratings yet
New Paradigm in Parallel Programming
3 pages
Chapter 1PARALLEL PROGRAM
No ratings yet
Chapter 1PARALLEL PROGRAM
6 pages
Streaming Application On Many-Core Systems
No ratings yet
Streaming Application On Many-Core Systems
17 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
High-Level Language Extensions For Fast Execution of Pipeline-Parallelized Code On Current Chip Multi-Processor Systems
No ratings yet
High-Level Language Extensions For Fast Execution of Pipeline-Parallelized Code On Current Chip Multi-Processor Systems
12 pages
Kai Hwang: Advanced Computer Architecture
No ratings yet
Kai Hwang: Advanced Computer Architecture
9 pages
FlinkCL: Boosting Big Data with GPUs
No ratings yet
FlinkCL: Boosting Big Data with GPUs
3 pages
PP CS
No ratings yet
PP CS
89 pages
Understanding Parallel Processing Units
No ratings yet
Understanding Parallel Processing Units
32 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Multicore Processors and Systems PDF
100% (2)
Multicore Processors and Systems PDF
310 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
A Light-Weight Approach To Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures
No ratings yet
A Light-Weight Approach To Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures
12 pages
Flynns
No ratings yet
Flynns
41 pages
Achieving High Performance Computing
No ratings yet
Achieving High Performance Computing
58 pages
U1&u2 Padcom-25
No ratings yet
U1&u2 Padcom-25
95 pages
28895568
No ratings yet
28895568
9 pages
Advanced Computer Architecture Insights
No ratings yet
Advanced Computer Architecture Insights
9 pages
Concurrency Analysis Report
No ratings yet
Concurrency Analysis Report
42 pages
Berkeley View
No ratings yet
Berkeley View
54 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Parallel Computing IA1
No ratings yet
Parallel Computing IA1
29 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
Parallel Algorithms for Image Processing
No ratings yet
Parallel Algorithms for Image Processing
5 pages
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
8 pages
Computação Paralela
No ratings yet
Computação Paralela
18 pages
Parallel Programming Models Survey
No ratings yet
Parallel Programming Models Survey
18 pages
V. Rajaraman, C. Siva Ram Murthy - Parallel Computers Architecture and Programming-PHI (2016)
100% (2)
V. Rajaraman, C. Siva Ram Murthy - Parallel Computers Architecture and Programming-PHI (2016)
506 pages
Multicore Programming Course Overview
No ratings yet
Multicore Programming Course Overview
20 pages
Unit 5
No ratings yet
Unit 5
96 pages
m1 Module Bank Padcom
No ratings yet
m1 Module Bank Padcom
4 pages
Creams: An Embedded Multiprocessor Platform
No ratings yet
Creams: An Embedded Multiprocessor Platform
12 pages
Pda 2
No ratings yet
Pda 2
105 pages
Operating System Structures For Multiprocessor Sys
No ratings yet
Operating System Structures For Multiprocessor Sys
7 pages
Parallel Computig Assignment
No ratings yet
Parallel Computig Assignment
15 pages
Par Prog Course Many Core SW Pats Ocl
No ratings yet
Par Prog Course Many Core SW Pats Ocl
90 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
Parallel Computing Course Guide
No ratings yet
Parallel Computing Course Guide
50 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
19 pages
Parallel Video Processing Performance Evaluation On The Ibm Cell Broadband Engine Processor
No ratings yet
Parallel Video Processing Performance Evaluation On The Ibm Cell Broadband Engine Processor
13 pages
UNIT - I: Parallel and Distributed Computing
No ratings yet
UNIT - I: Parallel and Distributed Computing
58 pages
Comparch Individual Assignment
No ratings yet
Comparch Individual Assignment
19 pages
Unit-1 - Cloud Computing
No ratings yet
Unit-1 - Cloud Computing
19 pages
Parallel Computing Trends
No ratings yet
Parallel Computing Trends
7 pages
Types of Parallel Processing Techniques
No ratings yet
Types of Parallel Processing Techniques
16 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Ecommended Eading: William Stallings
No ratings yet
Ecommended Eading: William Stallings
23 pages
Cpugpubasepaper
No ratings yet
Cpugpubasepaper
14 pages
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
No ratings yet
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
330 pages
24csppc202 Multicore Architecture and Programming
No ratings yet
24csppc202 Multicore Architecture and Programming
21 pages
HP Server Pricing and Specifications 2013
No ratings yet
HP Server Pricing and Specifications 2013
11 pages
Best Terminal Alternatives For Ubuntu
No ratings yet
Best Terminal Alternatives For Ubuntu
3 pages
User Manual For Yhbpm 1.0
No ratings yet
User Manual For Yhbpm 1.0
42 pages
Park+ Rfid Tag
No ratings yet
Park+ Rfid Tag
7 pages
Crash
No ratings yet
Crash
3 pages
Working Principle of Grid Scheduling
No ratings yet
Working Principle of Grid Scheduling
3 pages
Hanvon SDKManual
No ratings yet
Hanvon SDKManual
49 pages
Configure Windows Server 2008 R2
100% (1)
Configure Windows Server 2008 R2
55 pages
Universiti Teknologi Mara Final Assessment Course: Fundamentals of Computer Science Course Code: CSC401 Examination: JANUARY 2024 Time: 2 Hours
No ratings yet
Universiti Teknologi Mara Final Assessment Course: Fundamentals of Computer Science Course Code: CSC401 Examination: JANUARY 2024 Time: 2 Hours
9 pages
A Comparison of Azure AWS and Google Cloud Services PDF
No ratings yet
A Comparison of Azure AWS and Google Cloud Services PDF
17 pages
Borang Penyelenggaraan Berkala Unit IT PDTMT
No ratings yet
Borang Penyelenggaraan Berkala Unit IT PDTMT
1 page
ECC Assignment11
No ratings yet
ECC Assignment11
4 pages
4 Bit Shift Register
No ratings yet
4 Bit Shift Register
11 pages
2301 5.4 Tech Overview FINAL
No ratings yet
2301 5.4 Tech Overview FINAL
43 pages
WinTR-55 Setup: Show Hidden Files
No ratings yet
WinTR-55 Setup: Show Hidden Files
3 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
BIOS vs CMOS: Key Differences Explained
No ratings yet
BIOS vs CMOS: Key Differences Explained
27 pages
CBSE Class 9 Computer Science Sample Paper 2017 (1) - 0
No ratings yet
CBSE Class 9 Computer Science Sample Paper 2017 (1) - 0
2 pages
Securing Apache Web Servers With Mod Security & CIS Benchmark
No ratings yet
Securing Apache Web Servers With Mod Security & CIS Benchmark
58 pages
Mod 4 Besck104c
No ratings yet
Mod 4 Besck104c
13 pages
109141-Day Planner - Online RN
No ratings yet
109141-Day Planner - Online RN
5 pages
Managing User Settings With Group Policy
No ratings yet
Managing User Settings With Group Policy
29 pages
DX Diag
No ratings yet
DX Diag
27 pages
Sockets IBM
100% (1)
Sockets IBM
190 pages
NetBackup Appliance Commands Reference Guide 41
No ratings yet
NetBackup Appliance Commands Reference Guide 41
391 pages
IPTV Device Information Log
No ratings yet
IPTV Device Information Log
6 pages
RCS Trainer 4.0 - Internal User Guide
No ratings yet
RCS Trainer 4.0 - Internal User Guide
8 pages
CCSS V6 & V12 Release Notes 2007
No ratings yet
CCSS V6 & V12 Release Notes 2007
2 pages
MPMC Unit
No ratings yet
MPMC Unit
308 pages
Mock Exam
No ratings yet
Mock Exam
5 pages

PDC Review 1 Group 11

Uploaded by

PDC Review 1 Group 11

Uploaded by

Problem Statement

Optimizing Streaming Parallelism on Heterogeneous

Techniques & its Related Challenges:

Review of related papers:

3. Paper Name :- Efcient streaming applications on multi-core with FastFlow: the

The tighter coupling of on-chip resources changes the communication to computation

4. Paper Name :- Revisiting the Design of Data Stream Processing Systems on

You might also like