0% found this document useful (0 votes)
15 views5 pages

PDC Review 1 Group 11

This study focuses on optimizing streaming parallelism in heterogeneous many-core architectures, specifically using the technique of heterogeneous streaming to improve performance by sharing workloads. The research presents an automated approach utilizing machine learning for core allocation, tested on Intel XeonPhi and NVIDIA GTX 1080Ti, achieving speedups of 1.6x and 1.1x respectively. It also reviews related works and highlights the challenges of effectively utilizing the processing power of modern many-core accelerators.

Uploaded by

Lakshit Mangla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

PDC Review 1 Group 11

This study focuses on optimizing streaming parallelism in heterogeneous many-core architectures, specifically using the technique of heterogeneous streaming to improve performance by sharing workloads. The research presents an automated approach utilizing machine learning for core allocation, tested on Intel XeonPhi and NVIDIA GTX 1080Ti, achieving speedups of 1.6x and 1.1x respectively. It also reviews related works and highlights the challenges of effectively utilizing the processing power of modern many-core accelerators.

Uploaded by

Lakshit Mangla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Problem Statement

Optimizing Streaming Parallelism on Heterogeneous


Many-Core Architectures

Domain Introduction:
Heterogeneous many core accelerators refer to processing units used to accelerate the
performance of parallel applications. Some commonly used many core accelerators are GPGPUs
and Intel XeonPhi architectures. With an increase in the demand for performance by users, many
core accelerators have become more powerful with more cores and more processing power per
core. However, on the software side of things, it becomes difficult for applications to keep up
and utilize all the parallel processing power available from these many core accelerators.
This study focuses on how parallel programs can fully utilize the performance boost offered by
such multicore architectures.

Techniques & its Related Challenges:


The technique used in this study[1] to improve the performance of heterogeneous many core
performance is known as ​heterogeneous streaming. T ​ his technique shares workloads of parallel
programs by utilizing the independence in parts of the program so that they can run
simultaneously, thus improving performance. This effectively allows the current kernel
executions with data movements.
Some commonly used implementations of heterogeneous streaming include CUDA streams,
OpenCL and Intel hStreams.
This study aims to note the improvement of performance of a data parallel application by
exploiting spatial and temporal sharing of heterogeneous streams.
The main difference between this study and the ones done of ​Task Scheduling​ in the past is that
this study relies on partitioning processors into groups in order to improve host-device
communication on coprocessors like XeonPhi. The study also departs from earlier work by
developing an automatic approach to dynamically adjust the processor partition and task
granularity during runtime. It also includes many performance improvements specific to the Intel
XeonPhi platform as specified in Cheng et al.[2] and Jha et al.[3]

Research Findings:
The study contributes by presenting an easy and automated approach to exploiting streaming
parallelism in heterogeneous many-core architectures using machine learning based-model that
decides the core allocation for specific programs. The approach was tested on an Intel XeonPhi
as well as NVIDIA GTX 1080Ti GPU, with 39 different benchmarks. Experimental results
showed a speedup of 1.6x and 1.1x on XeonPhi and GTX 1080Ti respectively.

Individual Contribution:
Prateek Sharma - 18BCI0215: ​Review and summary of base paper[1] and related works[2][3]
as well as compilation of the final document.
Lakshit Mangla - 18BCI0246: ​Review and summarization of related works[4][5]
Anshil Seth - 18BCI0173: ​Review and summarization of related works[6][7]

Review of related papers:


1. Paper Name :- Heterogeneous Parallel Computing With Java : Jabber or
Justified[4]
Author :- H G Dietz

The main focus of this study was to find whether Java is a good language for
Heterogeneous Parallel Computing or not. The paper starts by describing what
heterogeneous computing is. It is the concept of using a collection of machines, in which
each machine may have properties somewhat different from the others to achieve
speedup on a computation.
Then the study divides heterogeneous computing into 4 main parts which are
1) Architecture
2) Speedup
3) Portability
4) Transformability

The Study then focuses on Java and gives some of its benefits and what it lacks. These
are some of its benefits :-
1) Data Types
2) Object Oriented
3) Threads
4) Support for Networking
5) Graphic Support

In the end of the paper it was concluded that Java has many features that should be a part
of a programming model for heterogeneous parallel computing. But the Java model is not
100% appropriate.
2. Paper Name :- Efficient Strategies of Compressing Three-Dimensional Sparse
Arrays based on Intel XEON and Intel XEON Phi Environments[5]
Author :- Chun-Yuan Lin , Che-Lun Hung

Array operations are useful in a lot of important scientific codes. In the past many
methods have been proposed to implement these array operations efficiently but most
cases were focused on 2D implementation of arrays.
Parallel computing is a suitable solution to speed up the array operations in terms of time
and memory space. For parallel array operations we can design and implement parallel
programs of them on a shared memory multiprocessor.
There are 3 strategies which are compared to increase the efficiency: -
1) CRS Scheme
2) Inter Task parallelization Strategy
3) Intra Task Parallelization Strategy

Experimental tests were done using these 3 strategies and compared. The results were
that the speed up ratio by inter task parallelization was better than that of intra task
parallelization in most of the cases.

3. Paper Name :- Efcient streaming applications on multi-core with FastFlow: the


biosequence alignment test-bed[6]
Author :- Marco ALDINUCCI , Marco DANELUTTO, Massimiliano MENEGHIN
,Massimo TORQUATI and Peter KILPATRICK

The tighter coupling of on-chip resources changes the communication to computation


ratio that inuences the design of parallel algorithms. Modern Single Chip
Multiprocessor (SCM) architectures introduce the potential for low overhead inter-core
communications, synchronisations and data sharing due to fast path access to the on-die
caches. These caches, organised in a hierarchy, are also a potential limiting factor of
these architectures since access patterns which are not carefully optimised may lead to
relevant access contention for shared cache and invalidation pressure for replicated
caches.
How the FastFlow library can be used to build a widely used parallel programming
paradigm (a.k.a. skeleton), i.e. the streaming stateful farm; and we com-pare its raw
scalability against a hand-tuned Pthread-based counterpart on a dual quad-core Intel
platform. The Fast Flow farm skeleton can be rapidly and effectively used to boost the
performance of many existing real-world applications, for example the Smith-Waterman
local alignment algorithm [18]. In the following we show that a straight for-ward porting
of the multi-threaded x86/SSE2-enabled SWPS3 implementation [19] onto FastFlow is
twice as fast as the SWPS3 itself, which is a hand-tuned high-performance
implementation.
They have evaluated the performance of FastFlow with two families of applications:
A synthetic micro-benchmark and the Smith-Waterman local sequence alignment
algorithm.
All experiments are executed on a shared memory Intel platform with two quad-core
XeonE5420 Harpertown 2.5GHz 6MB L2 cache and 8 GBytes of main memory.
The above evaluations are concluded by : The FastFlow version of the Smith-Waterman
algorithm, obtained from a third-party high-performance implementation by simply
substituting communication primitives, is always faster when compared to the original
version and exhibits double its speedup on ne-grained datasets. The presented results are
the “feasibility study” of the proposed approach.

4. Paper Name :- Revisiting the Design of Data Stream Processing Systems on


Multi-Core Processors[7]
Author :- Shuhao Zhang - SAP Innovation Center Singapore; Bingsheng He - Nat. Univ.
of Singapore; Daniel Dahlmeier - SAP Innovation Center Singapore; Amelie Chi Zhou

The performance issues are addressed and they allow the DSP system to exploit modern
scale up architectures, which also benefits scaling out environments. These performance
issues are shown and reflected by using two platforms: -
Apache Storm & Flink
3 strategies used to evaluate are:-
Pipelined processing with message passing.
On demand data parallelism and,
JVM based implementation.
They have revisited three common design aspects of modern DSP systems on modern
multi socket multi core architectures. Detailed profiling studies are done on Apache
Storm and Flink.
2 major highlighted outcomes from the studies are:-
The design of supporting both pipeline and data parallel processing results in a very
complex parallel execution model in DSP systems,causing high frontend stalls on a single
CPU.
The design of continuous message passing mechanisms between operators severely limits
the scalability of DSP systems on multi socket multi core architectures.
Base Paper and Reference Papers:
[1] P. Zhang, J. Fang, C. Yang, C. Huang, T. Tang and Z. Wang, "Optimizing Streaming
Parallelism on Heterogeneous Many-Core Architectures," in IEEE Transactions on Parallel and
Distributed Systems, vol. 31, no. 8, pp. 1878-1896, 1 Aug. 2020, doi:
10.1109/TPDS.2020.2978045.
[2] X. Cheng et al., “Many-core needs fine-grained scheduling: A case study of query processing
on Intel Xeon Phi processors,” J. Parallel Distrib. Comput., vol. 120, pp. 395–404, 2018.
[3] S. Jha et al., “Improving main memory hash joins on Intel Xeon Phi processors: An
experimental approach,” Proc. VLDB Endowment, vol. 8, pp. 642–653, 2015
[4] H. G. Dietz, "Heterogeneous parallel computing with Java: jabber or justified?," Proceedings
Seventh Heterogeneous Computing Workshop (HCW'98), Orlando, FL, USA, 1998, pp.
159-162, doi: 10.1109/HCW.1998.666554.
[5] C. Lin, H. T. Yen and C. Hung, "Efficient Strategies of Compressing Three-Dimensional
Sparse Arrays Based on Intel XEON and Intel XEON Phi Environments," 2015 IEEE
International Conference on Computer and Information Technology; Ubiquitous Computing and
Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and
Computing, Liverpool, 2015, pp. 1383-1388, doi: 10.1109/CIT/IUCC/DASC/PICOM.2015.206.
[6] Aldinucci, Marco & Danelutto, Marco & Meneghin, Massimiliano & Kilpatrick, Peter &
Torquati, Massimo. (2009). Efficient streaming applications on multi-core with FastFlow: The
biosequence alignment test-bed. 19. 10.3233/978-1-60750-530-3-273.
[7] S. Zhang, B. He, D. Dahlmeier, A. C. Zhou and T. Heinze, "Revisiting the Design of Data
Stream Processing Systems on Multi-Core Processors," 2017 IEEE 33rd International
Conference on Data Engineering (ICDE), San Diego, CA, 2017, pp. 659-670, doi:
10.1109/ICDE.2017.119.

You might also like