100% found this document useful (1 vote)
2K views16 pages

SIMD Architecture

The document is a term paper submitted by Nancy Mahajan to her professor Mr. Vijay Garg on the topic of SIMD architecture. It includes an introduction on SIMD and its ability to manipulate vectors and matrices efficiently. It then discusses two main types of SIMD architecture - true SIMD with distributed and shared memory, and pipelined SIMD. The advantages of SIMD include its ability to dramatically improve performance by processing multiple data elements at once with a single instruction.

Uploaded by

binzidd007
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views16 pages

SIMD Architecture

The document is a term paper submitted by Nancy Mahajan to her professor Mr. Vijay Garg on the topic of SIMD architecture. It includes an introduction on SIMD and its ability to manipulate vectors and matrices efficiently. It then discusses two main types of SIMD architecture - true SIMD with distributed and shared memory, and pipelined SIMD. The advantages of SIMD include its ability to dramatically improve performance by processing multiple data elements at once with a single instruction.

Uploaded by

binzidd007
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 16

TERM PAPER

SIMD
ARCHITECTURE
Submitted By:Nancy Mahajan Roll No.-RB1801A22 Reg. No.-10809333 B.Tech. Cse

Submitted to:-

Mr. Vijay Garg CSE Dept.

ACKNOWLEDGEMENT
First n foremost I want to thanks my electrical sciences Teacher "Mr. Vijay Garg" for giving me a term paper on SIMD Architecture. Such term papers enhance our capabilities, mental ability and keep us up-to-date about the related topic and subject. Secondly I would like to thanks the whole library faculty of "Lovely Professional University" for helping me preparing the project. In the last I would like to thanks my parents for cooperating and helping in everyway throughout the project.

TABLE OF CONTENTS

S.No. 1 2 3

TOPIC Introduction SIMD Operations Types Of SIMD Architecture advantages Disadvantages Bibliography

PAGE No. 4-7 8 9-11

4 5 6

11-13 13-14 15

Introduction
SIMD(Single-Instruction Stream Multiple-Data Stream) architectures are essential in the parallel world of computers. Their ability to manipulate large vectors and matrices in minimal time has created a phenomenal demand in such areas as weather data and cancer radiation research. The power behind this type of architecture can be seen when the number of processor elements is equivalent to the size of your vector. In this situation, componentwise addition and multiplication of vector elements can be done simultaneously. Even when the size of the vector is larger than the number of processors elements available, the speedup, compared to a sequential algorithm, is immense.

SIMD ARCHITECTURE

The SIMD model of parallel computing consists of two parts:


1) a front-end computer of the usual von Neumann style, 2) A processor array.

The processor array is a set of identical synchronized processing elements capable of simultaneously performing the same operation on different data. Each processor in the array has a small amount of local memory where the distributed data resides while it is being processed in parallel. The processor array is connected to the memory bus of the front end so that the front end can randomly access the local

SIMD ARCHITECTURE MODEL

processor memories as if it were another memory. Thus, the front end can issue special commands that cause parts of the memory to be operated on simultaneously or cause data to move around in the memory. A program can be eveloped and executed on the front end using a traditional serial programming language. The application program is executed by the front end in the usual serial way, but issues commands to the processor array to carry out SIMD operations in parallel. The similarity between serial and data parallel programming is one of the strong points of data parallelism. Synchronization is made irrelevant by the lockstep synchronization of the processors. Processors either do nothing or exactly the same operations at the same time. In SIMD architecture, parallelism is exploited by applying simultaneous operations across large sets of data. This paradigm is most useful for solving problems that have lots of data that need to be updated on a wholesale basis. It is especially powerful in many regular numerical calculations.

There are two main configurations that have been used in SIMD machines . In the first scheme, each processor has its own local memory. Processors can communicate with each other through the interconnection network. If the interconnection network does not provide direct connection between a given pair of processors, then this pair can exchange data via an intermediate processor. The ILLIAC IV used such an interconnection scheme. The interconnection network in the ILLIAC IV allowed each processor to communicate directly with four neighboring processors in an 8 _ 8 matrix pattern such that the i th processor can communicate directly with the (i 2 1)th, (i 1)th, (i 2 8)th, and (i 8)th processors. In the second SIMD scheme, processors and memory modules communicate with each other via the interconnection network. Two processors can transfer data between each other via intermediate memory module(s) or possibly via intermediate processor(s). The BSP (Burroughs Scientific Processor) used the second SIMD scheme.

TWO SIMD SCHEMES

SIMD operations
The basic unit of SIMD love is the vector, which is why SIMD computing is also known as vector processing. A vector is nothing more than a row of individual numbers, or scalars.

A regular CPU operates on scalars, one at a time.(A superscalar CPU operates on multiple scalars at once, but it performs a different operation on each instruction.) A vector processor, on the other hand, lines up a whole row of these scalars, all of the same type, and operates on them as a unit. These vectors are represented in what is called packed data format.Data are grouped into bytes (8 bits) or words (16 bits), and packed into a vector to be operated on. One of the biggest issues in designing a SIMD implementation is how many data elements will it be able to operate on in parallel. If you want to do single-precision (32-bit) floating-point calculations in parallel, then you can use a 4-element, 128-bit vector to do four-way single-precision floating-point, or you can use a 2-element 64-bit vector to do two-way SP FP. So the length of the individual vectors dictates how many elements of what type of data you can work with.

TYPES OF SIMD ARCHITECTURE


There are two types of SIMD architectures we will be discussing. The first is the True SIMD followed by the Pipelined SIMD. Each has its own advantages and disadvantages but their common attribute is superior ability to manipulate vectors.

True Simd
(overview)

Both types of true SIMD architecture organizations differ only in connection of memory models, M, to the arithmetic units, D. From above, the D, or arithmetic units, are called the processing elements (PEs). In distributed memory, each memory model is uniquely associated with a particular arithmetic unit. The synchronized PE's are controlled by one control unit. Each PE is basically an arithmetic logic unit with attached working registers and local memories for storage of distributed data. The CU decodes the instructions and determines where they should be executed. The scalar or control type of instructions are executed in CU whereas the vector instructions are broadcast to PE's. In shared memory SIMD machines, t he local memories attached to PE's are replaced by memory modules shared by all PE's through an alignment network. This configuration allows the individual PE's to share their memory without accessing the CU.

True SIMD: Distributed Memory

The True SIMD architecture contains a single contol unit(CU) with multiple processor elements(PE) acting as arithmetic units(AU). In this situation, the arithmetic units are slaves to the control unit. The AU's cannot fetch or interpret any instructions. They are merely a unit which has capabilities of addition, subtraction, multiplication, and division. Each AU has access only to its own memory. In this sense, if a AU needs the information contained in a different AU, it must put in a request to the CU and the CU must manage the transferring of information. The advantage of this type of architecture is in the ease of adding more memory and AU's to the computer. The disadvantage can be found in the time wasted by the CU managing all memory exchanges.

True SIMD: Shared Memory


Another True SIMD architecture, is designed with a configurable association between the PE's and the memory modules(M). In this architecture, the local memories that were attached to each AU as above are replaced by memory modules. These M's are shared by all the PE's through an alignment network or switching unit. This allows for the individual PE's to share their memory without accessing the control unit. This type of architecture is certainly superior to the above, but a disadvantage is inherited in the difficulty of adding memory.

Pipelined SIMD
Pipelined SIMD architecture is composed of a pipeline of arithmetic units with shared memory. The pipeline takes different streams of instructions and performs all the operations of an arithmetic unit. The pipeline is a first in first out type of procedure. The size of the pipelines are relative. To take advantage of the pipeline, the data to be evaluated must be stored in different memory modules so the pipeline can be fed with this information as fast as possible. The advantages to this architecture can be found in the

speed and efficiency of data processing assuming the above stipulation is met. It is also possible for a single processor to perform the same instruction on a large set of data items. In this case, parallelism is achieved by pipelining One set of operands starts through the pipeline, and Before the computation is finished on this set of operands, another set of operands starts flowing through the pipeline.

Advantages of SIMD
The main advantage of SIMD is that processing multiple data elements at the same time, with a single instruction, can dramatically improve performance. For example, processing 12 data items could take 12 instructions for scalar processing, but would require only three instructions if four data elements are processed per instruction using Page 819 SIMD. While the exact increase in code speed that you observe depends on many factors, you can achieve a dramatic performance boost if SIMD techniques can be utilized. Not everything is suitable for SIMD processing, and not all parts of an application need to be SIMD accelerated to realize significant improvements. SIMD offers greater flexibility and opportunities for better performance in video, audio and communications tasks which are increasingly important for applications. SIMD provides a cornerstone for robust and powerful

multimedia capabilities that significantly extend the scalar instruction set.

SIMD can provide a substantial boost in performance and capability for an application that makes significant use of 3D graphics, image processing, audio compression or other calculation-intense functions. Other features of a program may be accelerated by recoding to take advantage of the parallelism and additional operations of SIMD. Apple is adding SIMD capabilities to Core Graphics, QuickDraw and QuickTime. An application that calls them today will see improvements from SIMD without any changes. SIMD also offers the potential to create new applications that take advantage of its features and power. To take advantage of SIMD, an application must be reprogrammed or at least recompiled; however you do not need to rewrite the entire application. SIMD typically works best for that 10% of the application that consumes 80% of your CPU time -- these functions typically have heavy computational and data loads, two areas where SIMD excels.

Check On Disadvantages
Because, in an SIMD machine, a single ACU provides the instruction stream for all of the array processors, the system will frequently be under-utilized whenever programs are run that require only a few PEs. To alleviate this problem, multiple-SIMD (MSIMD) machines were designed. They consist of multiple control units, each with its own program memory. The PEs are controlled by U control units that divide the machine into U independent virtual SIMD machines of various sizes. U is usually much smaller than N and determines the maximum number of SIMD programs that can operate simultaneously. The distribution of the PEs onto the ACUs can be either static or dynamic.

The MSIMD machine architecture has several advantages over normal SIMD machines, including:

Efficiency: If a program requires only a subset of the available PEs, the remaining PEs can be used for other programs.

Multiple users: Up to U different users can execute different SIMD programs on the machine simultaneously.

Fault detection: A program runs on two independent machine partitions, and errors are detected by result comparison.

Fault tolerance: A faulty PE only affects one of the multiple SIMD machines, and other machines can still operate correctly.

Bibliography
http://carbon.cudenver.edu/csprojects/CSC5809S01/Simd/archi.html http://papers.ssrn.com/sol3/papers.cfm?abstract_id=944733 http://arstechnica.com/old/content/2000/03/simd.ars

Applications Tuning for Streaming SIMD Extensions.


URL:http://developer.intel.com/technology/itj/Q21999/ARTICLES/art_5a.htm Huff, Tom and Thakkar, Shreekant (1999). Internet Streaming SIMD Extensions. Computer, vol 32 no 12, 26-34.

Common questions

Powered by AI

True SIMD architecture consists of a single control unit with multiple processor elements acting as arithmetic units. These units only execute commands from the control unit and cannot fetch or interpret instructions themselves . In contrast, Pipelined SIMD architecture uses a pipeline of arithmetic units that can perform the same instructions on a large dataset in a first-in-first-out manner. This architecture is beneficial for speeding up data processing because it allows simultaneous operand evaluation by the pipeline . Thus, while True SIMD relies on centralized control, Pipelined SIMD focuses on successive data operations through pipelining.

In True SIMD architecture, distributed memory implies that each processing element (PE) is associated with its memory model exclusively, allowing additive operations but requiring control unit intervention for inter-PE data needs, which simplifies integration of additional memory and PEs . Shared memory, on the other hand, replaces individual memories with modules accessible by all PEs via an alignment network, enabling direct PE sharing without control unit mediation . Although shared memory provides superior data accessibility, it complicates memory expansion processes due to the interdependency on alignment and switching networks .

Synchronization in SIMD architecture is managed through lock-step synchronization where all processors operate simultaneously on different data using the same instruction stream provided by a single control unit, ensuring that all processing elements execute operations in unison . This method eliminates the complexity of synchronization typically associated with parallel computing by guaranteeing constant operational uniformity across all processors, thereby simplifying program development and reducing synchronization-related overhead . This lock-step approach is advantageous for regular numerical calculations and tasks involving large data sets needing uniform updates .

One of the significant challenges of SIMD architecture is that a single ACU (Arithmetic Control Unit) provides the instruction stream for all array processors, which often leads to underutilization when programs that require only a few Processing Elements (PEs) are run . This issue arises because all PEs must execute the same instruction in lockstep, regardless of their individual workloads. To address this inefficiency, multi-SIMD (MSIMD) machines were developed, allowing multiple ACUs to control subsets of PEs, thereby providing better resource distribution and enabling multiple users to execute different SIMD programs simultaneously . MSIMD machines enhance efficiency, user capability, fault detection, and tolerance, which are critical in overcoming SIMD underutilization issues .

Interconnection networks in SIMD architectures facilitate processor communication by connecting processors either directly or indirectly through intermediaries. The two main types of interconnection schemes are: 1) Processor-to-processor connections, used in systems like the ILLIAC IV, where each processor can directly communicate with adjacent processors within a matrix arrangement, which allows data to be passed efficiently among neighbors ; 2) Processor-to-memory-module connections, exemplified by the BSP (Burroughs Scientific Processor), where processors exchange data through intermediate memory modules or processors, offering more versatile communication paths . These networks are crucial for overcoming the direct communication limitations inherent in SIMD's synchronized operational paradigm.

The primary advantage of SIMD (Single-Instruction Stream Multiple-Data Stream) architecture is its ability to process multiple data elements simultaneously with a single instruction, significantly boosting performance in tasks like video, audio, and communication processing . This is particularly beneficial in operations involving large data sets that require identical operations to be performed on each element, such as in 3D graphics or image processing . The architecture also allows for dramatic performance improvements by handling numerous data elements concurrently, contrasting scalar processing which processes each data item individually .

In SIMD configurations, the ILLIAC IV utilized a direct interconnection matrix pattern allowing each processor to communicate directly with its four neighboring processors in an 8x8 pattern, ensuring efficient data exchange among adjacent processors . This setup facilitated localized data handling but constrained communication flexibility. On the other hand, Burroughs’ Scientific Processor (BSP) employed a more versatile scheme, where processors communicated through memory modules that acted as intermediates. This increased configuration flexibility at the cost of direct access speed, optimizing data exchange scenarios beyond the localized pattern of ILLIAC IV . These architectural distinctions highlight the trade-off between direct access speed and communication versatility in SIMD designs.

Adding SIMD capabilities to existing applications can lead to substantial performance improvements, especially for functions that are computationally intensive. SIMD allows applications to execute single operations on multiple data points simultaneously, which accelerates tasks such as image processing and compression . However, not all applications will benefit equally from SIMD, as not all functions are suitable for parallel processing. Moreover, integrating SIMD might require partial application rewrites or recompilation, which can involve significant development effort . Nonetheless, if optimally implemented, even a small percentage of SIMD-optimized code can yield performance gains .

In a SIMD architecture model, the front-end computer plays the role of a control unit of the usual von Neumann style, which develops and executes programs using a traditional serial programming language. This front-end computer is responsible for issuing commands to the processor array to perform SIMD operations in parallel, thus integrating serial and parallel processing paradigms . It interacts with the processor array by accessing processor memories, issuing commands for simultaneous memory operations or data movement .

SIMD architecture is highly suitable for multimedia enhancement due to its ability to execute the same operation on many data elements simultaneously, significantly improving performance in computationally intensive applications like 3D graphics and audio processing . It provides a means to accelerate the data-heavy components of multimedia tasks by performing vector operations on packed data formats, thus efficiently handling tasks such as image transformations and sound compression . With companies like Apple adding SIMD capabilities to applications such as Core Graphics and QuickTime, multimedia operations can benefit substantially from SIMD's parallel processing strength without needing extensive rewrites .

You might also like