Lesson25DistributedMemorySorting
BitonicSort
Thepseudocodeforgeneratingabitonicsequence:
1. createtwobitonicsubsequences
2. Makeoneanincreasingsequence
3. Makeoneintoadecreasingsequence
2
Workforabitonicsort:W
(n)= (nlog
(n))thisalgorithmisnotworkoptimal
bs
2
Spanforabitonicsort:D
(n)= (log
(n))
bs
DistributedBitonicMergeviaBinaryExchange
Withregardstobitonicsorts,everythingboilsdowntodoingbitonicsortsefficiently.
Step1:abitonicsplit,performedinplace
Attheendofstage1therearetwobitonicsubsequences
Foradistributedalgorithm,dividetheprocessesbetweenthe
nodes.
Thisshowstheminimumbetweenthetwoinputs.(Thestartofthesplit.)
[Link]
indicatethedependencebetweenthetwoinputsandoutputs.
Theendresultisabitonicsplit.
Thisshowstheinputs,theoutputs,andthepatternsof
dependencies.
Tolookatthedistributedversionofthethis:
Using4processingnodes.
Inthiscase,communication
Communicationoccursanywhereadependenceedgecrosses
aprocessboundary.
Binaryexchange=twoprocessesswappingdatawitheach
other.
Notethatallcommunicationoccursinthefirstlog(p)stages.
Thereareonlylog(p)stagesthatrequirecommunication.
Intheotherlog(n/P)stagesthereisnoneedforcommunicationbetweenprocesses.
Inacyclicdistributiontherowsofthenetworkareassignedtodifferentprocessesina
[Link].
PickaNetwork
Whichnetworkwouldallowforfullyconcurrentexchangeswithoutcongestion?
Hypercubeandfullyconnectedtopologies.
[Link]
networkismorethannecessary.
CommunicationCostofaBitonicMerge
Whatisthecommunicationtimeofabitonicmerge,assumingablockdistributed,
binaryexchangeschemeonahypercube.
Communicationtime=(a+(b*(n/p))*log(p))
Recall,thebinaryexchangeschemeonlycommunicatesduringthefirstlog(P)stages.
Eachprocesshastosendn/Pwordsateachstage.
BitonicMergeviaTransposes
Twodistributedbitonicmergeschemeshavebeendiscussed:
Blockdistributionscheme:log(P)stagesofcommunication
andlog(n/P)stagesofcomputation.
Cyclicscheme:log(n/P)stagesofcomputationandlog(P)stages
ofcomputation
Therunningtimeforthetwoarethesame:T
(nP)= log(P)+ (n/P)log(P)
net
Istheresomethingthatwillreducethe term,evenatthecostofincreasingthe term?YES
Startwithacyclictopology:
[Link],whichmeansno
[Link](or
shuffled).
Thetransposecanbeseenasanalltoallexchangeorasa
matrixtranspose.
2
Ifwelookatoneprocessnote:itneedstosendP1messages,eachofsizen/P
,toalltheother
processes.
Todeterminetheprocesstime:
1. Assumethenetworkisfullyconnected.
T
(nP)= (P1)+ (n/P)((P1)/P)Fullyconnectednetwork.
trans
Thereisalatencybandwidthtradeoffbetweenthetwo
schemes.
Inpracticeisitveryhardtofortheblockorcyclicschemeto
beatthetransposescheme.
ButterflyTrivia
Nameanotherfamousalgorithmthatfollowsthesame
computationalpattern.
FastFourierTransform
BitonicSortCostComputation
= cost/compare
(n/P )K = totaltimetodothecomparisons,atmergingstageK
Therearelognmergingstages,sothetotalcostforcomputationis:
BitonicSortCostCommunication
Assume:n,ParepowersofnP/n
T
(m)= + m
msg
Whatisthecommunicationtime?
Thekeyquestion:wheredoes
communicationhappen?
ForeachstageK,thesizeofthe
K
bitonicmergeisn
=2
K
Stage4asinglebitonicmergeof
size16.
AssumeablockdistributionwithPprocesses.
CommunicationonlystartswhenK>log(n/P)
Klog(n/P)
P
=2
whenk==log(n)itissimplifiedtoP
K
BitonicMergeinPprocessesofnsizerequires
Timetocommunicate
Sothetimefortheentiresortis:
Thesimplifiedversionis:
O( log(P ) + (n/P )(log 2
P))
LinearTimeDistributedSort,Part1
Anycomparisonbasedalgorithmforsortingscalesto:O(nlogn)
ForthebucketsortO(n)
Todobucketsort:
1. Startbyassumingyouknowthepossiblevalues.R={0,1,2,3,...,m1}
2. Thevaluestobesortedareuniformlydistributedovertherange.
3. DivideRintokbuckets
4. Thebucketsortfirstfiguresoutwhichbucketeachvaluebelongs.
5. Sortwithinineachbucketandconcatenatetheresults.
Howisthisalineartimescheme?
Howmanyelementsareineachbucket?theexpectednumberofelementsineachbucketwill
ben/k.(Assumetheelementsareevenlydistributedacrosstherange)
Thetimetosorteachbucketis:O(n/klog(n/k))
DistributedBucketSort
Itistodistributeassigneachbuckettoacomputenode.
Makethefollowingassumptions:
k=P
elementspernode~n/Pelementspertable
Assumeallnodesknowbucketranges
Assumethenetworkisfullyconnected
Therunningtimeis:O(b*(n/P)+t*(n/P)+(a*P))
Thestepstogetthis:
1. Eachprocessneedstoscanitslistoflocalelementsanddecidewhichelementsgo
[Link].
2. Thenodesneedtoexchangeelements(analltoalloperation).Eachnodehas~n/P
2
[Link]/P
[Link]
connected.
3. Noweachbucketmustdoalocalsort,ofcostO(n/P)
LineartimeDistributedSort,Part2:SampleSort
Thebucketsorthasamajorflawtheassumptionofauniformdistributionofvaluesacross
[Link].,youwillnotgetanequalnumberofelementsin
eachbucket.
SouseSampling.
Dobucketsort,[Link],
usesampling.
TodoSampleSort:
1. Beginwithdataand,inthiscase,3processes.
2. Assumetheelementsareequallydist.amongthe3processes:
3. Sortthemlocally
4. [Link]
sameequallyspacedelements.
5. Gatherthesamplesintheroot.
6. Sortthesamplesontheroot.
7. SelectP1splitters(inthiscase2)
8. Thesplittersdefinetheglobalbucketboundaries.
9. Forthisexample:
Thefirstbucketwillgetthefirstsplit,[Link]
willgetthesecondsplit,[Link],4end.
10. Nowthesplitterswillneedtobebroadcast.
11. Eachnodecanpartitionitselementsusingthesplitters.
12. [Link].
13. Theneachnodewilldoalocalsort.
Intherunningtimeforthissamplesort,whatisthelargestasymptoticfunctionofP?(AssumeP
processes)
2
2
2
O(P
)orO(P
logP)...TheroothastosortP
samples.
Ifthesystemistrulymassive,thiscouldbeadelimitedtoscalability.