Sparse Matrix Techniques
(Tutorial)
[Link]
LawrenceBerkeleyNationalLab
Math290/CS298,UCB
Jan.31,2007
Outline
PartI
Computerrepresentationsofsparsematrices
Sparsematrixvectormultiplywithvariousstorages
Performanceoptimizations
PartII
Techniquesforsparsefactorizations
(e.g.,SuperLUsolver)
01/31/07
Math290/CS298
SparseStorageSchemes
Notation
Ndimension
NNZnumberofnonzeros
Assumearbitrarysparsitypattern
tripletsformat({i,j,val})isnotsufficient...
Storage:2*NNZintegers,NNZreals
Noteasytorandomlyaccessoneroworcolumn
Linkedlistformatprovidesflexibility,butnotfriendlyonmodern
architectures...
CannotcallBLASdirectly
01/31/07
Math290/CS298
CompressedRowStorage(CRS)
Storenonzerosrowbyrowcontiguously
Example:N=7,NNZ=19
3arrays:
Storage:NNZreals,NNZ+N+1integers
2
c d
e
4
h
f
5
g
i 6 j
l
7
135811131720
nzval1a2bcd3e4f5ghi6jkl7
colind1425123245574567357
rowptr135811131720
01/31/07
Math290/CS298
SpMV(y=Ax)withCRS
135811131720
nzval1a2bcd3e4f5ghi6jkl7
colind1425123245574567357
rowptr135811131720
doi=1,N...rowiofA
sum=0.0
doj=rowptr(i),rowptr(i+1)1
sum=sum+nzval(j)*x(colind(j))
enddo
y(i)=sum
enddo
dotproduct
Nolocalityforx
Vectorlengthusuallyshort
Memorybound:3reads,2flops
01/31/07
Math290/CS298
CompressedColumnStorage(CCS)
AlsoknownasHarwellBoeingformat
Storenonzeroscolumnwisecontiguously
3arrays:
Storage:NNZreals,NNZ+N+1integers
2
c d
e
4
h
f
5
g
i 6 j
l
7
nzval1c2de3ka4hbf5il6gj7
rowind1323437146245676567
colptr136811161720
01/31/07
Math290/CS298
SpMV(y=Ax)withCCS
nzval1c2de3ka4hbf5il6gj7
rowind1323437146245676567
colptr136811161720
y(i)=0.0,i=1N
doj=1,N...columnjofA
t=x(j)
doi=colptr(j),colptr(j+1)1
y(rowind(i))=y(rowind(i))+nzval(i)*t
enddo
enddo
SAXPY
Nolocalityfory
Vectorlengthusuallyshort
Memorybound:3reads,1write,2flops
01/31/07
Math290/CS298
JaggedDiagonalStorage(JDS)
AlsoknownasITPACK,orEllpackstorage[Saad,Kincaidetal.]
Forceallrowstohavethesamelengthasthelongestrow,
thencolumnsarestoredcontiguously
1
c
a
2
d
e
b
3
4
h
k
f
5
i
l
1
2
c
e
g
5
j
h
7
k
a
b
d
4
g
i
l
0
0
3
f
0
6
7
0
0
0
0
0
j
0
2arrays:nzval(N,L)andcolind(N,L),whereL=maxrowlength
N*Lreals,N*Lintegers
UsuallyL<<N
01/31/07
Math290/CS298
SpMVwithJDS
y(i)=0.0,i=1N
doj=1,L
doi=1,N
y(i)=y(i)+nzval(i,j)*x(colind(i,j))
enddo
enddo
1
2
c
e
5
h
k
a
b
d
4
g
i
l
0
0
3
f
0
6
7
0
0
0
0
0
j
0
NeitherdotnorSAXPY
Goodforvectorprocessor:longvectorlength(N)
Extramemory,flopsforpaddedzeros,especiallybadifrowlengths
varyalot
01/31/07
Math290/CS298
SegmentedSum[Blellochetal.]
DatastructureisanaugmentedformofCRS
ComputationalstructureissimilartoJDS
Eachrowistreatedasasegmentinalongvector
Underlinedelementsdenotethebeginningofeachsegment
(i.e.,arowinA)
Dimension:S*L~NNZ,whereLischosentoapproximatethe
hardwarevectorlength
01/31/07
1
2
c d
e
a
b
3
4
h
k
f
5
i
l
1 d
a 3
2 e
g
b 4
6 j
c f
7
Math290/CS298
5
g
h
i
6
j
k
l
7
10
SpMVwithSegmentedSum
1
2
c d
e
a
b
3
4
h
k
f
5
i
l
1 d
a 3
2 e
g
b 4
6 j
c f
7
5 j
g k
h l
i 7
6
doi=S,1
doj=1,L
...
enddo
enddo
2arrays:nzval(S,L)andcolind(S,L),whereS*L~NNZ
NNZreals,NNZintegers
Goodforvectorprocessors
SpMVisperformedbottomup,witheachrowsum(dot)ofAxstoredin
thebeginningofeachsegment
SimilartoJDS,butwithmorecontrollogicininnerloop
01/31/07
Math290/CS298
11
Performance(megafloprate)[Gaekeetal.]
Testmatrix:N=10000,NNZ=177782,randompattern
~18nonzerosperrowonaverage
JDSdoes4.6xmoreoperations
machine
Ultra2i
Pentium4
VIRAM
Clockrate
333MHz
1.5GHz
200MHz
Peakfloprate
667Mflops
1.5Gflops
1.6Gflops
CRS
29
209
110
JDS
(effective)
27
6
17
4
632
137
SegSum
29
165
01/31/07
Math290/CS298
12
OptimizationTechniques
Matrixreordering
ForCRSSpMV,canimprovexvectorlocalitybyreducingthe
bandwidthofmatrixA
Example:reverseCuthillMcKee(breadthfirstsearch)
Observed23ximprovement[Toledo,etal.]
01/31/07
Math290/CS298
13
OptimizationTechniques
Registerblocking
FinddenseblocksofsizerbycinA
(Ifneeded,allowsomezerostobefilledin)
A*xisproceededblockbyblock
keepcelementsofxandrelementsofyinregisters
xelementreusedrtimes,yelementreusedctimes
Amountofindexedloadandstoreisreduced
Obtainedupto2.5ximprovement[Vuducetal.]
01/31/07
Math290/CS298
14
SPARSITY[Im,Yelick]
01/31/07
Math290/CS298
15
PerformanceImprovement[Vuducetal.]
01/31/07
Math290/CS298
16
OtherRepresentations
Blockentryformats(e.g.,multipledegreesoffreedomare
associatedwithasinglephysicallocation)
Constantblocksize(BCRS)
Varyingblocksizes(VBCRS)
Skyline(orprofile)storage(SKS)
Lowertrianglestoredrowbyrow
Uppertrianglestoredcolumnbycolumn
Ineachrow(column),firstnonzero
definesaprofile
Allentrieswithintheprofile
(somemaybezeros)arestored
01/31/07
Math290/CS298
17
References
Templatesforthesolutionoflinearsystems
Barrett,etal.,SIAM,1994
BeBOP:[Link]
SparseBLASstandard:
[Link]
01/31/07
Math290/CS298
18