0% found this document useful (0 votes)
13 views2 pages

2IIG0 Cheat Sheet 1

The document covers various concepts in linear algebra and machine learning, including norms, classifiers, learning algorithms, and optimization techniques. It discusses methods such as k-nearest neighbors, decision trees, and neural networks, along with their mathematical foundations and performance metrics like bias, variance, and error rates. Additionally, it highlights the importance of regularization and feature selection in model training and evaluation.

Uploaded by

fotitos.clauu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views2 pages

2IIG0 Cheat Sheet 1

The document covers various concepts in linear algebra and machine learning, including norms, classifiers, learning algorithms, and optimization techniques. It discusses methods such as k-nearest neighbors, decision trees, and neural networks, along with their mathematical foundations and performance metrics like bias, variance, and error rates. Additionally, it highlights the importance of regularization and feature selection in model training and evaluation.

Uploaded by

fotitos.clauu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Linear Algebra

 (x1 , y1 ), . . . , (xN , yN ). determined by the support vectors (removing other points does EPE: Train k models on different datasets. Calculate MSE for
∥v + w∥ ≤ ∥v∥ + ∥w∥
 (Triangle inequality) Assumption: Independent and identically distributed from not affect the hyperplane). Let yi = ±1 determine class. models on different test sets. Average MSE.
Norm ⇐⇒ ∥αv∥ = |α|∥v∥ (Homogeneity) unknown distribution p∗ (X, Y ), (xi , yi ) ∼ p∗ Margins: wT x + b = ±1 Bias: Difference between predicted values and actual values on
 Classifier: Function h maps from x to predicted label ŷ. Distance from hyperplane to margin is 1/∥w∥ which is not con- the training set. High bias is underfitting.
∥v∥ = 0 ⇐⇒ v = 0

1
Learning Algorithm: Takes dataset and returns classifier h. vex. max ∥w∥ = min ∥w∥ = min 12 ∥w∥2 which is convex. Variance: Difference between trained models (if training the
v T w = v · w = cos(v, w)∥v∥∥w∥(Prop. of Euclidean (L2 ) norm) True Classification Error: P[Y ̸= h(X)] Linearly Seperable Data: model on different datapoints produces a different curve). High
v · w/∥w∥ (Project v onto w) A classifier h∗ that minimizes this error is Bayes Optimal. This 1 variance is overfitting.
d min ∥w∥2 s.t. yi (wT xi − b) ≥ 1 for i = 1 . . . N
X is only possible if the true distribution is known. w,b 2 k-fold Cross-Validation: Divide dataset I into k disjunctive
∥v∥p = ( |vi |p )1/p (Lp norm for vectors) Non-linearly Seperable Data: Introduce slack variables ξi
K-Nearest Neighbours chunks. Train jth model on I \ Ij and compute MSE for that
i=1 1 X
model on Ij . CV MSE is the average MSE across the models.
n X
m Pick modal class among the k-nearest neighbours (use Eu- min ∥w∥2 + C ξi
w,b,ξ 2
X
∥A∥p = ( |Aji |p )1/p (Lp norm for matrices) clidean or other distance metric). Is stable, reduces data noise. i Regularization
i=1 j=1 Use odd k for binary classification (to ensure a majority exists). s.t. yi (wT xi − b) ≥ 1 − ξi , ξi ≥ 0 for i = 1 . . . N X T X is not invertible ⇐⇒ ΣT Σ (using SVD X = U ΣV T ) is
Lp norms for p < 1 violate triangle inequality (are not norms) Naive Bayes 1 X
not invertible ⇐⇒ r < p nonzero singular values. Then use
Equivalent to min ∥w∥2 + C max(1 − yi (wT xi − b), 0)
∥A∥op = max ∥Av∥ h∗ (x) = arg max p∗ (y|x). Idea: learn p∗ (y|x). w,b 2 pseudoinverse, (ΣT Σ)+ defined as
∥v∥=1 i 1 
y
C is a hyperparameter, larger is stricter margin. σ12
... 0
tr(ABCD) = tr(BCDA) = tr(CDAB)... (Cycling property) p∗ (x|y)p∗ (y)  . ..
Bayes Rule: p∗ (y|x) = ∝ p∗ (x|y)p∗ (y) Kernel Trick: Apply Lagrangian to obtain new formulation  .. .. 
L2 (= Frobenius for matrices) norm properties: p∗ (x) T + . . 0  ∈ Rp×p

1X X (Σ Σ) =  
• ∥v∥2 = v T v ∥A∥2 = tr(AT A) arg max p∗ (y|x) = arg max p∗ (x|y)p∗ (y) max − λi λj yi yj xTi xj + λi  0 . . . σr2
1 
d y y λ 2 i,j i

• ∥AX + B∥2 = 2X T (AX + B) Estimate p∗ (y) from class distribution of training set. X 0 0
dX s.t. λi yi = 0, 0 ≤ λ ≤ C for i = 1 . . . N and use solution β+ = V (ΣT Σ)+ ΣT U T y
Assumption (1): Conditional independence of D features:
SVD: For every matrix X ∈ Rn×p there exist orthogonal ma- D i Feature Selection
trices U ∈ Rn×n , V ∈ Rp×p , Σ ∈ Rn×p such that Optimal
X λ provides optimal solution:
Y
p∗ (x|y) ≈ p∗ (xd |y) giving us If features d ≫ observations n → sparse/ridge regression (elim-
• X = U ΣV T d=1
w= λi yi xi , b = wT xi − yi (for any i where λi > 0) inate some features). Penalizing less relevant features using
• U T U = U U T = In , V T V = V V T = Ip D i norm, and using Lagrangian with λ > 0:
Usually decision boundaries are not linear, apply non-linear fea-
Y
• Σ is diagonal containing singular values hN B = arg max p∗ (y) p∗ (xd |y) min ∥y − Xβ∥ + λ∥β∥p is the penalized objective.
• If all singular values > 0 then X −1 = V Σ−1 U T
y
d=1 ture transformation ϕ. Then xTi xj = ϕ(xi )T ϕ(xj ). β

Optimization Decision Trees Kernel: k : RD × RD → R. k is positive definite if for any L0 counts non-negative elements but is not a real norm.
Possible decisions: finite collection of vectors x1 , . . . , xN and ∀a ∈ RN it holds Choose squared L2 norm: β = (X T X + λI)−1 X T y
Analytical Optimisation
FONC & SONC: (Unconstrained min. & twice diff) Continuous: sort, use threshold between consecutive data that aT Ka ≥ 0 with K ∈ RN ×N , Kij = k(xi , xj ) Lasso Regression: Choose L1 norm. Not convex → can be
Discrete: partition set into 2 (one vs all, or arbitrary partition) Representer Theorem: k is postive definite solved with coordinate
 1 descent:
∇f (x) = 0, ∇2 f (x) is positive semidef (xT Ax ≥ 0 ∀x ∈ Rd )  ∥X·k ∥2 (ck − λ) if ck > λ
⇐⇒ k(xi , xj ) = ϕ(xi )T ϕ(xj ) in some feature space ϕ(x)

Lagrangian: Adding constraints, ci (x) = 0, gk (x) ≥ 0. Cost: 1
∥2 (ck + λ) if ck < −λ
Replace ϕ(xi )T ϕ(xj ) in algorithm with k(xi , xj ), which is βk = ∥X·k
C = {x ∈ Rd | constraints hold on x} (Feasible set) • Number of data points at old leaf N . 
cheaper to compute. Not necessary to know ϕ (can also not 0 if − λ ≤ ck ≤ λ

m l • N0 , N1 are number of data points at new leaves L0 , L1 . T
X
T
y−
X X
L = f (x) − λi ci (x) − µk gk (x) (Lagrangian) N0 N1 be determined from k). Finally, to classify kernelized SVMs: with ck = X·k X·k X·i βi
• Cost = Impurity(L0 )+ Impurity(L1 )−Impurity(L) 1X X i̸=k
i=1 k=1 N N max − λi λj yi yj k(xi , xj ) + λi
µi ≥ 0 if ≥ constraint, otherwise µi ≤ 0. Sometimes (e.g. for Impurity: λ 2 i,j Ridge regression is faster, Lasso gives sparser regression vectors.
i
convex problems) the minimizer for L will minimize f . To min- • Gini Impurity: 1 −
X
p(y)2
X Neural Networks
s.t. λi yi = 0, 0 ≤ λ ≤ C for i = 1 . . . N Are good at capturing signal-like data (images, video, speech,
imze L solve the system of d + m + l equations: y
i
m l X audio, language). Prominent network structure for data types:
X X • Entropy: − p(y) log(p(y)) (binary → base 2) w =X need to know ϕ, but w is not needed.
∇f (x) = λi ∇ci (x) + µk ∇gk (x) • Unstructured → MLPs (other classifiers usually better)
y b= λj yj k(xj , xi ) − yi (for any i where λi > 0)
i=1 k=1
• Self Classification Error: 1 − max p(y) • Image → CNNs.
j D
!
ci (x) = 0 ∀i : 1 ≤ i ≤ m y X X
Stopping Condition: Too few datapoints (use arbitrary h(x) = sgn( λi yi k(xi , x) − b) Neuron: ŷ = ϕ wi xi (bias: w0 , x0 = b, 1)
gk (x) ≥ 0 ∀k : 1 ≤ k ≤ l
threshold) or too small improvement of cost (use arbitrary i i=0
Numerical Optimisation threshold, or significance testing) Regression A single layer neural network with non-linear activation units
Coordinate Descent: Minimise each input variable. 1×d can approximate any continuous function. Use more layers to
(t+1) (t) (t)
Random Trees: Bootstrap and aggregate (bagging). With N Given dataset D = {(Di , yi )|Di ∈ R , yi ∈ R, 1 ≤ i ≤ n}
xi = arg min f (x1 . . . xd ), 1 ≤ i ≤ d samples, generate K datasets (bootstraps) of size N by sam- where Di , yi are the features and target for observation i: improve computation speed.
xi d T
pling with replacement. Train a new decision tree on each boot- Approximate noisy function with f : R → R = ϕ(x) β Non-linear activation functions:
Gradient Descent: xt+1 = xt − η∇f (x) where η is step size.
strap. Predict by aggregating the results of the classifiers (tak- Types of functions: • Sigmoid: y = 1/(1 + e−x ), y ′ = y(1 − y) = y(x)y(−x)
Convex Optimisation • Hyperbolic Tangent: y = (ex −e−x )/(ex +e−x ), y ′ = 1−y 2
ing average or the modal result). • Affine: ϕ(x) = (1 x)T ∈ Rd+1
Numerical optimisation gives local minimizer. For a convex d
• ReLU: y = max(0, x), y ′ = 1(x > 0)
function, every local minimizer is a global minimizer. Error • Polynomials of degree k: ϕ(x) ∈ R(k+1)
1 X
N
• Sum of k Gaussians: • Leaky ReLU: y =(max(αx, x), y ′ = 1 if x > 0 else α
Convex Set: X is convex ⇐⇒ ∀x, y ∈ X , α ∈ [0, 1] : Classification error: 1(yi ̸= h(xi ))
N i=1 ϕ(x) = (exp(−γ∥x − µ1 ∥2 ) . . . exp(−γ∥x − µk ∥2 ))T ∈ Rk y = x if x ≥ 0 else α(ex − 1)
αx + (1 − α)y ∈ X (Line segment entirely in X ) • Exponential LU:
Convex Function: f : Rd → R is convex ⇐⇒ Low error signifies overfitting. Solution: partition data to train- Goal: min ∥y − Xβ∥2 with ϕ : Rd → Rp , β ∈ Rp and y ′ = 1 if x ≥ 0 else αex
β
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ing/validation/test set. — ϕ(D1T )T —
  
y1 α is a hyperparameter 0 < α < 1
∀x, y ∈ Rd , α ∈ [0, 1] (Function below line) • Repeat: train on test set, tune on validation set. .
X= .. ∈R
n×p
y =  ...  ∈ Rn
   
Convex Properties: • Test model with best validation error on test set.
• Nonnegative weighted sums of convex functions are con- — ϕ(DnT )T — yn
Accuracy: Opposite of error. Confusion matrix shows count
vex. of true label vs predicted label. Normalise to show probability. with X the design matrix.
• g : Rd → Rk , g(x) = Ax + b ∧ f : Rk → R is convex Support Vector Machines If X T X is invertible then (X T X)−1 X T y is the only minimizer.
=⇒ f ◦ g is convex. Binary classification using linear decision boundaries (separat- Otherwise there are infinitely many.
• All norms are convex. ing hyperplane): h(x) = sgn(wT x + b) Bias-Variance
• Linear functions (and maps) are convex and concave. X Tradeoff
A good hyperplane maximises the minimal distance (margin) to M SE = 1 (yi − ϕ(DiT )T β)2 with test set I ⊆ training
Classification any training example. The hyperplane is supported by vectors |I|
i∈I
Feature vector x ∈ RD , class/label value y, N training samples which are exactly on the margin. The hyperplane is entirely data.
Traditional functions (top) kill gradients, especially for many Training Techniques cated SVD, use Y X T as recommendations. Approximation to Nonconvex Clustering
layers due to multiplying small numbers for backprop (use Data Normalization: For xj the sample values of feature j mean rating is poor for sparse entries. Solution, minimise on Feature transform by using kernel trick for clustering. Refor-
Batch Normalization). Modern functions (bottom) can blow Zero-mean-unit-variance: xj ← xj −mean(xj ) observed entries using O, theX
set of observed entries. mulate k-means objective:
maxY tr(Z T DDT Z) s.t. Z = Y (Y T Y )−1/2 , Y ∈ 1n×r
std(xj )
up activation (also normalize). ReLU neurons may die (never xj −min(xj ) min ∥1O ◦ (D − Y X T )∥2 = (Dij − Yi Xjt )2
Min-max ([0, 1]): xj ← max(xj )−min(xj ) −0.5 (for [−0.5, 0.5])
activate again) if too many activations go below zero. X,Y
(i,j)∈O Where DDT is defined using inner products. Use feature trans-
Multilayer Perceptrons Stochastic Gradient Descent: Partition randomly, update T
s.t. X ∈ Rd×r , Y ∈ Rn×r (Nonconvex, use coordinate descent) form ϕ, kernel matrix K = ϕ(D)ϕ(D) , equivalent objectives:
Notation: with loss on every mini-batch.
Dimensionality Reduction (PCA) minY ∥ϕ(D) − Y X ∥ s.t. X = ϕ(D)T Y (Y T Y )−1 , Y ∈ 1n×r
T 2
Momentum: Moving average of gradient to stabilise.
• # Units at layer i : Hi (H0 is number of features D) Adaptive Stepsizes: Multiply gradient by positive semidef. Can create features from linear combination of other features. maxY tr(Z T KZ) s.t. Z = Y (Y T Y )−1/2 , Y ∈ 1n×r
• wk,j
i
: weight from layer i − 1 → i, and unit k → j matrix for independent stepsizes for each neuron.
Such a feature with highest variance is a good feature. Eigendecomposition: K = K T ∈ Rn×n =⇒ K = V ΛV T , V
• wji = (w0,j i i
, . . . , wH i−1 ,j
)T (weights to unit j in layer i) Early Stopping: Stop when validation error starts increasing For existing features F1 , . . . , Fd : is orthogonal and Λ = diag(λ1 , . . . , λn ), |λ1 | ≥ · · · ≥ |λn |
• h0 = (1, x)T (bias + input vector) (or doesn’t achieve new minimum for Kfail epochs). Collect mean value for each feature µF = (µF1 , . . . , µFd )T Then since kernel matrices are positive semidef, we have sym-
Centered data matrix: C = D − 1µTF (1 is vector of ones) metric decomposition K = AAT = AT A, A = V λ1/2 = ϕ(D).
Regularization: Add penalty term λ∥w∥22 or λ∥w∥1 to Ltrain
 
Then the direction of maximal variance is given by α
• W = w0 . . . wHi 
i i i
(all weights to layer i) Dropout: Randomly remove inputs or units when training. Substitute A into above objective, run k-means, ϕ not needed.
αT C T Cα Drawback: not robust, full eigendecomposition needed.
Batch Normalization: Normalize the output from neurons. max (n observations)
∥α∥=1 n Consider symmetric similarity matrix W describing similarity
Convolutional Neural Networks
(
i iT i−1
a =W h 1
• Recursively define Include inductive bias in NN structure to specialise. Convolu- µFd+1 = µF α σF2d+1 = ∥Cα∥2 of points according to some measure (distance, knn or other).
hi = ϕi (ai ) ϕi acts element-wise n Use W as weighted adjacency matrix to make graph (graph
tions specialise in local correlations present in signal data. For low dimensional representation, find r orthogonal directions
• Then ai sums weights for each neuron at layer i w : K × K convolutional kernel (not kernel trick) i.e. matrix must be connected!). Clusters designate max similarity but
of largest variance, given by the columns of Z:
• hi contains the output for each neuron at layer i x : image (in matrix form) (K is odd!) which is equivalent to min cut:
max tr(Z T C T CZ) s.t. Z ∈ Rn×r , Z T Z = I
• ϕi can be a different activation function for each layer i y : convolutional layer (i.e. processed image) Z maxY Sim = tr(Y T W Y (Y T Y )−1 ) s.t. Y ∈ 1n×r
minY Cut = tr((1 − Y )T W Y (Y T Y )−1 ) s.t. Y ∈ 1n×r
T 2 T 2 n×r T
• Final layer does not accumulate. ⌊K/2⌋ ⌊K/2⌋ = min ∥C C − ZΣRR Z ∥ s.t. Z ∈ R ,Z Z = I
X X Z
1 2
The weights w = W , W , . . . are parameters to train. y[i, j] = x[i + k, j + l]w[k, l] for C = U ΣV T , R = {1, . . . , r}, then Z = V·R . Laplacian: L = IW − W with IW the degree matrix for W .
All these objectives are nonconvex (hence why we use SVD). maxY Sim = tr(Y W Y (Y Y ) ) s.t. Y ∈ 1
T T −1 n×r
ŷ = f (x; w) is the predicted output for x with weights w. l=⌊K/2⌋ k=⌊K/2⌋
Basically: elementwise (Hadamard) multiplication of w and the Indicator vectors of connected components: eigenvectors of L.
Loss = L(y, ŷ) describes how ŷ approximates y. K-Means Clustering Spectral Clustering: Uses only the r eigenvectors after the
N K × K matrix centered at i, j in x, then sum all the elements. n observations, r clusters, data D. Let Pn denote the set of all
1 X
Training loss = L(yi , f (xi ; w)) Undefined if x[i ± ⌊K/2⌋, j] or x[i, j ± ⌊K/2⌋] do not exist, this first one, which is faster than a full eigendecomposition.
N i=1 possible disjoint partitions of {1, . . . , n}. Let Ci denote a cluster
removes ⌊K/2⌋ from each side of x. Solution: padding. (set of points). Then we minimize distance between points:
Data Fusion
is the average loss across the training set of N observations. Convolutional Layers: If image is 3D feature map (e.g. 3rd r True Parameters from Noisy Observations
Single Output:
X 1 X 2 Unknown vector of parameters θ. Noisy observations {y̌ℓ }, ℓ ∈
dimension = colour channels). Then kernel is K ×K ×C tensor. min ∥Dj − Di ∥
• ŷ = final layer Still the same, elementwise multiplication with ’overlapping‘ el- {C1 ,...,Cr }∈Pn
s=1
|C s | {1, . . . , R}, R sensors. Parameters are random: Θ ∼ p(θ). Sen-
j,i∈Cs
∂L ements and sum (i.e. 2D convolution on C layers, add results). This is a discrete problem - gradient undefined. Reformulate: sors are also random, but depend on the parameters they mea-
• Squared Error (Loss): L = (ŷ − y)2 , = 2(ŷ − y) 
1 X T
∂ ŷ This creates a layer with a single channel. Use multiple kernels r X X·s =
 Di sure, so we consider Yℓ |θ ∼ p(yℓ |θ).
– Used for regression. |Cs | R
X
and stack layers → feature map with multiple channels. T 2
min ∥Di − X·s ∥ s.t. i∈Cs
Y
Multiple Outputs (Classes): Optionally: add bias terms b after each convolution. s=1 i∈Cs

{C , . . . , C } ∈ P Assumption: p(y1 , . . . , yR |θ) = p(yℓ |θ) Idea: True pa-
1 r n
ℓ=1
Zero Padding: Pad input with ⌊K/2⌋ 0s on each side.
• Let h = h be the final (accumulation) layer with C units
L Minimizing distance to centroid (X·s is the centroid for Cs ) rameters are the most likely ones given the observations from
exp(hi ) Can apply non-linearity ϕ elementwise, and apply a new con- Lloyd’s Algorithm: Block coordinate descent the sensors. Use p̃ ≈ pZthen
• ŷ = softmax(h)i = PC (base e) volutional layer (like the layers in MLP). Higher layers in CNN
• Initialize(random centroid locations (in reasonable range)
j=1 exp(hj ) represent more abstract feature. Without padding, information θ̂ = E[θ|y̌1 , . . . , y̌R ] = θ p̃(. . . )dθ.
– Transforms output to probability distribution. Assign points to closest centroid
∂L
is compressed. Use padding, then pooling to reduce # features. • Repeat Monte Carlo: Approximate by taking experimental average
• Cross Entropy: − log(ŷy ), = ŷj − 1 if j = y else ŷ Pooling: Sliding window like convolution, but with aggre- Centroid center ← avg. location of its points
∂hj of p(θ|y̌1 , . . . , y̌R ) over many samples.
gation functions (e.g. max, average) instead of multiplica- • Until converges (which it does) State Tracking (states are hidden)
– You only need to compute softmax at position y Reformulate: Let Y ∈ {0, 1}n×r where Yis = 1 ⇐⇒ i ∈ Cs
tion/summation. Applied independently for each channel. States {xn }, time instants {tn }, observations (one source) {y̌n }.
(except for derivative)
• Window size Kp (Can be even!) Then |Yi | = 1 ∀ i ∈ {1, . . . , n} p(xn |x0:n−1 ) = p(xn |xn−1 )
– Used for classification 1n×r is the set of all binary matrices partitioning n points into Markov Assumption:
• Stride/hop size Sp (Convolution slides one at a time) p(xn |xn−1 , y0:n−1 ) = p(xn |xn−1 )
Then train model via gradient descent: w ← w − η∇w Ltrain Flattening: Convert W × H × C feature map to vector of r sets: 1n×r = {Y ∈ {0, 1}n×r ||Yi | = 1 for i ∈ {1, . . . , n}}
Training loss gradient = average of sample loss gradients. 1 Only depends on directly previous state, and does not depend
length W · H · C so that MLP learning can be used. Then the sth centroid X·s = DT Y·s and X gathers all cen- on previous observations.
NB: ∇w Ltrain is vector of matrices of partial derivatives of loss Toeplitz Matrix: Convolution can be described by matrix, |Y·s |
White Observation Noise: p(yn |xn , y0:n−1 ) = p(yn |xn )
with respect to the weight. e.g. forw ∈ R2×2 and x ∈ R3×3 then for a flattenedx and y: troids column-wise: X = DT Y (Y T Y )−1 . Then the following
Observations do not depend on previous ones.
Backpropagation w1 w2 0 w3 w4 0 0 0 0 k-means objectives are equivalent and nonconvex:
good luck!
Xn+1 = fn (Xn ) + Wn ∼ p(wn )
Let N = # layers, then we have helper variables:  0 w1 w2 0 w3 w4 0 0 0 min ∥D − Y X T ∥2 s.t. Y ∈ 1n×r , X = DT Y (Y T Y )−1 

 x ∈ R4

y= Y Yn = hn (Xn ) + Vn ∼ p(vn )
(k) ∂L X (k+1) (k+1)
 0 0 0 w 1 w 2 0 w 3 w4 0  

• Bi = = B w for 1 ≤ k < N min ∥D − Y X ∥ s.t. Y ∈ 1 Additive (process, observation) noise.

T 2 n×r d×r
,X = R

∂hi
(k) j i,j 0 0 0 0 w1 w2 0 w3 w4 
j Y Chapman-Kolmogorov:
∂L (then y ∈ R2×2 ) r X n R
p(xn |y̌0:n−1 ) = p(xn |xn−1 )p(xn−1 |y̌0:n−1 )dxn−1

′(k) (k) (k)
• Bi = ′
Yis ∥Di − X·s ∥ s.t. Y ∈ 1
X 
= ϕk (ai )Bi for 1 ≤ k < N Which allows us to use backprop. Never update the 0s, and gra- T 2 n×r d×r 

(k) min ,X = R 
∂ai Y
 Bayes Rule: p(xn |y̌0:n ) ∝Rp(y̌n |xn )p(xn |y̌0:n−1 )
dients for shared weights are the sum of gradients from nodes s=1 i=1
(N )
• Bi =
∂L ′(N )
(No activation at layer N , so no Bi ) where they are used. Still easier to flatten instead. So k-means is instance of recommender problem, but cannot MMSE Estimate: x̂n = xn p(xn |y̌0:n )dxn
∂ ŷi A: predicted posterior, B: state transition distribution, C: pre-
Recommender and PCA use SVD. Given cluster assignment Y , then objective:
T 2 d×r vious posterior, D: updated posterior, E: likelihood function.
Then we have the following derivatives of loss with respect to Recommending (rank-r matrix factorization problem) min ∥D − Y X ∥ s.t. X ∈ R
weight:
X p(x0 ) → Repeat[C;B;A;(yˇn +E);D] → x̂n
Data matrix D ∈ Rn×d and rank r < min(n, d), then objective: is convex with solution X = DT Y (Y T Y )−1 . Use p(x0 ) as previous posterior for first loop.
∂L ′(k) (k−1) min ∥D − Y X T ∥2 s.t. X ∈ Rd×r , Y ∈ Rn×r (is nonconvex!) Or given centroid location X then objective:
• = B h for 1 ≤ k < N X,Y
min ∥D − Y X T ∥2 s.t. Y ∈ 1n×r is nonconvex because 1n×r
(k) j i
∂wi,j Solution: use truncated SVD, D = U ΣV T then: Y
∂L (N ) (N −1) Y X T = U·R ΣRR V·R T
with R = {1, .., r} To recommend, re- is not a convex set. But does have a solution assigning every
• (N )
= Bj hi
∂wi,j place missing values in D with mean rating µ, compute trun- point to closest centroid. Eduardo Costa Martins

You might also like