2IIG0 Cheat Sheet 1

The document covers various concepts in linear algebra and machine learning, including norms, classifiers, learning algorithms, and optimization techniques. It discusses methods such as k-nearest neighbors, decision trees, and neural networks, along with their mathematical foundations and performance metrics like bias, variance, and error rates. Additionally, it highlights the importance of regularization and feature selection in model training and evaluation.

Uploaded by

fotitos.clauu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views2 pages

2IIG0 Cheat Sheet 1

Uploaded by

fotitos.clauu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Linear Algebra

 (x1 , y1 ), . . . , (xN , yN ). determined by the support vectors (removing other points does EPE: Train k models on different datasets. Calculate MSE for
∥v + w∥ ≤ ∥v∥ + ∥w∥
 (Triangle inequality) Assumption: Independent and identically distributed from not affect the hyperplane). Let yi = ±1 determine class. models on different test sets. Average MSE.
Norm ⇐⇒ ∥αv∥ = |α|∥v∥ (Homogeneity) unknown distribution p∗ (X, Y ), (xi , yi ) ∼ p∗ Margins: wT x + b = ±1 Bias: Difference between predicted values and actual values on
 Classifier: Function h maps from x to predicted label ŷ. Distance from hyperplane to margin is 1/∥w∥ which is not con- the training set. High bias is underfitting.
∥v∥ = 0 ⇐⇒ v = 0

1
Learning Algorithm: Takes dataset and returns classifier h. vex. max ∥w∥ = min ∥w∥ = min 12 ∥w∥2 which is convex. Variance: Difference between trained models (if training the
v T w = v · w = cos(v, w)∥v∥∥w∥(Prop. of Euclidean (L2 ) norm) True Classification Error: P[Y ̸= h(X)] Linearly Seperable Data: model on different datapoints produces a different curve). High
v · w/∥w∥ (Project v onto w) A classifier h∗ that minimizes this error is Bayes Optimal. This 1 variance is overfitting.
d min ∥w∥2 s.t. yi (wT xi − b) ≥ 1 for i = 1 . . . N
X is only possible if the true distribution is known. w,b 2 k-fold Cross-Validation: Divide dataset I into k disjunctive
∥v∥p = ( |vi |p )1/p (Lp norm for vectors) Non-linearly Seperable Data: Introduce slack variables ξi
K-Nearest Neighbours chunks. Train jth model on I \ Ij and compute MSE for that
i=1 1 X
model on Ij . CV MSE is the average MSE across the models.
n X
m Pick modal class among the k-nearest neighbours (use Eu- min ∥w∥2 + C ξi
w,b,ξ 2
X
∥A∥p = ( |Aji |p )1/p (Lp norm for matrices) clidean or other distance metric). Is stable, reduces data noise. i Regularization
i=1 j=1 Use odd k for binary classification (to ensure a majority exists). s.t. yi (wT xi − b) ≥ 1 − ξi , ξi ≥ 0 for i = 1 . . . N X T X is not invertible ⇐⇒ ΣT Σ (using SVD X = U ΣV T ) is
Lp norms for p < 1 violate triangle inequality (are not norms) Naive Bayes 1 X
not invertible ⇐⇒ r < p nonzero singular values. Then use
Equivalent to min ∥w∥2 + C max(1 − yi (wT xi − b), 0)
∥A∥op = max ∥Av∥ h∗ (x) = arg max p∗ (y|x). Idea: learn p∗ (y|x). w,b 2 pseudoinverse, (ΣT Σ)+ defined as
∥v∥=1 i 1 
y
C is a hyperparameter, larger is stricter margin. σ12
... 0
tr(ABCD) = tr(BCDA) = tr(CDAB)... (Cycling property) p∗ (x|y)p∗ (y)  . ..
Bayes Rule: p∗ (y|x) = ∝ p∗ (x|y)p∗ (y) Kernel Trick: Apply Lagrangian to obtain new formulation  .. .. 
L2 (= Frobenius for matrices) norm properties: p∗ (x) T + . . 0  ∈ Rp×p

1X X (Σ Σ) =  
• ∥v∥2 = v T v ∥A∥2 = tr(AT A) arg max p∗ (y|x) = arg max p∗ (x|y)p∗ (y) max − λi λj yi yj xTi xj + λi  0 . . . σr2
1 
d y y λ 2 i,j i

• ∥AX + B∥2 = 2X T (AX + B) Estimate p∗ (y) from class distribution of training set. X 0 0
dX s.t. λi yi = 0, 0 ≤ λ ≤ C for i = 1 . . . N and use solution β+ = V (ΣT Σ)+ ΣT U T y
Assumption (1): Conditional independence of D features:
SVD: For every matrix X ∈ Rn×p there exist orthogonal ma- D i Feature Selection
trices U ∈ Rn×n , V ∈ Rp×p , Σ ∈ Rn×p such that Optimal
X λ provides optimal solution:
Y
p∗ (x|y) ≈ p∗ (xd |y) giving us If features d ≫ observations n → sparse/ridge regression (elim-
• X = U ΣV T d=1
w= λi yi xi , b = wT xi − yi (for any i where λi > 0) inate some features). Penalizing less relevant features using
• U T U = U U T = In , V T V = V V T = Ip D i norm, and using Lagrangian with λ > 0:
Usually decision boundaries are not linear, apply non-linear fea-
Y
• Σ is diagonal containing singular values hN B = arg max p∗ (y) p∗ (xd |y) min ∥y − Xβ∥ + λ∥β∥p is the penalized objective.
• If all singular values > 0 then X −1 = V Σ−1 U T
y
d=1 ture transformation ϕ. Then xTi xj = ϕ(xi )T ϕ(xj ). β

Optimization Decision Trees Kernel: k : RD × RD → R. k is positive definite if for any L0 counts non-negative elements but is not a real norm.
Possible decisions: finite collection of vectors x1 , . . . , xN and ∀a ∈ RN it holds Choose squared L2 norm: β = (X T X + λI)−1 X T y
Analytical Optimisation
FONC & SONC: (Unconstrained min. & twice diff) Continuous: sort, use threshold between consecutive data that aT Ka ≥ 0 with K ∈ RN ×N , Kij = k(xi , xj ) Lasso Regression: Choose L1 norm. Not convex → can be
Discrete: partition set into 2 (one vs all, or arbitrary partition) Representer Theorem: k is postive definite solved with coordinate
 1 descent:
∇f (x) = 0, ∇2 f (x) is positive semidef (xT Ax ≥ 0 ∀x ∈ Rd )  ∥X·k ∥2 (ck − λ) if ck > λ
⇐⇒ k(xi , xj ) = ϕ(xi )T ϕ(xj ) in some feature space ϕ(x)

Lagrangian: Adding constraints, ci (x) = 0, gk (x) ≥ 0. Cost: 1
∥2 (ck + λ) if ck < −λ
Replace ϕ(xi )T ϕ(xj ) in algorithm with k(xi , xj ), which is βk = ∥X·k
C = {x ∈ Rd | constraints hold on x} (Feasible set) • Number of data points at old leaf N . 
cheaper to compute. Not necessary to know ϕ (can also not 0 if − λ ≤ ck ≤ λ

m l • N0 , N1 are number of data points at new leaves L0 , L1 . T
X
T
y−
X X
L = f (x) − λi ci (x) − µk gk (x) (Lagrangian) N0 N1 be determined from k). Finally, to classify kernelized SVMs: with ck = X·k X·k X·i βi
• Cost = Impurity(L0 )+ Impurity(L1 )−Impurity(L) 1X X i̸=k
i=1 k=1 N N max − λi λj yi yj k(xi , xj ) + λi
µi ≥ 0 if ≥ constraint, otherwise µi ≤ 0. Sometimes (e.g. for Impurity: λ 2 i,j Ridge regression is faster, Lasso gives sparser regression vectors.
i
convex problems) the minimizer for L will minimize f . To min- • Gini Impurity: 1 −
X
p(y)2
X Neural Networks
s.t. λi yi = 0, 0 ≤ λ ≤ C for i = 1 . . . N Are good at capturing signal-like data (images, video, speech,
imze L solve the system of d + m + l equations: y
i
m l X audio, language). Prominent network structure for data types:
X X • Entropy: − p(y) log(p(y)) (binary → base 2) w =X need to know ϕ, but w is not needed.
∇f (x) = λi ∇ci (x) + µk ∇gk (x) • Unstructured → MLPs (other classifiers usually better)
y b= λj yj k(xj , xi ) − yi (for any i where λi > 0)
i=1 k=1
• Self Classification Error: 1 − max p(y) • Image → CNNs.
j D
!
ci (x) = 0 ∀i : 1 ≤ i ≤ m y X X
Stopping Condition: Too few datapoints (use arbitrary h(x) = sgn( λi yi k(xi , x) − b) Neuron: ŷ = ϕ wi xi (bias: w0 , x0 = b, 1)
gk (x) ≥ 0 ∀k : 1 ≤ k ≤ l
threshold) or too small improvement of cost (use arbitrary i i=0
Numerical Optimisation threshold, or significance testing) Regression A single layer neural network with non-linear activation units
Coordinate Descent: Minimise each input variable. 1×d can approximate any continuous function. Use more layers to
(t+1) (t) (t)
Random Trees: Bootstrap and aggregate (bagging). With N Given dataset D = {(Di , yi )|Di ∈ R , yi ∈ R, 1 ≤ i ≤ n}
xi = arg min f (x1 . . . xd ), 1 ≤ i ≤ d samples, generate K datasets (bootstraps) of size N by sam- where Di , yi are the features and target for observation i: improve computation speed.
xi d T
pling with replacement. Train a new decision tree on each boot- Approximate noisy function with f : R → R = ϕ(x) β Non-linear activation functions:
Gradient Descent: xt+1 = xt − η∇f (x) where η is step size.
strap. Predict by aggregating the results of the classifiers (tak- Types of functions: • Sigmoid: y = 1/(1 + e−x ), y ′ = y(1 − y) = y(x)y(−x)
Convex Optimisation • Hyperbolic Tangent: y = (ex −e−x )/(ex +e−x ), y ′ = 1−y 2
ing average or the modal result). • Affine: ϕ(x) = (1 x)T ∈ Rd+1
Numerical optimisation gives local minimizer. For a convex d
• ReLU: y = max(0, x), y ′ = 1(x > 0)
function, every local minimizer is a global minimizer. Error • Polynomials of degree k: ϕ(x) ∈ R(k+1)
1 X
N
• Sum of k Gaussians: • Leaky ReLU: y =(max(αx, x), y ′ = 1 if x > 0 else α
Convex Set: X is convex ⇐⇒ ∀x, y ∈ X , α ∈ [0, 1] : Classification error: 1(yi ̸= h(xi ))
N i=1 ϕ(x) = (exp(−γ∥x − µ1 ∥2 ) . . . exp(−γ∥x − µk ∥2 ))T ∈ Rk y = x if x ≥ 0 else α(ex − 1)
αx + (1 − α)y ∈ X (Line segment entirely in X ) • Exponential LU:
Convex Function: f : Rd → R is convex ⇐⇒ Low error signifies overfitting. Solution: partition data to train- Goal: min ∥y − Xβ∥2 with ϕ : Rd → Rp , β ∈ Rp and y ′ = 1 if x ≥ 0 else αex
β
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ing/validation/test set. — ϕ(D1T )T —
  
y1 α is a hyperparameter 0 < α < 1
∀x, y ∈ Rd , α ∈ [0, 1] (Function below line) • Repeat: train on test set, tune on validation set. .
X= .. ∈R
n×p
y =  ...  ∈ Rn
   
Convex Properties: • Test model with best validation error on test set.
• Nonnegative weighted sums of convex functions are con- — ϕ(DnT )T — yn
Accuracy: Opposite of error. Confusion matrix shows count
vex. of true label vs predicted label. Normalise to show probability. with X the design matrix.
• g : Rd → Rk , g(x) = Ax + b ∧ f : Rk → R is convex Support Vector Machines If X T X is invertible then (X T X)−1 X T y is the only minimizer.
=⇒ f ◦ g is convex. Binary classification using linear decision boundaries (separat- Otherwise there are infinitely many.
• All norms are convex. ing hyperplane): h(x) = sgn(wT x + b) Bias-Variance
• Linear functions (and maps) are convex and concave. X Tradeoff
A good hyperplane maximises the minimal distance (margin) to M SE = 1 (yi − ϕ(DiT )T β)2 with test set I ⊆ training
Classification any training example. The hyperplane is supported by vectors |I|
i∈I
Feature vector x ∈ RD , class/label value y, N training samples which are exactly on the margin. The hyperplane is entirely data.
Traditional functions (top) kill gradients, especially for many Training Techniques cated SVD, use Y X T as recommendations. Approximation to Nonconvex Clustering
layers due to multiplying small numbers for backprop (use Data Normalization: For xj the sample values of feature j mean rating is poor for sparse entries. Solution, minimise on Feature transform by using kernel trick for clustering. Refor-
Batch Normalization). Modern functions (bottom) can blow Zero-mean-unit-variance: xj ← xj −mean(xj ) observed entries using O, theX
set of observed entries. mulate k-means objective:
maxY tr(Z T DDT Z) s.t. Z = Y (Y T Y )−1/2 , Y ∈ 1n×r
std(xj )
up activation (also normalize). ReLU neurons may die (never xj −min(xj ) min ∥1O ◦ (D − Y X T )∥2 = (Dij − Yi Xjt )2
Min-max ([0, 1]): xj ← max(xj )−min(xj ) −0.5 (for [−0.5, 0.5])
activate again) if too many activations go below zero. X,Y
(i,j)∈O Where DDT is defined using inner products. Use feature trans-
Multilayer Perceptrons Stochastic Gradient Descent: Partition randomly, update T
s.t. X ∈ Rd×r , Y ∈ Rn×r (Nonconvex, use coordinate descent) form ϕ, kernel matrix K = ϕ(D)ϕ(D) , equivalent objectives:
Notation: with loss on every mini-batch.
Dimensionality Reduction (PCA) minY ∥ϕ(D) − Y X ∥ s.t. X = ϕ(D)T Y (Y T Y )−1 , Y ∈ 1n×r
T 2
Momentum: Moving average of gradient to stabilise.
• # Units at layer i : Hi (H0 is number of features D) Adaptive Stepsizes: Multiply gradient by positive semidef. Can create features from linear combination of other features. maxY tr(Z T KZ) s.t. Z = Y (Y T Y )−1/2 , Y ∈ 1n×r
• wk,j
i
: weight from layer i − 1 → i, and unit k → j matrix for independent stepsizes for each neuron.
Such a feature with highest variance is a good feature. Eigendecomposition: K = K T ∈ Rn×n =⇒ K = V ΛV T , V
• wji = (w0,j i i
, . . . , wH i−1 ,j
)T (weights to unit j in layer i) Early Stopping: Stop when validation error starts increasing For existing features F1 , . . . , Fd : is orthogonal and Λ = diag(λ1 , . . . , λn ), |λ1 | ≥ · · · ≥ |λn |
• h0 = (1, x)T (bias + input vector) (or doesn’t achieve new minimum for Kfail epochs). Collect mean value for each feature µF = (µF1 , . . . , µFd )T Then since kernel matrices are positive semidef, we have sym-
Centered data matrix: C = D − 1µTF (1 is vector of ones) metric decomposition K = AAT = AT A, A = V λ1/2 = ϕ(D).
Regularization: Add penalty term λ∥w∥22 or λ∥w∥1 to Ltrain
 
Then the direction of maximal variance is given by α
• W = w0 . . . wHi 
i i i
(all weights to layer i) Dropout: Randomly remove inputs or units when training. Substitute A into above objective, run k-means, ϕ not needed.
αT C T Cα Drawback: not robust, full eigendecomposition needed.
Batch Normalization: Normalize the output from neurons. max (n observations)
∥α∥=1 n Consider symmetric similarity matrix W describing similarity
Convolutional Neural Networks
(
i iT i−1
a =W h 1
• Recursively define Include inductive bias in NN structure to specialise. Convolu- µFd+1 = µF α σF2d+1 = ∥Cα∥2 of points according to some measure (distance, knn or other).
hi = ϕi (ai ) ϕi acts element-wise n Use W as weighted adjacency matrix to make graph (graph
tions specialise in local correlations present in signal data. For low dimensional representation, find r orthogonal directions
• Then ai sums weights for each neuron at layer i w : K × K convolutional kernel (not kernel trick) i.e. matrix must be connected!). Clusters designate max similarity but
of largest variance, given by the columns of Z:
• hi contains the output for each neuron at layer i x : image (in matrix form) (K is odd!) which is equivalent to min cut:
max tr(Z T C T CZ) s.t. Z ∈ Rn×r , Z T Z = I
• ϕi can be a different activation function for each layer i y : convolutional layer (i.e. processed image) Z maxY Sim = tr(Y T W Y (Y T Y )−1 ) s.t. Y ∈ 1n×r
minY Cut = tr((1 − Y )T W Y (Y T Y )−1 ) s.t. Y ∈ 1n×r
T 2 T 2 n×r T
• Final layer does not accumulate. ⌊K/2⌋ ⌊K/2⌋ = min ∥C C − ZΣRR Z ∥ s.t. Z ∈ R ,Z Z = I
X X Z
1 2
The weights w = W , W , . . . are parameters to train. y[i, j] = x[i + k, j + l]w[k, l] for C = U ΣV T , R = {1, . . . , r}, then Z = V·R . Laplacian: L = IW − W with IW the degree matrix for W .
All these objectives are nonconvex (hence why we use SVD). maxY Sim = tr(Y W Y (Y Y ) ) s.t. Y ∈ 1
T T −1 n×r
ŷ = f (x; w) is the predicted output for x with weights w. l=⌊K/2⌋ k=⌊K/2⌋
Basically: elementwise (Hadamard) multiplication of w and the Indicator vectors of connected components: eigenvectors of L.
Loss = L(y, ŷ) describes how ŷ approximates y. K-Means Clustering Spectral Clustering: Uses only the r eigenvectors after the
N K × K matrix centered at i, j in x, then sum all the elements. n observations, r clusters, data D. Let Pn denote the set of all
1 X
Training loss = L(yi , f (xi ; w)) Undefined if x[i ± ⌊K/2⌋, j] or x[i, j ± ⌊K/2⌋] do not exist, this first one, which is faster than a full eigendecomposition.
N i=1 possible disjoint partitions of {1, . . . , n}. Let Ci denote a cluster
removes ⌊K/2⌋ from each side of x. Solution: padding. (set of points). Then we minimize distance between points:
Data Fusion
is the average loss across the training set of N observations. Convolutional Layers: If image is 3D feature map (e.g. 3rd r True Parameters from Noisy Observations
Single Output:
X 1 X 2 Unknown vector of parameters θ. Noisy observations {y̌ℓ }, ℓ ∈
dimension = colour channels). Then kernel is K ×K ×C tensor. min ∥Dj − Di ∥
• ŷ = final layer Still the same, elementwise multiplication with ’overlapping‘ el- {C1 ,...,Cr }∈Pn
s=1
|C s | {1, . . . , R}, R sensors. Parameters are random: Θ ∼ p(θ). Sen-
j,i∈Cs
∂L ements and sum (i.e. 2D convolution on C layers, add results). This is a discrete problem - gradient undefined. Reformulate: sors are also random, but depend on the parameters they mea-
• Squared Error (Loss): L = (ŷ − y)2 , = 2(ŷ − y) 
1 X T
∂ ŷ This creates a layer with a single channel. Use multiple kernels r X X·s =
 Di sure, so we consider Yℓ |θ ∼ p(yℓ |θ).
– Used for regression. |Cs | R
X
and stack layers → feature map with multiple channels. T 2
min ∥Di − X·s ∥ s.t. i∈Cs
Y
Multiple Outputs (Classes): Optionally: add bias terms b after each convolution. s=1 i∈Cs

{C , . . . , C } ∈ P Assumption: p(y1 , . . . , yR |θ) = p(yℓ |θ) Idea: True pa-
1 r n
ℓ=1
Zero Padding: Pad input with ⌊K/2⌋ 0s on each side.
• Let h = h be the final (accumulation) layer with C units
L Minimizing distance to centroid (X·s is the centroid for Cs ) rameters are the most likely ones given the observations from
exp(hi ) Can apply non-linearity ϕ elementwise, and apply a new con- Lloyd’s Algorithm: Block coordinate descent the sensors. Use p̃ ≈ pZthen
• ŷ = softmax(h)i = PC (base e) volutional layer (like the layers in MLP). Higher layers in CNN
• Initialize(random centroid locations (in reasonable range)
j=1 exp(hj ) represent more abstract feature. Without padding, information θ̂ = E[θ|y̌1 , . . . , y̌R ] = θ p̃(. . . )dθ.
– Transforms output to probability distribution. Assign points to closest centroid
∂L
is compressed. Use padding, then pooling to reduce # features. • Repeat Monte Carlo: Approximate by taking experimental average
• Cross Entropy: − log(ŷy ), = ŷj − 1 if j = y else ŷ Pooling: Sliding window like convolution, but with aggre- Centroid center ← avg. location of its points
∂hj of p(θ|y̌1 , . . . , y̌R ) over many samples.
gation functions (e.g. max, average) instead of multiplica- • Until converges (which it does) State Tracking (states are hidden)
– You only need to compute softmax at position y Reformulate: Let Y ∈ {0, 1}n×r where Yis = 1 ⇐⇒ i ∈ Cs
tion/summation. Applied independently for each channel. States {xn }, time instants {tn }, observations (one source) {y̌n }.
(except for derivative)
• Window size Kp (Can be even!) Then |Yi | = 1 ∀ i ∈ {1, . . . , n} p(xn |x0:n−1 ) = p(xn |xn−1 )
– Used for classification 1n×r is the set of all binary matrices partitioning n points into Markov Assumption:
• Stride/hop size Sp (Convolution slides one at a time) p(xn |xn−1 , y0:n−1 ) = p(xn |xn−1 )
Then train model via gradient descent: w ← w − η∇w Ltrain Flattening: Convert W × H × C feature map to vector of r sets: 1n×r = {Y ∈ {0, 1}n×r ||Yi | = 1 for i ∈ {1, . . . , n}}
Training loss gradient = average of sample loss gradients. 1 Only depends on directly previous state, and does not depend
length W · H · C so that MLP learning can be used. Then the sth centroid X·s = DT Y·s and X gathers all cen- on previous observations.
NB: ∇w Ltrain is vector of matrices of partial derivatives of loss Toeplitz Matrix: Convolution can be described by matrix, |Y·s |
White Observation Noise: p(yn |xn , y0:n−1 ) = p(yn |xn )
with respect to the weight. e.g. forw ∈ R2×2 and x ∈ R3×3 then for a flattenedx and y: troids column-wise: X = DT Y (Y T Y )−1 . Then the following
Observations do not depend on previous ones.
Backpropagation w1 w2 0 w3 w4 0 0 0 0 k-means objectives are equivalent and nonconvex:
good luck!
Xn+1 = fn (Xn ) + Wn ∼ p(wn )
Let N = # layers, then we have helper variables:  0 w1 w2 0 w3 w4 0 0 0 min ∥D − Y X T ∥2 s.t. Y ∈ 1n×r , X = DT Y (Y T Y )−1 

 x ∈ R4

y= Y Yn = hn (Xn ) + Vn ∼ p(vn )
(k) ∂L X (k+1) (k+1)
 0 0 0 w 1 w 2 0 w 3 w4 0  

• Bi = = B w for 1 ≤ k < N min ∥D − Y X ∥ s.t. Y ∈ 1 Additive (process, observation) noise.

T 2 n×r d×r
,X = R

∂hi
(k) j i,j 0 0 0 0 w1 w2 0 w3 w4 
j Y Chapman-Kolmogorov:
∂L (then y ∈ R2×2 ) r X n R
p(xn |y̌0:n−1 ) = p(xn |xn−1 )p(xn−1 |y̌0:n−1 )dxn−1

′(k) (k) (k)
• Bi = ′
Yis ∥Di − X·s ∥ s.t. Y ∈ 1
X 
= ϕk (ai )Bi for 1 ≤ k < N Which allows us to use backprop. Never update the 0s, and gra- T 2 n×r d×r 

(k) min ,X = R 
∂ai Y
 Bayes Rule: p(xn |y̌0:n ) ∝Rp(y̌n |xn )p(xn |y̌0:n−1 )
dients for shared weights are the sum of gradients from nodes s=1 i=1
(N )
• Bi =
∂L ′(N )
(No activation at layer N , so no Bi ) where they are used. Still easier to flatten instead. So k-means is instance of recommender problem, but cannot MMSE Estimate: x̂n = xn p(xn |y̌0:n )dxn
∂ ŷi A: predicted posterior, B: state transition distribution, C: pre-
Recommender and PCA use SVD. Given cluster assignment Y , then objective:
T 2 d×r vious posterior, D: updated posterior, E: likelihood function.
Then we have the following derivatives of loss with respect to Recommending (rank-r matrix factorization problem) min ∥D − Y X ∥ s.t. X ∈ R
weight:
X p(x0 ) → Repeat[C;B;A;(yˇn +E);D] → x̂n
Data matrix D ∈ Rn×d and rank r < min(n, d), then objective: is convex with solution X = DT Y (Y T Y )−1 . Use p(x0 ) as previous posterior for first loop.
∂L ′(k) (k−1) min ∥D − Y X T ∥2 s.t. X ∈ Rd×r , Y ∈ Rn×r (is nonconvex!) Or given centroid location X then objective:
• = B h for 1 ≤ k < N X,Y
min ∥D − Y X T ∥2 s.t. Y ∈ 1n×r is nonconvex because 1n×r
(k) j i
∂wi,j Solution: use truncated SVD, D = U ΣV T then: Y
∂L (N ) (N −1) Y X T = U·R ΣRR V·R T
with R = {1, .., r} To recommend, re- is not a convex set. But does have a solution assigning every
• (N )
= Bj hi
∂wi,j place missing values in D with mean rating µ, compute trun- point to closest centroid. Eduardo Costa Martins

Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
CS 189 Cheat Sheet Overview
No ratings yet
CS 189 Cheat Sheet Overview
2 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Predicting Student Pass Rates
No ratings yet
Predicting Student Pass Rates
17 pages
Class05 LogisticsSVM
No ratings yet
Class05 LogisticsSVM
33 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
CH 4
No ratings yet
CH 4
39 pages
Support Vector Machine
No ratings yet
Support Vector Machine
50 pages
Convex Optimization in Classification
No ratings yet
Convex Optimization in Classification
39 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
CS229 Machine Learning Lecture Notes
No ratings yet
CS229 Machine Learning Lecture Notes
223 pages
Shatter Pitch in MLP Context
No ratings yet
Shatter Pitch in MLP Context
227 pages
CS229: Machine Learning Notes
No ratings yet
CS229: Machine Learning Notes
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
Support Vector Machines (SVMS)
No ratings yet
Support Vector Machines (SVMS)
31 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
58 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
Statistical Machine Learning Overview
No ratings yet
Statistical Machine Learning Overview
45 pages
SVMs: Classification & Regression Guide
No ratings yet
SVMs: Classification & Regression Guide
66 pages
ML - 5 Sovan LR SVM 1
No ratings yet
ML - 5 Sovan LR SVM 1
59 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Classification
No ratings yet
Classification
47 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Cs181 Textbook
No ratings yet
Cs181 Textbook
163 pages
CS229
No ratings yet
CS229
216 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Andrew NG Main - Notes PDF
100% (1)
Andrew NG Main - Notes PDF
226 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Lect 1
No ratings yet
Lect 1
24 pages
Textbook
No ratings yet
Textbook
161 pages
Lecture 7 - SVM
No ratings yet
Lecture 7 - SVM
125 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
12 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Undergraduate Fundamentals of Machine Learning
No ratings yet
Undergraduate Fundamentals of Machine Learning
163 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
34 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM: Classification & Optimization
No ratings yet
SVM: Classification & Optimization
44 pages
08 Classification
No ratings yet
08 Classification
46 pages
SVM Slides
No ratings yet
SVM Slides
22 pages
A Survey of Stability of Stochastic Systems
No ratings yet
A Survey of Stability of Stochastic Systems
18 pages
Road and Off-Road Vehicle Dynamics Handbook
No ratings yet
Road and Off-Road Vehicle Dynamics Handbook
5 pages
Notes
No ratings yet
Notes
5 pages
DM R-18 Important Questions
No ratings yet
DM R-18 Important Questions
8 pages
Indices and Surds
No ratings yet
Indices and Surds
26 pages
System of Linear Equations in Two Variables
No ratings yet
System of Linear Equations in Two Variables
24 pages
RS Aggarwal Class 12 Solutions Chapter-8
No ratings yet
RS Aggarwal Class 12 Solutions Chapter-8
86 pages
CCRC Entrance Exam Model Questions 2080 MMT
No ratings yet
CCRC Entrance Exam Model Questions 2080 MMT
5 pages
Implement A Class Calculator That Models Handheld Calculator. It Should Have (Atleast) The Following Functionality
No ratings yet
Implement A Class Calculator That Models Handheld Calculator. It Should Have (Atleast) The Following Functionality
7 pages
Derivatives Problem Set
No ratings yet
Derivatives Problem Set
6 pages
Disha Cds Maths 15 Years
67% (3)
Disha Cds Maths 15 Years
504 pages
Inp013
No ratings yet
Inp013
12 pages
Cost Analysis for Accounting Students
No ratings yet
Cost Analysis for Accounting Students
19 pages
Bsed Math 123 Syllabus PDF
No ratings yet
Bsed Math 123 Syllabus PDF
7 pages
Simple Interest Practice Questions 2023
No ratings yet
Simple Interest Practice Questions 2023
262 pages
Solid Propellant Motor Performance Program
No ratings yet
Solid Propellant Motor Performance Program
135 pages
Biostatistics in Neonatal Nursing
No ratings yet
Biostatistics in Neonatal Nursing
275 pages
Lesson Plan Exploring Spreadsheets Next Steps
No ratings yet
Lesson Plan Exploring Spreadsheets Next Steps
4 pages
P.6 M.T.C End Term 1
No ratings yet
P.6 M.T.C End Term 1
8 pages
Math ODE Solutions for Students
No ratings yet
Math ODE Solutions for Students
3 pages
Password Security Systems
100% (4)
Password Security Systems
6 pages
Y7 End of Year Assessment - Calculator
No ratings yet
Y7 End of Year Assessment - Calculator
7 pages
DIBLS MAT1105E Final Exam Paper
No ratings yet
DIBLS MAT1105E Final Exam Paper
5 pages
CE503 Fall 2020 Midterm1 Solutions
No ratings yet
CE503 Fall 2020 Midterm1 Solutions
12 pages
Purcell 10.2: Solutions To Problem Set 12
No ratings yet
Purcell 10.2: Solutions To Problem Set 12
5 pages
ANOVA
No ratings yet
ANOVA
57 pages
Designing Space Frame Race Car Chassis Structure Using Natural Frequencies Data From Ansys Mode Shape Analysis
No ratings yet
Designing Space Frame Race Car Chassis Structure Using Natural Frequencies Data From Ansys Mode Shape Analysis
11 pages
AP Scholar Awards Overview and Criteria
No ratings yet
AP Scholar Awards Overview and Criteria
2 pages
Analysis of Gate 2019: Instrumentation Engineering
No ratings yet
Analysis of Gate 2019: Instrumentation Engineering
6 pages
Inverse Functions and Composition Exercises
No ratings yet
Inverse Functions and Composition Exercises
11 pages

2IIG0 Cheat Sheet 1

Uploaded by

2IIG0 Cheat Sheet 1

Uploaded by

Linear Algebra

You might also like