STUDY GUIDE FOR EXAM 3
=== Multiclass classification === 1x
1. If we are doing linear classification (logistic regression, Perceptron, SVM) with k classes, and the data is d-dimensional, how many parameters are there overall?
		k classes (d features + bias term)
		k(d+1)
2. Given (w1, b1), (w2, b2), ..., (wk, bk), how would you draw the decision regions for the k classes?
		for i_1:k
		wi * x + bi >= 0, see where class i beats another class j
3. In multiclass logistic regression, for a given x, what formula do we use to get the probabilities of the different labels?
		P(label) / sum P(all labels)
		e^-z / sum_i (e^-z_i)
4. Write down the multiclass Perceptron algorithm.
		initialize vector w1:k 
		initialize vector b1:k
		while x, y misclassified:
			# correct label
			wy = wy + x
			by = by + 1
			# predicted label
			wy = wy - x
			by = by - 1 
	# suppose k classes, then there are k choose 2 pieces / areas			
5. Suppose the k classes are not linearly separable. Which of the following methods would still be reasonable to use: (a) logistic regression (b) perceptron (c) soft-margin SVM?
		soft-margin
=== Kernel machines === 1x
1. Quadratic decision boundaries.
(a) What is the extended representation phi(x)?
		phi(x) = (x1, x2, x1x2, x1^2, x2^2)
(b) If x is d-dimensional, what is the dimension of phi(x)?
		2d + d choose 2
(c) What is the kernel function k(x,x') = phi(x) dot phi(x')?
		k(x, x') = (1 + x dot x')^2
2. Polynomial decision boundaries. Same questions as in 1.
(a) What is the extended representation phi(x)?
		phi(x) = (1, sqrt(2)x1...sqrt(2)xd, x1^2..xd^2, sqrt(2)x1x2...sqrt(2)xd-1xd)
(b) If x is d-dimensional, what is the dimension of phi(x)?
		dp + d choose p
(c) What is the kernel function k(x,x') = phi(x) dot phi(x')?
		k(x, x') = (1 + x dot x')^p
3. Write down the kernel Perceptron algorithm in the most commonly-used form: the dual form that uses k() and makes no mention of phi().
	  a = 0,  b = 0
		while i has yi(sum_j a_j * yj k(xi, xj) + b) <= 0:    
			a_i += 1
			b += yi
		sign(∑1:n a_i * yi k(xi, xj) + b)
(a) If the algorithm makes three updates on data point 5, and four updates on data point 7, what is the final alpha vector?
    (0, 0, 0, 0, 3, 0, 0, 4)
(b) Continuing from (a): Do we have enough information to determine the final value of b in this case? If not, what are the possible final values?
		No, we don't know what the class was
(c) Given the final alpha and b, what is the formula for making a prediction on a new point x?
		sign(∑1:n a_iyi k(xi, x) + b)
(d) Why do we prefer this form of the algorithm to the one that explicitly uses phi(x)?
		We don't have to explicitly compute phi(x) which saves a lot of computation for higher order problems
4. Kernel SVM. In order to solve the kernel SVM optimization problem for a data set of n points, how many similarity values k(xi, xj) need to be computed?
	 n x n (Gram Matrix)
5. The RBF kernel. What is the kernel function?
	 exp(-||x-y||^2 / s^2)
=== Decision trees === 1x
1. What is the algorithm for growing a decision tree? What are some common stopping rules?
	 Start with root node of all points
		 - look at all leaces and possible splits
		 - choose split s.t argmin uncertainty
	 Stopping: Pure Leaf, Max Depth reached, set a threshold
2. What are the formulas for two commonly-used uncertainty measures: the misclassification rate and the Gini index? You should be familiar with both the binary and multiclass versions.
	 Misclassification Rate
		 Binary: min(p, 1-p)
		 Multiclass: 1 - max p_i
	 Gini 
		 Binary: 1 - (p^2 + (1-p)^2)
						 2p(1 - p)
		 Multiclass: 1 - sum_i!=j:k (p_i^2)
3. What is the overfitting phenomenon in decision trees?
	 We will ultimately choose a very niche/specific split that doesn't generalize well (can be caused by outliers, noisy data); can fit a tree of 0 error if deep enough
=== Ensemble methods === 1x
1. Define: (a) weak classifier (b) weak learner.
	 Weak classifier: classifier that performs just slightly better than a baseline such as random guessing / majority label on a problem (to be ensembled)
   Weak learner: algorithm that can consistently generate weak classifiers
2. Suppose we run boosting for T rounds using decision stumps as the weak classifiers. What is the form of the final classifier? What is the role of the coefficients alpha?
	 Final form:
		 f(x) = (a_1 * h_1 + ... + a_T * h_T)
	 where alpha_i are weights adjusted from boosting reweighting
3. If during boosting we are able to consistently generate weak classifiers that are better than random, what goes to zero: training error or test error?
	 Training error
4. In boosting, why can't we generate all the weak classifiers at the same time, rather than in sequence?
	 We want weak classifiers to predict with some context, such as looking at errors of the previous ones. Specifically this comes into play when we update the weights, since we have to give higher weights to correct classifiers and lower weights to incorrect classifiers
5. What forms of randomization are used in a random forest?
	 Choose n' points randomly w/ replacement
	 when fitting the decision tree on these n' points, we choose k features at random from the d total features and pick the best split for each node
6. What is the functional form of a random forest classifier?
	 majority vote of weak decision tree classifiers h1:T
	 
=== Neural nets === 1x
1. Write down the ReLU and sigmoid activation functions.
		z if z > 0 else 0
		1 / 1+exp(-z)
2. How are probabilities of different classes determined at the output layer of a neural net classifier?
		Softmax using a marginal probability sum over all possible classes 
		e_i^-z / sumi:k(e_i^-z)
3. Given the number of nodes in each layer of a fully-connected feedforward net, can you determine the total number of parameters of the net?
		If fully connected yes, k(d-1) * number layers + output layer
4. What kinds of functions can be approximated by a neural net with one hidden layer? What is the potential benefit of adding more layers?
		Any continuous function; benefit is that it more layers may reduce computation in the case that the size of each layer is reduced
5. Construct a small neural net that takes as input four Boolean (0/1) values (x1, x2, x3, x4) and computes OR(AND(x1, x2), AND(NOT(x3), NOT(x4))). All hidden units should be ReLU.
model = [
	nn.Linear(), # input layer
	nn.ReLU(), # first Hidden
	nn.ReLU(), # second hidden
	nn.Sigmoid()
]
6. What is the cross-entropy loss function that is used when training neural net classifiers? Is the loss function convex?
	 Given (x1:n, y1:n)
	 L(w) = - sum_1:n(ln P_w(yi | xi)
	 Not necessarily; it is not convex (especially on large d)
=== Generalization === 1x
1. What is the main assumption in the statistical learning framework?
   Past present and future data all come from same distribution P
2. Give examples of two different types of distribution shift.
	 Covariance shift
		Distribution of x change, but probable labels remain the same
		Solutions: importance weighting, more data collection to better guestimate the true distribution
   Label shift
		frequency of labels change, but those distributions remain the same
		Solutions: Label Distribution Estimation; Estimate the label distribution in the test set and use this to adjust the predictions from the model. Train on multiple models and ensemble so that the final model is more robust to shifts



https://piazza.com/class_profile/get_resource/ln18bjs43q41tr/ln18nhgamp78




a) Multiclass perceptron algorithm
w1:k = 0
b1:k = 0 
while point xi, yi misclassified:
	# correct label
	wy += x
	by += 1
	# incorrect label
	wy -= x
	by -= 1
# classify new point: 
argmax_j(w_j * x + b_j)
b) Suppose ran on 100 points, k = 5 labels, makes 50 mistakes
Largest possible value of b4 = 50
Smallest possible value of b4 = 50
c) Suppose multiclass logisitic regression ran

x = [1, 0]
P(label = 2) = P(2) / sum(P(each label))
= e^(2x1) / e^(2x1) + e^(-x1 + x2 + 1) + e^(3x1 - x - 2) + e^(x1 + 2x2 + 3)
= e^(2) / e^(2) + e^(4) + e^(0) + e^(1)
d)
w1 = [2, 0], b = 4
w2 = [2, 2], b = 2
w3 = [1, 2], b = 4
Decision Boundaries
1 > 2: w1x + b1 ≥ w2x + b2 = 2x1 + 4 ≥ 2x1 + 2x2 + 2
= 2 ≥ 2x2
= 1 ≥ x2
1 > 3: 2x1 + 4 ≥ x1 + 2x2 + 4
= 2x1 ≥ x1 + 2x2
= x1 ≥ 2x2
2 > 3: 2x1 + 2x2 + 2 ≥ x1 + 2x2 + 4
= x1 ≥ 2
a, b = 0
while some xi, yi is misclassified: yi(sum_j (a_j * yj k(xj, xi) + b)) <= 0
	a_i += 1
	b += yi
classifiy new point: sign(sum_j (a_j * yj k(xj, xi) + b))
k(x, x’) = (1 + x dot x’)^2
Basis expansion phi(x) in R^d
phi(x) = (
x1:d,
x1:d^2
x1x2…x_d-1x_d
)
RBF Kernel:
$$ exp(\frac{-|x - z|^2}{s^2}) $$
Polynomial boundary kernel function
k(x, x’) = (1 + x dot x’)^p
a) When splitting a node, what is the maximum number of splits if n data points in R^d
d(n -1)
b) Given a node of n points of which p fraction are +1 and 1-p are -1
Gini Index = 1 - ||p||^2, 2p(1-p)
Misclassification rate = min(1, 1-p), 1 - max p_i
c) Training error will go to 0 when adding more and more nodes
Test error not necessarily go to 0
Overfitting sets in at some stage
d) Techniques help to reduce overfitting
Pruning using validation set
set max depth
a) Weak classifier: performs with an error better than random / baseline
Weak learner: algorithm that can consistently generate weak classifiers
b) Suppose running AdaBoost produces m weak classifiers
Final Form:
f(x) = sign(a_1 * h1(x) … a_n * hn(x))
m rounds of boosting
Can not do the rounds in parallel (you have to take in context of previous errors to reweight the distribution)
c) What is a decision stump?
One level decision tree where the ‘choice’ is made
Basically its the linear boundary of one feature, x1 > value
d) Suppose that a random forest contains m decision trees
Final Form:
majority vote of T1:m (equal weights)
f(x) = argmax(T_1, … , T_m)
e) Randomization:
when picking n’ data points, resample dataset w/ replacement
when picking k features from d total, pick the best split of that subset k at each node
a) How many parameters in following fully connected neural net (including bias term)?
b) What Function computed by following on input x = [2, 3]
h1 = ReLU(3x1 - 2x2 - 1)
h2 = ReLU(2x2 - 4)
y = 3x1 + 2x2
2, 3 → 0, 2 → 4
c) Sketch the different activation functions
d) Loss function not convex
a) Suppose classification task w/ input space X and label space y
b) Examples of distribution shift

