=== Study guide for midterm #2 ===
Friday November 17, 1-1.50pm, in Center 216
Lectures covered:
-- Regression
-- Logistic regression
-- Basics of optimization
-- The Perceptron
-- Support vector machines
1. Go through the lectures and make sure you understand them.
2. Go through the corresponding homeworks (HW4 to HW7) and make sure you can do all the problems. Many exam questions will closely resemble homework problems.
3. Memorize the following formulas:
a) Loss functions for least-squares regression, ridge regression, and Lasso
L_ols = sum (y - (wx + b))^2
L_ols = sum (y - (wx + b))^2 + lamb * ||w||
L_ols = sum (y - (wx + b))^2 + lamb * ||w||
b) Form of the logistic regression prediction
1 / 1 + exp(-y(wx+b))
c) The Perceptron algorithm
for i in range(iter)
# until convergence
if wx + b <= 0:
w = w + yx
b = b + y
return w, b
d) The hard-margin SVM and soft-margin SVM optimization problems
Hard margin:
max ||w||^2 s.t. wx + b >= 1
Soft margin:
max ||w||^2 + C sum(E) s.t. wx + b >= 1 - E
e) The dual form of the Perceptron algorithm
w = sum(a_i * y_i * x_i)
4. The following types of questions are *guaranteed* to be on the exam:
a) Write down a gradient descent (or stochastic gradient descent) algorithm for minimizing this loss function.
# choose point randomly
x, y = np.random.choice(data)
# calculate gradient at that point only
# update rule
w_i+1 = w_i - alpha * gradient_w(x)
# if converged
if delta_L(w) < thres or == 0:
return
b) Are the following (univariate/multivariate) functions convex, concave, both, or neither?
both: linear
x^3: neither
x^2: convex
c) In the following example, draw the decision boundary as well as the left and right boundaries, and compute the margin.
2/||w||
right boundary: set x_1 = 0
left boundary: set x_2 = 0
Please come early!
STUDY QUESTIONS FOR MIDTERM 2
=== Linear regression ===
(a) What are they? Native Least Squares, Ridge, Lasso
$$ L_{OLS}(w) = \sum (y - (wx + b))^2\\ L_{L1}(w) = \sum (y - (wx + b))^2 + \lambda||w||\\ L_{L2}(w) = \sum (y - (wx + b))^2 + \lambda||w||^2\\ $$
(b) What are the differences between them: why would one use one rather than the other? Native: can result in overfitting esp w/ multicolinearity Lasso/L1: regularization applied, can force coefficients to 0 as regularization param changes thus promotes sparcity Ridge/L2: encourages coeffs coeffs to 0, does not promote sparcity (c) One of them is a strict generalization of another; what are these? Ridge regression can be viewed as a strict generalization of OLS, as it reduces to OLS under a specific parameter setting. Lasso, however, introduces a different type of penalty and does not generalize OLS in the same manner. This is due to the L2 norm increasing robust nature towards errors 2. What is the trick for incorporating the parameter b into the vector w? You add a 1 into w (w -> d -> d+1)
=== Logistic regression ===
$$ P(Y|X) = \frac{1}{1 + exp(-y(w^Tx + b))} $$
(a) What is the dimension of the data? n x d (b) What is the decision boundary? wx + b = 0 (c) What is the set of points for which Pr(Y=1) >= 3/4?
$$ P(Y = 1 | x) >= 3/4 \\
P(Y = 1|X) = \frac{1}{1 + exp(-(w^Tx + b))} >= 3/4\\
P(Y = 1|X) = 1 + exp(-w^Tx + b)) >= 4/3\\
P(Y = 1|X) = exp(-(w^Tx + b)) >= 1/3\\ w^Tx + b >= -ln(1/3)
$$
(a) What is the loss function for training a logistic regression model given n data points?
$$ L(w) = \sum ln(1 + e^z) \\ z = -y^i * wx^i\\
dL = \sum \frac{-y^ix^i}{1+exp(-y^i*wx^i)} $$
(b) Is this an optimization problem to which we can hope to find the globally optimum solution? Why or why not? Yes since loss function is convex
Given a vocabulary V and a text document, describe how to convert the document into a vector of dimension the size of V.
V is a set of all words and text will be an array of words. To convert the docuent into a vector of dimension the size of v, we will have a vector where each index corresponds to a word in V. Since V is a set we can ensure uniqueness in the cardinality, and for each word, index pair, we will have a count representing the number of times the word at that index appears in document.
=== First and second derivatives ===
=== Gradient descent ===
(b) Write down a stochastic gradient descent algorithm for minimizing L. for i in range(max iter til convergence) pick random point x, y w = w - alpha * gradient_L(point x, y)
=== Positive semidefinite matrices ===
for all non-negative vectors x
$$ x^TMx >= 0 $$
covariance matrix hessian matrix of convex functions
=== Convexity ===
$$ f(ta - (t - 1)b) <= tf(a) - (t-1)f(b)\\
$$
=== Perceptron algorithm ===
(a) Write down the (binary) Perceptron algorithm in both primal and dual form.
Primal:
for i 1:n:
# if misclassified
if w * x^i + b <= 0:
w += y^i * x^i
b += y^i
# else
return
Dual:
a, b = 0
for i 1:n:
# if misclassified
a += 1
b += y^i
(b) What is the relationship between the solutions of the two? Both learn same decision boundary
$$ \sum a_iy^ix^i $$