Review | Notion

=== Study guide for midterm #2 ===

Friday November 17, 1-1.50pm, in Center 216

Lectures covered:
-- Regression
-- Logistic regression
-- Basics of optimization
-- The Perceptron
-- Support vector machines

1. Go through the lectures and make sure you understand them.

2. Go through the corresponding homeworks (HW4 to HW7) and make sure you can do all the problems. Many exam questions will closely resemble homework problems.

3. Memorize the following formulas:
a) Loss functions for least-squares regression, ridge regression, and Lasso
		L_ols = sum (y - (wx + b))^2
		L_ols = sum (y - (wx + b))^2 + lamb * ||w||
		L_ols = sum (y - (wx + b))^2 + lamb * ||w||

b) Form of the logistic regression prediction
		1 / 1 + exp(-y(wx+b))
c) The Perceptron algorithm
		for i in range(iter)
			# until convergence
			if wx + b <= 0: 
				 w = w + yx
				 b = b + y
      return w, b

d) The hard-margin SVM and soft-margin SVM optimization problems
			Hard margin:
				max ||w||^2 s.t. wx + b >= 1
			Soft margin:
				max ||w||^2 + C sum(E) s.t. wx + b >= 1 - E
e) The dual form of the Perceptron algorithm
			w = sum(a_i * y_i * x_i)

4. The following types of questions are *guaranteed* to be on the exam:
a) Write down a gradient descent (or stochastic gradient descent) algorithm for minimizing this loss function.
		# choose point randomly
		x, y = np.random.choice(data)
		# calculate gradient at that point only
			# update rule
			w_i+1 = w_i - alpha * gradient_w(x)
		# if converged
		if delta_L(w) < thres or == 0:
			return

b) Are the following (univariate/multivariate) functions convex, concave, both, or neither?
	both: linear
	x^3: neither
	x^2: convex

c) In the following example, draw the decision boundary as well as the left and right boundaries, and compute the margin.
	2/||w||
	right boundary: set x_1 = 0
	left boundary: set x_2 = 0

Please come early!

STUDY QUESTIONS FOR MIDTERM 2

=== Linear regression ===

We have three different loss functions for linear regression.

(a) What are they? Native Least Squares, Ridge, Lasso

$$ L_{OLS}(w) = \sum (y - (wx + b))^2\\ L_{L1}(w) = \sum (y - (wx + b))^2 + \lambda||w||\\ L_{L2}(w) = \sum (y - (wx + b))^2 + \lambda||w||^2\\ $$

(b) What are the differences between them: why would one use one rather than the other? Native: can result in overfitting esp w/ multicolinearity Lasso/L1: regularization applied, can force coefficients to 0 as regularization param changes thus promotes sparcity Ridge/L2: encourages coeffs coeffs to 0, does not promote sparcity (c) One of them is a strict generalization of another; what are these? Ridge regression can be viewed as a strict generalization of OLS, as it reduces to OLS under a specific parameter setting. Lasso, however, introduces a different type of penalty and does not generalize OLS in the same manner. This is due to the L2 norm increasing robust nature towards errors 2. What is the trick for incorporating the parameter b into the vector w? You add a 1 into w (w -> d -> d+1)

=== Logistic regression ===

What are two sources of uncertainty in prediction?
- Human Emotion
- Black Swan events
Consider a logistic regression model given by parameters w in R^d and b.

$$ P(Y|X) = \frac{1}{1 + exp(-y(w^Tx + b))} $$

(a) What is the dimension of the data? n x d (b) What is the decision boundary? wx + b = 0 (c) What is the set of points for which Pr(Y=1) >= 3/4?

$$ P(Y = 1 | x) >= 3/4 \\

P(Y = 1|X) = \frac{1}{1 + exp(-(w^Tx + b))} >= 3/4\\

P(Y = 1|X) = 1 + exp(-w^Tx + b)) >= 4/3\\

P(Y = 1|X) = exp(-(w^Tx + b)) >= 1/3\\ w^Tx + b >= -ln(1/3)

Learning a logistic regression model.

(a) What is the loss function for training a logistic regression model given n data points?

$$ L(w) = \sum ln(1 + e^z) \\ z = -y^i * wx^i\\

dL = \sum \frac{-y^ix^i}{1+exp(-y^i*wx^i)} $$

(b) Is this an optimization problem to which we can hope to find the globally optimum solution? Why or why not? Yes since loss function is convex

Bag-of-words representation of text documents.

Given a vocabulary V and a text document, describe how to convert the document into a vector of dimension the size of V.

V is a set of all words and text will be an array of words. To convert the docuent into a vector of dimension the size of v, we will have a vector where each index corresponds to a word in V. Since V is a set we can ensure uniqueness in the cardinality, and for each word, index pair, we will have a count representing the number of times the word at that index appears in document.

=== First and second derivatives ===

How do we take the derivative of a loss function L: R^d -> R? Compute gradient vector:

Screen Shot 2023-11-16 at 3.10.28 PM.png

How do we take the second derivative? Hessian matrix:

Screen Shot 2023-11-16 at 3.12.03 PM.png

=== Gradient descent ===

Let L: R^d -> R denote a loss function that we wish to minimize. (a) Write down a gradient descent algorithm for minimizing L. for i in range(max iter til convergence) w = w - alpha * gradient_L(entire dataset)

(b) Write down a stochastic gradient descent algorithm for minimizing L. for i in range(max iter til convergence) pick random point x, y w = w - alpha * gradient_L(point x, y)

For what kinds of problems will these procedures find the global minimum? Problems with not a lot of plateaus, convex and fully differentiable loss functions, any problem where simulated annealing guaranteed convergence

=== Positive semidefinite matrices ===

What is the definition of a positive semidefinite matrix?

for all non-negative vectors x

$$ x^TMx >= 0 $$

What are some common types of matrices that are positive semidefinite?

covariance matrix hessian matrix of convex functions

=== Convexity ===

What is the technical definition of convexity?

$$ f(ta - (t - 1)b) <= tf(a) - (t-1)f(b)\\

Checking convexity. (a) How to determine if a function F: R -> R is convex? def of convexity second derivative test (>= 0) (b) What about a function F: R^d -> R? Check Hessian PSD (if covariance matrix)

=== Perceptron algorithm ===

Primal and dual.

(a) Write down the (binary) Perceptron algorithm in both primal and dual form.

Primal:

for i 1:n:
  # if misclassified
	if w * x^i + b <= 0:
		w += y^i * x^i
		b += y^i
  # else
  return

Dual:

a, b = 0
for i 1:n:
	# if misclassified
	a += 1
	b += y^i

Screen Shot 2023-11-16 at 3.25.55 PM.png

(b) What is the relationship between the solutions of the two? Both learn same decision boundary

$$ \sum a_iy^ix^i $$