Linear Classifiers Regressors

Linear classifier

Binary linear classifier

A binary linear classifiers is defined as follows

s c o r e = w w \cdot x x

where $x x$ is the feature vector we want to classify and $w w$ is the vector which the classifier thinks is important. It will returns the first class if the score is greater than zero, else the other class. If a data set is linearly inseparable a linear classifier often has a hard time of optimal learning.

Logistic regression

The name is somewhat confusing because it is a classifier. It is a linear classifier that gives probabilistic scores. To get probabilites we need to use a logistic or sigmoid function

P (positive output ∣ x x) = \frac{1}{1 + e ^{- s c o r e}}

P (negative output ∣ x x) = 1 - \frac{1}{1 + e ^{- s c o r e}} = \frac{1}{1 + e ^{s c o r e}}

In a linear model with probabilites we can train the model by selecting features that assign a high probability to the data. Therefore, we need to adjust $w w$ so that each our output label gets a high probability.

Formally this is defined by the likelihood function

L (w w) = P (y_{1} ∣ x x_{1}) \dots P (y_{m} ∣ x x_{m})

which translate to

L (w w) = \frac{1}{1 + e ^{- y_{1} \cdot (w w \cdot x x_{1})}} \dots \frac{1}{1 + e ^{- y_{m} \cdot (w w \cdot x x_{m})}}

in our case. We can convert this to the log loss function by using log on each side.

Multiclass classification

Two main ideas

break down the problem into simplier pieces and create a classifer for each piece
adjust the model to handle multiclass directly

There are two approaches we can use to convert a multiclass problem to a binary problem

one-versus-rest ^[1]
one-versus-one ^[2]

Built in classifiers like the perceptron or logistic regression will to this autoamtically (one-versus-rest).

Instead of using the sigmoid we use the softmax function in the multiclass scenario.

P (y_{i} ∣ x x) = \frac{e ^{s c o r e_{i}}}{\sum _{k} e ^{s c o r e_{k}}}

when training, instead of using the log loss we get the cross-entropy loss instead.

Linear regression

Similarly like the linear classifier, a linear regression model calculate its score like this

y = w w \cdot x x

where $x x$ is the encoded feature vector and $w w$ is the weights. The output $y$ is now numerical.

In linear-squares regression with the error function

w w^{*} = ar g w w min \frac{1}{N} i = 1 \sum N (y_{i} - w w \cdot x x_{i})^{2}

The error function finds the weight vector that minmizes the squared error over the training set. For each training instance we look at the predicted value and calculate the distance of it between the labeled value.

However this could be quite expensive which is why stochastic gradient descent is often used in practice. In stochastic gradient descent we consider just a single instance

f_{i} (w w) = (w w \cdot x x_{i} - y_{i})^{2}

Thus, the gradient of the least squared loss with respect to $w w$ is

\nabla f_{i} (w w) = 2 \cdot (w w \cdot x x_{i} - y_{i}) \cdot x x_{i}

Keeping the model simple

We can keep the model simple by adding a regularization term to the linear regression model. By adding this term we can keep the weights small. For example, by penalizing the squared length we achieve