Linear classifier
Binary linear classifier
A binary linear classifiers is defined as follows
where is the feature vector we want to classify and is the vector which the classifier thinks is important. It will returns the first class if the score is greater than zero, else the other class. If a data set is linearly inseparable a linear classifier often has a hard time of optimal learning.
Logistic regression
The name is somewhat confusing because it is a classifier. It is a linear classifier that gives probabilistic scores. To get probabilites we need to use a logistic or sigmoid function
In a linear model with probabilites we can train the model by selecting features that assign a high probability to the data. Therefore, we need to adjust so that each our output label gets a high probability.
Formally this is defined by the likelihood function
which translate to
in our case. We can convert this to the log loss function by using log on each side.
Multiclass classification
Two main ideas
- break down the problem into simplier pieces and create a classifer for each piece
- adjust the model to handle multiclass directly
There are two approaches we can use to convert a multiclass problem to a binary problem
Built in classifiers like the perceptron or logistic regression will to this autoamtically (one-versus-rest).
Instead of using the sigmoid we use the softmax function in the multiclass scenario.
when training, instead of using the log loss we get the cross-entropy loss instead.
Linear regression
Similarly like the linear classifier, a linear regression model calculate its score like this
where is the encoded feature vector and is the weights. The output is now numerical.
In linear-squares regression with the error function
The error function finds the weight vector that minmizes the squared error over the training set. For each training instance we look at the predicted value and calculate the distance of it between the labeled value.
However this could be quite expensive which is why stochastic gradient descent is often used in practice. In stochastic gradient descent we consider just a single instance
Thus, the gradient of the least squared loss with respect to is
Keeping the model simple
We can keep the model simple by adding a regularization term to the linear regression model. By adding this term we can keep the weights small. For example, by penalizing the squared length we achieve
which is called a regularizer . Another common regularizer is
which is aclled a regularizer.
If we combine the loss function with the regularizer we get
Bias
A linear classifier could be expressed with a bias term and will in such cases look like this
where is the bias (often also called offset or intercept).
Different models
Classifiers
- perceptron
- logistic regression
Regressors
-
linear regression [3] (no regularization)
-
ridge [4] (the combination of least squares loss with regularization)
-
lasso [5] (the combination of least squares loss with regularization)
-
linear SVR [6]