Linear classifier

    Binary linear classifier

    A binary linear classifiers is defined as follows

    where is the feature vector we want to classify and is the vector which the classifier thinks is important. It will returns the first class if the score is greater than zero, else the other class. If a data set is linearly inseparable a linear classifier often has a hard time of optimal learning.

    Logistic regression

    The name is somewhat confusing because it is a classifier. It is a linear classifier that gives probabilistic scores. To get probabilites we need to use a logistic or sigmoid function

    In a linear model with probabilites we can train the model by selecting features that assign a high probability to the data. Therefore, we need to adjust so that each our output label gets a high probability.

    Formally this is defined by the likelihood function

    which translate to

    in our case. We can convert this to the log loss function by using log on each side.

    Multiclass classification

    Two main ideas

    • break down the problem into simplier pieces and create a classifer for each piece
    • adjust the model to handle multiclass directly

    There are two approaches we can use to convert a multiclass problem to a binary problem

    • one-versus-rest [1]

    • one-versus-one [2]

    Built in classifiers like the perceptron or logistic regression will to this autoamtically (one-versus-rest).

    Instead of using the sigmoid we use the softmax function in the multiclass scenario.

    when training, instead of using the log loss we get the cross-entropy loss instead.

    Linear regression

    Similarly like the linear classifier, a linear regression model calculate its score like this

    where is the encoded feature vector and is the weights. The output is now numerical.

    In linear-squares regression with the error function

    The error function finds the weight vector that minmizes the squared error over the training set. For each training instance we look at the predicted value and calculate the distance of it between the labeled value.

    However this could be quite expensive which is why stochastic gradient descent is often used in practice. In stochastic gradient descent we consider just a single instance

    Thus, the gradient of the least squared loss with respect to is

    Keeping the model simple

    We can keep the model simple by adding a regularization term to the linear regression model. By adding this term we can keep the weights small. For example, by penalizing the squared length we achieve

    which is called a regularizer . Another common regularizer is

    which is aclled a regularizer.

    If we combine the loss function with the regularizer we get

    Bias

    A linear classifier could be expressed with a bias term and will in such cases look like this

    where is the bias (often also called offset or intercept).

    Different models

    Classifiers

    • perceptron
    • logistic regression

    Regressors

    • linear regression [3] (no regularization)

    • ridge [4] (the combination of least squares loss with regularization)

    • lasso [5] (the combination of least squares loss with regularization)

    • linear SVR [6]

    References