Optimization problems

    There are two main types of problems

    • unconstrained optimization
    • constrained optimization

    In machine learning many of the problems are of the first type.

    Unconstrained optimization

    In unconstrained optimization we try to find the that gives us the minimal or maximal value for some function like

    Constrained optimization

    In constrained optimization we try to find the that gives us the minimal or maximal value for some function where needs to satisfy some additional conditions like

    Gradient decent

    The idea of gradient decent is that we should take a small step in the direction opposite of the gradient and repeat until the gradient is close to zero. We need to walk in the opposite direction because the gradient will give us the direction of the steepest ascent.

    The algorithm could be explained in the following pseudocode

    1. set to some intial value and select a learning rate
    2. compute the gradient
    3. if is close to zero, terminate
    4. else subract the learning rate from and go back to step 2.

    We will always reach a minimum or maximum if

    • there is a top
    • the steps are short enough

    There are smarter versions of gradient as well that try to adapt its step size depending on how fast it is going or for have many steps it has already taken. One simple solution which is smarter than the ordinary gradient decent is to start fast with gradually slower steps.

    Gradient decent will not make sure that we find the global minimum/maximum but make sure we reach a minimum/maximum global or local if there exists one.

    Stochastic gradient descent

    Because the gradient depend on every instance in the training set it can be quite expensive to calculate. Thus, a simplified version that onlu computes parts of the training set are used in practice called stochastic gradient descent.

    The algorithm could be explained in the following pseudocode

    1. set to some intial value and select a learning rate
    2. select a single training sample
    3. compute the gradient using
    4. if is "done"
    5. else subract the learning rate from and go back to step 2.

    Point four above gives rise to some questions, namely what does "done" mean. In gradient descent we stop the iteration process when the gradient is close to zero. In stochastic gradient descent, close to zero does not mean the same thing because we are looking at a small sample of instances which very well could be zero unlike the whole training set. There are a few solutions to this, namely have a fixed number of iterations, stop when we have not seen any improvements for some time, or evaluate the model on some set which determines if we should terminate (early stopping)