Keywords

Generalization

In machine learning generalization refers to how well a trained model is to classify unseen data.

Generalization gap

Generalization gap is defined as the difference between the model's performance on training data versus unseen data from the same distribution.

Prediction

Predictions are machine learning model's way of mapping an input to an output.

Training data

A training set is collected from a distribution very similar to how it will look when the model is put into practive. You usually split the collected data into training data and test data with the majority of that being the training data.

Validation set

The model is evaluated on the validation set, which aims to be unbiased in comparison to the training set, while optimizing the model's hyperparameters. The validation set will become biased eventually because we tune the model to get as good performance as possible on that set.

Test set

The test set is the final evaluation of the model and provides an unbiased final evaluation of the model. After a model is evaluated on the test set it should not be further optimzed because that will introduce bias.

Induction

Induction in machine learning is the process of inferring general rules from specific

Regression

A machine learning model for regression tries to map inputs to outputs in the continuous space instead of the discrete space which is used in classification.

Classification

A machine learning model for classification tries to identify which set of categories an observer belongs to. Thus mapping an input to and output in the discrete space. Binary classification is a special of multiclass classification where there is only two groups.

Ranking

It is a type of application of typically supervised, semi-supervised of reinforcement learning wherein the training data has some partial order between each item. The order is often induced by a numerical or ordinal score.

Feature

Within machine learning features are some individual measurable properties of a phenomenon. Features could be numberic, but also structural information as strings or graphs. Together they are used to build patterns which the machine learning model learns.

Label

In supervised or semi-supervised learning, a label is the corresponding output in the training data.

Loss function

A loss function is formally a function that maps an event or some values to a real number with a cost associated with it, and the optimization algorithm tries to minimize that cost. There are different loss functions depending on the type of model. In regression problems the most common loss functions are ^[1] :

Mean absolute error (MAE)
Mean absolute percentage error (MAPE)
Mean squared error (MSE)
Root mean squared error (RMSE)
Huber loss
Log-cosh loss

For classification the most common loss functions are ^[2] :

Cross-entropy loss (or log loss)
Hinge loss
Squared hinge loss

Parity

A parity function is a boolean function whose values are 1 if the input vector has an off number of ones, else 0.

Noise

Noise in machine learning can be a wanted property or not depending on the problem. It may increse the complexity of the model and the time of learning which may degrade the performance. However, it may as well help generalize the model as in data augmentation.

Supervised learning

Supervised learning is one of the major learning paradigms in machine learning. It requires that the training data is labeled. Thus a model tries to imitate by examples.

Unsupervised learning

Unsupervised learning is one of the major learning paradigms in machine learning. Unlike supervised machine learning, unsupervised learning does not have any labeled data but must instead discover certain patterns about the data itself.

Semi-supervised learning

Semi-supervised learning could be looked at as a mixture of supervised learning and unsupervised learning, that combines a small amount of labeled data with a large amount of unlabeled data during the training phase.

Reinforcement learning

Reinforcement learning is one of the major learning paradigms in machine learning. It concerns how an agent should take actions in a defined enviroment in order to maximize the cumulative reward. The reward function is here the objective function.

Cross-validation

Cross-validation is a technique that is used to reduce the number of samples required for the training process of a model. The technique removes the need of a validation set. Instead one basic approach called k-fold cross-validation splits the training set into k smaller sets called folds. Each of these folds act like the validation set in turns for a total of k times. After cross-validation the model is evaulated on the test set as usual. The technique can be computationally heavy.

Decision tree

A decision tree is a model that orders its weights in tree-based format where leaves represent class labels and branches represent conjuntions of features that lead to the different decisions. It is the branches that are updated in the training process. Decisions trees could be used in regression as well wherin the leaves represent a condition and the the branches usually correspond to yes or no given the validity of the condition. Decision trees are one of the more popular machine learning models because of their simplicity and interpretablility.

Entropy

Entropy is the measure of disorder, a measure of purity or homogeneity. Thus, it could be seen as how random the data points are in a distribution. Greater disorder results in lower impurity.

Information gain

Entropy plays an important role in information gain. Information gain is known in information theory as the amount of information gained about a random variable from observing another random variable. In the context of decision trees it is a good measure for deciding whether a feature has relevance, although it is not perfect ^[3] .

E n t r o p y = - i = 1 \sum n p_{i} * lo g_{2} (p_{i})

Gini score

As in information gain entropy plays an important role here as well in determining how pure a set of data points are. It ranges between 0-1 where 0 expresses purity, namely, all data points belong to the same class, whereas 1 indicates a random distribution amoong the data points.

G i n i s c o r e = 1 - i = 1 \sum n (p_{i})^{2}

Ensemble

An ensemble method take advantage of multiple learning algorithms to obtain a better predictive result that could not otherwise be obtained.

Boosting

Boosting is an ensemble technique (meta-algorithm) that involves an incremental build of the ensemble model by training each new model instance in a fashion that will "correct" how earlier instances misclassified the data. Boosting has been shown to yield better results than bagging but also tends to overfit to a higher degree. It is used to reduce bias and variance. It converts weak learners to strong learners.

AdaBoost

AdaBoost is short for Adaptive Boosting. It focuses on misclassified instances by previous classifiers and tweaks new weak learners to slightly better performance. As long as the performane of each weak learner is better than random guessing even by a little, the final model can be proven to converge to a strong learner. It has no loss function.

Gradient boosting

Like any other boosting method gradient boosting builds the model by stage-wise improvements of weak learners, but generalizes the model with an arbitrary differentiable loss function. What differs the AdaBoost algorithm from gradient boosting is that is does not have a loss function.

Stacking

Stacking is an ensemble technique (meta-algorithm) that combines several different learning algorithms to one united. Every other learning algorithm is trained on the available data, and then combined with a combiner algorithm to make the final prediction using the predictions of all the other algorithms as input.

Bagging

Bagging or bootstrap aggregation is an ensemble technique (meta-algorithm) that promotes total model variance by having each submodel vote with equal weight. Thus, it is a technique for reducing variances for estimated prediction functions and does so by averaging a number of noisy but approximately unbiased models. Bagging trains each model in the ensemble with a randomly chosen subset of the training set. The samples in bagging are different from each other but replacements are allowed which means that one instance may occur in several samples or none at all.

Random forest

Random forests are an ensemble learning algorithm based on bagging and decision trees. The performance of multiple decision trees are thus combined and usually gives a better performance than one decision tree alone.

Spinning

Based on the same idea as bagging and is also called feature bagging or random subspace learning.

Weak learner

A weak learner is a classifier that is only partially correlated with the true classification. It performs better than random guesses but not very much.

Strong learner

A strong learner is a classifier that is very well correlated with the true classification.

Linear models

Linear models base its prediction on linear functions as the name suggests.

support vector machine
linear classifiers
logistic regression
lasso
elastic net
ridge regression
committee
patch representation
bag of words (text representation)
shape representation
meta features
meta algorithm
combinatorial transformations
logarithmic transformation
precision/recall metric
accuracy metric
f-measure
sensitivity/specificity metric
ROC curve
AUC score
development data
jack-knifing
imbalanced data
induced distribution
feature selection https://en.wikipedia.org/wiki/Feature_selection
predictive model
one-hot encoding
TF-IDF
mutual information
hyperparameters
grid search
black box optimization
automated machine learning (AutoML)
shallow decision tree
hypothesis space
linearly separable
least-square regression
inter-annotator agreement
chance-corrected agreement measure
chance agreement probability
objective function
regularizer
unconstrained optimization
constrained optimization
gradient
batch
early stopping
logistic
sigmoid
tanh
ReLu
log odds
likelihood function
log loss
maximum a posteriori
Gaussian prior
Laplace prior
one-versus-rest
one-versus-one
softmax
cross-entropy loss
margin
structural risk minimization theorem
input units/nodes
hidden units/nodes
output units/nodes
activation
universal approximation theorem
minibatch
adaptive
- adam
- adagrad
- RMSProp
dropout
data augmentation
pseudo-residual
residual
learning rate
ensemble size
measure
downstream task
word error rate
BLEU
overlap-based metric
humans-in-the-loop
true positives
false positives
true negatives
false negatives
coefficient of determination
confidence score
search engine
ranking systems
precision at k
scorer
ranker
feature extraction
SIFT
translational invariance
spotting patterns
convolutinal filters
pooling
fully connected layers
dense layers
redidual connections
normalizations
kernel
feature map
vanishing gradients
exploding gradients
mathematical instability
batch normalization
transfer learning
freeze and unfreeze model
fine-tune model
catastrophic forgetting
clustering (flat and hierachical)
k-medoids
mean shift
Gaussian mixture
DBSCAN
agglomerative (clustering)
divisive (clustering)
evaluation (internal and external evaluation)
silhoutte score
purity score
inverse purity score
residual sum of squares
NP-hard
elbow method
density-based clustering methods
core point
noise point or outlier
matrix factorization
low rank matrix factorization
autoregressive model (time series)
exogenous (ARX)
sequence-to-sequence
attention model
transformer (the BERT model)
induction
histogram
loss function
- squared loss
- absolute loss
- zero/one loss
spinning
exploration-exploitation dilemma

Bias

Nonresponse bias

Inductive bias

The preference for one distinction over another. If the inductive bias is too far away from the concept that is being learned, the whole learning might fail.

Normalization

Is a good way of keeping the data consistent. There are two basic types of normalization: example and feature normalization.

Feature normalization

Go through every feature and apply the same adjustment across all examples. There are two standard techniques to use: centering and scaling. Centering to keep the data set close around the origin. Scaling to make sure each feature has variance 1 across the training data.

Example normalization

Go through every feature but adjust them individually. The standard technique is to make sure that each feature vector has the length of one. The advantages of example normalization is that comparisons between data sets are more straightforward.

Approximation error

Will measure how well the model family is performing.

Estimation error

Will measure how far off one classifer is from the optimal classifier of that type.

Bias-variance trade-off

The trade-off between approximation and estimation error is usually called the bias/variance trade-off. The bias corresponding to the approximation error and the variance corrsponding to the estimation error.

Imbalanced data

The imbalanced data problem refers to the problem where the distribution from which the data is taken has an imbalance. This is not good because machine learning algorithms will try to minimize the error, and thus, predict in favor of the imbalance majority. They can often achieve really good results by doing nothing. Hence, you probably not care about predicting accuracy.

Feature selection

Embedded methods

Embedded methods are used to learn which features best contribute to the learning of a model while it is being created. Common methods are regularization methods.

Regularization methods

Regularization methods or penalization methods introduce additional constraints which makes the model bias toward fewer constraints.

Feature imputation

It will try to fill any missing data. We could replace the missing value with a constant (e.g. the mean value), a random value or a prediction from the other values.

The Widrow-Hoff algorithm

w = [0, ..., 0]
for i in range(N): # N epochs
    for (x[i], y[i]) in the training set
        g = w * x[i]
        error = g - y[i]
        w = w - learning_rate * error * x[i]
return w

Crowdsourcing

A common technique for annotating data. It uses a large pool of non-expert annotators to annotate the data.

Deep learning

Deep learning is a neural network with many hidden layers. The universal approximation theorem states that one hidden layers should be enough to approximate any function, but it is often more practical to stack many hidden layers on each other.

Backpropagation

Backpropagation is the trick of using the gradients of the weights of layers occuring later in the hierarchy to compute the gradient when using the chain rule.

Intrinsic evaluation

Intrinsic evaluation is the performance measured in isolation using some metric computed automatically.

Extrinsic evaluation

How does one change to the predictor affect the performance?

F-score

The F-score may refer to a clustering method or a classification method.

K-means

K-means is probably the most popular technique for clustering and the idea behind it is to find a set of K clusters such that each data point is close to its centroid (mean vector).

Lloyd's algorithm

while clusters don't change:
    insert x_i to cluster S_k
    recompute cluster centroids for each S_k
return [S_1...S_k]

The elbow method

When we use k-means for some clustering problem and want to choose the number of clusters we wish the algorithm should find, the elbow method presumes that there are some natural cluster optima. The loss function will drop quickly until we reach this optima, but increasing the numbers more will have diminishing returns. If we plot the number of clusters and the loss, we know we can apply the elbow method if the curve looks like an elbow.

Principle component analysis

Principle component analysis (PCA)

Singular value decomposition

Singular value decomposition (SVD)

Low-rank matrix factorization

A more space efficient technique for implementing PCA.

Cold start

How to we handle new users and new items in colloborative filtering?

Word embeddings

We represent words for NNs using a low-dimensional representation of real numbers.

Word-word co-occurrence matrix

We count the occurrence of words occurring together.

Reduction

Reduction in machine learning means that we convert a complictated problem into a bunch of simpler problems.

Part-of-speech tagging

Input a sequence of word tokens and output a sequence of grammatical tags corresponding to each token.

Imitation learning

A paradigm in machine learning where the model tries to imitate an "expert".

Feedforward neural network

Consists of connected layers of "classifiers" where intermediate classifiers are called hidden units and the final classifier is called output unit. Each hidden unit is computed by

h_{i} = f (w w_{h_{i}} \cdot x x)

and the output is computed by

y = f (w w_{o} \cdot h h)

$f$ is the activation function.

Multilayer perceptron

See Feedforward neural network.

Recurrent neural networks

They use states to represent previous events. After each step the state vector is recalculated.

References

[1]

https://medium.com/analytics-vidhya/a-comprehensive-guide-to-loss-functions-part-1-regression-ff8b847675d6.

[2]

https://medium.com/analytics-vidhya/common-loss-functions-in-machine-learning-for-classification-model-931cbf564d42.

[3]

https://en.wikipedia.org/wiki/Information_gain_in_decision_trees#Drawbacks.