Paradigms

Supervised learning

Our model tries to imitate examples.

Unsupervised learning

In unsupervised learning we do not have the labeled parts. In unsupervised learning we try to discover some patterns about the data: explore, visualize and understand.

Stanford has a good guide ^[1] .

Clustering

Clustering ^[2] ^[3] describes data by forming it into groups or hierarchies. The goal of clustering is to discover natural groups in a data set. A natural group is data points which are very much alike but different from other groups. Clustering is hard, because we don't there is typically no right answer and the clustering algorithm may come up with some pattern that is obvious in the dataset but is far from we want.

To evaluate similarity there are some different options. For vectorized data it is common to use Euclidean distance. We can divide clustering evaluation into two separate methods: internal and external. Internal evaluation will evaluate how cohesive and well-separated the data points in the cluster are and external evaluation will evaluate how well the algorithm performs to our objective. Is this clustering what we want?

Flat

$F i n d i n g r e p r e s e n t a t i v e s$ Each cluster is characterized by a center. We can used different techniques, e.g. k-means , k-medoids or mean shift , to get a center vector.
$P r o b a b i l i s t i c a p p r o a c h$ Each cluster is represented by a distribution.
$D e n s e a p p r o a c h$ Find dense regions with algorithms like DBSCAN

K-means

The formal definition of K-means is to find a partition $S$ of the dataset, that minimizes the the loss function

L (s) = k = 1 \sum K x_{i} \in S_{k} \sum ∣ ∣ x_{i} - centroid (S_{k}) ∣ ∣^{2}

which is called the residual sum of squares. Finding the partition is NP-hard unfortunately. We use Lloyd's algorithm instead which approximate a solution. It will reach a steady state if iterating for enough cycles. The scikit-learn implementation ^[4] .

How to we choose the number of clusters?

use our specific knowledge about the domain we are working with
use approximation methods, e.g. the elbow method
apply an evaluation method
use some regularization for the loss function

DBSCAN

In comparison to the k-means cluster method which is in the family of representatives, DBSCAN just operates on raw data points and is in the family of density-based clustering methods. DBSCAN will look for the distances (any kind of distance measure) between data points and how close they are too each other. The algorithm is based on core points and outliers. We could for example say that a core point is a point with 4 points withing reachable range, and any outliers are points that are not reachable from any other points.

Some advantage over the k-means method is that we don't need to define the number of cluster we want, DBSCAN does not have any assumption about the shape of each cluster, DBSCAN can use any kind of distance metric, and DBSCAN may create noise points. However, DBSCAN is very sensitive to hyperparameter tuning and works poorly if the clusters differ in density.

Evaluation

How does one evaluate these types of approaches? The silhoutte score is one method.

Silhoutte score

The silhoutte score is defined like

s_{i} = \frac{b _{i} - a _{i}}{max ( a _{i} , b _{i} )}

where $a_{i}$ is the average distance to other data points in the same cluster and $b_{i}$ is the minimal average distance to another cluster.

Purity score

The purity score defines how pure each cluster is in relation to some standard class. It looks like

Purity = \frac{1}{N} k = 1 \sum K i = 1 max ∣ C ∣ ∣ S_{k} \cap C_{i} ∣

One drawback of this approach is when we put every individual in its own cluster. Then we will get a purity of one.

Inverse purity score

Invserse Purity = \frac{1}{N} i = 1 \sum ∣ C ∣ k = 1 max K ∣ S_{k} \cap C_{i} ∣

F-score

The F-score ^[5] tries to be a balance of the purity score and the inverse purity score.

Hierarchical

There are two approaches, applomerative which is bottom-up and divisive which is top-down.

Statistical distribution

We want to use our model to find data points that are highly unusual.

Representation

We want to learn some new representation of the data, e.g. reducing the data set to lower dimensions.

It is good for visualization, and reducing the need for storage which makes the algorithm run faster and the learning easier.

Semisupervised learning

Follows the same patterns as supervised learning, but we just don't have enough labeled data.

Reinforcement learning

References

[1]

Andrew Ng et al. Welcome to the Deep Learning Tutorial. http://ufldl.stanford.edu/tutorial/.

[2]

https://en.wikipedia.org/wiki/Cluster_analysis.

[3]

https://scikit-learn.org/stable/modules/clustering.html.

[4]

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html.

[5]

https://en.wikipedia.org/wiki/F-score.