161324 Data Mining | Lecture 10
Recently, we’ve learned about classification models, which aim to predict the value of an existing categorical variable \(y\) based on a set of variables \(\mathbf{x}\). This is called supervised classification.
Cluster analysis is used to create a new categorical variable based on a set of feature variables \(\mathbf{x}\). This is called unsupervised classification.
The goal is to classify our objects into groups that have high within-group similarity and low between-group similarity, with respect to \(\mathbf{X}\).
Lecture 10 | Cluster Analysis
Sometimes there clearly are some groups.
Sometimes there clearly are no groups.
Often its something in between.
Regardless of the situation, cluster analysis will always produce groups!
\(k\)-means cluster analysis
Choose number of groups \(k\) (here 3).
Initialise the \(k\) centroids in \(\mathbf{X}\) space
(e.g., choose three points at random).
Loop until convergence:
It is wise to repeat the whole algorithm many times with different random starts to ensure a good result.
\(k\)-means cluster analysis
This dataset contains the proportions (as percentages) of each of nine major sources of protein in the diets of 25 European countries, some time prior to the 1990s.
food <- read_csv("https://massey.ac.nz/~anhsmith/data/food.csv")
kable(food) |> kable_styling(font_size = 18)
Country | RedMeat | WhiteMeat | Eggs | Milk | Fish | Cereals | Starch | Nuts | Fr.Veg |
---|---|---|---|---|---|---|---|---|---|
Albania | 10.1 | 1.4 | 0.5 | 8.9 | 0.2 | 42.3 | 0.6 | 5.5 | 1.7 |
Austria | 8.9 | 14.0 | 4.3 | 19.9 | 2.1 | 28.0 | 3.6 | 1.3 | 4.3 |
Belgium | 13.5 | 9.3 | 4.1 | 17.5 | 4.5 | 26.6 | 5.7 | 2.1 | 4.0 |
Bulgaria | 7.8 | 6.0 | 1.6 | 8.3 | 1.2 | 56.7 | 1.1 | 3.7 | 4.2 |
Czechoslovakia | 9.7 | 11.4 | 2.8 | 12.5 | 2.0 | 34.3 | 5.0 | 1.1 | 4.0 |
Denmark | 10.6 | 10.8 | 3.7 | 25.0 | 9.9 | 21.9 | 4.8 | 0.7 | 2.4 |
E Germany | 8.4 | 11.6 | 3.7 | 11.1 | 5.4 | 24.6 | 6.5 | 0.8 | 3.6 |
Finland | 9.5 | 4.9 | 2.7 | 33.7 | 5.8 | 26.3 | 5.1 | 1.0 | 1.4 |
France | 18.0 | 9.9 | 3.3 | 19.5 | 5.7 | 28.1 | 4.8 | 2.4 | 6.5 |
Greece | 10.2 | 3.0 | 2.8 | 17.6 | 5.9 | 41.7 | 2.2 | 7.8 | 6.5 |
Hungary | 5.3 | 12.4 | 2.9 | 9.7 | 0.3 | 40.1 | 4.0 | 5.4 | 4.2 |
Ireland | 13.9 | 10.0 | 4.7 | 25.8 | 2.2 | 24.0 | 6.2 | 1.6 | 2.9 |
Italy | 9.0 | 5.1 | 2.9 | 13.7 | 3.4 | 36.8 | 2.1 | 4.3 | 6.7 |
Netherlands | 9.5 | 13.6 | 3.6 | 23.4 | 2.5 | 22.4 | 4.2 | 1.8 | 3.7 |
Norway | 9.4 | 4.7 | 2.7 | 23.3 | 9.7 | 23.0 | 4.6 | 1.6 | 2.7 |
Poland | 6.9 | 10.2 | 2.7 | 19.3 | 3.0 | 36.1 | 5.9 | 2.0 | 6.6 |
Portugal | 6.2 | 3.7 | 1.1 | 4.9 | 14.2 | 27.0 | 5.9 | 4.7 | 7.9 |
Romania | 6.2 | 6.3 | 1.5 | 11.1 | 1.0 | 49.6 | 3.1 | 5.3 | 2.8 |
Spain | 7.1 | 3.4 | 3.1 | 8.6 | 7.0 | 29.2 | 5.7 | 5.9 | 7.2 |
Sweden | 9.9 | 7.8 | 3.5 | 24.7 | 7.5 | 19.5 | 3.7 | 1.4 | 2.0 |
Switzerland | 13.1 | 10.1 | 3.1 | 23.8 | 2.3 | 25.6 | 2.8 | 2.4 | 4.9 |
UK | 17.4 | 5.7 | 4.7 | 20.6 | 4.3 | 24.3 | 4.7 | 3.4 | 3.3 |
USSR | 9.3 | 4.6 | 2.1 | 16.6 | 3.0 | 43.6 | 6.4 | 3.4 | 2.9 |
W Germany | 11.4 | 12.5 | 4.1 | 18.8 | 3.4 | 18.6 | 5.2 | 1.5 | 3.8 |
Yugoslavia | 4.4 | 5.0 | 1.2 | 9.5 | 0.6 | 55.9 | 3.0 | 5.7 | 3.2 |
\(k\)-means cluster analysis
\(k\)-means cluster analysis
First, let’s try with 3 clusters and 1 random start.
tidyclust cluster object
K-means clustering with 3 clusters of sizes 5, 8, 12
Cluster means:
RedMeat WhiteMeat
1 6.34000 4.88000
2 10.60000 4.65000
3 10.76667 11.31667
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
2 3 3 1 3 3 3 2 3 2 3 3 2 3 2 3 1 1 1 2 3 2 2 3 1
Within cluster sum of squares by cluster:
[1] 13.3800 78.6200 158.0233
(between_SS / total_SS = 58.1 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
\(k\)-means cluster analysis
First, let’s try with 3 clusters and 1 random start.
\(k\)-means cluster analysis
Now, let’s do 50 random starts.
tidyclust cluster object
K-means clustering with 3 clusters of sizes 5, 8, 12
Cluster means:
RedMeat WhiteMeat
1 15.180000 9.000000
2 8.837500 12.062500
3 8.258333 4.658333
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 2 1 3 2 2 2 3 1 3 2 1 3 2 3 2 3 3 3 3 1 1 3 2 3
Within cluster sum of squares by cluster:
[1] 35.66800 39.45750 69.85833
(between_SS / total_SS = 75.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
\(k\)-means cluster analysis
Now, let’s do 50 random starts.
\(k\)-means cluster analysis
What about 2 clusters?
tidyclust cluster object
K-means clustering with 2 clusters of sizes 11, 14
Cluster means:
RedMeat WhiteMeat
1 8.109091 4.372727
2 11.178571 10.664286
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 2 2 1 2 2 2 1 2 1 2 2 1 2 1 2 1 1 1 2 2 2 1 2 1
Within cluster sum of squares by cluster:
[1] 56.15091 238.35571
(between_SS / total_SS = 50.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
\(k\)-means cluster analysis
What about 2 clusters?
\(k\)-means cluster analysis
Or 4?
tidyclust cluster object
K-means clustering with 4 clusters of sizes 5, 7, 8, 5
Cluster means:
RedMeat WhiteMeat
1 6.340000 4.8800
2 9.628571 4.5000
3 8.837500 12.0625
4 15.180000 9.0000
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
2 3 4 1 3 3 3 2 4 2 3 4 2 3 2 3 1 1 1 2 4 4 2 3 1
Within cluster sum of squares by cluster:
[1] 13.38000 24.51429 39.45750 35.66800
(between_SS / total_SS = 81.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
\(k\)-means cluster analysis
Or 4?
\(k\)-means cluster analysis
Increasing \(k\) will always decrease with within-group error.
\(k\)-means cluster analysis
The silhouette index measures how well the data points fit within their cluster vs a neighbouring cluster.
For each case \(i\), calculate:
\(a(i)\) = the average distance from case \(i\)
to all other members of its own cluster.
\(b(i)\) = the average distance from case \(i\)
to all members of the nearest neighbouring cluster.
\(s(i) = \frac{b(i)-a(i)}{\mathbf{max}(b(i),a(i))}\)
The scaling of \(s(i)\) by the maximum means that \(s(i)\) is always between -1 and 1.
If \(s(i)\) is near 1, the point clearly belongs in its cluster.
If \(s(i)\) is near zero, then the point is “on the fence”.
If \(s(i)\) is negative, then the point is more similar to members of another cluster.
Silhouette
Silhouette
For any cluster analysis, we can calculate the overall average silhouette score. We can then run the cluster analysis for a range of values of \(k\) and choose the value that gives the highest silhouette score.
The factoextra
package provides some convenient functions for this.
By this criterion, we’d choose \(k\) = 3.
Silhouette
food_cv_metrics <- tune_cluster(
object = workflow(
recipe(~ RedMeat + WhiteMeat,
data = food),
k_means(num_clusters = tune())
),
resamples = vfold_cv(
food,
v = nrow(food)
),
grid = tibble(
num_clusters=2:6
),
control = control_grid(
save_pred = TRUE,
extract = identity),
metrics = cluster_metric_set(
sse_ratio,
silhouette_avg
)
)
food_cv_metrics |>
collect_metrics() |>
ggplot() +
aes(x = num_clusters,
y = mean,
col = .metric) +
geom_point() + geom_line() +
ylab("Metric score") +
xlab("Number of clusters")
Silhouette
It’s not all about meat! There are actually 9 variables in this dataset.
Note that, although they are all measured as percentages, some vary much more than others.
It is generally sensible to normalise variables (subtract the mean and divide by the standard deviation) before doing \(k\)-means, or any other analysis that uses Euclidean distances. Otherwise, the variables with larger variances will dominate!
\(k\)-means cluster analysis
\(k\)-means cluster analysis
The fviz_cluster()
function will now show the clusters on a Principal Components Analysis plot of the nine variables.
\(k\)-means cluster analysis
\(k\)-means cluster analysis
Inherently based on Euclidean distances.
It is wise to normalise variables first.
For large numbers of variables, ordination methods like Principal Components Analysis (PCA) can be used to visualise clusters.
Looks for ‘spherical clusters’; not so good for irregular shapes.
Relatively fast, iterative algorithm.
The silhouette index can be used to choose \(k\).
One can use actual data points as the cluster centres (‘medoids’) instead of centroids, giving ‘\(k\)-medoid’ cluster analysis. This can be implemented with pam()
in R.
\(k\)-means cluster analysis
Hierarchical cluster analysis
Hierarchical cluster analysis uses a different approach to \(k\)-means.
Hierarchical cluster analysis
Hierarchical cluster analysis uses a different approach to \(k\)-means.
Hierarchical cluster analysis
Hierarchical cluster analysis uses a different approach to \(k\)-means.
Hierarchical cluster analysis
Hierarchical cluster analysis
RedMeat | WhiteMeat | |
---|---|---|
Yugoslavia | 4.4 | 5.0 |
Romania | 6.2 | 6.3 |
Greece | 10.2 | 3.0 |
Albania | 10.1 | 1.4 |
Italy | 9.0 | 5.1 |
Bulgaria | 7.8 | 6.0 |
\[ \begin{align} \delta_{s,t} &= \sqrt{\sum_{j=1}^p (x_{sj} - x_{tj})^2} \\ \\ \delta_{\text{Yug,Rom}} &= \sqrt{(4.4-6.2)^2 + (5.0-6.3)^2} \\ &=2.22 \end{align} \]
Yugoslavia | Romania | Greece | Albania | Italy | |
---|---|---|---|---|---|
Romania | 2.22 | . | . | . | . |
Greece | 6.14 | 5.19 | . | . | . |
Albania | 6.74 | 6.26 | 1.6 | . | . |
Italy | 4.6 | 3.05 | 2.42 | 3.86 | . |
Bulgaria | 3.54 | 1.63 | 3.84 | 5.14 | 1.5 |
Euclidean distance
RedMeat | WhiteMeat | |
---|---|---|
Yugoslavia | 4.4 | 5.0 |
Romania | 6.2 | 6.3 |
Greece | 10.2 | 3.0 |
Albania | 10.1 | 1.4 |
Italy | 9.0 | 5.1 |
Bulgaria | 7.8 | 6.0 |
Yugoslavia | Romania | Greece | Albania | Italy | |
---|---|---|---|---|---|
Romania | 2.22 | . | . | . | . |
Greece | 6.14 | 5.19 | . | . | . |
Albania | 6.74 | 6.26 | 1.6 | . | . |
Italy | 4.6 | 3.05 | 2.42 | 3.86 | . |
Bulgaria | 3.54 | 1.63 | 3.84 | 5.14 | 1.5 |
Yugoslavia | Romania | Greece | Albania | Italy | |
---|---|---|---|---|---|
Romania | 2.22 | . | . | . | . |
Greece | 6.14 | 5.19 | . | . | . |
Albania | 6.74 | 6.26 | 1.6 | . | . |
Italy | 4.6 | 3.05 | 2.42 | 3.86 | . |
Bulgaria | 3.54 | 1.63 | 3.84 | 5.14 | 1.5 |
Yugoslavia | Romania | Greece | Albania | |
---|---|---|---|---|
Romania | 2.22 | . | . | . |
Greece | 6.14 | 5.19 | . | . |
Albania | 6.74 | 6.26 | 1.6 | . |
Ita+Bul | 4.07 | 2.34 | 3.13 | 4.5 |
Yugoslavia | Romania | Greece | Albania | |
---|---|---|---|---|
Romania | 2.22 | . | . | . |
Greece | 6.14 | 5.19 | . | . |
Albania | 6.74 | 6.26 | 1.6 | . |
Ita+Bul | 4.07 | 2.34 | 3.13 | 4.5 |
Yugoslavia | Romania | Gre+Alb | |
---|---|---|---|
Romania | 2.22 | . | . |
Gre+Alb | 6.44 | 5.72 | . |
Ita+Bul | 4.07 | 2.34 | 3.82 |
Yugoslavia | Romania | Gre+Alb | |
---|---|---|---|
Romania | 2.22 | . | . |
Gre+Alb | 6.44 | 5.72 | . |
Ita+Bul | 4.07 | 2.34 | 3.82 |
Yug+Rom | Gre+Alb | |
---|---|---|
Gre+Alb | 6.08 | . |
Ita+Bul | 3.2 | 3.82 |
Yug+Rom | Gre+Alb | |
---|---|---|
Gre+Alb | 6.08 | . |
Ita+Bul | 3.2 | 3.82 |
Yug+Rom+Ita+Bul | |
---|---|
Gre+Alb | 4.95 |
Yug+Rom+Ita+Bul | |
---|---|
Gre+Alb | 4.95 |
Hierarchical cluster analysis
There are a number of ‘linkage’ criteria, ways to calculate the distance between groups of objects.
Hierarchical cluster analysis
With \(k\)-means, the clusters are defined by the distance to a centroid. This results in ‘spherical’ clusters.
Hierarchical clustering joins the most similar points and groups of points from the ‘ground up’. Clusters needn’t be any particular shape. It’s more about ‘gaps’.
Hierarchical vs k-means
dist(food_norm) |> dist(method = dist_type) |> as.matrix() |> as.data.frame() |>
replace_upper_triangle(NA) |> slice(-1) |> select(-last_col()) |>
column_to_rownames(var="rowname") |> kable(digits=1) |> kable_styling(font_size = 12)
Albania | Austria | Belgium | Bulgaria | Czechoslovakia | Denmark | E Germany | Finland | France | Greece | Hungary | Ireland | Italy | Netherlands | Norway | Poland | Portugal | Romania | Spain | Sweden | Switzerland | UK | USSR | W Germany | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Austria | 15.7 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Belgium | 16.9 | 5.0 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Bulgaria | 5.6 | 13.5 | 14.8 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Czechoslovakia | 14.8 | 4.6 | 5.2 | 12.0 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Denmark | 16.3 | 6.4 | 5.8 | 15.0 | 7.8 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
E Germany | 15.6 | 4.7 | 4.7 | 13.5 | 4.4 | 5.7 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Finland | 13.9 | 8.1 | 8.0 | 13.2 | 8.8 | 5.0 | 7.6 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
France | 14.5 | 6.8 | 5.5 | 13.1 | 7.4 | 6.9 | 6.7 | 7.8 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Greece | 8.3 | 12.4 | 12.8 | 6.9 | 11.4 | 13.1 | 12.2 | 11.6 | 10.4 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Hungary | 10.3 | 8.7 | 10.4 | 7.1 | 7.1 | 11.5 | 8.8 | 10.8 | 10.0 | 7.7 | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
Ireland | 16.5 | 5.6 | 4.5 | 15.2 | 7.6 | 5.1 | 5.9 | 6.6 | 5.8 | 13.4 | 11.4 | . | . | . | . | . | . | . | . | . | . | . | . | . |
Italy | 10.6 | 9.5 | 10.0 | 7.5 | 7.9 | 11.4 | 9.7 | 10.7 | 8.9 | 5.3 | 5.7 | 11.6 | . | . | . | . | . | . | . | . | . | . | . | . |
Netherlands | 16.5 | 2.1 | 4.4 | 14.3 | 4.9 | 5.7 | 4.8 | 7.7 | 6.9 | 13.0 | 9.6 | 5.1 | 10.1 | . | . | . | . | . | . | . | . | . | . | . |
Norway | 14.4 | 7.7 | 6.6 | 13.1 | 7.5 | 4.4 | 6.4 | 4.3 | 7.2 | 10.9 | 10.2 | 7.2 | 9.2 | 7.1 | . | . | . | . | . | . | . | . | . | . |
Poland | 13.9 | 5.8 | 6.6 | 11.1 | 3.9 | 8.2 | 5.2 | 8.6 | 7.1 | 9.8 | 6.3 | 8.3 | 6.6 | 6.3 | 7.3 | . | . | . | . | . | . | . | . | . |
Portugal | 11.4 | 15.8 | 16.3 | 11.9 | 15.3 | 14.9 | 14.2 | 13.7 | 13.2 | 9.6 | 12.5 | 15.9 | 12.2 | 16.4 | 13.4 | 13.2 | . | . | . | . | . | . | . | . |
Romania | 7.2 | 12.6 | 13.7 | 3.8 | 10.7 | 14.3 | 12.5 | 12.5 | 12.8 | 7.3 | 5.7 | 14.6 | 6.9 | 13.3 | 12.1 | 10.0 | 12.8 | . | . | . | . | . | . | . |
Spain | 11.1 | 11.1 | 11.1 | 9.4 | 9.8 | 11.5 | 9.8 | 10.8 | 9.3 | 5.8 | 7.5 | 12.0 | 6.1 | 11.7 | 9.5 | 7.7 | 8.0 | 9.0 | . | . | . | . | . | . |
Sweden | 15.7 | 6.2 | 5.5 | 14.3 | 7.1 | 2.9 | 6.0 | 4.6 | 7.2 | 12.4 | 10.8 | 5.7 | 10.3 | 5.4 | 3.3 | 7.8 | 15.3 | 13.3 | 11.2 | . | . | . | . | . |
Switzerland | 14.9 | 4.3 | 4.5 | 12.7 | 5.1 | 6.7 | 6.2 | 7.6 | 5.3 | 11.0 | 8.9 | 6.0 | 7.9 | 4.1 | 6.6 | 5.9 | 15.6 | 11.9 | 10.3 | 5.7 | . | . | . | . |
UK | 14.6 | 7.2 | 5.3 | 13.6 | 8.1 | 6.3 | 7.2 | 6.8 | 4.2 | 11.0 | 10.6 | 4.5 | 9.8 | 7.1 | 6.7 | 8.4 | 14.2 | 13.1 | 10.2 | 6.3 | 5.9 | . | . | . |
USSR | 11.4 | 8.6 | 8.3 | 9.0 | 6.0 | 9.4 | 7.5 | 8.1 | 8.5 | 8.4 | 6.1 | 9.6 | 6.2 | 8.8 | 7.1 | 5.5 | 12.8 | 7.2 | 7.4 | 8.4 | 7.6 | 8.7 | . | . |
W Germany | 17.3 | 3.5 | 3.1 | 15.3 | 5.5 | 5.3 | 4.2 | 8.0 | 6.4 | 13.7 | 10.6 | 4.1 | 10.9 | 2.6 | 7.2 | 6.9 | 16.6 | 14.3 | 11.9 | 5.5 | 4.9 | 6.4 | 9.3 | . |
Yugoslavia | 5.6 | 14.8 | 16.1 | 3.5 | 13.2 | 16.1 | 14.6 | 14.0 | 14.5 | 7.8 | 7.8 | 16.4 | 9.1 | 15.6 | 14.1 | 12.2 | 11.8 | 3.6 | 9.9 | 15.4 | 14.3 | 14.8 | 9.7 | 16.6 |
Builds a tree-like hierarchical system of groups and subgroups
Agglomerative hierarchical cluster analysis - Begins with each object in its own group - Joins the nearest objects step by step until all objects are in one group
Divisive hierarchical cluster analysis - Begins with all objects in one group - Splits each group into smaller ones until each object is in its own group
Can use any dissimilarity measure
Puts less emphasis on ‘spherical’ groups than \(k\)-means
Hierarchical cluster analysis
Binary data
Let’s look at a dataset of binary characteristics of animals.
animals <- cluster::animals |>
transmute("warm-blooded"=war-1, "can fly"=fly-1, "vertebrate"=ver-1,
"endangered"=end-1, "live in groups"=gro-1,"have hair"=hai-1) |>
`rownames<-`(c("ant", "bee", "cat", "caterpillar", "chimpanzee", "cow", "duck",
"eagle", "elephant", "fly", "frog", "herring", "lion", "lizard",
"lobster", "human", "rabbit", "salmon", "spider", "whale"))
# Replace some NAs and fix some errors -- humans are not endangered!
animals[c('frog','lobster','salmon'),'live in groups'] <- 1
animals[c('lion','human','spider'),'endangered'] <- c(1,0,0)
animals
warm-blooded can fly vertebrate endangered live in groups have hair
ant 0 0 0 0 1 0
bee 0 1 0 0 1 1
cat 1 0 1 0 0 1
caterpillar 0 0 0 0 0 1
chimpanzee 1 0 1 1 1 1
cow 1 0 1 0 1 1
duck 1 1 1 0 1 0
eagle 1 1 1 1 0 0
elephant 1 0 1 1 1 0
fly 0 1 0 0 0 0
frog 0 0 1 1 1 0
herring 0 0 1 0 1 0
lion 1 0 1 1 1 1
lizard 0 0 1 0 0 0
lobster 0 0 0 0 1 0
human 1 0 1 0 1 1
rabbit 1 0 1 0 1 1
salmon 0 0 1 0 1 0
spider 0 0 0 0 0 1
whale 1 0 1 1 1 0
Binary data
Remember, distance or dissimilarity is calculated between each pair of objects.
The two methods considered here for binary data, ‘simple matching’ and ‘Jaccard’, differ in one important aspect: the treatment of ‘double zeros’.
What proportion of variables do the objects have different values?
bee cat Same?
warm-blooded 0 1 FALSE
can fly 1 0 FALSE
vertebrate 0 1 FALSE
endangered 0 0 TRUE
live in groups 1 0 FALSE
have hair 1 1 TRUE
Dissimilarity =
4 different / 6 variables = 0.67
What proportion of variables have different values, excluding double zeros?
# A tibble: 4 × 3
# Groups: bee [2]
bee cat n
<dbl> <dbl> <int>
1 0 0 1
2 0 1 2
3 1 0 2
4 1 1 1
Dissimilarity =
4 different / 5 non-double-zeros = 0.8
Binary data
Manhattan distance is simply the sum of absolute differences. For binary data, the ‘manhattan
’ method is equivalent to simple matching without dividing by the number of variables.
Binary data
Hierarchical cluster analysis
Cluster analysis is unsupervised classification. There is no target variable.
The goal of cluster analysis is to create a new system of groups that have low within-group dissimilarity and high between-group dissimilarity, with respect to our features.
There are many methods of cluster analysis. We have focused on:
Other methods include \(k\)-medoids and divisive hierarchical clustering.
Euclidean distance, or ‘straight-line’ distance, is the most common measure of dissimilarity for numerical variables, but there are other options.
Different types of data require different measures. For example, ‘simple matching’ or ‘Jaccard’ can be used for binary variables. Different measures often yield different results.
Metrics such as ‘silhouette’ can help decide how many groups to make.