Cluster Analysis

161324 Data Mining | Lecture 10

Cluster Analysis

Recently, we’ve learned about classification models, which aim to predict the value of an existing categorical variable \(y\) based on a set of variables \(\mathbf{x}\). This is called supervised classification.

Cluster analysis is used to create a new categorical variable based on a set of feature variables \(\mathbf{x}\). This is called unsupervised classification.

The goal is to classify our objects into groups that have high within-group similarity and low between-group similarity, with respect to \(\mathbf{X}\).



Sometimes there clearly are some groups.


Sometimes there clearly are no groups.


Often its something in between.



Regardless of the situation, cluster analysis will always produce groups!

The \(k\)-means clustering algorithm

The \(k\)-means algorithm



  1. Choose number of groups \(k\) (here 3).

  2. Initialise the \(k\) centroids in \(\mathbf{X}\) space
    (e.g., choose three points at random).

  3. Loop until convergence:

    1. Assign cases to nearest centroid.
    2. Update centroid location.


It is wise to repeat the whole algorithm many times with different random starts to ensure a good result.

European protein composition

This dataset contains the proportions (as percentages) of each of nine major sources of protein in the diets of 25 European countries, some time prior to the 1990s.

food <- read_csv("https://massey.ac.nz/~anhsmith/data/food.csv")
kable(food) |> kable_styling(font_size = 18)
Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7
Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3
Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0
Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2
Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0
Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4
E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.6
Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4
France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5
Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.5
Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2
Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9
Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7
Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7
Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7
Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6
Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9
Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8
Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2
Sweden 9.9 7.8 3.5 24.7 7.5 19.5 3.7 1.4 2.0
Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9
UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3
USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9
W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8
Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2

European protein composition

food |> ggplot() +
  aes(x=RedMeat,y=WhiteMeat,label=Country) +
  geom_point() + 
  geom_text_repel(size=2.5) +
  xlab("Percent from red meat") +
  ylab("Percent from white meat") +
  theme(aspect.ratio=1, 
        legend.position = "top")

European protein: \(k\)-means

First, let’s try with 3 clusters and 1 random start.

set.seed(1)
km_spec_k3_s1 <- k_means(num_clusters = 3) |> 
  parsnip::set_engine("stats", 
                      nstart = 1)

km_fit_k3_s1 <- km_spec_k3_s1 |> 
  fit(~ RedMeat + WhiteMeat, data = food)

km_fit_k3_s1
tidyclust cluster object

K-means clustering with 3 clusters of sizes 5, 8, 12

Cluster means:
   RedMeat WhiteMeat
1  6.34000   4.88000
2 10.60000   4.65000
3 10.76667  11.31667

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
 2  3  3  1  3  3  3  2  3  2  3  3  2  3  2  3  1  1  1  2  3  2  2  3  1 

Within cluster sum of squares by cluster:
[1]  13.3800  78.6200 158.0233
 (between_SS / total_SS =  58.1 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

European protein: \(k\)-means

First, let’s try with 3 clusters and 1 random start.

km_fit_k3_s1 |> 
  augment(food) |> 
  ggplot() +
  aes(x=RedMeat,
      y=WhiteMeat,
      label=Country, 
      col=.pred_cluster) +
  geom_point() + 
  geom_text_repel(size=2.5) +
  xlab("Percent from red meat") +
  ylab("Percent from white meat") +
  theme(aspect.ratio=1, 
        legend.position = "top")

European protein: \(k\)-means

Now, let’s do 50 random starts.

set.seed(1)
km_spec_k3 <- k_means(num_clusters = 3) |> 
  parsnip::set_engine("stats", 
                      nstart = 50)

km_fit_k3 <- km_spec_k3 |> 
  fit(~ RedMeat + WhiteMeat, data = food)

km_fit_k3 
tidyclust cluster object

K-means clustering with 3 clusters of sizes 5, 8, 12

Cluster means:
    RedMeat WhiteMeat
1 15.180000  9.000000
2  8.837500 12.062500
3  8.258333  4.658333

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
 3  2  1  3  2  2  2  3  1  3  2  1  3  2  3  2  3  3  3  3  1  1  3  2  3 

Within cluster sum of squares by cluster:
[1] 35.66800 39.45750 69.85833
 (between_SS / total_SS =  75.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

European protein: \(k\)-means

Now, let’s do 50 random starts.

km_fit_k3 |> 
  augment(food) |> 
  ggplot() +
  aes(x=RedMeat,
      y=WhiteMeat,
      label=Country, 
      col=.pred_cluster) +
  geom_point() + 
  geom_text_repel(size=2.5) +
  xlab("Percent from red meat") +
  ylab("Percent from white meat") +
  theme(aspect.ratio=1, 
        legend.position = "top")

European protein: \(k\)-means

What about 2 clusters?

set.seed(1)
km_spec_k2 <- k_means(num_clusters = 2) |> 
  parsnip::set_engine("stats", 
                      nstart = 50)

km_fit_k2 <- km_spec_k2 |> 
  fit(~ RedMeat + WhiteMeat, data = food)

km_fit_k2
tidyclust cluster object

K-means clustering with 2 clusters of sizes 11, 14

Cluster means:
    RedMeat WhiteMeat
1  8.109091  4.372727
2 11.178571 10.664286

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
 1  2  2  1  2  2  2  1  2  1  2  2  1  2  1  2  1  1  1  2  2  2  1  2  1 

Within cluster sum of squares by cluster:
[1]  56.15091 238.35571
 (between_SS / total_SS =  50.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

European protein: \(k\)-means

What about 2 clusters?

km_fit_k2 |> 
  augment(food) |> 
  ggplot() +
  aes(x=RedMeat,
      y=WhiteMeat,
      label=Country, 
      col=.pred_cluster) +
  geom_point() + 
  geom_text_repel(size=2.5) +
  xlab("Percent from red meat") +
  ylab("Percent from white meat") +
  theme(aspect.ratio=1, 
        legend.position = "top")

European protein: \(k\)-means

Or 4?

set.seed(1)
km_spec_k4 <- k_means(num_clusters = 4) |> 
  parsnip::set_engine("stats", 
                      nstart = 50)



km_fit_k4 <- km_spec_k4 |> 
  fit(~ RedMeat + WhiteMeat, data = food)

km_fit_k4
tidyclust cluster object

K-means clustering with 4 clusters of sizes 5, 7, 8, 5

Cluster means:
    RedMeat WhiteMeat
1  6.340000    4.8800
2  9.628571    4.5000
3  8.837500   12.0625
4 15.180000    9.0000

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
 2  3  4  1  3  3  3  2  4  2  3  4  2  3  2  3  1  1  1  2  4  4  2  3  1 

Within cluster sum of squares by cluster:
[1] 13.38000 24.51429 39.45750 35.66800
 (between_SS / total_SS =  81.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

European protein: \(k\)-means

Or 4?

km_fit_k4 |> 
  augment(food) |> 
  ggplot() +
  aes(x=RedMeat,
      y=WhiteMeat,
      label=Country, 
      col=.pred_cluster) +
  geom_point() + 
  geom_text_repel(size=2.5) +
  xlab("Percent from red meat") +
  ylab("Percent from white meat") +
  theme(aspect.ratio=1, 
        legend.position = "top")

How to choose \(k\)?

Increasing \(k\) will always decrease with within-group error.

library(factoextra)
fviz_nbclust(meat, 
             kmeans, 
             method='wss', 
             k.max = 5)

The ‘silhouette’ method

The silhouette index measures how well the data points fit within their cluster vs a neighbouring cluster.

For each case \(i\), calculate:

\(a(i)\) = the average distance from case \(i\)
to all other members of its own cluster.

\(b(i)\) = the average distance from case \(i\)
to all members of the nearest neighbouring cluster.

\(s(i) = \frac{b(i)-a(i)}{\mathbf{max}(b(i),a(i))}\)

The scaling of \(s(i)\) by the maximum means that \(s(i)\) is always between -1 and 1.

If \(s(i)\) is near 1, the point clearly belongs in its cluster.
If \(s(i)\) is near zero, then the point is “on the fence”.
If \(s(i)\) is negative, then the point is more similar to members of another cluster.

The ‘silhouette’ method

# Create Euclidean distance matrix
dist_meat <- meat |> dist()

# Make silhouette plot
km_fit_k3$fit$cluster |> 
  cluster::silhouette(dist_meat) |> 
  `rownames<-`(rownames(meat)) |> 
  fviz_silhouette(label = T, 
                  print.summary = F) +
  coord_flip()

Using silhouette to choose \(k\)

For any cluster analysis, we can calculate the overall average silhouette score. We can then run the cluster analysis for a range of values of \(k\) and choose the value that gives the highest silhouette score.

The factoextra package provides some convenient functions for this.

library(factoextra)
fviz_nbclust(meat, 
             kmeans, 
             method='silhouette', 
             k.max = 6)

By this criterion, we’d choose \(k\) = 3.

Leave-one-out silhouette

food_cv_metrics <- tune_cluster(
  object = workflow(
    recipe(~ RedMeat + WhiteMeat, 
           data = food), 
    k_means(num_clusters = tune())
    ),
  resamples = vfold_cv(
    food, 
    v = nrow(food)
    ),
  grid = tibble(
    num_clusters=2:6
    ),
  control = control_grid(
    save_pred = TRUE, 
    extract = identity),
  metrics = cluster_metric_set(
    sse_ratio, 
    silhouette_avg
    )
)

food_cv_metrics |> 
  collect_metrics() |> 
  ggplot() +
  aes(x = num_clusters, 
      y = mean, 
      col = .metric) +
  geom_point() + geom_line() +
  ylab("Metric score") + 
  xlab("Number of clusters") 

\(k\)-means for more than two variables

It’s not all about meat! There are actually 9 variables in this dataset.

Note that, although they are all measured as percentages, some vary much more than others.

It is generally sensible to normalise variables (subtract the mean and divide by the standard deviation) before doing \(k\)-means, or any other analysis that uses Euclidean distances. Otherwise, the variables with larger variances will dominate!

Choosing \(k\) for 9 normalised variables

food_norm <- food |> 
  recipe(~ .) |>
  step_normalize(all_numeric()) |> 
  prep() |> 
  bake(food) |> 
  mutate(Country = food$Country) |> 
  column_to_rownames(var="Country")

food_norm |> 
  fviz_nbclust(kmeans, 
               method='silhouette', 
               k.max = 10)

Cluster analysis for nine variables

The fviz_cluster() function will now show the clusters on a Principal Components Analysis plot of the nine variables.

km_all_k2 <- kmeans(food_norm, centers=2, nstart=50)
fviz_cluster(km_all_k2, data=food_norm, repel=T, ggtheme=theme_bw())

Cluster analysis for nine variables

library(GGally) 
food |> 
  select(-Country) |> 
  add_column(Cluster = factor(km_all_k2$cluster)) |> 
  ggpairs(mapping=aes(colour = Cluster))

Summary: \(k\)-means

  • Inherently based on Euclidean distances.

  • It is wise to normalise variables first.

  • For large numbers of variables, ordination methods like Principal Components Analysis (PCA) can be used to visualise clusters.

  • Looks for ‘spherical clusters’; not so good for irregular shapes.

  • Relatively fast, iterative algorithm.

  • The silhouette index can be used to choose \(k\).

  • One can use actual data points as the cluster centres (‘medoids’) instead of centroids, giving ‘\(k\)-medoid’ cluster analysis. This can be implemented with pam() in R.

Hierarchical cluster analysis

Hierarchical cluster analysis

Hierarchical cluster analysis uses a different approach to \(k\)-means.

  • Is deterministic rather than stochastic
    (no random starts, same every time)
  • Hierarchical structure can be shown with a ‘dendrogram’

Hierarchical cluster analysis

Hierarchical cluster analysis uses a different approach to \(k\)-means.

  • Is deterministic rather than stochastic
    (no random starts, same every time)
  • Hierarchical structure can be shown with a ‘dendrogram’
  • Does not require prior choice of \(k\); can ‘cut’ the dendrogram at any level

Hierarchical cluster analysis

Hierarchical cluster analysis uses a different approach to \(k\)-means.

  • Is deterministic rather than stochastic
    (no random starts, same every time)
  • Hierarchical structure can be shown with a ‘dendrogram’
  • Does not require prior choice of \(k\); can ‘cut’ the dendrogram at any level
  • Let’s look at how it works with a subset of 6 countries and 2 variables
# Select 2 vars and 6 countries (in this order)
six_countries <- c("Yugoslavia","Romania","Greece",
                   "Albania","Italy","Bulgaria")

food6 <- food |> 
  select(Country, RedMeat, WhiteMeat) |> 
  slice(Country |> 
          factor(levels = six_countries) |> 
          order(na.last=NA)
        ) |> 
  column_to_rownames("Country")

Agglomerative hierarchical clustering

  1. Start with each object in its own cluster
  2. Repeat until all objects are in one cluster:
    1. Calculate the distance between each
      pair of objects in the \(\mathbf{x}\) space
    2. Join the closest pair of objects

Divisive
hierarchical clustering

  1. Start with all objects in one cluster
  2. Split into the ‘best’ 2 groups based on some criterion
  3. Continue to split until each object is in its own cluster

Euclidean ‘straight line’ distances

Data matrix
RedMeat WhiteMeat
Yugoslavia 4.4 5.0
Romania 6.2 6.3
Greece 10.2 3.0
Albania 10.1 1.4
Italy 9.0 5.1
Bulgaria 7.8 6.0

Scatterplot

Scatterplot

\[ \begin{align} \delta_{s,t} &= \sqrt{\sum_{j=1}^p (x_{sj} - x_{tj})^2} \\ \\ \delta_{\text{Yug,Rom}} &= \sqrt{(4.4-6.2)^2 + (5.0-6.3)^2} \\ &=2.22 \end{align} \]

Distance matrix
Yugoslavia Romania Greece Albania Italy
Romania 2.22 . . . .
Greece 6.14 5.19 . . .
Albania 6.74 6.26 1.6 . .
Italy 4.6 3.05 2.42 3.86 .
Bulgaria 3.54 1.63 3.84 5.14 1.5

Hierarchical clustering, ‘average’ linkage

Data matrix
RedMeat WhiteMeat
Yugoslavia 4.4 5.0
Romania 6.2 6.3
Greece 10.2 3.0
Albania 10.1 1.4
Italy 9.0 5.1
Bulgaria 7.8 6.0
Distance matrix
Yugoslavia Romania Greece Albania Italy
Romania 2.22 . . . .
Greece 6.14 5.19 . . .
Albania 6.74 6.26 1.6 . .
Italy 4.6 3.05 2.42 3.86 .
Bulgaria 3.54 1.63 3.84 5.14 1.5

Dendrogram

Scatterplot

Distance matrix
Yugoslavia Romania Greece Albania Italy
Romania 2.22 . . . .
Greece 6.14 5.19 . . .
Albania 6.74 6.26 1.6 . .
Italy 4.6 3.05 2.42 3.86 .
Bulgaria 3.54 1.63 3.84 5.14 1.5

Dendrogram

Scatterplot

Distance matrix
Yugoslavia Romania Greece Albania
Romania 2.22 . . .
Greece 6.14 5.19 . .
Albania 6.74 6.26 1.6 .
Ita+Bul 4.07 2.34 3.13 4.5
Distance matrix
Yugoslavia Romania Greece Albania
Romania 2.22 . . .
Greece 6.14 5.19 . .
Albania 6.74 6.26 1.6 .
Ita+Bul 4.07 2.34 3.13 4.5

Dendrogram

Scatterplot

Distance matrix
Yugoslavia Romania Gre+Alb
Romania 2.22 . .
Gre+Alb 6.44 5.72 .
Ita+Bul 4.07 2.34 3.82
Distance matrix
Yugoslavia Romania Gre+Alb
Romania 2.22 . .
Gre+Alb 6.44 5.72 .
Ita+Bul 4.07 2.34 3.82

Dendrogram

Scatterplot

Distance matrix
Yug+Rom Gre+Alb
Gre+Alb 6.08 .
Ita+Bul 3.2 3.82
Distance matrix
Yug+Rom Gre+Alb
Gre+Alb 6.08 .
Ita+Bul 3.2 3.82

Dendrogram

Scatterplot

Distance matrix
Yug+Rom+Ita+Bul
Gre+Alb 4.95
Distance matrix
Yug+Rom+Ita+Bul
Gre+Alb 4.95

Dendrogram

Scatterplot

Other group-joining ‘linkage’ criteria

There are a number of ‘linkage’ criteria, ways to calculate the distance between groups of objects.

  • ‘Average’ linkage is simple and generally very sensible (agglomerative, hierarchical cluster analysis with average linkage is sometimes called ‘UPGMA’).
  • ‘Ward’ linkage is another good but more complicated option.
  • ‘Complete’ and ‘single’ linkage are typically poor options.

Hierarchical and \(k\)-means clustering

With \(k\)-means, the clusters are defined by the distance to a centroid. This results in ‘spherical’ clusters.

Hierarchical clustering joins the most similar points and groups of points from the ‘ground up’. Clusters needn’t be any particular shape. It’s more about ‘gaps’.

Hierarchical clustering of all countries

dist(food_norm) |> dist(method = dist_type) |> as.matrix() |> as.data.frame() |> 
  replace_upper_triangle(NA) |>  slice(-1) |> select(-last_col()) |>
  column_to_rownames(var="rowname") |> kable(digits=1) |> kable_styling(font_size = 12)
Albania Austria Belgium Bulgaria Czechoslovakia Denmark E Germany Finland France Greece Hungary Ireland Italy Netherlands Norway Poland Portugal Romania Spain Sweden Switzerland UK USSR W Germany
Austria 15.7 . . . . . . . . . . . . . . . . . . . . . . .
Belgium 16.9 5.0 . . . . . . . . . . . . . . . . . . . . . .
Bulgaria 5.6 13.5 14.8 . . . . . . . . . . . . . . . . . . . . .
Czechoslovakia 14.8 4.6 5.2 12.0 . . . . . . . . . . . . . . . . . . . .
Denmark 16.3 6.4 5.8 15.0 7.8 . . . . . . . . . . . . . . . . . . .
E Germany 15.6 4.7 4.7 13.5 4.4 5.7 . . . . . . . . . . . . . . . . . .
Finland 13.9 8.1 8.0 13.2 8.8 5.0 7.6 . . . . . . . . . . . . . . . . .
France 14.5 6.8 5.5 13.1 7.4 6.9 6.7 7.8 . . . . . . . . . . . . . . . .
Greece 8.3 12.4 12.8 6.9 11.4 13.1 12.2 11.6 10.4 . . . . . . . . . . . . . . .
Hungary 10.3 8.7 10.4 7.1 7.1 11.5 8.8 10.8 10.0 7.7 . . . . . . . . . . . . . .
Ireland 16.5 5.6 4.5 15.2 7.6 5.1 5.9 6.6 5.8 13.4 11.4 . . . . . . . . . . . . .
Italy 10.6 9.5 10.0 7.5 7.9 11.4 9.7 10.7 8.9 5.3 5.7 11.6 . . . . . . . . . . . .
Netherlands 16.5 2.1 4.4 14.3 4.9 5.7 4.8 7.7 6.9 13.0 9.6 5.1 10.1 . . . . . . . . . . .
Norway 14.4 7.7 6.6 13.1 7.5 4.4 6.4 4.3 7.2 10.9 10.2 7.2 9.2 7.1 . . . . . . . . . .
Poland 13.9 5.8 6.6 11.1 3.9 8.2 5.2 8.6 7.1 9.8 6.3 8.3 6.6 6.3 7.3 . . . . . . . . .
Portugal 11.4 15.8 16.3 11.9 15.3 14.9 14.2 13.7 13.2 9.6 12.5 15.9 12.2 16.4 13.4 13.2 . . . . . . . .
Romania 7.2 12.6 13.7 3.8 10.7 14.3 12.5 12.5 12.8 7.3 5.7 14.6 6.9 13.3 12.1 10.0 12.8 . . . . . . .
Spain 11.1 11.1 11.1 9.4 9.8 11.5 9.8 10.8 9.3 5.8 7.5 12.0 6.1 11.7 9.5 7.7 8.0 9.0 . . . . . .
Sweden 15.7 6.2 5.5 14.3 7.1 2.9 6.0 4.6 7.2 12.4 10.8 5.7 10.3 5.4 3.3 7.8 15.3 13.3 11.2 . . . . .
Switzerland 14.9 4.3 4.5 12.7 5.1 6.7 6.2 7.6 5.3 11.0 8.9 6.0 7.9 4.1 6.6 5.9 15.6 11.9 10.3 5.7 . . . .
UK 14.6 7.2 5.3 13.6 8.1 6.3 7.2 6.8 4.2 11.0 10.6 4.5 9.8 7.1 6.7 8.4 14.2 13.1 10.2 6.3 5.9 . . .
USSR 11.4 8.6 8.3 9.0 6.0 9.4 7.5 8.1 8.5 8.4 6.1 9.6 6.2 8.8 7.1 5.5 12.8 7.2 7.4 8.4 7.6 8.7 . .
W Germany 17.3 3.5 3.1 15.3 5.5 5.3 4.2 8.0 6.4 13.7 10.6 4.1 10.9 2.6 7.2 6.9 16.6 14.3 11.9 5.5 4.9 6.4 9.3 .
Yugoslavia 5.6 14.8 16.1 3.5 13.2 16.1 14.6 14.0 14.5 7.8 7.8 16.4 9.1 15.6 14.1 12.2 11.8 3.6 9.9 15.4 14.3 14.8 9.7 16.6

Hierarchical clustering of all countries

library(factoextra)
fviz_nbclust(food_norm, hcut, method = "silhouette")

Hierarchical clustering of all countries

hcut2_food <- hcut(food_norm, k = 2) 
fviz_dend(hcut2_food, rect = TRUE, k_colors = c("#004B8D","#E4A024"), main="Dendrogram")

Hierarchical clustering of all countries

fviz_cluster(hcut2_food, palette = c("#E4A024","#004B8D"), repel=T, main="Principal Components Plot")

Summary: Hierarchical cluster analysis

  • Builds a tree-like hierarchical system of groups and subgroups

  • Agglomerative hierarchical cluster analysis - Begins with each object in its own group - Joins the nearest objects step by step until all objects are in one group

  • Divisive hierarchical cluster analysis - Begins with all objects in one group - Splits each group into smaller ones until each object is in its own group

  • Can use any dissimilarity measure

  • Puts less emphasis on ‘spherical’ groups than \(k\)-means

Clustering binary data

Clustering binary data

Let’s look at a dataset of binary characteristics of animals.

animals <- cluster::animals |> 
  transmute("warm-blooded"=war-1, "can fly"=fly-1, "vertebrate"=ver-1,
            "endangered"=end-1, "live in groups"=gro-1,"have hair"=hai-1) |>   
  `rownames<-`(c("ant", "bee", "cat", "caterpillar", "chimpanzee", "cow", "duck", 
                 "eagle", "elephant", "fly", "frog", "herring", "lion", "lizard", 
                 "lobster", "human", "rabbit", "salmon", "spider", "whale"))

# Replace some NAs and fix some errors -- humans are not endangered!
animals[c('frog','lobster','salmon'),'live in groups'] <- 1
animals[c('lion','human','spider'),'endangered'] <- c(1,0,0)

animals
            warm-blooded can fly vertebrate endangered live in groups have hair
ant                    0       0          0          0              1         0
bee                    0       1          0          0              1         1
cat                    1       0          1          0              0         1
caterpillar            0       0          0          0              0         1
chimpanzee             1       0          1          1              1         1
cow                    1       0          1          0              1         1
duck                   1       1          1          0              1         0
eagle                  1       1          1          1              0         0
elephant               1       0          1          1              1         0
fly                    0       1          0          0              0         0
frog                   0       0          1          1              1         0
herring                0       0          1          0              1         0
lion                   1       0          1          1              1         1
lizard                 0       0          1          0              0         0
lobster                0       0          0          0              1         0
human                  1       0          1          0              1         1
rabbit                 1       0          1          0              1         1
salmon                 0       0          1          0              1         0
spider                 0       0          0          0              0         1
whale                  1       0          1          1              1         0

Calculating dissimilarity with binary data

Remember, distance or dissimilarity is calculated between each pair of objects.

The two methods considered here for binary data, ‘simple matching’ and ‘Jaccard’, differ in one important aspect: the treatment of ‘double zeros’.

Simple matching

What proportion of variables do the objects have different values?

animals[c('bee','cat'),] |> 
  t() |> 
  data.frame() |> 
  mutate(`Same?` = bee==cat)
               bee cat Same?
warm-blooded     0   1 FALSE
can fly          1   0 FALSE
vertebrate       0   1 FALSE
endangered       0   0  TRUE
live in groups   1   0 FALSE
have hair        1   1  TRUE

Dissimilarity =
4 different / 6 variables = 0.67

Jaccard

What proportion of variables have different values, excluding double zeros?

animals[c('bee','cat'),] |> 
  t() |> data.frame() |> 
  group_by(bee, cat) |> 
  summarise(n = n())
# A tibble: 4 × 3
# Groups:   bee [2]
    bee   cat     n
  <dbl> <dbl> <int>
1     0     0     1
2     0     1     2
3     1     0     2
4     1     1     1

Dissimilarity =
4 different / 5 non-double-zeros = 0.8

Simple matching

Manhattan distance is simply the sum of absolute differences. For binary data, the ‘manhattan’ method is equivalent to simple matching without dividing by the number of variables.

dist(animals, method='manhattan') |> hclust() |>
 fviz_dend(horiz = T, main = "") + ylim(6,-1.5)

Jaccard

The ‘binary’ method in the dist() function is equivalent to Jaccard dissimilarity.


dist(animals, method='binary') |> hclust() |> 
 fviz_dend(horiz = T, main = "") + ylim(1,-.4)

Summary

Summary

  • Cluster analysis is unsupervised classification. There is no target variable.

  • The goal of cluster analysis is to create a new system of groups that have low within-group dissimilarity and high between-group dissimilarity, with respect to our features.

  • There are many methods of cluster analysis. We have focused on:

    • \(k\)-means
    • agglomerative hierarchical clustering

    Other methods include \(k\)-medoids and divisive hierarchical clustering.

  • Euclidean distance, or ‘straight-line’ distance, is the most common measure of dissimilarity for numerical variables, but there are other options.

  • Different types of data require different measures. For example, ‘simple matching’ or ‘Jaccard’ can be used for binary variables. Different measures often yield different results.

  • Metrics such as ‘silhouette’ can help decide how many groups to make.