Cluster Analysis

161324 Data Mining | Lecture 10

Country	RedMeat	WhiteMeat	Eggs	Milk	Fish	Cereals	Starch	Nuts	Fr.Veg
Albania	10.1	1.4	0.5	8.9	0.2	42.3	0.6	5.5	1.7
Austria	8.9	14.0	4.3	19.9	2.1	28.0	3.6	1.3	4.3
Belgium	13.5	9.3	4.1	17.5	4.5	26.6	5.7	2.1	4.0
Bulgaria	7.8	6.0	1.6	8.3	1.2	56.7	1.1	3.7	4.2
Czechoslovakia	9.7	11.4	2.8	12.5	2.0	34.3	5.0	1.1	4.0
Denmark	10.6	10.8	3.7	25.0	9.9	21.9	4.8	0.7	2.4
E Germany	8.4	11.6	3.7	11.1	5.4	24.6	6.5	0.8	3.6
Finland	9.5	4.9	2.7	33.7	5.8	26.3	5.1	1.0	1.4
France	18.0	9.9	3.3	19.5	5.7	28.1	4.8	2.4	6.5
Greece	10.2	3.0	2.8	17.6	5.9	41.7	2.2	7.8	6.5
Hungary	5.3	12.4	2.9	9.7	0.3	40.1	4.0	5.4	4.2
Ireland	13.9	10.0	4.7	25.8	2.2	24.0	6.2	1.6	2.9
Italy	9.0	5.1	2.9	13.7	3.4	36.8	2.1	4.3	6.7
Netherlands	9.5	13.6	3.6	23.4	2.5	22.4	4.2	1.8	3.7
Norway	9.4	4.7	2.7	23.3	9.7	23.0	4.6	1.6	2.7
Poland	6.9	10.2	2.7	19.3	3.0	36.1	5.9	2.0	6.6
Portugal	6.2	3.7	1.1	4.9	14.2	27.0	5.9	4.7	7.9
Romania	6.2	6.3	1.5	11.1	1.0	49.6	3.1	5.3	2.8
Spain	7.1	3.4	3.1	8.6	7.0	29.2	5.7	5.9	7.2
Sweden	9.9	7.8	3.5	24.7	7.5	19.5	3.7	1.4	2.0
Switzerland	13.1	10.1	3.1	23.8	2.3	25.6	2.8	2.4	4.9
UK	17.4	5.7	4.7	20.6	4.3	24.3	4.7	3.4	3.3
USSR	9.3	4.6	2.1	16.6	3.0	43.6	6.4	3.4	2.9
W Germany	11.4	12.5	4.1	18.8	3.4	18.6	5.2	1.5	3.8
Yugoslavia	4.4	5.0	1.2	9.5	0.6	55.9	3.0	5.7	3.2

Data matrix
	RedMeat	WhiteMeat
Yugoslavia	4.4	5.0
Romania	6.2	6.3
Greece	10.2	3.0
Albania	10.1	1.4
Italy	9.0	5.1
Bulgaria	7.8	6.0

Distance matrix
	Yugoslavia	Romania	Greece	Albania	Italy
Romania	2.22	.	.	.	.
Greece	6.14	5.19	.	.	.
Albania	6.74	6.26	1.6	.	.
Italy	4.6	3.05	2.42	3.86	.
Bulgaria	3.54	1.63	3.84	5.14	1.5

Data matrix
	RedMeat	WhiteMeat
Yugoslavia	4.4	5.0
Romania	6.2	6.3
Greece	10.2	3.0
Albania	10.1	1.4
Italy	9.0	5.1
Bulgaria	7.8	6.0

Distance matrix
	Yugoslavia	Romania	Greece	Albania	Italy
Romania	2.22	.	.	.	.
Greece	6.14	5.19	.	.	.
Albania	6.74	6.26	1.6	.	.
Italy	4.6	3.05	2.42	3.86	.
Bulgaria	3.54	1.63	3.84	5.14	1.5

Distance matrix
	Yugoslavia	Romania	Greece	Albania	Italy
Romania	2.22	.	.	.	.
Greece	6.14	5.19	.	.	.
Albania	6.74	6.26	1.6	.	.
Italy	4.6	3.05	2.42	3.86	.
Bulgaria	3.54	1.63	3.84	5.14	1.5

Distance matrix
	Yugoslavia	Romania	Greece	Albania
Romania	2.22	.	.	.
Greece	6.14	5.19	.	.
Albania	6.74	6.26	1.6	.
Ita+Bul	4.07	2.34	3.13	4.5

Distance matrix
	Yugoslavia	Romania	Greece	Albania
Romania	2.22	.	.	.
Greece	6.14	5.19	.	.
Albania	6.74	6.26	1.6	.
Ita+Bul	4.07	2.34	3.13	4.5

Distance matrix
	Yugoslavia	Romania	Gre+Alb
Romania	2.22	.	.
Gre+Alb	6.44	5.72	.
Ita+Bul	4.07	2.34	3.82

Distance matrix
	Yugoslavia	Romania	Gre+Alb
Romania	2.22	.	.
Gre+Alb	6.44	5.72	.
Ita+Bul	4.07	2.34	3.82

Distance matrix
	Yug+Rom	Gre+Alb
Gre+Alb	6.08	.
Ita+Bul	3.2	3.82

Distance matrix
	Yug+Rom	Gre+Alb
Gre+Alb	6.08	.
Ita+Bul	3.2	3.82

Distance matrix
	Yug+Rom+Ita+Bul
Gre+Alb	4.95

Distance matrix
	Yug+Rom+Ita+Bul
Gre+Alb	4.95

Clustering binary data

Let’s look at a dataset of binary characteristics of animals.

animals <- cluster::animals |> 
  transmute("warm-blooded"=war-1, "can fly"=fly-1, "vertebrate"=ver-1,
            "endangered"=end-1, "live in groups"=gro-1,"have hair"=hai-1) |>   
  `rownames<-`(c("ant", "bee", "cat", "caterpillar", "chimpanzee", "cow", "duck", 
                 "eagle", "elephant", "fly", "frog", "herring", "lion", "lizard", 
                 "lobster", "human", "rabbit", "salmon", "spider", "whale"))

# Replace some NAs and fix some errors -- humans are not endangered!
animals[c('frog','lobster','salmon'),'live in groups'] <- 1
animals[c('lion','human','spider'),'endangered'] <- c(1,0,0)

animals

            warm-blooded can fly vertebrate endangered live in groups have hair
ant                    0       0          0          0              1         0
bee                    0       1          0          0              1         1
cat                    1       0          1          0              0         1
caterpillar            0       0          0          0              0         1
chimpanzee             1       0          1          1              1         1
cow                    1       0          1          0              1         1
duck                   1       1          1          0              1         0
eagle                  1       1          1          1              0         0
elephant               1       0          1          1              1         0
fly                    0       1          0          0              0         0
frog                   0       0          1          1              1         0
herring                0       0          1          0              1         0
lion                   1       0          1          1              1         1
lizard                 0       0          1          0              0         0
lobster                0       0          0          0              1         0
human                  1       0          1          0              1         1
rabbit                 1       0          1          0              1         1
salmon                 0       0          1          0              1         0
spider                 0       0          0          0              0         1
whale                  1       0          1          1              1         0

1 / 46

Cluster Analysis 161324 Data Mining | Lecture 10

Cluster Analysis
Slide 2
Cluster Analysis
Sometimes there clearly...
The \(k\)-means clustering algorithm
The \(k\)-means algorithm
European protein composition
European protein composition
European protein: \(k\)-means
European protein: \(k\)-means
European protein: \(k\)-means
European protein: \(k\)-means
European protein: \(k\)-means
European protein: \(k\)-means
European protein: \(k\)-means
European protein: \(k\)-means
How to choose \(k\)?
The ‘silhouette’ method
The ‘silhouette’ method
Using silhouette to choose \(k\)
Leave-one-out silhouette
\(k\)-means for more than two variables
Choosing \(k\) for 9 normalised variables
Cluster analysis for nine variables
Cluster analysis for nine variables
Summary: \(k\)-means
Hierarchical cluster analysis
Hierarchical cluster analysis
Hierarchical cluster analysis
Hierarchical cluster analysis
Agglomerative hierarchical clustering
Euclidean ‘straight line’ distances
Hierarchical clustering, ‘average’ linkage
Other group-joining ‘linkage’ criteria
Hierarchical and \(k\)-means clustering
Hierarchical clustering of all countries
Hierarchical clustering of all countries
Hierarchical clustering of all countries
Hierarchical clustering of all countries
Summary: Hierarchical cluster analysis
Clustering binary data
Clustering binary data
Calculating dissimilarity with binary data
Simple matching
Summary
Summary

	Albania	Austria	Belgium	Bulgaria	Czechoslovakia	Denmark	E Germany	Finland	France	Greece	Hungary	Ireland	Italy	Netherlands	Norway	Poland	Portugal	Romania	Spain	Sweden	Switzerland	UK	USSR	W Germany
Austria	15.7	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Belgium	16.9	5.0	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Bulgaria	5.6	13.5	14.8	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Czechoslovakia	14.8	4.6	5.2	12.0	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Denmark	16.3	6.4	5.8	15.0	7.8	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
E Germany	15.6	4.7	4.7	13.5	4.4	5.7	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Finland	13.9	8.1	8.0	13.2	8.8	5.0	7.6	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
France	14.5	6.8	5.5	13.1	7.4	6.9	6.7	7.8	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Greece	8.3	12.4	12.8	6.9	11.4	13.1	12.2	11.6	10.4	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Hungary	10.3	8.7	10.4	7.1	7.1	11.5	8.8	10.8	10.0	7.7	.	.	.	.	.	.	.	.	.	.	.	.	.	.
Ireland	16.5	5.6	4.5	15.2	7.6	5.1	5.9	6.6	5.8	13.4	11.4	.	.	.	.	.	.	.	.	.	.	.	.	.
Italy	10.6	9.5	10.0	7.5	7.9	11.4	9.7	10.7	8.9	5.3	5.7	11.6	.	.	.	.	.	.	.	.	.	.	.	.
Netherlands	16.5	2.1	4.4	14.3	4.9	5.7	4.8	7.7	6.9	13.0	9.6	5.1	10.1	.	.	.	.	.	.	.	.	.	.	.
Norway	14.4	7.7	6.6	13.1	7.5	4.4	6.4	4.3	7.2	10.9	10.2	7.2	9.2	7.1	.	.	.	.	.	.	.	.	.	.
Poland	13.9	5.8	6.6	11.1	3.9	8.2	5.2	8.6	7.1	9.8	6.3	8.3	6.6	6.3	7.3	.	.	.	.	.	.	.	.	.
Portugal	11.4	15.8	16.3	11.9	15.3	14.9	14.2	13.7	13.2	9.6	12.5	15.9	12.2	16.4	13.4	13.2	.	.	.	.	.	.	.	.
Romania	7.2	12.6	13.7	3.8	10.7	14.3	12.5	12.5	12.8	7.3	5.7	14.6	6.9	13.3	12.1	10.0	12.8	.	.	.	.	.	.	.
Spain	11.1	11.1	11.1	9.4	9.8	11.5	9.8	10.8	9.3	5.8	7.5	12.0	6.1	11.7	9.5	7.7	8.0	9.0	.	.	.	.	.	.
Sweden	15.7	6.2	5.5	14.3	7.1	2.9	6.0	4.6	7.2	12.4	10.8	5.7	10.3	5.4	3.3	7.8	15.3	13.3	11.2	.	.	.	.	.
Switzerland	14.9	4.3	4.5	12.7	5.1	6.7	6.2	7.6	5.3	11.0	8.9	6.0	7.9	4.1	6.6	5.9	15.6	11.9	10.3	5.7	.	.	.	.
UK	14.6	7.2	5.3	13.6	8.1	6.3	7.2	6.8	4.2	11.0	10.6	4.5	9.8	7.1	6.7	8.4	14.2	13.1	10.2	6.3	5.9	.	.	.
USSR	11.4	8.6	8.3	9.0	6.0	9.4	7.5	8.1	8.5	8.4	6.1	9.6	6.2	8.8	7.1	5.5	12.8	7.2	7.4	8.4	7.6	8.7	.	.
W Germany	17.3	3.5	3.1	15.3	5.5	5.3	4.2	8.0	6.4	13.7	10.6	4.1	10.9	2.6	7.2	6.9	16.6	14.3	11.9	5.5	4.9	6.4	9.3	.
Yugoslavia	5.6	14.8	16.1	3.5	13.2	16.1	14.6	14.0	14.5	7.8	7.8	16.4	9.1	15.6	14.1	12.2	11.8	3.6	9.9	15.4	14.3	14.8	9.7	16.6