mclust is a contributed R package for model-based clustering, classification, and density estimation based on finite normal mixture modelling. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine model-based hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. Additional functionalities are available for displaying and visualizing fitted models along with clustering, classification, and density estimation results.
This document gives a quick tour of mclust (version 5.4.5) functionalities. It was written in R Markdown, using the knitr package for production. See help(package="mclust") for further details and references provided by citation("mclust").
data(diabetes)
class <- diabetes$class
table(class)
## class
## Chemical   Normal    Overt 
##       36       76       33
X <- diabetes[,-1]
head(X)
##   glucose insulin sspg
## 1      80     356  124
## 2      97     289  117
## 3     105     319  143
## 4      90     356  199
## 5      90     323  240
## 6      86     381  157
clPairs(X, class)summary(BIC)
## Best BIC values:
##              VVV,3       VVV,4       EVE,6
## BIC      -4751.316 -4784.32213 -4785.24591
## BIC diff     0.000   -33.00573   -33.92951
mod1 <- Mclust(X, x = BIC)
summary(mod1, parameters = TRUE)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
## with 3 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -2303.496 145 29 -4751.316 -4770.169
## 
## Clustering table:
##  1  2  3 
## 81 36 28 
## 
## Mixing probabilities:
##         1         2         3 
## 0.5368974 0.2650129 0.1980897 
## 
## Means:
##              [,1]     [,2]       [,3]
## glucose  90.96239 104.5335  229.42136
## insulin 357.79083 494.8259 1098.25990
## sspg    163.74858 309.5583   81.60001
## 
## Variances:
## [,,1]
##          glucose    insulin       sspg
## glucose 57.18044   75.83206   14.73199
## insulin 75.83206 2101.76553  322.82294
## sspg    14.73199  322.82294 2416.99074
## [,,2]
##           glucose   insulin       sspg
## glucose  185.0290  1282.340  -509.7313
## insulin 1282.3398 14039.283 -2559.0251
## sspg    -509.7313 -2559.025 23835.7278
## [,,3]
##           glucose   insulin       sspg
## glucose  5529.250  20389.09  -2486.208
## insulin 20389.088  83132.48 -10393.004
## sspg    -2486.208 -10393.00   2217.533
plot(mod1, what = "classification")table(class, mod1$classification)
##           
## class       1  2  3
##   Chemical  9 26  1
##   Normal   72  4  0
##   Overt     0  6 27
plot(mod1, what = "uncertainty")
ICL <- mclustICL(X)
summary(ICL)
## Best ICL values:
##              VVV,3       EVE,6       EVE,7
## ICL      -4770.169 -4797.38232 -4797.50566
## ICL diff     0.000   -27.21342   -27.33677
plot(ICL)
LRT <- mclustBootstrapLRT(X, modelName = "VVV")
LRT
## ------------------------------------------------------------- 
## Bootstrap sequential LRT for the number of mixture components 
## ------------------------------------------------------------- 
## Model        = VVV 
## Replications = 999 
##               LRTS bootstrap p-value
## 1 vs 2   361.16739             0.001
## 2 vs 3   123.49685             0.001
## 3 vs 4    16.76161             0.498EM algorithm is used by mclust for maximum likelihood estimation. Initialisation of EM is performed using the partitions obtained from agglomerative hierarchical clustering. For details see help(mclustBIC) or help(Mclust), and help(hc).
(hc1 <- hc(X, modelName = "VVV", use = "SVD"))
## Call:
## hc(data = X, modelName = "VVV", use = "SVD") 
## 
## Model-Based Agglomerative Hierarchical Clustering 
## Model name        = VVV 
## Use               = SVD 
## Number of objects = 145
BIC1 <- mclustBIC(X, initialization = list(hcPairs = hc1)) # default 
summary(BIC1)
## Best BIC values:
##              VVV,3       VVV,4       EVE,6
## BIC      -4751.316 -4784.32213 -4785.24591
## BIC diff     0.000   -33.00573   -33.92951
(hc2 <- hc(X, modelName = "VVV", use = "VARS"))
## Call:
## hc(data = X, modelName = "VVV", use = "VARS") 
## 
## Model-Based Agglomerative Hierarchical Clustering 
## Model name        = VVV 
## Use               = VARS 
## Number of objects = 145
BIC2 <- mclustBIC(X, initialization = list(hcPairs = hc2))
summary(BIC2)
## Best BIC values:
##              VVV,3       VVE,3       EVE,4
## BIC      -4760.091 -4775.53693 -4793.26143
## BIC diff     0.000   -15.44628   -33.17079
(hc3 <- hc(X, modelName = "EEE", use = "SVD"))
## Call:
## hc(data = X, modelName = "EEE", use = "SVD") 
## 
## Model-Based Agglomerative Hierarchical Clustering 
## Model name        = EEE 
## Use               = SVD 
## Number of objects = 145
BIC3 <- mclustBIC(X, initialization = list(hcPairs = hc3))
summary(BIC3)
## Best BIC values:
##              VVV,3        VVE,4       VVE,3
## BIC      -4751.354 -4757.091572 -4775.69587
## BIC diff     0.000    -5.737822   -24.34212Update BIC by merging the best results:
BIC <- mclustBICupdate(BIC1, BIC2, BIC3)
summary(BIC)
## Best BIC values:
##              VVV,3        VVE,4       VVE,3
## BIC      -4751.316 -4757.091572 -4775.53693
## BIC diff     0.000    -5.775172   -24.22053
plot(BIC)Univariate fit using random starting points obtained by creating random agglomerations (see help(randomPairs)) and merging best results:
data(galaxies, package = "MASS") 
galaxies <- galaxies / 1000
BIC <- NULL
for(j in 1:20)
{
  rBIC <- mclustBIC(galaxies, verbose = FALSE,
                    initialization = list(hcPairs = randomPairs(galaxies)))
  BIC <- mclustBICupdate(BIC, rBIC)
}
summary(BIC)
## Best BIC values:
##                V,3         V,4        V,5
## BIC      -441.6122 -443.399746 -446.34966
## BIC diff    0.0000   -1.787536   -4.73745
plot(BIC)mod <- Mclust(galaxies, x = BIC)
summary(mod)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 3 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -203.1792 82  8 -441.6122 -441.6126
## 
## Clustering table:
##  1  2  3 
##  3  7 72data(iris)
class <- iris$Species
table(class)
## class
##     setosa versicolor  virginica 
##         50         50         50
X <- iris[,1:4]
head(X)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
mod2 <- MclustDA(X, class, modelType = "EDDA")
summary(mod2)
## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -187.7097 150 36 -555.8024
##             
## Classes       n     % Model G
##   setosa     50 33.33   VEV 1
##   versicolor 50 33.33   VEV 1
##   virginica  50 33.33   VEV 1
## 
## Training confusion matrix:
##             Predicted
## Class        setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          0        50
## Classification error = 0.02 
## Brier score          = 0.0127
plot(mod2, what = "scatterplot")data(banknote)
class <- banknote$Status
table(class)
## class
## counterfeit     genuine 
##         100         100
X <- banknote[,-1]
head(X)
##   Length  Left Right Bottom  Top Diagonal
## 1  214.8 131.0 131.1    9.0  9.7    141.0
## 2  214.6 129.7 129.7    8.1  9.5    141.7
## 3  214.8 129.7 129.7    8.7  9.6    142.2
## 4  214.8 129.7 129.6    7.5 10.4    142.0
## 5  215.0 129.6 129.7   10.4  7.7    141.8
## 6  215.7 130.8 130.5    9.0 10.1    141.4
mod3 <- MclustDA(X, class)
summary(mod3)
## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -646.0798 200 66 -1641.849
##              
## Classes         n  % Model G
##   counterfeit 100 50   EVE 2
##   genuine     100 50   XXX 1
## 
## Training confusion matrix:
##              Predicted
## Class         counterfeit genuine
##   counterfeit         100       0
##   genuine               0     100
## Classification error = 0 
## Brier score          = 0
plot(mod3, what = "scatterplot")cv <- cvMclustDA(mod2, nfold = 10)
str(cv)
## List of 5
##  $ classification: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ z             : num [1:150, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
##  $ error         : num 0.0267
##  $ brier         : logi NA
##  $ se            : num 0.0109
unlist(cv[3:4])
##      error      brier 
## 0.02666667         NA
cv <- cvMclustDA(mod3, nfold = 10)
str(cv)
## List of 5
##  $ classification: Factor w/ 2 levels "counterfeit",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ z             : num [1:200, 1:2] 6.72e-05 3.69e-19 5.13e-28 3.32e-20 2.26e-29 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "counterfeit" "genuine"
##  $ error         : num 0
##  $ brier         : logi NA
##  $ se            : num 0
unlist(cv[3:4])
## error brier 
##     0    NAdata(acidity)
mod4 <- densityMclust(acidity)
summary(mod4)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust E (univariate, equal variance) model with 2 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -185.9493 155  4 -392.0723 -398.5554
plot(mod4, what = "BIC")data(faithful)
mod5 <- densityMclust(faithful)
summary(mod5)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -1126.326 272 11 -2314.316 -2357.824
plot(mod5, what = "BIC")plot(mod5, what = "density", type = "hdr")
plot(mod5, what = "density", type = "hdr",
     data = faithful, points.cex = 0.5)boot1 <- MclustBootstrap(mod1, nboot = 999, type = "bs")
summary(boot1, what = "se")
## ---------------------------------------------------------- 
## Resampling standard errors 
## ---------------------------------------------------------- 
## Model                      = VVV 
## Num. of mixture components = 3 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## 
## Mixing probabilities:
##          1          2          3 
## 0.05185780 0.05058160 0.03559685 
## 
## Means:
##                1         2         3
## glucose 1.042239  3.444948 16.340816
## insulin 7.554105 29.047203 63.483315
## sspg    7.669033 31.684647  9.926121
## 
## Variances:
## [,,1]
##          glucose   insulin      sspg
## glucose 10.78177  51.28084  51.61617
## insulin 51.28084 529.62298 416.38176
## sspg    51.61617 416.38176 623.81098
## [,,2]
##           glucose   insulin      sspg
## glucose  65.66172  616.6785  442.0993
## insulin 616.67852 7279.0671 3240.3558
## sspg    442.09927 3240.3558 7070.4152
## [,,3]
##           glucose   insulin      sspg
## glucose 1045.6542  4178.685  667.2709
## insulin 4178.6846 18873.253 2495.0278
## sspg     667.2709  2495.028  506.8173
summary(boot1, what = "ci")
## ---------------------------------------------------------- 
## Resampling confidence intervals 
## ---------------------------------------------------------- 
## Model                      = VVV 
## Num. of mixture components = 3 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## Confidence level           = 0.95 
## 
## Mixing probabilities:
##               1         2         3
## 2.5%  0.4490043 0.1510533 0.1324862
## 97.5% 0.6518326 0.3548749 0.2688038
## 
## Means:
## [,,1]
##        glucose  insulin     sspg
## 2.5%  89.13950 344.9890 150.8405
## 97.5% 93.16603 374.7221 181.8322
## [,,2]
##         glucose  insulin     sspg
## 2.5%   98.82567 447.4121 257.9011
## 97.5% 112.28459 561.3273 374.6194
## [,,3]
##        glucose   insulin      sspg
## 2.5%  198.5986  969.6231  63.22103
## 97.5% 263.2932 1226.2654 101.09078
## 
## Variances:
## [,,1]
##        glucose  insulin     sspg
## 2.5%  38.65508 1234.198 1514.416
## 97.5% 79.43401 3287.722 4146.024
## [,,2]
##         glucose   insulin     sspg
## 2.5%   88.35268  3514.662 12583.92
## 97.5% 358.15175 31416.557 39228.47
## [,,3]
##        glucose   insulin     sspg
## 2.5%  3377.773  47477.74 1317.041
## 97.5% 7379.344 120297.75 3229.747
par(mfrow=c(4,3))
plot(boot1, what = "pro")
plot(boot1, what = "mean")boot4 <- MclustBootstrap(mod4, nboot = 999, type = "bs")
summary(boot4, what = "se")
## ---------------------------------------------------------- 
## Resampling standard errors 
## ---------------------------------------------------------- 
## Model                      = E 
## Num. of mixture components = 2 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## 
## Mixing probabilities:
##          1          2 
## 0.04130937 0.04130937 
## 
## Means:
##          1          2 
## 0.04669993 0.06719883 
## 
## Variances:
##          1          2 
## 0.02376885 0.02376885
summary(boot4, what = "ci")
## ---------------------------------------------------------- 
## Resampling confidence intervals 
## ---------------------------------------------------------- 
## Model                      = E 
## Num. of mixture components = 2 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## Confidence level           = 0.95 
## 
## Mixing probabilities:
##               1         2
## 2.5%  0.5364895 0.3004131
## 97.5% 0.6995869 0.4635105
## 
## Means:
##              1        2
## 2.5%  4.279055 6.184439
## 97.5% 4.461108 6.449465
## 
## Variances:
##               1         2
## 2.5%  0.1395796 0.1395796
## 97.5% 0.2317769 0.2317769
par(mfrow=c(2,2))
plot(boot4, what = "pro")
plot(boot4, what = "mean")mod1dr <- MclustDR(mod1)
summary(mod1dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (VVV, 3) 
##         
## Clusters  n
##        1 81
##        2 36
##        3 28
## 
## Estimated basis vectors: 
##              Dir1     Dir2      Dir3
## glucose -0.988671  0.76532 -0.966565
## insulin  0.142656 -0.13395  0.252109
## sspg    -0.046689  0.62955  0.046837
## 
##                Dir1     Dir2      Dir3
## Eigenvalues  1.3506  0.75608   0.53412
## Cum. %      51.1440 79.77436 100.00000
plot(mod1dr, what = "pairs")
mod1dr <- MclustDR(mod1, lambda = 1)
summary(mod1dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (VVV, 3) 
##         
## Clusters  n
##        1 81
##        2 36
##        3 28
## 
## Estimated basis vectors: 
##              Dir1     Dir2
## glucose  0.764699  0.86359
## insulin -0.643961 -0.22219
## sspg     0.023438 -0.45260
## 
##                Dir1      Dir2
## Eigenvalues  1.2629   0.35218
## Cum. %      78.1939 100.00000
plot(mod1dr, what = "scatterplot")mod2dr <- MclustDR(mod2)
summary(mod2dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: EDDA 
##             
## Classes       n Model G
##   setosa     50   VEV 1
##   versicolor 50   VEV 1
##   virginica  50   VEV 1
## 
## Estimated basis vectors: 
##                  Dir1      Dir2     Dir3     Dir4
## Sepal.Length  0.17425 -0.193663  0.64081 -0.46231
## Sepal.Width   0.45292  0.066561  0.34852  0.57110
## Petal.Length -0.61629 -0.311030 -0.42366  0.46256
## Petal.Width  -0.62024  0.928076  0.53703 -0.49613
## 
##                 Dir1     Dir2      Dir3       Dir4
## Eigenvalues  0.94747  0.68835  0.076141   0.052607
## Cum. %      53.69408 92.70374 97.018700 100.000000
plot(mod2dr, what = "scatterplot")
mod3dr <- MclustDR(mod3)
summary(mod3dr)
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: MclustDA 
##              
## Classes         n Model G
##   counterfeit 100   EVE 2
##   genuine     100   XXX 1
## 
## Estimated basis vectors: 
##              Dir1      Dir2     Dir3      Dir4      Dir5      Dir6
## Length   -0.10027 -0.327553  0.79718 -0.033721 -0.317043  0.084618
## Left     -0.21760 -0.305350 -0.30266 -0.893676  0.371043 -0.565611
## Right     0.29180 -0.018877 -0.49600  0.406605 -0.861020  0.481331
## Bottom    0.57603  0.445501  0.12002 -0.034570  0.004359 -0.078688
## Top       0.57555  0.385645  0.10093 -0.103629  0.136005  0.625416
## Diagonal -0.44088  0.672251 -0.04781 -0.151473 -0.044035  0.209542
## 
##                 Dir1     Dir2     Dir3     Dir4      Dir5       Dir6
## Eigenvalues  0.87241  0.55372  0.48603  0.13301  0.053113   0.027239
## Cum. %      41.04429 67.09530 89.96182 96.21965 98.718473 100.000000
plot(mod3dr, what = "scatterplot")Most of the graphs produced by mclust use colors that by default are defined in the following options:
mclust.options("bicPlotColors")
##       EII       VII       EEI       EVI       VEI       VVI       EEE 
##    "gray"   "black" "#218B21" "#41884F" "#508476" "#58819C" "#597DC3" 
##       EVE       VEE       VVE       EEV       VEV       EVV       VVV 
## "#5178EA" "#716EE7" "#9B60B8" "#B2508B" "#C03F60" "#C82A36" "#CC0000" 
##         E         V 
##    "gray"   "black"
mclust.options("classPlotColors")
##  [1] "dodgerblue2"    "red3"           "green3"         "slateblue"     
##  [5] "darkorange"     "skyblue1"       "violetred4"     "forestgreen"   
##  [9] "steelblue4"     "slategrey"      "brown"          "black"         
## [13] "darkseagreen"   "darkgoldenrod3" "olivedrab"      "royalblue"     
## [17] "tomato4"        "cyan2"          "springgreen2"The first option controls colors used for plotting BIC, ICL, etc. curves, whereas the second option is used to assign colors for indicating clusters or classes when plotting data.
Color-blind-friendly palettes can be defined and assigned to the above options as follows:
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#999999", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
bicPlotColors <- mclust.options("bicPlotColors")
bicPlotColors[1:14] <- c(cbPalette, cbPalette[1:6])
mclust.options("bicPlotColors" = bicPlotColors)
mclust.options("classPlotColors" = cbPalette)
clPairs(iris[,-5], iris$Species)The above color definitions are adapted from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/, but users can easily define their own palettes if needed.
Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233. https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf
Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.
Fraley C., Raftery A. E., Murphy T. B. and Scrucca L. (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mclust_5.4.5 knitr_1.23  
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.0  magrittr_1.5    htmltools_0.3.6 tools_3.6.0    
##  [5] yaml_2.2.0      Rcpp_1.0.1      stringi_1.4.3   rmarkdown_1.13 
##  [9] stringr_1.4.0   digest_0.6.19   xfun_0.8        evaluate_0.14