Chapter 2:
Exploratory Data Analysis (EDA)

Categorical data

Bar graphs or charts

  • For categorical variables
  • height of each bar is proportional to the quantity displayed
  • clustered bar graphs

tea data

Pie charts

  • Popular but not a good display
  • proportion should be meaningful
  • A stacked barchart is an alternative to multiple piecharts

https://shiny.massey.ac.nz/kgovinda/demos/explore.counts.of.factors/

One-dimensional graphs

  • Dotplots and stripcharts display one dimensional data (grouped/ungrouped) & are useful to discover gaps and outliers

    • often used to display experimental design data; not good for very small datasets (<20)

Stem-and-leaf display

  • Preserves the raw data (approximately) as well as shows the distribution
  • Data summaries are basically displayed
1 | 2: represents 1.2
 leaf unit: 0.1
            n: 50
    2     97. | 68
    3     98* | 3
   11     98. | 56778899
  (17)    99* | 00000111223444444
   22     99. | 5555557789
   12    100* | 01244
    7    100. | 559
    4    101* | 014
         101. | 
    1    102* | 4

Histograms

  • Counts in class intervals are displayed

  • Display of relative frequencies is preferred

    • Class intervals need not be equal
    • Can mislead for small sample sizes (say <50)

Frequency polygon & smoothed density plots

  • smoothed density plots are approximations of the underlying distribution
  • histograms are crude approximations of the true density

Summary statistics for EDA

Five number summary

Minimum, lower hinge, median, upper hinge and maximum

set.seed(1234)
my.data <- rnorm(50, 100)
fivenum(my.data)
[1]  97.65430  99.00566  99.46477  99.98486 102.41584
summary(my.data)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  97.65   99.01   99.46   99.55   99.96  102.42 

Boxplots

  • Graphical display of 5-number summary
  • Can show many batches of data on the same graph

Letter Value Display

  • EDA summary in the form of a Table
  • This display represents the tail of the distribution very well
  Depth    Lower     Upper       Mid   Spread
M  25.5 99.92736  99.92736  99.92736 0.000000
F  13.0 99.43952 100.70136 100.07044 1.261832
E   7.0 98.93218 101.20796 100.07007 2.275786
D   4.0 98.73494 101.55871 100.14682 2.823770
C   2.5 98.52396 101.75099 100.13747 3.227034
B   1.5 98.17334 101.97793 100.07564 3.804590
A   1.0 98.03338 102.16896 100.10117 4.135573

Letter Value Plot

  • suitable for large datasets

Cumulative frequency graphs

  • Show the left tail area
  • Useful to obtain the quantiles (deciles, percentiles, quartiles etc)

Shiny apps

https://shiny.massey.ac.nz/kgovinda/demos/explore.univariate.graphs/

https://shiny.massey.ac.nz/kgovinda/demos/get.univariate.plots/

  • Adopt study guide, past assignment R codes

  • Lots of examples on the web (particularly ggplot2 based)

Quantile-Quantile (QQ) plot

  • Quantiles of one variable is plotted against another

Some Q-Q Plot patterns

  • Case a: Quantiles of Y (mean/median etc) are higher than that of X

  • Case b: Spread or SD of Y > spread or SD of X

  • Case c: X and Y follow different distributions

    • R function: qqplot().

Bivariate relationships

  • A scatter plot can show

    • nature of relationship (linear, nonlinear etc.)
    • gaps/subgroups
    • outliers
    • lowess smoothing helps (may not work well for small number of points)

Marginal Plot

  • Shows both bivariate relationships and univariate (marginal) distributions

Shiny apps

Explore-

https://shiny.massey.ac.nz/kgovinda/demos/explore.bivariate.plots/

https://shiny.massey.ac.nz/kgovinda/demos/get.bivariate.plots/

https://shiny.massey.ac.nz/kgovinda/demos/explore.facet.wrapped.plots/

https://shiny.massey.ac.nz/kgovinda/demos/get.facet.wrapped.plots/

https://shiny.massey.ac.nz/kgovinda/demos/explore.facet.grid.plots/

Scatterplot matrices (i.e. all pairwise scatters)

Grouping Variable

  • The plot below shows the area effect clearly.

Correlation coefficients

  • Pearson correlation coefficient is a measure of linear relationship

    • Case a: Positive linear relationship
    • Case b: Negative linear relationship

Correlation Matrix

  • To show all pairwise correlation coefficients
  • Useful to explore the inter-relationship between variables
Call:corr.test(x = pinetree[, -1])
Correlation matrix 
        Top Third Second First
Top    1.00  0.92   0.96  0.97
Third  0.92  1.00   0.95  0.91
Second 0.96  0.95   1.00  0.97
First  0.97  0.91   0.97  1.00
Sample Size 
[1] 60
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
       Top Third Second First
Top      0     0      0     0
Third    0     0      0     0
Second   0     0      0     0
First    0     0      0     0

 To see confidence intervals of the correlations, print with the short=FALSE option

Correlation Plots

3-D Plots

  • A bubble plot, shows the third (fourth) variable as point size (colour).

3-D plots are far more useful if you can rotate them

Package plot3D

3-D plots are far more useful if you can rotate them

Package plotly

Contour plots

  • 3D plots are difficult to interpret than 2D plots in general

  • Contour plots are another way of looking three variables in two dimensions

Conditioning plots

  • Conditioning Plots (Coplots) show two variables at different ranges of third variable

More R graphs

  • Build plots in a single layout. R packages patchwork or gridExtra can be used.

Time series data

  • A Time Series is an ordered sequence of observations of a variable(s) (often) made at equally spaced time points.

  • Time series Components of variation

    • Trend - representing long term positive (upward) or negative (downward) movement
    • Seasonal - a periodic behaviour happening within a block (say Christmas time) of a given time period (say in a calendar year) but this periodic behaviour will repeat fairly regularly over time (say year after year)
    • Error (Residual)

Time Series Example

Autocorrelation function (ACF)

  • The \(k ^ \text{th}\) order ACF or the autocorrelation between \(x_t\) and \(x_{t-k}\) is

\[\frac{\text{Covariance}(x_t, x_{t-k})}{\text{SD}(x_t)\text{SD}(x_{t-k})} = \frac{\text{Covariance}(x_t, x_{t-k})}{\text{Variance}(x_t)}\]

Autocorrelation function (ACF) Plot

The significance of autocorrelations may be judged from the 95% confidence interval band

Autocorrelations decay to zero ($20 notes positively depend on the values of $20 notes held in the immediate past rather than too distant past)

PACF (Partial Autocorrelation Function)

  • A type of correlation after removing the effect of earlier lags

Time series trend types

Requires a (parametric) model to fit the trend (covered later)

Non-parametric fits can also be made

Seasonality

Simple scatter plot of the response variable against time may reveal seasonality directly

Sub-series plots

Seasonality is easily seen graphically when grouping variables are used

ACF plot showing seasonality

White noise errors

Example using random normal data

Time series decomposition

  • Additive model
    \(X_t\) = Trend + Seasonal + Error
    (where \(X_t\) is an observation at time \(t\))

  • Multiplicative model
    \(X_t\) = Trend \(\times\) Seasonal + Error
    (trend and seasonal components are not independent)

  • Detrending means removing the trend from the series, making it easier to see the seasonality.

  • Deseasoning means removing the seasonality from the series, making it easier to see the trend.

Learning EDA

  • The best way to learn EDA is to try many approaches and find which are informative and which are not.

  • Chatfield (1995) on tackling statistical problems:

    • Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is prior information such as are there any likely effects.
    • Find out how the data were collected.
    • Look at the structure of the data.
    • The data then need to be carefully examined in an exploratory way before attempting a more sophisticated analysis.
    • Use common sense, and be honest!

Summary

  • Size

    • For small datasets, we cannot be assertive.
    • Some displays are affected by sample size (eg. stem plot); some may not (eg. smoothed density)
  • Shape

    • We are concerned with overall shape of distribution.
    • Are there gaps and/or many peaks (modes)?
    • Is the distribution symmetrical? Is the distribution normal?
  • Outliers

    • More important than points in the middle
    • boxplots & scatter plots show them
  • Graphs should be simple and informative; certainly not misleading!