Chapter 2:
Exploratory Data Analysis (EDA)

Categorical data

Bar graphs or charts

For categorical variables
height of each bar is proportional to the quantity displayed
clustered bar graphs

tea data

Pie charts

Popular but not a good display
proportion should be meaningful
A stacked barchart is an alternative to multiple piecharts

https://shiny.massey.ac.nz/kgovinda/demos/explore.counts.of.factors/

One-dimensional graphs

Dotplots and stripcharts display one dimensional data (grouped/ungrouped) & are useful to discover gaps and outliers
- often used to display experimental design data; not good for very small datasets (<20)

Stem-and-leaf display

Preserves the raw data (approximately) as well as shows the distribution
Data summaries are basically displayed

1 | 2: represents 1.2
 leaf unit: 0.1
            n: 50
    2     97. | 68
    3     98* | 3
   11     98. | 56778899
  (17)    99* | 00000111223444444
   22     99. | 5555557789
   12    100* | 01244
    7    100. | 559
    4    101* | 014
         101. | 
    1    102* | 4

Histograms

Counts in class intervals are displayed
Display of relative frequencies is preferred
- Class intervals need not be equal
- Can mislead for small sample sizes (say <50)

Frequency polygon & smoothed density plots

smoothed density plots are approximations of the underlying distribution
histograms are crude approximations of the true density

Summary statistics for EDA

Five number summary

Minimum, lower hinge, median, upper hinge and maximum

set.seed(1234)
my.data <- rnorm(50, 100)
fivenum(my.data)

[1]  97.65430  99.00566  99.46477  99.98486 102.41584

summary(my.data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  97.65   99.01   99.46   99.55   99.96  102.42

Boxplots

Graphical display of 5-number summary
Can show many batches of data on the same graph

Letter Value Display

EDA summary in the form of a Table
This display represents the tail of the distribution very well

  Depth    Lower     Upper       Mid   Spread
M  25.5 99.92736  99.92736  99.92736 0.000000
F  13.0 99.43952 100.70136 100.07044 1.261832
E   7.0 98.93218 101.20796 100.07007 2.275786
D   4.0 98.73494 101.55871 100.14682 2.823770
C   2.5 98.52396 101.75099 100.13747 3.227034
B   1.5 98.17334 101.97793 100.07564 3.804590
A   1.0 98.03338 102.16896 100.10117 4.135573

Letter Value Plot

suitable for large datasets

Cumulative frequency graphs

Show the left tail area
Useful to obtain the quantiles (deciles, percentiles, quartiles etc)

Shiny apps

https://shiny.massey.ac.nz/kgovinda/demos/explore.univariate.graphs/

https://shiny.massey.ac.nz/kgovinda/demos/get.univariate.plots/

Adopt study guide, past assignment R codes
Lots of examples on the web (particularly ggplot2 based)

Quantile-Quantile (QQ) plot

Quantiles of one variable is plotted against another

Some Q-Q Plot patterns

Case a: Quantiles of Y (mean/median etc) are higher than that of X
Case b: Spread or SD of Y > spread or SD of X
Case c: X and Y follow different distributions
- R function: qqplot().

Bivariate relationships

A scatter plot can show
- nature of relationship (linear, nonlinear etc.)
- gaps/subgroups
- outliers
- lowess smoothing helps (may not work well for small number of points)

Marginal Plot

Shows both bivariate relationships and univariate (marginal) distributions

Shiny apps

Explore-

https://shiny.massey.ac.nz/kgovinda/demos/explore.bivariate.plots/

https://shiny.massey.ac.nz/kgovinda/demos/get.bivariate.plots/

https://shiny.massey.ac.nz/kgovinda/demos/explore.facet.wrapped.plots/

https://shiny.massey.ac.nz/kgovinda/demos/get.facet.wrapped.plots/

https://shiny.massey.ac.nz/kgovinda/demos/explore.facet.grid.plots/

Scatterplot matrices (i.e. all pairwise scatters)

Grouping Variable

The plot below shows the area effect clearly.

Correlation coefficients

Pearson correlation coefficient is a measure of linear relationship
- Case a: Positive linear relationship
- Case b: Negative linear relationship

Correlation Matrix

To show all pairwise correlation coefficients
Useful to explore the inter-relationship between variables

Call:corr.test(x = pinetree[, -1])
Correlation matrix 
        Top Third Second First
Top    1.00  0.92   0.96  0.97
Third  0.92  1.00   0.95  0.91
Second 0.96  0.95   1.00  0.97
First  0.97  0.91   0.97  1.00
Sample Size 
[1] 60
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
       Top Third Second First
Top      0     0      0     0
Third    0     0      0     0
Second   0     0      0     0
First    0     0      0     0

 To see confidence intervals of the correlations, print with the short=FALSE option

Correlation Plots

3-D Plots

A bubble plot, shows the third (fourth) variable as point size (colour).

3-D plots are far more useful if you can rotate them

Package plot3D

3-D plots are far more useful if you can rotate them

Package plotly

Contour plots

3D plots are difficult to interpret than 2D plots in general
Contour plots are another way of looking three variables in two dimensions

Conditioning plots

Conditioning Plots (Coplots) show two variables at different ranges of third variable

More `R` graphs

Build plots in a single layout. R packages patchwork or gridExtra can be used.

Time series data

A Time Series is an ordered sequence of observations of a variable(s) (often) made at equally spaced time points.
Time series Components of variation
- Trend - representing long term positive (upward) or negative (downward) movement
- Seasonal - a periodic behaviour happening within a block (say Christmas time) of a given time period (say in a calendar year) but this periodic behaviour will repeat fairly regularly over time (say year after year)
- Error (Residual)

Time Series Example

Autocorrelation function (ACF)

The $k ^ \text{th}$ order ACF or the autocorrelation between $x_t$ and $x_{t-k}$ is

\[\frac{\text{Covariance}(x_t, x_{t-k})}{\text{SD}(x_t)\text{SD}(x_{t-k})} = \frac{\text{Covariance}(x_t, x_{t-k})}{\text{Variance}(x_t)}\]

Autocorrelation function (ACF) Plot

The significance of autocorrelations may be judged from the 95% confidence interval band

Autocorrelations decay to zero ($20 notes positively depend on the values of $20 notes held in the immediate past rather than too distant past)

PACF (Partial Autocorrelation Function)

A type of correlation after removing the effect of earlier lags

Time series trend types

Requires a (parametric) model to fit the trend (covered later)

Non-parametric fits can also be made

Seasonality

Simple scatter plot of the response variable against time may reveal seasonality directly

Sub-series plots

Seasonality is easily seen graphically when grouping variables are used

ACF plot showing seasonality

White noise errors

Example using random normal data

Time series decomposition

Additive model
$X_t$ = Trend + Seasonal + Error
(where $X_t$ is an observation at time $t$)
Multiplicative model
$X_t$ = Trend $\times$ Seasonal + Error
(trend and seasonal components are not independent)
Detrending means removing the trend from the series, making it easier to see the seasonality.
Deseasoning means removing the seasonality from the series, making it easier to see the trend.

Learning EDA

The best way to learn EDA is to try many approaches and find which are informative and which are not.
Chatfield (1995) on tackling statistical problems:
- Do not attempt to analyse the data until you understand what is being measured and why. Find out whether there is prior information such as are there any likely effects.
- Find out how the data were collected.
- Look at the structure of the data.
- The data then need to be carefully examined in an exploratory way before attempting a more sophisticated analysis.
- Use common sense, and be honest!

Summary

Size
- For small datasets, we cannot be assertive.
- Some displays are affected by sample size (eg. stem plot); some may not (eg. smoothed density)
Shape
- We are concerned with overall shape of distribution.
- Are there gaps and/or many peaks (modes)?
- Is the distribution symmetrical? Is the distribution normal?
Outliers
- More important than points in the middle
- boxplots & scatter plots show them
Graphs should be simple and informative; certainly not misleading!

Chapter 2:Exploratory Data Analysis (EDA)

Categorical data

tea data

Pie charts

One-dimensional graphs

Stem-and-leaf display

Histograms

Frequency polygon & smoothed density plots

Summary statistics for EDA

Five number summary

Boxplots

Letter Value Display

Letter Value Plot

Cumulative frequency graphs

Shiny apps

Quantile-Quantile (QQ) plot

Some Q-Q Plot patterns

Bivariate relationships

Marginal Plot

Shiny apps

Scatterplot matrices (i.e. all pairwise scatters)

Grouping Variable

Correlation coefficients

Correlation Matrix

Correlation Plots

3-D Plots

3-D plots are far more useful if you can rotate them

3-D plots are far more useful if you can rotate them

Contour plots

Conditioning plots

More R graphs

Time series data

Time Series Example

Autocorrelation function (ACF)

Autocorrelation function (ACF) Plot

PACF (Partial Autocorrelation Function)

Time series trend types

Seasonality

Sub-series plots

ACF plot showing seasonality

White noise errors

Time series decomposition

Learning EDA

Summary

Chapter 2:
Exploratory Data Analysis (EDA)

More `R` graphs