---
title: 'Lab 3: Petrels'
author: "CHANGE ME"
output:
  html_document
---

## Introduction

In this lab, we look at a data set on bycatch White-chinned Petrel *Procellaria aequinoctialis* caught predominantly by bottom longliners, tuna longliners, squid trawlers and general fish trawlers between 2000 and 2003. The following variables were measured.

Variable            Description
--------            -----------
Autopsy             ID number of Autopsy
Year                Year caught
Month               Month caught
Longitude           Longitude of location caught
Latitude            Latitude of location caught 
Area                General area where caught (A to F) 
Culmen.Dth.at.Base  Depth of upper bill at base (mm)
Culmen.Wth.at.Base  Width of upper bill at base (mm)
Bill.Least.Dth      Smallest depth of the bill (mm)
Nostril.Lth         Nostril length (mm)
R.Tarsus.Lth        Right Tarsus length (mm)
L.Tarsus.Lth        Left Tarsus length (mm)
Tail.Lth            Tail length (mm)
R.Wing.Lth          Right wing length (mm)
L.Wing.Lth          Left wing length (mm)
Sex                 Sex (Female, Male)

We start by reading the data in, loading the `ggplot2` package so we can do some plotting, and produce a summary

```{r}
library(ggplot2)
all_petrels = read.csv("https://www.massey.ac.nz/~jcmarsha/227212/data/petrels.csv")
summary(all_petrels)
```

We notice there are a number of missing values (labelled `NA`, short for 'Not available'). We can get rid of all the petrels for which we have incomplete data using `na.omit`:

```{r}
petrels = na.omit(all_petrels)
summary(petrels)
```

## Exploratory data analysis

Here we'll do our exploratory data analysis of how the right wing length of the petrels varies across areas and by sex.

1. We'll start by producing a table of the sex of petrels for each area. The goal is to see if the proportion of male and female birds
are similar across the areas.

```{r}
table(petrels$Sex, petrels$Area)
```

What do you think - are the proportion of male and female birds similar across the areas?

2. You could do this with a bar chart instead:

```{r}
ggplot(data=petrels) + geom_bar(mapping=aes(x=Area, fill=Sex))
```

We can get different bar charts by adding position='dodge' or position='fill' inside the `geom_bar()` function. Try this and see what you get. Which plot do you think answers our question the best? Add some comments here on this.

3. Let's investigate the distribution of right wing lengths between the sexes. Remember you can use `geom_boxplot`, `geom_histogram` or `geom_density` for this. Look back to lab 2 for a reminder, and produce the plots in the code chunk below.

```{r}

```

Do you think there might be a difference in size between the sexes based on your plots?

4. Now let's look to see if there's a difference in the distribution of right wing length by area, to help us figure out whether all the birds are similar, or in some areas whether they're a different size. Again, you could do this with boxplots, histograms or densities. If using histograms or densities, you might want to add on `facet_wrap(~Area)` to produce a separate plot (facet) per area, e.g.

```{r}
ggplot(data=petrels) + geom_histogram(mapping=aes(x=R.Wing.Lth)) + facet_wrap(~Area)
```

Try all the types of plot and see which one you like the best. Then add a note describing what you see, and which areas (if any) have larger or smaller birds.

5. Considering what you've found so far, what might explain the larger birds in area E? Add some notes on this here.

6. Try producing plots that show how right wing length changes by sex and area. You could for example use the plot you had for sex, and then add on `facet_wrap(~Area)`. Or alternatively, you could use `facet_grid(Sex ~ Area)` if you want a separate plot for each sex and area combination (e.g. for histograms or densities).

Do you think that the differences in sex might explain the differences we see in area? Add some notes here.

## Statistical Inference

Below, we'll see if we can **quantify** the average right wing lengths in the population of petrels from our sample using confidence intervals.

1. We'll start by computing a confidence interval for the right wing length for males and females separately. One way is to first pull out the separate groups into separate datasets, and then use `t.test` on each of them. e.g.

```{r}
males = subset(petrels, Sex == "Male")
females = subset(petrels, Sex == "Female")
t.test(males$R.Wing.Lth)
```

Add to the above so you have a 95% confidence interval for female birds as well. What is your conclusion about the average right wing length of male and female petrels in the population? Is there evidence do you think for a difference between them? Add some notes here.

2. Use the code block below to do the same thing, but this time for the different areas. What is your conclusion here?

```{r}
AreaA = subset(petrels, Area == "A")
t.test(AreaA$R.Wing.Lth)
```

3. The above steps are a little cumbersome for all 6 areas - lots of copy and paste that we might mess up a little bit. This sort of repetition is what computers are good at, so let's use them:

```{r}
ttests = tapply(petrels$R.Wing.Lth, petrels$Area, t.test)
tidied = lapply(ttests, broom::tidy)
combined = dplyr::bind_rows(tidied, .id = "Area")
combined
```

Obviously you're not expected to be able to produce code like this, but let's explain what it's doing

- The `tapply` function splits up the `petrels$R.Wing.Lth` column by the categories in `petrels$Area` and then applies `t.test` to each of them. i.e. exactly like we did above. It stores the results in a list, which we save in `ttests`. Try typing ttests to see what it contains.
- The `lapply` function then takes the `ttests` list and applies the `broom::tidy` function to each of the entries in turn, returning a new list called `tidied`. The `broom::tidy` function 'sweeps' up the messy output from `t.test` that we saw above into a nice tidy data frame.
- The `dplyr::bind_rows` function takes the list of tidied data frames and binds them together into a single data frame which we've saved in 'combined', with rows being distinguished by a column named `Area`.
- The last line then just prints out what `combined` contains. You'll see you have columns for `estimate` (the mean right wing length), `conf.low` and `conf.high`.

You could potentially plot this now if you like. You can use `geom_point` for the points (sample means) and `geom_errorbar` for the confidence intervals.

4. An alternate way to do part 1 above (to assess differences between sex) is to compute a confidence interval for the difference in mean right wing length between the sexes, rather than separate confidence intervals for each sex. The way to do this is as follows. Try it out:

```{r}
t.test(R.Wing.Lth ~ Sex, conf.level=0.95, data=petrels)
```

This is known as the `formula` syntax. Note the piece with the tilde (~) can be read as "Right wing length in terms of sex". Using formulas like this is how we'll do statistical modelling in semester 2.

What is your conclusion from the above test?

5. Just how confident can we be that there is a difference? Try increasing the confidence level by specifying conf.level to something higher (e.g. 0.99). You should find you can increase it quite a lot, but at some point you'll have to admit that the mean right wing lengths in the population could possibly be the same.

6. Add some notes about the relationship between confidence level and the length of the confidence interval below. How does this relate to being able to use the confidence interval to make a decision?