--- title: 'Lab 3: Petrels' author: "CHANGE ME" output: html_document --- ## Introduction In this lab, we look at a data set on bycatch White-chinned Petrel *Procellaria aequinoctialis* caught predominantly by bottom longliners, tuna longliners, squid trawlers and general fish trawlers between 2000 and 2003. The following variables were measured. Variable Description -------- ----------- Autopsy ID number of Autopsy Year Year caught Month Month caught Longitude Longitude of location caught Latitude Latitude of location caught Area General area where caught (A to F) Culmen.Dth.at.Base Depth of upper bill at base (mm) Culmen.Wth.at.Base Width of upper bill at base (mm) Bill.Least.Dth Smallest depth of the bill (mm) Nostril.Lth Nostril length (mm) R.Tarsus.Lth Right Tarsus length (mm) L.Tarsus.Lth Left Tarsus length (mm) Tail.Lth Tail length (mm) R.Wing.Lth Right wing length (mm) L.Wing.Lth Left wing length (mm) Sex Sex (Female, Male) We start by reading the data in, loading the `ggplot2` package so we can do some plotting, and produce a summary ```{r} library(ggplot2) all_petrels = read.csv("https://www.massey.ac.nz/~jcmarsha/227212/data/petrels.csv") summary(all_petrels) ``` We notice there are a number of missing values (labelled `NA`, short for 'Not available'). We can get rid of all the petrels for which we have incomplete data using `na.omit`: ```{r} petrels = na.omit(all_petrels) summary(petrels) ``` ## Exploratory data analysis Here we'll do our exploratory data analysis of how the right wing length of the petrels varies across areas and by sex. 1. We'll start by producing a table of the sex of petrels for each area. The goal is to see if the proportion of male and female birds are similar across the areas. ```{r} table(petrels$Sex, petrels$Area) ``` What do you think - are the proportion of male and female birds similar across the areas? 2. You could do this with a bar chart instead: ```{r} ggplot(data=petrels) + geom_bar(mapping=aes(x=Area, fill=Sex)) ``` We can get different bar charts by adding position='dodge' or position='fill' inside the `geom_bar()` function. Try this and see what you get. Which plot do you think answers our question the best? Add some comments here on this. 3. Let's investigate the distribution of right wing lengths between the sexes. Remember you can use `geom_boxplot`, `geom_histogram` or `geom_density` for this. Look back to lab 2 for a reminder, and produce the plots in the code chunk below. ```{r} ``` Do you think there might be a difference in size between the sexes based on your plots? 4. Now let's look to see if there's a difference in the distribution of right wing length by area, to help us figure out whether all the birds are similar, or in some areas whether they're a different size. Again, you could do this with boxplots, histograms or densities. If using histograms or densities, you might want to add on `facet_wrap(~Area)` to produce a separate plot (facet) per area, e.g. ```{r} ggplot(data=petrels) + geom_histogram(mapping=aes(x=R.Wing.Lth)) + facet_wrap(~Area) ``` Try all the types of plot and see which one you like the best. Then add a note describing what you see, and which areas (if any) have larger or smaller birds. 5. Considering what you've found so far, what might explain the larger birds in area E? Add some notes on this here. 6. Try producing plots that show how right wing length changes by sex and area. You could for example use the plot you had for sex, and then add on `facet_wrap(~Area)`. Or alternatively, you could use `facet_grid(Sex ~ Area)` if you want a separate plot for each sex and area combination (e.g. for histograms or densities). Do you think that the differences in sex might explain the differences we see in area? Add some notes here. ## Statistical Inference Below, we'll see if we can **quantify** the average right wing lengths in the population of petrels from our sample using confidence intervals. 1. We'll start by computing a confidence interval for the right wing length for males and females separately. One way is to first pull out the separate groups into separate datasets, and then use `t.test` on each of them. e.g. ```{r} males = subset(petrels, Sex == "Male") females = subset(petrels, Sex == "Female") t.test(males$R.Wing.Lth) ``` Add to the above so you have a 95% confidence interval for female birds as well. What is your conclusion about the average right wing length of male and female petrels in the population? Is there evidence do you think for a difference between them? Add some notes here. 2. Use the code block below to do the same thing, but this time for the different areas. What is your conclusion here? ```{r} AreaA = subset(petrels, Area == "A") t.test(AreaA$R.Wing.Lth) ``` 3. The above steps are a little cumbersome for all 6 areas - lots of copy and paste that we might mess up a little bit. This sort of repetition is what computers are good at, so let's use them: ```{r} ttests = tapply(petrels$R.Wing.Lth, petrels$Area, t.test) tidied = lapply(ttests, broom::tidy) combined = dplyr::bind_rows(tidied, .id = "Area") combined ``` Obviously you're not expected to be able to produce code like this, but let's explain what it's doing - The `tapply` function splits up the `petrels$R.Wing.Lth` column by the categories in `petrels$Area` and then applies `t.test` to each of them. i.e. exactly like we did above. It stores the results in a list, which we save in `ttests`. Try typing ttests to see what it contains. - The `lapply` function then takes the `ttests` list and applies the `broom::tidy` function to each of the entries in turn, returning a new list called `tidied`. The `broom::tidy` function 'sweeps' up the messy output from `t.test` that we saw above into a nice tidy data frame. - The `dplyr::bind_rows` function takes the list of tidied data frames and binds them together into a single data frame which we've saved in 'combined', with rows being distinguished by a column named `Area`. - The last line then just prints out what `combined` contains. You'll see you have columns for `estimate` (the mean right wing length), `conf.low` and `conf.high`. You could potentially plot this now if you like. You can use `geom_point` for the points (sample means) and `geom_errorbar` for the confidence intervals. 4. An alternate way to do part 1 above (to assess differences between sex) is to compute a confidence interval for the difference in mean right wing length between the sexes, rather than separate confidence intervals for each sex. The way to do this is as follows. Try it out: ```{r} t.test(R.Wing.Lth ~ Sex, conf.level=0.95, data=petrels) ``` This is known as the `formula` syntax. Note the piece with the tilde (~) can be read as "Right wing length in terms of sex". Using formulas like this is how we'll do statistical modelling in semester 2. What is your conclusion from the above test? 5. Just how confident can we be that there is a difference? Try increasing the confidence level by specifying conf.level to something higher (e.g. 0.99). You should find you can increase it quite a lot, but at some point you'll have to admit that the mean right wing lengths in the population could possibly be the same. 6. Add some notes about the relationship between confidence level and the length of the confidence interval below. How does this relate to being able to use the confidence interval to make a decision?