library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
visreg
provides us an efficient routine to visualise our
linear model. However, it only produces the line with a confidence band.
We need to do a bit more work to visualising our model with a prediction
band.
Load the package datarium
and the data
marketing
. Fit the linear model, and save the result in the
object lm.youtube
again. Then, sweep your fitted result by
the package broom
and get the tidy version
lm.youtube
.
library(datarium)
library(broom)
data(`marketing`)
lm.youtube <- lm(sales ~ youtube, data=marketing)
lm.youtube.fit <- augment(lm.youtube)
Get all prediction intervals for each observed
youtube
by calling predict()
without supplying
new.data
, and save the result in
lm.youtube.pred
. We can then bind
lm.youtube.pred
with lm.youtube.fit
by column
with the R function cbind()
, aka colunm bind.
lm.youtube.pred <- predict(lm.youtube, interval='prediction')
## Warning in predict.lm(lm.youtube, interval = "prediction"): predictions on current data refer to _future_ responses
lm.youtube.fit.pred <- lm.youtube.fit |> cbind(lm.youtube.pred) |> select(-fit) |> tibble()
lm.youtube.fit.pred
## # A tibble: 200 × 10
## sales youtube .fitted .resid .hat .sigma .cooksd .std.resid lwr upr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 26.5 276. 21.6 4.96 0.00970 3.90 0.00794 1.27 13.8 29.3
## 2 12.5 53.4 11.0 1.50 0.0122 3.92 0.000920 0.387 3.22 18.7
## 3 11.2 20.6 9.42 1.74 0.0165 3.92 0.00169 0.449 1.65 17.2
## 4 22.2 182. 17.1 5.12 0.00501 3.90 0.00434 1.31 9.35 24.8
## 5 15.5 217. 18.8 -3.27 0.00578 3.91 0.00205 -0.839 11.0 26.5
## 6 8.64 10.4 8.94 -0.295 0.0180 3.92 0.0000534 -0.0762 1.15 16.7
## 7 14.2 69 11.7 2.44 0.0105 3.92 0.00208 0.627 3.97 19.5
## 8 15.8 144. 15.3 0.544 0.00549 3.92 0.0000538 0.140 7.56 23.0
## 9 5.76 10.3 8.93 -3.17 0.0181 3.91 0.00616 -0.818 1.15 16.7
## 10 12.7 240. 19.8 -7.12 0.00690 3.89 0.0116 -1.83 12.1 27.6
## # … with 190 more rows
Here we further remove the duplicate column
fit
.
First, visualise your linear model with the fitted line and the
scatter plot. You can then add the prediction band by the R function
geom_ribbon()
with aes(ymin = lwr, ymax = upr)
which defines a shaded region bounded by the lower limits and upper
limits of the prediction intervals in your plot. You may need to adjust
fill
and alpha
in geom_ribbon()
to get a nicer plot.
lm.youtube.fit.pred |> ggplot(aes(x = youtube ,y=sales)) +
geom_point() +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "grey70", alpha=0.5) +
geom_smooth(method='lm')
## `geom_smooth()` using formula 'y ~ x'
The default prediction interval in R is a 95% prediction interval which means around 95% observations shall fall into the bands. Check the plot in Step 3 and count the points falling outside the prediction band. Add one or two comments on your findings.
Answer: It is not hard to see that there are 9
points falling outside the bands. This results in \(1-9/200=95.5\%\) observations within the
bands. The result agrees well with the confidence level. We can also
complete this job by filter()
and
summarise()
.
```r
lm.youtube.fit.pred |> filter(sales>upr|sales<lwr) |> summarise(n())
```
```
## # A tibble: 1 × 1
## `n()`
## <int>
## 1 9
```
youtube
into two groups
youtube<200
and youtube>200
and count
the points falling in the prediction band for each group. Add one or two
comments on your findings.Answer: Graphical approaches are not attrative this
time but we can still use filter()
and
summarise()
. The results can be found in the following R
code chunks. The critical issue is that the prediction bands seem too
wide for the lower budgets and too narrow for the higher budgets. The
bands are not consistent for different budgets (x
) and
therefore violate the pointwise condition.
```r
lm.youtube.fit.pred |> filter(youtube>200) |> ggplot(aes(x = youtube ,y=sales)) +
geom_point() +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "grey70", alpha=0.5) +
geom_smooth(method='lm')
```
```
## `geom_smooth()` using formula 'y ~ x'
```
<img src="solC03_files/figure-html/unnamed-chunk-6-1.png" width="672" />
```r
lm.youtube.fit.pred |> filter(youtube<200) |> ggplot(aes(x = youtube ,y=sales)) +
geom_point() +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "grey70", alpha=0.5) +
geom_smooth(method='lm')
```
```
## `geom_smooth()` using formula 'y ~ x'
```
<img src="solC03_files/figure-html/unnamed-chunk-6-2.png" width="672" />
```r
lm.youtube.fit.pred |> filter(youtube>200) |> summarise(n())
```
```
## # A tibble: 1 × 1
## `n()`
## <int>
## 1 93
```
```r
lm.youtube.fit.pred |> filter(youtube>200) |> filter(sales<upr) |> filter(sales>lwr) |> summarise(n())
```
```
## # A tibble: 1 × 1
## `n()`
## <int>
## 1 84
```
```r
lm.youtube.fit.pred |> filter(youtube<200) |> summarise(n())
```
```
## # A tibble: 1 × 1
## `n()`
## <int>
## 1 107
```
```r
lm.youtube.fit.pred |> filter(youtube<200) |> filter(sales<upr) |> filter(sales>lwr) |> summarise(n())
```
```
## # A tibble: 1 × 1
## `n()`
## <int>
## 1 107
```
As we can see from Exercise 1, the prediction interval tends to be
too wide when youtube
is large and too narrow when
youtube
is small. A similar issue also arises when we try
to predict the sales given zero budget in Exercise 3 of Lab C2. The
prediction at zero budget may significantly overestimate the actual
sales.
These issues allude to a critical point: does our linear model provide a good fit to our data?
Of course, we can always get some clues from the scatter plot and
smoothed curve for sales
against youtube
. We
can even conclude that the fitted line fails to capture a significant
portion of uncertainties in our data. So a linear model may not be an
adequate model for the relationship between sales
and
youtube
.
The next question is how to identify and quantify the inadequacy of our linear model?
The visualisation of model identifies some ill-posed patterns in our model but it fails to identify the root cause of the problem. \(R^2\) is a simple indicator but it won’t capture the issues on prediction discussed above.
What we need are some diagnostic tools to check the fitness of our linear model on the data set and identify the crux in our fitted model. However, before we carry out the examination on our fitted linear model, we need to know when a fitted linear model will behave well.
To ensure the overall good performance of a fitted linear model, we need our data satisfy a few conditions, i.e. four assumptions as follows
Linearity
Residuals don’t depend on \(x\). The trend is correctly modelled as a line. A line won’t fit an exponential trend.
Independence
Residuals don’t depend on each other. If residuals are dependent, we can expect that one residual may contribute some information to another residual and vice versa.
Normality
Residuals are distributed normally. Normality ensures that the least square estimation will catch the best possible line.
Equal variance
Residuals have constant variance. The variation doesn’t change as we move along the trend.
with Linearity being the most important.
Firstly we’ll be looking at linearity and equal variance, both of which can be assessed using a plot of the residuals versus the fitted value.
If linearity holds, we’d expect a plot of residuals vs fitted value to show no trend - the points should be scattered fairly constantly above and below the line - in particular we don’t want to see a curve.
If equal variance holds, we’d expect the scatter of points around the trend to be constant as the fitted value changes. You want it to be relatively even, and in particular not increasing from left to right (i.e. not spreading out).
We can demonstrate this idea by using the stochastic simulation. An example of a good plot (left) and bad plot (right) is shown below for two artificial data sets, i.e., linear and exponential.
set.seed(2020)
n <- 50
ab <- c(0,2)
demo <- tibble(x=(1:n)/n, e= rnorm(n, sd=0.1)) |> mutate (y=ab[1]+ab[2]*x+e) |>
mutate (y2=exp(ab[1]+ab[2]*x+e))
demo |> select(x, y, y2) |> gather(key = "Group", value = "Y", -x) |>
ggplot(aes(x=x,y=Y,col=Group)) + geom_point()
g1 <- lm(y~x,data=demo) |> augment() |>
ggplot(aes(x=.fitted,y=.resid)) + geom_point() +geom_smooth()+geom_hline(yintercept=0)
g2 <- lm(y2~x,data=demo) |> augment() |>
ggplot(aes(x=.fitted,y=.resid)) + geom_point() +geom_smooth()+geom_hline(yintercept=0)
library(patchwork)
g1+g2
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Compare the scatter plots and the plots of residuals versus fitted values. Think about the reason why we include horizontal lines at \(y=0\) and smoothed curves with confidence bands in the residual plots.
Answer: The right panel shows a clear curvature but the left panel looks quite random. The blue smoothed curves try to extract general patterns from residuals vs fits plots and the confidence bands can help to tell if the general trend differs a lot from the baseline (\(y=0\)).
You can see more examples using the interactive found here:
https://shiny.massey.ac.nz/jcmarsha/linearity
You can of course modify some parameters in the above R code
chunk and even change the exponential function to some other functions.
A possible choice is to change ab <- c(0,2)
to
ab <- c(0,0.5)
.
set.seed(2020)
n <- 50
ab <- c(0,0.5)
demo <- tibble(x=(1:n)/n, e= rnorm(n, sd=0.1)) |> mutate (y=ab[1]+ab[2]*x+e) |>
mutate (y2=exp(ab[1]+ab[2]*x+e))
demo |> select(x, y, y2) |> gather(key = "Group", value = "Y", -x) |>
ggplot(aes(x=x,y=Y,col=Group)) + geom_point()
g1 <- lm(y~x,data=demo) |> augment() |>
ggplot(aes(x=.fitted,y=.resid)) + geom_point() +geom_smooth()+geom_hline(yintercept=0)
g2 <- lm(y2~x,data=demo) |> augment() |>
ggplot(aes(x=.fitted,y=.resid)) + geom_point() +geom_smooth()+geom_hline(yintercept=0)
library(patchwork)
g1+g2
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Answer: This case is a little tricky as the residuals vs fits plot does not reveal any issue in the exponential model. Why? This is because the exponential function \(\exp(x)\) can be well approximated by a linear function \(1+x\) if \(x\) is small. This fact actually provides additional warranties on using linear models in scientific research. Though many natural phenomena may follow some nonlinear laws, those complicated nonlinear functions can be approximated by our linear models under certain conditions!
Let’s see how well our model for sales does by producing the diagnostic plot for the linear model you fit above using the following.
lm.youtube.fit |> ggplot(aes(x=.fitted,y=.resid)) +
geom_point()+geom_smooth()+geom_hline(yintercept=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Take a good look at the plot. Do you think linearity and equal variance hold? Add some notes about each assumption to your notebook.
Answer: Linearity looks ok, though we still worry about the left end corresponding to small budgets. The equal variance is certainly violated!
In addition to the residuals vs fits plot, we have another tool, i.e. the scale-location plot, to check equal variance. The plot can be generated by using the following R code chunk:
lm.youtube.fit |> mutate(.root.abs.std.resid=sqrt(abs(.std.resid))) |>
ggplot(aes(x=.fitted,y=.root.abs.std.resid)) + geom_point()+geom_smooth()+geom_hline(yintercept=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Similar to the residual versus fits plot, the scale-Location plot shows whether residuals are spread equally along the ranges of input variables (predictor). The assumption of equal variance (homoscedasticity) could also be checked with this plot. If we see a horizontal line with randomly spread points, it means that the model is good.
Take a good look at the plot. Do you think linearity and equal variance hold? Add some notes about each assumption to your notebook.
Answer: Certainly, equal variance fails to hold. But the linearity can not be verified by the scale-location plot.
In the scale-location plot we still use the fitted values as \(x\) coordinates. But another quantity - the square root of absolute value of standardised residuals- is used as \(y\) coordinates. Standardised residuals are obtained by dividing the residuals by the standard error of residuals as \(s_i=e_i/sd_{res}\). An advantange of standardised residuals is that both the variance and standard error are just 1 for this sequence.
The standardised residuals can be readily obtained from the fitted
model by the R function rstandard()
. Otherwise, if you have
carefully read the tibble from augment()
and the R code
chunk for the scale-location plot, you should have found the
standardised residuals .std.resid
right following
.resid
.
The operation of dividing a sequence by its standard error is called standardisation in statistics. It is frequently used in statistics as the data on different scales can be handle with similar measures after the standardisaton.
Optional Challenge: Modify the R code chunk in Step 1 and perform a simulation study on the scale-location plot.
Answer: The following R code chunk can produce the simulation study. We can notice that the scale-location plot is not really a great tool for diagnostics.
set.seed(2020)
n <- 100
ab <- c(0,2)
demo <- tibble(x=(1:n)/n, e= rnorm(n, sd=0.25)) |> mutate (y=ab[1]+ab[2]*x+e) |>
mutate (y2=exp(ab[1]+ab[2]*x+e))
demo |> select(x, y, y2) |> gather(key = "Group", value = "Y", -x) |>
ggplot(aes(x=x,y=Y,col=Group)) + geom_point()
g1 <- lm(y~x,data=demo) |> augment() |> mutate(.root.abs.std.resid=sqrt(abs(.std.resid))) |>
ggplot(aes(x=.fitted,y=.root.abs.std.resid)) + geom_point() +geom_smooth()+geom_hline(yintercept=0)
g2 <- lm(y2~x,data=demo) |> augment() |> mutate(.root.abs.std.resid=sqrt(abs(.std.resid))) |>
ggplot(aes(x=.fitted,y=.root.abs.std.resid)) + geom_point() +geom_smooth()+geom_hline(yintercept=0)
library(patchwork)
g1+g2
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The residuals vs fits plot and scale-location plot are usually
sufficient for us to tell the issues on linearity and equal variance.
Our next task is to verify the normality. Recall the simulation study in
Lab 2 where we make a scatter plot of the residuals versus true random
errors which can be approximated by the line \(y=x\). Even if we can’t observe the true
random error in a real problem, we can still make a similar plot called
quantile-quantile plot, aka Q-Q plot,
via geom_qq()
and geom_qq_line
as follows.
lm.youtube.fit |> ggplot(aes(sample=.std.resid)) + geom_qq(alpha=0.3) + geom_qq_line(color='red')
We are using the standardised residuals again. geom_qq()
adds the black points while geom_qq_line()
dispicts the red
line. If most points follow the red line, we can feel free to assume the
normality of the residuals. However, one must confirm the linearity and
equal variance first before thinking about the normality.
Of course, we need to know the appearance of a good Q-Q plot before we make any solid judgement. The normality assumption says the residuals follow normal distribution and the standardised residuals has zero mean and unit variance. We can then simulate a standard normal sample with zero mean and unit variance and check its Q-Q plot as follows.
set.seed(2020)
tibble(e=rnorm(100)) |> ggplot(aes(sample=e)) + geom_qq(alpha=0.3) + geom_qq_line(color='red')
Take a good look at the above two plots. Do you think the normality hold? Add some notes about the assumption to your notebook.
Answer: From the Q-Q plot, one can argue that the normality holds. However, one need to first check the linearity and equal variance. If the above two conditions fail to hold, it is non-sense to check the normality.
The last thing to check is slightly different from the above assumptions. We will look at the outliers in linear model. Just like we observe some distant points in a boxplot for one variable, we can have some outliers for \(x\) and \(y\). The trick thing is that, even both \(x\) and \(y\) look good in their own boxplots, their joint effort may push the point away from the main body in a scatter plot. A possible pitfall in fitting a linear model is that the linear model found by least square method is not very robust against outliers. Only a few outliers can distort the point estimates significantly. So it is essential to spot those bad guys hidden in our data.
Of course, even a scatter plot can help us identify the outliers. If we want some more quantitative measurements, there is a specific measure called Cook’s distance, or just Cook’s \(D\), for each data point to measure its influence on the regression line. A large Cook’s distance suggests that this point may be an outlier which has a big influence on the whole regression line.
This peice of informaiont is readily collected by
augment()
in the column .cooksd
. The following
R code chunk produces a plot of Cook’s distance against the index.
lm.youtube.fit |> mutate(.index=1:n()) |> ggplot(aes(y=.cooksd,x=.index)) + geom_col()
Those spikes in the above plot suggests potential outliers. However, one need a cut-off point to distinguish the outliers from those tamed observations. Unfortunately, there is no a golden rule to split these two groups of points for arbitrary data sets. Some people uses 0.5 or 1 as a cut-off point but our Cook’s distances here are much smaller. We will see a refined diagnostic plot for outlier detection in the next step.
We will skip independence in Lab 3 but revisit it in Lab
4. Before I conclude this exercise, I would like to give you a
handy shortcut to produce all above plots. One can easily generate a set
of residuals diagnostic plots via plot()
, i.e the default
plot function in R, as follows.
plot(lm.youtube)
The last one of above plots is called the residuals versus leverage plot. The leverage is another measure for identifying outliers and you shall notice that there are several points being flagged already. This plot will also contain a cut-off curve in the red dash line for Cook’s distances if there are any points with Cook’s \(D>0.5\).
Besides of the convenience, another advantage of plot()
is that the indices of potential outliers will be flagged in the plot.
You can use the information in these diagnostic plots to locate the
untamed outliers in your data.
However, one shall notice that, we can’t customise those plots by
following the same procedures in ggplot()
.
In fact, plot()
can produce six different diagnostic
plots. You can make the individual plot by specifying which
in plot()
like plot(lm.youtube,which=1)
or
plot(lm.youtube,which=4)
. Have a try with
which=1,2,3,4,5,6
and see which number gives you the
desired plot.
Answer: Just notice that the last one is Cook’s distance vs leverage plot, another tool for outlier detection.
plot(lm.youtube,which=1) # residuals vs fits
plot(lm.youtube,which=2) # Q-Q plot
plot(lm.youtube,which=3) # scale-location
plot(lm.youtube,which=4) # Cook's distance
plot(lm.youtube,which=5) # Residuals vs leverage
plot(lm.youtube,which=6) # Cook's distance vs leverage
We have to acknowledge that the residuals plots of
lm.youtube
suggest that the linear model is not a great one
for describing the relationship between sales
and
youtube
. A transformation is one way to deal with the
non-linearity and unequal variance of the data. We will try the log
transformation in this exercise.
Instead of modelling sales
in terms of
youtube
, we could instead take log transforms of
sales
and youtube
to see if it is possible to
get rid of the curvature in the relationship.
lm.youtube.log <- lm(log(sales) ~ log(youtube), data=marketing)
summary(lm.youtube)
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0632 -2.3454 -0.2295 2.4805 8.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.439112 0.549412 15.36 <2e-16 ***
## youtube 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
summary(lm.youtube.log)
##
## Call:
## lm(formula = log(sales) ~ log(youtube), data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43349 -0.15917 0.01696 0.16910 0.39399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.02284 0.07372 13.87 <2e-16 ***
## log(youtube) 0.35504 0.01487 23.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2109 on 198 degrees of freedom
## Multiple R-squared: 0.7421, Adjusted R-squared: 0.7408
## F-statistic: 569.8 on 1 and 198 DF, p-value: < 2.2e-16
Compare the summary output of the two models you have. Which do you think is better? Why?
Answer: Multiple R-squared has been improved a lot. Therefore, the log transformed model is better.
We add the following code to produce the residual vs fitted plot for this new model.
lm.youtube.log.fit <- augment(lm.youtube.log)
lm.youtube.log.fit |> ggplot(aes(x=.fitted,y=.std.resid)) +
geom_point()+geom_smooth(span=1.2)+geom_hline(yintercept=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What the hell is this curve smoothed by loess
-
the weirdness of it going through the outlier in the bottom
left?
Answer: loess
means
LOcally WEighted
Scatter-plot Smoother. It is a
generalisation of our linear model to the nonlinear function. One
potential drawback of loess
is that it may depend heavily
on one (or a small portion of) data point(s) as it smooths the
scatter-plot locally. This is the root cause of the
weirdness.
The \(y\)-axis is now the standardised residuals. If one fit a linear model with a transformation on \(y\), the residuals will not be reported and the standardised residuals will be used thoroughly. Why do we not use the residuals?
How is the assumption of linearity now? What about equal variance? Remember to add comments to your notebook.
Answer: The linearity is getting better if we ignore the bottom left. Equal variance does not hold anyway, but there is a slight improvement. Produce other diagnostic plots and add some comments on them.
plot(lm.youtube.log)
Answer: Most issues are still there but getting better slightly. A potential outlier (131) can be identified from the residual vs leverage plot.
Earlier we produced prediction intervals for sales with
youtube
budget equal to zero dollar. Re-do these using
your transformed model. Remember that you’ll need to exponentiate the
resulting intervals using the exp()
function
(Why?)
How do these intervals compare with your previous ones? Add some comments to your notebook about this.
Answer: Sorry that I made a mistake here when
preparing this lab. Clearly, we can’t take the logarithm of zero. So it
is not viable to produce the prediction by using predict()
.
One thing we can notice that the log transformed model is a power law
model at the raw scale as \(sales=c\times(youtube)^k\). So the sales
will be exactly zero given zero budget. Of course, we may not want to
use a model like this since the sales should not be zero even we have
not advertising budget. A quick remedy is to use a shifted log
tranformation as \(x^*=\log(x+1)\) as
follows. This will fix the problem caused by the logarithm of zero.
lm.youtube.log1 <- lm(log(sales) ~ log(youtube+1), data=marketing)
newdata <- data.frame(youtube=0)
exp(predict(lm.youtube.log1,newdata,interval = 'prediction'))
## fit lwr upr
## 1 2.603275 1.664424 4.071702
Now we have a much narrow prediction interval after comparing it with Step 5 of Ex 3 in Lab 2!
Now let’s try visualising the second linear model. You should notice that the \(x\)-axis is on the normal scale (even though we applied a log transformation in the model formula) but the \(y\)-axis is on the log scale.
library(visreg)
visreg(lm.youtube.log, gg=TRUE)
Nonetheless, the model fit should be a bit better, and will be curved. Notice that we’ve used a linear model to fit a curved relationship. The key is that the linearity of the model is in terms of the coefficients (each term can contain only one \(\beta\) as a multiplier, and terms must be added together) not in terms of the way \(y\) and \(x\) are related. You can apply any transformation you like to \(x\) and \(y\) as needed to fit the data.
You can also visualise the second linear model on the natural
scale by applying a transformation in the visreg
command.
Try the following:
visreg(lm.youtube.log, trans=exp, partial=TRUE, gg =TRUE)
The trans=exp
uses exponentiation to transform the
outcome variable. The partial=TRUE
means that residuals
deviated from the line (and thus data values) are plotted as points as
well. You may want to change the y-axis label by adding
ylab="Body weight (kg)"
to the above command.
Add some comments to your notebook about the model fit and how well you think it does. Notice that the confidence bands at the right end are larger than those at the left end or in the middle. Why is this?
Answer: Though the unequal variance is still there, the visualisation suggests a much better fit. The fitted curve traces the trend of sales well with points scatterring around the two sides of the curve evenly. The wider confidence bands is made by the inverse transformation, i.e. the exponential function.
Optional Challenge: Try to find a better transform for this data set!
Answer: No standard solution. But I am happy to review your work if you send it to me via email.