16.3.49 one-way ANOVA test

To see how much difference time of day made on the speed at which he could download files, a college sophomore placed a file on a remote server, then proceeded to download it at three different time periods of the day. He downloaded the file 18 times in all, 6 times at each time of day, and recorded the time in seconds that the download took. At the 5% significance level, do the data provide sufficient evidence to conclude that a difference exists in mean download speed?

Notation in one-way ANOVA:

k = number of populations
n = total number of observations
\(\bar x\) = mean of all n observations
\(n_j\) = size of sample from Population j
\(\bar{x_j}\) = mean of sample from Population j
\(s_j^2\) = variance of sample from Population j
\(T_j\) = sum of sample data from Population j

Defining formulas from sums of squares in one-way ANOVA:

SST = \(\sum (x_i - \bar x)^2\)
SSTR = \(\sum n_j(\bar{x_j} - \bar{x})^2\)
SSE = \(\sum (n_j-1)s_j^2\)

One-way ANOVA identity: SST = SSTR + SSE

Computing formulas from sums of squares in one-way ANOVA:

SST = \(\sum x_i^2 - (\sum x_i)^2/n\)
SSTR = \(\sum (T_j^2/n_j) - (\sum x_i)^2/n\)
SSE = SST - SSTR

The way they define \(\sum (T_j^2/n_j)\) is different from the one for x

Mean squares in one-way ANOVA:

MSTR = \(\frac{SSTR}{k-1}\)
MSE = \(\frac{SSE}{n-k}\)
SSE = SST - SSTR

Test statistic for one-way ANOVAA (independent samples, normal populations, and equal population standard deviations):

F = \(\frac{MSTR}{MSE}\)

with df = (k - 1, n - k)

Confidence interval for \(\mu_i - \mu_j\) in the Tukey multiple-comparison method (independent samples, normal populations, and equal population sstandard deviations):

\((\bar{x_i} - \bar{x_j}) \pm \frac{q_{\alpha}}{\sqrt{2}}.s\sqrt{\frac{1}{n_i} + \frac{1}{n_j}}\)

where s = \(\sqrt{MSE}\) and \(q_{\alpha}\) is obtained for a q-curve with parameters k and n - k

Test statistic for a Kruskal-Wallis test (independent samples, same-shape populations, all sample sizes 5 or greater):

\(K=\frac{SSTR}{SST/(n-1)}\) or
\(K=\frac{12}{n(n+1)}\sum_{j=1}^{k} \frac{R_j^2}{n_j} - 3(n+1)\)

where SSTR and SST are computed for the ranks of the data, and \(R_j\) denotes the sum of the ranks for the sample data from Population j. K has approximately a chi-square distribution with df = k -1

First, let \(\mu_1, \mu_2,\) and \(\mu_3\) be the population means times for 7 a.m., 5 p.m., and 12 a.m., respectively. What are the correct hypotheses for a one-way ANOVA test?

Since the question asks “At the 5% significance level, do the data provide sufficient evidence to conclude that a difference exists in mean download speed?.” the correct hypothesis is.

\(H_0: \mu_1 = \mu_2 = \mu_3\)

\(H_a:\) Not all the means are equal.

Now conduct a one-way ANOVA test on the data. What is the F-statistic?

First we need to get the data from the question. (We can import it from Excel)

data <- read.csv("https://raw.githubusercontent.com/sileaderwt/MTH1320-UMSL/main/Image%2BData/16.3.49/16.3.49.csv")
data

##   Early Evening Late
## 1    70     206  217
## 2   140     293  177
## 3    86     251  176
## 4   209     308  222
## 5   109     237  212
## 6    76     209  164

The type of data is tibble in R. To avoid confusion when working with other question, we should create a data frame.

dframe = data.frame(data)
dframe

##   Early Evening Late
## 1    70     206  217
## 2   140     293  177
## 3    86     251  176
## 4   209     308  222
## 5   109     237  212
## 6    76     209  164

First approach: using anova() in R which Professor Covert introduces in the video lectures (recommended)

X <- c()
len <- c()
for(item in dframe){
  X <- c(X, item[!is.na(item)])
  len <- c(len, length(item[!is.na(item)]))
}
X

##  [1]  70 140  86 209 109  76 206 293 251 308 237 209 217 177 176 222 212 164

len

## [1] 6 6 6

We import name of our data.

Y= rep(names(dframe), times = len)
Y

##  [1] "Early"   "Early"   "Early"   "Early"   "Early"   "Early"   "Evening"
##  [8] "Evening" "Evening" "Evening" "Evening" "Evening" "Late"    "Late"   
## [15] "Late"    "Late"    "Late"    "Late"

dframe2 = data.frame(X,Y)
dframe2

##      X       Y
## 1   70   Early
## 2  140   Early
## 3   86   Early
## 4  209   Early
## 5  109   Early
## 6   76   Early
## 7  206 Evening
## 8  293 Evening
## 9  251 Evening
## 10 308 Evening
## 11 237 Evening
## 12 209 Evening
## 13 217    Late
## 14 177    Late
## 15 176    Late
## 16 222    Late
## 17 212    Late
## 18 164    Late

We run anova()

fm1 = aov(X~Y, data=dframe2)
fm1

## Call:
##    aov(formula = X ~ Y, data = dframe2)
## 
## Terms:
##                        Y Residuals
## Sum of Squares  55776.44  26028.67
## Deg. of Freedom        2        15
## 
## Residual standard error: 41.65627
## Estimated effects may be unbalanced

anova(fm1)

## Analysis of Variance Table
## 
## Response: X
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Y          2  55776 27888.2  16.072 0.0001862 ***
## Residuals 15  26029  1735.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can run str() to see structure of our data. We can extract any data from the frame.

str(anova(fm1))

## Classes 'anova' and 'data.frame':    2 obs. of  5 variables:
##  $ Df     : int  2 15
##  $ Sum Sq : num  55776 26029
##  $ Mean Sq: num  27888 1735
##  $ F value: num  16.1 NA
##  $ Pr(>F) : num  0.000186 NA
##  - attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: X"

Round to 2 decimal places

print((anova(fm1)$`F value`), 4)

## [1] 16.07    NA

Now determine the critical value \(F_{\alpha}\)

Since \(\alpha = .05\), df(2,15)

qf(1-.05, 2, 15)

## [1] 3.68232

Round to decimal places

round(qf(1-.05, 2, 15), 2)

## [1] 3.68

Since our test statistic = 16.07 > our critical value \(F_{\alpha}=3.68\) and it is a right-tailed test, we have enough evidence to reject the hypothesis.

Second approach: using the formula from the book

The same approach with question 16.3.43, we could see that two approaches have the same answer

(a) Compute SST, SSTR, and SSE using the following computing formulas, where xi is the ith observation, n is the total number of observations, nj is the sample size for population j, and Tj is the sum of the sample data from population j.

Names of variables

\(\sum{x}: Sx\)

\(\sum x^2: Sxx\) \(\sum (T_j^2/n_j)\): T

n = total number of observations: n

To make it easy to find \(\sum x\), we store all data into a variable x

x <- c()
for(item in dframe){
  x <- c(x, item[!is.na(item)])
}
x

##  [1]  70 140  86 209 109  76 206 293 251 308 237 209 217 177 176 222 212 164

n = length(x)
n

## [1] 18

Find \(\sum x_i\)

sum(x)

## [1] 3362

Find \(\sum x_i^2\)

sum(x*x)

## [1] 709752

Find SST

To find SST, we use formula SST = \(\sum x_i^2 - (\sum x_i)^2/n\)

SST = sum(x*x) - (sum(x))^2/n
SST

## [1] 81805.11

Find \(\sum (T_j^2/n_j)\)

T = 0
for(item in dframe){
  t =  item[!is.na(item)]
  T = T + sum(t)^2/length(t)
}
T

## [1] 683723.3

Find SSTR, we use formula SSTR = \(\sum (T_j^2/n_j) - (\sum x_i)^2/n\)

SSTR = T - sum(x)^2/n
SSTR

## [1] 55776.44

Find SSE, we use formula SSE = SST - SSTR

SSE = SST - SSTR
SSE

## [1] 26028.67

(b). Compare your results in part (a) for SSTR and SSE with the following results from the defining formulas.

We find SSTR, SSE, SST by using defining formula

Find SST using the formula SST = \(\sum (x_i - \bar x)^2\)

SST = sum(x*x) - (sum(x))^2/n
SST

## [1] 81805.11

Find SSTR using the formula SSTR = \(\sum n_j(\bar{x_j} - \bar{x})^2\)

SSTR = 0
for(item in dframe){
  t =  item[!is.na(item)]
  SSTR = SSTR + length(t)*(mean(t) - mean(x))^2
}
SSTR

## [1] 55776.44

Find SSE using the formula SSE = \(\sum (n_j-1)s_j^2\)

SSE = 0
for(item in dframe){
  t =  item[!is.na(item)]
  SSE = SSE + (length(t)-1)*sd(t)^2
}
SSE

## [1] 26028.67

We have the same answer by using two different formulas.

(c) Construct a one-way ANOVA table.

Find df treatment

k = length(dframe)
k-1

## [1] 2

Find SS treatment

SSTR

## [1] 55776.44

Find MS treatment

MSTR = SSTR/(k-1)
MSTR

## [1] 27888.22

Find Error df

n - k

## [1] 15

Find Error SS

SSE

## [1] 26028.67

Find Error MS

MSE = SSE / (n - k)
MSE

## [1] 1735.244

Find F-statistic treatment

MSTR / MSE

## [1] 16.07164

Find df total

n - 1

## [1] 17

Find SS total

SST

## [1] 81805.11

Now determine the critical value \(F_{\alpha}\)

Since \(\alpha = .05\)

alpha = 0.05
qf(1-alpha, k-1, n-k)

## [1] 3.68232

Round to decimal places

round(qf(1-alpha, k-1, n-k), 2)

## [1] 3.68

Hope that helps!