16.3.43 ANOVA table

The data shown to the right are from independent simple random samples from three populations. Use these data to complete parts (a) through (d).

Notation in one-way ANOVA:

k = number of populations
n = total number of observations
\(\bar x\) = mean of all n observations
\(n_j\) = size of sample from Population j
\(\bar{x_j}\) = mean of sample from Population j
\(s_j^2\) = variance of sample from Population j
\(T_j\) = sum of sample data from Population j

Defining formulas from sums of squares in one-way ANOVA:

SST = \(\sum (x_i - \bar x)^2\)
SSTR = \(\sum n_j(\bar{x_j} - \bar{x})^2\)
SSE = \(\sum (n_j-1)s_j^2\)

One-way ANOVA identity: SST = SSTR + SSE

Computing formulas from sums of squares in one-way ANOVA:

SST = \(\sum x_i^2 - (\sum x_i)^2/n\)
SSTR = \(\sum (T_j^2/n_j) - (\sum x_i)^2/n\)
SSE = SST - SSTR

The way they define \(\sum (T_j^2/n_j)\) is different from the one for x

Mean squares in one-way ANOVA:

MSTR = \(\frac{SSTR}{k-1}\)
MSE = \(\frac{SSE}{n-k}\)
SSE = SST - SSTR

Test statistic for one-way ANOVAA (independent samples, normal populations, and equal population standard deviations):

F = \(\frac{MSTR}{MSE}\)

with df = (k - 1, n - k)

Confidence interval for \(\mu_i - \mu_j\) in the Tukey multiple-comparison method (independent samples, normal populations, and equal population sstandard deviations):

\((\bar{x_i} - \bar{x_j}) \pm \frac{q_{\alpha}}{\sqrt{2}}.s\sqrt{\frac{1}{n_i} + \frac{1}{n_j}}\)

where s = \(\sqrt{MSE}\) and \(q_{\alpha}\) is obtained for a q-curve with parameters k and n - k

Test statistic for a Kruskal-Wallis test (independent samples, same-shape populations, all sample sizes 5 or greater):

\(K=\frac{SSTR}{SST/(n-1)}\) or
\(K=\frac{12}{n(n+1)}\sum_{j=1}^{k} \frac{R_j^2}{n_j} - 3(n+1)\)

where SSTR and SST are computed for the ranks of the data, and \(R_j\) denotes the sum of the ranks for the sample data from Population j. K has approximately a chi-square distribution with df = k -1

First approach: Using formulas from the book

(a) Compute SST, SSTR, and SSE using the following computing formulas, where \(x_i\) is the ith observation, n is the total number of observations, \(n_j\) is the sample size for population j, and \(T_j\) is the sum of the sample data from population j.

First we need to get the data from the question. (We can import it from Excel)

data <- read.csv("https://raw.githubusercontent.com/sileaderwt/MTH1320-UMSL/main/Image%2BData/16.3.43/16.3.43.csv")
data

##   Sample1 Sample2 Sample3
## 1       6       2       1
## 2       5       3       4
## 3       4       1       2
## 4      NA      NA       5

The type of data is tibble in R. To avoid confusion when working with other question, we should create a data frame.

dframe = data.frame(data)
dframe

##   Sample1 Sample2 Sample3
## 1       6       2       1
## 2       5       3       4
## 3       4       1       2
## 4      NA      NA       5

Names of variables

\(\sum{x}: Sx\)

\(\sum x^2: Sxx\) \(\sum (T_j^2/n_j)\): T

n = total number of observations: n

To make it easy to find \(\sum x\), we store all data into a variable x

x <- c()
for(item in dframe){
  x <- c(x, item[!is.na(item)])
}
x

##  [1] 6 5 4 2 3 1 1 4 2 5

n = length(x)
n

## [1] 10

Find \(\sum x_i\)

sum(x)

## [1] 33

Find \(\sum x_i^2\)

sum(x*x)

## [1] 137

Find SST

To find SST, we use formula SST = \(\sum x_i^2 - (\sum x_i)^2/n\)

SST = sum(x*x) - (sum(x))^2/n
SST

## [1] 28.1

Find \(\sum (T_j^2/n_j)\)

T = 0
for(item in dframe){
  t =  item[!is.na(item)]
  T = T + sum(t)^2/length(t)
}
T

## [1] 123

Find SSTR, we use formula SSTR = \(\sum (T_j^2/n_j) - (\sum x_i)^2/n\)

SSTR = T - sum(x)^2/n
SSTR

## [1] 14.1

Find SSE, we use formula SSE = SST - SSTR

SSE = SST - SSTR
SSE

## [1] 14

(b). Compare your results in part (a) for SSTR and SSE with the following results from the defining formulas.

We find SSTR, SSE, SST by using defining formula

Find SST using the formula SST = \(\sum (x_i - \bar x)^2\)

SST = sum(x*x) - (sum(x))^2/n
SST

## [1] 28.1

Find SSTR using the formula SSTR = \(\sum n_j(\bar{x_j} - \bar{x})^2\)

SSTR = 0
for(item in dframe){
  t =  item[!is.na(item)]
  SSTR = SSTR + length(t)*(mean(t) - mean(x))^2
}
SSTR

## [1] 14.1

Find SSE using the formula SSE = \(\sum (n_j-1)s_j^2\)

SSE = 0
for(item in dframe){
  t =  item[!is.na(item)]
  SSE = SSE + (length(t)-1)*sd(t)^2
}
SSE

## [1] 14

We have the same answer by using two different formulas.

(c) Construct a one-way ANOVA table.

Find df treatment

k = length(dframe)
k-1

## [1] 2

Find SS treatment

SSTR

## [1] 14.1

Find MS treatment

MSTR = SSTR/(k-1)
MSTR

## [1] 7.05

Find Error df

n - k

## [1] 7

Find Error SS

SSE

## [1] 14

Find Error MS

MSE = SSE / (n - k)
MSE

## [1] 2

Find F-statistic treatment

MSTR / MSE

## [1] 3.525

Find df total

n - 1

## [1] 9

Find SS total

SST

## [1] 28.1

(d) Decide, at the 5% significance level, whether the data provide sufficient evidence to conclude that the means of the populations from which the samples were drawn are not all the same.

First, let \(\mu_1, \mu_2,\) and \(\mu_3\)be the population means of samples 1, 2, and 3, respectively. What are the correct hypotheses for a one-way ANOVA test?

\(H_0: \mu_1 = \mu_2 = \mu_3\)

\(H_a:\) Not all the means are equal.

Now determine the critical value \(F_{\alpha}\)

Since \(\alpha = .05\)

alpha = 0.05
qf(1-alpha, k-1, n-k)

## [1] 4.737414

Round to decimal places

round(qf(1-alpha, k-1, n-k), 2)

## [1] 4.74

Since our test statistic = 3.53 < our critical value \(F_{\alpha}=4.74\) and it is a right-tailed test, we do not have enough evidence to reject the hypothesis.

Second approach: using anova() in R which Professor Covert introduces in the video lectures (recommended)

X <- c()
len <- c()
for(item in dframe){
  X <- c(X, item[!is.na(item)])
  len <- c(len, length(item[!is.na(item)]))
}
X

##  [1] 6 5 4 2 3 1 1 4 2 5

len

## [1] 3 3 4

We import name of our data.

Y= rep(names(dframe), times = len)
Y

##  [1] "Sample1" "Sample1" "Sample1" "Sample2" "Sample2" "Sample2" "Sample3"
##  [8] "Sample3" "Sample3" "Sample3"

dframe2 = data.frame(X,Y)
dframe2

##    X       Y
## 1  6 Sample1
## 2  5 Sample1
## 3  4 Sample1
## 4  2 Sample2
## 5  3 Sample2
## 6  1 Sample2
## 7  1 Sample3
## 8  4 Sample3
## 9  2 Sample3
## 10 5 Sample3

We run anova()

fm1 = aov(X~Y, data=dframe2)
fm1

## Call:
##    aov(formula = X ~ Y, data = dframe2)
## 
## Terms:
##                    Y Residuals
## Sum of Squares  14.1      14.0
## Deg. of Freedom    2         7
## 
## Residual standard error: 1.414214
## Estimated effects may be unbalanced

anova(fm1)

## Analysis of Variance Table
## 
## Response: X
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## Y          2   14.1    7.05   3.525 0.08729 .
## Residuals  7   14.0    2.00                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can run str() to see structure of our data. We can extract any data from the frame.

str(anova(fm1))

## Classes 'anova' and 'data.frame':    2 obs. of  5 variables:
##  $ Df     : int  2 7
##  $ Sum Sq : num  14.1 14
##  $ Mean Sq: num  7.05 2
##  $ F value: num  3.52 NA
##  $ Pr(>F) : num  0.0873 NA
##  - attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: X"

Print F-statistic value to 2 decimal places

print((anova(fm1)$`F value`), 3)

## [1] 3.52   NA

Find df of treatment and error

anova(fm1)$`Df`

## [1] 2 7

Find SS of treatment and error to 2 decimal places

print((anova(fm1)$`Sum Sq`), 4)

## [1] 14.1 14.0

Find MS of treatment and error to 2 decimal places

print((anova(fm1)$`Mean Sq`), 3)

## [1] 7.05 2.00

Find P-value to 2 decimal places

print((anova(fm1)$`Pr(>F)`), 3)

## [1] 0.0873     NA

Two approaches give the same answer. The second approach is recommended since professor Covert introduces in the video lecture.

Now determine the critical value \(F_{\alpha}\)

Since \(\alpha = .05\), df(2,7)

qf(1-.05, 2, 7)

## [1] 4.737414

Round to decimal places

round(qf(1-.05, 2, 7), 2)

## [1] 4.74

Since our test statistic = 3.52 < our critical value \(F_{\alpha}=4.74\) and it is a right-tailed test, we do not have enough evidence to reject the hypothesis.

Hope that helps!