An R function for bootstrap confidence intervals

Here I introduce a R function made by myself to calculate several confidence intervals of mean/median including exact CI, basic bootstrap CI, percentile bootstrap CI & studentized bootstrap CI.

R code

# Copyright 2013 Chen-Pan Liao
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# See http://apansharing.blogspot.tw/2013/12/an-r-function-for-bootstrap-confidence.html
# to learn how to use this function
## main function
bootCI <- function(
x,
alpha=0.05,
alternative=c("t", "l", "g"), # see arg "alternative" in help(t.test)
B=1999, # number of bootstrap samples
quantileAlgorithm = 7 # passed to quantile(type)
){
#validation
if(mode(x) != "numeric"){
stop("mode(x) must be \"numeric\"")
}
if(class(x) != "numeric" && class(x) != "integer"){
stop("class(x) must be \"numeric\" or \"integer\"")
}
if(length(x) < 2){
stop("length(x) must be larger than 1")
}
# initiation
alternative <- alternative[1]
stat.length <- length(x)
stat.mean <- mean(x)
stat.sem <- sqrt(var(x) / stat.length)
# alternative
probs <- c(alpha/2, 1-alpha/2)
if(alternative == "l") probs <- c(0, 1-alpha)
if(alternative == "g") probs <- c(alpha, 1)
# bootstrap
boot.mat <- matrix(0.0, B+1, stat.length)
boot.mat[1,] <- x
for (i in 2:(B+1)) {
boot.mat[i,] <- sample(x, replace=T)
}
# exact (traditional) CI based on t distribution
CI.exact <- stat.mean + qt(probs, stat.length - 1) * stat.sem
names(CI.exact) <- paste(probs*100, "%", sep="")
# basic bootstrap CI
CI.basic <- 2 * stat.mean - quantile(
apply(boot.mat, 1, mean), probs=1-probs, type=quantileAlgorithm
)
names(CI.basic) <- rev(names(CI.basic))
# percentile bootstrap CI
CI.percentile <- quantile(
apply(boot.mat, 1, mean), probs=probs, type=quantileAlgorithm
)
# studentized bootstrap CI
tmp.mean <- apply(boot.mat, 1, mean)
tmp.sem <- apply(
boot.mat, 1, function(.){sqrt(var(.) / length(.))}
)
tmp.t <- (tmp.mean - stat.mean) / tmp.sem
expr <- expression(
CI.studentized <-
stat.mean - quantile(
tmp.t, probs=1-probs, type=quantileAlgorithm
) * stat.sem
)
a.try <- try(eval(expr), T)
if("try-error" %in% class(a.try)) {
warning("studentized bootstrap CI cannot
be done well due to boostrap sem = 0")
}
names(CI.studentized) <- rev(names(CI.studentized))
# exceptions of one-tail
if(alternative == "g") {
CI.exact[2] = Inf
CI.basic[2] = Inf
CI.percentile[2] = Inf
CI.studentized[2] = Inf
} else if(alternative == "l") {
CI.exact[1] = -Inf
CI.basic[1] = -Inf
CI.percentile[1] = -Inf
CI.studentized[1] = -Inf
}
output <- list(
x = x,
alpha = alpha,
alternative = alternative,
B = B,
CI.exact = CI.exact,
CI.basic = CI.basic,
CI.percentile = CI.percentile,
CI.studentized = CI.studentized
)
class(output) <- "bootCI"
return(output)
}
## print function
print.bootCI <- function(w){
cat("Summary of x\n")
print(summary(w$x))
cat("CIs of mu\n")
mat <- rbind(w$CI.exact, w$CI.basic, w$CI.percentile, w$CI.studentized)
rownames(mat) <- c("$CI.exact", "$CI.basic", "$CI.percentile", "$CI.studentized")
print(mat)
}
view raw bootCI.R hosted with ❤ by GitHub

Arguments

There are five arguments in this R function bootCI():

  • x — a numeric vector specifying $x_i$
  • alpha = 0.05 — a numeric specifying $\alpha$ value
  • alternative = c("t", "l", "g") — a single character string specifying two-sided, less or greater tail
  • B = 1999 — a integer specifying number of bootstrap samples
  • quantileAlgorithm = 7 — a integer specifying the argument type passed to quantile()

Examples

We can define a numeric vector $x = [5\, 2\, 3\, 6\, 8\, 19\, 1.5]$ including $N=7$ items as num:

> num <- c(5, 2, 3, 6, 8, 19, 1.5)

After loading the above function, we can call the function to calculate CIs:

> bootCI(num)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                     2.5%     97.5%
$CI.exact       0.7778728 11.936413
$CI.basic       1.7125000  9.714286
$CI.percentile  3.0000000 11.001786
$CI.studentized 2.6785623 17.669412

The above results show that

exact CI=(μ^+tα2se^μ^,μ^+t1α2se^μ^)basic CI=(2μ^μ^1α2,2μ^μ^α2)percentile CI=(μ^α2,μ^1α2)studentized CI=(μ^+tα2se^μ^,μ^+t1α2se^μ^)

where $\hat{\mu}$ is mean of $x$, $t_{\frac{\alpha}{2}}$ is lower $\alpha/2$ critical value for the $t$ distribution given $\text{df} = N-1$, $\hat{se}_{\hat{\mu}}$ is the standard error of the mean of $x$, $\hat{\mu}_{\frac{\alpha}{2}}^*$ is $\alpha/2$ percentile of the mean of bootstrapped $x$, and $t^*_{\frac{\alpha}{2}} = \frac{\hat{\mu^*}-\hat{\mu}}{\hat{se}^*_{\hat{\mu^*}}}$ is t-value based on bootstrapped $x$. See

for more details.

We can assign a different $\alpha$, direction or number of bootstrap samples:

> bootCI(num, alpha=0.01)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                      0.5%    99.5%
$CI.exact       -2.0962642 14.81055
$CI.basic       -0.1428571 10.28571
$CI.percentile   2.4285714 12.85714
$CI.studentized  1.1394952 25.22030

> bootCI(num, alternative="g")
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                      5% 100%
$CI.exact       1.926445  Inf
$CI.basic       2.714286  Inf
$CI.percentile  3.285714  Inf
$CI.studentized 3.332740  Inf

> bootCI(num, B=99)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                     2.5%     97.5%
$CI.exact       0.7778728 11.936413
$CI.basic       1.6053571  9.330357
$CI.percentile  3.3839286 11.108929
$CI.studentized 2.5532298 15.532786

The returned list may be helpful for someone:

> myCI <- bootCI(num)
> myCI$CI.percentile
    2.5%    97.5% 
 3.00000 10.85714 
> str(myCI)
List of 8
 $ x             : num [1:7] 5 2 3 6 8 19 1.5
 $ alpha         : num 0.05
 $ alternative   : chr "t"
 $ B             : num 1999
 $ CI.exact      : Named num [1:2] 0.778 11.936
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.basic      : Named num [1:2] 1.86 9.71
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.percentile : Named num [1:2] 3 10.9
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.studentized: Named num [1:2] 2.65 17.73
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 - attr(*, "class")= chr "bootCI"

Paired t-test vs general mixed model with R

Dataset

Consider the following dataset:

> x1 <- c(3,4,3,5,6,6,6,7,7,6,8,9,11)
> x2 <- c(3,4,5,5,8,6,8,7,9,8,8,9,14)
> cbind(x1,x2)
      x1 x2
 [1,]  3  3
 [2,]  4  4
 [3,]  3  5
 [4,]  5  5
 [5,]  6  8
 [6,]  6  6
 [7,]  6  8
 [8,]  7  7
 [9,]  7  9
[10,]  6  8
[11,]  8  8
[12,]  9  9
[13,] 11 14

where x1 and x2 are paired sample.

Paired t-test

The process of a typical paired t-test with R is following:

> t.test(x1, x2, paired=T)

 Paired t-test

data:  x1 and x2
t = -3.1225, df = 12, p-value = 0.008814
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.6977786 -0.3022214
sample estimates:
mean of the differences 
                     -1

The results show that 1) average of $x_1 - x_2$ equals to -1; 2) $t_{12} = -3.1225$, $p = 0.008814$.

Mixed model for paired sample

We can build a mixed model for this paired-sample example. Consider the following dataset:

> y <- c(x1,x2)
> trt <- gl(2,13,26)
> subject <- gl(13,1,26)
> cbind(y, trt, subject)
       y trt subject
 [1,]  3   1       1
 [2,]  4   1       2
 [3,]  3   1       3
 [4,]  5   1       4
 [5,]  6   1       5
 [6,]  6   1       6
 [7,]  6   1       7
 [8,]  7   1       8
 [9,]  7   1       9
[10,]  6   1      10
[11,]  8   1      11
[12,]  9   1      12
[13,] 11   1      13
[14,]  3   2       1
[15,]  4   2       2
[16,]  5   2       3
[17,]  5   2       4
[18,]  8   2       5
[19,]  6   2       6
[20,]  8   2       7
[21,]  7   2       8
[22,]  9   2       9
[23,]  8   2      10
[24,]  8   2      11
[25,]  9   2      12
[26,] 14   2      13

where y is dependent variable, trt is a two-level fixed factor, and subject is experimental units considered as blocks (random factor). A mixed model can fit this dataset:

yij=τi+bj+εij

where $i=\{1,2\}$ showing two-level in fixed factor $\tau$, $j=\{1,2,\ldots , 13\}$ showing 13 subjects as a random factor $b$.

We can find the solution of this mixed model by using lme4:lmer():

> require(lme4)
> mod <- lmer(y ~ trt + (1|subject) )
> summary(mod)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ trt + (1 | subject) 

REML criterion at convergence: 98.5708 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 5.8590   2.4205  
 Residual             0.6667   0.8165  
Number of obs: 26, groups: subject, 13

Fixed effects:
            Estimate Std. Error t value
(Intercept)   6.2308     0.7085   8.794
trt2          1.0000     0.3203   3.123

Correlation of Fixed Effects:
     (Intr)
trt2 -0.226

As you can see, the estimate of trt2 equals 1 showing the average of $x_2 - x_1$; the statistic $t=3.123$ is equivalent to the t-value I get in paired t-test.

We can find the p-value by using car:Anova():

> require(car)
> Anova(mod, type=3, test.statistic="F")
Analysis of Deviance Table (Type III Wald F tests with Kenward-Roger df)

Response: y
                F Df Df.res    Pr(>F)    
(Intercept) 77.34  1 13.288 6.636e-07 ***
trt          9.75  1 12.000  0.008814 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

and then we can find that $F_{1,12} = 9.75 = 3.123^2$ and $p=0.008814$. These outcomes are all equivalent to what I got in paired t-test.