Apan’s Notes: 2013

An R function for bootstrap confidence intervals

Here I introduce a R function made by myself to calculate several confidence intervals of mean/median including exact CI, basic bootstrap CI, percentile bootstrap CI & studentized bootstrap CI.

R code

Arguments

There are five arguments in this R function bootCI():

x — a numeric vector specifying $x_i$
alpha = 0.05 — a numeric specifying $\alpha$ value
alternative = c("t", "l", "g") — a single character string specifying two-sided, less or greater tail
B = 1999 — a integer specifying number of bootstrap samples
quantileAlgorithm = 7 — a integer specifying the argument type passed to quantile()

Examples

We can define a numeric vector $x = [5\, 2\, 3\, 6\, 8\, 19\, 1.5]$ including $N=7$ items as num:

> num <- c(5, 2, 3, 6, 8, 19, 1.5)

After loading the above function, we can call the function to calculate CIs:

> bootCI(num)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                     2.5%     97.5%
$CI.exact       0.7778728 11.936413
$CI.basic       1.7125000  9.714286
$CI.percentile  3.0000000 11.001786
$CI.studentized 2.6785623 17.669412

The above results show that

\begin{aligned} \text{exact CI} & = ( \hat{\mu} + t_{\frac{\alpha}{2}} \cdot \hat{se}_{\hat{\mu}} \;,\; \hat{\mu} + t_{1-\frac{\alpha}{2}} \cdot \hat{se}_{\hat{\mu}} ) \\ \text{basic CI} & = ( 2\hat{\mu} - \hat{\mu}_{1-\frac{\alpha}{2}}^{*} \;,\; 2\hat{\mu} - \hat{\mu}_{\frac{\alpha}{2}}^{*} ) \\ \text{percentile CI} & = ( \hat{\mu}_{\frac{\alpha}{2}}^* \;,\; \hat{\mu}_{1-\frac{\alpha}{2}}^* ) \\ \text{studentized CI} & = ( \hat{\mu} + t^*_{\frac{\alpha}{2}} \cdot \hat{se}_{\hat{\mu}} \;,\; \hat{\mu} + t^*_{1-\frac{\alpha}{2}} \cdot \hat{se}_{\hat{\mu}} ) \\ \end{aligned}

where $\hat{\mu}$ is mean of $x$, $t_{\frac{\alpha}{2}}$ is lower $\alpha/2$ critical value for the $t$ distribution given $\text{df} = N-1$, $\hat{se}_{\hat{\mu}}$ is the standard error of the mean of $x$, $\hat{\mu}_{\frac{\alpha}{2}}^*$ is $\alpha/2$ percentile of the mean of bootstrapped $x$, and $t^*_{\frac{\alpha}{2}} = \frac{\hat{\mu^*}-\hat{\mu}}{\hat{se}^*_{\hat{\mu^*}}}$ is t-value based on bootstrapped $x$. See

J. Carpenter and J. Bithell (2000). Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statist. Med. 19:1141-1164
Wikipedia contributors, 'Bootstrapping (statistics)', Wikipedia, The Free Encyclopedia, 17 December 2013, 21:28 UTC, <http://en.wikipedia.org/w/index.php?title=Bootstrapping_(statistics)&oldid=586549893> [accessed 31 December 2013]

for more details.

We can assign a different $\alpha$, direction or number of bootstrap samples:

> bootCI(num, alpha=0.01)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                      0.5%    99.5%
$CI.exact       -2.0962642 14.81055
$CI.basic       -0.1428571 10.28571
$CI.percentile   2.4285714 12.85714
$CI.studentized  1.1394952 25.22030

> bootCI(num, alternative="g")
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                      5% 100%
$CI.exact       1.926445  Inf
$CI.basic       2.714286  Inf
$CI.percentile  3.285714  Inf
$CI.studentized 3.332740  Inf

> bootCI(num, B=99)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                     2.5%     97.5%
$CI.exact       0.7778728 11.936413
$CI.basic       1.6053571  9.330357
$CI.percentile  3.3839286 11.108929
$CI.studentized 2.5532298 15.532786

The returned list may be helpful for someone:

> myCI <- bootCI(num)
> myCI$CI.percentile
    2.5%    97.5% 
 3.00000 10.85714 
> str(myCI)
List of 8
 $ x             : num [1:7] 5 2 3 6 8 19 1.5
 $ alpha         : num 0.05
 $ alternative   : chr "t"
 $ B             : num 1999
 $ CI.exact      : Named num [1:2] 0.778 11.936
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.basic      : Named num [1:2] 1.86 9.71
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.percentile : Named num [1:2] 3 10.9
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.studentized: Named num [1:2] 2.65 17.73
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 - attr(*, "class")= chr "bootCI"

Paired t-test vs general mixed model with R

Dataset

Consider the following dataset:

> x1 <- c(3,4,3,5,6,6,6,7,7,6,8,9,11)
> x2 <- c(3,4,5,5,8,6,8,7,9,8,8,9,14)
> cbind(x1,x2)
      x1 x2
 [1,]  3  3
 [2,]  4  4
 [3,]  3  5
 [4,]  5  5
 [5,]  6  8
 [6,]  6  6
 [7,]  6  8
 [8,]  7  7
 [9,]  7  9
[10,]  6  8
[11,]  8  8
[12,]  9  9
[13,] 11 14

where x1 and x2 are paired sample.

Paired t-test

The process of a typical paired t-test with R is following:

> t.test(x1, x2, paired=T)

 Paired t-test

data:  x1 and x2
t = -3.1225, df = 12, p-value = 0.008814
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.6977786 -0.3022214
sample estimates:
mean of the differences 
                     -1

The results show that 1) average of $x_1 - x_2$ equals to -1; 2) $t_{12} = -3.1225$, $p = 0.008814$.

Mixed model for paired sample

We can build a mixed model for this paired-sample example. Consider the following dataset:

> y <- c(x1,x2)
> trt <- gl(2,13,26)
> subject <- gl(13,1,26)
> cbind(y, trt, subject)
       y trt subject
 [1,]  3   1       1
 [2,]  4   1       2
 [3,]  3   1       3
 [4,]  5   1       4
 [5,]  6   1       5
 [6,]  6   1       6
 [7,]  6   1       7
 [8,]  7   1       8
 [9,]  7   1       9
[10,]  6   1      10
[11,]  8   1      11
[12,]  9   1      12
[13,] 11   1      13
[14,]  3   2       1
[15,]  4   2       2
[16,]  5   2       3
[17,]  5   2       4
[18,]  8   2       5
[19,]  6   2       6
[20,]  8   2       7
[21,]  7   2       8
[22,]  9   2       9
[23,]  8   2      10
[24,]  8   2      11
[25,]  9   2      12
[26,] 14   2      13

where y is dependent variable, trt is a two-level fixed factor, and subject is experimental units considered as blocks (random factor). A mixed model can fit this dataset:

\[ y_{ij} = \tau_i + b_{j} + \varepsilon_{ij} \]

where $i=\{1,2\}$ showing two-level in fixed factor $\tau$, $j=\{1,2,\ldots , 13\}$ showing 13 subjects as a random factor $b$.

We can find the solution of this mixed model by using lme4:lmer():

> require(lme4)
> mod <- lmer(y ~ trt + (1|subject) )
> summary(mod)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ trt + (1 | subject) 

REML criterion at convergence: 98.5708 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 5.8590   2.4205  
 Residual             0.6667   0.8165  
Number of obs: 26, groups: subject, 13

Fixed effects:
            Estimate Std. Error t value
(Intercept)   6.2308     0.7085   8.794
trt2          1.0000     0.3203   3.123

Correlation of Fixed Effects:
     (Intr)
trt2 -0.226

As you can see, the estimate of trt2 equals 1 showing the average of $x_2 - x_1$; the statistic $t=3.123$ is equivalent to the t-value I get in paired t-test.

We can find the p-value by using car:Anova():

> require(car)
> Anova(mod, type=3, test.statistic="F")
Analysis of Deviance Table (Type III Wald F tests with Kenward-Roger df)

Response: y
                F Df Df.res    Pr(>F)    
(Intercept) 77.34  1 13.288 6.636e-07 ***
trt          9.75  1 12.000  0.008814 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

and then we can find that $F_{1,12} = 9.75 = 3.123^2$ and $p=0.008814$. These outcomes are all equivalent to what I got in paired t-test.

Multiple comparisons for generalized linear model in R

An example

Defining y and x

> y <- c (rpois(10,1), rpois(10,2), rpois(10,3), rpois(10,5))
> y
 [1] 0 1 0 2 1 2 0 1 1 0 3 2 2 4 2 3 3 1 1 1 3 0 4 4 2 3 1 4 3 6 3 8 6 3 2 6 5 5 2 8
> x <- gl(4,10)
> x
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4

Fitting a Poisson regression

> f <- glm(y~x, family=poisson)
> summary(f)

Call:
glm(formula = y ~ x, family = poisson)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.44949  -0.90724   0.04533   0.52699   1.52242  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2231     0.3536  -0.631 0.527945    
x2            1.0116     0.4129   2.450 0.014277 *  
x3            1.3218     0.3979   3.322 0.000895 ***
x4            1.7918     0.3819   4.692 2.71e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 67.967  on 39  degrees of freedom
Residual deviance: 34.884  on 36  degrees of freedom
AIC: 142.6

Number of Fisher Scoring iterations: 5

Multiple comparisons

> # install.packages("multcomp")
> require(multcomp)
> f.mc <- glht(f, linfct = mcp(x = "Tukey"))
> summary(f.mc)

  Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = y ~ x, family = poisson)

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
2 - 1 == 0   1.0116     0.4129   2.450  0.06433 .  
3 - 1 == 0   1.3218     0.3979   3.322  0.00481 ** 
4 - 1 == 0   1.7918     0.3819   4.692  < 0.001 ***
3 - 2 == 0   0.3102     0.2807   1.105  0.67729    
4 - 2 == 0   0.7802     0.2575   3.030  0.01198 *  
4 - 3 == 0   0.4700     0.2327   2.019  0.17303    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

> confint(f.mc)

  Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = y ~ x, family = poisson)

Quantile = 2.5471
95% family-wise confidence level
 

Linear Hypotheses:
           Estimate lwr      upr     
2 - 1 == 0  1.01160 -0.04002  2.06322
3 - 1 == 0  1.32176  0.30822  2.33529
4 - 1 == 0  1.79176  0.81905  2.76447
3 - 2 == 0  0.31015 -0.40481  1.02512
4 - 2 == 0  0.78016  0.12436  1.43596
4 - 3 == 0  0.47000 -0.12281  1.06282

> coef(f.mc)
    2 - 1     3 - 1     4 - 1     3 - 2     4 - 2     4 - 3 
1.0116009 1.3217558 1.7917595 0.3101549 0.7801586 0.4700036 
> vcov(f.mc)
              2 - 1         3 - 1         4 - 1         3 - 2         4 - 2         4 - 3
2 - 1  1.704542e-01  1.249996e-01  1.249996e-01 -4.545455e-02 -4.545455e-02  9.714451e-17
3 - 1  1.249996e-01  1.583330e-01  1.249996e-01  3.333333e-02  6.938894e-17 -3.333333e-02
4 - 1  1.249996e-01  1.249996e-01  1.458330e-01 -2.775558e-17  2.083333e-02  2.083333e-02
3 - 2 -4.545455e-02  3.333333e-02 -2.775558e-17  7.878788e-02  4.545455e-02 -3.333333e-02
4 - 2 -4.545455e-02  6.938894e-17  2.083333e-02  4.545455e-02  6.628788e-02  2.083333e-02
4 - 3  9.714451e-17 -3.333333e-02  2.083333e-02 -3.333333e-02  2.083333e-02  5.416667e-02

在 R 中配置 Generalized Logit Model

已知資料 data.csv 如下:

treat, r.a, r.b, r.c
   t1,  10,  20,   5
   t2,  11,  22,   5
   t3,  21,  40,  11
   t4,  21,   5,  10

其中 treat 為具有 4 個水準的固定因子; r.a, r.b 及 r.c 分別為 3 個類型的反應類別, 其下為頻率.

以下為 R code:

# 系統第一次運作請先安裝 package nnet
install.packages("nnet")

# 引入 package nnet
require(nnet)

# 讀取資料
dat <- read.csv("data.csv")

# 建立 generalized logit model
m <- multinom(
  cbind(r.a, r.b, r.c) ~ treat,          # 其中 r.a 寫在最前面即可作為比較基準
  data=dat,
  contrasts=list(treat=contr.treatment), # 其中採用 treatment contrast 方便解釋
  Hess=T                                 # 其中採用 Hessian Matrix 求標準誤
)

# 係數與標準誤
(s.m <- summary(m))

# odds ratio
exp(s.m$coefficients)

# chi-square
(chisq <- (s.m$coefficients / s.m$standard.error)^2)

# p-value
pchisq(chisq, df=1, lower.tail=F)

可得到以下結果:

> (s.m <- summary(m))
Call:
multinom(formula = cbind(r.a, r.b, r.c) ~ treat, data = dat, 
    contrasts = list(treat = contr.treatment), Hess = T)

Coefficients:
    (Intercept)        treat2      treat3      treat4
r.b   0.6931308  2.158365e-05 -0.04877868 <strong>-2.12818025</strong>
r.c  -0.6931750 -9.525513e-02  0.04656602 -0.04876199

Std. Errors:
    (Intercept)    treat2    treat3    treat4
r.b   0.3872968 0.5351291 0.4718241 0.6305649
r.c   0.5477240 0.7687048 0.6622145 0.6690451

Residual Deviance: 346.6049 
AIC: 362.6049 


> # odds ratio
> exp(s.m$coefficients)
    (Intercept)   treat2    treat3    treat4
r.b   1.9999671 1.000022 0.9523919 <strong>0.1190537</strong>
r.c   0.4999861 0.909141 1.0476672 0.9524078


> # chi-square
> (chisq <- (s.m$coefficients / s.m$standard.error)^2)
    (Intercept)       treat2      treat3       treat4
r.b    3.202894 1.626795e-09 0.010688081 <strong>11.390878935</strong>
r.c    1.601630 1.535527e-02 0.004944712  0.005311928


> # p-value
> pchisq(chisq, df=1, lower.tail=F)
    (Intercept)    treat2    treat3      treat4
r.b  0.07350811 0.9999678 0.9176589 <strong>0.000738056</strong>
r.c  0.20567234 0.9013815 0.9439400 0.941899243

可以發現, 唯一達顯著之係數為 r.b 對比 r.a 時在 treat4 對比 treat1 的情況, 其係數 $\beta = -2.12818025$, 勝算比 $= \exp(-2.12818025) = 0.1190537$, 檢定量 $\chi^2_1 = 11.390878935$, $p = 0.000738056$, 顯示 treat4 比 treat1 的 r.b / r.a 達顯著差異. 由實際資料 $\frac{5/21}{20/10} = 0.1190476$ 也和模型所預測的勝算比 0.1190537 相當接近.

G-test for 2-way contingency data in R

A R function applying G-test for 2-way contingency data in R is demonstrated. Yates’ correction and Williams’ correction are also included in the output.

R Source

Example

R code

my.2way.contingency.table <- matrix(
  c(
    5,0,3,12,6,
    4,2,7,23,11
  ),
  nrow = 2, ncol=5, byrow=T
)

g.test.result <- g.test.2way(my.2way.contingency.table)
g.test.result
str(g.test.result)

R output

> my.2way.contingency.table <- matrix(
+   c(
+     5,0,3,12,6,
+     4,2,7,23,11
+   ),
+   nrow = 2, ncol=5, byrow=T
+ )
> 
> g.test.result <- g.test.2way(my.2way.contingency.table)
> g.test.result
Observed value:
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    0    3   12    6
[2,]    4    2    7   23   11
Expected value:
         [,1]      [,2]     [,3]     [,4]      [,5]
[1,] 3.205479 0.7123288 3.561644 12.46575  6.054795
[2,] 5.794521 1.2876712 6.438356 22.53425 10.945205
Observed value with Yates' correction:
     [,1] [,2] [,3] [,4] [,5]
[1,]  4.5  0.5  3.5 12.5  6.5
[2,]  4.5  1.5  6.5 22.5 10.5
Degree of freedom = 4 
G-test:
 G = 3.41127, p = 0.491497
G-test with Yates' correction:
 G = 1.28744, p = 0.863503
G-test with Williams' correction:
 G = 3.07349, q_min = 1.1099, p = 0.545603
> str(g.test.result)
List of 11
 $ data.observed      : num [1:2, 1:5] 5 4 0 2 3 7 12 23 6 11
 $ data.expected      : num [1:2, 1:5] 3.205 5.795 0.712 1.288 3.562 ...
 $ data.observed.Yates: num [1:2, 1:5] 4.5 4.5 0.5 1.5 3.5 6.5 12.5 22.5 6.5 10.5
 $ df                 : num 4
 $ q.min              : num 1.11
 $ g                  : num 3.41
 $ g.Yates            : num 1.29
 $ g.Williams         : num 3.07
 $ p                  : num 0.491
 $ p.Yates            : num 0.864
 $ p.Williams         : num 0.546
 - attr(*, "class")= chr "g.test.2way"

Convert taxa matrix to guild matrix with R

在 R 中如何把分類群資料小計成為功能群資料? 可利用以下介紹的 R function taxaToGuild() 將一分類群資料檔及一分類群對應功能群資料檔轉化出一個功能群資料框. 一般來說, 該 function 也提供一個資料框資料以系統化的方式進行欄小計.

taxaToGuild.R 原始檔

例子

假設有一蜘蛛以科為分類群的物種數資料，以 CSV 檔儲存後如下 (taxa.csv):

"Plot","Leptonetidae","Clubionidae","Others","Araneidae","Oonopidae","Ctenizidae"
1,2,0,0,1,1,2
2,1,2,3,1,0,0
3,1,0,1,1,1,1
4,2,3,2,1,0,1
5,0,1,1,2,1,4
6,2,0,0,0,1,3
7,1,0,1,2,2,1
8,0,0,1,2,3,1

其中第一列皆為變數名且第一欄為可辦視的樣點名。另有一科名對照功能群的資料檔以 CSV 檔儲存後如下 (guild.csv):

"taxa","Guild1","Guild2","Guild3"
"Oonopidae","Ground runner","runner","Ground"
"Araneidae","Orb weaver","weaver","Space"
"Clubionidae","Foliage runner","runner","Foliage"
"Ctenidae","Ground runner","runner","Ground"
"Ctenizidae","Burrow dweller","dweller","Burrow"
"Gnaphosidae","Ground runner","runner","Ground"
"Leptonetidae","Ground weaver","weaver","Ground"
"Linyphiidae","Space weaver","weaver","Space"

其中第一列皆為變數名且第一欄為可辦視的科名, 第 2 至 4 欄為三套不同的功能群系統. 在將 taxa.csv, guild.csv, taxaToGuild.r 三檔案置於相同某路徑後，於 R 環境中以 setwd("某路徑") 後進行以下動作.

source("taxaToGuild.r")
new.dataframe <- taxaToGuild(
  taxa.file.csv = "taxa.csv",
  taxa.file.csv.row.num = 1,
  guild.file.csv = "guild.csv",
  guild.file.csv.row.num = 1,
  guild.system.row.num = 1
)
new.dataframe

其中 taxa.file.csv = "taxa.csv" 表示引入 taxa.csv 為分類群資料, taxa.file.csv.row.num = 1 表示 taxa.csv 的第一欄為列名, guild.file.csv = "guild.csv" 表示引入 guild.csv 為功能群對照檔, guild.file.csv.row.num = 1 表示 guild.csv 的第一欄為列名, guild.system.row.num = 1 表示以 guild.csv 的第 1 組對照表 (實際上是此例的第 2 欄). 新產生的功能群資料框賦予變數 new.dataframe. 操作可得以下結果.

> source("taxaToGuild.r")
> new.dataframe <- taxaToGuild(
+   taxa.file.csv = "taxa.csv",
+   taxa.file.csv.row.num = 1,
+   guild.file.csv = "guild.csv",
+   guild.file.csv.row.num = 1,
+   guild.system.row.num = 1
+ )
The guild system is Guild1.
Warning: the following taxa is merged into guild NA due to no reference: Others.
> new.dataframe
  Ground weaver Foliage runner NA Orb weaver Ground runner Burrow dweller
1             2              0  0          1             1              2
2             1              2  3          1             0              0
3             1              0  1          1             1              1
4             2              3  2          1             0              1
5             0              1  1          2             1              4
6             2              0  0          0             1              3
7             1              0  1          2             2              1
8             0              0  1          2             3              1

值得注意的是, 因為 "Others" 無法藉由 "Guild1" 這項對應表成功對應至某個功能群, 所以被丟棄至 new.dataframe 中的 "NA" 欄. 在本例中若欲刪去 new.dataframe 中的 "NA" 欄, 可利用 new.dataframe[,-3].

Apan’s Notes

Pages

An R function for bootstrap confidence intervals

R code

Arguments

Examples

Paired t-test vs general mixed model with R

Dataset

Paired t-test

Mixed model for paired sample

Multiple comparisons for generalized linear model in R

An example

Defining y and x

Fitting a Poisson regression

Multiple comparisons

Read more

在 R 中配置 Generalized Logit Model

G-test for 2-way contingency data in R

R Source

Example

R code

R output

Convert taxa matrix to guild matrix with R

taxaToGuild.R 原始檔

例子