An R function for bootstrap confidence intervals

Here I introduce a R function made by myself to calculate several confidence intervals of mean/median including exact CI, basic bootstrap CI, percentile bootstrap CI & studentized bootstrap CI.

R code

# Copyright 2013 Chen-Pan Liao
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# See http://apansharing.blogspot.tw/2013/12/an-r-function-for-bootstrap-confidence.html
# to learn how to use this function
## main function
bootCI <- function(
x,
alpha=0.05,
alternative=c("t", "l", "g"), # see arg "alternative" in help(t.test)
B=1999, # number of bootstrap samples
quantileAlgorithm = 7 # passed to quantile(type)
){
#validation
if(mode(x) != "numeric"){
stop("mode(x) must be \"numeric\"")
}
if(class(x) != "numeric" && class(x) != "integer"){
stop("class(x) must be \"numeric\" or \"integer\"")
}
if(length(x) < 2){
stop("length(x) must be larger than 1")
}
# initiation
alternative <- alternative[1]
stat.length <- length(x)
stat.mean <- mean(x)
stat.sem <- sqrt(var(x) / stat.length)
# alternative
probs <- c(alpha/2, 1-alpha/2)
if(alternative == "l") probs <- c(0, 1-alpha)
if(alternative == "g") probs <- c(alpha, 1)
# bootstrap
boot.mat <- matrix(0.0, B+1, stat.length)
boot.mat[1,] <- x
for (i in 2:(B+1)) {
boot.mat[i,] <- sample(x, replace=T)
}
# exact (traditional) CI based on t distribution
CI.exact <- stat.mean + qt(probs, stat.length - 1) * stat.sem
names(CI.exact) <- paste(probs*100, "%", sep="")
# basic bootstrap CI
CI.basic <- 2 * stat.mean - quantile(
apply(boot.mat, 1, mean), probs=1-probs, type=quantileAlgorithm
)
names(CI.basic) <- rev(names(CI.basic))
# percentile bootstrap CI
CI.percentile <- quantile(
apply(boot.mat, 1, mean), probs=probs, type=quantileAlgorithm
)
# studentized bootstrap CI
tmp.mean <- apply(boot.mat, 1, mean)
tmp.sem <- apply(
boot.mat, 1, function(.){sqrt(var(.) / length(.))}
)
tmp.t <- (tmp.mean - stat.mean) / tmp.sem
expr <- expression(
CI.studentized <-
stat.mean - quantile(
tmp.t, probs=1-probs, type=quantileAlgorithm
) * stat.sem
)
a.try <- try(eval(expr), T)
if("try-error" %in% class(a.try)) {
warning("studentized bootstrap CI cannot
be done well due to boostrap sem = 0")
}
names(CI.studentized) <- rev(names(CI.studentized))
# exceptions of one-tail
if(alternative == "g") {
CI.exact[2] = Inf
CI.basic[2] = Inf
CI.percentile[2] = Inf
CI.studentized[2] = Inf
} else if(alternative == "l") {
CI.exact[1] = -Inf
CI.basic[1] = -Inf
CI.percentile[1] = -Inf
CI.studentized[1] = -Inf
}
output <- list(
x = x,
alpha = alpha,
alternative = alternative,
B = B,
CI.exact = CI.exact,
CI.basic = CI.basic,
CI.percentile = CI.percentile,
CI.studentized = CI.studentized
)
class(output) <- "bootCI"
return(output)
}
## print function
print.bootCI <- function(w){
cat("Summary of x\n")
print(summary(w$x))
cat("CIs of mu\n")
mat <- rbind(w$CI.exact, w$CI.basic, w$CI.percentile, w$CI.studentized)
rownames(mat) <- c("$CI.exact", "$CI.basic", "$CI.percentile", "$CI.studentized")
print(mat)
}
view raw bootCI.R hosted with ❤ by GitHub

Arguments

There are five arguments in this R function bootCI():

  • x — a numeric vector specifying $x_i$
  • alpha = 0.05 — a numeric specifying $\alpha$ value
  • alternative = c("t", "l", "g") — a single character string specifying two-sided, less or greater tail
  • B = 1999 — a integer specifying number of bootstrap samples
  • quantileAlgorithm = 7 — a integer specifying the argument type passed to quantile()

Examples

We can define a numeric vector $x = [5\, 2\, 3\, 6\, 8\, 19\, 1.5]$ including $N=7$ items as num:

> num <- c(5, 2, 3, 6, 8, 19, 1.5)

After loading the above function, we can call the function to calculate CIs:

> bootCI(num)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                     2.5%     97.5%
$CI.exact       0.7778728 11.936413
$CI.basic       1.7125000  9.714286
$CI.percentile  3.0000000 11.001786
$CI.studentized 2.6785623 17.669412

The above results show that

exact CI=(μ^+tα2se^μ^,μ^+t1α2se^μ^)basic CI=(2μ^μ^1α2,2μ^μ^α2)percentile CI=(μ^α2,μ^1α2)studentized CI=(μ^+tα2se^μ^,μ^+t1α2se^μ^)

where $\hat{\mu}$ is mean of $x$, $t_{\frac{\alpha}{2}}$ is lower $\alpha/2$ critical value for the $t$ distribution given $\text{df} = N-1$, $\hat{se}_{\hat{\mu}}$ is the standard error of the mean of $x$, $\hat{\mu}_{\frac{\alpha}{2}}^*$ is $\alpha/2$ percentile of the mean of bootstrapped $x$, and $t^*_{\frac{\alpha}{2}} = \frac{\hat{\mu^*}-\hat{\mu}}{\hat{se}^*_{\hat{\mu^*}}}$ is t-value based on bootstrapped $x$. See

for more details.

We can assign a different $\alpha$, direction or number of bootstrap samples:

> bootCI(num, alpha=0.01)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                      0.5%    99.5%
$CI.exact       -2.0962642 14.81055
$CI.basic       -0.1428571 10.28571
$CI.percentile   2.4285714 12.85714
$CI.studentized  1.1394952 25.22030

> bootCI(num, alternative="g")
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                      5% 100%
$CI.exact       1.926445  Inf
$CI.basic       2.714286  Inf
$CI.percentile  3.285714  Inf
$CI.studentized 3.332740  Inf

> bootCI(num, B=99)
Summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   2.500   5.000   6.357   7.000  19.000 
CIs of mu
                     2.5%     97.5%
$CI.exact       0.7778728 11.936413
$CI.basic       1.6053571  9.330357
$CI.percentile  3.3839286 11.108929
$CI.studentized 2.5532298 15.532786

The returned list may be helpful for someone:

> myCI <- bootCI(num)
> myCI$CI.percentile
    2.5%    97.5% 
 3.00000 10.85714 
> str(myCI)
List of 8
 $ x             : num [1:7] 5 2 3 6 8 19 1.5
 $ alpha         : num 0.05
 $ alternative   : chr "t"
 $ B             : num 1999
 $ CI.exact      : Named num [1:2] 0.778 11.936
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.basic      : Named num [1:2] 1.86 9.71
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.percentile : Named num [1:2] 3 10.9
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 $ CI.studentized: Named num [1:2] 2.65 17.73
  ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
 - attr(*, "class")= chr "bootCI"

Paired t-test vs general mixed model with R

Dataset

Consider the following dataset:

> x1 <- c(3,4,3,5,6,6,6,7,7,6,8,9,11)
> x2 <- c(3,4,5,5,8,6,8,7,9,8,8,9,14)
> cbind(x1,x2)
      x1 x2
 [1,]  3  3
 [2,]  4  4
 [3,]  3  5
 [4,]  5  5
 [5,]  6  8
 [6,]  6  6
 [7,]  6  8
 [8,]  7  7
 [9,]  7  9
[10,]  6  8
[11,]  8  8
[12,]  9  9
[13,] 11 14

where x1 and x2 are paired sample.

Paired t-test

The process of a typical paired t-test with R is following:

> t.test(x1, x2, paired=T)

 Paired t-test

data:  x1 and x2
t = -3.1225, df = 12, p-value = 0.008814
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.6977786 -0.3022214
sample estimates:
mean of the differences 
                     -1

The results show that 1) average of $x_1 - x_2$ equals to -1; 2) $t_{12} = -3.1225$, $p = 0.008814$.

Mixed model for paired sample

We can build a mixed model for this paired-sample example. Consider the following dataset:

> y <- c(x1,x2)
> trt <- gl(2,13,26)
> subject <- gl(13,1,26)
> cbind(y, trt, subject)
       y trt subject
 [1,]  3   1       1
 [2,]  4   1       2
 [3,]  3   1       3
 [4,]  5   1       4
 [5,]  6   1       5
 [6,]  6   1       6
 [7,]  6   1       7
 [8,]  7   1       8
 [9,]  7   1       9
[10,]  6   1      10
[11,]  8   1      11
[12,]  9   1      12
[13,] 11   1      13
[14,]  3   2       1
[15,]  4   2       2
[16,]  5   2       3
[17,]  5   2       4
[18,]  8   2       5
[19,]  6   2       6
[20,]  8   2       7
[21,]  7   2       8
[22,]  9   2       9
[23,]  8   2      10
[24,]  8   2      11
[25,]  9   2      12
[26,] 14   2      13

where y is dependent variable, trt is a two-level fixed factor, and subject is experimental units considered as blocks (random factor). A mixed model can fit this dataset:

yij=τi+bj+εij

where $i=\{1,2\}$ showing two-level in fixed factor $\tau$, $j=\{1,2,\ldots , 13\}$ showing 13 subjects as a random factor $b$.

We can find the solution of this mixed model by using lme4:lmer():

> require(lme4)
> mod <- lmer(y ~ trt + (1|subject) )
> summary(mod)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ trt + (1 | subject) 

REML criterion at convergence: 98.5708 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 5.8590   2.4205  
 Residual             0.6667   0.8165  
Number of obs: 26, groups: subject, 13

Fixed effects:
            Estimate Std. Error t value
(Intercept)   6.2308     0.7085   8.794
trt2          1.0000     0.3203   3.123

Correlation of Fixed Effects:
     (Intr)
trt2 -0.226

As you can see, the estimate of trt2 equals 1 showing the average of $x_2 - x_1$; the statistic $t=3.123$ is equivalent to the t-value I get in paired t-test.

We can find the p-value by using car:Anova():

> require(car)
> Anova(mod, type=3, test.statistic="F")
Analysis of Deviance Table (Type III Wald F tests with Kenward-Roger df)

Response: y
                F Df Df.res    Pr(>F)    
(Intercept) 77.34  1 13.288 6.636e-07 ***
trt          9.75  1 12.000  0.008814 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

and then we can find that $F_{1,12} = 9.75 = 3.123^2$ and $p=0.008814$. These outcomes are all equivalent to what I got in paired t-test.

Multiple comparisons for generalized linear model in R

An example

Defining y and x
> y <- c (rpois(10,1), rpois(10,2), rpois(10,3), rpois(10,5))
> y
 [1] 0 1 0 2 1 2 0 1 1 0 3 2 2 4 2 3 3 1 1 1 3 0 4 4 2 3 1 4 3 6 3 8 6 3 2 6 5 5 2 8
> x <- gl(4,10)
> x
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
Levels: 1 2 3 4
Fitting a Poisson regression
> f <- glm(y~x, family=poisson)
> summary(f)

Call:
glm(formula = y ~ x, family = poisson)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.44949  -0.90724   0.04533   0.52699   1.52242  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2231     0.3536  -0.631 0.527945    
x2            1.0116     0.4129   2.450 0.014277 *  
x3            1.3218     0.3979   3.322 0.000895 ***
x4            1.7918     0.3819   4.692 2.71e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 67.967  on 39  degrees of freedom
Residual deviance: 34.884  on 36  degrees of freedom
AIC: 142.6

Number of Fisher Scoring iterations: 5
Multiple comparisons
> # install.packages("multcomp")
> require(multcomp)
> f.mc <- glht(f, linfct = mcp(x = "Tukey"))
> summary(f.mc)

  Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = y ~ x, family = poisson)

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
2 - 1 == 0   1.0116     0.4129   2.450  0.06433 .  
3 - 1 == 0   1.3218     0.3979   3.322  0.00481 ** 
4 - 1 == 0   1.7918     0.3819   4.692  < 0.001 ***
3 - 2 == 0   0.3102     0.2807   1.105  0.67729    
4 - 2 == 0   0.7802     0.2575   3.030  0.01198 *  
4 - 3 == 0   0.4700     0.2327   2.019  0.17303    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

> confint(f.mc)

  Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = y ~ x, family = poisson)

Quantile = 2.5471
95% family-wise confidence level
 

Linear Hypotheses:
           Estimate lwr      upr     
2 - 1 == 0  1.01160 -0.04002  2.06322
3 - 1 == 0  1.32176  0.30822  2.33529
4 - 1 == 0  1.79176  0.81905  2.76447
3 - 2 == 0  0.31015 -0.40481  1.02512
4 - 2 == 0  0.78016  0.12436  1.43596
4 - 3 == 0  0.47000 -0.12281  1.06282

> coef(f.mc)
    2 - 1     3 - 1     4 - 1     3 - 2     4 - 2     4 - 3 
1.0116009 1.3217558 1.7917595 0.3101549 0.7801586 0.4700036 
> vcov(f.mc)
              2 - 1         3 - 1         4 - 1         3 - 2         4 - 2         4 - 3
2 - 1  1.704542e-01  1.249996e-01  1.249996e-01 -4.545455e-02 -4.545455e-02  9.714451e-17
3 - 1  1.249996e-01  1.583330e-01  1.249996e-01  3.333333e-02  6.938894e-17 -3.333333e-02
4 - 1  1.249996e-01  1.249996e-01  1.458330e-01 -2.775558e-17  2.083333e-02  2.083333e-02
3 - 2 -4.545455e-02  3.333333e-02 -2.775558e-17  7.878788e-02  4.545455e-02 -3.333333e-02
4 - 2 -4.545455e-02  6.938894e-17  2.083333e-02  4.545455e-02  6.628788e-02  2.083333e-02
4 - 3  9.714451e-17 -3.333333e-02  2.083333e-02 -3.333333e-02  2.083333e-02  5.416667e-02

Read more

在 R 中配置 Generalized Logit Model

已知資料 data.csv 如下:

treat, r.a, r.b, r.c
   t1,  10,  20,   5
   t2,  11,  22,   5
   t3,  21,  40,  11
   t4,  21,   5,  10

其中 treat 為具有 4 個水準的固定因子; r.a, r.br.c 分別為 3 個類型的反應類別, 其下為頻率.

以下為 R code:

# 系統第一次運作請先安裝 package nnet
install.packages("nnet")

# 引入 package nnet
require(nnet)

# 讀取資料
dat <- read.csv("data.csv")

# 建立 generalized logit model
m <- multinom(
  cbind(r.a, r.b, r.c) ~ treat,          # 其中 r.a 寫在最前面即可作為比較基準
  data=dat,
  contrasts=list(treat=contr.treatment), # 其中採用 treatment contrast 方便解釋
  Hess=T                                 # 其中採用 Hessian Matrix 求標準誤
)

# 係數與標準誤
(s.m <- summary(m))

# odds ratio
exp(s.m$coefficients)

# chi-square
(chisq <- (s.m$coefficients / s.m$standard.error)^2)

# p-value
pchisq(chisq, df=1, lower.tail=F)

可得到以下結果:

> (s.m <- summary(m))
Call:
multinom(formula = cbind(r.a, r.b, r.c) ~ treat, data = dat, 
    contrasts = list(treat = contr.treatment), Hess = T)

Coefficients:
    (Intercept)        treat2      treat3      treat4
r.b   0.6931308  2.158365e-05 -0.04877868 <strong>-2.12818025</strong>
r.c  -0.6931750 -9.525513e-02  0.04656602 -0.04876199

Std. Errors:
    (Intercept)    treat2    treat3    treat4
r.b   0.3872968 0.5351291 0.4718241 0.6305649
r.c   0.5477240 0.7687048 0.6622145 0.6690451

Residual Deviance: 346.6049 
AIC: 362.6049 


> # odds ratio
> exp(s.m$coefficients)
    (Intercept)   treat2    treat3    treat4
r.b   1.9999671 1.000022 0.9523919 <strong>0.1190537</strong>
r.c   0.4999861 0.909141 1.0476672 0.9524078


> # chi-square
> (chisq <- (s.m$coefficients / s.m$standard.error)^2)
    (Intercept)       treat2      treat3       treat4
r.b    3.202894 1.626795e-09 0.010688081 <strong>11.390878935</strong>
r.c    1.601630 1.535527e-02 0.004944712  0.005311928


> # p-value
> pchisq(chisq, df=1, lower.tail=F)
    (Intercept)    treat2    treat3      treat4
r.b  0.07350811 0.9999678 0.9176589 <strong>0.000738056</strong>
r.c  0.20567234 0.9013815 0.9439400 0.941899243

可以發現, 唯一達顯著之係數為 r.b 對比 r.a 時在 treat4 對比 treat1 的情況, 其係數 $\beta = -2.12818025$, 勝算比 $= \exp(-2.12818025) = 0.1190537$, 檢定量 $\chi^2_1 = 11.390878935$, $p = 0.000738056$, 顯示 treat4treat1r.b / r.a 達顯著差異. 由實際資料 $\frac{5/21}{20/10} = 0.1190476$ 也和模型所預測的勝算比 0.1190537 相當接近.

G-test for 2-way contingency data in R

A R function applying G-test for 2-way contingency data in R is demonstrated. Yates’ correction and Williams’ correction are also included in the output.

R Source

# Copyright 2013 Chen-Pan Liao
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
g.test.2way <- function(
x # must be a 2-way xtabs/matrix
)
{
## initiating
data.o <- x
sum.total <- sum(x)
sum.row <- rowSums(x)
sum.col <- colSums(x)
n.row <- dim(x)[1]
n.col <- dim(x)[2]
chisq.df <- (n.col - 1) * (n.row - 1)
## check
if(n.row == 1) {stop("Numbers of row must be larger than 1")}
if(n.col == 1) {stop("Numbers of column must be larger than 1")}
if(mode(data.o) != "numeric") {stop("Mode of x must be numeric")}
if(
sum(class(data.o) != c("matrix", "xtabs")) == 0
) {
stop("Class of x must be matrix or xtabs")
}
## matrix of expected value
data.e <- matrix(NA, ncol = n.col, nrow = n.row)
for (m in 1:n.row){
for (n in 1:n.col){
data.e[m,n] <- sum.col[n] * sum.row[m] / sum.total
}
}
## matrix of ovserved value with Yates' correction
data.o.y <- matrix(NA, ncol = n.col, nrow = n.row)
for (m in 1:n.row){
for (n in 1:n.col){
if (data.e[m,n] - data.o[m,n] > 0){
data.o.y[m,n] <- data.o[m,n] + 0.5
} else if (data.e[m,n] - data.o[m,n] < 0){
data.o.y[m,n] <- data.o[m,n] - 0.5
} else {
data.o.y[m,n] <- data.o[m,n]
}
}
}
## q_min for Williams' correction
q.min <- 1 +
(sum.total * sum(1 / sum.col) - 1) *
(sum.total * sum(1 / sum.row) - 1) /
(6 * (chisq.df) * sum.total)
## G-value
isnot.na.index <- !is.na(data.o * log(data.o / data.e))
g <- 2 * sum((data.o * log(data.o / data.e))[isnot.na.index])
isnot.na.index.y <- !is.na(data.o.y * log(data.o.y / data.e))
g.y <- 2 * sum((data.o.y * log(data.o.y / data.e))[isnot.na.index])
g.w <- g / q.min
## P-value
p.g <- 1 - pchisq(g, chisq.df)
p.g.y <- 1 - pchisq(g.y, chisq.df)
p.g.w <- 1 - pchisq(g.w, chisq.df)
## return
z <- list(
data.observed = data.o,
data.expected = data.e,
data.observed.Yates = data.o.y,
df = chisq.df,
q.min = q.min,
g = g,
g.Yates = g.y,
g.Williams = g.w,
p = p.g,
p.Yates = p.g.y,
p.Williams = p.g.w
)
class(z) <- "g.test.2way"
return(z)
}
## print function
print.g.test.2way <- function(x){
cat("Observed value:\n")
print(x$data.observed)
cat("Expected value:\n")
print(x$data.expected)
cat("Observed value with Yates' correction:\n")
print(x$data.observed.Yates)
cat("Degree of freedom =", x$df, "\n")
cat(sprintf("G-test:\n\tG = %g,\tp = %g\n", x$g, x$p))
cat(sprintf(
"G-test with Yates' correction:\n\tG = %g,\tp = %g\n",
x$g.Yates, x$p.Yates
))
cat(sprintf(
"G-test with Williams' correction:\n\tG = %g,\tq_min = %g,\tp = %g\n",
x$g.Williams, x$q.min, x$p.Williams
))
}
view raw g.test.2way.R hosted with ❤ by GitHub

Example

R code
my.2way.contingency.table <- matrix(
  c(
    5,0,3,12,6,
    4,2,7,23,11
  ),
  nrow = 2, ncol=5, byrow=T
)

g.test.result <- g.test.2way(my.2way.contingency.table)
g.test.result
str(g.test.result)
R output
> my.2way.contingency.table <- matrix(
+   c(
+     5,0,3,12,6,
+     4,2,7,23,11
+   ),
+   nrow = 2, ncol=5, byrow=T
+ )
> 
> g.test.result <- g.test.2way(my.2way.contingency.table)
> g.test.result
Observed value:
     [,1] [,2] [,3] [,4] [,5]
[1,]    5    0    3   12    6
[2,]    4    2    7   23   11
Expected value:
         [,1]      [,2]     [,3]     [,4]      [,5]
[1,] 3.205479 0.7123288 3.561644 12.46575  6.054795
[2,] 5.794521 1.2876712 6.438356 22.53425 10.945205
Observed value with Yates' correction:
     [,1] [,2] [,3] [,4] [,5]
[1,]  4.5  0.5  3.5 12.5  6.5
[2,]  4.5  1.5  6.5 22.5 10.5
Degree of freedom = 4 
G-test:
 G = 3.41127, p = 0.491497
G-test with Yates' correction:
 G = 1.28744, p = 0.863503
G-test with Williams' correction:
 G = 3.07349, q_min = 1.1099, p = 0.545603
> str(g.test.result)
List of 11
 $ data.observed      : num [1:2, 1:5] 5 4 0 2 3 7 12 23 6 11
 $ data.expected      : num [1:2, 1:5] 3.205 5.795 0.712 1.288 3.562 ...
 $ data.observed.Yates: num [1:2, 1:5] 4.5 4.5 0.5 1.5 3.5 6.5 12.5 22.5 6.5 10.5
 $ df                 : num 4
 $ q.min              : num 1.11
 $ g                  : num 3.41
 $ g.Yates            : num 1.29
 $ g.Williams         : num 3.07
 $ p                  : num 0.491
 $ p.Yates            : num 0.864
 $ p.Williams         : num 0.546
 - attr(*, "class")= chr "g.test.2way"

Convert taxa matrix to guild matrix with R

在 R 中如何把分類群資料小計成為功能群資料? 可利用以下介紹的 R function taxaToGuild() 將一分類群資料檔及一分類群對應功能群資料檔轉化出一個功能群資料框. 一般來說, 該 function 也提供一個資料框資料以系統化的方式進行欄小計.

taxaToGuild.R 原始檔

# Copyright 2013 Chen-Pan Liao
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
############################################
## taxaToGuild.r
## Author: Chen-Pan Liao (2013)
## License: Public domain
## Environment: R (ver. 2.15+)
## Usage:
## new.dataframe <- taxaToGuild(
## taxa.file.csv,
## taxa.file.csv.row.num,
## guild.file.csv,
## guild.file.csv.row.num,
## guild.system.row.num
## )
############################################
taxaToGuild <- function (
taxa.file.csv,
taxa.file.csv.row.num,
guild.file.csv,
guild.file.csv.row.num,
guild.system.row.num
# taxa.file.csv = "taxa.csv",
# taxa.file.csv.row.num = 1,
# guild.file.csv = "guild.csv",
# guild.file.csv.row.num = 1,
# guild.system.row.num = 1
){
guild.data <- read.csv(guild.file.csv, row.names = guild.file.csv.row.num)
taxa.data <- read.csv("taxa.csv", row.names = taxa.file.csv.row.num)
guild.list.taxa <- row.names(guild.data)
guild.list.guild <- guild.data[, guild.system.row.num]
taxa.list <- names(taxa.data)
taxa.list.ind <- numeric(length(taxa.list))
taxa.unknown <- 0
for(n in 1:length(taxa.list)){
if (sum(taxa.list[n] == guild.list.taxa) > 0) {
taxa.list.ind[n] <- which(taxa.list[n] == guild.list.taxa)
} else {
taxa.list.ind[n] <- NA
taxa.unknown[length(taxa.unknown) + 1] <- n
}
}
new.guild.list <- unique(guild.list.guild[taxa.list.ind])
taxa.unknown <- taxa.unknown[-1]
#new.taxa.data <- taxa.data
#names(new.taxa.data) <- guild.list.guild[taxa.list.ind]
## new taxa matrix initiation
newer.taxa.data <- data.frame (
matrix(
0,
nrow = nrow(taxa.data),
ncol = length(new.guild.list)
),
row.names = row.names(taxa.data)
)
names(newer.taxa.data) <- new.guild.list
## subsum of new taxa matrix
for(n in 1:length(new.guild.list)){
if (!is.na(new.guild.list[n])) {
newer.taxa.data[,n] <- rowSums(
data.frame(
taxa.data[
,
which(new.guild.list[n] == guild.list.guild[taxa.list.ind])
]
)
)
} else {
newer.taxa.data[,n] <- rowSums(
data.frame(
taxa.data[ , taxa.unknown]
)
)
}
}
cat(
sprintf(
"The guild system is %s.\n",
names(guild.data)[guild.system.row.num]
)
)
cat(
sprintf(
"Warning: the following taxa is merged into guild NA due to no reference: %s.\n",
taxa.list[taxa.unknown]
)
)
return(newer.taxa.data)
}
view raw taxaToGuild.R hosted with ❤ by GitHub

例子

假設有一蜘蛛以科為分類群的物種數資料,以 CSV 檔儲存後如下 (taxa.csv):

"Plot","Leptonetidae","Clubionidae","Others","Araneidae","Oonopidae","Ctenizidae"
1,2,0,0,1,1,2
2,1,2,3,1,0,0
3,1,0,1,1,1,1
4,2,3,2,1,0,1
5,0,1,1,2,1,4
6,2,0,0,0,1,3
7,1,0,1,2,2,1
8,0,0,1,2,3,1

其中第一列皆為變數名且第一欄為可辦視的樣點名。另有一科名對照功能群的資料檔以 CSV 檔儲存後如下 (guild.csv):

"taxa","Guild1","Guild2","Guild3"
"Oonopidae","Ground runner","runner","Ground"
"Araneidae","Orb weaver","weaver","Space"
"Clubionidae","Foliage runner","runner","Foliage"
"Ctenidae","Ground runner","runner","Ground"
"Ctenizidae","Burrow dweller","dweller","Burrow"
"Gnaphosidae","Ground runner","runner","Ground"
"Leptonetidae","Ground weaver","weaver","Ground"
"Linyphiidae","Space weaver","weaver","Space"

其中第一列皆為變數名且第一欄為可辦視的科名, 第 2 至 4 欄為三套不同的功能群系統. 在將 taxa.csv, guild.csv, taxaToGuild.r 三檔案置於相同某路徑後,於 R 環境中以 setwd("某路徑") 後進行以下動作.

source("taxaToGuild.r")
new.dataframe <- taxaToGuild(
  taxa.file.csv = "taxa.csv",
  taxa.file.csv.row.num = 1,
  guild.file.csv = "guild.csv",
  guild.file.csv.row.num = 1,
  guild.system.row.num = 1
)
new.dataframe

其中 taxa.file.csv = "taxa.csv" 表示引入 taxa.csv 為分類群資料, taxa.file.csv.row.num = 1 表示 taxa.csv 的第一欄為列名, guild.file.csv = "guild.csv" 表示引入 guild.csv 為功能群對照檔, guild.file.csv.row.num = 1 表示 guild.csv 的第一欄為列名, guild.system.row.num = 1 表示以 guild.csv 的第 1 組對照表 (實際上是此例的第 2 欄). 新產生的功能群資料框賦予變數 new.dataframe. 操作可得以下結果.

> source("taxaToGuild.r")
> new.dataframe <- taxaToGuild(
+   taxa.file.csv = "taxa.csv",
+   taxa.file.csv.row.num = 1,
+   guild.file.csv = "guild.csv",
+   guild.file.csv.row.num = 1,
+   guild.system.row.num = 1
+ )
The guild system is Guild1.
Warning: the following taxa is merged into guild NA due to no reference: Others.
> new.dataframe
  Ground weaver Foliage runner NA Orb weaver Ground runner Burrow dweller
1             2              0  0          1             1              2
2             1              2  3          1             0              0
3             1              0  1          1             1              1
4             2              3  2          1             0              1
5             0              1  1          2             1              4
6             2              0  0          0             1              3
7             1              0  1          2             2              1
8             0              0  1          2             3              1

值得注意的是, 因為 "Others" 無法藉由 "Guild1" 這項對應表成功對應至某個功能群, 所以被丟棄至 new.dataframe 中的 "NA" 欄. 在本例中若欲刪去 new.dataframe 中的 "NA" 欄, 可利用 new.dataframe[,-3].