matrix(data=NA, nrow=3, ncol=5)
can create/initiate a 3 × 5 NA matrix on the fly. Of course, matrix(data=0, nrow=3, ncol=5)
may be useful in case.
matrix(data=NA, nrow=3, ncol=5)
can create/initiate a 3 × 5 NA matrix on the fly. Of course, matrix(data=0, nrow=3, ncol=5)
may be useful in case.
最近把東海大學生命科學系大學部的生物統計學實驗講義中的 SAS code 全改寫成 R code。內容是如何使用 R 計算常見的統計檢定,包括例題及 R code。該文件以 Creative Commons Attribution-ShareAlike 3.0 Unported License 授權。
View Statistical computing by using R on Scribd
After an analysis of (generalized) linear regression, a Pearson/Deviance goodness-of-fit test is useful to test if the model fitted reasonably. I wrote a simple function in R to do it.
# Copyright 2011 Chen-Pan Liao# This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Usage example: # model.name <- glm (...) # lm.fit.test(model.name) lm.fit.test <- function (my.model) { df <- my.model$df.residual; chisq.pearson <- sum(resid(my.model , type="pearson")^2); chisq.deviance <- sum(resid(my.model , type="deviance")^2); p.pearson <- pchisq( chisq.pearson , df , lower.tail=F ); p.deviance <- pchisq( chisq.deviance , df , lower.tail=F ); ratio.pearson <- chisq.pearson / df; ratio.deviance <- chisq.deviance / df; cat( "Pearson chisq = " , chisq.pearson , ", df = " , df , ", p = " , p.pearson , ", chisq / df = " , chisq.pearson/df , ".\n" , "Deviance chisq = " , chisq.deviance , ", df = " , df , ", p = " , p.deviance , ", chisq / df = " , chisq.deviance/df , ".\n" ); }
After defining a glm() model as a custom name, copy the source code into R interpreter and then input lm.fit.test(your.model.name)
where your.model.name
is the custom glm model name. See the fallowing example
y <- c(rpois(20,1) , rpois(20,2)); x <- gl(2 , 20); my.model <- glm(y~x , family=poisson); summary(my.model); lm.fit.test(my.model);
Mr. Shigenobu AOKI displayed a quick R function for both of Tukey and Games-Howell post-hoc comparisons after an ANOVA. The Games-Howell post-hoc is famous for non-equal variances between treatments. Read more on the webpage about Games-Howell post-hoc in R.
\[ \log ( \text{odd} ) = \text{logit} (p) = \log ( \frac{\Pr (y=1) }{1 - \Pr (y=1)} ) = \beta_0 + \sum_{i=1}^{n} \beta_i x_i \] \[ \text{odd} = \frac{\Pr (y=1) }{1 - \Pr (y=1)} = \exp ( \beta_0 + \sum_{i=1}^{n} \beta_i x_i ) \] \[ \Pr (y=1) = \frac{\text{odd}}{1+\text{odd}} = \frac{1}{1+\text{odd}^{-1}} = \frac{1}{1+\exp (-( \beta_0 + \sum_{i=1}^{n}\beta_i x_i)) } \] \[ \log (\text{odds ratio}) = \log \left( \frac{\text{odd}_1}{\text{odd}_2} \right) = \log(\text{odd}_1) - \log(\text{odd}_2) \]
如何解讀邏輯斯迴歸 (logistic regression) 的迴歸式? 因為我常搞錯邏輯斯迴歸結果與勝算或勝算比的關係, 特書此筆記. 本文最主要的內容是唸過 “FAQ: How do I interpret odds ratios in logistic regression?” 一文後的筆記。
請注意 “勝算” 與 “勝算比” 是兩回事.
在模型 \[ \text{logit}(p) = \log(\frac{1}{1-p}) = \beta_0,\] 其中 $p$ 為依變數 $y = 1$ 之機率, 而 $y$ 為布林資料 (0 或 1) 以表示肯定或否定. 此時, $\beta_0$ 即為 \[ \log(\frac{\Pr (y = 1)}{\Pr (y = 0)}).\] 說成白話文, $y = 1$ 的機率除以 $y = 0$ 的機率為 $\exp(\beta_0)$.
我以下列資料舉例. 抽樣的 10 名學生中, 有 6 人有手機而 4 人沒有. 計算 “有手機” 對 “無手機” 的勝算為 $\frac{6/10}{4/10} = 1.5$. 在迴歸式中可得 $\text{logit}(p) = \log( \frac{1}{1-p}) = 0.4055$, 可驗證 $\exp(0.4055) = 1.500052$ (因為 $y = 1$ 為有手機而 $y = 0$ 為無手機). 以下為驗證之 R code:
>isPhone<-c(rep(1,6),rep(0,4))
>m<-glm(isPhone~1,family=binomial);summary(m)
Call: glm(formula = isPhone ~ 1, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.354 -1.354 1.011 1.011 1.011 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.4055 0.6455 0.628 0.53 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.46 on 9 degrees of freedom Residual deviance: 13.46 on 9 degrees of freedom AIC: 15.46 Number of Fisher Scoring iterations: 4 > exp(0.4055) [1] 1.500052
在模型 \[ \text{logit}(p) = \beta_0 + \beta_1 x_1, \] 其中 $p$ 為依變數 $y = 1$ 之機率而 $y$ 為布林資料 (0 或 1) 以表示肯定或否定, $x_1$ 為二元類別資料並轉換為布林資料 (0 或 1). 此時, $\beta_0$ 即為 $x = 0$ 時 $\log ( \frac{\Pr (y = 1)}{\Pr (y = 0)})$, 而 $\beta_1$ 為 $x_1 = 1$ 對 $x_1 = 0$ 在 $y$ 是否為 1 的勝算比. 說成白話文, $x_1 = 0$ 時 $y = 1$ 的機率除以 $y = 0$ 的機率為 $\exp( \beta_0)$; $x_1 = 1$ 對 $x_1 = 0$ 的勝算比為 $\beta_1$.
我以下列資料舉例. 抽樣的 10 名學生中, 有 6 名男生 4 名女生, 且每人分別擁有手機或沒有, 資料如下:
有無手機 | 男生 (x1 = 1) | 女生 (x1 = 0) |
---|---|---|
有 (y = 1) | 5 | 1 |
無 (y = 0) | 1 | 3 |
小計 | 6 | 4 |
“男生有手機” 對 “男生無手機” 的勝算為 $\frac{5/6}{1/6} = 5$; “女生有手機” 對 “女生無手機” 的勝算為 $\frac{1/4}{3/4} = 0.3333$; 二者的勝算比為 $5 / 0.3333 = 15$.
在男生 = 1, 女生 = 0, 有手機 = 1, 無手機 = 0 的設定下, 迴歸式中可得 $\text{logit}(p) = -1.099 + 2.708(\text{性別})$, 可驗證 $\exp (-1.099) = 0.3332041$ 為 “女生有手機” 對 “女生無手機” 的勝算 (因為女生的編碼為 $y = 0$); $\exp (2.708) = 14.99925$ 為性別與有無手機的勝算比, 可解釋作 “男生有手機之勝算為女生的 15 倍”. 以下為驗證之 R code:
> isMale<-c(1,1,1,1,1,1,0,0,0,0) > isPhone<-as.factor(c(1,1,1,1,1,0,0,0,0,1)) > m<-glm(isPhone~isMale,family=binomial);summary(m) Call: glm(formula = isMale ~ isPhone, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.8930 -0.7585 0.6039 0.6039 1.6651 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.099 1.155 -0.951 0.3414 isMale 2.708 1.592 1.701 0.0889 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.4602 on 9 degrees of freedom Residual deviance: 9.9054 on 8 degrees of freedom AIC: 13.905 Number of Fisher Scoring iterations: 4 > exp(-1.099) [1] 0.3332041 > exp(2.708) [1] 14.99925
在模型 \[\text{logit}(p) = \beta_0 + \beta_0 x_1,\] 其中 $p$ 為依變數 $y = 1$ 之機率而 $y$ 為布林資料 (0 或 1) 以表示肯定或否定, $x_1$ 為連續型變數. 此時, $\beta_0$ 即為 $x = 0$ 時 $\log (y = 1 \text{對} y = 0 \text{的勝算})$, 而 $\beta_1$ 為 $x_1$ 對勝算的貢獻量. 說成白話文, $x_1=0$ 時 $y = 1$ 的機率除以 $y = 0$ 的機率為 $\exp(\beta_0)$; 若 $x_1 \neq 0$ 時 $y = 1$ 的機率除以 $y = 0$ 的機率為 $\exp (\beta_0 + \beta_1 x_1)$, 而 $\beta_1$ 為正數或負數則影響了勝算的正向或負向影響。
我以下列資料舉例. 抽樣的 10 名學生中, 每人有不同每日零用錢及是否擁有手機, 資料如下:
有無手機 (y, 1 為有, 0 為無) | 每日零用錢 (x1) |
---|---|
1 | 100 |
1 | 120 |
1 | 90 |
1 | 110 |
1 | 110 |
0 | 80 |
0 | 100 |
0 | 90 |
0 | 70 |
0 | 110 |
上述資料可得回歸式 $\text{logit}(p) = -9.71278 + 0.09840(\text{每日零用錢})$. 由迴歸式可觀察出, 每日零用錢每增加 1 單位則對有手機之勝算增加 $\text{exp}(0.09840) = 1.103404$, 亦表示有無手機在多拿 1 單位零用錢與沒有者的勝算比是 $\exp(\beta_1 \times 1) = 1.103404$,多拿 2 單位則為 $\exp(\beta_1 \times 2)$, 以此類推.
我們可以估計每日零用錢為 70 時有手機對無手機的勝算: $\exp (-9.71278 + 0.09840 \times 70) = 0.05932171$, 可解釋為每日零用錢為 70 的學生中有手機的機率為沒有手機的 0.05932171 倍. 要估算每日零用錢為 75 的勝算亦如法炮製: $\exp( -9.71278 + 0.09840 \times 75) = 0.09702564$. 因此, 二者的勝算比為 $0.09702564 / 0.05932171 = 1.635584$, 也就是 $\exp (0.09840 \times 5)$.
由於$\text{勝算} = p / (1 - p)$, 故可由勝算回推 $p$, 公式為 $p = \text{勝算} / (1 + \text{勝算})$. 前面有談到零用錢為 70 時有手機對無手機的勝算為 0.05932171, 可推得零用錢為 70 時有手機的機率為 $0.05932171 / (1 + 0.05932171) = 0.05599971$. 同理, 每日零用錢為 75 時有手機的機率為 $0.09702564 / (1 + 0.09702564) = 0.08844428$.
> isPhone<-c(1,1,1,1,1,0,0,0,0,0) > money<-c(100, 120, 90, 110, 110, 80, 100, 90, 70, 110) > m<-glm(isPhone~money,family=binomial);summary(m) Call: glm(formula = isPhone ~ money, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.67078 -0.76661 0.07113 0.75438 1.55604 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.71278 6.59477 -1.473 0.141 money 0.09840 0.06574 1.497 0.134 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.863 on 9 degrees of freedom Residual deviance: 10.481 on 8 degrees of freedom AIC: 14.481 Number of Fisher Scoring iterations: 4
How to apply LSD (as a kind of multiple comparisons after ANOVA) in R?
A dataset for typical one-way ANOVA is given:
y <- rnorm(40) x <- gl(4,10) m <- aov(y~x)
Applying LSD by using function pairwise.t.test
:
pairwise.t.test(y,x,p.adj="none")
Applying LSD by using function LSD.test
from package agricolae
:
library(agricolae) LSD.test(m,"x",group=T) # with grouping LSD.test(m,"x",group=F) # without grouping
How to specify a specific level as a reference level in GNU R while applying (generalized) linear regression?
A dataframe dt
is given:
y<-rpois(40,1) x1<-gl(4,10) x2<-gl(4,1,40) dt <- data.frame(cbind(y,x1,x2))
Applying poisson regression:
m1<-glm(y~factor(x1)+factor(x2),data=dt) summary(m1)
The output shows x1=1 and x2=1 are both reference level defined by R automatically. To switch the reference level from x1=1 to x1=2, the function relevel
could be used:
m2<-glm(y~relevel(factor(dt$x1),"2")+factor(x2),data=dt) summary(m2)