## Pages

### Create a NA matrix in R

matrix(data=NA, nrow=3, ncol=5) can create/initiate a 3 × 5 NA matrix on the fly. Of course, matrix(data=0, nrow=3, ncol=5) may be useful in case.

### A function of Pearson/Deviance goodness-of-fit for (generalized) linear regression

After an analysis of (generalized) linear regression, a Pearson/Deviance goodness-of-fit test is useful to test if the model fitted reasonably. I wrote a simple function in R to do it.

#### Source Code

#    Copyright 2011 Chen-Pan Liao
#    This program is free software: you can redistribute it and/or modify
#    the Free Software Foundation, either version 3 of the License, or
#    any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.

# Usage example:
#   model.name <- glm (...)
#   lm.fit.test(model.name)

lm.fit.test <- function (my.model) {
df <- my.model$df.residual; chisq.pearson <- sum(resid(my.model , type="pearson")^2); chisq.deviance <- sum(resid(my.model , type="deviance")^2); p.pearson <- pchisq( chisq.pearson , df , lower.tail=F ); p.deviance <- pchisq( chisq.deviance , df , lower.tail=F ); ratio.pearson <- chisq.pearson / df; ratio.deviance <- chisq.deviance / df; cat( "Pearson chisq = " , chisq.pearson , ", df = " , df , ", p = " , p.pearson , ", chisq / df = " , chisq.pearson/df , ".\n" , "Deviance chisq = " , chisq.deviance , ", df = " , df , ", p = " , p.deviance , ", chisq / df = " , chisq.deviance/df , ".\n" ); } #### Usage and example After defining a glm() model as a custom name, copy the source code into R interpreter and then input lm.fit.test(your.model.name) where your.model.name is the custom glm model name. See the fallowing example y <- c(rpois(20,1) , rpois(20,2)); x <- gl(2 , 20); my.model <- glm(y~x , family=poisson); summary(my.model); lm.fit.test(my.model); ### Games-Howell post-hoc in R Mr. Shigenobu AOKI displayed a quick R function for both of Tukey and Games-Howell post-hoc comparisons after an ANOVA. The Games-Howell post-hoc is famous for non-equal variances between treatments. Read more on the webpage about Games-Howell post-hoc in R. ### 邏輯斯迴歸、勝算與勝算比的關係 (logistic regression, odds and odds ratio) #### 公式一覽 $\log ( \text{odd} ) = \text{logit} (p) = \log ( \frac{\Pr (y=1) }{1 - \Pr (y=1)} ) = \beta_0 + \sum_{i=1}^{n} \beta_i x_i$ $\text{odd} = \frac{\Pr (y=1) }{1 - \Pr (y=1)} = \exp ( \beta_0 + \sum_{i=1}^{n} \beta_i x_i )$ $\Pr (y=1) = \frac{\text{odd}}{1+\text{odd}} = \frac{1}{1+\text{odd}^{-1}} = \frac{1}{1+\exp (-( \beta_0 + \sum_{i=1}^{n}\beta_i x_i)) }$ $\log (\text{odds ratio}) = \log \left( \frac{\text{odd}_1}{\text{odd}_2} \right) = \log(\text{odd}_1) - \log(\text{odd}_2)$ #### 寫在前面 如何解讀邏輯斯迴歸 (logistic regression) 的迴歸式? 因為我常搞錯邏輯斯迴歸結果與勝算或勝算比的關係, 特書此筆記. 本文最主要的內容是唸過 “FAQ: How do I interpret odds ratios in logistic regression?” 一文後的筆記。 #### 名詞翻譯 • logistic regression: 邏級式迴歸; • odds: 勝算; • odds ratio: 勝算比. 請注意 “勝算” 與 “勝算比” 是兩回事. #### 僅包含常數項的迴歸式 在模型 $\text{logit}(p) = \log(\frac{1}{1-p}) = \beta_0,$ 其中$p$為依變數$y = 1$之機率, 而$y$為布林資料 (0 或 1) 以表示肯定或否定. 此時,$\beta_0$即為 $\log(\frac{\Pr (y = 1)}{\Pr (y = 0)}).$ 說成白話文,$y = 1$的機率除以$y = 0$的機率為$\exp(\beta_0)$. 我以下列資料舉例. 抽樣的 10 名學生中, 有 6 人有手機而 4 人沒有. 計算 “有手機” 對 “無手機” 的勝算為$\frac{6/10}{4/10} = 1.5$. 在迴歸式中可得$\text{logit}(p) = \log( \frac{1}{1-p}) = 0.4055$, 可驗證$\exp(0.4055) = 1.500052$(因為$y = 1$為有手機而$y = 0$為無手機). 以下為驗證之 R code: > isPhone<-c(rep(1,6),rep(0,4)) > m<-glm(isPhone~1,family=binomial);summary(m) Call: glm(formula = isPhone ~ 1, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.354 -1.354 1.011 1.011 1.011 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.4055 0.6455 0.628 0.53 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.46 on 9 degrees of freedom Residual deviance: 13.46 on 9 degrees of freedom AIC: 15.46 Number of Fisher Scoring iterations: 4 > exp(0.4055) [1] 1.500052 #### 包含一個二元類別自變數的迴歸式 在模型 $\text{logit}(p) = \beta_0 + \beta_1 x_1,$ 其中$p$為依變數$y = 1$之機率而$y$為布林資料 (0 或 1) 以表示肯定或否定,$x_1$為二元類別資料並轉換為布林資料 (0 或 1). 此時,$\beta_0$即為$x = 0$時$\log ( \frac{\Pr (y = 1)}{\Pr (y = 0)})$, 而$\beta_1$為$x_1 = 1$對$x_1 = 0$在$y$是否為 1 的勝算比. 說成白話文,$x_1 = 0$時$y = 1$的機率除以$y = 0$的機率為$\exp( \beta_0)$;$x_1 = 1$對$x_1 = 0$的勝算比為$\beta_1$. 我以下列資料舉例. 抽樣的 10 名學生中, 有 6 名男生 4 名女生, 且每人分別擁有手機或沒有, 資料如下: 有無手機男生 (x1 = 1)女生 (x1 = 0) 有 (y = 1)51 無 (y = 0)13 小計64 “男生有手機” 對 “男生無手機” 的勝算為$\frac{5/6}{1/6} = 5$; “女生有手機” 對 “女生無手機” 的勝算為$\frac{1/4}{3/4} = 0.3333$; 二者的勝算比為$5 / 0.3333 = 15$. 在男生 = 1, 女生 = 0, 有手機 = 1, 無手機 = 0 的設定下, 迴歸式中可得$\text{logit}(p) = -1.099 + 2.708(\text{性別})$, 可驗證$\exp (-1.099) = 0.3332041$為 “女生有手機” 對 “女生無手機” 的勝算 (因為女生的編碼為$y = 0$);$\exp (2.708) = 14.99925$為性別與有無手機的勝算比, 可解釋作 “男生有手機之勝算為女生的 15 倍”. 以下為驗證之 R code: > isMale<-c(1,1,1,1,1,1,0,0,0,0) > isPhone<-as.factor(c(1,1,1,1,1,0,0,0,0,1)) > m<-glm(isPhone~isMale,family=binomial);summary(m) Call: glm(formula = isMale ~ isPhone, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.8930 -0.7585 0.6039 0.6039 1.6651 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.099 1.155 -0.951 0.3414 isMale 2.708 1.592 1.701 0.0889 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.4602 on 9 degrees of freedom Residual deviance: 9.9054 on 8 degrees of freedom AIC: 13.905 Number of Fisher Scoring iterations: 4 > exp(-1.099) [1] 0.3332041 > exp(2.708) [1] 14.99925 #### 包含一連續變數作為自變數的迴歸式 在模型 $\text{logit}(p) = \beta_0 + \beta_0 x_1,$ 其中$p$為依變數$y = 1$之機率而$y$為布林資料 (0 或 1) 以表示肯定或否定,$x_1$為連續型變數. 此時,$\beta_0$即為$x = 0$時$\log (y = 1 \text{對} y = 0 \text{的勝算})$, 而$\beta_1$為$x_1$對勝算的貢獻量. 說成白話文,$x_1=0$時$y = 1$的機率除以$y = 0$的機率為$\exp(\beta_0)$; 若$x_1 \neq 0$時$y = 1$的機率除以$y = 0$的機率為$\exp (\beta_0 + \beta_1 x_1)$, 而$\beta_1$為正數或負數則影響了勝算的正向或負向影響。 我以下列資料舉例. 抽樣的 10 名學生中, 每人有不同每日零用錢及是否擁有手機, 資料如下: 有無手機 (y, 1 為有, 0 為無)每日零用錢 (x1) 1100 1120 190 1110 1110 080 0100 090 070 0110 上述資料可得回歸式$\text{logit}(p) = -9.71278 + 0.09840(\text{每日零用錢})$. 由迴歸式可觀察出, 每日零用錢每增加 1 單位則對有手機之勝算增加$\text{exp}(0.09840) = 1.103404$, 亦表示有無手機在多拿 1 單位零用錢與沒有者的勝算比是$\exp(\beta_1 \times 1) = 1.103404$，多拿 2 單位則為$\exp(\beta_1 \times 2)$, 以此類推. 我們可以估計每日零用錢為 70 時有手機對無手機的勝算:$\exp (-9.71278 + 0.09840 \times 70) = 0.05932171$, 可解釋為每日零用錢為 70 的學生中有手機的機率為沒有手機的 0.05932171 倍. 要估算每日零用錢為 75 的勝算亦如法炮製:$\exp( -9.71278 + 0.09840 \times 75) = 0.09702564$. 因此, 二者的勝算比為$0.09702564 / 0.05932171 = 1.635584$, 也就是$\exp (0.09840 \times 5)$. 由於$\text{勝算} = p / (1 - p)$, 故可由勝算回推$p$, 公式為$p = \text{勝算} / (1 + \text{勝算})$. 前面有談到零用錢為 70 時有手機對無手機的勝算為 0.05932171, 可推得零用錢為 70 時有手機的機率為$0.05932171 / (1 + 0.05932171) = 0.05599971$. 同理, 每日零用錢為 75 時有手機的機率為$0.09702564 / (1 + 0.09702564) = 0.08844428$. > isPhone<-c(1,1,1,1,1,0,0,0,0,0) > money<-c(100, 120, 90, 110, 110, 80, 100, 90, 70, 110) > m<-glm(isPhone~money,family=binomial);summary(m) Call: glm(formula = isPhone ~ money, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.67078 -0.76661 0.07113 0.75438 1.55604 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.71278 6.59477 -1.473 0.141 money 0.09840 0.06574 1.497 0.134 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.863 on 9 degrees of freedom Residual deviance: 10.481 on 8 degrees of freedom AIC: 14.481 Number of Fisher Scoring iterations: 4 #### 結論 • 邏輯斯迴歸的迴歸式$\log ( \text{odd} ) = \text{logit} (p) = \log \left( \frac{\Pr (y=1) }{1 - \Pr (y=1)} \right) = \beta_0 + \sum_{i=1}^{n} \beta_i x_i$所加總得到的數值是 log(勝算), 故 exp(logit(p)) = 勝算; • 將二個勝算相除可以得到勝算比; • 可由勝算回推機率, 公式為機率 = 勝算 / (1 + 勝算); • 因為指數與對數在運算上的特性, log(勝算比) 常成為二個 logit(p) 相加減之結果. • 若不解, 一切回歸到 “log(勝算) = log (p/(1 - p)) = 迴歸式” 及 “勝算比 = 勝算1 / 勝算2” 的基本概念, 再往下推導. ### Least-significant difference (LSD) in R How to apply LSD (as a kind of multiple comparisons after ANOVA) in R? A dataset for typical one-way ANOVA is given: y <- rnorm(40) x <- gl(4,10) m <- aov(y~x) Applying LSD by using function pairwise.t.test: pairwise.t.test(y,x,p.adj="none") Applying LSD by using function LSD.test from package agricolae: library(agricolae) LSD.test(m,"x",group=T) # with grouping LSD.test(m,"x",group=F) # without grouping ### Specifying a reference level in R: function relevel How to specify a specific level as a reference level in GNU R while applying (generalized) linear regression? A dataframe dt is given: y<-rpois(40,1) x1<-gl(4,10) x2<-gl(4,1,40) dt <- data.frame(cbind(y,x1,x2)) Applying poisson regression: m1<-glm(y~factor(x1)+factor(x2),data=dt) summary(m1) The output shows x1=1 and x2=1 are both reference level defined by R automatically. To switch the reference level from x1=1 to x1=2, the function relevel could be used: m2<-glm(y~relevel(factor(dt$x1),"2")+factor(x2),data=dt)
summary(m2)