Logistic regression is the most common mode to model a binary outcome Y
Bernoulli model
- Y takes value
- 1 with probability \(\pi\)
- 0 with probability \(1 - \pi\)
The log of the odds happens to be a very convenient quantity to be modeled as linear function of covariates:
The log of the odds is modeled as a linear function of covariates
\[Y \sim \text{Bernoulli}(\pi)\] \[ \log(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1 \cdot X_1\]
Exercise
Prove that the probability of having the disease for a person with genotype \(X_1=x\) is
\[\begin{align} \pi & = P(\text{disease}|\beta_0, \beta_1, X_1=x) \\ & = \frac{\exp(\beta_0 + \beta_1 \cdot x)}{1+\exp(\beta_0 + \beta_1 \cdot x)} \end{align}\]
Calculate the likelihood
Jane has the disease D and their genotype at location 1 is AA and Kim does not have the disease D and their genotype at location 1 is aa.
Data: \(Y(\text{Jane}) = 1\) and \(Y(\text{Kim}) = 0\)
If we are counting A alleles, the genotype dosages are \(X_1(\text{Jane}) = 2\) and \(X_1(\text{Kim}) = 0\)
Recall that the likelihood is the probability of the data given (conditional on) the parameters, we usually also condition on the genotype. The likelihood for Jane is (probability Jane has the disease given that she has genotype 0 and the parameters \(\beta_0\) and \(\beta_1\)). By conditional on
or given
, we mean that we pretend that we know their values and give them names.
\[\begin{align} L(\text{Jane}) &= P(\text{Jane has the disease}|\beta_0,\beta_1, X_1=0)\\ & = \frac{\exp(\beta_0 + \beta_1 \cdot X_1)}{1+\exp(\beta_0 + \beta_1 \cdot X_1)} ~~~~~~~~~~~~ \text{recall that for Jane} X_1=2\\ & = \frac{\exp(\beta_0+ \beta_2 \cdot 2)}{1+\exp(\beta_0 + \beta_2 \cdot 2)} \end{align}\]
Now, if Jane and Kim are not related, we can assume that the data point on Kim is independent of Jane’s and the full likelihood for both can be written as the product of the two
\[\begin{align} L(\text{Jane and Kim}) &= L(\text{Jane})\cdot L(\text{Kim})\\ & = \frac{\exp(\beta_0 + \beta_1 \cdot 2)}{1+\exp(\beta_0+ \beta_1 \cdot 2 )} \cdot \frac{\exp(\beta_0)}{1+\exp(\beta_0 )} \end{align}\]
Simulate data
beta_0=0.2
beta_1 = 0.30
nsam = 10000
maf = 0.25
expit = function(x) exp(x) / (1+exp(x))
xvec = rbinom(nsam,2,maf)
yvec = rbinom(nsam,1,expit(xvec*beta_1 + beta_0))
summary(glm(yvec ~ xvec, family="binomial"))
##
## Call:
## glm(formula = yvec ~ xvec, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5620 -1.2682 0.9581 1.0893 1.0893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21091 0.02608 8.088 6.08e-16 ***
## xvec 0.32954 0.03414 9.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13523 on 9999 degrees of freedom
## Residual deviance: 13428 on 9998 degrees of freedom
## AIC: 13432
##
## Number of Fisher Scoring iterations: 4
Why log of the odds is such a convenient quantity?
- it can take any value on the real line
- prevalence of the disease is cancelled out, this allows us to use case control design instead of population data and still get the right estimates of the effects of covariates on risk. TODO: add a good reference here.