Logistic regression is the most common mode to model a binary outcome Y

Bernoulli model

  • Y takes value
    • 1 with probability \(\pi\)
    • 0 with probability \(1 - \pi\)

The log of the odds happens to be a very convenient quantity to be modeled as linear function of covariates:

The log of the odds is modeled as a linear function of covariates

\[Y \sim \text{Bernoulli}(\pi)\] \[ \log(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1 \cdot X_1\]

Exercise

Prove that the probability of having the disease for a person with genotype \(X_1=x\) is

\[\begin{align} \pi & = P(\text{disease}|\beta_0, \beta_1, X_1=x) \\ & = \frac{\exp(\beta_0 + \beta_1 \cdot x)}{1+\exp(\beta_0 + \beta_1 \cdot x)} \end{align}\]

Calculate the likelihood

Jane has the disease D and their genotype at location 1 is AA and Kim does not have the disease D and their genotype at location 1 is aa.

Data: \(Y(\text{Jane}) = 1\) and \(Y(\text{Kim}) = 0\)

If we are counting A alleles, the genotype dosages are \(X_1(\text{Jane}) = 2\) and \(X_1(\text{Kim}) = 0\)

Recall that the likelihood is the probability of the data given (conditional on) the parameters, we usually also condition on the genotype. The likelihood for Jane is (probability Jane has the disease given that she has genotype 0 and the parameters \(\beta_0\) and \(\beta_1\)). By conditional on or given, we mean that we pretend that we know their values and give them names.

\[\begin{align} L(\text{Jane}) &= P(\text{Jane has the disease}|\beta_0,\beta_1, X_1=0)\\ & = \frac{\exp(\beta_0 + \beta_1 \cdot X_1)}{1+\exp(\beta_0 + \beta_1 \cdot X_1)} ~~~~~~~~~~~~ \text{recall that for Jane} X_1=2\\ & = \frac{\exp(\beta_0+ \beta_2 \cdot 2)}{1+\exp(\beta_0 + \beta_2 \cdot 2)} \end{align}\]

Now, if Jane and Kim are not related, we can assume that the data point on Kim is independent of Jane’s and the full likelihood for both can be written as the product of the two

\[\begin{align} L(\text{Jane and Kim}) &= L(\text{Jane})\cdot L(\text{Kim})\\ & = \frac{\exp(\beta_0 + \beta_1 \cdot 2)}{1+\exp(\beta_0+ \beta_1 \cdot 2 )} \cdot \frac{\exp(\beta_0)}{1+\exp(\beta_0 )} \end{align}\]

Simulate data

beta_0=0.2
beta_1 = 0.30
nsam = 10000
maf = 0.25
expit = function(x) exp(x) / (1+exp(x))
xvec = rbinom(nsam,2,maf)
yvec = rbinom(nsam,1,expit(xvec*beta_1 + beta_0))
summary(glm(yvec ~ xvec, family="binomial"))
## 
## Call:
## glm(formula = yvec ~ xvec, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5620  -1.2682   0.9581   1.0893   1.0893  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.21091    0.02608   8.088 6.08e-16 ***
## xvec         0.32954    0.03414   9.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13523  on 9999  degrees of freedom
## Residual deviance: 13428  on 9998  degrees of freedom
## AIC: 13432
## 
## Number of Fisher Scoring iterations: 4

Why log of the odds is such a convenient quantity?

  • it can take any value on the real line
  • prevalence of the disease is cancelled out, this allows us to use case control design instead of population data and still get the right estimates of the effects of covariates on risk. TODO: add a good reference here.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT.

Suggest changes

If you find any mistakes (including typos) or want to suggest changes, please feel free to edit the source file of this page on Github and create a pull request.

Citation

For attribution, please cite this work as

Haky Im (2021). Logistic Regression. BIOS 25328 Cancer Genomics Class Notes. /post/2021/04/01/logistic-regression/

BibTeX citation

@misc{
  title = "Logistic Regression",
  author = "Haky Im",
  year = "2021",
  journal = "BIOS 25328 Cancer Genomics Class Notes",
  note = "/post/2021/04/01/logistic-regression/"
}