Data Science Day 12: Odds Ratio
Learning Objective:
- Probability vs Odds Vs Odds Ratio
1. Probability = Event/Sample Space
2. Odds= Prob(Event)/Prob(Non-Event)
3. Odds Ratio = Odds(Group 1)/ Odds(Group 2)
- Interpretation
The Odds Ratio is a measure of association between exposure and outcome.
OR=Odds(Group 1)/Odds(Group2)>1 indicates the increased occurrence of an event in Group 1 compared to Group 2.
OR=Odds(Group 1)/Odds(Group2) < 1 indicates the decreased occurrence of an event in Group 1 compared to Group 2.
The true Odds Ratio lies in between 95% Confidence interval and P-value represents the statistical significant

955169 / Pixabay
- Example: UCLA Graduate School Admission dataset
1. calculate both theoretical and true Odds Ratio and interpret the meaning of odds ratio
import pandas as pd import statsmodels.api as sm import pylab as pl import numpy as np import matplotlib.pyplot as plt import seaborn as sns #import UCLA dataset df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv") df.columns=["admit", "gre", "gpa", "prestige"] print (df.head()) #descriptive statistics print (df.describe())
#1 is the most prestiges school. # we make a dummy_rank to group prestige 1,2 as 1 and 3,4 as 2 df["dummy_rank"]=np.where(df["prestige"] <3 , 1 ,2) df.hist() pl.show() #dummy_rank=pd.get_dummies(df["prestige"],prefix="prestige") print (df.head()) #frequncy table prestiges vs admit print(pd.crosstab(df['admit'],df["dummy_rank"]))
#Apply logistic regression X=df[["gre","gpa","dummy_rank"]] logit=sm.Logit(df["admit"],X) result=logit.fit() print (result.summary()) print (result.conf_int())
# Theoratical odds ratio print(np.exp(result.params)) params= result.params conf=result.conf_int() conf["OR"]=params conf.columns=["2.5%","97.5%","OR"] print(np.exp(conf))
# Calculate Probality vs Odds vs Odds ratio prob_rank1_accept=87/(125+87) print(prob_rank1_accept) prob_rank2_accept=40/(148+40) print(prob_rank2_accept) odds_rank1=87/125 odds_rank2=40/148 print(odds_rank1, odds_rank2) odds_ratio=odds_rank2/odds_rank1 print(odds_ratio)
#Visulatization %matplotlib inline pd.crosstab(df.admit, df.dummy_rank).plot(kind="bar") plt.title("Admit vs Prestige") plt.xlabel("Admit") plt.ylabel("Student Frequency Count")
Summary
Our theoretical Odds Ratio is 0.319 with a CI(0.20, 0.41), which is close to the true Odds ratio, 0.388. This indicates if the undergraduate students are from the school in prestige 3 or 4, the chances of them getting in graduate school is 38% that of the students from prestige 1 or 2 undergraduate schools. In other words, it is 2.5 times more likely for a student to get into a graduate school from undergraduate school rated in Prestige 1 or 2 compared to 3 or 4. Our graph supported the result!
Inspired by http://blog.yhat.com/posts/logistic-regression-and-python.html
Happy Studying! 😻