# Run once
# install.packages("forestplot")
# install.packages("mlbench")
library(forestplot)
library(mlbench) # for importing benchmark machine learning datasets
Forest plots for Logistic regression: A clear view of predictor impact
Introduction:
A forest plot is a visual tool used to display the results of statistical analyses, especially the effect sizes and their confidence intervals from models like logistic regression. It helps quickly compare the strength and significance of multiple predictors in a clear, visual format. Forest plots are widely used in medical research, clinical trials, and meta-analyses to assess risk factors, treatment effects, or study outcomes.
Install and load the package
Let’s understand forestplot with an example, below are the steps that we will follow:
Run a logistic regression (e.g., predicting if someone has a disease or not),
Extract odds ratios (OR) and confidence intervals from the model,
Use a forest plot to show each variable’s effect.
Creating a dummy dataset
data("PimaIndiansDiabetes") # Load into environment
<- PimaIndiansDiabetes # Create working copy
df
head(df) # First few rows
pregnant glucose pressure triceps insulin mass pedigree age diabetes
1 6 148 72 35 0 33.6 0.627 50 pos
2 1 85 66 29 0 26.6 0.351 31 neg
3 8 183 64 0 0 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.288 33 pos
6 5 116 74 0 0 25.6 0.201 30 neg
Step 1: Run Binary Logistic Regression
<- glm(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age,
model data = df, family = binomial)
Step 2: Extract Odds Ratios and CIs
Get coefficients and exponentiate them
<- exp(coef(model)) # Odds Ratios
OR <- exp(confint(model)) # Confidence intervals
CI
<- data.frame(
results Variable = names(OR),
OR = round(OR, 2),
Lower = round(CI[, 1], 2),
Upper = round(CI[, 2], 2)
)
print(results)
Variable OR Lower Upper
(Intercept) (Intercept) 0.00 0.00 0.00
pregnant pregnant 1.13 1.06 1.21
glucose glucose 1.04 1.03 1.04
pressure pressure 0.99 0.98 1.00
triceps triceps 1.00 0.99 1.01
insulin insulin 1.00 1.00 1.00
mass mass 1.09 1.06 1.13
pedigree pedigree 2.57 1.44 4.66
age age 1.01 1.00 1.03
Step 3: Create Forest Plot
# Format CI as text
<- paste0(results$Lower, " - ", results$Upper)
CI_text
# Create label text with CI
<- cbind(
tabletext c("Variable", results$Variable),
c("Odds Ratio", as.character(results$OR)),
c("95% CI", CI_text)
)
# Forest plot
forestplot(labeltext = tabletext,
mean = c(NA, results$OR),
lower = c(NA, results$Lower),
upper = c(NA, results$Upper),
zero = 1,
xlab = "Odds Ratio (OR)",
boxsize = 0.2,
line.margin = 0.2,
col = fpColors(box = "royalblue", line = "darkblue", summary = "steelblue"))
This visualization adds interpretability to logistic regression outputs and aids in identifying the most influential health indicators.
Reading the plot:
Variables whose confidence interval includes 1 are considered Non-Significant Variables: pressure, triceps, insulin
These may not be statistically significant predictors in this model.
Notes:
Odds Ratio > 1: increases risk of diabetes.
Odds Ratio < 1: decreases risk.
CI not crossing 1: statistically significant.
Why Forest Plot?
The forest plot is a compact and intuitive way to:
Visualize the effect size of each variable,
See which predictors are statistically significant,
Communicate model results to both technical and non-technical audiences.