forestplot

Forest plots for Logistic regression: A clear view of predictor impact

Introduction:

A forest plot is a visual tool used to display the results of statistical analyses, especially the effect sizes and their confidence intervals from models like logistic regression. It helps quickly compare the strength and significance of multiple predictors in a clear, visual format. Forest plots are widely used in medical research, clinical trials, and meta-analyses to assess risk factors, treatment effects, or study outcomes.

Install and load the package

# Run once
# install.packages("forestplot")
# install.packages("mlbench") 
library(forestplot)
library(mlbench)  # for importing benchmark machine learning datasets

Let’s understand forestplot with an example, below are the steps that we will follow:

Run a logistic regression (e.g., predicting if someone has a disease or not),
Extract odds ratios (OR) and confidence intervals from the model,
Use a forest plot to show each variable’s effect.

Creating a dummy dataset

data("PimaIndiansDiabetes")  # Load into environment
df <- PimaIndiansDiabetes    # Create working copy

head(df)     # First few rows

  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35       0 33.6    0.627  50      pos
2        1      85       66      29       0 26.6    0.351  31      neg
3        8     183       64       0       0 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74       0       0 25.6    0.201  30      neg

Step 1: Run Binary Logistic Regression

model <- glm(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age,
             data = df, family = binomial)

Step 2: Extract Odds Ratios and CIs

Get coefficients and exponentiate them

OR <- exp(coef(model))                       # Odds Ratios
CI <- exp(confint(model))                    # Confidence intervals

results <- data.frame(
  Variable = names(OR),
  OR = round(OR, 2),
  Lower = round(CI[, 1], 2),
  Upper = round(CI[, 2], 2)
)

print(results)

               Variable   OR Lower Upper
(Intercept) (Intercept) 0.00  0.00  0.00
pregnant       pregnant 1.13  1.06  1.21
glucose         glucose 1.04  1.03  1.04
pressure       pressure 0.99  0.98  1.00
triceps         triceps 1.00  0.99  1.01
insulin         insulin 1.00  1.00  1.00
mass               mass 1.09  1.06  1.13
pedigree       pedigree 2.57  1.44  4.66
age                 age 1.01  1.00  1.03

Step 3: Create Forest Plot

# Format CI as text
CI_text <- paste0(results$Lower, " - ", results$Upper)

# Create label text with CI
tabletext <- cbind(
c("Variable", results$Variable),
c("Odds Ratio", as.character(results$OR)),
c("95% CI", CI_text)
)


# Forest plot
forestplot(labeltext = tabletext,
mean = c(NA, results$OR),
lower = c(NA, results$Lower),
upper = c(NA, results$Upper),
zero = 1,
xlab = "Odds Ratio (OR)",
boxsize = 0.2,
line.margin = 0.2,
col = fpColors(box = "royalblue", line = "darkblue", summary = "steelblue"))

This visualization adds interpretability to logistic regression outputs and aids in identifying the most influential health indicators.

Reading the plot:

Variables whose confidence interval includes 1 are considered Non-Significant Variables: pressure, triceps, insulin
These may not be statistically significant predictors in this model.

Notes:

Odds Ratio > 1: increases risk of diabetes.
Odds Ratio < 1: decreases risk.
CI not crossing 1: statistically significant.

Why Forest Plot?

The forest plot is a compact and intuitive way to:

Visualize the effect size of each variable,
See which predictors are statistically significant,
Communicate model results to both technical and non-technical audiences.