Data Clarity Made Easy: Visualizing Missing Values with Missingno in python

Missingno is a Python library designed to visualize and handle missing data within datasets. It provides a range of tools to assist in identifying and understanding patterns of missingness in data. Missingno offers intuitive visualizations to help users quickly grasp the extent and distribution of missing values in their datasets, allowing for informed decision-making in the data cleaning and preprocessing stages.

Importing Libraries and Data

The below data is motor data for insurance. The information was collected about each customers Gender,Age,Driving license, Region code, Previously insured, vehicle age, Vehicle damage, Annual premium, Policy sales channel, Vintage

import pandas as pd
import missingno as msno
df = pd.read_csv('Motor_Insurance.csv')
df.head()
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage
0 1 Male 44 1 28 0.0 > 2 Years Yes 40454.0 26.0 217.0
1 2 Male 76 1 3 NaN 1-2 Year No 33536.0 26.0 183.0
2 3 Male 47 1 28 0.0 > 2 Years Yes 38294.0 26.0 27.0
3 4 Male 21 1 11 1.0 < 1 Year No 28619.0 152.0 203.0
4 5 Female 29 1 41 1.0 < 1 Year No 27496.0 152.0 39.0

Method for vizualization of missing values

Missing Values with Missingno Bar Graph

Missingno’s bar graph provides a succinct visual overview of missing values across variables in a dataset. Each bar represents a column, with its height indicating data completeness and null values. The y-axis ranges from 0 to 1, denoting completeness, while the x-axis represents index values. Notably, columns such as Previously_Insured, Annual_Premium, Sales_channels, and Vintage exhibit missing values.

#Missingno Bar chart
import missingno as msno
import matplotlib.pyplot as plt

#Color
gradient_color = plt.cm.Blues_r(msno.nullity_sort(df).isnull().mean()) 
msno.bar(df, color=gradient_color)
plt.show()

Understanding Data Completeness with Missingno Matrix

The Missingno matrix offers a comprehensive visualization of missing values in a dataset, aiding in the identification of missing data patterns. It employs color shading to highlight the presence or absence of data, while a spark line on the right indicates the total column count, reflecting data completeness. Notably, columns such as Previously_Insured, Vehicle_Age, Annual_Premium, Policy_Sales_Channel, and Vintage exhibit significant missing data, guiding subsequent data preprocessing steps.

#Missingno Matric 
#Color
color = (0.4, 0.5, 0.7)  

msno.matrix(df, color=color)
plt.show()

Exploring Data Integrity with Missingno Heatmap

The msno.heatmap(df) function in Missingno helps exploring correlations of nullity between different columns. Positive correlation values near one suggest a direct relationship between the presence of null values in one column and another. Conversely, negative correlation values near negative one indicate an inverse relationship, while values close to zero signify no correlation.

#missingno heatmap
msno.heatmap(df)

Uncovering Data Clustering with Missingno Dendrogram

The msno.dendrogram(df) function tree-like graph, formed through hierarchical clustering, identifies columns with strong correlation and nullity, grouping them based on similarity in missingness patterns. Columns grouped together at level zero indicate direct relationships between the presence or absence of nulls. Furthermore, the degree of separation reflects the likelihood of null value correlation between columns.In the resulting plot, two distinct groups emerge, highlighting variables with a high degree of null values and those with complete data.

#Missingno Dendrogram
msno.dendrogram(df)

Conclusion

Missingno proves invaluable in Python-based data analysis, particularly for its adept handling of missing values. Its array of visualization tools aids in comprehensively understanding and managing missing data patterns, enhancing decision-making during preprocessing. Missingno stands as an essential asset for data scientists, offering ease of use and versatility in optimizing data quality and extracting actionable insights.