Data Clarity Made Easy: Visualizing Missing Values with Missingno in python
Missingno is a Python library designed to visualize and handle missing data within datasets. It provides a range of tools to assist in identifying and understanding patterns of missingness in data. Missingno offers intuitive visualizations to help users quickly grasp the extent and distribution of missing values in their datasets, allowing for informed decision-making in the data cleaning and preprocessing stages.
Importing Libraries and Data
The below data is motor data for insurance. The information was collected about each customers Gender,Age,Driving license, Region code, Previously insured, vehicle age, Vehicle damage, Annual premium, Policy sales channel, Vintage
import pandas as pdimport missingno as msnodf = pd.read_csv('Motor_Insurance.csv')df.head()
id
Gender
Age
Driving_License
Region_Code
Previously_Insured
Vehicle_Age
Vehicle_Damage
Annual_Premium
Policy_Sales_Channel
Vintage
0
1
Male
44
1
28
0.0
> 2 Years
Yes
40454.0
26.0
217.0
1
2
Male
76
1
3
NaN
1-2 Year
No
33536.0
26.0
183.0
2
3
Male
47
1
28
0.0
> 2 Years
Yes
38294.0
26.0
27.0
3
4
Male
21
1
11
1.0
< 1 Year
No
28619.0
152.0
203.0
4
5
Female
29
1
41
1.0
< 1 Year
No
27496.0
152.0
39.0
Method for vizualization of missing values
Missing Values with Missingno Bar Graph
Missingno’s bar graph provides a succinct visual overview of missing values across variables in a dataset. Each bar represents a column, with its height indicating data completeness and null values. The y-axis ranges from 0 to 1, denoting completeness, while the x-axis represents index values. Notably, columns such as Previously_Insured, Annual_Premium, Sales_channels, and Vintage exhibit missing values.
#Missingno Bar chartimport missingno as msnoimport matplotlib.pyplot as plt#Colorgradient_color = plt.cm.Blues_r(msno.nullity_sort(df).isnull().mean()) msno.bar(df, color=gradient_color)plt.show()
Understanding Data Completeness with Missingno Matrix
The Missingno matrix offers a comprehensive visualization of missing values in a dataset, aiding in the identification of missing data patterns. It employs color shading to highlight the presence or absence of data, while a spark line on the right indicates the total column count, reflecting data completeness. Notably, columns such as Previously_Insured, Vehicle_Age, Annual_Premium, Policy_Sales_Channel, and Vintage exhibit significant missing data, guiding subsequent data preprocessing steps.
The msno.heatmap(df) function in Missingno helps exploring correlations of nullity between different columns. Positive correlation values near one suggest a direct relationship between the presence of null values in one column and another. Conversely, negative correlation values near negative one indicate an inverse relationship, while values close to zero signify no correlation.
#missingno heatmapmsno.heatmap(df)
Uncovering Data Clustering with Missingno Dendrogram
The msno.dendrogram(df) function tree-like graph, formed through hierarchical clustering, identifies columns with strong correlation and nullity, grouping them based on similarity in missingness patterns. Columns grouped together at level zero indicate direct relationships between the presence or absence of nulls. Furthermore, the degree of separation reflects the likelihood of null value correlation between columns.In the resulting plot, two distinct groups emerge, highlighting variables with a high degree of null values and those with complete data.
#Missingno Dendrogrammsno.dendrogram(df)
Conclusion
Missingno proves invaluable in Python-based data analysis, particularly for its adept handling of missing values. Its array of visualization tools aids in comprehensively understanding and managing missing data patterns, enhancing decision-making during preprocessing. Missingno stands as an essential asset for data scientists, offering ease of use and versatility in optimizing data quality and extracting actionable insights.