· Visualisations · 4 min read
Sunburst Plots with Seaborn: Visualizing Hierarchical Data
Sunburst Plots with Seaborn: Visualizing Hierarchical Data
Introduction
Sunburst plots are a powerful tool for visualizing hierarchical and multi-level data, especially when working with large and complex datasets. In this article, we’ll walk you through creating sunburst plots with Seaborn, a popular Python data visualization library. These plots allow you to explore the relationships and dependencies between different data categories, helping professionals gain insights for data analysis and decision-making.
Properties and Parameters of Sunburst Plots
Sunburst plots are created in Seaborn using the clustermap()
function, which generates hierarchically-clustered heatmaps. Although not a dedicated sunburst plot function, it’s highly configurable and can be tailored to create sunburst-like visualizations. Here are some important parameters:
data
: The input data in a Pandas DataFrame format.pivot_kws
: A dictionary of keyword arguments to pass to the pivot function, which helps create a multi-index on the resulting DataFrame.method
: The linkage method to use for calculating distances between the data points. Options include ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’, ‘ward’.metric
: The distance metric to use for the pairwise data. Can be ‘correlation’, ‘euclidean’, ‘hamming’, etc.z_score
: Standardize the data by subtracting the mean and dividing by the standard deviation. If set to 0 or 1, it will apply normalization along the respective axis.standard_scale
: Standardize the data by subtracting the minimum and dividing by the range (max-min). If set to 0 or 1, it will apply normalization along the respective axis.col_colors
,row_colors
: Color mappings for the columns or rows. If specified, it should be a DataFrame or a Series.linewidths
,linecolor
: Line properties for separating the heatmap cells.cmap
: The colormap to use for the heatmap cells.xticklabels
,yticklabels
: Configure the tick labels along the x and y axis.
A Simplified Real-life Example
Let’s say you want to visualize the sales data of an electronics store. You have a dataset comprising items, their categories, subcategories, and the number of units sold.
import pandas as pd
import seaborn as sns
# Sample data
data = {
'Category': ['Phones', 'Computers', 'Phones', 'Accessories', 'Computers'],
'Subcategory': ['Smartphones', 'Laptops', 'Tablets', 'Cables', 'Desktops'],
'Item': ['iPhone', 'MacBook Pro', 'iPad', 'USB-C Cable', 'iMac'],
'Units Sold': [250, 50, 180, 500, 35],
}
# Create a DataFrame
df = pd.DataFrame(data)
df = df.pivot_table(index=['Category', 'Subcategory'], columns='Item', values='Units Sold', fill_value=0)
# Create a sunburst plot with Seaborn's clustermap function
sns.clustermap(df, cmap="coolwarm", linewidths=1, linecolor="grey", standard_scale=1)
In this example, we first create a Pandas DataFrame, then pivot it to have a multi-level index based on the Category
and Subcategory
columns. Finally, we use Seaborn’s clustermap()
function to create a sunburst-like visualization, standardizing the data for better interpretation.
A Complex Real-life Example
Consider a more complex dataset comprising sales data of various items across different regions, categories, and subcategories over multiple years.
import numpy as np
# Simulate a large and complex dataset
np.random.seed(42)
region_list = ['North', 'South', 'East', 'West']
category_list = ['Phones', 'Computers', 'Accessories']
subcategory_list = ['Smartphones', 'Laptops', 'Tablets', 'Cables', 'Desktops']
years_list = list(range(2010, 2021))
data = {
'Region': np.random.choice(region_list, 1000),
'Category': np.random.choice(category_list, 1000),
'Subcategory': np.random.choice(subcategory_list, 1000),
'Year': np.random.choice(years_list, 1000),
'Units Sold': np.random.randint(1, 500, 1000),
}
df = pd.DataFrame(data)
df = df.pivot_table(index=['Region', 'Category', 'Subcategory'],
columns='Year', values='Units Sold', aggfunc=np.sum, fill_value=0)
sns.clustermap(df, cmap="coolwarm", linewidths=1, linecolor="grey", standard_scale=1, figsize=(10, 10))
In this example, we generate synthetic data using NumPy and create a multi-level index DataFrame. We then use the clustermap()
function to visualize the relationships and dependencies between regions, categories, subcategories, and years.
Personal Tips for Sunburst Plots
- When working with large datasets, avoid overwhelming the visualization with too many categories, subcategories, or layers.
- Use meaningful colormaps that highlight differences in data values while maintaining readability.
- Experiment with different distance metrics and linkage methods to find the best representation suited to your dataset and analysis goals.
- Adjust the figure size to ensure all labels are legible and the structure is clear.
- Standardize your data using
z_score
orstandard_scale
parameters for better interpretation of the relationships between categories.