Sunburst Plots with Seaborn: Visualizing Hierarchical Data

Introduction

Sunburst plots are a powerful tool for visualizing hierarchical and multi-level data, especially when working with large and complex datasets. In this article, we’ll walk you through creating sunburst plots with Seaborn, a popular Python data visualization library. These plots allow you to explore the relationships and dependencies between different data categories, helping professionals gain insights for data analysis and decision-making.

Properties and Parameters of Sunburst Plots

Sunburst plots are created in Seaborn using the clustermap() function, which generates hierarchically-clustered heatmaps. Although not a dedicated sunburst plot function, it’s highly configurable and can be tailored to create sunburst-like visualizations. Here are some important parameters:

data: The input data in a Pandas DataFrame format.
pivot_kws: A dictionary of keyword arguments to pass to the pivot function, which helps create a multi-index on the resulting DataFrame.
method: The linkage method to use for calculating distances between the data points. Options include ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’, ‘ward’.
metric: The distance metric to use for the pairwise data. Can be ‘correlation’, ‘euclidean’, ‘hamming’, etc.
z_score: Standardize the data by subtracting the mean and dividing by the standard deviation. If set to 0 or 1, it will apply normalization along the respective axis.
standard_scale: Standardize the data by subtracting the minimum and dividing by the range (max-min). If set to 0 or 1, it will apply normalization along the respective axis.
col_colors, row_colors: Color mappings for the columns or rows. If specified, it should be a DataFrame or a Series.
linewidths, linecolor: Line properties for separating the heatmap cells.
cmap: The colormap to use for the heatmap cells.
xticklabels, yticklabels: Configure the tick labels along the x and y axis.

A Simplified Real-life Example

Let’s say you want to visualize the sales data of an electronics store. You have a dataset comprising items, their categories, subcategories, and the number of units sold.

import pandas as pd
import seaborn as sns

# Sample data
data = {
    'Category': ['Phones', 'Computers', 'Phones', 'Accessories', 'Computers'],
    'Subcategory': ['Smartphones', 'Laptops', 'Tablets', 'Cables', 'Desktops'],
    'Item': ['iPhone', 'MacBook Pro', 'iPad', 'USB-C Cable', 'iMac'],
    'Units Sold': [250, 50, 180, 500, 35],
}

# Create a DataFrame
df = pd.DataFrame(data)
df = df.pivot_table(index=['Category', 'Subcategory'], columns='Item', values='Units Sold', fill_value=0)

# Create a sunburst plot with Seaborn's clustermap function
sns.clustermap(df, cmap="coolwarm", linewidths=1, linecolor="grey", standard_scale=1)

In this example, we first create a Pandas DataFrame, then pivot it to have a multi-level index based on the Category and Subcategory columns. Finally, we use Seaborn’s clustermap() function to create a sunburst-like visualization, standardizing the data for better interpretation.

A Complex Real-life Example

Consider a more complex dataset comprising sales data of various items across different regions, categories, and subcategories over multiple years.

import numpy as np

# Simulate a large and complex dataset
np.random.seed(42)
region_list = ['North', 'South', 'East', 'West']
category_list = ['Phones', 'Computers', 'Accessories']
subcategory_list = ['Smartphones', 'Laptops', 'Tablets', 'Cables', 'Desktops']
years_list = list(range(2010, 2021))

data = {
    'Region': np.random.choice(region_list, 1000),
    'Category': np.random.choice(category_list, 1000),
    'Subcategory': np.random.choice(subcategory_list, 1000),
    'Year': np.random.choice(years_list, 1000),
    'Units Sold': np.random.randint(1, 500, 1000),
}

df = pd.DataFrame(data)
df = df.pivot_table(index=['Region', 'Category', 'Subcategory'],
                    columns='Year', values='Units Sold', aggfunc=np.sum, fill_value=0)

sns.clustermap(df, cmap="coolwarm", linewidths=1, linecolor="grey", standard_scale=1, figsize=(10, 10))

In this example, we generate synthetic data using NumPy and create a multi-level index DataFrame. We then use the clustermap() function to visualize the relationships and dependencies between regions, categories, subcategories, and years.

Personal Tips for Sunburst Plots

When working with large datasets, avoid overwhelming the visualization with too many categories, subcategories, or layers.
Use meaningful colormaps that highlight differences in data values while maintaining readability.
Experiment with different distance metrics and linkage methods to find the best representation suited to your dataset and analysis goals.
Adjust the figure size to ensure all labels are legible and the structure is clear.
Standardize your data using z_score or standard_scale parameters for better interpretation of the relationships between categories.