Goal of this section
By the end of this section you will be able to:
- Perform hierarchical clustering on genes and samples
- Understand what a dendrogram represents
- Create heatmaps of expression data
- Use
seaborn.clustermapfor a high-level “all-in-one” clustering plot
We will use the scaled variable-gene matrix from the previous section:
- hspc_var (variable genes)
- hspc_zs (z-scored per gene)
Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as sch
(Re)load our functions:
import importlib
import functions
importlib.reload(functions)
from functions import *
``
Get the Data
url = "https://raw.githubusercontent.com/shambam/R_programming_1/main/Mouse_HSPC_reduced.txt"
hspc_data = pd.read_csv(
url,
sep="\t",
header=0,
index_col=0
)
hspc_data = from_pd(hspc_data )
hspc_var = get_top_variable_genes(hspc_data, top_n=500)
hspc_zs = hspc_var.copy()
# remember Excercise in Performace? Use the function you created there.
zscore_rows(hspc_zs)
Hierarchical clustering (genes)
Hierarchical clustering groups similar genes together.
We need a distance between rows (genes). A common choice is Euclidean distance. Then we build a hierarchy using the linkage method "Ward".
Ward is a linkage method that starts with each gene as its own cluster. At every step, it looks at all pairs of clusters, computes the distance between their centers using Euclidean distance, and asks: if we merge these two clusters, how much would the total variance inside the cluster increase? It then merges the pair that produces the smallest increase in within-cluster variance, which leads to compact, tight clusters.
gene_dist = vectorized_dist(hspc_zs)
gene_link = sch.linkage(gene_dist, method="ward")
Plot a dendrogram (for small numbers of genes this is fine):
plt.figure(figsize=(10, 6))
sch.dendrogram(gene_link, labels=hspc_zs["genes"].index.tolist(), leaf_rotation=90)
plt.title("Hierarchical clustering (genes)")
plt.xlabel("Gene")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
Note: With hundreds of genes, dendrogram labels can become unreadable. That’s normal.
Hierarchical clustering (samples)
To cluster samples, transpose the matrix so samples become rows:
sample_dist = vectorized_dist(transpose(hspc_zs) )
sample_link = sch.linkage(sample_dist, method="ward")
plt.figure(figsize=(10, 6))
sch.dendrogram(sample_link, labels=hspc_zs['samples'].index.tolist(), leaf_rotation=90)
plt.title("Hierarchical clustering (samples)")
plt.xlabel("Sample")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
A first heatmap (no clustering)
plt.figure(figsize=(8, 6))
plt.imshow(hspc_zs['expression'], aspect="auto")
plt.title("Heatmap (no clustering)")
plt.xlabel("Samples")
plt.ylabel("Genes")
plt.show()
This shows the data but doesn’t order genes/samples by similarity.
Ordering rows/columns using the dendrogram
We can extract the ordering from the dendrogram output and reorder the matrix.
genes_hc = sch.dendrogram(gene_link, no_plot=True, labels=hspc_zs['genes'].index.tolist())
samples_hc = sch.dendrogram(sample_link, no_plot=True, labels=hspc_zs['samples'].index.tolist())
ordered = expression_df(hspc_zs).loc[genes_hc["ivl"], samples_hc["ivl"]]
ordered.shape
Plot the ordered heatmap:
plt.figure(figsize=(8, 6))
plt.imshow(ordered.values, aspect="auto")
plt.title("Heatmap (genes and samples ordered by clustering)")
plt.xlabel("Samples")
plt.ylabel("Genes")
plt.show()
Heatmap with seaborn
Seaborn produces nicer heatmaps (and supports colorbars, labels, etc.).
plt.figure(figsize=(10, 8))
sns.heatmap(ordered, cmap="magma")
plt.title("Heatmap (ordered)")
plt.show()
Clustermap: clustering + heatmap in one step
seaborn.clustermap does clustering and plotting together.
We use z_score=0 to z-score rows (genes) inside clustermap.
(If you already z-scored, you can set z_score=None.)
cm = sns.clustermap(
hspc_var['expression'],
method="ward",
metric="euclidean",
cmap="magma",
z_score=0
)
plt.show()
Save it:
cm.fig.savefig("clustermap.png", dpi=200)
Assigning genes to clusters (example: 5 clusters)
Sometimes we want to label each gene with a cluster ID.
gene_clusters_k5 = sch.fcluster(cm.dendrogram_row.linkage, t=5, criterion="maxclust")
gene_clusters_k5[:10]
Count genes per cluster:
counts = pd.Series(gene_clusters_k5).value_counts().sort_index()
counts
Plot counts:
counts.plot(kind="bar")
plt.title("Genes per cluster (k=5)")
plt.xlabel("Cluster")
plt.ylabel("Number of genes")
plt.show()
Exercise
You have seen how you can create heatmaps for a pandas DataFrame.
Please implement a function for our own data dictionary that does plot a heatmap.
Hint
- *args collects extra positional arguments into a tuple
- **kwargs collects extra named arguments into a dictionary
How to use *args and **kwargs
You can document them in the function like this:
def plot_heatmap(data, *args, **kwargs):
"""
Plot a heatmap of the expression matrix.
Parameters
----------
data : dict
Data dictionary containing the key 'expression'.
*args :
Extra positional arguments passed to seaborn.clustermap.
**kwargs :
Extra keyword arguments passed to seaborn.clustermap.
"""
Inside the function, you simply forward them to seaborn:
sns.clustermap(mat, *args, **kwargs)
This passes all additional arguments directly to clustermap without you needing to specify them one by one.
You can and should also add options that your function processes itself, for example a save argument to write the image to a file, or options like title, xlabel, and ylabel to control the plot labels.
A good rule is: handle the most common options directly in your function, and pass everything else through *args and **kwargs to keep the function flexible without becoming overly complicated.
Why this matters for bioinformatics
Clustering + heatmaps are standard tools to:
- discover gene programs
- see sample relationships
- identify outliers
- validate expected biology (e.g. known cell types)
In the next section, we will introduce a simple statistical model (ANOVA) and compute p-values across genes.