Goal of this section
By the end of this section you will:
- Understand what k-means clustering does (conceptually)
- Load a small time-course expression dataset
- Run k-means using scikit-learn (reliable, fast)
- Visualize cluster patterns
- Start the final project: implement a clustering method using simulated annealing
Important note: This course is about learning to think programmatically. For methods like simulated annealing, performance improvements usually come from reducing the number of calculations and updating only what is necessary, rather than using the most mathematically elegant or exact version of a distance function.
Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import math
And our own functions:
import importlib
import functions
importlib.reload(functions)
from functions import *
Load the yeast cell-cycle dataset
We will use a small dataset where expression has been measured across timepoints. This dataset has clear clusters, which makes it good for learning.
url = "https://raw.githubusercontent.com/shambam/R_programming_1/main/Spellman_Yeast_Cell_Cycle.tsv"
ycc = pd.read_csv(url, sep="\t", index_col=0)
print(ycc.head())
## and to use our functions:
ycc = from_pd(ycc)
Z-score each gene (row)
We scale each gene to mean 0 and sd 1 so patterns are comparable.
ycc_z = ycc.copy()
zscore_rows(ycc_z)
ycc_z['expression'].mean(), ycc_z['expression'].std()
What does k-means do?
k-means tries to group genes into K clusters such that:
- genes in the same cluster have similar expression patterns
- each cluster is represented by a “centroid” (an average pattern)
Algorithm idea:
- Choose K initial centroids
- Assign each gene to the closest centroid
- Update centroids as the mean of assigned genes
- Repeat until stable
Run k-means using scikit-learn
This gives a correct implementation you can trust.
K = 8
km = KMeans(n_clusters=K, n_init=10, random_state=0)
clusters = km.fit_predict(ycc_z['expression'])
clusters[:10]
Add cluster labels to the dict:
ycc_z['genes']['kmeans_clusters'] = clusters
ycc_z['genes'].head()
Visualize cluster patterns
We want to see the typical “time-course shape” of each cluster.
Approach: - for each cluster, plot all genes (faint lines) - optionally plot the average centroid (thicker line)
def plot_cluster_patterns(data, cluster_col, K=None, ncols=4, show_centroid=True):
"""
Plot gene expression patterns per cluster from our data dictionary.
Parameters
----------
data : dict
Our course data dictionary (validated by `check_data_model(data)`).
cluster_col : str
Column name in `data["genes"]` that stores the cluster label per gene.
K : int or None
Number of clusters. If None, inferred as max(label)+1 (integer labels).
ncols : int
Number of subplot columns.
show_centroid : bool
If True, plot the mean curve per cluster (red line).
"""
# Validate input data structure (provided elsewhere in the course)
check_data_model(data)
X = data["expression"] # (n_genes, n_samples)
clusters = np.asarray(data["genes"][cluster_col].values)
# x-axis: always 0..n_samples-1
timepoints = np.arange(X.shape[1])
if K is None:
K = int(np.max(clusters)) + 1
nrows = math.ceil(K / ncols)
fig, axes = plt.subplots(nrows, ncols, figsize=(4 * ncols, 3.2 * nrows), sharey=True)
axes = np.atleast_1d(axes).ravel()
for k in range(K):
ax = axes[k]
members = X[clusters == k, :]
if members.shape[0] == 0:
ax.set_title(f"Cluster {k} (n=0)")
ax.axis("off")
continue
# plot individual genes (faint)
for row in members:
ax.plot(timepoints, row, color="black", alpha=0.15, linewidth=1)
# plot centroid (mean)
if show_centroid:
ax.plot(timepoints, members.mean(axis=0), color="red", linewidth=2)
ax.set_title(f"Cluster {k} (n={members.shape[0]})")
ax.set_xlabel("Time")
if (k % ncols) == 0:
ax.set_ylabel("Z-score")
# hide unused panels
for j in range(K, len(axes)):
axes[j].axis("off")
plt.tight_layout()
plt.show()
return fig
And run it like this:
fig = plot_cluster_patterns( ycc_z, 'kmeans_clusters', K=K )
Exercise (k-means)
Here we try to replicate what scipy does internally using simple Python.
- Choose K initial centroids (import random)
- Assign each gene to the closest centroid
- Update centroids as the mean of assigned genes
- Repeat until stable
The most important calculation will be the eucledian distance.
We need to repeatetly get the eucledian distances of one centroid against all genes.
Implement a dist_to_centroid (centroid, mat ) function.
Peak only after you finished yours
## I am sure yours performs way better!
def dist( vec_a, vec_b ):
"""
Calculates the distance between two numpy arrays
"""
if len(vec_a) != len(vec_b):
raise ValueError("Vectors must have the same length.")
sum = 0.0
for (a,b) in vec_a.zip(vec_b):
sum += (a -b) ** 2
return (sum ** 0.5)
def dist_to_centroid (centroid, mat ):
"""
Calculates eucledian distance between one centroid and a matrix of genes
"""
dists = []
for i in range(mat.shape[0]):
dists.append( dist( centroid, mat[i] ) )
return dists
And finally we need to get randomness into our scripts. For this you normall use a random number generator but as we are already using numpy we should probably take the one from there:
rng = np.random.default_rng(seed)
With this module we can then e.g. identify a random set of k ids from a range of n ids:
idx = rng.choice(np.arange(1,11), size=k, replace=False)
And the loop should contain these steps: 1. identify k random genes and use them as initial centromers 2. compare each centromer to each gene and find for each gene the closest centromer 3. recalculate the new centromers as the mean of the closest genes 4. if the new centroids look like the old ones - break the loop 5. use the new centroids and restart the loop
Really - do not peak - use this opportunity to dig into this problem!
def kmeans( data, k, maxiter, seed):
"""
Clusters rows using the kmeans algorithm
"""
data_np = np.asarray(data)
## define a reproducible 'random' state and get the initial centroids as random genes
rng = np.random.default_rng(seed)
idx = rng.choice(np.arange(len(data_np)), size=k, replace=False)
centroids = data_np[idx]
## the main loop
n = len(data_np)
for it in range(maxiter):
# 1) Distance table D: rows=genes, cols=centroids
D = np.zeros((n, k), dtype=float)
for c in range(k):
dists = dist_to_centroid(centroids[c], data_np)
for i in range(n):
D[i, c] = dists[i]
# 2) Assign each gene to nearest centroid
labels = np.zeros(n, dtype=int)
for i in range(n):
labels[i] = int(np.argmin(D[i, :]))
# 3) Update centroids (mean of assigned points)
new_centroids = centroids.copy()
for c in range(k):
members = data_np[labels == c]
if len(members) > 0: # avoid empty cluster crash
new_centroids[c] = np.mean(members, axis=0)
else:
# re-seed empty centroid to a random point (simple, explicit)
new_centroids[c] = data_np[rng.integers(0, n)]
# 4) Stop if centroids no longer move
if np.allclose(new_centroids, centroids):
break
centroids = new_centroids
return labels, centroids, D
The Final Project: Simulated annealing clustering
Your final project will combine everything you learned to implement a clustering algorithm using a simulated annealing / Metropolis-Hastings style procedure.
This is a great exercise because: - it forces you to break a problem into smaller functions - it uses loops and conditionals in a meaningful way - it links biology (clusters of genes) with a real optimisation method
Optimisation problem
We want clusters to be compact: genes in the same cluster should be similar.
We already know Euclidean distance between two genes (vectors) across timepoints:
An energy function measures “how bad” a clustering is. One simple idea is the average within-cluster pairwise distance.
For clusters \(C_1, \ldots, C_K\):
Lower energy means tighter clusters.
Rescaling step (important)
Before starting simulated annealing, rescale each gene row so values lie between 0 and 1:
def minmax_rows(mat):
mn = mat.min(axis=1)
mx = mat.max(axis=1)
return mat.sub(mn, axis=0).div(mx - mn, axis=0)
ycc_mm = minmax_rows(ycc)
ycc_mm.head()
Simulated annealing algorithm (high level)
You will need:
- Initial temperature
Temp - Cooling factor
cool(e.g. 0.999 or 0.995) - Number of clusters
K - Number of iterations
I
Pseudo-code:
- Randomly assign each gene to one of K clusters
For iteration 1..I:
- Compute current energy
E_old - Randomly pick one gene and propose moving it to a different cluster
- Compute new energy
E_new - Accept if:
E_new < E_old, orexp(-(E_new - E_old)/Temp) > U(0,1)- Cool the system:
Temp = Temp * cool
What I mean here - create a script that gets these values from the user. You should search for how this would work.
Things to do (project checklist)
- Implement an energy function
energy(assignments, dist_matrix, K) - Implement a function that proposes a move (gene -> different cluster)
- Implement the simulated annealing loop
- Print:
- starting energy
E_start - final energy
E_final - Plot the final clusters (like we did for k-means)
Inspiration (optional reading)
Before you start: - Read about simulated annealing / Metropolis acceptance - Remember: the goal is learning, not competing with libraries
You now have the tools to tackle a real algorithmic problem in Python. Use Google, but no AI tool to learn e.g. how to accept user input in a python script.
Performance
Once you have your script up and runnig and the results look accetable (plots!) and you still have some juce to program more:
- Store the distance matrix and do not calculate distances on every iteration
- Store the per cluster energies and only re-caulaulte the ones affected by the move
- Only calculate a delta energy and not "touch" the stable genes in the clusters
Outlook
When you simulated annealing functionality does work - add it to your functions.py file. And when you still want to look into Python (great!) you could ask e.g. ChatGPT to create a python package from this function.py file. It should come up with a pretty advanced package structure and should walk you all the way from your functions file to a package that you can install using pip and import like any other python package!