Goal of this section
By the end of this section you will be able to:
- Select the most variable genes from an expression matrix
- Explain what variance means in biological terms
- Z-score (standardize) gene expression per gene (row-wise)
- Check scaling visually
This is a common preprocessing step before clustering and heatmaps.
Improved start-up
So far we have written many small helper functions directly inside our notebooks. That works for experiments, but it quickly becomes messy:
- You need to copy-paste functions between notebooks
- It is hard to reuse code
- Mistakes fixed in one notebook do not automatically get fixed in others
A better approach is to store your functions in a separate Python file and import them when needed.
Step 1: Create a functions file
In the Jupyter interface:
- Look at the file browser panel on the left.
- Click the blue “+” button.
-
Choose:
-
Other
- Python File
- Name the file:
functions.py
Step 2: Collect your functions
Open functions.py and move all the following into it:
- All
importstatements - All helper functions you created in previous sections
For example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def dist(vec1, vec2):
"""
Euclidean distance between two vectors
"""
diff = vec1 - vec2
return np.sum(diff**2) ** 0.5
def get_top_variable_genes(data, top_n=500):
"""
Return top N most variable genes
"""
var = data.var(axis=1)
top = var.sort_values(ascending=False).head(top_n)
return data.loc[top.index]
You can keep adding functions to this file as the course continues.
Step 3: Import the functions into your notebook
At the top of your notebook, simply write:
from functions import *
This will:
- Load all libraries defined in
functions.py - Load all functions you defined there
- Make them available in your notebook
Important note (very useful in Jupyter)
If you change functions.py, Jupyter does not automatically reload it.
To reload it, run:
import importlib
import functions
importlib.reload(functions)
from functions import *
Or simply restart the kernel.
Why this is good practice
This approach:
- Keeps notebooks clean
- Encourages code reuse
- Makes debugging easier
- Is closer to how real projects are structured
Later, this idea naturally grows into:
- Python modules
- Packages
- Reusable analysis libraries
Import libraries
These are the libraries we use in the here described functions.
When you put new function into function.py make sure that
all necessary libraries are also loaded in the file.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Reminder: what is our data shape?
For the course data:
- rows = genes
- columns = samples (or cells)
So when we compute variance per gene, we operate along axis=1 (across columns).
Why do we select variable genes?
Most genes do not change much across samples. Genes with higher variability are more likely to:
- reflect biological differences between populations
- separate sample groups in clustering
- show interesting patterns in heatmaps
So we often focus on the top N most variable genes.
Select the top variable genes
def get_top_variable_genes(data, top_n=500):
mat = data['expression']
var = mat.var(axis=1, ddof=1) # variance per row (gene)
# 2) get sorted gene indices (small → large variance)
sorted_ids = np.argsort(var)
# return a subset (bottom top_n)
return subset_genes(data, sorted_ids[-top_n:])
Use it:
hspc_var = get_top_variable_genes(hspc_data, top_n=500)
hspc_var['expression'].shape
Check the value range
hspc_var['expression'].min(), hspc_var['expression'].max()
Z-score scaling (standardization)
For clustering and heatmaps, it is often useful to scale each gene so that:
- mean = 0
- standard deviation = 1
For a gene (row) with values \(x_1, x_2, ..., x_n\), the z-score is:
where: - \(\mu\) is the mean of the gene across samples - \(\sigma\) is the standard deviation of the gene across samples
This makes genes comparable even if they have very different expression ranges.
A fast z-score function (vectorized)
def zscore_rows(mat):
m = np.mean(mat, axis=1, keepdims=True)
s = np.std(mat, axis=1, ddof=1, keepdims=True)
return (mat - m) / s
Run it:
hspc_zs = hspc_data_tiny.copy()
hspc_zs['expression'] = zscore_rows(hspc_data_tiny['expression'])
hspc_zs['expression'].shape
Quick sanity check
After z-scoring, each gene should have:
- mean ~ 0
- std ~ 1
Check the first few genes:
print(hspc_zs.mean(axis=1).head())
print(hspc_zs.std(axis=1, ddof=1).head())
Visual check: boxplots before and after scaling
There is a boxplot function in pandas, but our dict stores the data as numpy array. The easiest here is to define one more function that actually converts out dict to a pandas DatFrame - the reverse of our from_df function earlier:
def expression_df(data):
return pd.DataFrame(
data["expression"],
columns=data["samples"].index,
index=data["genes"].index
)
Before z-scoring:
expression_df(hspc_var).T.boxplot(rot=90)
plt.title("Before scaling (raw values)")
plt.show()
After:
expression_df(hspc_zs).T.boxplot(rot=90)
plt.title("After scaling (z-scores)")
plt.show()
Exercise
We used expression_df(hspc_var).T there, but we likely also need a transform for our own data.
- Implement a
def transpose()that returns a transformed data structure. - Change the
zscore_rows()function to accept our dict. Also change the function to zscore in place. We could store the mean and std per gene in the data dist, too.
Why this matters for bioinformatics
Z-scoring makes clustering focus on patterns rather than absolute expression levels.
For example, two genes might have different expression magnitudes but similar behaviour across samples:
- gene A: low counts but increases in the same samples as gene B
- gene B: high counts but the same pattern
Z-scoring helps the clustering algorithm see the shared pattern.
In the next section, we will cluster genes and visualise them using dendrograms and heatmaps.