Goal of this section
By the end of this section you will be able to:
- Define your own functions
- Pass arguments into functions
- Return results from functions
- Handle incorrect input using errors
- Write small reusable functions that work with our
datadictionary model
Functions let us reuse code and make programs easier to read.
Our data model (reminder)
In previous sections we used a dictionary that stores:
data["expression"]: numpy ndarray (genes × samples)data["genes"]: DataFrame (gene metadata)data["samples"]: DataFrame (sample metadata, includescluster)
We will now write functions that work with this structure.
Defining a function
A function is a named block of code.
def square(x):
return x * x
Call the function:
square(4)
A first useful function: get one gene across all samples
Goal: return the expression values for one gene.
def get_gene(data, gene):
"""
Return expression values for one gene across all samples.
"""
g_idx = data["genes"].index.get_loc(gene)
return data["expression"][g_idx]
Try it:
get_gene(data, "Gata1")
Handling missing genes (raise an error)
What happens if the gene does not exist?
get_gene(data, "NotAGene")
Better: check first and raise a helpful error.
def get_gene(data, gene):
"""
Return expression values for one gene across all samples.
"""
if gene not in data["genes"].index:
raise ValueError(
f"Unknown gene '{gene}'. Expected one of: {list(data['genes'].index)}"
)
g_idx = data["genes"].index.get_loc(gene)
return data["expression"][:, g_idx]
Now:
get_gene(data, "NotAGene")
Functions with multiple arguments: subset samples by cluster
Goal: get the sample names that belong to a given cluster.
def samples_in_cluster(data, cluster_id):
"""
Return a list of sample names belonging to cluster_id.
"""
samples = data["samples"]
if "cluster" not in samples.columns:
raise ValueError("data['samples'] must contain a column named 'cluster'.")
mask = samples["cluster"] == cluster_id
return list(samples.index[mask])
Try it:
samples_in_cluster(data, 1)
Excercise:
Create a function
samples_in(data, col_name, value ):
that returns the sample names where the column col_name
in the sample table equals value.
??? Do not peak before yours works ;-)
```python
samples_in(data, col_name, value ):
"""
Returns a list of samples wher
if col_name not in samples.columns:
raise ValueError(f"data['samples'] must contain a column named '{col_name}'.")
mask = samples[col_name] == value
return list(samples.index[mask])
````
Returning multiple results
Sometimes we want to return several values. A simple way is to return a dictionary.
Example: compute mean expression for each gene and each sample.
def mean_expression_summary(data):
"""
Return mean expression per gene and per sample.
"""
expr = data["expression"]
gene_means = []
for gene in range(expr.shape[0]):
gene_means.append(expr[gene,:].mean())
sample_means = []
for sample in range(expr.shape[1]):
sample_means.append(expr[:,sample].mean())
return {
"gene_means": gene_means,
"sample_means": sample_means
}
Use it:
summary = mean_expression_summary(data)
summary["gene_means"]
summary["sample_means"]
Conditional logic in functions (with useful errors)
Example: choose how to compute a summary.
def summarize_gene(data, gene, method="mean"):
"""
Summarize one gene across samples using a chosen method.
method: 'mean' or 'max'
"""
v = get_gene(data, gene)
if method == "mean":
return v.mean()
elif method == "max":
return v.max()
else:
raise ValueError(
f"Unknown method '{method}'. Expected 'mean' or 'max'."
)
Try:
summarize_gene(data, "Gata1", method="mean")
summarize_gene(data, "Gata1", method="max")
summarize_gene(data, "Gata1", method="median")
Exercise 1: Check the data model
Goal
Write a function:
check_data_model(data)
It should raise a helpful error if:
- One of the keys is missing:
-
"expression","genes","samples" -
The objects are not pandas DataFrames.
-
The indexes do not match the data model:
data["expression"].indexmust matchdata["genes"].indexdata["expression"].columnsmust matchdata["samples"].index
Solution: check_data_model
def check_data_model(data):
required = ["expression", "samples", "genes"]
# Check keys
for key in required:
if key not in data:
raise ValueError(f"Missing key '{key}' in data dictionary.")
expr = data["expression"]
genes = data["genes"]
samples = data["samples"]
# Check types
import pandas as pd
if not isinstance(expr, np.ndarray):
raise ValueError("'expression' must be a numpy ndarray.")
if not isinstance(genes, pd.DataFrame):
raise ValueError("'genes' must be a pandas DataFrame.")
if not isinstance(samples, pd.DataFrame):
raise ValueError("'samples' must be a pandas DataFrame.")
# Check index alignment
if not expr.shape[0] == len(genes.index):
raise ValueError(
f"Shape[0] of 'expression' must match index length of 'genes' ({expr.shape[0]} != {len(genes.index)})."
)
if not expr.shape[1] == len(samples.index):
raise ValueError(
"Shape[1] of 'expression' must match index length of 'samples' ({expr.shape[1]} != {len(samples.index)})."
)
Why this matters for bioinformatics
Bioinformatics scripts often:
- take user input (gene names, thresholds, cluster IDs)
- process tables and metadata together
- produce plots and output files
Functions make it possible to:
- structure code
- test small pieces
- reuse analysis steps
- fail early with clear error messages
In the next section, we will learn how to save and load our module savely.