title: "Saving and loading our dataset"
Goal of this section
By the end of this section you will be able to:
- Save our dictionary-based dataset (
data) to disk in a safe way - Load it back from disk into the same structure
- Write functions that fail early with helpful error messages
We will store the three pandas tables inside one folder:
expression.tsvgenes.tsvsamples.tsv
Import libraries
import pandas as pd
import numpy as np
from pathlib import Path
Our data model (reminder)
We use a dictionary with three tables:
data["expression"]: DataFrame (genes × samples)data["genes"]: DataFrame (gene metadata)data["samples"]: DataFrame (sample metadata, includes cluster)
Saving data safely
Rules
When saving to a folder:
- If the folder does not exist: create it
- If any of the files we want to write already exists: fail with an error
- Write the three tables as
.tsv(tab-separated) files
Function
def save_data(data, folder):
"""
Save the dataset dictionary into a folder as TSV files.
Creates the folder if it does not exist.
Fails if any of the output files already exist.
"""
# --- basic checks ---
from pathlib import Path
check_data_model( data )
out_dir = Path(folder)
# Create folder if needed
out_dir.mkdir(parents=True, exist_ok=True)
# Target filenames (fixed)
paths = {
"expression": out_dir / "expression.tsv",
"genes": out_dir / "genes.tsv",
"samples": out_dir / "samples.tsv",
}
# Fail if any file already exists (to avoid overwriting)
existing = [str(p) for p in paths.values() if p.exists()]
if len(existing) > 0:
raise FileExistsError(
"Refusing to overwrite existing file(s): " + ", ".join(existing)
)
# Write TSV files
np.savetxt(paths["expression"], data["expression"], delimiter="\t")
data["genes"].to_csv(paths["genes"], sep="\t", index=True, header=True)
data["samples"].to_csv(paths["samples"], sep="\t", index=True, header=True)
return str(out_dir)
Try it:
save_data(data, "my_dataset")
Check the folder in your file browser.
Exercise: Load the dataset back
Goal
Write a function:
data = load_data(folder)
It should:
- Take only the folder name (string)
- Look for the three files inside that folder:
expression.tsvgenes.tsvsamples.tsv- If the folder does not exist: raise an error
- If any of the files is missing: raise an error that lists what is missing
- Read the files back into pandas DataFrames
- Return the dataset dictionary with keys:
"expression","genes","samples"
Hints
- Use
Path(folder) - Use
.exists()to check if something is on disk - Use
pd.read_csv(..., sep="\t", index_col=0)to read a TSV table with row names - Your function should raise errors instead of returning partial results
Solution: load_data(folder)
def load_data(folder):
in_dir = Path(folder)
# Folder must exist
if not in_dir.exists():
raise FileNotFoundError(f"Folder not found: {in_dir}")
if not in_dir.is_dir():
raise NotADirectoryError(f"Not a folder: {in_dir}")
# Required files (fixed)
paths = {
"expression": in_dir / "expression.tsv",
"genes": in_dir / "genes.tsv",
"samples": in_dir / "samples.tsv",
}
# Check missing files
missing = [name for name, p in paths.items() if not p.exists()]
if len(missing) > 0:
raise FileNotFoundError(
f"Missing file(s) in {in_dir}: " + ", ".join(missing)
)
# Read tables
expr = np.loadtxt(paths["expression"], delimiter="\t")
genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)
data = {
"expression": expr,
"genes": genes,
"samples": samples
}
check_data_model( data )
return data
Test your load function
After saving:
data2 = load_data("my_dataset")
Check that it looks correct:
def data_equal(d1, d2):
expr_ok = np.allclose(d1["expression"], d2["expression"])
genes_ok = d1["genes"].equals(d2["genes"])
samples_ok = d1["samples"].equals(d2["samples"])
return expr_ok and genes_ok and samples_ok
data_equal(data, data2)
Why this matters for bioinformatics
Real workflows must be reproducible.
Safe save/load functions help you:
- avoid accidentally overwriting results
- catch missing inputs early
- share datasets between scripts and notebooks
- move work from your laptop to an HPC system
In the next section, we will learn how to speed up some calculations by using built-in vectorized functions.