title: "Saving and loading our dataset"

Goal of this section

By the end of this section you will be able to:

Save our dictionary-based dataset (data) to disk in a safe way
Load it back from disk into the same structure
Write functions that fail early with helpful error messages

We will store the three pandas tables inside one folder:

expression.tsv
genes.tsv
samples.tsv

Import libraries

import pandas as pd
import numpy as np
from pathlib import Path

Our data model (reminder)

We use a dictionary with three tables:

data["expression"] : DataFrame (genes × samples)
data["genes"] : DataFrame (gene metadata)
data["samples"] : DataFrame (sample metadata, includes cluster)

Saving data safely

Rules

When saving to a folder:

If the folder does not exist: create it
If any of the files we want to write already exists: fail with an error
Write the three tables as .tsv (tab-separated) files

Function

def save_data(data, folder):
    """
    Save the dataset dictionary into a folder as TSV files.

    Creates the folder if it does not exist.
    Fails if any of the output files already exist.
    """
    # --- basic checks ---
    from pathlib import Path
    check_data_model( data )

    out_dir = Path(folder)

    # Create folder if needed
    out_dir.mkdir(parents=True, exist_ok=True)

    # Target filenames (fixed)
    paths = {
        "expression": out_dir / "expression.tsv",
        "genes": out_dir / "genes.tsv",
        "samples": out_dir / "samples.tsv",
    }

    # Fail if any file already exists (to avoid overwriting)
    existing = [str(p) for p in paths.values() if p.exists()]
    if len(existing) > 0:
        raise FileExistsError(
            "Refusing to overwrite existing file(s): " + ", ".join(existing)
        )

    # Write TSV files
    np.savetxt(paths["expression"], data["expression"], delimiter="\t")
    data["genes"].to_csv(paths["genes"], sep="\t", index=True, header=True)
    data["samples"].to_csv(paths["samples"], sep="\t", index=True, header=True)

    return str(out_dir)

Try it:

save_data(data, "my_dataset")

Check the folder in your file browser.

Exercise: Load the dataset back

Goal

Write a function:

data = load_data(folder)

It should:

Take only the folder name (string)
Look for the three files inside that folder:
expression.tsv
genes.tsv
samples.tsv
If the folder does not exist: raise an error
If any of the files is missing: raise an error that lists what is missing
Read the files back into pandas DataFrames
Return the dataset dictionary with keys:
"expression", "genes", "samples"

Hints

Use Path(folder)
Use .exists() to check if something is on disk
Use pd.read_csv(..., sep="\t", index_col=0) to read a TSV table with row names
Your function should raise errors instead of returning partial results

Solution: load_data(folder)

def load_data(folder):
   in_dir = Path(folder)

   # Folder must exist
   if not in_dir.exists():
       raise FileNotFoundError(f"Folder not found: {in_dir}")

   if not in_dir.is_dir():
       raise NotADirectoryError(f"Not a folder: {in_dir}")

   # Required files (fixed)
   paths = {
       "expression": in_dir / "expression.tsv",
       "genes": in_dir / "genes.tsv",
       "samples": in_dir / "samples.tsv",
   }

   # Check missing files
   missing = [name for name, p in paths.items() if not p.exists()]
   if len(missing) > 0:
       raise FileNotFoundError(
           f"Missing file(s) in {in_dir}: " + ", ".join(missing)
       )

   # Read tables
   expr = np.loadtxt(paths["expression"], delimiter="\t")
   genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
   samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)

   data = {
       "expression": expr,
       "genes": genes,
       "samples": samples
   }

   check_data_model( data )

   return data

Test your load function

After saving:

data2 = load_data("my_dataset")

Check that it looks correct:

def data_equal(d1, d2):
    expr_ok = np.allclose(d1["expression"], d2["expression"])
    genes_ok = d1["genes"].equals(d2["genes"])
    samples_ok = d1["samples"].equals(d2["samples"])

    return expr_ok and genes_ok and samples_ok

data_equal(data, data2)

Why this matters for bioinformatics

Real workflows must be reproducible.

Safe save/load functions help you:

avoid accidentally overwriting results
catch missing inputs early
share datasets between scripts and notebooks
move work from your laptop to an HPC system

In the next section, we will learn how to speed up some calculations by using built-in vectorized functions.