Skip to content
title: "Saving and loading our dataset"

Goal of this section

By the end of this section you will be able to:

  • Save our dictionary-based dataset (data) to disk in a safe way
  • Load it back from disk into the same structure
  • Write functions that fail early with helpful error messages

We will store the three pandas tables inside one folder:

  • expression.tsv
  • genes.tsv
  • samples.tsv

Import libraries

import pandas as pd
import numpy as np
from pathlib import Path

Our data model (reminder)

We use a dictionary with three tables:

  • data["expression"] : DataFrame (genes × samples)
  • data["genes"] : DataFrame (gene metadata)
  • data["samples"] : DataFrame (sample metadata, includes cluster)

Saving data safely

Rules

When saving to a folder:

  1. If the folder does not exist: create it
  2. If any of the files we want to write already exists: fail with an error
  3. Write the three tables as .tsv (tab-separated) files

Function

def save_data(data, folder):
    """
    Save the dataset dictionary into a folder as TSV files.

    Creates the folder if it does not exist.
    Fails if any of the output files already exist.
    """
    # --- basic checks ---
    from pathlib import Path
    check_data_model( data )

    out_dir = Path(folder)

    # Create folder if needed
    out_dir.mkdir(parents=True, exist_ok=True)

    # Target filenames (fixed)
    paths = {
        "expression": out_dir / "expression.tsv",
        "genes": out_dir / "genes.tsv",
        "samples": out_dir / "samples.tsv",
    }

    # Fail if any file already exists (to avoid overwriting)
    existing = [str(p) for p in paths.values() if p.exists()]
    if len(existing) > 0:
        raise FileExistsError(
            "Refusing to overwrite existing file(s): " + ", ".join(existing)
        )

    # Write TSV files
    np.savetxt(paths["expression"], data["expression"], delimiter="\t")
    data["genes"].to_csv(paths["genes"], sep="\t", index=True, header=True)
    data["samples"].to_csv(paths["samples"], sep="\t", index=True, header=True)

    return str(out_dir)

Try it:

save_data(data, "my_dataset")

Check the folder in your file browser.


Exercise: Load the dataset back

Goal

Write a function:

data = load_data(folder)

It should:

  1. Take only the folder name (string)
  2. Look for the three files inside that folder:
  3. expression.tsv
  4. genes.tsv
  5. samples.tsv
  6. If the folder does not exist: raise an error
  7. If any of the files is missing: raise an error that lists what is missing
  8. Read the files back into pandas DataFrames
  9. Return the dataset dictionary with keys:
  10. "expression", "genes", "samples"

Hints

  • Use Path(folder)
  • Use .exists() to check if something is on disk
  • Use pd.read_csv(..., sep="\t", index_col=0) to read a TSV table with row names
  • Your function should raise errors instead of returning partial results

Solution: load_data(folder)
def load_data(folder):
   in_dir = Path(folder)

   # Folder must exist
   if not in_dir.exists():
       raise FileNotFoundError(f"Folder not found: {in_dir}")

   if not in_dir.is_dir():
       raise NotADirectoryError(f"Not a folder: {in_dir}")

   # Required files (fixed)
   paths = {
       "expression": in_dir / "expression.tsv",
       "genes": in_dir / "genes.tsv",
       "samples": in_dir / "samples.tsv",
   }

   # Check missing files
   missing = [name for name, p in paths.items() if not p.exists()]
   if len(missing) > 0:
       raise FileNotFoundError(
           f"Missing file(s) in {in_dir}: " + ", ".join(missing)
       )

   # Read tables
   expr = np.loadtxt(paths["expression"], delimiter="\t")
   genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
   samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)

   data = {
       "expression": expr,
       "genes": genes,
       "samples": samples
   }

   check_data_model( data )

   return data

Test your load function

After saving:

data2 = load_data("my_dataset")

Check that it looks correct:

def data_equal(d1, d2):
    expr_ok = np.allclose(d1["expression"], d2["expression"])
    genes_ok = d1["genes"].equals(d2["genes"])
    samples_ok = d1["samples"].equals(d2["samples"])

    return expr_ok and genes_ok and samples_ok

data_equal(data, data2)

Why this matters for bioinformatics

Real workflows must be reproducible.

Safe save/load functions help you:

  • avoid accidentally overwriting results
  • catch missing inputs early
  • share datasets between scripts and notebooks
  • move work from your laptop to an HPC system

In the next section, we will learn how to speed up some calculations by using built-in vectorized functions.