Goal of this section
By the end of this section you will be able to:
- Create dictionaries in Python
- Store multiple related objects together
- Access dictionary elements by name
- Understand why dictionaries are useful for biological data
- Use a dictionary-based dataset to make a gene–gene plot for one cluster
What is a dictionary?
A dictionary stores data as:
key -> value
Example:
gene_info = {
"name": "Gata1",
"chromosome": "chrX",
"length": 2345
}
gene_info
You access values using the key:
gene_info["name"]
gene_info["chromosome"]
Dictionaries vs lists
A list stores values by position:
genes = ["Gata1", "Runx1", "Spi1"]
genes[0]
A dictionary stores values by name:
genes = {
"erythroid": "Gata1",
"stem": "Runx1",
"myeloid": "Spi1"
}
genes["erythroid"]
This is often clearer when working with biological data where information is paired with metadata.
A mini experiment dataset (stored in one dictionary)
In bioinformatics we often want to keep multiple related tables together, for example:
- an expression matrix (genes × samples)
- a gene table (gene metadata)
- a sample table (sample metadata, e.g. clusters)
We will store all of these in a single dictionary called data.
import numpy as np
import pandas as pd
# ---- expression (genes x samples) ----
expr = np.array(
[[8, 7, 2, 3, 9, 1, 2, 8], # Gata1
[1, 2, 7, 8, 1, 6, 7, 2], # Spi1
[4, 4, 5, 5, 4, 6, 6, 4], # Runx1
],dtype=float
)
# ---- gene metadata table ----
genes = pd.DataFrame(
{
"symbol": ["Gata1", "Spi1", "Runx1"],
"role": ["erythroid", "myeloid", "stem"],
},
index=["Gata1", "Spi1", "Runx1"]
)
# ---- sample metadata table (store clusters here) ----
samples = pd.DataFrame(
{
"cluster": [0, 0, 1, 1, 0, 1, 1, 0],
"type": ["A", "A", "B", "B", "A", "B", "B", "A"],
},
index=["sample1", "sample2", "sample3", "sample4",
"sample5", "sample6", "sample7", "sample8"]
)
# ---- store everything together ----
data = {
"expression": expr,
"genes": genes,
"samples": samples
}
data
Notes:
data["expression"]is the expression matrix (genes × samples)data["genes"]is gene metadata (one row per gene)data["samples"]is sample metadata (one row per sample), including cluster
Accessing dictionary elements
Example: get the expression matrix
data["expression"]
Get the sample table
data["samples"]
Combine dictionary access and DataFrame indexing:
data["expression"].loc["Gata1", :]
But this does not work like that - does it?
Remember what data["expression"] is - a ndarray that is very efficient for calculations.
But it lacks row and column names. We have stored row and column data in genes and samples. Actually the gene names are stored in the rownames of the genes object.
print (data['genes'].index)
data['genes'].index == "Gata1"
Do you remember how to subset a ndarray using a boolean array? Try it!
Solution:
data['expression'][ data['genes'].index == "Gata1" ]
Modifying a dictionary
Add a new entry:
data["species"] = "mouse"
data
Replace an entry (example):
data["samples"]["type"] = ["A", "A", "B", "B", "A", "B", "B", "A"]
data["samples"]
Why this structure is useful in bioinformatics
Many packages store information in named containers:
- AnnData / Scanpy:
adata.X,adata.obs,adata.var,adata.uns - Seurat stores data and metadata in named slots
- many file formats store data + metadata + settings
A dictionary is a simple way to model the same idea:
named pieces of data stored together.
Exercise: Gene–gene plot for one cluster
Goal
Create a scatter plot for cluster 1:
- x-axis: expression of Gata1
- y-axis: expression of Spi1
- points: samples that belong to cluster 1
Use the dataset stored in data.
Hints
1) Access an entry from a dictionary (reminder)
dict1 = {"a": 10, "b": 20}
dict1["a"]
2) Get the cluster labels from the sample table
- The cluster labels are stored in the sample metadata table.
- You will need something like: “give me the
clustercolumn”.
3) Create a mask for cluster 1
Example pattern:
v = np.array([0, 1, 1, 0])
mask = v == 1
mask
4) Subset the expression matrix by columns
You have seen column subsetting in the DataFrame section.
Remember the idea:
- rows = genes
- columns = samples
Select only the sample columns that match your mask.
5) Extract the two genes
You need one vector for x (Gata1) and one for y (Spi1).
6) Scatter plot reminder
import matplotlib.pyplot as plt
plt.scatter(x_values, y_values)
plt.xlabel("...")
plt.ylabel("...")
plt.title("...")
plt.show()
Why this matters for bioinformatics
In real single-cell workflows you often:
- store expression + sample metadata together
- plot gene vs gene for a subset of cells (e.g. one cluster)
- iterate quickly over different genes and clusters
In the next section, we will learn how to read and write data from files.