Goal of this section
By the end of this section you will be able to:
- Create matrices using NumPy
- Understand rows vs columns in biological data
- Create labeled tables using pandas DataFrames
- Transpose matrices and tables
These structures are central to bioinformatics:
rows = genes, columns = samples.
Import libraries
import numpy as np
import pandas as pd
Creating a matrix with NumPy
Create a matrix of zeros:
m = np.zeros((10, 5)) # 10 rows, 5 columns
m
This represents: - 10 genes (rows) - 5 samples (columns)
Transposing a matrix
Transpose means swapping rows and columns:
m_T = m.T
m_T
This is important because some functions expect: - rows = observations - columns = variables
Why we need labels
NumPy matrices only store numbers.
They do not store gene or sample names.
For biological data, we need labeled rows and columns.
This is why we use pandas DataFrames.
Creating a DataFrame
Convert the NumPy matrix into a DataFrame with labels:
genes = ["Gata1", "Spi1", "Runx1", "Cebpa", "Tal1",
"Actb", "Kit", "Cd34", "Lyz", "Il7r"]
samples = ["LTHSC_1", "LTHSC_2", "MEP_1", "MEP_2", "GMP_1"]
df = pd.DataFrame(
m,
index=genes,
columns=samples
)
print(df)
Now we have: - row names (index) - column names (labels)
Shape of a DataFrame
Check dimensions:
df.shape
This returns:
(number_of_rows, number_of_columns)
Access them separately:
print( f"0 - nrow: {df.shape[0]}") # rows
print( f"1 - ncol: {df.shape[1]}") # columns
Accessing values
Convert DataFrame back to NumPy if needed:
df.values
But in most cases, we work directly with the DataFrame.
Creating a numeric DataFrame
Create a matrix of numbers:
m2 = np.arange(1, 51).reshape((10, 5))
df2 = pd.DataFrame(
m2,
index=genes,
columns=samples
)
print(df2)
m2 = np.arange(1, 51).reshape((10, 5))
df2 = pd.DataFrame(
m2,
index=genes,
columns=samples
)
print(df2)
Exercise
- Plot the gene "Gata1" for the samples
- Color the plot by sample type
Why this matters for bioinformatics
Most tools in bioinformatics expect data in this form:
- genes × samples
- with labels
Understanding how to create and manipulate these tables is essential for: - filtering - plotting - clustering - statistical testing
In the next section, we will learn how to subset and filter these tables.