Goal of this section
By the end of this section you will be able to:
- Create and use variables
- Understand the difference between Python lists and NumPy arrays
- Add comments to your code
- Use
help()to look up documentation - Perform vectorized numerical operations
- Use logical conditions to select values
These are the building blocks for everything that follows.
Variables
A variable is a name that refers to a value stored in memory.
x = 10
y = 3.5
name = "geneA"
You can print the contents of a variable:
print(x)
print(name)
In Jupyter notebooks, simply writing the variable name as the last statement of the cell will also show its value:
x
What would happen if you have another statement after the variable name?
Try:
x
a = 14
Comments
Comments are notes for humans. Python ignores them when running code.
x = 5 # this is a comment
y = x + 2 # add 2 to x
Use comments to explain why something is done, not just what is done.
Python lists
A list is a collection of values:
genes = ["Gata1", "Runx1", "Spi1"]
genes
You can access elements using square brackets []:
print(genes[0])
print(genes[1])
Python starts counting at 0, not 1.
NumPy arrays
In bioinformatics we usually work with numerical vectors and matrices.
For this, we use NumPy arrays.
First import NumPy:
import numpy as np
This means:
- Import the numpy library
- Make it available as np
Later we can call functions like:
np.array(x)
Creating arrays
Create a numeric vector:
x = np.array([1, 2, 3, 4, 5])
x
NumPy arrays are designed for fast numerical computation.
Lists vs NumPy arrays
Compare this list:
lst = [1, 2, 3, 4, 5]
with this array:
arr = np.array([1, 2, 3, 4, 5])
Try adding 1:
lst + 1
This fails, because lists do not support mathematical operations.
But with NumPy:
arr + 1
This adds 1 to every element.
This is called vectorized computation.
It is much faster than Python loops because the operations are executed in optimized compiled code (mostly written in C) and work on entire blocks of memory at once.
Creating numeric sequences
NumPy provides functions to generate sequences easily.
np.arange(1, 11)
This creates numbers from 1 to 10.
You can control the step size:
np.arange(0, 21, 5)
You can also generate evenly spaced values using linspace:
np.linspace(0, 100, 11)
Getting help
Python has built-in documentation.
help(np.arange) # or in Jupyter: ?np.arange
Look at: - what arguments the function takes - what it returns
Vectorized mathematics
Create a vector:
v = np.arange(1, 11)
v
Multiply all values:
2 * v
Take the reciprocal:
1 / v
Compute the mean and standard deviation:
np.mean(v)
np.std(v)
Exercise
- Create a NumPy vector from 0 to 100 in steps of 10.
- Compute its mean and standard deviation.
- Try the same with a Python list. What happens?
Steps: 1. Subtract the mean from each value 2. Square the differences 3. Compute the mean of those squared differences 4. Take the square root
Solution
import numpy as np
# 1. Create vector
v = np.arange(0, 101, 10)
v
# 2. Mean and standard deviation
np.mean(v)
np.std(v)
# 3. Try the same with a Python list
lst = list(range(0, 101, 10))
# This will fail:
np.mean(lst)
np.std(lst)
Logical comparisons
We can compare vectors to values:
x = np.arange(1, 11)
print(x == 5)
print(x < 5)
print(x >= 5)
x != 5
These comparisons return boolean arrays (True / False for each element).
Using logical results to select values
We can use boolean arrays to subset data:
x[x < 5]
x[x >= 5]
x[x != 5]
This is a very common pattern in data analysis. But would that also work with a list? Try it!
Working with two string vectors
Create some gene vectors:
marker_genes = np.array(["Gata1", "Spi1", "Runx1"])
detected_genes = np.array(["Runx1", "Actb", "Gata1"])
Find common values:
common = np.intersect1d(marker_genes, detected_genes)
common
Find values in marker_genes that are NOT in detected_genes:
result = marker_genes[~np.isin(marker_genes, detected_genes)]
result
This would once again not work with Python lists.
Exercise
- Create a vector from 10 to 50 in steps of 2.
- Select only values larger than 25.
- Create a second vector from 30 to 70.
- Find the values common to both vectors.
- Find the values in the first vector that are not in the second.
Why this matters for bioinformatics
Gene expression data is usually stored as:
- rows = genes
- columns = samples
These are large numerical tables.
NumPy arrays allow us to:
- store them efficiently
- compute statistics quickly
- filter genes and samples
- find overlaps between gene sets
- prepare data for plotting and clustering
These operations are performed thousands of times in real workflows. Using NumPy vectorized operations and logical indexing is the fastest and clearest way to do this.
In the next section, we will learn how to plot data in python.