Goal of this section

By the end of this section you will be able to:

Create and use variables
Understand the difference between Python lists and NumPy arrays
Add comments to your code
Use help() to look up documentation
Perform vectorized numerical operations
Use logical conditions to select values

These are the building blocks for everything that follows.

Variables

A variable is a name that refers to a value stored in memory.

x = 10
y = 3.5
name = "geneA"

You can print the contents of a variable:

print(x)
print(name)

In Jupyter notebooks, simply writing the variable name as the last statement of the cell will also show its value:

What would happen if you have another statement after the variable name?

Try:

x
a = 14

Comments

Comments are notes for humans. Python ignores them when running code.

x = 5        # this is a comment
y = x + 2    # add 2 to x

Use comments to explain why something is done, not just what is done.

Python lists

A list is a collection of values:

genes = ["Gata1", "Runx1", "Spi1"]
genes

You can access elements using square brackets []:

print(genes[0])
print(genes[1])

Python starts counting at 0, not 1.

NumPy arrays

In bioinformatics we usually work with numerical vectors and matrices.
For this, we use NumPy arrays.

First import NumPy:

import numpy as np

This means: - Import the numpy library - Make it available as np

Later we can call functions like:

np.array(x)

Creating arrays

Create a numeric vector:

x = np.array([1, 2, 3, 4, 5])
x

NumPy arrays are designed for fast numerical computation.

Lists vs NumPy arrays

Compare this list:

lst = [1, 2, 3, 4, 5]

with this array:

arr = np.array([1, 2, 3, 4, 5])

Try adding 1:

lst + 1

This fails, because lists do not support mathematical operations.

But with NumPy:

arr + 1

This adds 1 to every element.
This is called vectorized computation.

It is much faster than Python loops because the operations are executed in optimized compiled code (mostly written in C) and work on entire blocks of memory at once.

Creating numeric sequences

NumPy provides functions to generate sequences easily.

np.arange(1, 11)

This creates numbers from 1 to 10.

You can control the step size:

np.arange(0, 21, 5)

You can also generate evenly spaced values using linspace:

np.linspace(0, 100, 11)

Getting help

Python has built-in documentation.

help(np.arange)   # or in Jupyter: ?np.arange

Look at: - what arguments the function takes - what it returns

Vectorized mathematics

Create a vector:

v = np.arange(1, 11)
v

Multiply all values:

2 * v

Take the reciprocal:

1 / v

Compute the mean and standard deviation:

np.mean(v)
np.std(v)

Exercise

Create a NumPy vector from 0 to 100 in steps of 10.
Compute its mean and standard deviation.
Try the same with a Python list. What happens?

Steps: 1. Subtract the mean from each value 2. Square the differences 3. Compute the mean of those squared differences 4. Take the square root

Solution

import numpy as np

# 1. Create vector
v = np.arange(0, 101, 10)
v

# 2. Mean and standard deviation
np.mean(v)
np.std(v)

# 3. Try the same with a Python list
lst = list(range(0, 101, 10))

# This will fail:
np.mean(lst)
np.std(lst)

Logical comparisons

We can compare vectors to values:

x = np.arange(1, 11)

print(x == 5)
print(x < 5)
print(x >= 5)
x != 5

These comparisons return boolean arrays (True / False for each element).

Using logical results to select values

We can use boolean arrays to subset data:

x[x < 5]
x[x >= 5]
x[x != 5]

This is a very common pattern in data analysis. But would that also work with a list? Try it!

Working with two string vectors

Create some gene vectors:

marker_genes = np.array(["Gata1", "Spi1", "Runx1"])
detected_genes = np.array(["Runx1", "Actb", "Gata1"])

Find common values:

common = np.intersect1d(marker_genes, detected_genes)
common

Find values in marker_genes that are NOT in detected_genes:

result = marker_genes[~np.isin(marker_genes, detected_genes)]
result

This would once again not work with Python lists.

Exercise

Create a vector from 10 to 50 in steps of 2.
Select only values larger than 25.
Create a second vector from 30 to 70.
Find the values common to both vectors.
Find the values in the first vector that are not in the second.

Why this matters for bioinformatics

Gene expression data is usually stored as:

rows = genes
columns = samples

These are large numerical tables.
NumPy arrays allow us to:

store them efficiently
compute statistics quickly
filter genes and samples
find overlaps between gene sets
prepare data for plotting and clustering

These operations are performed thousands of times in real workflows. Using NumPy vectorized operations and logical indexing is the fastest and clearest way to do this.

In the next section, we will learn how to plot data in python.