Introduction to Python

Getting Started

To run the labs in this book, you will need two things:

An installation of Python3, which is the specific version of Python used in the labs.
Access to Jupyter, a very popular Python interface that runs code through a file called a notebook.

You can download and install Python3 by following the instructions available at anaconda.com.

There are a number of ways to get access to Jupyter. Here are just a few:

Using Google’s Colaboratory service: colab.research.google.com/.
Using JupyterHub, available at jupyter.org/hub.
Using your own jupyter installation. Installation instructions are available at jupyter.org/install.

Please see the Python resources page on the book website statlearning.com for up-to-date information about getting Python and Jupyter working on your computer.

You will need to install the ISLP package, which provides access to the datasets and custom-built functions that we provide. Inside a macOS or Linux terminal type pip install ISLP; this also installs most other packages needed in the labs. The Python resources page has a link to the ISLP documentation website.

To run this lab, download the file Ch2-statlearn-lab.ipynb from the Python resources page. Now run the following code at the command line: jupyter lab Ch2-statlearn-lab.ipynb.

If you’re using Windows, you can use the start menu to access anaconda, and follow the links. For example, to install ISLP and run this lab, you can run the same code above in an anaconda shell.

Basic Commands

In this lab, we will introduce some simple Python commands. For more resources about Python in general, readers may want to consult the tutorial at docs.python.org/3/tutorial/.

Like most programming languages, Python uses functions to perform operations. To run a function called fun, we type fun(input1,input2), where the inputs (or arguments) input1 and input2 tell Python how to run the function. A function can have any number of inputs. For example, the print() function outputs a text representation of all of its arguments to the console.

print('fit a model with', 11, 'variables')

fit a model with 11 variables

The following command will provide information about the print() function.

print?

Adding two integers in Python is pretty intuitive.

3 + 5

In Python, textual data is handled using strings. For instance, "hello" and 'hello' are strings. We can concatenate them using the addition + symbol.

"hello" + " " + "world"

'hello world'

A string is actually a type of sequence: this is a generic term for an ordered list. The three most important types of sequences are lists, tuples, and strings.
We introduce lists now.

The following command instructs Python to join together the numbers 3, 4, and 5, and to save them as a list named x. When we type x, it gives us back the list.

x = [3, 4, 5]
x

[3, 4, 5]

Note that we used the brackets [] to construct this list.

We will often want to add two sets of numbers together. It is reasonable to try the following code, though it will not produce the desired results.

y = [4, 9, 7]
x + y

[3, 4, 5, 4, 9, 7]

The result may appear slightly counterintuitive: why did Python not add the entries of the lists element-by-element? In Python, lists hold arbitrary objects, and are added using concatenation. In fact, concatenation is the behavior that we saw earlier when we entered "hello" + " " + "world".

This example reflects the fact that Python is a general-purpose programming language. Much of Python’s data-specific functionality comes from other packages, notably numpy and pandas. In the next section, we will introduce the numpy package. See docs.scipy.org/doc/numpy/user/quickstart.html for more information about numpy.

Introduction to Numerical Python

As mentioned earlier, this book makes use of functionality that is contained in the numpy library, or package. A package is a collection of modules that are not necessarily included in the base Python distribution. The name numpy is an abbreviation for numerical Python.

To access numpy, we must first import it.

import numpy as np

In the previous line, we named the numpy module np; an abbreviation for easier referencing.

In numpy, an array is a generic term for a multidimensional set of numbers. We use the np.array() function to define x and y, which are one-dimensional arrays, i.e. vectors.

x = np.array([3, 4, 5])
y = np.array([4, 9, 7])

Note that if you forgot to run the import numpy as np command earlier, then you will encounter an error in calling the np.array() function in the previous line. The syntax np.array() indicates that the function being called is part of the numpy package, which we have abbreviated as np.

Since x and y have been defined using np.array(), we get a sensible result when we add them together. Compare this to our results in the previous section, when we tried to add two lists without using numpy.

x + y

array([ 7, 13, 12])

In numpy, matrices are typically represented as two-dimensional arrays, and vectors as one-dimensional arrays. {While it is also possible to create matrices using np.matrix(), we will use np.array() throughout the labs in this book.} We can create a two-dimensional array as follows.

x = np.array([[1, 2], [3, 4]])
x

array([[1, 2],
       [3, 4]])

The object x has several attributes, or associated objects. To access an attribute of x, we type x.attribute, where we replace attribute with the name of the attribute. For instance, we can access the ndim attribute of x as follows.

x.ndim

The output indicates that x is a two-dimensional array.
Similarly, x.dtype is the data type attribute of the object x. This indicates that x is comprised of 64-bit integers:

x.dtype

dtype('int64')

Why is x comprised of integers? This is because we created x by passing in exclusively integers to the np.array() function. If we had passed in any decimals, then we would have obtained an array of floating point numbers (i.e. real-valued numbers).

np.array([[1, 2], [3.0, 4]]).dtype

dtype('float64')

Typing fun? will cause Python to display documentation associated with the function fun, if it exists. We can try this for np.array().

np.array?

This documentation indicates that we could create a floating point array by passing a dtype argument into np.array().

np.array([[1, 2], [3, 4]], float).dtype

dtype('float64')

The array x is two-dimensional. We can find out the number of rows and columns by looking at its shape attribute.

x.shape

(2, 2)

A method is a function that is associated with an object. For instance, given an array x, the expression x.sum() sums all of its elements, using the sum() method for arrays. The call x.sum() automatically provides x as the first argument to its sum() method.

x = np.array([1, 2, 3, 4])
x.sum()

We could also sum the elements of x by passing in x as an argument to the np.sum() function.

x = np.array([1, 2, 3, 4])
np.sum(x)

As another example, the reshape() method returns a new array with the same elements as x, but a different shape. We do this by passing in a tuple in our call to reshape(), in this case (2, 3). This tuple specifies that we would like to create a two-dimensional array with 2 rows and 3 columns. {Like lists, tuples represent a sequence of objects. Why do we need more than one way to create a sequence? There are a few differences between tuples and lists, but perhaps the most important is that elements of a tuple cannot be modified, whereas elements of a list can be.}

In what follows, the \n character creates a new line.

x = np.array([1, 2, 3, 4, 5, 6])
print('beginning x:\n', x)
x_reshape = x.reshape((2, 3))
print('reshaped x:\n', x_reshape)

beginning x:
 [1 2 3 4 5 6]
reshaped x:
 [[1 2 3]
 [4 5 6]]

The previous output reveals that numpy arrays are specified as a sequence of rows. This is called row-major ordering, as opposed to column-major ordering.

Python (and hence numpy) uses 0-based indexing. This means that to access the top left element of x_reshape, we type in x_reshape[0,0].

x_reshape[0, 0]

Similarly, x_reshape[1,2] yields the element in the second row and the third column of x_reshape.

x_reshape[1, 2]

Similarly, x[2] yields the third entry of x.

Now, let’s modify the top left element of x_reshape. To our surprise, we discover that the first element of x has been modified as well!

print('x before we modify x_reshape:\n', x)
print('x_reshape before we modify x_reshape:\n', x_reshape)
x_reshape[0, 0] = 5
print('x_reshape after we modify its top left element:\n', x_reshape)
print('x after we modify top left element of x_reshape:\n', x)

x before we modify x_reshape:
 [1 2 3 4 5 6]
x_reshape before we modify x_reshape:
 [[1 2 3]
 [4 5 6]]
x_reshape after we modify its top left element:
 [[5 2 3]
 [4 5 6]]
x after we modify top left element of x_reshape:
 [5 2 3 4 5 6]

Modifying x_reshape also modified x because the two objects occupy the same space in memory.

We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot — and trying to do so introduces an exception, or error.

my_tuple = (3, 4, 5)
my_tuple[0] = 2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 2
      1 my_tuple = (3, 4, 5)
----> 2 my_tuple[0] = 2

TypeError: 'tuple' object does not support item assignment

We now briefly mention some attributes of arrays that will come in handy. An array’s shape attribute contains its dimension; this is always a tuple. The ndim attribute yields the number of dimensions, and T provides its transpose.

x_reshape.shape, x_reshape.ndim, x_reshape.T

((2, 3),
 2,
 array([[5, 4],
        [2, 5],
        [3, 6]]))

Notice that the three individual outputs (2,3), 2, and array([[5, 4],[2, 5], [3,6]]) are themselves output as a tuple.

We will often want to apply functions to arrays. For instance, we can compute the square root of the entries using the np.sqrt() function:

np.sqrt(x)

array([2.23606798, 1.41421356, 1.73205081, 2.        , 2.23606798,
       2.44948974])

We can also square the elements:

x**2

array([25,  4,  9, 16, 25, 36])

We can compute the square roots using the same notation, raising to the power of 1/2 instead of 2.

x**0.5

array([2.23606798, 1.41421356, 1.73205081, 2.        , 2.23606798,
       2.44948974])

Throughout this book, we will often want to generate random data. The np.random.normal() function generates a vector of random normal variables. We can learn more about this function by looking at the help page, via a call to np.random.normal?. The first line of the help page reads normal(loc=0.0, scale=1.0, size=None). This signature line tells us that the function’s arguments are loc, scale, and size. These are keyword arguments, which means that when they are passed into the function, they can be referred to by name (in any order). {Python also uses positional arguments. Positional arguments do not need to use a keyword. To see an example, type in np.sum?. We see that a is a positional argument, i.e. this function assumes that the first unnamed argument that it receives is the array to be summed. By contrast, axis and dtype are keyword arguments: the position in which these arguments are entered into np.sum() does not matter.} By default, this function will generate random normal variable(s) with mean (loc) 0 and standard deviation (scale) 1; furthermore, a single random variable will be generated unless the argument to size is changed.

We now generate 50 independent random variables from a N(0,1) distribution.

x = np.random.normal(size=50)
x

array([ 1.51375067, -2.22727296,  0.62822215,  0.23952346, -1.45118485,
       -0.36339142, -0.71260139, -0.5227538 , -0.04551861,  0.54377979,
        0.03609708,  0.15355396, -0.5977978 , -1.35394647,  2.73624778,
        1.19369197,  0.26285256,  2.07539237,  0.02296531, -0.41230577,
       -0.34230916,  1.42061621,  0.4444357 ,  0.49187843, -0.05846681,
        0.34765814, -0.93450303,  0.65072419,  0.71278734,  0.31502719,
        0.91218123,  0.39522089, -0.04983099,  0.52138797, -1.29511901,
       -0.95294366, -0.64058613,  1.12055365,  0.73605291, -0.9610408 ,
        1.46111001, -0.60197115, -0.77473046,  0.75302822, -1.06153867,
        0.96265338, -0.82498867, -1.58303799, -0.3021057 ,  1.76380525])

We create an array y by adding an independent N(50,1) random variable to each element of x.

y = x + np.random.normal(loc=50, scale=1, size=50)

The np.corrcoef() function computes the correlation matrix between x and y. The off-diagonal elements give the correlation between x and y.

np.corrcoef(x, y)

array([[1.       , 0.6862853],
       [0.6862853, 1.       ]])

If you’re following along in your own Jupyter notebook, then you probably noticed that you got a different set of results when you ran the past few commands. In particular, each time we call np.random.normal(), we will get a different answer, as shown in the following example.

print(np.random.normal(scale=5, size=2))
print(np.random.normal(scale=5, size=2))

[-1.40848937  5.95428352]
[-3.27273543 -4.45294183]

In order to ensure that our code provides exactly the same results each time it is run, we can set a random seed using the np.random.default_rng() function. This function takes an arbitrary, user-specified integer argument. If we set a random seed before generating random data, then re-running our code will yield the same results. The object rng has essentially all the random number generating methods found in np.random. Hence, to generate normal data we use rng.normal().

rng = np.random.default_rng(1303)
print(rng.normal(scale=5, size=2))
rng2 = np.random.default_rng(1303)
print(rng2.normal(scale=5, size=2))

[ 4.09482632 -1.07485605]
[ 4.09482632 -1.07485605]

Throughout the labs in this book, we use np.random.default_rng() whenever we perform calculations involving random quantities within numpy. In principle, this should enable the reader to exactly reproduce the stated results. However, as new versions of numpy become available, it is possible that some small discrepancies may occur between the output in the labs and the output from numpy.

The np.mean(), np.var(), and np.std() functions can be used to compute the mean, variance, and standard deviation of arrays. These functions are also available as methods on the arrays.

rng = np.random.default_rng(3)
y = rng.standard_normal(10)
np.mean(y), y.mean()

(-0.1126795190952861, -0.1126795190952861)

np.var(y), y.var(), np.mean((y - y.mean())**2)

(2.7243406406465125, 2.7243406406465125, 2.7243406406465125)

Notice that by default np.var() divides by the sample size n rather than n-1; see the ddof argument in np.var?.

np.sqrt(np.var(y)), np.std(y)

(1.6505576756498128, 1.6505576756498128)

The np.mean(), np.var(), and np.std() functions can also be applied to the rows and columns of a matrix. To see this, we construct a 10 \times 3 matrix of N(0,1) random variables, and consider computing its row sums.

X = rng.standard_normal((10, 3))
X

array([[ 0.22578661, -0.35263079, -0.28128742],
       [-0.66804635, -1.05515055, -0.39080098],
       [ 0.48194539, -0.23855361,  0.9577587 ],
       [-0.19980213,  0.02425957,  1.54582085],
       [ 0.54510552, -0.50522874, -0.18283897],
       [ 0.54052513,  1.93508803, -0.26962033],
       [-0.24355868,  1.0023136 , -0.88645994],
       [-0.29172023,  0.88253897,  0.58035002],
       [ 0.0915167 ,  0.67010435, -2.82816231],
       [ 1.02130682, -0.95964476, -1.66861984]])

Since arrays are row-major ordered, the first axis, i.e. axis=0, refers to its rows. We pass this argument into the mean() method for the object X.

X.mean(axis=0)

array([ 0.15030588,  0.14030961, -0.34238602])

The following yields the same result.

X.mean(0)

array([ 0.15030588,  0.14030961, -0.34238602])

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2024,
  author = {Bochman, Oren},
  title = {Chapter 2: {Introduction} to {Python} - {Lab}},
  date = {2024-06-11},
  url = {https://orenbochman.github.io/notes-islr/posts/ch02/Ch02-statlearn-lab.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2024. “Chapter 2: Introduction to Python - Lab.” June 11, 2024. https://orenbochman.github.io/notes-islr/posts/ch02/Ch02-statlearn-lab.html.