I. Getting started with Python for science

This part of the Scipy lecture notes is a self-contained introduction to everything that is needed to use Python for science, from the language itself, to numerical computing or plotting.

I-A. Scientific computing with tools and workflow

I-A-1. Why Python ?

I-A-1-a. The scientist's needs

  • Get data (simulation, experiment control)
  • Manipulate and process data
  • Visualize results… to understand what we are doing !
  • Communicate results : produce figures for reports or publications, write presentations.

I-A-1-b. Specifications

Rich collection of already existing bricks corresponding to classical numerical methods or basic actions : we don't want to re-program the plotting of a curve, a Fourier transform or a fitting algorithm. Don't reinvent the wheel!

Easy to learn : computer science is neither our job nor our education. We want to be able to draw a curve, smooth a signal, do a Fourier transform in a few minutes.

Easy communication with collaborators, students, ncustomers, to make the code live within a lab or a company : the code should be as readable as a book. Thus, the language should contain as few syntax symbols or unneeded routines as possible that would divert the reader from the mathematical or scientific understanding of the code.

Efficient code that executes quickly… but needless to say that a very fast code becomes useless if we spend too much time writing it. So, we need both a quick development time and a quick execution time.

A single environment/language for everything, if possible, to avoid learning a new software for each new problem.

I-A-1-c. Existing solutions

Which solutions do scientists use to work?

Compiled languages : C, C++, Fortran, etc.

  • Advantages :
  • Very fast. Very optimized compilers. For heavy computations, it's difficult to outperform these languages.
  • Some very optimized scientific libraries have been written for these languages. Example : BLAS (vector/matrix operations).
  • Drawbacks :
  • Painful usage : no interactivity during development, mandatory compilation steps, verbose syntax (&, : :, }}, ; etc.), manual memory management (tricky in C). These are difficult languages for non computer scientists.

Scripting languages : Matlab

  • Advantages :
  • Very rich collection of libraries with numerous algorithms, for many different domains. Fast execution because these libraries are often written in a compiled language.
  • Pleasant development environment : comprehensive and well organized help, integrated editor, etc.
  • Commercial support is available.
  • Drawbacks :
  • Base language is quite poor and can become restrictive for advanced users.
  • Not free.

Other scripting languages : Scilab, Octave, Igor, R, IDL, etc.

  • Advantages :
  • Open-source, free, or at least cheaper than Matlab.
  • Some features can be very advanced (statistics in R, figures in Igor, etc.)
  • Drawbacks :
  • Fewer available algorithms than in Matlab, and the language is not more advanced.
  • Some software are dedicated to one domain. Ex : Gnuplot or xmgrace to draw curves. These programs are very powerful, but they are restricted to a single type of usage, such as plotting.

What about Python?

  • Advantages :
  • Very rich scientific computing libraries (a bit less than Matlab, though)
  • Well thought out language, allowing to write very readable and well structured code : we “code what we think”.
  • Many libraries for other tasks than scientific computing (web server management, serial port access, etc.)
  • Free and open-source software, widely spread, with a vibrant community.
  • Drawbacks :
  • less pleasant development environment than, for example, Matlab. (More geek-oriented).
  • Not all the algorithms that can be found in more specialized software or toolboxes.

I-A-2. Scientific Python building blocks

Unlike Matlab, Scilab or R, Python does not come with a pre-bundled set of modules for scientific computing. Below are the basic building blocks that can be combined to obtain a scientific computing environment :

  • Python, a generic and modern computing language
  • Python language : data types (string, int), flow control, data collections (lists, dictionaries), patterns, etc.
  • Modules of the standard library.
  • A large number of specialized modules or applications written in Python : web protocols, web framework, etc. … and scientific computing.
  • Image non disponible
    Development tools (automatic testing, documentation generation)
  • IPython, an advanced Python shell http ://ipython.scipy.org/moin/
  • Numpy : provides powerful numerical arrays objects, and routines to manipulate them. http ://www.numpy.org/
  • Image non disponible
    Scipy : high-level data processing routines. Optimization, regression, interpolation, etc http ://www.scipy.org/
  • Image non disponible
    Matplotlib : 2-D visualization, “publication-ready” plots http ://matplotlib.sourceforge.net/
  • Mayavi : 3-D visualization http ://code.enthought.com/projects/mayavi/

I-A-3. The interactive workflow : IPython and a text editor

Interactive work to test and understand algorithms : In this section, we describe an interactive workflow with IPython that is handy to explore and understand algorithms.

Python is a general-purpose language. As such, there is not one blessed environment to work in, and not only one way of using it. Although this makes it harder for beginners to find their way, it makes it possible for Python to be used to write programs, in web servers, or embedded devices.

I-A-3-a. Command line interaction

Start ipython :

In [1] : print('Hello world')
Hello world

Getting help by using the ? operator after an object :

In [2] : print?
Type :               builtin_function_or_method
Base Class :         <type 'builtin_function_or_method'>
String Form :        <built-in function print>
Namespace :          Python builtin
Docstring :
    print(value…, sep=' ', end='\n', file=sys.stdout)

    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments :
    file : a file-like object (stream); defaults to the current sys.stdout.
    sep :  string inserted between values, default a space.
    end :  string appended after the last value, default a newline.

I-A-3-b. Elaboration of the algorithm in an editor

Create a file my_file.py in a text editor. Under EPD (Enthought Python Distribution), you can use Scite, available from the start menu. Under Python(x,y), you can use Spyder. Under Ubuntu, if you don't already have your favorite editor, we would advise installing Stani's Python editor. In the file, add the following lines :

s = 'Hello world'

Now, you can run it in IPython and explore the resulting variables :

In [1] : %run my_file.py
Hello world

In [2] : s
Out[2] : 'Hello world'

In [3] : %whos
Variable   Type    Data/Info
s          str     Hello world

From a script to functions

While it is tempting to work only with scripts, that is a file full of instructions following each other, do plan to progressively evolve the script to a set of functions :

  • A script is not reusable, functions are.
  • Thinking in terms of functions helps breaking the problem in small blocks.

I-A-3-c. IPython Tips and Tricks

The IPython user manual contains a wealth of information about using IPython, but to get you started we want to give you a quick introduction to three useful features : history, magic functions, aliases and tab completion.

Like a UNIX shell, IPython supports command history. Type up and down to navigate previously typed commands :

In [1] : x = 10

In [2] : <UP>

In [2] : x = 10

IPython supports so called magic functions by prefixing a command with the % character. For example, the run and whos functions from the previous section are magic functions. Note that, the setting automagic, which is enabled by default, allows you to omit the preceding % sign. Thus, you can just type the magic function and it will work.

Other useful magic functions are :

  • %cd to change the current directory.
In [2] : cd /tmp
  • %timeit allows you to time the execution of short snippets using the timeit module from the standard library :
In [3] : timeit x = 10
10000000 loops, best of 3 : 39 ns per loop
  • %cpaste allows you to paste code, especially code from websites which has been prefixed with the standard python prompt (e.g. ) or with an ipython prompt, (e.g. in [3]) :
In [5] : cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
 :In [3] : timeit x = 10
10000000 loops, best of 3 : 85.9 ns per loop
In [6] : cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
 : timeit x = 10
10000000 loops, best of 3 : 86 ns per loop
  • %debug allows you to enter postmortem debugging. That is to say, if the code you try to execute, raises an exception, using %debug will enter the debugger at the point where the exception was thrown.
In [7] : x === 10
  File « <ipython-input-6-12fd421b5f28> », line 1
    x === 10
SyntaxError : invalid syntax

In [8] : debug
> /home/esc/anaconda/lib/python2.7/site-packages/IPython/core/compilerop.py(87)ast_parse()
     86         and are passed to the built-in compile function.« " »
---> 87         return compile(source, filename, symbol, self.flags | PyCF_ONLY_AST, 1)

{'source' : u'x === 10\n', 'symbol' : 'exec', 'self' :
<IPython.core.compilerop.CachingCompiler instance at 0x2ad8ef0>,
'filename' : '<ipython-input-6-12fd421b5f28>'}

IPython help

  • The built-in IPython cheat-sheet is accessible via the %quickref magic function.
  • A list of all available magic functions is shown when typing %magic.

Furthermore IPython ships with various aliases which emulate common UNIX command line tools such as ls to list files, cp to copy files and rm to remove files. A list of aliases is shown when typing alias :

In [1] : alias
Total number of aliases : 16
Out[1] :
[('cat', 'cat'),
('clear', 'clear'),
('cp', 'cp -i'),
('ldir', 'ls -F -o --color %l | grep /$'),
('less', 'less'),
('lf', 'ls -F -o --color %l | grep ^-'),
('lk', 'ls -F -o --color %l | grep ^l'),
('ll', 'ls -F -o --color'),
('ls', 'ls -F --color'),
('lx', 'ls -F -o --color %l | grep ^-..x'),
('man', 'man'),
('mkdir', 'mkdir'),
('more', 'more'),
('mv', 'mv -i'),
('rm', 'rm -i'),
('rmdir', 'rmdir')]

Lastly, we would like to mention the tab completion feature, whose description we cite directly from the IPython manual :

Tab completion, especially for attributes, is a convenient way to explore the structure of any object you're dealing with. Simply type object_name.<TAB> to view the object's attributes. Besides Python objects and keywords, tab completion also works on file and directory names.

In [1] : x = 10

In [2] : x.<TAB>
x.bit_length   x.conjugate    x.denominator  x.imag         x.numerator

In [3] : x.real.
x.real.bit_length   x.real.denominator  x.real.numerator
x.real.conjugate    x.real.imag         x.real.real

In [4] : x.real.

I-B. The Python language

Python for scientific computing :

We introduce here the Python language. Only the bare minimum necessary for getting started with Numpy and Scipy is addressed here. To learn more about the language, consider going through the excellent tutorial http ://docs.python.org/tutorial. Dedicated books are also available, such as http ://diveintopython.org/.

Image non disponible

Python is a programming language, as are C, Fortran, BASIC, PHP, etc. Some specific features of Python are as follows :

  • an interpreted (as opposed to compiled) language. Contrary to e.g. C or Fortran, one does not compile Python code before executing it. In addition, Python can be used interactively : many Python interpreters are available, from which commands and scripts can be executed.
  • a free software released under an open-source license : Python can be used and distributed free of charge, even for building commercial software.
  • Multiplatform : Python is available for all major operating systems, Windows, Linux/Unix, MacOS X, most likely your mobile phone OS, etc.
  • a very readable language with clear non-verbose syntax
  • a language for which a large variety of high-quality packages are available for various applications, from web frameworks to scientific computing.
  • a language very easy to interface with other languages, in particular C and C++.
  • Some other features of the language are illustrated just below. For example, Python is an object-oriented language, with dynamic typing (the same variable can contain objects of different types during the course of a program).

See http ://www.python.org/about/ for more information about distinguishing features of Python.

I-B-1. First steps

Start the Ipython shell (an enhanced interactive Python shell) :

  • by typing “ipython” from a Linux/Mac terminal, or from the Windows cmd shell,
  • or by starting the program from a menu, e.g. in the Python(x,y) or EPD menu if you have installed one of these scientific-Python suites.

If you don't have Ipython installed on your computer, other Python shells are available, such as the plain Python shell started by typing “python” in a terminal, or the Idle interpreter. However, we advise to use the Ipython shell because of its enhanced features, especially for interactive scientific computing.

Once you have started the interpreter, type :

 print "Hello, world!"
Hello, world!

The message “Hello, world!” is then displayed. You just executed your first Python instruction, congratulations!

To get yourself started, type the following stack of instructions :

 a = 3
 b = 2*a
<type 'int'>
 print b
 b = 'hello'
<type 'str'>
 b + b

Two variables a and b have been defined above. Note that one does not declare the type of an variable before assigning its value. In C, conversely, one should write :

int a = 3;

In addition, the type of a variable may change, in the sense that at one point in time it can be equal to a value of a certain type, and a second point in time, it can be equal to a value of a different type. b was first equal to an integer, but it became equal to a string when it was assigned the value 'hello'. Operations on integers (b=2*a) are coded natively in Python, and so are some operations on strings such as additions and multiplications, which amount respectively to concatenation and repetition.

I-B-2. Basic types

I-B-2-a. Numerical types

Python supports the following numerical, scalar types :

 1 + 1
 a = 4
<type 'int'>
 c = 2.1
<type 'float'>
 a = 1.5 + 0.5j
 type(1. + 0j )
<type 'complex'>
 3 > 4
 test = (3 > 4)
<type 'bool'>

A Python shell can therefore replace your pocket calculator, with the basic arithmetic operations +, -, *, /, % (modulo) natively implemented

 7 * 3.
 8 % 3

Type conversion (casting) :


Warning Integer division

 3 / 2

Trick : use floats :

 3 / 2.

 a = 3
 b = 2
 a / b
 a / float(b)

If you explicitly want integer division use // :

 3.0 // 2

The behaviour of the division operator has changed in Python 3. Please look at the python3porting website for details.

I-B-2-b. Containers

Python provides many efficient types of containers, in which collections of objects can be stored.

I-B-2-b-i. Lists

A list is an ordered collection of objects, that may have different types. For example :

 L = ['red', 'blue', 'green', 'black', 'white']
<type 'list'>

Indexing : accessing individual objects contained in the list :


Counting from the end with negative indices :


Indexing starts at 0 (as in C), not at 1 (as in Fortran or Matlab)!

Slicing : obtaining sublists of regularly-spaced elements :

['red', 'blue', 'green', 'black', 'white']
 L[2 :4]
['green', 'black']

Note that L[start :stop] contains the elements with indices i such as start<= i < stop (i ranging from start to stop-1). Therefore, L[start :stop] has (stop-start) elements.

Slicing syntax : L[start :stop :stride]

All slicing parameters are optional :

['red', 'blue', 'green', 'black', 'white']
 L[3 :]
['black', 'white']
 L[ :3]
['red', 'blue', 'green']
 L[ : :2]
['red', 'green', 'white']

Lists are mutable objects and can be modified :

 L[0] = 'yellow'
['yellow', 'blue', 'green', 'black', 'white']
 L[2 :4] = ['gray', 'purple']
['yellow', 'blue', 'gray', 'purple', 'white']

The elements of a list may have different types :

 L = [3, -200, 'hello']
[3, -200, 'hello']
 L[1], L[2]
(-200, 'hello')

For collections of numerical data that all have the same type, it is often more efficient to use the array type provided by the numpy module. A NumPy array is a chunk of memory containing fixed-sized items. With NumPy arrays, operations on elements can be faster because elements are regularly spaced in memory and more operations are performed through specialized C functions instead of Python loops.

Python offers a large panel of functions to modify lists, or query them. Here are a few examples; for more details, see http ://docs.python.org/tutorial/datastructures.html#more-on-lists

Add and remove elements :

 L = ['red', 'blue', 'green', 'black', 'white']
['red', 'blue', 'green', 'black', 'white', 'pink']
 L.pop() # removes and returns the last item
['red', 'blue', 'green', 'black', 'white']
 L.extend(['pink', 'purple']) # extend L, in-place
['red', 'blue', 'green', 'black', 'white', 'pink', 'purple']
 L = L[ :-2]
['red', 'blue', 'green', 'black', 'white']

Reverse :

 r = L[ : :-1]
['white', 'black', 'green', 'blue', 'red']
 r2 = list(L)
['red', 'blue', 'green', 'black', 'white']
 r2.reverse() # in-place
['white', 'black', 'green', 'blue', 'red']

Concatenate and repeat lists :

 r + L
['white', 'black', 'green', 'blue', 'red', 'red', 'blue', 'green', 'black', 'white']
 r * 2
['white', 'black', 'green', 'blue', 'red', 'white', 'black', 'green', 'blue', 'red']

Sort :

 sorted(r) # new object
['black', 'blue', 'green', 'red', 'white']
['white', 'black', 'green', 'blue', 'red']
 r.sort()  # in-place
['black', 'blue', 'green', 'red', 'white']

Methods and Object-Oriented Programming :

The notation r.method() (e.g. r.append(3) and L.pop()) is our first example of object-oriented programming (OOP). Being a list, the object r owns the method function that is called using the notation .. No further knowledge of OOP than understanding the notation . is necessary for going through this tutorial.

Discovering methods :

Reminder : in Ipython : tab-completion (press tab)

In [28] : r.<TAB>
r.__add__           r.__iadd__          r.__setattr__
r.__class__         r.__imul__          r.__setitem__
r.__contains__      r.__init__          r.__setslice__
r.__delattr__       r.__iter__          r.__sizeof__
r.__delitem__       r.__le__            r.__str__
r.__delslice__      r.__len__           r.__subclasshook__
r.__doc__           r.__lt__            r.append
r.__eq__            r.__mul__           r.count
r.__format__        r.__ne__            r.extend
r.__ge__            r.__new__           r.index
r.__getattribute__  r.__reduce__        r.insert
r.__getitem__       r.__reduce_ex__     r.pop
r.__getslice__      r.__repr__          r.remove
r.__gt__            r.__reversed__      r.reverse
r.__hash__          r.__rmul__          r.sort
I-B-2-b-ii. Strings

Different string syntaxes (simple, double or triple quotes) :

s = 'Hello, how are you?'
s = « Hi, what's up »
s = '''Hello,                 # tripling the quotes allows the
       how are you'''         # the string to span more than one line
s = « " »Hi,
what's up?« " »


n [1] : 'Hi, what's up?'
   File « <ipython console> », line 1
    'Hi, what's up?'
SyntaxError : invalid syntax

The newline character is \n, and the tab character is \t.

Strings are collections like lists. Hence they can be indexed and sliced, using the same syntax and rules.

Indexing :

 a = « hello »

(Remember that negative indices correspond to counting from the right end.)

Slicing :

 a = « hello, world! »
 a[3 :6] # 3rd to 6th (excluded) elements : elements 3, 4, 5
 a[2 :10 :2] # Syntax : a[start :stop :step]
'lo o'
 a[ : :3] # every three characters, from beginning to end
'hl r!'

Accents and special characters can also be handled in Unicode strings (see http ://docs.python.org/tutorial/introduction.html#unicode-strings).

A string is an immutable object and it is not possible to modify its contents. One may however create new strings from the original one.

In [53] : a = « hello, world! »
In [54] : a[2] = 'z'
Traceback (most recent call last) :
   File « <stdin> », line 1, in <module>
TypeError : 'str' object does not support item assignment

In [55] : a.replace('l', 'z', 1)
Out[55] : 'hezlo, world!'
In [56] : a.replace('l', 'z')
Out[56] : 'hezzo, worzd!'

Strings have many useful methods, such as a.replace as seen above. Remember the a. object-oriented notation and use tab completion or help(str) to search for new methods.

Python offers advanced possibilities for manipulating strings, looking for patterns or formatting. The interested reader is referred to http ://docs.python.org/library/stdtypes.html#string-methods and http ://docs.python.org/library/string.html#new-string-formatting

String formatting :

 'An integer : %i; a float : %f; another string : %s' % (1, 0.1, 'string')
'An integer : 1; a float : 0.100000; another string : string'

 i = 102
 filename = 'processing_of_dataset_%d.txt' % i
I-B-2-b-iii. Dictionaries

A dictionary is basically an efficient table that maps keys to values. It is an unordered container

 tel = {'emmanuelle' : 5752, 'sebastian' : 5578}
 tel['francis'] = 5915
{'sebastian' : 5578, 'francis' : 5915, 'emmanuelle' : 5752}
['sebastian', 'francis', 'emmanuelle']
[5578, 5915, 5752]
 'francis' in tel

It can be used to conveniently store and retrieve values associated with a name (a string for a date, a name, etc.). See http ://docs.python.org/tutorial/datastructures.html#dictionaries for more information.

A dictionary can have keys (resp. values) with different types :
 d = {'a' :1, 'b' :2, 3 :'hello'}
{'a' : 1, 3 : 'hello', 'b' : 2}
I-B-2-b-iv. More container types


Tuples are basically immutable lists. The elements of a tuple are written between parentheses, or just separated by commas :

 t = 12345, 54321, 'hello!'
(12345, 54321, 'hello!')
 u = (0, 2)

Sets : unordered, unique items :

 s = set(('a', 'b', 'c', 'a'))
set(['a', 'c', 'b'])
 s.difference(('a', 'b'))

I-B-2-c. Assignment operator

Python library reference says :

Assignment statements are used to (re)bind names to values and to modify attributes or items of mutable objects.

In short, it works as follows (simple assignment) :

  1. an expression on the right hand side is evaluated, the corresponding object is created/obtained
  2. a name on the left hand side is assigned, or bound, to the r.h.s. object

Things to note :

  • a single object can have several names bound to it :
In [1] : a = [1, 2, 3]
In [2] : b = a
In [3] : a
Out[3] : [1, 2, 3]
In [4] : b
Out[4] : [1, 2, 3]
In [5] : a is b
Out[5] : True
In [6] : b[1] = 'hi!'
In [7] : a
Out[7] : [1, 'hi!', 3]
  • to change a list in place, use indexing/slices :
In [1] : a = [1, 2, 3]
In [3] : a
Out[3] : [1, 2, 3]
In [4] : a = ['a', 'b', 'c'] # Creates another object.
In [5] : a
Out[5] : ['a', 'b', 'c']
In [6] : id(a)
Out[6] : 138641676
In [7] : a[ :] = [1, 2, 3] # Modifies object in place.
In [8] : a
Out[8] : [1, 2, 3]
In [9] : id(a)
Out[9] : 138641676 # Same as in Out[6], yours will differ…
  • the key concept here is mutable vs. Immutable :
  • mutable objects can be changed in place
  • immutable objects cannot be modified once created

A very good and detailed explanation of the above issues can be found in David M. Beazley's article Types and Objects in Python.

I-B-3. Control Flow

Controls the order in which the code is executed.

I-B-3-a. if/elif/else

 if 2**2 == 4 :
...     print 'Obvious!'

Blocks are delimited by indentation

Type the following lines in your Python interpreter, and be careful to respect the indentation depth. The Ipython shell automatically increases the indentation depth after a column : sign; to decrease the indentation depth, go four spaces to the left with the Backspace key. Press the Enter key twice to leave the logical block.

In [1] : a = 10

In [2] : if a == 1 :
   ... :     print(1)
   ... : elif a == 2 :
   ... :     print(2)
   ... : else :
   ... :     print('A lot')
   ... :
A lot

Indentation is compulsory in scripts as well. As an exercise, re-type the previous lines with the same indentation in a script condition.py, and execute the script with run condition.py in Ipython.

I-B-3-b. for/range

Iterating with an index :

 for i in range(4) :
...     print(i)

But most often, it is more readable to iterate over values :

 for word in ('cool', 'powerful', 'readable') :
...     print('Python is %s' % word)
Python is cool
Python is powerful
Python is readable

I-B-3-c. while/break/continue

Typical C-style while loop (Mandelbrot problem) :

 z = 1 + 1j
 while abs(z) < 100 :
...     z = z**2 + 1

More advanced features

break out of enclosing for/while loop :

 z = 1 + 1j

 while abs(z) < 100 :
...     if z.imag == 0 :
...         break
...     z = z**2 + 1

continue the next iteration of a loop. :

 a = [1, 0, 2, 4]
 for element in a :
...     if element == 0 :
...         continue
...     print 1. / element

I-B-3-d. Conditional Expressions

if <OBJECT> :

Evaluates to False :

  • any number equal to zero (0, 0.0, 0+0j)
  • an empty container (list, tuple, set, dictionary…)
  • False, None

Evaluates to True :

  • everything else

a == b :

Tests equality, with logics :


 1 == 1.

a is b :

Tests identity : both sides are the same object :

 1 is 1.

 a = 1
 b = 1
 a is b

a in b :

For any collection b : b contains a :

 b = [1, 2, 3]
 2 in b
 5 in b

If b is a dictionary, this tests that a is a key of b.

I-B-3-e. Advanced iteration

I-B-3-e-i. Iterate over any sequence

You can iterate over any sequence (string, list, keys in a dictionary, lines in a file…) :

 vowels = 'aeiouy'

 for i in 'powerful' :
...     if i in vowels :
...         print(i),
o e u

 message = "Hello how are you?"
 message.split() # returns a list
['Hello', 'how', 'are', 'you?']
 for word in message.split() :
...     print word

Few languages (in particular, languages for scientific computing) allow to loop over anything but integers/indices. With Python it is possible to loop exactly over the objects of interest without bothering with indices you often don't care about. This feature can often be used to make code more readable.

Not safe to modify the sequence you are iterating over.

I-B-3-e-ii. Keeping track of enumeration number

Common task is to iterate over a sequence while keeping track of the item number.

  • Could use while loop with a counter as above. Or a for loop :
 words = ('cool', 'powerful', 'readable')
 for i in range(0, len(words)) :
...     print i, words[i]
0 cool
1 powerful
2 readable
  • But, Python provides enumerate keyword for this :
 for index, item in enumerate(words) :
...     print index, item
0 cool
1 powerful
2 readable
I-B-3-e-iii. Looping over a dictionary

Use iteritems :

 d = {'a' : 1, 'b' :1.2, 'c' :1j}

 for key, val in d.iteritems() :
...     print('Key : %s has value : %s' % (key, val))
Key : a has value : 1
Key : c has value : 1j
Key : b has value : 1.2

I-B-3-f. List Comprehensions

 [i**2 for i in range(4)]
[0, 1, 4, 9]

Exercise :

Compute the decimals of Pi using the Wallis formula :

kitxmlcodelatexdvp\pi = 2 \prod_{i=1}^{\infty} \frac{4i^2}{4i^2 - 1}finkitxmlcodelatexdvp

I-B-4. Defining functions

I-B-4-a. Function definition

In [56] : def test() :
   .... :     print('in test function')
   .... :
   .... :

In [57] : test()
in test function

Function blocks must be indented as other control-flow blocks.

I-B-4-b. Return statement

Functions can optionally return values.

In [6] : def disk_area(radius) :
   ... :     return 3.14 * radius * radius
   ... :

In [8] : disk_area(1.5)
Out[8] : 7.0649999999999995

By default, functions return None.

Note the syntax to define a function :

  • the def keyword;
  • is followed by the function's name, then
  • the arguments of the function are given between parentheses followed by a colon.
  • the function body;
  • and return object for optionally returning values.

I-B-4-c. Parameters

Mandatory parameters (positional arguments) :

In [81] : def double_it(x) :
   .... :     return x * 2
   .... :

In [82] : double_it(3)
Out[82] : 6

In [83] : double_it()
Traceback (most recent call last) :
  File "<stdin>", line 1, in <module>
TypeError : double_it() takes exactly 1 argument (0 given)

Optional parameters (keyword or named arguments) :

In [84] : def double_it(x=2) :
   .... :     return x * 2
   .... :

In [85] : double_it()
Out[85] : 4

In [86] : double_it(3)
Out[86] : 6

Keyword arguments allow you to specify default values.

Default values are evaluated when the function is defined, not when it is called. This can be problematic when using mutable types (e.g. dictionary or list) and modifying them in the function body, since the modifications will be persistent across invocations of the function.

Using an immutable type in a keyword argument :

In [124] : bigx = 10

In [125] : def double_it(x=bigx) :
   ..... :     return x * 2
   ..... :

In [126] : bigx = 1e9  # Now really big

In [128] : double_it()
Out[128] : 20

Using an mutable type in a keyword argument (and modifying it inside the function body) :

In [2] : def add_to_dict(args={'a' : 1, 'b' : 2}) :
   ... :     for i in args.keys() :
   ... :         args[i] += 1
   ... :     print args
   ... :

In [3] : add_to_dict
Out[3] : <function __main__.add_to_dict>

In [4] : add_to_dict()
{'a' : 2, 'b' : 3}

In [5] : add_to_dict()
{'a' : 3, 'b' : 4}

In [6] : add_to_dict()
{'a' : 4, 'b' : 5}

More involved example implementing python's slicing :

In [98] : def slicer(seq, start=None, stop=None, step=None) :
   .... :     """Implement basic python slicing."""
   .... :     return seq[start :stop :step]
   .... :

In [101] : rhyme = 'one fish, two fish, red fish, blue fish'.split()

In [102] : rhyme
Out[102] : ['one', 'fish,', 'two', 'fish,', 'red', 'fish,', 'blue', 'fish']

In [103] : slicer(rhyme)
Out[103] : ['one', 'fish,', 'two', 'fish,', 'red', 'fish,', 'blue', 'fish']

In [104] : slicer(rhyme, step=2)
Out[104] : ['one', 'two', 'red', 'blue']

In [105] : slicer(rhyme, 1, step=2)
Out[105] : ['fish,', 'fish,', 'fish,', 'fish']

In [106] : slicer(rhyme, start=1, stop=4, step=2)
Out[106] : ['fish,', 'fish,']

The order of the keyword arguments does not matter :

In [107] : slicer(rhyme, step=2, start=1, stop=4)
Out[107] : ['fish,', 'fish,']

But it is good practice to use the same ordering as the function's definition.

Keyword arguments are a very convenient feature for defining functions with a variable number of arguments, especially when default values are to be used in most calls to the function.

I-B-4-d. Passing by value

Can you modify the value of a variable inside a function? Most languages (C, Java…) distinguish “passing by value” and “passing by reference”. In Python, such a distinction is somewhat artificial, and it is a bit subtle whether your variables are going to be modified or not. Fortunately, there exist clear rules.

Parameters to functions are references to objects, which are passed by value. When you pass a variable to a function, python passes the reference to the object to which the variable refers (the value). Not the variable itself.

If the value passed in a function is immutable, the function does not modify the caller's variable. If the value is mutable, the function may modify the caller's variable in-place :

 def try_to_modify(x, y, z) :
...     x = 23
...     y.append(42)
...     z = [99] # new reference
...     print(x)
...     print(y)
...     print(z)
 a = 77    # immutable variable
 b = [99]  # mutable variable
 c = [28]
 try_to_modify(a, b, c)
[99, 42]
[99, 42]

Functions have a local variable table called a local namespace.

The variable x only exists within the function try_to_modify.

I-B-4-e. Global variables

Variables declared outside the function can be referenced within the function :

In [114] : x = 5

In [115] : def addx(y) :
   ..... :     return x + y
   ..... :

In [116] : addx(10)
Out[116] : 15

But these “global” variables cannot be modified within the function, unless declared global in the function.

This doesn't work :

In [117] : def setx(y) :
   ..... :     x = y
   ..... :     print('x is %d' % x)
   ..... :
   ..... :

In [118] : setx(10)
x is 10

In [120] : x
Out[120] : 5

This works :

In [121] : def setx(y) :
   ..... :     global x
   ..... :     x = y
   ..... :     print('x is %d' % x)
   ..... :
   ..... :

In [122] : setx(10)
x is 10

In [123] : x
Out[123] : 10

I-B-4-f. Variable number of parameters

Special forms of parameters :

  • *args : any number of positional arguments packed into a tuple
  • **kwargs : any number of keyword arguments packed into a dictionary
In [35] : def variable_args(*args, **kwargs) :
   .... :     print 'args is', args
   .... :     print 'kwargs is', kwargs
   .... :

In [36] : variable_args('one', 'two', x=1, y=2, z=3)
args is ('one', 'two')
kwargs is {'y' : 2, 'x' : 1, 'z' : 3}

I-B-4-g. Docstrings

Documentation about what the function does and its parameters. General convention :

In [67] : def funcname(params) :
   .... :     """Concise one-line sentence describing the function.
   .... :
   .... :     Extended summary which can contain multiple paragraphs.
   .... :     """
   .... :     # function body
   .... :     pass
   .... :

In [68] : funcname?
Type :           function
Base Class :     type 'function'>
String Form :    <function funcname at 0xeaa0f0>
Namespace :      Interactive
File :           <ipython console>
Definition :     funcname(params)
Docstring :
    Concise one-line sentence describing the function.

    Extended summary which can contain multiple paragraphs.

Docstring guidelines

For the sake of standardization, the Docstring Conventions webpage documents the semantics and conventions associated with Python docstrings.

Also, the Numpy and Scipy modules have defined a precise standard for documenting scientific functions, that you may want to follow for your own functions, with a Parameters section, an Examples section, etc. See http ://projects.scipy.org/numpy/wiki/CodingStyleGuidelines#docstring-standard and http ://projects.scipy.org/numpy/browser/trunk/doc/example.py#L37

I-B-4-h. Functions are objects

Functions are first-class objects, which means they can be :

  • assigned to a variable
  • an item in a list (or any collection)
  • passed as an argument to another function.
In [38] : va = variable_args

In [39] : va('three', x=1, y=2)
args is ('three',)
kwargs is {'y' : 2, 'x' : 1}

I-B-4-i. Methods

Methods are functions attached to objects. You've seen these in our examples on lists, dictionaries, strings, etc…

I-B-4-j. Exercises

Exercise : Fibonacci sequence

Write a function that displays the n first terms of the Fibonacci sequence, defined by :

  • u_0 = 1; u_1 = 1
  • u_(n+2) = u_(n+1) + u_n

Exercise : Quicksort

Implement the quicksort algorithm, as defined by wikipedia :

function quicksort(array)

var list less, greater if length(array) < 2

return array

select and remove a pivot value pivot from array for each x in array

if x < pivot + 1 then append x to less else append x to greater

return concatenate(quicksort(less), pivot, quicksort(greater))

I-B-5. Reusing code : scripts and modules

For now, we have typed all instructions in the interpreter. For longer sets of instructions we need to change track and write the code in text files (using a text editor), that we will call either scripts or modules. Use your favorite text editor (provided it offers syntax highlighting for Python), or the editor that comes with the Scientific Python Suite you may be using (e.g., Scite with Python(x,y)).

I-B-5-a. Scripts

Let us first write a script, that is a file with a sequence of instructions that are executed each time the script is called. Instructions may be e.g. copied-and-pasted from the interpreter (but take care to respect indentation rules!).

The extension for Python files is .py. Write or copy-and-paste the following lines in a file called test.py

message = "Hello how are you?"
for word in message.split() :
    print word

Let us now execute the script interactively, that is inside the Ipython interpreter. This is maybe the most common use of scripts in scientific computing.

in Ipython, the syntax to execute a script is %run script.py.

For example :

In [1] : %run test.py

In [2] : message
Out[2] : 'Hello how are you?'

The script has been executed. Moreover the variables defined in the script (such as message) are now available inside the interpreter's namespace.

Other interpreters also offer the possibility to execute scripts (e.g., execfile in the plain Python interpreter, etc.).

It is also possible In order to execute this script as a standalone program, by executing the script inside a shell terminal (Linux/Mac console or cmd Windows console). For example, if we are in the same directory as the test.py file, we can execute this in a console :

$ python test.py

Standalone scripts may also take command-line arguments

In file.py :

import sys
print sys.argv
$ python file.py test arguments
['file.py', 'test', 'arguments']

Don't implement option parsing yourself. Use modules such as optparse, argparse or docopt.

I-B-5-b. Importing objects from modules

In [1] : import os

In [2] : os
Out[2] : <module 'os' from '/usr/lib/python2.6/os.pyc'>

In [3] : os.listdir('.')
Out[3] :

And also :

In [4] : from os import listdir

Importing shorthands :

In [5] : import numpy as np
from os import *

This is called the star import and please, Use it with caution

  • Makes the code harder to read and understand : where do symbols come from?
  • Makes it impossible to guess the functionality by the context and the name (hint : os.name is the name of the OS), and to profit usefully from tab completion.
  • Restricts the variable names you can use : os.name might override name, or vise-versa.
  • Creates possible name clashes between modules.
  • Makes the code impossible to statically check for undefined symbols.

Modules are thus a good way to organize code in a hierarchical way. Actually, all the scientific computing tools we are going to use are modules :

 import numpy as np # data arrays
 np.linspace(0, 10, 6)
array([  0.,   2.,   4.,   6.,   8.,  10.])
 import scipy # scientific computing

In Python(x,y), Ipython(x,y) executes the following imports at startup :

 import numpy
 import numpy as np
 from pylab import *
 import scipy

and it is not necessary to re-import these modules.

I-B-5-c. Creating modules

If we want to write larger and better organized programs (compared to simple scripts), where some objects are defined, (variables, functions, classes) and that we want to reuse several times, we have to create our own modules.

Let us create a module demo contained in the file demo.py :

"A demo module."

def print_b() :
    "Prints b."
    print 'b'

def print_a() :
    "Prints a."
    print 'a'

c = 2
d = 2

In this file, we defined two functions print_a and print_b. Suppose we want to call the print_a function from the interpreter. We could execute the file as a script, but since we just want to have access to the function print_a, we are rather going to import it as a module. The syntax is as follows.

In [1] : import demo

In [2] : demo.print_a()

In [3] : demo.print_b()

Importing the module gives access to its objects, using the module.object syntax. Don't forget to put the module's name before the object's name, otherwise Python won't recognize the instruction.

Introspection :

In [4] : demo?
Type :               module
Base Class : <type 'module'>
String Form :        <module 'demo' from 'demo.py'>
Namespace :  Interactive
File :               /home/varoquau/Projects/Python_talks/scipy_2009_tutorial/source/demo.py
Docstring :
    A demo module.

In [5] : who

In [6] : whos
Variable   Type      Data/Info
demo       module    <module 'demo' from 'demo.py'>

In [7] : dir(demo)
Out[7] :

In [8] : demo.
demo.__builtins__      demo.__init__          demo.__str__
demo.__class__         demo.__name__          demo.__subclasshook__
demo.__delattr__       demo.__new__           demo.c
demo.__dict__          demo.__package__       demo.d
demo.__doc__           demo.__reduce__        demo.print_a
demo.__file__          demo.__reduce_ex__     demo.print_b
demo.__format__        demo.__repr__          demo.py
demo.__getattribute__  demo.__setattr__       demo.pyc
demo.__hash__          demo.__sizeof__

Importing objects from modules into the main namespace :

In [9] : from demo import print_a, print_b

In [10] : whos
Variable   Type        Data/Info
demo       module      <module 'demo' from 'demo.py'>
print_a    function    <function print_a at 0xb7421534>
print_b    function    <function print_b at 0xb74214c4>

In [11] : print_a()

Module caching

Modules are cached : if you modify demo.py and re-import it in the old session, you will get the old one.

Solution :

In [10] : reload(demo)

I-B-5-d. ‘__main__' and module loading

File demo2.py :

import sys

def print_a() :
    « Prints a. »
    print 'a'

print sys.argv

if __name__ == '__main__' :

Importing it :

In [11] : import demo2

In [12] : import demo2

Running it :

In [13] : %run demo2

I-B-5-e. Scripts or modules? How to organize your code

Rule of thumb

  • Sets of instructions that are called several times should be written inside functions for better code reusability.
  • Functions (or other bits of code) that are called from several scripts should be written inside a module, so that only the module is imported in the different scripts (do not copy-and-paste your functions in the different scripts!).
I-B-5-e-i. How modules are found and imported

When the import mymodule statement is executed, the module mymodule is searched in a given list of directories. This list includes a list of installation-dependent default path (e.g., /usr/lib/python) as well as the list of directories specified by the environment variable PYTHONPATH.

The list of directories searched by Python is given by the sys.path variable

In [1] : import sys

In [2] : sys.path
Out[2] :

Modules must be located in the search path, therefore you can :

  • write your own modules within directories already defined in the search path (e.g. $HOME/.local/lib/python2.7/dist-packages). You may use symbolic links (on Linux) to keep the code somewhere else.
  • modify the environment variable PYTHONPATH to include the directories containing the user-defined modules.
    On Linux/Unix, add the following line to a file read by the shell at startup (e.g. /etc/profile, .profile)

    export PYTHONPATH=$PYTHONPATH :/home/emma/user_defined_modules

    On Windows, http ://support.microsoft.com/kb/310519 explains how to handle environment variables.

  • or modify the sys.path variable itself within a Python script.
import sys
new_path = '/home/emma/user_defined_modules'
if new_path not in sys.path :

This method is not very robust, however, because it makes the code less portable (user-dependent path) and because you have to add the directory to your sys.path each time you want to import from a module in this directory.

See http ://docs.python.org/tutorial/modules.html for more information about modules.

I-B-5-f. Packages

A directory that contains many modules is called a package. A package is a module with submodules (which can have submodules themselves, etc.). A special file called __init__.py (which may be empty) tells Python that the directory is a Python package, from which modules can be imported.

$ ls
cluster/        io/          README.txt@     stsci/
__config__.py@  LATEST.txt@  setup.py@       __svn_version__.py@
__config__.pyc  lib/         setup.pyc       __svn_version__.pyc
constants/      linalg/      setupscons.py@  THANKS.txt@
fftpack/        linsolve/    setupscons.pyc  TOCHANGE.txt@
__init__.py@    maxentropy/  signal/         version.py@
__init__.pyc    misc/        sparse/         version.pyc
INSTALL.txt@    ndimage/     spatial/        weave/
integrate/      odr/         special/
interpolate/    optimize/    stats/
$ cd ndimage
$ ls
doccer.py@   fourier.pyc   interpolation.py@  morphology.pyc   setup.pyc
doccer.pyc   info.py@      interpolation.pyc  _nd_image.so
filters.py@  info.pyc      measurements.py@   _ni_support.py@
filters.pyc  __init__.py@  measurements.pyc   _ni_support.pyc  tests/
fourier.py@  __init__.pyc  morphology.py@     setup.py@

From Ipython :

In [1] : import scipy

In [2] : scipy.__file__
Out[2] : '/usr/lib/python2.6/dist-packages/scipy/__init__.pyc'

In [3] : import scipy.version

In [4] : scipy.version.version
Out[4] : '0.7.0'

In [5] : import scipy.ndimage.morphology

In [6] : from scipy.ndimage import morphology

In [17] : morphology.binary_dilation?
Type :           function
Base Class :     <type 'function'>
String Form :    <function binary_dilation at 0x9bedd84>
Namespace :      Interactive
File :           /usr/lib/python2.6/dist-packages/scipy/ndimage/morphology.py
Definition :     morphology.binary_dilation(input, structure=None,
iterations=1, mask=None, output=None, border_value=0, origin=0,
Docstring :
    Multidimensional binary dilation with the given structure.

    An output array can optionally be provided. The origin parameter
    controls the placement of the filter. If no structuring element is
    provided an element is generated with a squared connectivity equal
    to one. The dilation operation is repeated iterations times.  If
    iterations is less than 1, the dilation is repeated until the
    result does not change anymore.  If a mask is given, only those
    elements with a true value at the corresponding mask element are
    modified at each iteration.

I-B-5-g. Good practices

  • Use meaningful object names
  • Indentation : no choice!
    Indenting is compulsory in Python! Every command block following a colon bears an additional indentation level with respect to the previous line with a colon. One must therefore indent after def f() : or while :. At the end of such logical blocks, one decreases the indentation depth (and re-increases it if a new block is entered, etc.)
    Strict respect of indentation is the price to pay for getting rid of { or ; characters that delineate logical blocks in other languages. Improper indentation leads to errors such as

    IndentationError : unexpected indent (test.py, line 2)

    All this indentation business can be a bit confusing in the beginning. However, with the clear indentation, and in the absence of extra characters, the resulting code is very nice to read compared to other languages.

  • Indentation depth : Inside your text editor, you may choose to indent with any positive number of spaces (1, 2, 3, 4…). However, it is considered good practice to indent with 4 spaces. You may configure your editor to map the Tab key to a 4-space indentation. In Python(x,y), the editor is already configured this way.

  • Style guidelines

Long lines : you should not write very long lines that span over more than (e.g.) 80 characters. Long lines can be broken with the \ character

 long_line = "Here is a very very long line \
 that we break in two parts."


Write well-spaced code : put whitespaces after commas, around arithmetic operators, etc. :

 a = 1 # yes
 a=1 # too cramped

A certain number of rules for writing “beautiful” code (and more importantly using the same conventions as anybody else!) are given in the Style Guide for Python Code.

If you want to do a first quick pass through the Scipy lectures to learn the ecosystem, you can directly skip to the next chapter : NumPy : creating and manipulating numerical data.

The remainder of this chapter is not necessary to follow the rest of the intro part. But be sure to come back and finish this chapter later.

I-B-6. Input and Output

To be exhaustive, here are some information about input and output in Python. Since we will use the Numpy methods to read and write files, you may skip this chapter at first reading.

We write or read strings to/from files (other types must be converted to strings). To write in a file :

 f = open('workfile', 'w') # opens the workfile file
<type 'file'>
 f.write('This is a test \nand another test')

To read from a file :

In [1] : f = open('workfile', 'r')

In [2] : s = f.read()

In [3] : print(s)
This is a test
and another test

In [4] : f.close()

For more details : http ://docs.python.org/tutorial/inputoutput.html

I-B-6-a. Iterating over a file

In [6] : f = open('workfile', 'r')

In [7] : for line in f :
   ... :     print line
   ... :
This is a test

and another test

In [8] : f.close()
I-B-6-a-i. File modes
  • Read-only : r
  • Write-only : w
  • Note : Create a new file or overwrite existing file.
  • Append a file : a
  • Read and Write : r+
  • Binary mode : b
  • Note : Use for binary files, especially on Windows.

I-B-7. Standard Library

Reference document for this section :

I-B-7-a. os module : operating system functionality

“A portable way of using operating system dependent functionality.”

I-B-7-a-i. Directory and file manipulation

Current directory :

In [17] : os.getcwd()
Out[17] : '/Users/cburns/src/scipy2009/scipy_2009_tutorial/source'

List a directory :

In [31] : os.listdir(os.curdir)
Out[31] :

Make a directory :

In [32] : os.mkdir('junkdir')

In [33] : 'junkdir' in os.listdir(os.curdir)
Out[33] : True

Rename the directory :

In [36] : os.rename('junkdir', 'foodir')

In [37] : 'junkdir' in os.listdir(os.curdir)
Out[37] : False

In [38] : 'foodir' in os.listdir(os.curdir)
Out[38] : True

In [41] : os.rmdir('foodir')

In [42] : 'foodir' in os.listdir(os.curdir)
Out[42] : False

Delete a file :

In [44] : fp = open('junk.txt', 'w')

In [45] : fp.close()

In [46] : 'junk.txt' in os.listdir(os.curdir)
Out[46] : True

In [47] : os.remove('junk.txt')

In [48] : 'junk.txt' in os.listdir(os.curdir)
Out[48] : False
I-B-7-a-ii. os.path : path manipulations

os.path provides common operations on pathnames.

In [70] : fp = open('junk.txt', 'w')

In [71] : fp.close()

In [72] : a = os.path.abspath('junk.txt')

In [73] : a
Out[73] : '/Users/cburns/src/scipy2009/scipy_2009_tutorial/source/junk.txt'

In [74] : os.path.split(a)
Out[74] : ('/Users/cburns/src/scipy2009/scipy_2009_tutorial/source',

In [78] : os.path.dirname(a)
Out[78] : '/Users/cburns/src/scipy2009/scipy_2009_tutorial/source'

In [79] : os.path.basename(a)
Out[79] : 'junk.txt'

In [80] : os.path.splitext(os.path.basename(a))
Out[80] : ('junk', '.txt')

In [84] : os.path.exists('junk.txt')
Out[84] : True

In [86] : os.path.isfile('junk.txt')
Out[86] : True

In [87] : os.path.isdir('junk.txt')
Out[87] : False

In [88] : os.path.expanduser('~/local')
Out[88] : '/Users/cburns/local'

In [92] : os.path.join(os.path.expanduser('~'), 'local', 'bin')
Out[92] : '/Users/cburns/local/bin'
I-B-7-a-iii. Running an external command
In [8] : os.system('ls')
basic_types.rst   demo.py          functions.rst  python_language.rst  standard_library.rst
control_flow.rst  exceptions.rst   io.rst         python-logo.png
demo2.py          first_steps.rst  oop.rst        reusing_code.rst

Alternative to os.system

A noteworthy alternative to os.system is the sh module. Which provides much more convenient ways to obtain the output, error stream and exit code of the external command.

In [20] : import sh
In [20] : com = sh.ls()

In [21] : print com
basic_types.rst   exceptions.rst   oop.rst              standard_library.rst
control_flow.rst  first_steps.rst  python_language.rst
demo2.py          functions.rst    python-logo.png
demo.py           io.rst           reusing_code.rst

In [22] : print com.exit_code
In [23] : type(com)
Out[23] : sh.RunningCommand
I-B-7-a-iv. Walking a directory

os.path.walk generates a list of filenames in a directory tree.

In [10] : for dirpath, dirnames, filenames in os.walk(os.curdir) :
 .... :     for fp in filenames :
 .... :         print os.path.abspath(fp)
 .... :
 .... :
I-B-7-a-v. Environment variables :
In [9] : import os

In [11] : os.environ.keys()
Out[11] :

In [12] : os.environ['PYTHONPATH']
Out[12] : '. :/Users/cburns/src/utils :/Users/cburns/src/nitools :
/Users/cburns/local/lib/python2.5/site-packages/ :
/usr/local/lib/python2.5/site-packages/ :

In [16] : os.getenv('PYTHONPATH')
Out[16] : '. :/Users/cburns/src/utils :/Users/cburns/src/nitools :
/Users/cburns/local/lib/python2.5/site-packages/ :
/usr/local/lib/python2.5/site-packages/ :

I-B-7-b. shutil : high-level file operations

The shutil provides useful file operations :

  • shutil.rmtree : Recursively delete a directory tree.
  • shutil.move : Recursively move a file or directory to another location.
  • shutil.copy : Copy files or directories.

I-B-7-c. glob : Pattern matching on files

The glob module provides convenient file pattern matching.

Find all files ending in .txt :

In [18] : import glob

In [19] : glob.glob('*.txt')
Out[19] : ['holy_grail.txt', 'junk.txt', 'newfile.txt']

I-B-7-d. sys module : system-specific information

System-specific information related to the Python interpreter.

  • Which version of python are you running and where is it installed :
In [117] : sys.platform
Out[117] : 'darwin'

In [118] : sys.version
Out[118] : '2.5.2 (r252 :60911, Feb 22 2008, 07 :57 :53) \n
          [GCC 4.0.1 (Apple Computer, Inc. build 5363)]'

In [119] : sys.prefix
Out[119] : '/Library/Frameworks/Python.framework/Versions/2.5'
  • List of command line arguments passed to a Python script :
In [100] : sys.argv
Out[100] : ['/Users/cburns/local/bin/ipython']

sys.path is a list of strings that specifies the search path for modules. Initialized from PYTHONPATH :

In [121] : sys.path
Out[121] :

I-B-7-e. pickle : easy persistence

Useful to store arbitrary objects to a file. Not safe or fast!

In [1] : import pickle

In [2] : l = [1, None, 'Stan']

In [3] : pickle.dump(l, file('test.pkl', 'w'))

In [4] : pickle.load(file('test.pkl'))
Out[4] : [1, None, 'Stan']

Exercise :

Write a program to search your PYTHONPATH for the module site.py.

Solution :

Write a program to search your PYTHONPATH for the module site.py.

"""Script to search the PYTHONPATH for the module site.py"""

import os
import sys
import glob

def find_module(module) :
    result = []
    # Loop over the list of paths in sys.path
    for subdir in sys.path :
        # Join the subdir path with the module we're searching for
        pth = os.path.join(subdir, module)
        # Use glob to test if the pth is exists
        res = glob.glob(pth)
        # glob returns a list, if it is not empty, the pth exists
        if len(res) > 0 :
    return result

if __name__ == '__main__' :
    result = find_module('site.py')
    print result

I-B-8. Exception handling in Python

It is highly unlikely that you haven't yet raised Exceptions if you have typed all the previous commands of the tutorial. For example, you may have raised an exception if you entered a command with a typo.

Exceptions are raised by different kinds of errors arising when executing Python code. In your own code, you may also catch errors, or define custom error types. You may want to look at the descriptions of the the built-in Exceptions when looking for the right exception type.

I-B-8-a. Exceptions

Exceptions are raised by errors in Python :

In [1] : 1/0
ZeroDivisionError : integer division or modulo by zero

In [2] : 1 + 'e'
TypeError : unsupported operand type(s) for + : 'int' and 'str'

In [3] : d = {1 :1, 2 :2}

In [4] : d[3]
KeyError : 3

In [5] : l = [1, 2, 3]

In [6] : l[4]
IndexError : list index out of range

In [7] : l.foobar
AttributeError : 'list' object has no attribute 'foobar'

As you can see, there are different types of exceptions for different errors.

I-B-8-b. Catching exceptions

I-B-8-b-i. try/except
In [10] : while True :
   .... :     try :
   .... :         x = int(raw_input('Please enter a number : '))
   .... :         break
   .... :     except ValueError :
   .... :         print('That was no valid number.  Try again...')
   .... :
Please enter a number : a
That was no valid number.  Try again...
Please enter a number : 1

In [9] : x
Out[9] : 1
I-B-8-b-ii. try/finally
In [10] : try :
   .... :     x = int(raw_input('Please enter a number : '))
   .... : finally :
   .... :     print('Thank you for your input')
   .... :
   .... :
Please enter a number : a
Thank you for your input
ValueError : invalid literal for int() with base 10 : 'a'
Important for resource management (e.g. closing a file)
I-B-8-b-iii. Easier to ask for forgiveness than for permission
In [11] : def print_sorted(collection) :
   .... :     try :
   .... :         collection.sort()
   .... :     except AttributeError :
   .... :         pass
   .... :     print(collection)
   .... :
   .... :

In [12] : print_sorted([1, 3, 2])
[1, 2, 3]

In [13] : print_sorted(set((1, 3, 2)))
set([1, 2, 3])

In [14] : print_sorted('132')

I-B-8-c. Raising exceptions

  • Capturing and reraising an exception :
In [15] : def filter_name(name) :
   .... :    try :
   .... :        name = name.encode('ascii')
   .... :    except UnicodeError, e :
   .... :        if name == 'Gaël' :
   .... :            print('OK, Gaël')
   .... :        else :
   .... :            raise e
   .... :    return name
   .... :

In [16] : filter_name('Gaël')
OK, Gaël
Out[16] : 'Ga\xc3\xabl'

In [17] : filter_name('Stéfan')
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc3 in position 2 : ordinal not in range(128)
  • Exceptions to pass messages between parts of the code :
In [17] : def achilles_arrow(x) :
   .... :    if abs(x - 1) < 1e-3 :
   .... :        raise StopIteration
   .... :    x = 1 - (1-x)/2.
   .... :    return x
   .... :

In [18] : x = 0

In [19] : while True :
   .... :     try :
   .... :         x = achilles_arrow(x)
   .... :     except StopIteration :
   .... :         break
   .... :
   .... :

In [20] : x
Out[20] : 0.9990234375

Use exceptions to notify certain conditions are met (e.g. StopIteration) or not (e.g. custom error raising)

I-B-9. Object-oriented programming (OOP)

Python supports object-oriented programming (OOP). The goals of OOP are :

  • to organize the code, and
  • to re-use code in similar contexts.

Here is a small example : we create a Student class, which is an object gathering several custom functions (methods) and variables (attributes), we will be able to use :

 class Student(object) :
...     def __init__(self, name) :
...         self.name = name
...     def set_age(self, age) :
...         self.age = age
...     def set_major(self, major) :
...         self.major = major
 anna = Student('anna')

In the previous example, the Student class has __init__, set_age and set_major methods. Its attributes are name, age and major. We can call these methods and attributes with the following notation : classinstance.method or classinstance.attribute. The __init__ constructor is a special method we call with : MyClass(init parameters if any).

Now, suppose we want to create a new class MasterStudent with the same methods and attributes as the previous one, but with an additional internship attribute. We won't copy the previous class, but inherit from it :

 class MasterStudent(Student) :
...     internship = 'mandatory, from March to June'
 james = MasterStudent('james')
'mandatory, from March to June'

The MasterStudent class inherited from the Student attributes and methods.

Thanks to classes and object-oriented programming, we can organize code with different classes corresponding to different objects we encounter (an Experiment class, an Image class, a Flow class, etc.), with their own methods and attributes. Then we can use inheritance to consider variations around a base class and re-use code. Ex : from a Flow base class, we can create derived StokesFlow, TurbulentFlow, PotentialFlow, etc.

I-C. NumPy : creating and manipulating numerical data

This chapter gives an overview of Numpy, the core tool for performant numerical computing with Python.

I-C-1. The Numpy array object

I-C-1-a. What are Numpy and Numpy arrays ?

I-C-1-a-i. Numpy arrays

Python objects :

  • high-level number objects : integers, floating point
  • containers : lists (costless insertion and append), dictionaries (fast lookup)

Numpy provides :

  • extension package to Python for multidimensional arrays
  • closer to hardware (efficiency)
  • designed for scientific computation (convenience)
  • Also known as array oriented computing

 import numpy as np
 a = np.array([0, 1, 2, 3])
array([0, 1, 2, 3])

For example, An array containing :

  • values of an experiment/simulation at discrete time steps
  • signal recorded by a measurement device, e.g. sound wave
  • pixels of an image, grey-level or colour
  • 3-D data measured at different X-Y-Z positions, e.g. MRI scan

Why it is useful : Memory-efficient container that provides fast numerical operations.

In [1] : L = range(1000)

In [2] : %timeit [i**2 for i in L]
1000 loops, best of 3 : 403 us per loop

In [3] : a = np.arange(1000)

In [4] : %timeit a**2
100000 loops, best of 3 : 12.7 us per loop
I-C-1-a-ii. Numpy Reference documentation
In [5] : np.array?
String Form :<built-in function array>
Docstring :
array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0
  • Looking for something :
 np.lookfor('create array') 
Search results for 'create array'
    Create an array.
    Create a memory-map to an array stored in a *binary* file on disk.
In [6] : np.con*?
I-C-1-a-iii. Import conventions

The recommended convention to import numpy is :

 import numpy as np

I-C-1-b. Creating arrays

I-C-1-b-i. Manual construction of arrays
  • 1-D :
 a = np.array([0, 1, 2, 3])
array([0, 1, 2, 3])
  • 2-D, 3-D… :
 b = np.array([[0, 1, 2], [3, 4, 5]])    # 2 x 3 array
array([[0, 1, 2],
       [3, 4, 5]])
(2, 3)
 len(b)     # returns the size of the first dimension

 c = np.array([[[1], [2]], [[3], [4]]])

(2, 2, 1)

Exercise : Simple arrays

  • Create a simple two dimensional array. First, redo the examples from above. And then create your own : how about odd numbers counting backwards on the first row, and even numbers on the second?
  • Use the functions len(), numpy.shape() on these arrays. How do they relate to each other? And to the ndim attribute of the arrays?
I-C-1-b-ii. Functions for creating arrays

In practice, we rarely enter items one by one…

  • Evenly spaced :
 a = np.arange(10) # 0 .. n-1  (!)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 b = np.arange(1, 9, 2) # start, end (exclusive), step
array([1, 3, 5, 7])
  • or by number of points :
 c = np.linspace(0, 1, 6)   # start, end, num-points
array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])
 d = np.linspace(0, 1, 5, endpoint=False)
array([ 0. ,  0.2,  0.4,  0.6,  0.8])
  • Common arrays :
 a = np.ones((3, 3))  # reminder : (3, 3) is a tuple
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
 b = np.zeros((2, 2))
array([[ 0.,  0.],
       [ 0.,  0.]])
 c = np.eye(3)
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])
 d = np.diag(np.array([1, 2, 3, 4]))
array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])
  • np.random : random numbers (Mersenne Twister PRNG) :
 a = np.random.rand(4)       # uniform in [0, 1]
array([ 0.95799151,  0.14222247,  0.08777354,  0.51887998])

 b = np.random.randn(4)      # Gaussian
array([ 0.37544699, -0.11425369, -0.47616538,  1.79664113])

 np.random.seed(1234)        # Setting the random seed

Exercise : Creating arrays using functions

  • Experiment with arange, linspace, ones, zeros, eye and diag.
  • Create different kinds of arrays with random numbers.
  • Try setting the seed before creating an array with random values.
  • Look at the function np.empty. What does it do? When might this be useful?

I-C-1-c. Basic data types

You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. vs 2). This is due to a difference in the data-type used :

 a = np.array([1, 2, 3])

 b = np.array([1., 2., 3.])

Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy autodetects the data-type from the input.

You can explicitly specify which data-type you want :

 c = np.array([1, 2, 3], dtype=float)

The default data type is floating point :

 a = np.ones((3, 3))

There are also other types :

Complex :

 d = np.array([1+2j, 3+4j, 5+6*1j])

Bool :

 e = np.array([True, False, False, True])

Strings :

 f = np.array(['Bonjour', 'Hello', 'Hallo',])
 f.dtype     # <--- strings containing max. 7 letters

Much more :

  • int32
  • int64
  • unit32
  • unit64

I-C-1-d. Basic visualization

Now that we have our first data arrays, we are going to visualize them.

Start by launching IPython in pylab mode.

$ ipython --pylab

Or the notebook :

$ ipython notebook --pylab=inline

Alternatively, if IPython has already been started :


Or, from the notebook :

 %pylab inline

The inline is important for the notebook, so that plots are displayed in the notebook and not in a new window.

Matplotlib is a 2D plotting package. We can import its functions as below :

 import matplotlib.pyplot as plt  # the tidy way

And then use (note that you have to use show explicitly) :

 plt.plot(x, y)       # line plot    
 plt.show()           # <-- shows the plot (not needed with pylab)

Or, if you are using pylab :

 plot(x, y)       # line plot

Using import matplotlib.pyplot as plt is recommended for use in scripts. Whereas pylab is recommended for interactive exploratory work.

  • 1D plotting :
 x = np.linspace(0, 3, 20)
 y = np.linspace(0, 9, 20)
 plt.plot(x, y)       # line plot    
[<matplotlib.lines.Line2D object at …>]
 plt.plot(x, y, 'o')  # dot plot    
[<matplotlib.lines.Line2D object at …>]

[source code, hires.png, pdf]

Image non disponible
  • 2D arrays (such as images) :
 image = np.random.rand(30, 30)
 plt.imshow(image, cmap=plt.cm.hot)    
<matplotlib.colorbar.Colorbar instance at …>

[source code, hires.png, pdf]

Image non disponible

More in the : matplotlib chapter

Exercise : Simple visualizations

  • Plot some simple arrays : a cosine as a function of time and a 2D matrix.
  • Try using the gray colormap on the 2D matrix.

I-C-1-e. Indexing and slicing

The items of an array can be accessed and assigned to the same way as other Python sequences (e.g. lists) :

 a = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 a[0], a[2], a[-1]
(0, 2, 9)

Indices begin at 0, like other Python sequences (and C/C++). In contrast, in Fortran or Matlab, indices begin at 1.

The usual python idiom for reversing a sequence is supported :

 a[ : :-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

For multidimensional arrays, indexes are tuples of integers :

 a = np.diag(np.arange(3))
array([[0, 0, 0],
       [0, 1, 0],
       [0, 0, 2]])
 a[1, 1]
 a[2, 1] = 10 # third line, second column
array([[ 0,  0,  0],
       [ 0,  1,  0],
       [ 0, 10,  2]])
array([0, 1, 0])
  • In 2D, the first dimension corresponds to rows, the second to columns.
  • for multidimensional a, a[0] is interpreted by taking all elements in the unspecified dimensions.

Slicing : Arrays, like other Python sequences can also be sliced :

 a = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 a[2 :9 :3] # [start :end :step]
array([2, 5, 8])

Note that the last index is not included! :

 a[ :4]
array([0, 1, 2, 3])

All three slice components are not required : by default, start is 0, end is the last and step is 1 :

 a[1 :3]
array([1, 2])
 a[ : :2]
array([0, 2, 4, 6, 8])
 a[3 :]
array([3, 4, 5, 6, 7, 8, 9])

A small illustrated summary of Numpy indexing and slicing…

Image non disponible

You can also combine assignment and slicing :

 a = np.arange(10)
 a[5 :] = 10
array([ 0,  1,  2,  3,  4, 10, 10, 10, 10, 10])
 b = np.arange(5)
 a[5 :] = b[ : :-1]
array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

Exercise : Indexing and slicing

  • Try the different flavours of slicing, using start, end and step : starting from a linspace, try to obtain odd numbers counting backwards, and even numbers counting forwards.
  • Reproduce the slices in the diagram above. You may use the following expression to create the array :
 np.arange(6) + np.arange(0, 51, 10)[ :, np.newaxis]
array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

Exercise : Array creation

Create the following arrays (with correct data types) :

[[1, 1, 1, 1],
 [1, 1, 1, 1],
 [1, 1, 1, 2],
 [1, 6, 1, 1]]

[[0., 0., 0., 0., 0.],
 [2., 0., 0., 0., 0.],
 [0., 3., 0., 0., 0.],
 [0., 0., 4., 0., 0.],
 [0., 0., 0., 5., 0.],
 [0., 0., 0., 0., 6.]]

Par on course : 3 statements for each

Hint : Individual array elements can be accessed similarly to a list, e.g. a[1] or a[1, 2].

Hint : Examine the docstring for diag.

Exercise : Tiling for array creation

Skim through the documentation for np.tile, and use this function to construct the array :

[[4, 3, 4, 3, 4, 3],
 [2, 1, 2, 1, 2, 1],
 [4, 3, 4, 3, 4, 3],
 [2, 1, 2, 1, 2, 1]]

I-C-1-f. Copies and views

A slicing operation creates a view on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use np.may_share_memory() to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.

When modifying the view, the original array is modified as well :

 a = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 b = a[ : :2]
array([0, 2, 4, 6, 8])
 np.may_share_memory(a, b)
 b[0] = 12
array([12,  2,  4,  6,  8])
 a   # (!)
array([12,  1,  2,  3,  4,  5,  6,  7,  8,  9])

 a = np.arange(10)
 c = a[ : :2].copy()  # force a copy
 c[0] = 12
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

 np.may_share_memory(a, c)

This behavior can be surprising at first sight… but it allows to save both memory and time.

Worked example : Prime number sieve

Image non disponible

Compute prime numbers in 0-99, with a sieve

  • Construct a shape (100,) boolean array is_prime, filled with True in the beginning :
 is_prime = np.ones((100,), dtype=bool)
  • Cross out 0 and 1 which are not primes :
 is_prime[ :2] = 0
  • For each integer j starting from 2, cross out its higher multiples :
 N_max = int(np.sqrt(len(is_prime)))
 for j in range(2, N_max) :
…     is_prime[2*j : :j] = False
  • Skim through help(np.nonzero), and print the prime numbers
  • Follow-up :
  • Move the above code into a script file named prime_sieve.py
  • Run it to check it works
  • Use the optimization suggested in the sieve of Eratosthenes :
  • Skip j which are already known to not be primes
  • The first number to cross out is kitxmlcodeinlinelatexdvpj^2finkitxmlcodeinlinelatexdvp

I-C-1-g. Fancy indexing

Numpy arrays can be indexed with slices, but also with boolean or integer arrays (masks). This method is called fancy indexing. It creates copies not views.

I-C-1-g-i. Using boolean masks
 a = np.random.random_integers(0, 20, 15)
array([10,  3,  8,  0, 19, 10, 11,  9, 10,  6,  0, 20, 12,  7, 14])
 (a % 3 == 0)
array([False,  True, False,  True, False, False, False,  True, False,
        True,  True, False,  True, False, False], dtype=bool)
 mask = (a % 3 == 0)
 extract_from_a = a[mask] # or,  a[a%3==0]
 extract_from_a           # extract a sub-array with the mask
array([ 3,  0,  9,  6,  0, 12])

Indexing with a mask can be very useful to assign a new value to a sub-array :

 a[a % 3 == 0] = -1
array([10, -1,  8, -1, 19, 10, 11, -1, 10, -1, -1, 20, -1,  7, 14])
I-C-1-g-ii. Indexing with an array of integers
 a = np.arange(0, 100, 10)
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

Indexing can be done with an array of integers, where the same index is repeated several time :

 a[[2, 3, 2, 4, 2]]  # note : [2, 3, 2, 4, 2] is a Python list
array([20, 30, 20, 40, 20])

New values can be assigned with this kind of indexing :

 a[[9, 7]] = -100
array([   0,   10,   20,   30,   40,   50,   60, -100,   80, -100])

When a new array is created by indexing with an array of integers, the new array has the same shape than the array of integers :

 a = np.arange(10)
 idx = np.array([[3, 4], [9, 7]])
(2, 2)
array([[3, 4],
       [9, 7]])

The image below illustrates various fancy indexing applications :

Image non disponible

Exercise : Fancy indexing

  • Again, reproduce the fancy indexing shown in the diagram above.
  • Use fancy indexing on the left and array creation on the right to assign values into an array, for instance by setting parts of the array in the diagram above to zero.

I-C-2. Numerical operations on arrays

I-C-2-a. Elementwise operations

I-C-2-a-i. Basic operations

With scalars :

 a = np.array([1, 2, 3, 4])
 a + 1
array([2, 3, 4, 5])
array([ 2,  4,  8, 16])

All arithmetic operates elementwise :

 b = np.ones(4) + 1
 a - b
array([-1.,  0.,  1.,  2.])
 a * b
array([ 2.,  4.,  6.,  8.])

 j = np.arange(5)
 2**(j + 1) - j
array([ 2,  3,  6, 13, 28])

These operations are of course much faster than if you did them in pure python :

 a = np.arange(10000)
 %timeit a + 1  
10000 loops, best of 3 : 24.3 us per loop
 l = range(10000)
 %timeit [i+1 for i in l] 
1000 loops, best of 3 : 861 us per loop

Array multiplication is not matrix multiplication :

 c = np.ones((3, 3))
 c * c                   # NOT matrix multiplication!
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

Matrix multiplication :

array([[ 3.,  3.,  3.],
       [ 3.,  3.,  3.],
       [ 3.,  3.,  3.]])

Exercise : Elementwise operations

  • Try simple arithmetic elementwise operations : add even elements with odd elements
  • Time them against their pure python counterparts using %timeit.
  • Generate :
  • [2**0, 2**1, 2**2, 2**3, 2**4]
  • a_j = 2^(3*j) - j
I-C-2-a-ii. Other operations

Comparisons :

 a = np.array([1, 2, 3, 4])
 b = np.array([4, 2, 2, 4])
 a == b
array([False,  True, False,  True], dtype=bool)
 a > b
array([False, False,  True, False], dtype=bool)

Array-wise comparisons :

 a = np.array([1, 2, 3, 4])
 b = np.array([4, 2, 2, 4])
 c = np.array([1, 2, 3, 4])
 np.array_equal(a, b)
 np.array_equal(a, c)

Logical operations :

 a = np.array([1, 1, 0, 0], dtype=bool)
 b = np.array([1, 0, 1, 0], dtype=bool)
 np.logical_or(a, b)
array([ True,  True,  True, False], dtype=bool)
 np.logical_and(a, b)
array([ True, False, False, False], dtype=bool)

Transcendental functions :

 a = np.arange(5)
array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025])
array([       -inf,  0.        ,  0.69314718,  1.09861229,  1.38629436])
array([  1.00000000e+00,   2.71828183e+00,   7.38905610e+00,
         2.00855369e+01,   5.45981500e+01])

Shape mismatches :

 a = np.arange(4)
 a + np.array([1, 2])  
Traceback (most recent call last) :
  File « <stdin> », line 1, in <module>
ValueError : operands could not be broadcast together with shapes (4) (2)

Broadcasting ? We'll return to that laterBroadcasting.

Transposition :

 a = np.triu(np.ones((3, 3)), 1)   # see help(np.triu)
array([[ 0.,  1.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  0.]])
array([[ 0.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  1.,  0.]])

The transposition is a view

As a results, the following code is wrong and will not make a matrix symmetric :

 a += a.T

Linear algebra

The sub-module numpy.linalg implements basic linear algebra, such as solving linear systems, singular value decomposition, etc. However, it is not guaranteed to be compiled using efficient routines, and thus we recommend the use of scipy.linalgLinear algebra operations : scipy.linalgLinear algebra operations: scipy.linalg, as detailed in section

Exercise other operations

  • Look at the help for np.allclose. When might this be useful ?
  • Look at the help for np.triu and np.tril.

I-C-2-b. Basic reductions

I-C-2-b-i. Computing sums
 x = np.array([1, 2, 3, 4])
Image non disponible

Sum by rows and by columns :

 x = np.array([[1, 1], [2, 2]])
array([[1, 1],
       [2, 2]])
 x.sum(axis=0)   # columns (first dimension)
array([3, 3])
 x[ :, 0].sum(), x[ :, 1].sum()
(3, 3)
 x.sum(axis=1)   # rows (second dimension)
array([2, 4])
 x[0, :].sum(), x[1, :].sum()
(2, 4)

Same idea in higher dimensions :

 x = np.random.rand(2, 2, 2)
 x.sum(axis=2)[0, 1]     
 x[0, 1, :].sum()     
I-C-2-b-ii. Other reductions

— works the same way (and take axis=)

Extrema :

 x = np.array([1, 3, 2])

 x.argmin()  # index of minimum
 x.argmax()  # index of maximum

Logical operations :

 np.all([True, True, False])
 np.any([True, True, False])

Can be used for array comparisons :

 a = np.zeros((100, 100))
 np.any(!= 0)
 np.all(a == a)

 a = np.array([1, 2, 3, 2])
 b = np.array([2, 2, 3, 2])
 c = np.array([6, 4, 4, 5])
 ((a <= b) & (b <= c)).all()

Statistics :

 x = np.array([1, 2, 3, 1])
 y = np.array([[1, 2, 3], [5, 6, 1]])
 np.median(y, axis=-1) # last axis
array([ 2.,  5.])

 x.std()          # full population standard dev.

… and many more (best to learn as you go).

Exercise : Reductions

  • Given there is a sum, what other function might you expect to see?
  • What is the difference between sum and cumsum ?

Worked Example : data statistics

Data in populations.txt describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years.

You can view the data in an editor, or alternatively in IPython (both shell and notebook) :

In [1] : !cat data/populations.txt

First, load the data into a Numpy array :

 data = np.loadtxt('data/populations.txt')
 year, hares, lynxes, carrots = data.T  # trick : columns to variables

Then plot it :

 from matplotlib import pyplot as plt
 plt.axes([0.2, 0.1, 0.5, 0.8]) 
 plt.plot(year, hares, year, lynxes, year, carrots) 
 plt.legend(('Hare', 'Lynx', 'Carrot'), loc=(1.05, 0.5))

[source code, hires.png, pdf]

Image non disponible

The mean populations over time :

 populations = data[ :, 1 :]
array([ 34080.95238095,  20166.66666667,  42400.        ])

The sample standard deviations :

array([ 20897.90645809,  16254.59153691,   3322.50622558])

Which species has the highest population each year ? :

 np.argmax(populations, axis=1)
array([2, 2, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 2])

Worked Example : diffusion using a random walk algorithm

Image non disponible

Let us consider a simple 1D random walk process : at each time step a walker jumps right or left with equal probability.

We are interested in finding the typical distance from the origin of a random walker after t left or right jumps? We are going to simulate many “walkers” to find this law, and we are going to do so using array computing tricks : we are going to create a 2D array with the “stories” (each walker has a story) in one direction, and the time in the other :

Image non disponible
 n_stories = 1000 # number of walkers
 t_max = 200      # time during which we follow the walker

We randomly choose all the steps 1 or -1 of the walk :

 t = np.arange(t_max)
 steps = 2 * np.random.random_integers(0, 1, (n_stories, t_max)) - 1
 np.unique(steps) # Verification : all steps are 1 or -1
array([-1,  1])

We build the walks by summing steps along the time :

 positions = np.cumsum(steps, axis=1) # axis = 1 : dimension of time
 sq_distance = positions**2

We get the mean in the axis of the stories :

 mean_sq_distance = np.mean(sq_distance, axis=0)

Plot the results :

 plt.figure(figsize=(4, 3)) 
<matplotlib.figure.Figure object at …>
 plt.plot(t, np.sqrt(mean_sq_distance), 'g.', t, np.sqrt(t), 'y-') 
[<matplotlib.lines.Line2D object at …>, <matplotlib.lines.Line2D object at …>]
 plt.xlabel(r« $t$ ») 
<matplotlib.text.Text object at …>
 plt.ylabel(r« $\sqrt{\langle (\delta x)^2 \rangle}$ ») 
<matplotlib.text.Text object at …>

[source code, hires.png, pdf]

Image non disponible

We find a well-known result in physics : the RMS distance grows as the square root of the time!

I-C-2-c. Broadcasting

  • Basic operations on numpy arrays (addition, etc.) are elementwise
  • This works on arrays of the same size.

Nevertheless, It's also possible to do operations on arrays of different

sizes if Numpy can transform these arrays so that they all have

the same size : this conversion is called broadcasting.

The image below gives an example of broadcasting :

Image non disponible

Let's verify :

 a = np.tile(np.arange(0, 40, 10), (3, 1)).T
array([[ 0,  0,  0],
       [10, 10, 10],
       [20, 20, 20],
       [30, 30, 30]])
 b = np.array([0, 1, 2])
 a + b
array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])

We have already used broadcasting without knowing it ! :

 a = np.ones((4, 5))
 a[0] = 2  # we assign an array of dimension 0 to an array of dimension 1
array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

An useful trick :

 a = np.arange(0, 40, 10)
 a = a[ :, np.newaxis]  # adds a new axis -> 2D array
(4, 1)
array([[ 0],
 a + b
array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])

Broadcasting seems a bit magical, but it is actually quite natural to use it when we want to solve a problem whose output data is an array with more dimensions than input data.

Worked Example : Broadcasting

Let's construct an array of distances (in miles) between cities of Route 66 : Chicago, Springfield, Saint-Louis, Tulsa, Oklahoma City, Amarillo, Santa Fe, Albuquerque, Flagstaff and Los Angeles.

 mileposts = np.array([0, 198, 303, 736, 871, 1175, 1475, 1544,
…        1913, 2448])
 distance_array = np.abs(mileposts - mileposts[ :, np.newaxis])
array([[   0,  198,  303,  736,  871, 1175, 1475, 1544, 1913, 2448],
       [ 198,    0,  105,  538,  673,  977, 1277, 1346, 1715, 2250],
       [ 303,  105,    0,  433,  568,  872, 1172, 1241, 1610, 2145],
       [ 736,  538,  433,    0,  135,  439,  739,  808, 1177, 1712],
       [ 871,  673,  568,  135,    0,  304,  604,  673, 1042, 1577],
       [1175,  977,  872,  439,  304,    0,  300,  369,  738, 1273],
       [1475, 1277, 1172,  739,  604,  300,    0,   69,  438,  973],
       [1544, 1346, 1241,  808,  673,  369,   69,    0,  369,  904],
       [1913, 1715, 1610, 1177, 1042,  738,  438,  369,    0,  535],
       [2448, 2250, 2145, 1712, 1577, 1273,  973,  904,  535,    0]])
Image non disponible

A lot of grid-based or network-based problems can also use broadcasting. For instance, if we want to compute the distance from the origin of points on a 10x10 grid, we can do

 x, y = np.arange(5), np.arange(5)[ :, np.newaxis]
 distance = np.sqrt(x ** 2 + y ** 2)
array([[ 0.        ,  1.        ,  2.        ,  3.        ,  4.        ],
       [ 1.        ,  1.41421356,  2.23606798,  3.16227766,  4.12310563],
       [ 2.        ,  2.23606798,  2.82842712,  3.60555128,  4.47213595],
       [ 3.        ,  3.16227766,  3.60555128,  4.24264069,  5.        ],
       [ 4.        ,  4.12310563,  4.47213595,  5.        ,  5.65685425]])

Or in color :


[source code, hires.png, pdf]

Image non disponible

Remark : the numpy.ogrid function allows to directly create vectors x and y of the previous example, with two “significant dimensions” :

 x, y = np.ogrid[0 :5, 0 :5]
 x, y
       [4]]), array([[0, 1, 2, 3, 4]]))
 x.shape, y.shape
((5, 1), (1, 5))
 distance = np.sqrt(x ** 2 + y ** 2)

So, np.ogrid is very useful as soon as we have to handle computations on a grid. On the other hand, np.mgrid directly provides matrices full of indices for cases where we can't (or don't want to) benefit from broadcasting :

 x, y = np.mgrid[0 :4, 0 :4]
array([[0, 0, 0, 0],
       [1, 1, 1, 1],
       [2, 2, 2, 2],
       [3, 3, 3, 3]])
array([[0, 1, 2, 3],
       [0, 1, 2, 3],
       [0, 1, 2, 3],
       [0, 1, 2, 3]])

I-C-2-d. Array shape manipulation

I-C-2-d-i. Flattening
 a = np.array([[1, 2, 3], [4, 5, 6]])
array([1, 2, 3, 4, 5, 6])
array([[1, 4],
       [2, 5],
       [3, 6]])
array([1, 4, 2, 5, 3, 6])
Higher dimensions : last dimensions ravel out “first”.
I-C-2-d-ii. Reshaping

The inverse operation to flattening :

(2, 3)
 b = a.ravel()
 b = b.reshape((2, 3))
array([[1, 2, 3],
       [4, 5, 6]])


 a.reshape((2, -1))    # unspecified (-1) value is inferred
array([[1, 2, 3],
       [4, 5, 6]])

ndarray.reshape may return a view (cf help(np.reshape)), or copy

 b[0, 0] = 99
array([[99,  2,  3],
       [ 4,  5,  6]])

Beware : reshape may also return a copy! :

 a = np.zeros((3, 2))
 b = a.T.reshape(3*2)
 b[0] = 9
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

To understand this you need to learn more about the memory layout of a numpy array.

I-C-2-d-iii. Adding a dimension

Indexing with the np.newaxis object allows us to add an axis to an array (you have seen this already above in the broadcasting section) :

 z = np.array([1, 2, 3])
array([1, 2, 3])

 z[ :, np.newaxis]

 z[np.newaxis, :]
array([[1, 2, 3]])
I-C-2-d-iv. Dimension shuffling
 a = np.arange(4*3*2).reshape(4, 3, 2)
(4, 3, 2)
 a[0, 2, 1]
 b = a.transpose(1, 2, 0)
(3, 2, 4)
 b[2, 1, 0]

Also creates a view :

 b[2, 1, 0] = -1
 a[0, 2, 1]
I-C-2-d-v. Resizing

Size of an array can be changed with ndarray.resize :

 a = np.arange(4)
array([0, 1, 2, 3, 0, 0, 0, 0])

However, it must not be referred to somewhere else :

 b = a
Traceback (most recent call last) :
  File « <stdin> », line 1, in <module>
ValueError : cannot resize an array that has been referenced or is
referencing another array in this way.  Use the resize function

Exercise : Shape manipulations

  • Look at the docstring for reshape, especially the notes section which has some more information about copies and views.
  • Use flatten as an alternative to ravel. What is the difference? (Hint : check which one returns a view and which a copy)
  • Experiment with transpose for dimension shuffling.

I-C-2-e. Sorting data

Sorting along an axis :

 a = np.array([[4, 3, 5], [1, 2, 1]])
 b = np.sort(a, axis=1)
array([[3, 4, 5],
       [1, 1, 2]])

Sorts each row separately!

In-place sort :

array([[3, 4, 5],
       [1, 1, 2]])

Sorting with fancy indexing :

 a = np.array([4, 3, 1, 2])
 j = np.argsort(a)
array([2, 3, 1, 0])
array([1, 2, 3, 4])

Finding minima and maxima :

 a = np.array([4, 3, 1, 2])
 j_max = np.argmax(a)
 j_min = np.argmin(a)
 j_max, j_min
(0, 2)

Exercise : Sorting

  • Try both in-place and out-of-place sorting.
  • Try creating arrays with different dtypes and sorting them.
  • Use all or array_equal to check the results.
  • Look at np.random.shuffle for a way to create sortable input quicker.
  • Combine ravel, sort and reshape.
  • Look at the axis keyword for sort and rewrite the previous exercise.

I-C-2-f. Summary

What do you need to know to get started ?

  • Know how to create arrays : array, arange, ones, zeros.
  • Know the shape of the array with array.shape, then use slicing to obtain different views of the array : array[ : :2], etc. Adjust the shape of the array using reshape or flatten it with ravel.
  • Obtain a subset of the elements of an array and/or modify their values with masks

     a[a < 0] = 0
  • Know miscellaneous operations on arrays, such as finding the mean or max (array.max(), array.mean()). No need to retain everything, but have the reflex to search in the documentation (online docs, help(), lookfor())!!

  • For advanced use : master the indexing with arrays of integers, as well as broadcasting. Know more Numpy functions to handle various array operations.

Quick read

If you want to do a first quick pass through the Scipy lectures to learn the ecosystem, you can directly skip to the next chapter : Matplotlib : plottingMatplotlib : plotting.

The remainder of this chapter is not necessary to follow the rest of the intro part. But be sure to come back and finish this chapter, as well as to do some more exercicesSome exercises.

I-C-3. More elaborate arrays

I-C-3-a. More data types

I-C-3-a-i. Casting

“Bigger” type wins in mixed-type operations :

 np.array([1, 2, 3]) + 1.5
array([ 2.5,  3.5,  4.5])

Assignment never changes the type !

 a = np.array([1, 2, 3])
 a[0] = 1.9     # <-- float is truncated to integer
array([1, 2, 3])

Forced casts :

 a = np.array([1.7, 1.2, 1.6])
 b = a.astype(int)  # <-- truncates to integer
array([1, 1, 1])

 a = np.array([1.2, 1.5, 1.6, 2.5, 3.5, 4.5])
 b = np.around(a)
 b                    # still floating-point
array([ 1.,  2.,  2.,  2.,  4.,  4.])
 c = np.around(a).astype(int)
array([1, 2, 2, 2, 4, 4])
I-C-3-a-ii. Different data type sizes

Integers (signed) :


8 bits


16 bits


32 bits (same as int on 32-bit platform)


64 bits (same as int on 64-bit platform)

 np.array([1], dtype=int).dtype
 np.iinfo(np.int32).max, 2**31 - 1
(2147483647, 2147483647)
 np.iinfo(np.int64).max, 2**63 - 1
(9223372036854775807, 9223372036854775807L)

Unsigned integers :


8 bits


16 bits


32 bits


64 bits

 np.iinfo(np.uint32).max, 2**32 - 1
(4294967295, 4294967295)
 np.iinfo(np.uint64).max, 2**64 - 1
(18446744073709551615L, 18446744073709551615L)

Floating-point numbers :


16 bits


32 bits


64 bits (same as float)


96 bits, platform-dependent (same as np.longdouble)


128 bits, platform-dependent (same as np.longdouble)


 np.float32(1e-8) + np.float32(1) == 1
 np.float64(1e-8) + np.float64(1) == 1

Complex floating-point numbers :


two 32-bit floats


two 64-bit floats


two 96-bit floats, platform-dependent


two 128-bit floats, platform-dependent

Smaller data types

If you don't know you need special data types, then you probably don't.

Comparison on using float32 instead of float64 :

  • Half the size in memory and on disk
  • Half the memory bandwidth required (may be a bit faster in some operations)


n [1]: a = np.zeros((1e6,), dtype=np.float64)

In [2]: b = np.zeros((1e6,), dtype=np.float32)

In [3]: %timeit a*a
1000 loops, best of 3: 1.78 ms per loop

In [4]: %timeit b*b
1000 loops, best of 3: 1.07 ms per loop
  • But : bigger rounding errors — sometimes in surprising places (i.e., don't use them unless you really need them)

I-C-3-b. Structured data types


(4-character string)





 samples = np.zeros((6,), dtype=[('sensor_code', 'S4'),
...                                 ('position', float), ('value', float)])
('sensor_code', 'position', 'value')

 samples[:] = [('ALFA',   1, 0.37), ('BETA', 1, 0.11), ('TAU', 1,   0.13),
...               ('ALFA', 1.5, 0.37), ('ALFA', 3, 0.11), ('TAU', 1.2, 0.13)]
array([('ALFA', 1.0, 0.37), ('BETA', 1.0, 0.11), ('TAU', 1.0, 0.13),
       ('ALFA', 1.5, 0.37), ('ALFA', 3.0, 0.11), ('TAU', 1.2, 0.13)],
      dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])

Field access works by indexing with field names :

array(['ALFA', 'BETA', 'TAU', 'ALFA', 'ALFA', 'TAU'],
array([ 0.37,  0.11,  0.13,  0.37,  0.11,  0.13])
('ALFA', 1.0, 0.37)

 samples[0]['sensor_code'] = 'TAU'
('TAU', 1.0, 0.37)

Multiple fields at once :

 samples[['position', 'value']]
array([(1.0, 0.37), (1.0, 0.11), (1.0, 0.13), (1.5, 0.37), (3.0, 0.11),
       (1.2, 0.13)],
      dtype=[('position', '<f8'), ('value', '<f8')])

Fancy indexing works, as usual :

 samples[samples['sensor_code'] == 'ALFA']
array([('ALFA', 1.5, 0.37), ('ALFA', 3.0, 0.11)],
      dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])

There are a bunch of other syntaxes for constructing structured arrays, see here and here.

I-C-3-c. Maskedarray : dealing with (propagation of) missing data

  • For floats one could use NaN's, but masks work for all types :
 x = np.ma.array([1, 2, 3, 4], mask=[0, 1, 0, 1])
masked_array(data = [1 -- 3 --],
             mask = [False  True False  True],
       fill_value = 999999)

 y = np.ma.array([1, 2, 3, 4], mask=[0, 1, 1, 1])
 x + y
masked_array(data = [2 -- -- --],
             mask = [False  True  True  True],
       fill_value = 999999)
  • Masking versions of common functions :
 np.ma.sqrt([1, -1, 2, -2])
masked_array(data = [1.0 -- 1.41421356237 --],
             mask = [False  True False  True],
       fill_value = 1e+20)

There are other useful Advanced topics

While it is off topic in a chapter on numpy, let's take a moment to recall good coding practice, which really do pay off in the long run:

Good practices

  • Explicit variable names (no need of a comment to explain what is in the variable)
  • Style : spaces after commas, around =, etc.
    A certain number of rules for writing “beautiful” code (and, more importantly, using the same conventions as everybody else!) are given in the Style Guide for Python Code and the Docstring Conventions page (to manage help strings).
  • Except some rare cases, variable names and comments in English

I-C-4. Advanced operations

I-C-4-a. Polynomials

Numpy also contains polynomials in different bases:

For example, kitxmlcodeinlinelatexdvp3x^2 + 2x - 1finkitxmlcodeinlinelatexdvp :

 p = np.poly1d([3, 2, -1])
array([-1.        ,  0.33333333])
 x = np.linspace(0, 1, 20)
 y = np.cos(x) + 0.3*np.random.rand(20)
 p = np.poly1d(np.polyfit(x, y, 3))

 t = np.linspace(0, 1, 200)
 plt.plot(x, y, 'o', t, p(t), '-')   
[<matplotlib.lines.Line2D object at ...>, <matplotlib.lines.Line2D object at ...>]

[source code, hires.png, pdf]

Image non disponible

See http://docs.scipy.org/doc/numpy/reference/routines.polynomials.poly1d.html for more.

I-C-4-a-i. More polynomials (with more bases)

Numpy also has a more sophisticated polynomial interface, which supports e.g. the Chebyshev basis.

kitxmlcodeinlinelatexdvp3x^2 + 2x - 1finkitxmlcodeinlinelatexdvp :

 p = np.polynomial.Polynomial([-1, 2, 3]) # coefs in different order!
array([-1.        ,  0.33333333])
 p.degree()  # In general polynomials do not always expose 'order'

Example using polynomials in Chebyshev basis, for polynomials in range [-1, 1] :

 x = np.linspace(-1, 1, 2000)
 y = np.cos(x) + 0.3*np.random.rand(2000)
 p = np.polynomial.Chebyshev.fit(x, y, 90)

 t = np.linspace(-1, 1, 200)
 plt.plot(x, y, 'r.')   
[<matplotlib.lines.Line2D object at ...>]
 plt.plot(t, p(t), 'k-', lw=3)   
[<matplotlib.lines.Line2D object at ...>]

[source code, hires.png, pdf]

Image non disponible

The Chebyshev polynomials have some advantages in interpolation.

I-C-4-b. Loading data files

I-C-4-b-i. Text files

Example : populations.txt :

# year  hare    lynx    carrot
1900    30e3    4e3     48300
1901    47.2e3  6.1e3   48200
1902    70.2e3  9.8e3   41500
1903    77.4e3  35.2e3  38200
 data = np.loadtxt('data/populations.txt')
array([[  1900.,  30000.,   4000.,  48300.],
       [  1901.,  47200.,   6100.,  48200.],
       [  1902.,  70200.,   9800.,  41500.],
 np.savetxt('pop2.txt', data)
 data2 = np.loadtxt('pop2.txt')

If you have a complicated text file, what you can try are :

  • np.genfromtxt
  • Using Python's I/O functions and e.g. regexps for parsing (Python is quite well suited for this)

Reminder : Navigating the filesystem with IPython

In [1]: pwd      # show current directory
In [2]: cd ex
In [3]: ls
populations.txt  species.txt
I-C-4-b-ii. Images

Using Matplotlib :

 img = plt.imread('data/elephant.png')
 img.shape, img.dtype
((200, 300, 3), dtype('float32'))
<matplotlib.image.AxesImage object at ...>

 plt.imsave('red_elephant', img[:,:,0], cmap=plt.cm.gray)

This saved only one channel (of RGB) :

<matplotlib.image.AxesImage object at ...>

Other libraries :

 from scipy.misc import imsave
 imsave('tiny_elephant.png', img[::6,::6])
 plt.imshow(plt.imread('tiny_elephant.png'), interpolation='nearest')  
<matplotlib.image.AxesImage object at ...>

[source code, hires.png, pdf]

Image non disponible
I-C-4-b-iii. Numpy's own format

Numpy has its own binary format, not portable but with efficient I/O :

 data = np.ones((3, 3))
 np.save('pop.npy', data)
 data3 = np.load('pop.npy')
I-C-4-b-iv. Well-known (& more obscure) file formats
  • HDF5 : h5py, PyTables
  • NetCDF : scipy.io.netcdf_file, netcdf4-python
  • Matlab : scipy.io.loadmat, scipy.io.savemat
  • MatrixMarket : scipy.io.mmread, scipy.io.mmread

… if somebody uses it, there's probably also a Python library for it.

Exercise : Text data files

Write a Python script that loads data from populations.txt:: and drop the last column and the first 5 rows. Save the smaller dataset to pop2.txt.

Numpy internals

If you are interested in the Numpy internals, there is a good discussion in Advanced Numpy.

I-C-5. Some exercises

I-C-5-a. Array manipulations

  • Form the 2-D array (without typing it in explicitly) :
[[1,  6, 11],
 [2,  7, 12],
 [3,  8, 13],
 [4,  9, 14],
 [5, 10, 15]]

and generate a new array containing its 2d and 4th rows.

  • Divide each column of the array:
 a = np.arange(25).reshape(5, 5)

elementwise with the array b = np.array([1., 5, 10, 15, 20]). (Hint : np.newaxis).

  • Harder one : Generate a 10 x 3 array of random numbers (in range [0,1]). For each row, pick the number closest to 0.5.
  • Use abs and argsort to find the column j closest for each row.
  • Use fancy indexing to extract the numbers. (Hint : a[i,j] - the array i must contain the row numbers corresponding to stuff in j.)

I-C-5-b. Picture manipulation: Framing Lena

Let's do some manipulations on numpy arrays by starting with the famous image of Lena (http://www.cs.cmu.edu/~chuck/lennapg/). scipy provides a 2D array of this image with the scipy.lena function:

 from scipy import misc
 lena = misc.lena()

Note : In older versions of scipy, you will find lena under scipy.lena()

Here are a few images we will be able to obtain with our manipulations: use different colormaps, crop the image, change some parts of the image.

Image non disponible
  • Let's use the imshow function of pylab to display the image.
In [3]: import pylab as plt
In [4]: lena = misc.lena()
In [5]: plt.imshow(lena)
  • Lena is then displayed in false colors. A colormap must be
specified for her to be displayed in grey.
In [6]: plt.imshow(lena, cmap=plt.cm.gray)
  • Create an array of the image with a narrower centering : for example,

remove 30 pixels from all the borders of the image. To check the result, display this new array with imshow.

In [9]: crop_lena = lena[30:-30,30:-30]
  • We will now frame Lena's face with a black locket. For this, we

need to create a mask corresponding to the pixels we want to be black. The mask is defined by this condition (y-256)**2 + (x-256)**2

In [15]: y, x = np.ogrid[0:512,0:512] # x and y indices of pixels
In [16]: y.shape, x.shape
Out[16]: ((512, 1), (1, 512))
In [17]: centerx, centery = (256, 256) # center of the image
In [18]: mask = ((y - centery)**2 + (x - centerx)**2) > 230**2 # circle

then we assign the value 0 to the pixels of the image corresponding to the mask. The syntax is extremely simple and intuitive:

In [19]: lena[mask] = 0
In [20]: plt.imshow(lena)
Out[20]: <matplotlib.image.AxesImage object at 0xa36534c>
  • Follow-up: copy all instructions of this exercise in a script called

lena_locket.py then execute this script in IPython with %run lena_locket.py.

Change the circle to an ellipsoid.

I-C-5-c. Data statistics

The data in populations.txt describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years :

 data = np.loadtxt('data/populations.txt')
 year, hares, lynxes, carrots = data.T  # trick: columns to variables

 plt.axes([0.2, 0.1, 0.5, 0.8]) 
<matplotlib.axes.Axes object at ...>
 plt.plot(year, hares, year, lynxes, year, carrots) 
[<matplotlib.lines.Line2D object at ...>, ...]
 plt.legend(('Hare', 'Lynx', 'Carrot'), loc=(1.05, 0.5)) 
<matplotlib.legend.Legend object at ...>

[source code, hires.png, pdf]

Image non disponible

Computes and print, based on the data in populations.txt…

  1. The mean and std of the populations of each species for the years in the period.
  2. Which year each species had the largest population.
  3. Which species has the largest population for each year. (Hint : argsort & fancy indexing of np.array(['H', 'L', 'C']))
  4. Which years any of the populations is above 50000. (Hint: comparisons and np.any)
  5. The top 2 years for each species when they had the lowest populations. (Hint : argsort, fancy indexing)
  6. Compare (plot) the change in hare population (see help(np.gradient)) and the number of lynxes. Check correlation (see help(np.corrcoef)).

… all without for-loops.

Solution : Python source file

I-C-5-d. Crude integral approximations

Write a function f(a, b, c) that returnskitxmlcodeinlinelatexdvpa^b - cfinkitxmlcodeinlinelatexdvp. Form a 24x12x6 array containing its values in parameter ranges [0,1] x [0,1] x [0,1].

Approximate the 3-d integral :


over this volume with the mean. The exact result is : kitxmlcodeinlinelatexdvp\ln 2 - \frac{1}{2}\approx0.1931\ldotsfinkitxmlcodeinlinelatexdvp — what is your relative error ?

(Hints : use elementwise operations and broadcasting. You can make np.ogrid give a number of points in given range with np.ogrid[0:1:20j].)

Reminder Python functions :

def f(a, b, c):
    return some_result

Solution: Python source file

I-C-5-e. Mandelbrot set

[source code, hires.png, pdf]

Image non disponible

Write a script that computes the Mandelbrot fractal. The Mandelbrot iteration:

N_max = 50
some_threshold = 50

c = x + 1j*y

for j in xrange(N_max):
    z = z**2 + c

Point (x, y) belongs to the Mandelbrot set if kitxmlcodeinlinelatexdvp|c|finkitxmlcodeinlinelatexdvp < some_threshold.

Do this computation by :

  1. Construct a grid of c = x + 1j*y values in range [-2, 1] x [-1.5, 1.5]
  2. Do the iteration
  3. Form the 2-d boolean mask indicating which points are in the set
  4. Save the result to an image with:
 import matplotlib.pyplot as plt
 plt.imshow(mask.T, extent=[-2, 1, -1.5, 1.5]) 
<matplotlib.image.AxesImage object at …>

Solution: Python source file

I-C-5-f. Markov chain

Image non disponible

Markov chain transition matrix P, and probability distribution on the states p :

  • 0 <= P[i,j] <= 1 : probability to go from state i to state j
  • Transition rule: kitxmlcodeinlinelatexdvpp_{new} = P^T p_{old}finkitxmlcodeinlinelatexdvp
  • all(sum(P, axis=1) == 1), p.sum() == 1 : normalization

Write a script that works with 5 states, and:

  • Constructs a random matrix, and normalizes each row so that it is a transition matrix.
  • Starts from a random (normalized) probability distribution p and takes 50 steps => p_50
  • Computes the stationary distribution: the eigenvector of P.T with eigenvalue 1 (numerically: closest to 1) => p_stationary

Remember to normalize the eigenvector — I didn't…

  • Checks if p_50 and p_stationary are equal to tolerance 1e-5

Toolbox : np.random.rand, .dot(), np.linalg.eig, reductions, abs(), argmin, comparisons, all, np.linalg.norm, etc.

Solution : Python source file

I-D. Matplotlib : plotting

I-D-1. Introduction

Matplotlib is probably the single most used Python package for 2D-graphics. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats. We are going to explore matplotlib in interactive mode covering most common cases.

I-D-1-a. IPython and the pylab mode

IPython is an enhanced interactive Python shell that has lots of interesting features including named inputs and outputs, access to shell commands, improved debugging and many more. It is central to the scientific-computing workflow in Python for its use in combination with Matplotlib:

We start IPython with the command line argument -pylab (--pylab since IPython version 0.12), for interactive matplotlib sessions with Matlab/Mathematica-like functionality.

I-D-1-b. pylab

pylab provides a procedural interface to the matplotlib object-oriented plotting library. It is modeled closely after Matlab™. Therefore, the majority of plotting commands in pylab have Matlab™ analogs with similar arguments. Important commands are explained with interactive examples.

I-D-2. Simple plot

In this section, we want to draw the cosine and sine functions on the same plot. Starting from the default settings, we'll enrich the figure step by step to make it nicer.

First step is to get the data for the sine and cosine functions:

import numpy as np

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)

X is now a numpy array with 256 values ranging from -π to +π (included). C is the cosine (256 values) and S is the sine (256 values).

To run the example, you can type them in an IPython interactive session:

$ ipython --pylab

This brings us to the IPython prompt:

IPython 0.13 -- An enhanced Interactive Python.
?       -> Introduction to IPython's features.
%magic  -> Information about IPython's 'magic' % functions.
help    -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

Welcome to pylab, a matplotlib-based Python environment.
For more information, type 'help(pylab)'.

You can also download each of the examples and run it using regular python, but you will loose interactive data manipulation:

$ python exercice_1.py

You can get source for each step by clicking on the corresponding figure.

I-D-2-a. Plotting with default settings

Image non disponible

Matplotlib comes with a set of default settings that allow customizing all kinds of properties. You can control the defaults of almost every property in matplotlib: figure size and dpi, line width, color and style, axes, axis and grid properties, text and font properties and so on.

import pylab as pl
import numpy as np

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)

pl.plot(X, C)
pl.plot(X, S)


I-D-2-b. Instantiating defaults

Image non disponible

In the script below, we've instantiated (and commented) all the figure settings that influence the appearance of the plot.

The settings have been explicitly set to their default values, but now you can interactively play with the values to explore their affect (see Line propertiesLine properties and Line stylesLine styles below).

import pylab as pl
import numpy as np

# Create a figure of size 8x6 inches, 80 dots per inch
pl.figure(figsize=(8, 6), dpi=80)

# Create a new subplot from a grid of 1x1
pl.subplot(1, 1, 1)

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)

# Plot cosine with a blue continuous line of width 1 (pixels)
pl.plot(X, C, color=« blue », linewidth=1.0, linestyle=« - »)

# Plot sine with a green continuous line of width 1 (pixels)
pl.plot(X, S, color=« green », linewidth=1.0, linestyle=« - »)

# Set x limits
pl.xlim(-4.0, 4.0)

# Set x ticks
pl.xticks(np.linspace(-4, 4, 9, endpoint=True))

# Set y limits
pl.ylim(-1.0, 1.0)

# Set y ticks
pl.yticks(np.linspace(-1, 1, 5, endpoint=True))

# Save figure using 72 dots per inch
# savefig(« exercice_2.png », dpi=72)

# Show result on screen

I-D-2-c. Changing colors and line widths

Image non disponible

First step, we want to have the cosine in blue and the sine in red and a slighty thicker line for both of them. We'll also slightly alter the figure size to make it more horizontal.

pl.figure(figsize=(10, 6), dpi=80)
pl.plot(X, C, color=« blue », linewidth=2.5, linestyle=« - »)
pl.plot(X, S, color=« red »,  linewidth=2.5, linestyle=« - »)

I-D-2-d. Setting limits

Image non disponible

Current limits of the figure are a bit too tight and we want to make some space in order to clearly see all data points.

pl.xlim(X.min() * 1.1, X.max() * 1.1)
pl.ylim(C.min() * 1.1, C.max() * 1.1)

I-D-2-e. Setting ticks

Image non disponible

Current ticks are not ideal because they do not show the interesting values (+/-π,+/-π/2) for sine and cosine. We'll change them such that they show only these values.

pl.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
pl.yticks([-1, 0, +1])

I-D-2-f. Setting tick labels

Image non disponible

Ticks are now properly placed but their label is not very explicit. We could guess that 3.142 is π but it would be better to make it explicit. When we set tick values, we can also provide a corresponding label in the second argument list. Note that we'll use latex to allow for nice rendering of the label.

pl.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi],
          [r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])

pl.yticks([-1, 0, +1],
          [r'$-1$', r'$0$', r'$+1$'])

I-D-2-g. Moving spines

Image non disponible

Spines are the lines connecting the axis tick marks and noting the boundaries of the data area. They can be placed at arbitrary positions and until now, they were on the border of the axis. We'll change that since we want to have them in the middle. Since there are four of them (top/bottom/left/right), we'll discard the top and right by setting their color to none and we'll move the bottom and left ones to coordinate 0 in data space coordinates.

ax = pl.gca()  # gca stands for 'get current axis'

I-D-2-h. Adding a legend

Image non disponible

Let's add a legend in the upper left corner. This only requires adding the keyword argument label (that will be used in the legend box) to the plot commands.

pl.plot(X, C, color=« blue », linewidth=2.5, linestyle=« - », label=« cosine »)
pl.plot(X, S, color=« red »,  linewidth=2.5, linestyle=« - », label=« sine »)

pl.legend(loc='upper left')

I-D-2-i. Annotate some points

Image non disponible

Let's annotate some interesting points using the annotate command. We chose the 2π/3 value and we want to annotate both the sine and the cosine. We'll first draw a marker on the curve as well as a straight dotted line. Then, we'll use the annotate command to display some text with an arrow.


t = 2 * np.pi / 3
pl.plot([t, t], [0, np.cos(t)], color='blue', linewidth=2.5, linestyle=« -- »)
pl.scatter([t, ], [np.cos(t), ], 50, color='blue')

            xy=(t, np.sin(t)), xycoords='data',
            xytext=(+10, +30), textcoords='offset points', fontsize=16,
            arrowprops=dict(arrowstyle=« -> », connectionstyle=« arc3,rad=.2 »))

pl.plot([t, t],[0, np.sin(t)], color='red', linewidth=2.5, linestyle=« -- »)
pl.scatter([t, ],[np.sin(t), ], 50, color='red')

            xy=(t, np.cos(t)), xycoords='data',
            xytext=(-90, -50), textcoords='offset points', fontsize=16,
            arrowprops=dict(arrowstyle=« -> », connectionstyle=« arc3,rad=.2 »))

I-D-2-j. Devil is in the details

Image non disponible


The tick labels are now hardly visible because of the blue and red lines. We can make them bigger and we can also adjust their properties such that they'll be rendered on a semi-transparent white background. This will allow us to see both the data and the labels.

for label in ax.get_xticklabels() + ax.get_yticklabels():
    label.set_bbox(dict(facecolor='white', edgecolor='None', alpha=0.65))

I-D-3. Figures, Subplots, Axes and Ticks

A “figure” in matplotlib means the whole window in the user interface. Within this figure there can be “subplots”.

So far we have used implicit figure and axes creation. This is handy for fast plots. We can have more control over the display using figure, subplot, and axes explicitly. While subplot positions the plots in a regular grid, axes allows free placement within the figure. Both can be useful depending on your intention. We've already worked with figures and subplots without explicitly calling them. When we call plot, matplotlib calls gca() to get the current axes and gca in turn calls gcf() to get the current figure. If there is none it calls figure() to make one, strictly speaking, to make a subplot(111). Let's look at the details.

I-D-3-a. Figures

A figure is the windows in the GUI that has “Figure #” as title. Figures are numbered starting from 1 as opposed to the normal Python way starting from 0. This is clearly MATLAB-style. There are several parameters that determine what the figure looks like :






number of figure



figure size in in inches (width, height)



resolution in dots per inch



color of the drawing background



color of edge around the drawing background



draw figure frame or not

The defaults can be specified in the resource file and will be used most of the time. Only the number of the figure is frequently changed.

As with other objects, you can set figure properties also setp or with the set_something methods.

When you work with the GUI you can close a figure by clicking on the x in the upper right corner. But you can close a figure programmatically by calling close. Depending on the argument it closes (1) the current figure (no argument), (2) a specific figure (figure number or figure instance as argument), or (3) all figures (« all » as argument).

pl.close(1)     # Closes figure 1

I-D-3-b. Subplots

With subplot you can arrange plots in a regular grid. You need to specify the number of rows and columns and the number of the plot. Note that the gridspec command is a more powerful alternative.

Image non disponible
Image non disponible
Image non disponible
Image non disponible

I-D-3-c. Axes

Axes are very similar to subplots but allow placement of plots at any location in the figure. So if we want to put a smaller plot inside a bigger one we do so with axes.

Image non disponible
Image non disponible

I-D-3-d. Ticks

Well formatted ticks are an important part of publishing-ready figures. Matplotlib provides a totally configurable system for ticks. There are tick locators to specify where ticks should appear and tick formatters to give ticks the appearance you want. Major and minor ticks can be located and formatted independently from each other. Per default minor ticks are not shown, i.e. there is only an empty list for them because it is as NullLocator (see below).

I-D-3-e. Tick Locators

Tick locators control the positions of the ticks. They are set as follows :

ax = pl.gca()

There are several locators for different kind of requirements :

Image non disponible

All of these locators derive from the base class matplotlib.ticker.Locator. You can make your own locator deriving from it. Handling dates as ticks can be especially tricky. Therefore, matplotlib provides special locators in matplotlib.dates.

I-D-4. Other Types of Plots: examples and exercises

I-D-4-a. Regular Plots

Image non disponible

You need to use the fill_between command.

Starting from the code below, try to reproduce the graphic on the right taking care of filled areas:

n = 256
X = np.linspace(-np.pi, np.pi, n, endpoint=True)
Y = np.sin(2 * X)

pl.plot(X, Y + 1, color='blue', alpha=1.00)
pl.plot(X, Y - 1, color='blue', alpha=1.00)

Click on the figure for solution.

I-D-4-b. Scatter Plots

Image non disponible

Color is given by angle of (X,Y).

Starting from the code below, try to reproduce the graphic on the right taking care of marker size, color and transparency.

n = 1024
X = np.random.normal(0,1,n)
Y = np.random.normal(0,1,n)


Click on figure for solution.

I-D-4-c. Bar Plots

Image non disponible

You need to take care of text alignment.

Starting from the code below, try to reproduce the graphic on the right by adding labels for red bars.

n = 12
X = np.arange(n)
Y1 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)
Y2 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)

pl.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')
pl.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')

for x, y in zip(X, Y1):
    pl.text(x + 0.4, y + 0.05, '%.2f' % y, ha='center', va='bottom')

pl.ylim(-1.25, +1.25)

Click on figure for solution.

I-D-4-d. Contour Plots

Image non disponible

You need to use the clabel command.

Starting from the code below, try to reproduce the graphic on the right taking care of the colormap (see ColormapsColormaps below).

def f(x, y):
    return (1 - x / 2 + x ** 5 + y ** 3) * np.exp(-x ** 2 -y ** 2)

n = 256
x = np.linspace(-3, 3, n)
y = np.linspace(-3, 3, n)
X, Y = np.meshgrid(x, y)

pl.contourf(X, Y, f(X, Y), 8, alpha=.75, cmap='jet')
C = pl.contour(X, Y, f(X, Y), 8, colors='black', linewidth=.5)

Click on figure for solution.

I-D-4-e. Imshow

Image non disponible

You need to take care of the origin of the image in the imshow command and use a colorbar

Starting from the code below, try to reproduce the graphic on the right taking care of colormap, image interpolation and origin.

def f(x, y):
    return (1 - x / 2 + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)

n = 10
x = np.linspace(-3, 3, 4 * n)
y = np.linspace(-3, 3, 3 * n)
X, Y = np.meshgrid(x, y)
pl.imshow(f(X, Y))

Click on the figure for the solution.

I-D-4-f. Pie Charts

Image non disponible

You need to modify Z.

Starting from the code below, try to reproduce the graphic on the right taking care of colors and slices size.

Z = np.random.uniform(0, 1, 20)

Click on the figure for the solution.

I-D-4-g. Quiver Plots

Image non disponible

You need to draw arrows twice.

Starting from the code above, try to reproduce the graphic on the right taking care of colors and orientations.

n = 8
X, Y = np.mgrid[0:n, 0:n]
pl.quiver(X, Y)

Click on figure for solution.

I-D-4-h. Grids

Image non disponible

Starting from the code below, try to reproduce the graphic on the right taking care of line styles.

axes = pl.gca()
axes.set_xlim(0, 4)
axes.set_ylim(0, 3)

Click on figure for solution.

I-D-4-i. Multi Plots

Image non disponible

You can use several subplots with different partition.

Starting from the code below, try to reproduce the graphic on the right.

pl.subplot(2, 2, 1)
pl.subplot(2, 2, 3)
pl.subplot(2, 2, 4)

Click on figure for solution.

I-D-4-j. Polar Axis

Image non disponible

You only need to modify the axes line

Starting from the code below, try to reproduce the graphic on the right.

pl.axes([0, 0, 1, 1])

N = 20
theta = np.arange(0., 2 * np.pi, 2 * np.pi / N)
radii = 10 * np.random.rand(N)
width = np.pi / 4 * np.random.rand(N)
bars = pl.bar(theta, radii, width=width, bottom=0.0)

for r, bar in zip(radii, bars):
    bar.set_facecolor(cm.jet(r / 10.))

Click on figure for solution.

I-D-4-k. 3D Plots

Image non disponible

You need to use contourf

Starting from the code below, try to reproduce the graphic on the right.

from mpl_toolkits.mplot3d import Axes3D

fig = pl.figure()
ax = Axes3D(fig)
X = np.arange(-4, 4, 0.25)
Y = np.arange(-4, 4, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)

ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='hot')

Click on figure for solution.

I-D-4-l. Text

Image non disponible

Have a look at the matplotlib logo.

Try to do the same from scratch !

Click on figure for solution.

Quick read

If you want to do a first quick pass through the Scipy lectures to learn the ecosystem, you can directly skip to the next chapter : Scipy : high-level scientific computingScipy : high-level scientific computing.

The remainder of this chapter is not necessary to follow the rest of the intro part. But be sure to come back and finish this chapter later.

I-D-5. Beyond this tutorial

Matplotlib benefits from extensive documentation as well as a large community of users and developers. Here are some links of interest:

I-D-5-a. Tutorials

  • Pyplot tutorial
  • Introduction
  • Controlling line properties
  • Working with multiple figures and axes
  • Working with text
  • Image tutorial
  • Startup commands
  • Importing image data into Numpy arrays
  • Plotting numpy arrays as images
  • Text tutorial
  • Text introduction
  • Basic text commands
  • Text properties and layout
  • Writing mathematical expressions
  • Text rendering With LaTeX
  • Annotating text
  • Artist tutorial
  • Introduction
  • Customizing your objects
  • Object containers
  • Figure container
  • Axes container
  • Axis containers
  • Tick containers
  • Path tutorial
  • Introduction
  • Bézier example
  • Compound paths
  • Transforms tutorial
  • Introduction
  • Data coordinates
  • Axes coordinates
  • Blended transformations
  • Using offset transforms to create a shadow effect
  • The transformation pipeline

I-D-5-b. Matplotlib documentation

I-D-5-c. Code documentation

The code is well documented and you can quickly access a specific command from within a python session :

 import pylab as pl
Help on function plot in module matplotlib.pyplot:

plot(*args, **kwargs)
   Plot lines and/or markers to the
   :class:`~matplotlib.axes.Axes`.  *args* is a variable length
   argument, allowing for multiple *x*, *y* pairs with an
   optional format string.  For example, each of the following is

       plot(x, y)         # plot x and y using default line style and color
       plot(x, y, 'bo')   # plot x and y using blue circle markers
       plot(y)            # plot y using x as index array 0..N-1
       plot(y, 'r+')      # ditto, but with red plusses

   If *x* and/or *y* is 2-dimensional, then the corresponding columns
   will be plotted.

I-D-5-d. Galleries

The matplotlib gallery is also incredibly useful when you search how to render a given graphic. Each example comes with its source.

A smaller gallery is also available here.

I-D-5-e. Mailing lists

Finally, there is a user mailing list where you can ask for help and a developers mailing list that is more technical.

I-D-6. Quick references

Here is a set of tables that show main properties and styles.

I-D-6-a. Line properties




alpha (or a)

alpha transparency on 0-1 scale

Image non disponible


True or False - use antialised rendering

Image non disponible Image non disponible

color (or c)

matplotlib color arg

Image non disponible

linestyle (or ls)

see Line propertiesLine properties


linewidth (or lw)

float, the line width in points

Image non disponible


Cap style for solid lines

Image non disponible


Join style for solid lines

Image non disponible


Cap style for dashes

Image non disponible


Join style for dashes

Image non disponible


see MarkersMarkers


markeredgewidth (mew)

line width around the marker symbol

Image non disponible

markeredgecolor (mec)

edge color if a marker is used

Image non disponible

markerfacecolor (mfc)

face color if a marker is used

Image non disponible

markersize (ms)

size of the marker in points

Image non disponible

I-D-6-b. Line styles

Image non disponible

I-D-6-c. Markers

Image non disponible

I-D-6-d. Colormaps

All colormaps can be reversed by appending _r. For instance, gray_r is the reverse of gray.

If you want to know more about colormaps, checks Documenting the matplotlib colormaps.

Image non disponible

I-E. Scipy : high-level scientific computing


The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc.

scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab's toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.

Before implementing a routine, it is worth checking if the desired data processing is not already implemented in Scipy. As non-professional programmers, scientists often tend to re-invent the wheel, which leads to buggy, non-optimal, difficult-to-share and unmaintainable code. By contrast, Scipy‘s routines are optimized and tested, and should therefore be used when possible.

This tutorial is far from an introduction to numerical computing. As enumerating the different submodules and functions in scipy would be very boring, we concentrate instead on a few examples to give a general idea of how to use scipy for scientific computing.

scipy is composed of task-specific sub-modules:


Vector quantization / Kmeans


Physical and mathematical constants


Fourier transform


Integration routines




Data input and output


Linear algebra routines


n-dimensional image package


Orthogonal distance regression




Signal processing


Sparse matrices


Spatial data structures and algorithms


Any special mathematical functions



They all depend on numpy, but are mostly independent of each other. The standard way of importing Numpy and these Scipy modules is:

 import numpy as np
 from scipy import stats  # same for other sub-modules

The main scipy namespace mostly contains functions that are really numpy functions (try scipy.cos is np.cos). Those are exposed for historical reasons only; there's usually no reason to use import scipy in your code.

I-E-1. File input/output: scipy.io


  • Loading and saving matlab files :
 from scipy import io as spio
 a = np.ones((3, 3))
 spio.savemat('file.mat', {'a': a}) # savemat expects a dictionary
 data = spio.loadmat('file.mat', struct_as_record=True)
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
  • Reading images :
 from scipy import misc
 # Matplotlib also has a similar function
 import matplotlib.pyplot as plt

See also :

  • Load text files : numpy.loadtxt()/numpy.savetxt()
  • Clever loading of text/csv files : numpy.genfromtxt()/numpy.recfromcsv()
  • Fast and efficient, but numpy-specific, binary format : numpy.save()/numpy.load()

I-E-2. Special functions: scipy.special


Special functions are transcendental functions. The docstring of the scipy.special module is well-written, so we won't list all functions here. Frequently used ones are :

  • Bessel function, such as scipy.special.jn() (nth integer order Bessel function)
  • Elliptic function (scipy.special.ellipj() for the Jacobian elliptic function…)
  • Gamma function: scipy.special.gamma(), also note scipy.special.gammaln() which will give the log of Gamma to a higher numerical precision.
  • Erf, the area under a Gaussian curve: scipy.special.erf()

I-E-3. Linear algebra operations: scipy.linalg


The scipy.linalg module provides standard linear algebra operations, relying on an underlying efficient implementation (BLAS, LAPACK).

 from scipy import linalg
 arr = np.array([[1, 2],
…                 [3, 4]])
 arr = np.array([[3, 2],
…                 [6, 4]])
 linalg.det(np.ones((3, 4)))
Traceback (most recent call last):
ValueError: expected square matrix
 arr = np.array([[1, 2],
…                 [3, 4]])
 iarr = linalg.inv(arr)
array([[-2. ,  1. ],
       [ 1.5, -0.5]])
 np.allclose(np.dot(arr, iarr), np.eye(2))

Finally computing the inverse of a singular matrix (its determinant is zero) will raise LinAlgError:

 arr = np.array([[3, 2],
…                 [6, 4]])
Traceback (most recent call last):
LinAlgError: singular matrix
  • More advanced operations are available, for example singular-value decomposition (SVD):
 arr = np.arange(9).reshape((3, 3)) + np.diag([1, 0, 1])
 uarr, spec, vharr = linalg.svd(arr)

The resulting array spectrum is:

array([ 14.88982544,   0.45294236,   0.29654967])

The original matrix can be re-composed by matrix multiplication of the outputs of svd with np.dot:

 sarr = np.diag(spec)
 svd_mat = uarr.dot(sarr).dot(vharr)
 np.allclose(svd_mat, arr)

SVD is commonly used in statistics and signal processing. Many other standard decompositions (QR, LU, Cholesky, Schur), as well as solvers for linear systems, are available in scipy.linalg.

I-E-4. Fast Fourier transforms: scipy.fftpack

The scipy.fftpack module allows to compute fast Fourier transforms. As an illustration, a (noisy) input signal may look like :

 time_step = 0.02
 period = 5.
 time_vec = np.arange(0, 20, time_step)
 sig = np.sin(2 * np.pi / period * time_vec) + \
…       0.5 * np.random.randn(time_vec.size)

The observer doesn't know the signal frequency, only the sampling time step of the signal sig. The signal is supposed to come from a real function so the Fourier transform will be symmetric. The scipy.fftpack.fftfreq() function will generate the sampling frequencies and scipy.fftpack.fft() will compute the fast Fourier transform:

 from scipy import fftpack
 sample_freq = fftpack.fftfreq(sig.size, d=time_step)
 sig_fft = fftpack.fft(sig)

Because the resulting power is symmetric, only the positive part of the spectrum needs to be used for finding the frequency:

 pidxs = np.where(sample_freq > 0)
 freqs = sample_freq[pidxs]
 power = np.abs(sig_fft)[pidxs]

[source code, hires.png, pdf]

Image non disponible

The signal frequency can be found by :

 freq = freqs[power.argmax()]
 np.allclose(freq, 1./period)  # check that correct freq is found

Now the high-frequency noise will be removed from the Fourier transformed signal :

 sig_fft[np.abs(sample_freq) > freq] = 0

The resulting filtered signal can be computed by the scipy.fftpack.ifft() function :

 main_sig = fftpack.ifft(sig_fft)

The result can be viewed with :

 import pylab as plt
 plt.plot(time_vec, sig)
 plt.plot(time_vec, main_sig, linewidth=3)
 plt.xlabel('Time [s]')

[source code, hires.png, pdf]

Image non disponible

Numpy also has an implementation of FFT (numpy.fft). However, in general the scipy one should be preferred, as it uses more efficient underlying implementations.

Worked example : Crude periodicity finding

[source code, hires.png, pdf]

Image non disponible

[source code, hires.png, pdf]

Image non disponible

Worked example : Gaussian image blur

Convolution :

kitxmlcodelatexdvpf_1(t) = \int dt'\, K(t-t') f_0(t') \tilde{f}_1(\omega) = \tilde{K}(\omega) \tilde{f}_0(\omega)finkitxmlcodelatexdvp

[source code, hires.png, pdf]

Image non disponible

Exercise : Denoise moon landing image

Image non disponible
  1. Examine the provided image moonlanding.png, which is heavily contaminated with periodic noise. In this exercise, we aim to clean up the noise using the Fast Fourier Transform.
  2. Load the image using pylab.imread().
  3. Find and use the 2-D FFT function in scipy.fftpack, and plot the spectrum (Fourier transform of) the image. Do you have any trouble visualising the spectrum? If so, why?
  4. The spectrum consists of high and low frequency components. The noise is contained in the high-frequency part of the spectrum, so set some of those components to zero (use array slicing).
  5. Apply the inverse Fourier transform to see the resulting image.

I-E-5. Optimization and fit: scipy.optimize

Optimization is the problem of finding a numerical solution to a minimization or equality.

The scipy.optimize module provides useful algorithms for function minimization (scalar or multidimensional), curve fitting and root finding.

 from scipy import optimize

Finding the minimum of a scalar function

Let's define the following function :

 def f(x):
…     return x**2 + 10*np.sin(x)

and plot it :

 x = np.arange(-10, 10, 0.1)
 plt.plot(x, f(x)) 

[source code, hires.png, pdf]

Image non disponible

This function has a global minimum around -1.3 and a local minimum around 3.8.

The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:

 optimize.fmin_bfgs(f, 0)
Optimization terminated successfully.
         Current function value: -7.945823
         Iterations: 5
         Function evaluations: 24
         Gradient evaluations: 8

A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:

 optimize.fmin_bfgs(f, 3, disp=0)
array([ 3.83746663])

If we don't know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, the simplest algorithm is the brute force algorithm, in which the function is evaluated on each point of a given grid:

 grid = (-10, 10, 0.1)
 xmin_global = optimize.brute(f, (grid,))

For larger grid sizes, scipy.optimize.brute() becomes quite slow. scipy.optimize.anneal() provides an alternative, using simulated annealing. More efficient algorithms for different classes of global optimization problems exist, but this is out of the scope of scipy. Some useful packages for global optimization are OpenOpt, IPOPT, PyGMO and PyEvolve.

To find the local minimum, let's constraint the variable to the interval (0, 10) using scipy.optimize.fminbound() :

 xmin_local = optimize.fminbound(f, 0, 10)    

Finding minima of function is discussed in more details in the advanced chapter : Mathematical optimization: finding minima of functions.

Finding the roots of a scalar function

To find a root, i.e. a point where f(x) = 0, of the function f above we can use for example scipy.optimize.fsolve() :

 root = optimize.fsolve(f, 1)  # our initial guess is 1
array([ 0.])

Note that only one root is found. Inspecting the plot of f reveals that there is a second root around -2.5. We find the exact value of it by adjusting our initial guess:

 root2 = optimize.fsolve(f, -2.5)

Curve fitting

Suppose we have data sampled from f with some noise:

 xdata = np.linspace(-10, 10, num=20)
 ydata = f(xdata) + np.random.randn(xdata.size)

Now if we know the functional form of the function from which the samples were drawn (x^2 + sin(x) in this case) but not the amplitudes of the terms, we can find those by least squares curve fitting. First we have to define the function to fit :

 def f2(x, a, b):
…     return a*x**2 + b*np.sin(x)

Then we can use scipy.optimize.curve_fit() to find a and b :

 guess = [2, 2]
 params, params_covariance = optimize.curve_fit(f2, xdata, ydata, guess)
array([ 0.99925147,  9.76065551])

Now we have found the minima and roots of f and used curve fitting on it, we put all those resuls together in a single plot:

[source code, hires.png, pdf]

Image non disponible

In Scipy >= 0.11 unified interfaces to all minimization and root finding algorithms are available : scipy.optimize.minimize(), scipy.optimize.minimize_scalar() and scipy.optimize.root(). They allow comparing various algorithms easily through the method keyword.

You can find algorithms with the same functionalities for multidimensional problems in scipy.optimize.

Exercise : Curve fitting of temperature data

The temperature extremes in Alaska for each month, starting in January, are given by (in degrees Celcius) :

max:  17,  19,  21,  28,  33,  38, 37,  37,  31,  23,  19,  18
min: -62, -59, -56, -46, -32, -18, -9, -13, -25, -46, -52, -58
  1. Plot these temperature extremes.
  2. Define a function that can describe min and max temperatures. Hint: this function has to have a period of 1 year. Hint: include a time offset.
  3. Fit this function to the data with scipy.optimize.curve_fit().
  4. Plot the result. Is the fit reasonable? If not, why?
  5. Is the time offset for min and max temperatures the same within the fit accuracy?

Exercise : 2-D minimization

[source code, hires.png, pdf]

Image non disponible

The six-hump camelback function :

kitxmlcodelatexdvpf(x, y) = (4 - 2.1x^2 + \frac{x^4}{3})x^2 + xy + (4y^2 - 4)y^2finkitxmlcodelatexdvp

has multiple global and local minima. Find the global minima of this function.

Hints :

How many global minima are there, and what is the function value at those points? What happens for an initial guess of (x, y) = (0, 0) ?

See the summary exercise on Non linear least squares curve fitting: application to point extraction in topographical lidar dataSummary exercises on scientific computing for another, more advanced example.

I-E-6. Statistics and random numbers: scipy.stats

The module scipy.stats contains statistical tools and probabilistic descriptions of random processes. Random number generators for various random process can be found in numpy.random.

I-E-6-a. Histogram and probability density function

Given observations of a random process, their histogram is an estimator of the random process's PDF (probability density function) :

 a = np.random.normal(size=1000)
 bins = np.arange(-4, 5)
array([-4, -3, -2, -1,  0,  1,  2,  3,  4])
 histogram = np.histogram(a, bins=bins, normed=True)[0]
 bins = 0.5*(bins[1:] + bins[:-1])
array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])
 from scipy import stats
 b = stats.norm.pdf(bins)  # norm is a distribution
In [1]: pl.plot(bins, histogram)
In [2]: pl.plot(bins, b)

[source codehires.png, , pdf]

Image non disponible

If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:

 loc, std = stats.norm.fit(a)

Exercise : Probability distributions

Generate 1000 random variates from a gamma distribution with a shape parameter of 1, then plot a histogram from those samples. Can you plot the pdf on top (it should match)?

Extra: the distributions have a number of useful methods. Explore them by reading the docstring or by using IPython tab completion. Can you find the shape parameter of 1 back by using the fit method on your random variates?

I-E-6-b. Percentiles

The median is the value with half of the observations below, and half above :


It is also called the percentile 50, because 50% of the observation are below it :

 stats.scoreatpercentile(a, 50)     

Similarly, we can calculate the percentile 90 :

 stats.scoreatpercentile(a, 90)     

The percentile is an estimator of the CDF: cumulative distribution function.

I-E-6-c. Statistical tests

A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the two sets of observations are significantly different :

 a = np.random.normal(0, 1, size=100)
 b = np.random.normal(1, 1, size=10)
 stats.ttest_ind(a, b)   
(-3.75832707…, 0.00027786…)

The resulting output is composed of :

  • The T statistic value : it is a number the sign of which is proportional to the difference between the two random processes and the magnitude is related to the significance of this difference.
  • the p value : the probability of both processes being identical. If it is close to 1, the two process are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.

I-E-7. Interpolation: scipy.interpolate

The scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points where no measure exists. The module is based on the FITPACK Fortran subroutines from the netlib project.

By imagining experimental data close to a sine function :

 measured_time = np.linspace(0, 1, 10)
 noise = (np.random.random(10)*2 - 1) * 1e-1
 measures = np.sin(2 * np.pi * measured_time) + noise

The scipy.interpolate.interp1d class can build a linear interpolation function :

 from scipy.interpolate import interp1d
 linear_interp = interp1d(measured_time, measures)

Then the scipy.interpolate.linear_interp instance needs to be evaluated at the time of interest :

 computed_time = np.linspace(0, 1, 50)
 linear_results = linear_interp(computed_time)

A cubic interpolation can also be selected by providing the kind optional keyword argument :

 cubic_interp = interp1d(measured_time, measures, kind='cubic')
 cubic_results = cubic_interp(computed_time)

The results are now gathered on the following Matplotlib figure :

[source code, hires.png, pdf]

Image non disponible

scipy.interpolate.interp2d is similar to scipy.interpolate.interp1d, but for 2-D arrays. Note that for the interp family, the computed time must stay within the measured time range. See the summary exercise on Maximum wind speed prediction at the Sprogø stationSummary exercises on scientific computing for a more advance spline interpolation example.

I-E-8. Numerical integration: scipy.integrate


The most generic integration routine is scipy.integrate.quad() :

 from scipy.integrate import quad
 res, err = quad(np.sin, 0, np.pi/2)
 np.allclose(res, 1)
 np.allclose(err, 1 - res)

Others integration schemes are available with fixed_quad, quadrature, romberg.

scipy.integrate also features routines for integrating Ordinary Differential Equations (ODE). In particular, scipy.integrate.odeint() is a general-purpose integrator using LSODA (Livermore Solver for Ordinary Differential equations with Automatic method switching for stiff and non-stiff problems), see the ODEPACK Fortran library for more details.

odeint solves first-order ODE systems of the form:

dy/dt = rhs(y1, y2, .., t0,…)

As an introduction, let us solve the ODE dy/dt = -2y between t = 0..4, with the initial condition y(t=0) = 1. First the function computing the derivative of the position needs to be defined:

 def calc_derivative(ypos, time, counter_arr):
…     counter_arr += 1return -2 * ypos

An extra argument counter_arr has been added to illustrate that the function may be called several times for a single time step, until solver convergence. The counter array is defined as :

counter = np.zeros((1,), dtype=np.uint16)

The trajectory will now be computed :

 from scipy.integrate import odeint
 time_vec = np.linspace(0, 4, 40)
 yvec, info = odeint(calc_derivative, 1, time_vec,
…                     args=(counter,), full_output=True)

Thus the derivative function has been called more than 40 times (which was the number of time steps) :

array([129], dtype=uint16)

and the cumulative number of iterations for each of the 10 first time steps can be obtained by:

array([31, 35, 43, 49, 53, 57, 59, 63, 65, 69], dtype=int32)

Note that the solver requires more iterations for the first time step. The solution yvec for the trajectory can now be plotted:

[source code, hires.png, pdf]

Image non disponible

Another example with scipy.integrate.odeint() will be a damped spring-mass oscillator (2d order oscillator). The position of a mass attached to a spring obeys the 2d order ODE y'' + 2 eps wo y' + wo^2 y = 0 with wo^2 = k/m with k the spring constant, m the mass and eps=c/(2 m wo) with c the damping coefficient. For this example, we choose the parameters as :

 mass = 0.5  # kg
 kspring = 4  # N/m
 cviscous = 0.4  # N s/m

so the system will be underdamped, because :

 eps = cviscous / (2 * mass * np.sqrt(kspring/mass))
 eps < 1

For the scipy.integrate.odeint() solver the 2d order equation needs to be transformed in a system of two first-order equations for the vector Y=(y, y'). It will be convenient to define nu = 2 eps * wo = c / m and om = wo^2 = k/m :

 nu_coef = cviscous / mass
 om_coef = kspring / mass

Thus the function will calculate the velocity and acceleration by:

 def calc_deri(yvec, time, nuc, omc):
…     return (yvec[1], -nuc * yvec[1] - omc * yvec[0])
 time_vec = np.linspace(0, 10, 100)
 yarr = odeint(calc_deri, (1, 0), time_vec, args=(nu_coef, om_coef))

The final position and velocity are shown on the following Matplotlib figure:

[source code, hires.png, pdf]

Image non disponible

There is no Partial Differential Equations (PDE) solver in Scipy. Some Python packages for solving PDE's are available, such as fipy or SfePy.

I-E-9. Signal processing: scipy.signal

 from scipy import signal
t = np.linspace(0, 5, 100)
x = t + np.random.normal(size=100)

pl.plot(t, x, linewidth=3)
pl.plot(t, signal.detrend(x), linewidth=3)

[source code, hires.png, pdf]

Image non disponible
t = np.linspace(0, 5, 100)
x = np.sin(t)

pl.plot(t, x, linewidth=3)
pl.plot(t[::2], signal.resample(x, 50), 'ko')

[source code, hires.png, pdf]

Image non disponible

I-E-10. Image processing: scipy.ndimage

The submodule dedicated to image processing in scipy is scipy.ndimage.

 from scipy import ndimage

Image processing routines may be sorted according to the category of processing they perform.

I-E-10-a. Geometrical transformations on images

Changing orientation, resolution, ..

 from scipy import misc
 lena = misc.lena()
 shifted_lena = ndimage.shift(lena, (50, 50))
 shifted_lena2 = ndimage.shift(lena, (50, 50), mode='nearest')
 rotated_lena = ndimage.rotate(lena, 30)
 cropped_lena = lena[50:-50, 50:-50]
 zoomed_lena = ndimage.zoom(lena, 2)
(1024, 1024)
Image non disponible
In [35]: subplot(151)
Out[35]: <matplotlib.axes.AxesSubplot object at 0x925f46c>

In [36]: pl.imshow(shifted_lena, cmap=cm.gray)
Out[36]: <matplotlib.image.AxesImage object at 0x9593f6c>

In [37]: axis('off')
Out[37]: (-0.5, 511.5, 511.5, -0.5)

In [39]: # etc.

I-E-10-b. Image filtering

 from scipy import misc
 lena = misc.lena()
 import numpy as np
 noisy_lena = np.copy(lena).astype(np.float)
 noisy_lena += lena.std()*0.5*np.random.standard_normal(lena.shape)
 blurred_lena = ndimage.gaussian_filter(noisy_lena, sigma=3)
 median_lena = ndimage.median_filter(blurred_lena, size=5)
 from scipy import signal
 wiener_lena = signal.wiener(blurred_lena, (5,5))
Image non disponible

Many other filters in scipy.ndimage.filters and scipy.signal can be applied to images.


Compare histograms for the different filtered images.

I-E-10-c. Mathematical morphology

Mathematical morphology is a mathematical theory that stems from set theory. It characterizes and transforms geometrical structures. Binary (black and white) images, in particular, can be transformed using this theory: the sets to be transformed are the sets of neighboring non-zero-valued pixels. The theory was also extended to gray-valued images.

Image non disponible

Elementary mathematical-morphology operations use a structuring element in order to modify other geometrical structures.

Let us first generate a structuring element

 el = ndimage.generate_binary_structure(2, 1)
array([[False, True, False],
       [True, True, True],
       [False, True, False]], dtype=bool)
array([[0, 1, 0],
       [1, 1, 1],
       [0, 1, 0]])
  • Erosion
 a = np.zeros((7,7), dtype=np.int)
 a[1:6, 2:5] = 1
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])
 #Erosion removes objects smaller than the structure
 ndimage.binary_erosion(a, structure=np.ones((5,5))).astype(a.dtype)
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])
  • Dilation
 a = np.zeros((5, 5))
 a[2, 2] = 1
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
  • Opening
 a = np.zeros((5,5), dtype=np.int)
 a[1:4, 1:4] = 1; a[4, 4] = 1
array([[0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 1, 1, 0],
       [0, 0, 0, 0, 1]])
 # Opening removes small objects
 ndimage.binary_opening(a, structure=np.ones((3,3))).astype(np.int)
array([[0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 1, 1, 0],
       [0, 0, 0, 0, 0]])
 # Opening can also smooth corners
array([[0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 1, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0]])
  • Closing : ndimage.binary_closing


Check that opening amounts to eroding, then dilating.

An opening operation removes small structures, while a closing operation fills small holes. Such operations can therefore be used to “clean” an image.

 a = np.zeros((50, 50))
 a[10:-10, 10:-10] = 1
 a += 0.25*np.random.standard_normal(a.shape)
 mask = a>=0.5
 opened_mask = ndimage.binary_opening(mask)
 closed_mask = ndimage.binary_closing(opened_mask)
Image non disponible


Check that the area of the reconstructed square is smaller than the area of the initial square. (The opposite would occur if the closing step was performed before the opening).

For gray-valued images, eroding (resp. dilating) amounts to replacing a pixel by the minimal (resp. maximal) value among pixels covered by the structuring element centered on the pixel of interest.

 a = np.zeros((7,7), dtype=np.int)
 a[1:6, 1:6] = 3
 a[4,4] = 2; a[2,3] = 1
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 3, 3, 3, 3, 3, 0],
       [0, 3, 3, 1, 3, 3, 0],
       [0, 3, 3, 3, 3, 3, 0],
       [0, 3, 3, 3, 2, 3, 0],
       [0, 3, 3, 3, 3, 3, 0],
       [0, 0, 0, 0, 0, 0, 0]])
 ndimage.grey_erosion(a, size=(3,3))
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0],
       [0, 0, 3, 2, 2, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

I-E-10-d. Measurements on images

Let us first generate a nice synthetic binary image.

 x, y = np.indices((100, 100))
 sig = np.sin(2*np.pi*x/50.)*np.sin(2*np.pi*y/50.)*(1+x*y/50.**2)**2
 mask = sig > 1

Now we look for various information about the objects in the image :

 labels, nb = ndimage.label(mask)
 areas = ndimage.sum(mask, labels, xrange(1, labels.max()+1))
array([ 190.,   45.,  424.,  278.,  459.,  190.,  549.,  424.])
 maxima = ndimage.maximum(sig, labels, xrange(1, labels.max()+1))
array([  1.80238238,   1.13527605,   5.51954079,   2.49611818,
         6.71673619,   1.80238238,  16.76547217,   5.51954079])
[(slice(30L, 48L, None), slice(30L, 48L, None))]
 sl = ndimage.find_objects(labels==4)
 import pylab as pl
<matplotlib.image.AxesImage object at …>
Image non disponible

See the summary exercise on Image processing application: counting bubbles and unmolten grainsSummary exercises on scientific computing for a more advanced example.

I-E-11. Summary exercises on scientific computing

The summary exercises use mainly Numpy, Scipy and Matplotlib. They provide some real-life examples of scientific computing with Python. Now that the basics of working with Numpy and Scipy have been introduced, the interested user is invited to try these exercises.

Exercises :

I-E-11-a. Maximum wind speed prediction at the Sprogø station

The exercise goal is to predict the maximum wind speed occurring every 50 years even if no measure exists for such a period. The available data are only measured over 21 years at the Sprogø meteorological station located in Denmark. First, the statistical steps will be given and then illustrated with functions from the scipy.interpolate module. At the end the interested readers are invited to compute results from raw data and in a slightly different approach.

I-E-11-a-i. Statistical approach

The annual maxima are supposed to fit a normal probability density function. However such function is not going to be estimated because it gives a probability from a wind speed maxima. Finding the maximum wind speed occurring every 50 years requires the opposite approach, the result needs to be found from a defined probability. That is the quantile function role and the exercise goal will be to find it. In the current model, it is supposed that the maximum wind speed occurring every 50 years is defined as the upper 2% quantile.

By definition, the quantile function is the inverse of the cumulative distribution function. The latter describes the probability distribution of an annual maxima. In the exercise, the cumulative probability p_i for a given year i is defined as p_i = i/(N+1) with N = 21, the number of measured years. Thus it will be possible to calculate the cumulative probability of every measured wind speed maxima. From those experimental points, the scipy.interpolate module will be very useful for fitting the quantile function. Finally the 50 years maxima is going to be evaluated from the cumulative probability of the 2% quantile.

I-E-11-a-ii. Computing the cumulative probabilities

The annual wind speeds maxima have already been computed and saved in the numpy format in the file examples/max-speeds.npy, thus they will be loaded by using numpy :

import numpy as np
max_speeds = np.load('intro/summary-exercises/examples/max-speeds.npy')
years_nb = max_speeds.shape[0]

Following the cumulative probability definition p_i from the previous section, the corresponding values will be :

cprob = (np.arange(years_nb, dtype=np.float32) + 1)/(years_nb + 1)

and they are assumed to fit the given wind speeds :

sorted_max_speeds = np.sort(max_speeds)
I-E-11-a-iii. Prediction with UnivariateSpline

In this section the quantile function will be estimated by using the UnivariateSpline class which can represent a spline from points. The default behavior is to build a spline of degree 3 and points can have different weights according to their reliability. Variants are InterpolatedUnivariateSpline and LSQUnivariateSpline on which errors checking is going to change. In case a 2D spline is wanted, the BivariateSpline class family is provided. All those classes for 1D and 2D splines use the FITPACK Fortran subroutines, that's why a lower library access is available through the splrep and splev functions for respectively representing and evaluating a spline. Moreover interpolation functions without the use of FITPACK parameters are also provided for simpler use (see interp1d, interp2d, barycentric_interpolate and so on).

For the Sprogø maxima wind speeds, the UnivariateSpline will be used because a spline of degree 3 seems to correctly fit the data :

from scipy.interpolate import UnivariateSpline
quantile_func = UnivariateSpline(cprob, sorted_max_speeds)

The quantile function is now going to be evaluated from the full range of probabilities :

nprob = np.linspace(0, 1, 1e2)
fitted_max_speeds = quantile_func(nprob)

In the current model, the maximum wind speed occurring every 50 years is defined as the upper 2% quantile. As a result, the cumulative probability value will be :

fifty_prob = 1. - 0.02

So the storm wind speed occurring every 50 years can be guessed by :

fifty_wind = quantile_func(fifty_prob)

The results are now gathered on a Matplotlib figure :

Image non disponible

Solution :

I-E-11-a-iv. Exercise with the Gumbell distribution

The interested readers are now invited to make an exercise by using the wind speeds measured over 21 years. The measurement period is around 90 minutes (the original period was around 10 minutes but the file size has been reduced for making the exercise setup easier). The data are stored in numpy format inside the file examples/sprog-windspeeds.npy. Do not look at the source code for the plots until you have completed the exercise.

  • The first step will be to find the annual maxima by using numpy and plot them as a matplotlib bar figure.
Image non disponible

Solution :

  • The second step will be to use the Gumbell distribution on cumulative probabilities p_i defined as -log( -log(p_i) ) for fitting a linear quantile function (remember that you can define the degree of the UnivariateSpline). Plotting the annual maxima versus the Gumbell distribution should give you the following figure.
Image non disponible

Solution :

  • The last step will be to find 34.23 m/s for the maximum wind speed occurring every 50 years.

I-E-11-b. Non linear least squares curve fitting: application to point extraction in topographical lidar data

The goal of this exercise is to fit a model to some data. The data used in this tutorial are lidar data and are described in details in the following introductory paragraph. If you're impatient and want to practice now, please skip it and go directly to Loading and visualization.

I-E-11-b-i. Introduction

Lidars systems are optical rangefinders that analyze property of scattered light to measure distances. Most of them emit a short light impulsion towards a target and record the reflected signal. This signal is then processed to extract the distance between the lidar system and the target.

Topographical lidar systems are such systems embedded in airborne platforms. They measure distances between the platform and the Earth, so as to deliver information on the Earth's topography (see(1) for more details).

In this tutorial, the goal is to analyze the waveform recorded by the lidar system(2). Such a signal contains peaks whose center and amplitude permit to compute the position and some characteristics of the hit target. When the footprint of the laser beam is around 1m on the Earth surface, the beam can hit multiple targets during the two-way propagation (for example the ground and the top of a tree or building). The sum of the contributions of each target hit by the laser beam then produces a complex signal with multiple peaks, each one containing information about one target.

One state of the art method to extract information from these data is to decompose them in a sum of Gaussian functions where each function represents the contribution of a target hit by the laser beam.

Therefore, we use the scipy.optimize module to fit a waveform to one or a sum of Gaussian functions.

I-E-11-b-ii. Loading and visualization

Load the first waveform using :

import numpy as np
waveform_1 = np.load('data/waveform_1.npy')

and visualize it :

import matplotlib.pyplot as plt
t = np.arange(len(waveform_1))
plt.plot(t, waveform_1) 
Image non disponible

As you can notice, this waveform is a 80-bin-length signal with a single peak.

I-E-11-b-iii. Fitting a waveform with a simple Gaussian model

The signal is very simple and can be modeled as a single Gaussian function and an offset corresponding to the background noise. To fit the signal with the function, we must :

  • define the model
  • propose an initial solution
  • call scipy.optimize.leastsq
I-E-11-b-iii-I. Model

A Gaussian function defined by

kitxmlcodelatexdvpB + A \exp\left\{-\left(\frac{t-\mu}{\sigma}\right)^2\right\}finkitxmlcodelatexdvp

can be defined in python by:

def model(t, coeffs):
...    return coeffs[0] + coeffs[1] * np.exp( - ((t-coeffs[2])/coeffs[3])**2 )

Where :

  • coeffs[0] is kitxmlcodeinlinelatexdvpBfinkitxmlcodeinlinelatexdvp (noise)
  • coeffs[1] is kitxmlcodeinlinelatexdvpAfinkitxmlcodeinlinelatexdvp (amplitude)
  • coeffs[2] is kitxmlcodeinlinelatexdvp\mufinkitxmlcodeinlinelatexdvp (center)
  • coeffs[3] is kitxmlcodeinlinelatexdvp\sigmafinkitxmlcodeinlinelatexdvp (width)
I-E-11-b-iii-II. Initial solution

An approximative initial solution that we can find from looking at the graph is for instance :

x0 = np.array([3, 30, 15, 1], dtype=float)
I-E-11-b-iii-III. Fit

scipy.optimize.leastsq minimizes the sum of squares of the function given as an argument. Basically, the function to minimize is the residuals (the difference between the data and the model) :

def residuals(coeffs, y, t):
...     return y - model(t, coeffs)

So let's get our solution by calling scipy.optimize.leastsq with the following arguments :

  • the function to minimize
  • an initial solution
  • the additional arguments to pass to the function
from scipy.optimize import leastsq
x, flag = leastsq(residuals, x0, args=(waveform_1, t))
print x
[  2.70363341  27.82020742  15.47924562   3.05636228]

And visualize the solution :

plt.plot(t, waveform_1, t, model(t, x)) 
plt.legend(['waveform', 'model']) 

Remark : from scipy v0.8 and above, you should rather use scipy.optimize.curve_fit which takes the model and the data as arguments, so you don't need to define the residuals any more.

I-E-11-b-iii-IV. Going further
  • Try with a more complex waveform (for instance data/waveform_2.npy) that contains three significant peaks. You must adapt the model which is now a sum of Gaussian functions instead of only one Gaussian peak.

    Image non disponible
  • In some cases, writing an explicit function to compute the Jacobian is faster than letting leastsq estimate it numerically. Create a function to compute the Jacobian of the residuals and use it as an input for leastsq.

  • When we want to detect very small peaks in the signal, or when the initial guess is too far from a good solution, the result given by the algorithm is often not satisfying. Adding constraints to the parameters of the model enables to overcome such limitations. An example of a priori knowledge we can add is the sign of our variables (which are all positive).

With the following initial solution:

x0 = np.array([3, 50, 20, 1], dtype=float)

compare the result of scipy.optimize.leastsq and what you can get with scipy.optimize.fmin_slsqp when adding boundary constraints.

I-E-11-c. Image processing application : counting bubbles and unmolten grains

Image non disponible
I-E-11-c-i. Statement of the problem

1. Open the image file MV_HFV_012.jpg and display it. Browse through the keyword arguments in the docstring of imshow to display the image with the “right” orientation (origin in the bottom left corner, and not the upper left corner as for standard arrays).

This Scanning Element Microscopy image shows a glass sample (light gray matrix) with some bubbles (on black) and unmolten sand grains (dark gray). We wish to determine the fraction of the sample covered by these three phases, and to estimate the typical size of sand grains and bubbles, their sizes, etc.

2. Crop the image to remove the lower panel with measure information.

3. Sli