| Date: | 2020-05-26 |
|---|---|
| Author: | Gábor Nyers |
| tags: | python |
| category: | python_workshop |
| summary: | The Iterator protocol and its practical uses |
| licence: | CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/ |
Contents:
The next Python Tuesday session is all about how iterators help us to solve many data-processing related problems. The term 'iterator' may sound abstract but there is a practical benefit and they everywhere in Python3. For example the functions like sum(), min(), max() or the much more powerful sorted(), filter() and map() functions.
- Introduction (what is the "Iterator Protocol" and do we care?)
- A bit of theory: Concepts
- Tools and data types
- Use cases
- As a Python programmer you rely on the Iterator protocol every time that
you:
- process the elements of a container data type in a
forloop, e.g.:list,dict - sort the elements of a container data type
- use any of the following functions:
min(),max(),sum(),map(),sorted(),filter()
- process the elements of a container data type in a
- The working of the above facilities is possible because the involved components work in concert according the "Itereator Protocol"
Fundamental task: process every element of a container, e.g.:
>>> s = 'Beautiful is better than ugly.' # str, with 30 elements >>> l = [ 1, 2, 92, -32, 'python', 3.1415 ] # list, 6 elements of 3 types >>> t = ( 3, 9, 'cheese shop', False ) # tuple >>> d = { 'alice': 32, 'bob': 42 } # dict >>> s = { 3, 3, 3, 4, -1, True, 94 } # set >>> b = b'\xe2\x82\xac1 = \xc6\x922.20' # bytes ('€1 = ƒ2.20') >>> ab = Addressbook( Person('Alice'), \ # an custom `Addressbook` ... Person('Bob'), \ # class, containing ... Person('Carol') ) # a collection of `Persons`Challenges:
- The internal implementations of these data structures are different. HOW TO make sure that we can loop through all of them?
- New collection classes will have yet different implementation. HOW TO make sure that existing code can work with these new data types?
Pythonis a high-level programming language, HOW TO make sure that there is a uniform and simple way to loop through these collections, without a lot of "boilerplate code".
Solution: The "Iterator Protocol", i.e.: a few simple rules that govern the "conversation" between components to get all elements of a collection.
Example: the code
for e in [1,2,3,4]: print(e)involves the concerted "effort" of 3 different parties:- the
forloop, - an (implicit) "Iterator", and
- the "Iterable" (of
list)
- the
A general concept implemented in many modern programming languages. The most fundamental steps:
Each container data type (the Iterable, e.g.
list[1,2,3,4]) must implement its own "Iterator", which can retrieve all elements of the Iterable objectThe Iterator must be produced upon "request", i.e. in Python the return value of the
iter()function:>>> myiterator = iter([1,2,3,4])
The component (e.g.
forloop), which wants to access the elements of the Iterable ([1,2,3,4]) will use the Iterator (myiterator) to access the elements. InPythonthe component will invoke thenext()function to receive the next unseen element:>>> next_elem = next(myiterator)
The component will repeat the
next()call, until the Iterator (myiterator) signals that there is no element left. InPythonthis signalling occurs by the iterator raising theStopIterationexception:>>> next(myiterator) 2 # the next element >>> next(myiterator) 3 # the next element >>> next(myiterator) 4 # the next and last(!) element >>> next(myiterator) # nothing more left! Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
See also: Python 3 documentation on an Iterator
manually driving the loop, the programmer walks through the elements of the
strIterable using astr_iteratorIterator:>>> s = 'Beautiful is better than ugly.' # Iterable: str >>> str_i = iter(s) # Iterator: str_i >>> type(str_i) <class 'str_iterator'> >>> next(str_i) # "Next element, please!" 'B' >>> next(str_i) 'e' >>> next(str_i) # ... repeat until done. 'a'
listIterable provides thelist_iteratorIterator to loop through the content onenext()call at a time:>>> l = [ 1, 2, 92, -32, 'python', 3.1415 ] >>> myiterator_l = iter(l) >>> type(myiterator_l) <class 'list_iterator'> # an iterator specific # for the `list` data type >>> next(myiterator_l) # same behavior 1the
sum()function drives an implicit loop to sum up thedictobject's keys using adict_keyiteratorIterator:>>> sum({ 3: 'three', 1: 'one', 10: 'ten'}) 14 >>> iter({ 3: 'three', 1: 'one', 10: 'ten'}) <dict_keyiterator object at 0x7f69e460af48> >>> s = { 3, 3, 3, 4, -1, True, 94 } # setthe
min()function drives an implicit loop to find the minimal value of thesetobject using a :>>> s = set('Python') # convert a `str` to `set` >>> s {'t', 'n', 'o', 'h', 'P', 'y'} # can you guess why scrambled? >>> min(s) # give me the "smallest" element 'P' >>> type( iter( {'t', 'n', 'o', 'h', 'P', 'y'} ) ) <class 'set_iterator'>in
Pythontheiter()function calls the Iterable's.__iter__()method to get the Iterator object:>>> t = (1,2,3,4) >>> tuple_i = t.__iter__() >>> type(tuple_i) <class 'tuple_iterator'> >>> next(tuple_i) 1 >>> next(tuple_i) 2
the same Iterable can be looped through using multiple Iterators at the same time. Each Iterator stores its own position independently so they do not interfere with each other:
>>> s = 'Explicit is better than implicit.' # str Iterable >>> str_i1 = iter(s) # 1st Iterator >>> str_i2 = iter(s) # 2nd Iterator >>> str_i1, str_i2 (<str_iterator object at 0x7f69e460ce80>, <str_iterator object at 0x7f69e460cf28>) >>> next(str_i2) # 2nd Iterator first 'E' >>> next(str_i2) # str_i2 gives 2nd element 'x' >>> next(str_i1) # str_i1: 1st element 'E' >>> next(str_i2) # we switch arbitralily 'p' >>> next(str_i1) # the Iterators keep track 'x' # of their position >>> next(str_i1) 'p' >>> next(str_i1) 'l' >>> next(str_i2) 'l'
A Generator:
represents a collection of objects, which are (usually) not in the
Pythonprocess' memory, but are generated with an expression or function.Example: the numerical sequence of the Fibonacci numbers (the "hello world" of generator examples ;-)
This is an collection of
intobjects, which can be generated using an expression:f_next = f_last + f_2ndlastis an Iterator, i.e.: upon request of the
next()it will produce the next element.
Generators can be created by either - a generator function, or - a generator expression
Generator function: any function with the yield keyword in it:
>>> def fibonacci(n): # doctest: +ELLIPSIS ... a, b = 0, 1 ... i = 0 ... while i < n: ... yield b ... a, b = b, a+b ... i += 1 >>> g = fibonacci(12) # generator is create, but not started >>> type(g) <class 'generator'> >>> next(g) # produce the next value() 1 >>> g.__next__() # next() invokes the .__next__() 1 # magic method of the generator >>> next(g) 3 >>> list(fibonacci(12)) # force the generator to produce [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144]
A generator expression: similar syntax as any comprehension, but with
round braces: ( ):
>>> l = [3, 4, 5, 6, 7, 11] # some data >>> g = ( e**2 for e in l ) # a generator expression >>> type(g) <class 'generator'> >>> next(g) # like any other Iterator... 9 >>> next(g) 16 >>> next(g) 25
A nice summary of the above concepts (inspired by this article)
- keyword
for: can loop through the elements of any Iterable - operator
in: checks if an object is an element of a collection
str,list,tuple,bytes(sequence types) etc...dict,set(mapping types)file, i.e.: the return value ofopen()- custom classes, which implement the
.__iter__()method
sorted(): sort an Interable, optionally with a specific sorting functionreversed(): reverse the order of the elements of an Iterablemin(),max(): return the minimal or maximal value offilter()map()itemgetter():from operator import itemgetter firstelem = itemgetter(0) l = [ [a,b,c] for a in 'abc' for b in '12345' for c in 'ATGC' ] firstelems = map(firstelem; l)
range():enumerate():
- Challenge
Sort a list with the days of the week (list of strings) in the correct order.
>>> days = 'Mon Sun Tue Fri Sat Sun Mon Tue Mon Wed Sat Thu Fri'.split() >>> days ['Mon', 'Sun', 'Tue', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Mon', 'Wed', 'Sat', 'Thu', 'Fri']
- Problem
By default
sorted()function will sort strings in alphabetical order (lexicographical order).>>> sorted(days) ['Fri', 'Fri', 'Mon', 'Mon', 'Mon', 'Sat', 'Sat', 'Sun', 'Sun', 'Thu', 'Tue', 'Tue', 'Wed']
- Solution
Use a sort function which will define the order of the elements:
>>> def day_sorter(day): ... # the desired order of the elements ... order = 'Mon Tue Wed Thu Fri Sat Sun'.split() ... # return the position of the current element in the `order` list ... return order.index(day) ... >>> sorted(days, key=day_sorter) ['Mon', 'Mon', 'Mon', 'Tue', 'Tue', 'Wed', 'Thu', 'Fri', 'Fri', 'Sat', 'Sat', 'Sun', 'Sun']
- Bonus
The same sorting function will also work with
min()andmax()>>> min(days) Fri # No! >>> min(days, key=day_sorter) Mon # Yes!
- Challenge
How many males and females are in the following
dict?persons = [ {'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Mark', 'age': 17, 'gender': 'm'}, {'name': 'Thomas', 'age': 44, 'gender': 'm'}, {'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Dragomir', 'age': 54, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}, ]- Solution
Count the number of values of the
genderattribute using thecollections.Counterclass. But:Counterneeds the to be counted values in a sequence-like Iterable, e.g.:>>> from collections import Counter >>> Counter('abracadabra') Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})However the data is in a
dict! So let's extract the required data with a smalllambdafunction:>>> gender_data_iterator = map(lambda v: v['gender'], persons) >>> Counter(gender_data_iterator) Counter({'m': 5, 'f': 4})- Bonus
What if the data is not clean?
persons2 = [ {'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Mark', 'age': 17, 'gender': 'M'}, {'name': 'Thomas', 'age': 44, 'gender': 'M'}, {'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'M'}, {'name': 'Dragomir', 'age': 54, 'gender': 'M'}, {'name': 'Jenny', 'age': 34, 'gender': 'F'}, {'name': 'Eline', 'age': 29, 'gender': 'F'}, ]In this case the previous solution clearly would give the wrong answer:
>>> Counter( map(lambda v: v['gender'], persons2) ) Counter({'M': 4, 'F': 2, 'f': 1, 'm': 1, 'v': 1})So, let's lower-case the values before counting:
lambda v: v['gender'].lower():>>> gender_data_iterator2 = map(lambda v: v['gender'].lower(), persons2) >>> Counter(gender_data_iterator2) Counter({'m': 5, 'f': 4})To see how this works, let's examine just the
lambdafunction.- The raw data record: ::
>>> persons2[3] {'name': 'Thomas', 'age': 44, 'gender': 'M'}
When we apply the
lambdafunction to the raw data:>>> f = lambda v: v['gender'].lower() >>> f(persons2[3]) 'm'
- Problem
- Given the
persons(see)dictof the previous example, sort the persons into age groups of decades, that is: 0-9, 10-19, 20-29, 30-39 etc... - Analysis
Let's define the desired output of our program as a
dict, where:the keys: are the age buckets, expressed by a
rangeobject, e.g.:range(10). This corresponds with the ages of 0-9.values: are
list, which contain the persons, who fall in the age bucket, e.g.:{ range(10, 20): [{'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Mark', 'age': 17, 'gender': 'm'}], range(20, 30): [{'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}], range(30, 40): [{'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}] }
- Solution
We'll need a
list(ortuple), which contains the differentrangeobjects, against which the program will examine a persondict, e.g.:categories = (range(9), range(10,20), range(20, 30), range(30, 40))
Also needed is an empty
dict, which will store the result:res = {}Finally, a nested loop will walk through the
personsdictand match the age of the currentpersonagainst eachrangeobject:for r in categories: for p in persons: if p['age'] in r: res.setdefault(r, []).append(p)The line code:res.setdefault(r, []).append(p) is perhaps the most intriguing here. Let's break this down:
.setdefault()method will return one of the following values:- the value of the key
r(i.e.: alist), ifris an existing key inres, OR - add the key
rwith an emptylistas value toresAND return this emptylistobject, ifrwas not yet a key
- the value of the key
- in either of the above cases, the expression
res.setdefault(r, [])will return alist, to which we append the current persondictas a new element.
- Bonus
This algorithm will accept any arbitrary age buckets, even if they overlap. Observe the extended
categories, where we added the age groups representing: elementary school children, high-school children, adults and retirees:categories = (range(9), range(10,20), range(20, 30), range(30, 40), range(6, 15), range(15, 19), range(19, 67), range(67, 120)) def categorize(persons, categories): res = {} for r in categories: for p in persons: if p['age'] in r: res.setdefault(r, []).append(p) return res print(categorize(persons, categories))The result is:
{range(10, 20): [{'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Mark', 'age': 17, 'gender': 'm'}], range(20, 30): [{'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}], range(30, 40): [{'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}], range( 6, 15): [{'name': 'Lucy', 'age': 14, 'gender': 'f'}], range(15, 19): [{'name': 'Mark', 'age': 17, 'gender': 'm'}], range(19, 67): [{'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Thomas', 'age': 44, 'gender': 'm'}, {'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Dragomir', 'age': 54, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}] }
- Problem
- Create the
Addressbookclass, which is a collection ofPersoninstances. Make sure that theAddressbookinstances are Iterable. - Solution
Let's use the recently introduced
dataclassesmodule to create the classes.from dataclasses import dataclass, field from typing import List @dataclass class Person: fname: str = '' sname: str = '' gender: str = '' email: str = '' @dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False)At this point the
Addressbookinstances can hold items, but it is not yet an Iterable:>>> ab = Addressbook() >>> ab Addressbook(name='My Addressbook', _items=[]) >>> ab2._items += [ 'Jenny', 'Robert', 'Alice' ] >>> ab2 Addressbook(name='My Addressbook', _items=['Jenny', 'Robert', 'Alice'])
However it is not yet an Iterable:
>>> list(ab2) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'Addressbook' object is not iterable
In
Pythona class is only Iterable if it implements the.__iter__()method, which provides an Iterator instance. So let's do it:@dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False) def __iter__(self): return iter(self._items)It is now working:
>>> ab2 = Addressbook() >>> ab2._items += [ 'Jenny', 'Robert', 'Alice' ] >>> ab2 Addressbook(name='My Addressbook', _items=['Jenny', 'Robert', 'Alice']) >>> list(ab2) # convert Addressbook -> list ['Jenny', 'Robert', 'Alice']
While at it why don't we add a couple of other nice features, such as:
- implement the
.add()method, which will add an item to the address book - implement the
.__len__()method, so that thelen()function is able to show the number of elements in the collection.
@dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False) def __iter__(self): return iter(self._items) def add(self, person): self._items.append(person) def __len__(self): return len(self._items)Try out the result:
>>> fred = Person(fname='Fred', sname='Flintstone', gender='m', email='fred@bedrock.place') >>> wilma = Person(fname='Wilma', sname='Flintstone', gender='f', email='wilma@bedrock.place') >>> ab = Addressbook(name='The Flintstones') >>> ab.add(fred) >>> ab.add(wilma) >>> print(f'Number of entries in addressbook: {len(ab)}') 2- implement the
- Bonus
By implementing the
.__getitem__()magic method on thePersonclass, we even can use the previous solution to count:@dataclass class Person: fname: str = '' sname: str = '' gender: str = '' email: str = '' def __getitem__(self, item): res = getattr(self, item) return res @dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False) def __iter__(self): return iter(self._items) def add(self, person): self._items.append(person) def __len__(self): return len(self._items) fred = Person(fname='Fred', sname='Flintstone', gender='m', email='fred@bedrock.place') wilma = Person(fname='Wilma', sname='Flintstone', gender='f', email='wilma@bedrock.place') ab = Addressbook(name='The Flintstones') ab.add(fred) ab.add(wilma)Finally let's try how our new classes fit in our data-processing toolkit so far :
gender_data_iterator = map(lambda v: v['gender'], ab) >>> Counter(gender_data_iterator) Counter({'m': 1, 'f': 1})
