# Item 17: Be Defensive When Iterating Over Arguments

Sat 03 January 2015

The following is a sample from the new book Effective Python.

When a function takes a list of objects as a parameter, it’s often important to iterate over that list multiple times. For example, say you want to analyze tourism numbers for the U.S. State of Texas. Imagine the data set is the number of visitors to each city (in millions per year). You’d like to figure out what percentage of overall tourism each city receives.

To do this you need a normalization function. It sums the inputs to figure out the total number of tourists per year. Then it divides each city’s individual visitor count by the total to find that city’s contribution to the whole.

```def normalize(numbers):
total = sum(numbers)
result = []
for value in numbers:
percent = 100 * value / total
result.append(percent)
return result
```

This function works when given a list of visits.

```visits = [15, 35, 80]
percentages = normalize(visits)
print(percentages)
```
```>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]
```

To scale this up, I need to read the data from a file that contains every city in all of Texas. I define a generator to do this because then I can reuse the same function later when I want to compute tourism numbers for the whole world, a much larger data set.

```def read_visits(data_path):
with open(data_path) as f:
for line in f:
yield int(line)
```

Surprisingly, calling `normalize` on the generator’s return value produces no results.

```it = read_visits('/tmp/my_numbers.txt')
percentages = normalize(it)
print(percentages)
```
```>>>
[]
```

The cause of this behavior is that an iterator only produces its results a single time. If you iterate over an iterator or generator that has already raised a `StopIteration` exception, you won’t get any results the second time around.

```it = read_visits('/tmp/my_numbers.txt')
print(list(it))
print(list(it))  # Already exhausted
```
```>>>
[15, 35, 80]
[]
```

What’s confusing is that you also won’t get any errors when you iterate over an already-exhausted iterator. `for` loops, the `list` constructor, and many other functions throughout the Python standard library expect the `StopIteration` exception to be raised during normal operation. These functions can’t tell the difference between an iterator that has no output and an iterator that had output and is now exhausted.

To solve this problem, you can explicitly exhaust an input iterator and keep a copy of its entire contents in a list. You can then iterate over the list version of the data as many times as you need to. Here’s the same function as before, but it defensively copies the input iterator.

```def normalize_copy(numbers):
numbers = list(numbers)  # Copy the iterator
total = sum(numbers)
result = []
for value in numbers:
percent = 100 * value / total
result.append(percent)
return result
```

Now the function works correctly on a generator’s return value.

```it = read_visits('/tmp/my_numbers.txt')
percentages = normalize_copy(it)
print(percentages)
```
```>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]
```

The problem with this approach is the copy of the input iterator’s contents could be large. Copying the iterator could cause your program to run out of memory and crash. One way around this is to accept a function that returns a new iterator each time it’s called.

```def normalize_func(get_iter):
total = sum(get_iter())   # New iterator
result = []
for value in get_iter():  # New iterator
percent = 100 * value / total
result.append(percent)
return result
```

To use `normalize_func`, you can pass in a `lambda` expression that calls the generator and produces a new iterator each time.

```percentages = normalize_func(lambda: read_visits(path))
```

Though it works, having to pass a lambda function like this is clumsy. The better way to achieve the same result is to provide a new container class that implements the iterator protocol.

The iterator protocol is how Python `for` loops and related expressions traverse the contents of a container type. When Python sees a statement like `for x in foo` it will actually call `iter(foo)`. The `iter` built-in function calls the `foo.__iter__` special method in turn. The `__iter__` method must return an iterator object (which itself implements the `__next__` special method). Then the `for` loop repeatedly calls the `next` built-in function on the iterator object until it’s exhausted (and raises a `StopIteration` exception).

It sounds complicated, but practically speaking you can achieve all of this behavior for your classes by implementing the `__iter__` method as a generator. Here I define an iterable container class that reads the files containing tourism data.

```class ReadVisits(object):
def __init__(self, data_path):
self.data_path = data_path

def __iter__(self):
with open(self.data_path) as f:
for line in f:
yield int(line)
```

This new container type works correctly when passed to the original function without any modifications.

```visits = ReadVisits(path)
percentages = normalize(visits)
print(percentages)
```
```>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]
```

This works because the `sum` method in `normalize` will call `ReadVisits.__iter__` to allocate a new iterator object. The `for` loop to normalize the numbers will also call `__iter__` to allocate a second iterator object. Each of those iterators will be advanced and exhausted independently, ensuring that each unique iteration sees all of the input data values. The only downside of this approach is it reads the input data multiple times.

Now that you know how containers like `ReadVisits` work, you can write your functions to ensure that parameters aren’t just iterators. The protocol states that when an iterator is passed to the `iter` built-in function, `iter` will return the iterator itself. In contrast, when a container type is passed to `iter`, a new iterator object will be returned each time. Thus, you can test an input value for this behavior and raise a `TypeError` to reject iterators.

```def normalize_defensive(numbers):
if iter(numbers) is iter(numbers):  # An iterator -- bad!
raise TypeError('Must supply a container')
total = sum(numbers)
result = []
for value in numbers:
percent = 100 * value / total
result.append(percent)
return result
```

This is ideal if you you don’t want to copy the full input iterator like `normalize_copy` above, but you also need to iterate over the input data multiple times. This function works as expected for `list` and `ReadVisits` inputs because they are containers. It will work for any type of container that follows the iterator protocol.

```visits = [15, 35, 80]
normalize_defensive(visits)  # No error
normalize_defensive(visits)  # No error
```

The function will raise an exception if the input is iterable but not a container.

```it = iter(visits)
normalize_defensive(it)
```
```>>>
TypeError: Must supply a container
```

### Things to Remember

• Beware of functions that iterate over input arguments multiple times. If these arguments are iterators you may see strange behavior and missing values.
• Python’s iterator protocol defines how containers and iterators interact with the `iter` and `next` built-in functions, `for` loops, and related expressions.
• You can easily define your own iterable container type by implementing the `__iter__` method as a generator.
• You can detect that a value is an iterator (instead of a container) if calling `iter` on it twice produces the same result, which can then be progress with the `next` built-in function.

## If you enjoyed this, get the book

Effective Python provides insight into the Pythonic way of writing programs: the best way to use Python. Novice programmers will learn the best practices of Python’s capabilities. Experienced programmers will learn how to embrace the strangeness of a new tool with confidence. Items include advice on what to do, what to avoid, how to strike the right balance, and why this is the best choice.