Saikat Kumar Dey

Jan 13, 2018

3 min read

Write better python using generators

No, we won’t use this one ;)

Let’s say that we have a text file student_records.txt . Each line contains a student name and age. We want to parse student names and their ages.


Our first approach would be to read the files and store the values in a list/dictionary.

def read_student_records(path):
records = []
with open(path) as file:
for line in file:
name, age = line.split(":")
records.append((name, age))
return records

This gets the job done. However, the function read_student_records() is messy. What if we have to read from multiple files?

Let’s clean up our code above and make it more modular.

def lines_from_file(path):
lines = []
with open(path) as file:
for line in file:
return lines

The above functions are pretty self-explanatory.

  • lines_from_file function reads a file line by line, stores it in a list and returns it.
  • student_records reads in a list of lines, extracts name & age from each line and stores it in a list.
  • student_records_from_file combines lines_from_file and student_records.
  • student_records_from_files calls student_records_from_file , stores the records and returns it.

Everything seems alright? Could you find out some issues in the above code? I’ll wait.

In student_records_from_file function

  • #1 blocks until all the lines are read.
  • #2 blocks for parsing each records.
  • #3 blocks for reading all records from a file before moving on to the next one.

In each of these functions, we store data and throw them away when we return the values to a different function.

What if some of the files have 1 billion records or more?

We don’t need to store all of the lines in memory before parsing them. It is desirable to iterate over each line and parse them, one line in memory at a time.

Generators help you do just that. It is very useful tool to create iterators. The way generators differ from regular python functions is that they yield instead of return.

An iterator produces values in a sequence, one at a time. Calling a next() on an iterator returns the next value in the sequence.

Now, back to generators. There’s a difference between generator function and generator object. When you define a function which “yield” instead of “return” , it’s a generator function. When you call a generator function, a generator object is created. No execution happens until next() is called on a generator object.

Let’s re-write our code using generators now.

def lines_from_file(path):
with open(path) as file:
for line in file:
yield line

yield from is a shortcut for yielding data from another generator object:

#4 can be also be written as
for record in student_records_from_file(file):
yield record

Now you can iterate over student records one by one and do something with it.

for record in student_records_from_files(filenames):
print(record) #do something with record

This leads to a simple modular design. Each function is responsible for doing one task, hence it’s optimized for reusability. Also, it has a low memory footprint since we are having one student record in memory at a time.

Data Scientist,

Love podcasts or audiobooks? Learn on the go with our new app.