Write better python using generators

Saikat Kumar Dey
3 min readJan 13, 2018
No, we won’t use this one ;)

Let’s say that we have a text file student_records.txt . Each line contains a student name and age. We want to parse student names and their ages.

#student_records.txtgreg:15
matthew:14
ram:16
raju:14

Our first approach would be to read the files and store the values in a list/dictionary.

def read_student_records(path):
records = []
with open(path) as file:
for line in file:
name, age = line.split(":")
records.append((name, age))
return records
for record in read_student_records(path):
print(record) # do something with record

This gets the job done. However, the function read_student_records() is messy. What if we have to read from multiple files?

Let’s clean up our code above and make it more modular.

def lines_from_file(path):
lines = []
with open(path) as file:
for line in file:
lines.append(line)
return lines
def student_records(lines):
records = []
for line in lines:
name, age = line.split(":")
records.append((name, age))
return records
def student_records_from_file(path):
lines = lines_from_file(path) # 1
records = student_records(lines) # 2
return records
# read from multiple files and do something with it
def student_records_from_files(filenames):
records = []
for filename in filenames:
records_from_file = student_records_from_file(filename) # 3
records.extend(records_from_file)
return records
filenames = ['r1.txt', 'r2.txt']
records = student_records_from_multiple_files(filenames)
for record in records:
print(record) # do something with record

The above functions are pretty self-explanatory.

  • lines_from_file function reads a file line by line, stores it in a list and returns it.
  • student_records reads in a list of lines, extracts name & age from each line and stores it in a list.
  • student_records_from_file combines lines_from_file and student_records.
  • student_records_from_files calls student_records_from_file , stores the records and returns it.

Everything seems alright? Could you find out some issues in the above code? I’ll wait.

In student_records_from_file function

  • #1 blocks until all the lines are read.
  • #2 blocks for parsing each records.
  • #3 blocks for reading all records from a file before moving on to the next one.

In each of these functions, we store data and throw them away when we return the values to a different function.

What if some of the files have 1 billion records or more?

We don’t need to store all of the lines in memory before parsing them. It is desirable to iterate over each line and parse them, one line in memory at a time.

Generators help you do just that. It is very useful tool to create iterators. The way generators differ from regular python functions is that they yield instead of return.

An iterator produces values in a sequence, one at a time. Calling a next() on an iterator returns the next value in the sequence.

Now, back to generators. There’s a difference between generator function and generator object. When you define a function which “yield” instead of “return” , it’s a generator function. When you call a generator function, a generator object is created. No execution happens until next() is called on a generator object.

Let’s re-write our code using generators now.

def lines_from_file(path):
with open(path) as file:
for line in file:
yield line
def student_records(lines):
for line in lines:
name, age = line.split(":")
yield name, age
def student_records_from_file(path):
lines = lines_from_file(path)
yield from student_records(lines)
def student_records_from_files(filenames):
for file in filenames:
yield from student_records_from_file(file) # 4

yield from is a shortcut for yielding data from another generator object:

#4 can be also be written as
for record in student_records_from_file(file):
yield record

Now you can iterate over student records one by one and do something with it.

for record in student_records_from_files(filenames):
print(record) #do something with record

This leads to a simple modular design. Each function is responsible for doing one task, hence it’s optimized for reusability. Also, it has a low memory footprint since we are having one student record in memory at a time.

--

--