Have you ever had to process a large dataset and encountered the dreaded MemoryError in Python, or simply needed to limit the memory consumption of a large dataset? If the answer to that is yes, consider using a generator.
Generators provide a means of processing data without loading the entire data-source into memory first. This is especially useful when working with things like very large files in an efficient and pythonic way.
Generator's at their core are functions that return an iterator, meaning that the generator can be looped over. Unlike a regular iterator in Python however, the entire set of data is not loaded into memory up front.
Creating a Generator Function
Generator functions revolve around the
yield statement (more on that later). They are defined as a normal function, with the addition of the
yield statement in place of the familiar
In the below (contrived) example, the generator function
basic_generator is defined. This will return a generator that will yield integers up to 50,000,000 one at a time.
The generator returned by
basic_generator is assigned to the variable
gen which can then be iterated over like a normal iterator. The difference here is that the generator produces one value at a time rather than loading all 50,000,000 integers into memory up-front as a list would.
Reading Large Files
Likely the most common usage of generator functions is when reading large files, reading a file in Python it is common to read the entire contents of the file into memory at once. When dealing with large files, this can be problematic either by consuming excessive volumes of memory or in the worst-case scenario consuming all the available memory, raising a
MemoryError and ultimately crashing the application.
Note: When using a 32-bit version of Python, only 2GB of memory will be addressable by the Python process so no matter how much physical memory is available, a MemoryError will be raised as soon once 2GB has been consumed by Python.
Depending on the type and content of the file it is likely that you will want to read it in either line by line (common for a text file), or in chunks of bytes (common for a binary file).
open function in Python already lazily evaluates the file line by line so for reading a large file in one line at a time there is no requirement to implement a generator function. It can simply be achieved as in the following example.
Now in the case of a file that needs to be read in n sized chunks of bytes, the following generator function could be used. This will open the file and yield chunks of data equal to chunk_size bytes.
Another common and popular use of generator functions is to generate large or potentially infinite sequences of values.
Infinite Number Sequence
A generator function is a perfect candidate to solve the problem of generating an infinite sequence of numbers, as the state of the sequence generation can be neatly encapsulated within the one function definition.
A Fibonacci sequence in a memory-constrained environment is another great candidate to solve using a generator function as loading all of the values in the generated Fibonacci sequence into memory can be avoided.
In cases where a full generator function may be overly verbose for something like a simple sequence generator, there is the option to create a generator using a generator expression. The syntax is very similar to list comprehensions, the only difference being that
( ) round brackets are required for a generator expression, as opposed to the
[ ] brackets in a list comprehension.
If you are interested in learning about list comprehensions take a look at my post about them here.
Under The Hood
When the python interpreter encounters a
yield statement in a function, it knows that the function is a generator function. At this point, a special iterator object will be returned from the function and assigned to the target variable.
The generator object stores state using the internal
f_locals variables. Additionally, the object has a
next() method which invokes and runs the code inside the generator function up to the yield statement at which point the value yielded is returned, the current state is stored and the execution of the function suspended until another call to
next() is made.
The next method can either be called manually or implicitly within a loop construct.
When the generators values are exhausted a
StopIteration exception is raised to signal that the generator has been exhausted and no additional values can be yielded. This exception is handled automatically by a loop construct such as a for loop, however, if manually calling next on the generator this exception needs to be caught and handled manually.
Memory and Performance Metrics
Generators provide the most benefit when it comes to memory consumption, this is because of the lazy evaluation that they provide, preventing all values from being loaded into memory at one time.
In the following example, the size in bytes of a generator to generate a sequence of 50,000 integers is compared to a static list of 50, 000 integers. The generator consumes only 112 bytes of memory, compared to the statically generated list of integers which consumes 406,488 bytes of memory. In this particular example, the generator is ~3600x more memory efficient than the list.
The drawback of this memory, efficiency, however, is observed via runtime performance, using a generator as opposed to a statically generated list of values incurs a much higher number of function call's and consequently a longer runtime. Below a simple sum of integers up to 100,000 is profiled using cProfile. The generator expression results in 100,005 function calls in 123ms, compared to the list comprehension which takes only 5 function calls in 12ms.
See my post on profiling in Python here.
This is due to the generator returned by the generator expression calling
next() to get every value in the sequence.
The natural conclusion from these observations is that careful consideration needs to be given to the requirements of a system. If memory constraints are going to outweigh the performance impact then a generator is the most suitable solution. Conversely, if memory consumption is less of a priority than runtime performance it is more appropriate to utilise a static data source such as a list.