This sample is from a previous version of the book. See the new third edition here.
Though Python isn’t able to parallelize CPU-bound computation without extra effort (see Item 64: “Consider concurrent.futures
for True Parallelism”), it is able to support high-throughput, parallel I/O in a variety of ways (see Item 53: “Use Threads for Blocking I/O, Avoid for Parallelism” and Item 60: “Achieve Highly Concurrent I/O with Coroutines” for details). That said, it’s surprisingly easy to use these I/O tools the wrong way and reach the conclusion that the language is too slow for even I/O-bound workloads.
For example, say that you’re building a media server to stream television or movies over a network to users so they can watch without having to download the video data in advance. One of the key features of such a system is the ability for users to move forward or backward in the video playback so they can skip or repeat parts. In the client program I can implement this by requesting a chunk of data from the server corresponding to the new time index selected by the user:
def timecode_to_index(video_id, timecode):
...
# Returns the byte offset in the video data
def request_chunk(video_id, byte_offset, size):
...
# Returns size bytes of video_id's data from the offset
video_id = ...
timecode = '01:09:14:28'
byte_offset = timecode_to_index(video_id, timecode)
size = 20 * 1024 * 1024
video_data = request_chunk(video_id, byte_offset, size)
How would you implement the server-side handler that receives the request_chunk
request and returns the corresponding 20MB chunk of video data? For the sake of this example, I’m going to assume that the command and control parts of the server have already been hooked up (see Item 61: “Know How to Port Threaded I/O to asyncio
” for what that requires). I’m going to focus on the last steps where the requested chunk is extracted from gigabytes of video data that’s cached in memory, and is then sent over a socket back to the client. Here’s what the implementation would look like:
socket = ... # socket connection to client
video_data = ... # bytes containing data for video_id
byte_offset = ... # Requested starting position
size = 20 * 1024 * 1024 # Requested chunk size
chunk = video_data[byte_offset:byte_offset + size]
socket.send(chunk)
The latency and throughput of this code will come down to two factors: how much time it takes to slice the 20MB video chunk
from video_data
, and how much time the socket takes to transmit that data to the client. If I assume that the socket is infinitely fast, I can run a micro-benchmark using the timeit
built-in module to understand the performance characteristics of slicing bytes
instances this way to create chunks (see Item 11: “Know How to Slice Sequences” for background).
import timeit
def run_test():
chunk = video_data[byte_offset:byte_offset + size]
# Call socket.send(chunk), but ignoring for benchmark
result = timeit.timeit(
stmt='run_test()',
globals=globals(),
number=100) / 100
print(f'{result:0.9f} seconds')
>>>
0.004925669 seconds
It took roughly 5 milliseconds to extract the 20MB slice of data to transmit to the client. That means the overall throughput of my server is limited to a theoretical maximum of 20MB / 5 milliseconds = 7.3GB / second, since that’s the fastest I can extract the video data from memory. My server will also be limited to 1 CPU-second / 5 milliseconds = 200 clients requesting new chunks in parallel, which is tiny compared to the tens of thousands of simultaneous connections that tools like the asyncio
built-in module can support. The problem is that slicing a bytes
instance causes the underlying data to be copied, which takes CPU time.
A better way to write this code is using Python’s built-in memoryview
type, which exposes CPython’s high-performance buffer protocol to programs. The buffer protocol is a low-level C API that allows the Python runtime and C extensions to access the underlying data buffers that are behind objects like bytes
instances. The best part about memoryview
instances is that slicing them results in another memoryview
instance without copying the underlying data. Here, I create a memoryview
wrapping a bytes
instance and inspect a slice of it:
data = b'shave and a haircut, two bits'
view = memoryview(data)
chunk = view[12:19]
print(chunk)
print('Size: ', chunk.nbytes)
print('Data in view: ', chunk.tobytes())
print('Underlying data:', chunk.obj)
>>>
<memory at 0x105d6ba00>
Size: 7
Data in view: b'haircut'
Underlying data: b'shave and a haircut, two bits'
By enabling zero-copy operations, memoryview
can provide enormous speedups for code that needs to quickly process large amounts of memory, such as numerical C-extensions like NumPy and I/O bound programs like this one. Here, I replace the simple bytes
slicing above with memoryview
slicing instead, and repeat the same micro-benchmark:
video_view = memoryview(video_data)
def run_test():
chunk = video_view[byte_offset:byte_offset + size]
# Call socket.send(chunk), but ignoring for benchmark
result = timeit.timeit(
stmt='run_test()',
globals=globals(),
number=100) / 100
print(f'{result:0.9f} seconds')
>>>
0.000000250 seconds
The result is 250 nanoseconds. Now the theoretical maximum throughput of my server is 20MB / 250 nanoseconds = 164 TB/second. For parallel clients, I can theoretically support up to 1 CPU-second / 250 nanoseconds = 4 million. That’s more like it! This means that now my program is entirely bound by the underlying performance of the socket connection to the client, not by CPU constraints.
Now, imagine that the data must flow in the other direction, where some clients are sending live video streams to the server in order to broadcast them to other users. In order to do this, I need to store the latest video data from the user in a cache that other clients can read from. Here’s what the implementation of reading 1MB of new data from the incoming client would look like:
socket = ... # socket connection to the client
video_cache = ... # Cache of incoming video stream
byte_offset = ... # Incoming buffer position
size = 1024 * 1024 # Incoming chunk size
chunk = socket.recv(size)
video_view = memoryview(video_cache)
before = video_view[:byte_offset]
after = video_view[byte_offset + size:]
new_cache = b''.join([before, chunk, after])
The socket.recv
method will return a bytes
instance. I can splice the new data with the existing cache at the current byte_offset
by using simple slicing operations and the ‘bytes.join’ method. To understand the performance of this, I can run another micro-benchmark. I’m using a dummy socket so the performance test is only for the memory operations, not the I/O interaction.
def run_test():
chunk = socket.recv(size)
before = video_view[:byte_offset]
after = video_view[byte_offset + size:]
new_cache = b''.join([before, chunk, after])
result = timeit.timeit(
stmt='run_test()',
globals=globals(),
number=100) / 100
print(f'{result:0.9f} seconds')
>>>
0.033520550 seconds
It takes 33 milliseconds to receive 1MB and update the video cache. That means my maximum receive throughput is 1MB / 33 milliseconds = 31MB / second, and I’m limited to 31MB / 1MB = 31 simultaneous clients streaming in video data this way. This doesn’t scale.
A better way to write this code is to use Python’s built-in bytearray
type in conjunction with memoryview
. One limitation with bytes
instances is that they are read-only, and don’t allow for individual indexes to be updated.
my_bytes = b'hello'
my_bytes[0] = b'\x79'
>>>
Traceback ...
TypeError: 'bytes' object does not support item assignment
The bytearray
type is like a mutable version of bytes
that allows for arbitrary positions to be overwritten. bytearray
uses integers for its values instead of bytes
.
my_array = bytearray(b'hello')
my_array[0] = 0x79
print(my_array)
>>>
bytearray(b'yello')
A memoryview
can also be used to wrap a bytearray
. When you slice such a memoryview
, the resulting object can be used to assign data to a particular portion of the underlying buffer. This avoids the copying costs from above that were required to splice the bytes
instances back together after data was received from the client.
my_array = bytearray(b'row, row, row your boat')
my_view = memoryview(my_array)
write_view = my_view[3:13]
write_view[:] = b'-10 bytes-'
print(my_array)
>>>
bytearray(b'row-10 bytes- your boat')
There are many libraries in Python that use the buffer protocol to receive or read data quickly, such as socket.recv_into
and RawIOBase.readinto
. The benefit of these methods is that they avoid allocating memory and creating another copy of the data—what’s received goes straight into an existing buffer. Here, I use socket.recv_into
along with a memoryview
slice to receive data into an underlying bytearray
without the need for any splicing:
video_array = bytearray(video_cache)
write_view = memoryview(video_array)
chunk = write_view[byte_offset:byte_offset + size]
socket.recv_into(chunk)
I can run another micro-benchmark to compare the performance of this approach to the earlier example that used socket.recv
.
def run_test():
chunk = write_view[byte_offset:byte_offset + size]
socket.recv_into(chunk)
result = timeit.timeit(
stmt='run_test()',
globals=globals(),
number=100) / 100
print(f'{result:0.9f} seconds')
>>>
0.000033925 seconds
It took 33 microseconds to receive a 1MB video transmission. That means my server can support 1MB / 33 microseconds = 31GB / second of max throughput, and 31GB / 1MB = 31,000 parallel streaming clients. That’s the type of scalability that I’m looking for!
Things to Remember
- The
memoryview
built-in type provides a zero-copy interface for reading and writing slices of objects that support Python’s high performance buffer protocol. - The
bytearray
built-in type provides a mutablebytes
-like type that can be used for zero-copy data reads with functions likesocket.recv_from
. - A
memoryview
can wrap abytearray
, allowing for received data to be spliced into an arbitrary buffer location without copying costs.