Item 74: Consider memoryview and bytearray for Zero-Copy Interactions with bytes

Tue 22 October 2019

This sample is from a previous version of the book. See the new third edition here.

Though Python isn’t able to parallelize CPU-bound computation without extra effort (see Item 64: “Consider concurrent.futures for True Parallelism”), it is able to support high-throughput, parallel I/O in a variety of ways (see Item 53: “Use Threads for Blocking I/O, Avoid for Parallelism” and Item 60: “Achieve Highly Concurrent I/O with Coroutines” for details). That said, it’s surprisingly easy to use these I/O tools the wrong way and reach the conclusion that the language is too slow for even I/O-bound workloads.

For example, say that you’re building a media server to stream television or movies over a network to users so they can watch without having to download the video data in advance. One of the key features of such a system is the ability for users to move forward or backward in the video playback so they can skip or repeat parts. In the client program I can implement this by requesting a chunk of data from the server corresponding to the new time index selected by the user:

def timecode_to_index(video_id, timecode):
    ...
    # Returns the byte offset in the video data

def request_chunk(video_id, byte_offset, size):
    ...
    # Returns size bytes of video_id's data from the offset

video_id = ...
timecode = '01:09:14:28'
byte_offset = timecode_to_index(video_id, timecode)
size = 20 * 1024 * 1024
video_data = request_chunk(video_id, byte_offset, size)

How would you implement the server-side handler that receives the request_chunk request and returns the corresponding 20MB chunk of video data? For the sake of this example, I’m going to assume that the command and control parts of the server have already been hooked up (see Item 61: “Know How to Port Threaded I/O to asyncio” for what that requires). I’m going to focus on the last steps where the requested chunk is extracted from gigabytes of video data that’s cached in memory, and is then sent over a socket back to the client. Here’s what the implementation would look like:

socket = ...             # socket connection to client
video_data = ...         # bytes containing data for video_id
byte_offset = ...        # Requested starting position
size = 20 * 1024 * 1024  # Requested chunk size

chunk = video_data[byte_offset:byte_offset + size]
socket.send(chunk)

The latency and throughput of this code will come down to two factors: how much time it takes to slice the 20MB video chunk from video_data, and how much time the socket takes to transmit that data to the client. If I assume that the socket is infinitely fast, I can run a micro-benchmark using the timeit built-in module to understand the performance characteristics of slicing bytes instances this way to create chunks (see Item 11: “Know How to Slice Sequences” for background).

import timeit

def run_test():
    chunk = video_data[byte_offset:byte_offset + size]
    # Call socket.send(chunk), but ignoring for benchmark

result = timeit.timeit(
    stmt='run_test()',
    globals=globals(),
    number=100) / 100

print(f'{result:0.9f} seconds')
>>>
0.004925669 seconds

It took roughly 5 milliseconds to extract the 20MB slice of data to transmit to the client. That means the overall throughput of my server is limited to a theoretical maximum of 20MB / 5 milliseconds = 7.3GB / second, since that’s the fastest I can extract the video data from memory. My server will also be limited to 1 CPU-second / 5 milliseconds = 200 clients requesting new chunks in parallel, which is tiny compared to the tens of thousands of simultaneous connections that tools like the asyncio built-in module can support. The problem is that slicing a bytes instance causes the underlying data to be copied, which takes CPU time.

A better way to write this code is using Python’s built-in memoryview type, which exposes CPython’s high-performance buffer protocol to programs. The buffer protocol is a low-level C API that allows the Python runtime and C extensions to access the underlying data buffers that are behind objects like bytes instances. The best part about memoryview instances is that slicing them results in another memoryview instance without copying the underlying data. Here, I create a memoryview wrapping a bytes instance and inspect a slice of it:

data = b'shave and a haircut, two bits'
view = memoryview(data)
chunk = view[12:19]
print(chunk)
print('Size:           ', chunk.nbytes)
print('Data in view:   ', chunk.tobytes())
print('Underlying data:', chunk.obj)
>>>
<memory at 0x105d6ba00>
Size:            7
Data in view:    b'haircut'
Underlying data: b'shave and a haircut, two bits'

By enabling zero-copy operations, memoryview can provide enormous speedups for code that needs to quickly process large amounts of memory, such as numerical C-extensions like NumPy and I/O bound programs like this one. Here, I replace the simple bytes slicing above with memoryview slicing instead, and repeat the same micro-benchmark:

video_view = memoryview(video_data)

def run_test():
    chunk = video_view[byte_offset:byte_offset + size]
    # Call socket.send(chunk), but ignoring for benchmark

result = timeit.timeit(
    stmt='run_test()',
    globals=globals(),
    number=100) / 100

print(f'{result:0.9f} seconds')
>>>
0.000000250 seconds

The result is 250 nanoseconds. Now the theoretical maximum throughput of my server is 20MB / 250 nanoseconds = 164 TB/second. For parallel clients, I can theoretically support up to 1 CPU-second / 250 nanoseconds = 4 million. That’s more like it! This means that now my program is entirely bound by the underlying performance of the socket connection to the client, not by CPU constraints.

Now, imagine that the data must flow in the other direction, where some clients are sending live video streams to the server in order to broadcast them to other users. In order to do this, I need to store the latest video data from the user in a cache that other clients can read from. Here’s what the implementation of reading 1MB of new data from the incoming client would look like:

socket = ...        # socket connection to the client
video_cache = ...   # Cache of incoming video stream
byte_offset = ...   # Incoming buffer position
size = 1024 * 1024  # Incoming chunk size

chunk = socket.recv(size)
video_view = memoryview(video_cache)
before = video_view[:byte_offset]
after = video_view[byte_offset + size:]
new_cache = b''.join([before, chunk, after])

The socket.recv method will return a bytes instance. I can splice the new data with the existing cache at the current byte_offset by using simple slicing operations and the ‘bytes.join’ method. To understand the performance of this, I can run another micro-benchmark. I’m using a dummy socket so the performance test is only for the memory operations, not the I/O interaction.

def run_test():
    chunk = socket.recv(size)
    before = video_view[:byte_offset]
    after = video_view[byte_offset + size:]
    new_cache = b''.join([before, chunk, after])

result = timeit.timeit(
    stmt='run_test()',
    globals=globals(),
    number=100) / 100

print(f'{result:0.9f} seconds')
>>>
0.033520550 seconds

It takes 33 milliseconds to receive 1MB and update the video cache. That means my maximum receive throughput is 1MB / 33 milliseconds = 31MB / second, and I’m limited to 31MB / 1MB = 31 simultaneous clients streaming in video data this way. This doesn’t scale.

A better way to write this code is to use Python’s built-in bytearray type in conjunction with memoryview. One limitation with bytes instances is that they are read-only, and don’t allow for individual indexes to be updated.

my_bytes = b'hello'
my_bytes[0] = b'\x79'
>>>
Traceback ...
TypeError: 'bytes' object does not support item assignment

The bytearray type is like a mutable version of bytes that allows for arbitrary positions to be overwritten. bytearray uses integers for its values instead of bytes.

my_array = bytearray(b'hello')
my_array[0] = 0x79
print(my_array)
>>>
bytearray(b'yello')

A memoryview can also be used to wrap a bytearray. When you slice such a memoryview, the resulting object can be used to assign data to a particular portion of the underlying buffer. This avoids the copying costs from above that were required to splice the bytes instances back together after data was received from the client.

my_array = bytearray(b'row, row, row your boat')
my_view = memoryview(my_array)
write_view = my_view[3:13]
write_view[:] = b'-10 bytes-'
print(my_array)
>>>
bytearray(b'row-10 bytes- your boat')

There are many libraries in Python that use the buffer protocol to receive or read data quickly, such as socket.recv_into and RawIOBase.readinto. The benefit of these methods is that they avoid allocating memory and creating another copy of the data—what’s received goes straight into an existing buffer. Here, I use socket.recv_into along with a memoryview slice to receive data into an underlying bytearray without the need for any splicing:

video_array = bytearray(video_cache)
write_view = memoryview(video_array)
chunk = write_view[byte_offset:byte_offset + size]
socket.recv_into(chunk)

I can run another micro-benchmark to compare the performance of this approach to the earlier example that used socket.recv.

def run_test():
    chunk = write_view[byte_offset:byte_offset + size]
    socket.recv_into(chunk)

result = timeit.timeit(
    stmt='run_test()',
    globals=globals(),
    number=100) / 100

print(f'{result:0.9f} seconds')
>>>
0.000033925 seconds

It took 33 microseconds to receive a 1MB video transmission. That means my server can support 1MB / 33 microseconds = 31GB / second of max throughput, and 31GB / 1MB = 31,000 parallel streaming clients. That’s the type of scalability that I’m looking for!

Things to Remember

  • The memoryview built-in type provides a zero-copy interface for reading and writing slices of objects that support Python’s high performance buffer protocol.
  • The bytearray built-in type provides a mutable bytes-like type that can be used for zero-copy data reads with functions like socket.recv_from.
  • A memoryview can wrap a bytearray, allowing for received data to be spliced into an arbitrary buffer location without copying costs.