Python: Parallel S3 multipart upload with retries

The fastest way to upload (huge) files to Amazon S3 is using Multipart Upload. Instead of uploading one (huge) file through one connection you split it into (smaller) chunks and upload them through multiple connections in parallel.

In Python, you usually use Mitch Garnaat’s boto library to access the Amazon Web Services. Since I usually work with huge video files, I was searching the web for some implementations and thought about how low the memory footprint and disk usage could be. Both Mitch Garnaat and Brad Chapman used the unix split command to create chunks first, doubling the disk usage. Others were creating StringIO objects on demand, eating RAM. So I had a deeper look into Python’s IO module and wrote FileChunkIO, which takes the path of the file, an offset (where to start reading), the amount of bytes (where to stop reading) and fakes to be just that chunk (read only of course). As you can read it buffered, it’s footprint is pretty low.

An example of my parallel S3 multipart upload with retries using that FileChunkIO is available here: