Jul 22, 2024

Python Pickle Risks and Safer Serialization Alternatives

By Alyce Osbourne

When Python developers hear the word “pickle,” they may not immediately think of briny cucumbers but rather of the built-in pickle module used for serializing and deserializing Python objects. While pickle can be a powerful tool, it also carries significant risks that can leave your codebase vulnerable. In this blog post, I will explore these dangers and discuss safer alternatives.

What is Pickle?

The pickle module allows you to serialize Python objects into a byte stream, which can then be written to a file or transmitted over a network. Later, you can deserialize this byte stream back into a Python object. On the surface, this can be incredibly convenient for persisting data, caching, or sending complex objects between different parts of a system, but as I will explain, it’s not without its drawbacks.

How to use Pickle

Pickle is fairly straightforward to use. If you are familiar with the json module, it may seem familiar. It allows you to serialize almost any Python object, which makes it incredibly convenient to use because there is no need to build complicated serialization tools.

import dataclasses
import pickle


@dataclasses.dataclass
class User:
    name: str
    age: int


data = User(name="Arjan", age=47)

pickled_data = pickle.dumps(data)
unpickled_data = pickle.loads(pickled_data)

assert data == unpickled_data

The pickled data for the user looks like this:

b'\\x80\\x04\\x954\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x8c\\x08__main__\\x94\\x8c\\x04User\\x94\\x93\\x94)\\x81\\x94}\\x94(\\x8c\\x04name\\x94\\x8c\\x05Arjan\\x94\\x8c\\x03age\\x94K/ub.'

As you can see, we can pick out some minor details, but generally, it’s pretty hard to read. You should take notice of the __main__ and User fragments, it’s these that inform Pickle of the type it should deserialize too.

The dangers

Security Vulnerabilities: The most significant danger of using pickle is its inherent insecurity. When you unpickle data, you are essentially executing the byte stream as Python code. This means that if an attacker can modify the serialized data, they can execute arbitrary code on your system. This vulnerability can lead to severe consequences, such as data breaches, system compromises, and more. Malicious payloads are trivial to create and can be as simple as just a few lines of code.
Compatibility Issues: Pickle and the data it generates are specific to Python, and it generally doesn’t support unpickling data between versions. This can be problematic when you need to upgrade your Python version or share data between different systems and languages.
Lack of Transparency: Pickle’s byte stream is generally not human-readable, making it challenging to debug and inspect the data. If something goes wrong, it can be difficult to determine the cause without detailed knowledge of the pickle protocol and the serialized objects.
Size and Performance: Pickle can produce large byte streams, leading to increased storage requirements and slower performance, especially when working with large or complex objects. This can be a significant drawback in resource-constrained environments.

Malicious payloads

To show you just how scary Pickle is and how easy it is for just about anyone to ruin your day, here is a simple example of how malicious payloads can be created.

import os
import pickle


class Payload:
    def __init__(
            self,
            init,
            *args
    ):
        self.init = init
        self.args = args

    def __reduce__(self):
        return self.init, self.args


payload = pickle.dumps(
    Payload(
        os.system,
        'echo Malicious code executed!'
    ))

with open('payload.bin', 'wb') as f:
    f.write(payload)

That’s it! In just a few lines of code, we have Arbitrary Code Execution! The minute this code is unpickled, it will run its payload, and you have effectively destroyed your production server. And it could be so much worse. This could have infected your machine, exfiltrated confidential information, etc.

with open('payload.bin', 'rb') as f:
    payload = f.read()
    pickle.loads(payload)

How does this work?!

Normally, __reduce__ is used to tell the pickle machine how to re-instantiate an instance of the object it has serialized, but in this case, we use os.system as the constructor and the command as the arguments, which fools Python into thinking this is how it decodes the given data.

For reference, a normal __reduce__ method generally returns its class and the arguments to pass when instantiating it.


@dataclasses.dataclass
class User:
    name: str
    age: int

    def __reduce__(self):
	    # Reduce returns the callable and args required to rebuild this class
	    return User, (self.name, self.age)

A really nefarious feature of this is that because the Payload object reduced itself to os.system and its arguments, it does not require that the machine targeted for infect have the Payload object available, as would normally be the case for Pickle.

Remember the __main__ and User sections in our pickled bytes? That is the serialized data returned by the __reduce__ method, most of the data thereafter is the arguments.

If we look at the payloads bytes, we see something similar:

b'\x80\x04\x955\x00\x00\x00\x00\x00\x00\x00\x8c\x02nt\x94\x8c\x06system\x94\x93\x94\x8c\x1decho Malicious code executed!\x94\x85\x94R\x94.'

We can see our system command and the arguments. We have basically fooled Pickle into thinking os.system is a class, and we have serialized an instance of it.

One of the most troubling things about this is that we tend to think of our own applications as trusted, and as such, we allow it access rights that can allow it cause actual harm to our systems.

Safer alternatives

Given the risks associated with pickle, it is essential to consider safer alternatives for serializing and deserializing data:

JSON: JSON (JavaScript Object Notation) is a lightweight, human-readable data interchange format. It is language-agnostic and supported by many programming languages, making it an excellent choice for interoperability. JSON is also inherently safer than pickle because it does not allow arbitrary code execution. This is due to the fact that JSON is a data-only serialization format (supporting only primitive types) and does not need to execute arbitrary instructions to instantiate the objects it supports. Some other alternatives are YAML and TOML, which provide many of the features of JSON while being arguably more human-readable.
```
**Example**:


```python
import json

data = {'name': 'Arjan', 'age': 47}

# Serialize to JSON
json_data = json.dumps(data)

# Deserialize from JSON
loaded_data = json.loads(json_data)

```
```
MessagePack: MessagePack is a binary serialization format similar to JSON that is more efficient than JSON in terms of size and speed. Unlike JSON, it doesn’t aim to be human-readable and can produce much smaller serialized objects. It is also safer than pickle as it does not allow arbitrary code execution due to it also being a data-only format.
```
**Example**:


```python
import msgpack

data = {'name': 'Arjan', 'age': 47}

# Serialize to MessagePack
msgpack_data = msgpack.packb(data)

# Deserialize from MessagePack
loaded_data = msgpack.unpackb(msgpack_data)
```
```

Final thoughts

While pickle can be a convenient tool for Python developers, its risks often outweigh its benefits. Security vulnerabilities, compatibility issues, a lack of transparency, and performance concerns make it a less-than-ideal choice for many applications. By opting for safer alternatives such as JSON, YAML, or MessagePack, you can ensure your data serialization needs are met without compromising security or performance.

Remember, when it comes to serialization in Python, it’s better to be safe than sorry. Avoid getting yourself into a pickle with pickle!

Pi Day Sale! Use coupon PITHON2025 to get 14.3% off all courses. 🥳

Python Pickle Risks and Safer Serialization Alternatives

What is Pickle?

How to use Pickle

The dangers

Malicious payloads

How does this work?!

Safer alternatives

Final thoughts

Improve your code with my 3-part code diagnosis framework

Recent posts

Effective Use of Design Patterns in Modern Software Design

Master Python Descriptors for Property and Attribute Access

Fixing Common Syntactic Snafus and Pitfalls in Python