When Python developers hear the word “pickle,” they may not immediately think of briny cucumbers but rather of the built-in pickle
module used for serializing and deserializing Python objects. While pickle
can be a powerful tool, it also carries significant risks that can leave your codebase vulnerable. In this blog post, I will explore these dangers and discuss safer alternatives.
What is Pickle?
The pickle
module allows you to serialize Python objects into a byte stream, which can then be written to a file or transmitted over a network. Later, you can deserialize this byte stream back into a Python object. On the surface, this can be incredibly convenient for persisting data, caching, or sending complex objects between different parts of a system, but as I will explain, it’s not without its drawbacks.
How to use Pickle
Pickle is fairly straightforward to use. If you are familiar with the json
module, it may seem familiar.
It allows you to serialize almost any Python object, which makes it incredibly convenient to use because there is no need to build complicated serialization tools.
import dataclasses
import pickle
@dataclasses.dataclass
class User:
name: str
age: int
data = User(name="Arjan", age=47)
pickled_data = pickle.dumps(data)
unpickled_data = pickle.loads(pickled_data)
assert data == unpickled_data
The pickled data for the user looks like this:
b'\\x80\\x04\\x954\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x8c\\x08__main__\\x94\\x8c\\x04User\\x94\\x93\\x94)\\x81\\x94}\\x94(\\x8c\\x04name\\x94\\x8c\\x05Arjan\\x94\\x8c\\x03age\\x94K/ub.'
As you can see, we can pick out some minor details, but generally, it’s pretty hard to read. You should take notice of the __main__
and User
fragments, it’s these that inform Pickle of the type it should deserialize too.
The dangers
- Security Vulnerabilities:
The most significant danger of using
pickle
is its inherent insecurity. When you unpickle data, you are essentially executing the byte stream as Python code. This means that if an attacker can modify the serialized data, they can execute arbitrary code on your system. This vulnerability can lead to severe consequences, such as data breaches, system compromises, and more. Malicious payloads are trivial to create and can be as simple as just a few lines of code. - Compatibility Issues: Pickle and the data it generates are specific to Python, and it generally doesn’t support unpickling data between versions. This can be problematic when you need to upgrade your Python version or share data between different systems and languages.
- Lack of Transparency:
Pickle’s byte stream is generally not human-readable, making it challenging to debug and inspect the data. If something goes wrong, it can be difficult to determine the cause without detailed knowledge of the
pickle
protocol and the serialized objects. - Size and Performance: Pickle can produce large byte streams, leading to increased storage requirements and slower performance, especially when working with large or complex objects. This can be a significant drawback in resource-constrained environments.
Malicious payloads
To show you just how scary Pickle is and how easy it is for just about anyone to ruin your day, here is a simple example of how malicious payloads can be created.
import os
import pickle
class Payload:
def __init__(
self,
init,
*args
):
self.init = init
self.args = args
def __reduce__(self):
return self.init, self.args
payload = pickle.dumps(
Payload(
os.system,
'echo Malicious code executed!'
))
with open('payload.bin', 'wb') as f:
f.write(payload)
That’s it! In just a few lines of code, we have Arbitrary Code Execution! The minute this code is unpickled, it will run its payload, and you have effectively destroyed your production server. And it could be so much worse. This could have infected your machine, exfiltrated confidential information, etc.
with open('payload.bin', 'rb') as f:
payload = f.read()
pickle.loads(payload)
How does this work?!
Normally, __reduce__
is used to tell the pickle machine how to re-instantiate an instance of the object it has serialized, but in this case, we use os.system
as the constructor and the command as the arguments, which fools Python into thinking this is how it decodes the given data.
For reference, a normal __reduce__
method generally returns its class and the arguments to pass when instantiating it.
@dataclasses.dataclass
class User:
name: str
age: int
def __reduce__(self):
# Reduce returns the callable and args required to rebuild this class
return User, (self.name, self.age)
A really nefarious feature of this is that because the Payload object reduced itself to os.system
and its arguments, it does not require that the machine targeted for infect have the Payload object available, as would normally be the case for Pickle.
Remember the __main__
and User
sections in our pickled bytes? That is the serialized data returned by the __reduce__
method, most of the data thereafter is the arguments.
If we look at the payloads bytes, we see something similar:
b'\x80\x04\x955\x00\x00\x00\x00\x00\x00\x00\x8c\x02nt\x94\x8c\x06system\x94\x93\x94\x8c\x1decho Malicious code executed!\x94\x85\x94R\x94.'
We can see our system
command and the arguments.
We have basically fooled Pickle into thinking os.system
is a class, and we have serialized an instance of it.
One of the most troubling things about this is that we tend to think of our own applications as trusted, and as such, we allow it access rights that can allow it cause actual harm to our systems.
Safer alternatives
Given the risks associated with pickle
, it is essential to consider safer alternatives for serializing and deserializing data:
-
JSON: JSON (JavaScript Object Notation) is a lightweight, human-readable data interchange format. It is language-agnostic and supported by many programming languages, making it an excellent choice for interoperability. JSON is also inherently safer than
pickle
because it does not allow arbitrary code execution. This is due to the fact that JSON is a data-only serialization format (supporting only primitive types) and does not need to execute arbitrary instructions to instantiate the objects it supports. Some other alternatives are YAML and TOML, which provide many of the features of JSON while being arguably more human-readable.**Example**: ```python import json data = {'name': 'Arjan', 'age': 47} # Serialize to JSON json_data = json.dumps(data) # Deserialize from JSON loaded_data = json.loads(json_data) ```
-
MessagePack: MessagePack is a binary serialization format similar to JSON that is more efficient than JSON in terms of size and speed. Unlike JSON, it doesn’t aim to be human-readable and can produce much smaller serialized objects. It is also safer than
pickle
as it does not allow arbitrary code execution due to it also being a data-only format.**Example**: ```python import msgpack data = {'name': 'Arjan', 'age': 47} # Serialize to MessagePack msgpack_data = msgpack.packb(data) # Deserialize from MessagePack loaded_data = msgpack.unpackb(msgpack_data) ```
Final thoughts
While pickle
can be a convenient tool for Python developers, its risks often outweigh its benefits. Security vulnerabilities, compatibility issues, a lack of transparency, and performance concerns make it a less-than-ideal choice for many applications. By opting for safer alternatives such as JSON, YAML, or MessagePack, you can ensure your data serialization needs are met without compromising security or performance.
Remember, when it comes to serialization in Python, it’s better to be safe than sorry. Avoid getting yourself into a pickle with pickle
!