Serialization is a key process in programming where data structures or object states are transformed into a format that can be conveniently stored, transmitted, and later reconstructed.
Let’s explore the serialization options provided by Python!
JSON
JavaScript Object Notation, otherwise known as JSON, is a human-readable serialization format that is also simple for machines to parse.
Python includes support for this format in the json
module.
JSON is used fairly universally and is supported by most languages, as well as being the predominant format used to transmit data between networks.
Key features:
- Text-based and language-independent.
- Ideal for lightweight and cross-language data interchange.
- Supports basic data types like strings, numbers, lists, and dictionaries.
Caveats
- Can only serialize simple data.
- Depending on your data structure, you can balloon the size of the stored data due to the need to include delimiters in the format.
import json
# Serializing data
data = {'name': 'Arjan', 'profession': "Software Developer"}
json_data = json.dumps(data)
print(json_data)
# Deserializing data
data_loaded = json.loads(json_data)
print(data_loaded)
Marshal
The marshal
module is used for serializing and deserializing Python objects. It is mainly intended for serializing .pyc
files, and is generally not intended for general persistence, especially across Python versions.
It can be used when implementing inter-process communication when working with multiprocessing. Outside of these cases, it’s generally not recommended for use. It can also only serialize simple, primitive types and is unsuitable for more complex data.
Key features:
- Fast and specific to Python internal use.
- Not secure against erroneous or maliciously constructed data.
- Intended for Python bytecode serialization.
Caveats
- Not intended for general use.
- It cannot be used across versions.
- Non-universal format.
- Can only serialize simple data types.
import marshal
# Serializing data
data = {'x': 1, 'y': 2}
serialized_data = marshal.dumps(data)
# Deserializing data
deserialized_data = marshal.loads(serialized_data)
print(deserialized_data)
Pickle
The pickle
module serializes Python object structures into byte streams and is more general than marshal
. Unlike json
, pickle
can handle a wide variety of Python objects, including custom classes.
Of all of the options on the list, this is the most powerful but also the most dangerous.
pickle
can serialize and deserialize arbitrary Python code, meaning it is able to be used by bad actors as a means to infect your computer with a malicious payload, and this generally avoids most antivirus technology as the code is being executed in an authorized application. This means you should never, ever unpickle data from untrusted sources.
Key features:
- Can serialize complex Python objects.
- Supports binary formats.
Caveats
- Potentially dangerous.
- Highly coupled to the structure of the code that generated it.
- Not compatible between different versions.
import pickle
# Serializing data
data = {'a': [1, 2, 3], 'b': None}
with open('data.pkl', 'wb') as file:
pickle.dump(data, file)
# Deserializing data
with open('data.pkl', 'rb') as file:
loaded_data = pickle.load(file)
print(loaded_data)
Shelve
The shelve
module is a persistent key store that utilizes pickle
to serialize objects. This means that pickle
is can store arbitrary Python objects. Since it uses pickle
it comes with the same inherit risks.
Key features:
- Dictionary-like interface.
- Stores pickled objects with a key.
- Good for simple data storage solutions.
Caveats
- All of the ones presented by
pickle
. - Can only track changes of mutable objects if specifically configured too, which comes at large performance costs.
- Large file sizes, as well as multiple files per shelve depending on the operating system.
import shelve
# Serializing data
with shelve.open('shelf.db') as shelf:
shelf['info'] = {'name': 'Alice', 'occupation': 'Engineer'}
# Deserializing data
with shelve.open('shelf.db') as shelf:
print(shelf['info'])
CTypes
Lastly, ctypes
is a foreign function library for Python that provides C-compatible data types and allows calling functions in DLLs or shared libraries.
It also provides a number of ctype
data structures, such as Structure. While not directly intended for serialization, it can facilitate it by organizing data into structures and converting them to bytes.
Key features:
- Ideal for interfacing with C code.
- Provides C-compatible data types.
- It isn’t a direct serialization tool but can be used as a tool to write serializers.
- Maximum control over the representation of serialized data.
Caveats
- It isn’t intended as a serialization tool; as such, all serialization and deserialization functions will need to be handwritten.
- You will have to maintain your own serialization tools.
Final thoughts
Python provides a number of tools for serializing, each with their own strengths and weaknesses. For most common cases, JSON will be the preferable choice, and for situations where you need to serialize more complex data, pickle
and shelve
can be used.
Be sure to weigh up the pros and cons of each option and choose the one that suits your problem the best.
Be sure to check out my post on creating custom collections.