TacoQ Logo
Guides

Serialization & Versioning

Optimize your task size and make them backwards and forwards compatible.

Motivation

Passing data between services raises two big issues: performance and versioning:

  • Performance: The size of the data you are sending through the broker can have a significant impact on the performance of your system.
  • Versioning: If you wish to change the schema of the input data of your task, you need to ensure it is backwards compatible with the previous schema.

Proper serialization, clear schemas, and versioning strategies solve both problems.

Serialization Strategies

Simple Formats

JSON

JSON is easy, fast to set up, and widely supported. If you're doing something small or medium-sized, it's a great choice.

This can also be optimized through the use of MessagePack, which is a binary format that is more compact than JSON and faster to parse.

CSV

You can also communicate information with CSV format, which is especially handy if you're working with data using something like pandas or polars.

Polars is a fast, modern alternative to pandas. If you haven't already, we recommend giving it a try!

Simple Compatibility

If using JSON or CSV, you can still implement a rudimentary form of backwards compatibility by assuming a default value for missing fields, ignoring fields that are no longer present in the schema, and, in the case of changed field names, you can still read the old data by using the old name (you are implementing a subset of Avro's schema evolution).

Forwards compatibility (reading data created by a newer version of the schema) is more complex to do manually. It can be done, but we recommend using a more serious serialization strategy for this.

Testing for Compatibility

It is not trivial to ensure that a JSON generated by pydantic can be parsed by a serde application.

It is a good idea to have end-to-end or contract tests that validate that the publisher and consumer can communicate. If you have a lot of moving parts all making use of the same data, we recommend looking at a tool like Pact to ensure that the data is being serialized and deserialized correctly.

There also exist schema validation tools like JSON Schema, but if you are going through the trouble of using a schema, we recommend just skipping to something like Protocol Buffers or Avro, which provide better support for schema evolution and optimized encoding.

Schema-Based Formats

Apache Thrift & Protocol Buffers

Apache Thrift and Protocol Buffers are binary encoding formats that are quite similar: they both require the user to define an interface definition language (IDL) that defines the schema of the data:

  • Thrift uses .thrift files
  • Protocol Buffers use .proto files

These IDLs support the definition of data types, services, data evolution, and more. You can then use them to generate code for the language you are using, and encode and decode them in a binary format.

These tools have obvious synergies with TacoQ due to having the guarantee that the publisher and consumer are using the same schema while having idiomatic code generation for the language you are using.

At Taco, we generally avoid code generation due to its complexity and maintenance cost, though we won't dive into the full rationale here. If you prefer a lower overhead approach, Avro is a great alternative.

Apache Avro

Apache Avro is a binary format that is similar to Protocol Buffers and Thrift, but its IDL is instead written using JSON. You can then use primitive libraries like fastavro to encode and decode the data into your own objects in that language.

It is lower overhead and does not require code-gen (though it exists!). It also supports dynamically updating schemas, which means you can share the schema at runtime, and the consumer can still read the data, either via an endpoint or attaching the schema directly to the message.

Check out our Python implementation. We believe it is simple and robust.

Breaking Changes

It is likely that you will eventually need to make breaking changes to the schema that cannot be versioned. In this case, you should create an entirely new task_kind_v2.

Pattern: Task Redirection For breaking schema changes, create a new task (e.g., task_kind_v2) like you would with a REST endpoint. Refactor the old V1 task to only transform and forward data to new tasks. This way, you know for certain that V1 is consuming V1 data, and V2 is consuming V2 data, without having to pollute the code with both versions in the same function. Remember that you now need to maintain two tasks. This can get very painful very quickly, so you should deprecate it as soon as you can.

Claim Check Pattern

When dealing with large payloads, it's important to remember that message brokers are optimized for small, fast messages-not large data blobs. For large payloads, avoid message brokers. Instead, store data separately and reference it by ID. This is called the Claim Check Pattern.

Recommend Reading

This small guide can only do so much to explain the intricacies of serialization and versioning. If you want to do a deep dive into how to version your tasks, we recommend the following resources:

These are also great books in general, and we recommend them wholeheartedly!

TacoDivision Logo

Frameworks

TacoQ

TacoDocs

WIP

TacoFlow

Soon

TacoBI

Soon

TacoCI

Soon

Taco Plus

Docs Templates

WIP

Early Access

Soon

Priority Support

Soon

BI Templates

Soon

Community

© 2025 Taco Division.
All rights reserved.