Streaming Technologies and Serialization Protocols
Streaming Technologies and Serialization Protocols
Abstract—Efficiently streaming high-volume data is essential The pressing need to facilitate real-time data analysis [6] and
for real-time data analytics, visualization, and AI and machine leverage recent advancements in machine learning [7] further
learning model training. Various streaming technologies and emphasizes the necessity for efficient data streaming tech-
serialization protocols have been developed to meet different
streaming needs. Together, they perform differently across vari- nologies. These technologies must not only handle the sheer
arXiv:2407.13494v1 [cs.SE] 18 Jul 2024
ous tasks and datasets. Therefore, when developing a streaming volume of data but also integrate seamlessly with analytical
system, it can be challenging to make an informed decision on tools.
the suitable combination, as we encountered when implement- In this paper, we extend the work conducted in [8] for
ing streaming for the UKAEA’s MAST data or SKA’s radio the SKA’s radio astronomy data streaming and visualization.
astronomy data. This study addresses this gap by proposing an
empirical study of widely used data streaming technologies and We explore an array of streaming technologies available. We
serialization protocols. We introduce an extensible and open- consider the combination of two major choices of technology
source software framework to benchmark their efficiency across when implementing a streaming service: (a) the choice of a
various performance metrics. Our findings reveal significant streaming system, which performs the necessary communica-
performance differences and trade-offs between these technolo- tion between two endpoints, and (b) the choice of encoding
gies. These insights can help in choosing suitable streaming and
serialization solutions for contemporary data challenges. We aim used to convert the data into transmittable formats. Our
to provide the scientific community and industry professionals contributions are as follows:
with the knowledge to optimize data streaming for better data • We provide a comprehensive review of 11 streaming
utilization and real-time analysis.
technologies and 11 encoding methods, categorized by
Index Terms—Data streaming, messaging systems, serialization their underlying principles and operational frameworks.
protocols, web services, performance evaluation, empirical study, • We introduce an extensible software framework designed
and applications.
to benchmark the efficiency of various combinations of
streaming technology and serialization protocols, assess-
I. I NTRODUCTION ing them across 11 performance metrics.
• By testing 132 combinations, we offer a detailed com-
They also conducted a limited analysis on the serialization, 1) Text Formats: Extensible Markup Language (XML) [15]
deserialization, and transmission latency of two protocols - is a markup language and data format developed by the World
ProtoBuf and JSON. Our work builds on their research by Wide Web consortium. It is designed to store and transmit
covering a more extensive range of combinations. arbitrary data in a simple, human-readable format. XML adds
Proos et al. [9] consider the performance of three different context to data using tags with descriptive attributes for each
serialization formats (JSON, Flatbuffers, and Protobuf) and data item. It has been extended to various derivative formats,
a mixture of three different messaging protocols (AMQP, such as XHTML and EXI.
MQTT, and CoAP). They evaluate the performance using JavaScript Object Notation (JSON) [16] is another human-
real “CurrentStatus” messages from Scania vehicles as JSON readable data interchange format that represents data as a
data payload data. They monitor communication between a collection of nested key-value pairs. JSON is commonly used
desktop computer and a Raspberry Pi. They consider numerous for data exchange protocol in RESTful APIs. Due to the
evaluation metrics such as latency, message size, and serial- smaller payload size, it is often seen as a lower overhead
ization/deserialization speed. alternative to XML for data interchange.
The authors of [10] compare 10 different serialization YAML Ain’t Markup Language (YAML) [17] is a simple
formats for use with different types of micro-controllers and text-based data format often used for configuration files. It is
evaluate the size of the payload from each method. They test less verbose than XML and supports advanced features such
performance with two types of messages 1) JSON payloads as comments, extensible data types, and internal referencing.
obtained from “public mqtt.eclipse.org messages” and 2) ob- 2) Binary Formats: Binary JSON (BSON) [18] is a binary
ject classes from smartphone-centric studies [11], [12]. data format based on JSON, developed by MongoDB. Similar
Fu and Zhang [13] presented a detailed review of different to JSON, BSON also represents data structures using key-
messaging systems. They evaluate each method in terms of value pairs. It was initially designed for use with the Mon-
throughput and latency when sending randomly generated text goDB NoSQL database but can be used independently of the
payloads. They evaluate each method only on the local device system. BSON extends the JSON format with several data
to avoid bias from any network specifics. Orthogonal to our types that are not present in JSON, such as a datetime format.
work, they are focused on evaluating the scaling of each Universal Binary JSON (UBJSON) [19] is another binary
system over a number of producers, consumers, and message extension to the JSON format created by Apache. UBJSON
queues. is designed according to the original philosophy of JSON and
Churchill et al. [6] explored using ADIOS2 [14] for trans- does not include additional data types, unlike BSON.
ferring large amounts of Tokamak diagnostic data from the Concise Binary Object Representation (CBOR) [20] is also
K-STAR facility in Korea to the NERSC and PPPL facilities based on the JSON format. The major defining feature of
in the USA for near-real-time data analysis. CBOR is its extensibility, allowing the user to define custom
We differentiate our study from these related works by tags that add context to complex data beyond the built-in
evaluating 1) A wide variety of different streaming technolo- primitives.
gies, both message broker-based and RPC-based. 2) consid- MessagePack [21] is a binary serialization format, again
ering a large number of data serialization formats, including based on JSON. It was designed to achieve smaller payload
text, binary, and protocol-based formats. 3) We evaluate the sizes than BSON and supports over 50 programming lan-
combination of these technologies, developing an extensible guages.
framework for measuring and comparing serialization and Pickle [22] is a binary serialization format built into the
streaming technologies. 4) Evaluating the performance over Python programming language. It was primarily designed to
10 different metrics. We comprehensively evaluate 10 different offer a data interchange format for communicating between
streaming technologies with 12 different serialization methods different Python instances.
over 8 different datasets. 3) Protocol Formats: Protocol Buffers (ProtoBuf) [23]
were developed by Google as an efficient data interchange
III. BACKGROUND format, particularly optimized for inter-machine communica-
In this paper, we study how the choice of streaming tion. Specifically, ProtoBuf is designed to facilitate remote
technologies and serialization protocols critically affects data procedural call (RPC) communication through gRPC [27].
transfer speed. Specifically, we analyze the application of Data structures used for communication are defined in .proto
popular messaging technologies and serialization protocols files, which are then compiled into generated code for various
across diverse datasets used in machine learning. Before supported languages. During transmission, these data struc-
discussing our experimental setup and results, this section tures are serialized into a compact binary format that omits
provides an overview of message systems and serialization names, data types, and other identifiers, making it non-self-
protocols suitable for streaming data. descriptive. Upon receipt, the messages are decoded using the
shared protocol buffer definitions.
Thrift [24] is another binary data format developed by
A. Serialization Protocols Apache Software Foundation or Apache, similar in many
In this section, we provide a brief overview of three dif- respects to ProtoBuf. In Thrift, data structures are also defined
ferent categories of serialization protocol: text formats, binary in a separate file, and these definitions are used to generate
formats, and protocol formats. corresponding appropriate data structures in various supported
3
TABLE I
A COMPARISON OF VARIOUS SERIALIZATION PROTOCOLS . T YPE : DESCRIBES HOW THE METHOD SERIALIZES DATA , WHETHER IN TEXT OR BINARY
FORMAT, OR RELYING ON A COMMON PROTOCOL . H UMAN READABLE : INDICATES WHETHER THE SERIALIZATION SCHEME IS LEGIBLE TO A HUMAN
READER . D EFINED SCHEMA : SPECIFIES WHETHER PRODUCER AND CONSUMER SHARE A COMMON KNOWLEDGE OF THE DATA FORMAT PRIOR TO
TRANSMISSION . C ODE GENERATED SCHEMA : STATES WHETHER THE SERIALIZATION REQUIRES CODE TO BE GENERATED FROM A PREDEFINED
PROTOCOL .
Code
Human Defined Based
Protocol Type Binary Generated
Readable Schema On
Schema
XML [15] Text ✕ ✓ ✕ ✕
JSON [16] Text ✕ ✓ ✕ ✕
YAML [17] Text ✕ ✓ ✕ ✕
BSON [18] Binary ✓ ✕ ✕ ✕ JSON
UBSON [19] Binary ✓ ✕ ✕ ✕ JSON
CBOR [20] Binary ✓ ✓ ✕ ✕ MessagePack
MessagePack [21] Binary ✓ ✕ ✕ ✕ JSON
Pickle [22] Binary ✓ ✕ ✕ ✕
ProtoBuf [23] Protocol ✓ ✕ ✓ ✓
Thrift [24] Protocol ✓ ✕ ✓ ✓
Capn’Proto [25] Protocol ✓ ✕ ✓ ✓
Avro [26], [12], [10] Protocol ✓ ✓ ✓ ✕
languages. Before transmission, data is serialized into a binary fer from larger payload and serialization costs due to the
format. Thrift is also designed for RPC communication and overhead of the markup describing the data. In contrast, binary
includes methods for defining services that use Thrift data formats serialize the data to bytes before transmission. These
structures. However, Thrift has a smaller number of supported formats are not human-readable, but achieve a better payload
data types compared to ProtoBuf. size with lower serialization costs. Protocol-based formats
Capn’Proto [25] is a protocol-based binary format that com- also encode data in binary, but differ in that they rely on
petes with ProtoBuf and Thrift. Capn’Proto differentiates itself a predefined protocol definition shared between sender and
with two main features. First, its internal data representation receiver. Using a shared protocol frees more information out
is identical to its encoded representation, which eliminates the of the transmitted packet, yielding smaller payloads and faster
need for a serializing step. Second, its RPC service imple- serialization time.
mentation offers a unique feature called “time travel” enabling
chained RPCs to be executed as a single request. Additionally, B. Data Streaming Technologies
Capn’Proto offers a byte-packing method that reduces payload
In this section, we discuss three different categories of data
size, albeit with the expense of some increase in serialization
streaming technologies: message queue-based, RPC-based,
time. In our experiments, we refer to the byte packed version
and low-level.
of Capn’Proto as ”capnp-packed” to differentiate it from the
1) Message Queues: ActiveMQ [30], developed in Java by
unpacked version, ”capnp”.
Apache, is a flexible messaging system designed to support
Avro [26] is a schema-based binary serialization technology various communication protocols, including AMQP, STOMP,
developed by Apache. Avro uses JSON to define schema REST, XMPP, and OpenWire. The system’s architecture is
data structures and namespaces. These schemas are shared based on a controller-worker model, where the controller
between both producer and consumer. One of Avro’s key broker is synchronized with worker brokers. The system
advantages is its dynamic schema definition, which does not operates in two modes: topic mode and queuing mode. In
require code generation, unlike competitors such as ProtoBuf. topic mode, ActiveMQ employs a publish-subscribe (pub/sub)
Avro messages are also self-describing, meaning they can be mechanism, where messages are transient, and delivery is not
decoded without needing access to the original schema. guaranteed. Conversely, in queue mode, ActiveMQ utilizes
We also considered the PSON format [28] and Zerial- point-to-point messaging approach, storing messages on disk
izer [29]. PSON is a binary serialization format with a current or in a database to ensure at-least-once delivery. For our
implementation limited to C++ and lacks Python bindings, experiments, we utilize the STOMP communication protocol.
which restricts its applicability for our study. Zerializer, on Kafka [31] is a distributed event processing platform written
the other hand, necessitates a specific hardware setup for in Scala and Java; initially developed by LinkedIn and now
implementation, placing it outside the scope of our study maintained by Apache. Kafka leverages the concept of topics
due to practical constraints. Consequently, while these formats and partitions to achieve parallelism and reliability. Consumers
might offer potential advantages, their limitations in terms of can subscribe to one more topic, with each topic divided
language support and hardware requirements precluded their into multiple partitions. Each partition is read by a single
inclusion in our experimental evaluation. consumer, ensuring message order within that partition. For
A summary of serialization protocols can be found in enhanced reliability, topics and partitions are replicated across
Table I. The text-based formats represent data during a text- multiple brokers within a cluster. Kafka employs a peer-to-
based markup. While human-readable, text-based formats suf- peer (P2P) architecture to synchronize brokers, with no single
4
TABLE II
A COMPARISON OF DIFFERENT DATA STREAMING TECHNOLOGIES .
Code
Queue Consume Broker Delivery Order Multiple
Name Type Generated
Mode Mode Architecture Guarantee Guarantee Consumer
Protocol
ActiveMQ [30] Messaging Pub/Sub & P2P Pull controller-worker at-least-once queue-order ✕ ✓
Kafka [31] Messaging P2P Pull P2P All partition-order ✕ ✓
Pulsar [32] Messaging P2P Push P2P All global-order ✕ ✓
RabbitMQ [33] Messaging Pub/Sub Push/Pull controller-worker at-least/most-once None ✕ ✓
RocketMQ [34] Messaging Pub/Sub Push/Pull controller-worker at-least-once queue-order ✕ ✓
Avro [35] RPC P2P Pull Brokerless None global-order ✕ ✕
Capn’Proto [36] RPC P2P Pull Brokerless None global-order ✓ ✕
gRPC [37] RPC P2P Pull Brokerless None global-order ✓ ✕
Thrift [38] RPC P2P Pull Brokerless None global-order ✓ ✕
ZMQ [39] Low Level P2P Pull Brokerless None global-order ✕ ✕
ADIOS2 [14] Low Level P2P Pull Brokerless None global-order ✕ ✕
broker taking precedence over other brokers. Zookeeper [40] tion references in bookies. These bookies are coordinated by
manages brokers within the cluster. Kafka uses TCP for a bookkeeper, which is also load-balanced using Zookeeper.
communication between message queues and supports only Each partition is further split into several segments and dis-
push-based message delivery to consumers while persisting tributed across different bookies. The separation of message
messages to disk for durability and fault tolerance. storage from message brokers means that if an individual
RabbitMQ [33], developed by VMWare, is a widely used broker fails, it can be replaced with another broker without
messaging system known for its robust support for various loss of information. Similarly, if a bookie fails, the replica
messaging protocols, including AMQP, STOMP, and MQTT. information stored in other bookies can take over, ensuring
Implemented in Erlang programming language, RabbitMQ data integrity. Pulsar’s architecture allows it to offer a global
leverages Erlang’s inherent support for distributed computa- ordering and delivery guarantee, although this high reliability
tion, eliminating the need for a separate cluster manager. A and scalability come at the cost of extra communication
RabbitMQ cluster consists of multiple brokers, each hosting an overhead between brokers and bookies.
exchange and multiple queues. The exchange is bound to one For a detailed overview of different message queue tech-
queue per broker, with queues synchronized across brokers. nologies, please refer [13].
One queue acts as the controller, while the others function 2) RPC Based: gRPC [37], developed by Google, is an
as workers. RabbitMQ supports point-to-point communication RPC framework that utilizes ProtoBuf as its default serializa-
and both push and pull consumer modes. Although message tion protocol. To define the available RPC calls for a client,
ordering is not guaranteed, RabbitMQ provides at-least-once gRPC requires a protocol definition written in ProtoBuf. While
and at-most-once delivery guarantees. RabbitMQ faces poor ProtoBuf is the standard, sending arbitrary bytes from other
scalability issues due to the need to replicate each queue on serialization protocols over gRPC is possible by defining a
every broker. Our experiments utilize the STOMP protocol for message type with a bytes field. The Python gRPC imple-
communication with the pika python package. mentation supports synchronous and asynchronous (asyncio)
RocketMQ [34], developed by Alibaba and written in Java, communication. For all our experiments with gRPC, we use
is a messaging system that employs a bespoke communication asynchronous communication.
protocol. It defines a set of topics, each internally split Capn’Proto [36] and Thrift also have their own RPC frame-
into a set of queues. Each queue is hosted on a separate works. Similar to gRPC, these frameworks define remote pro-
broker within the cluster, and queues are replicated using a cedural calls within their protocol definitions, using their own
controller-worker paradigm. Brokers can dynamically register syntax specification. Like gRPC, they allow the transmission
with a name server, which manages cluster and query routing. of arbitrary bytes by defining a message with a bytes field.
RocketMQ guarantees message ordering, and supports at- Avro provides RPC-based communication protocol as well.
least-once delivery. Consumers may receive messages from Unlike other RPC-based methods, Avro does not require the
RocketMQ either using push or pull modes. Message queuing RPC protocol to be explicitly defined. This flexibility comes
is implemented using the pub/sub paradigm, and RocketMQ at the expense of stricter type validation, setting Avro apart
scales well with a large number of topics and consumers. from systems such as gRPC and Thrift.
Pulsar [32], created by Yahoo and now maintained by 3) Low Level: In addition to RPC and messaging systems,
Apache, is implemented in Java and designed to support a we consider two low-level communication systems: ZeroMQ
large number of consumers and topics while ensuring high and ADIOS2. Like RPC systems, they do not rely on an
reliability. Pulsar’s innovative architecture separates message intermediate broker for message transmission.
storage from the message broker. A cluster of brokers is ZeroMQ (ZMQ) [39] is a brokerless communications library
managed by a load balancer (Zookeeper). Similar to Kafka, developed by iMatix. It is a highly flexible message frame-
each topic is split into partitions. However, instead of storing work that uses TCP sockets and supports various messaging
messages within partitions on the broker, Pulser stores parti- patterns, such as push/pull, pub/sub, request/reply, and many
5
Numpy Array Pydantic Object JSON (utf-8) JSON (utf-8) Pydantic Object
Ltot, Ttot
Fig. 1. Illustrates the data flow from producer to consumer, indicating the places at which various performance metrics are recorded. These metrics include
(A) Lo : object creation latency, (B) To : object creation throughput, (C) C: compression ratio, (D) Ls : serialization latency, (E) Ld deserialization latency,
(F) Ts : serialization throughput, (G) Td : deserialization throughput, (H) Ltrans : transmission latency, (I) Ttrans : transmission throughput, (J) Ltot : total
latency, and (K) Ttot : total throughput.
Fig. 2. Diagram showing the architecture of our streaming framework. A Runner is used to create a Producer and Consumer pair for each type of
streaming technology. Both producer and consumer are instantiated with a Marshaler that encodes data to the desired format (e.g. JSON, ProtoBuf etc.).
Producers are created with a data stream object that generates data samples for transmission. Depending on the streaming method, the Consumer and
Producer may connect to an external message broker.
deserialization protocol can handle, independent of the deserialization times. Similarly, we can calculate total and
size of the data stream. transmission throughput.
For streaming technologies, we consider two different per-
formance metrics: B. Dataset
8) Transmission Latency (Ltrans ) – This is the time taken In our experiments, we consider eight different payloads,
for a payload to be sent over the wire, excluding the time ranging from simple data to common machine learning work-
taken to encode the message. loads, and include fusion science data. Our goal is to cover a
9) Transmission Throughput (Ttrans = P Sd(i) N
) – This range of scenarios. This section briefly describes the datasets
Ltrans used to evaluate performance with various streaming technolo-
is similar to total throughput, but considers the payload
gies and serialization protocols.
size divided by the time taken to send the message over
• Numerical Primitives: As a baseline comparison, we
the wire, exclusive of the serialization time.
10) Total Latency (Ltot ) – This is the total time for a payload use simple datasets consisting of randomly generated nu-
to be transmitted from producer to consumer, inclusive of merical primitives for int32, float32, and float64
the serialization time. types.
• BatchMatrix: A synthetic dataset where each message
11) Total Throughput (Ttot = PSd N(i) ) – This is the original
Ltot consists of a randomly generated 3D tensor of type
data object size divided by the total time to send the
float32 with shape {32, 100, 100} to simulate sending
message. Throughput measures the rate of bytes that can
a batched set of image samples.
be communicated over the wire.
• Iris Data: This is a dataset using the well-known Iris
Finally, we also investigate the effect of batch size on dataset [41]. The Iris dataset contains an array of four
the throughput. Grouping data into batches is a common float32 features and a one-dimensional string target
requirement during machine learning training, and we show variable.
increasing the batch size while lowering the number of com- • MNIST: We use the widely used MNIST machine learn-
munications has a positive effect on throughput. ing image dataset [42] as a realistic example of streaming
We make a distinction between transmission time and total 2D tensor data.
time (Fig. 1). The total time is the end-to-end transmission • Scientific Papers: The scientific papers dataset is a well-
of a message, including the time to serialize the message known dataset in the field of NLP and text process-
and send it over the wire. Transmission time is the time ing [43]. The dataset comprises 349,128 articles of text
taken to transmit the payload excluding the serialization and from PubMed and arXiv publications. Each sample is
7
repeated as a collection of string for properties such and total time. Additionally, the logger captures the pay-
as article, abstract, and section names for transmission. load size of each message immediately after serialization.
• Plasma Current Data: A more realistic example of With this additional information, we can calculate the
scientific data, we use plasma current data from the average payload size and throughput of the streaming
MAST tokamak [4]. Each set of plasma current data service.
contains three 1D arrays of type float32: data, time, ADIOS and ZeroMQ can directly send array data without
and errors. The “data” array represents the amount of copying the input array. However, to achieve this, the array
current at each timestep, the “time” represents the time data must be directly passed to the communication library
the measurement was taken in seconds, and the “errors” without serialization. Therefore, we additionally consider Ze-
represents the standard deviation of the error in the roMQ and ADIOS to have their own ”native” encoding strat-
measured current. egy for each stream, which is only used with their respective
streaming protocol. This allows for a fair comparison with
other technologies because sending an encoding array with
C. Implementation and Experimental Setup
ADIOS or ZeroMQ incurs an additional copy that could be
We developed a framework to measure the performance circumvented by properly using their zero-copy functionality.
of streaming and serialization technology. The architecture Following the convention of previous work [13], [8], we run
diagram of our framework is shown in Figure 2, which follows each streaming test locally, with the producer and consumer
service-oriented architecture [44], [45] and is implemented in on the same machine to avoid network-specific issues.
Python. We used the appropriate Python client library for each
streaming and serialization technology. The source code can V. R ESULTS
be found in our GitHub repository [46].
The user interacts with the framework through a command- In this section we present the results of our experiments
line interface. A test runner sets up both the server-side and with the combination of different streaming technologies,
client-side of the streaming test. serialization protocols, and data streams.
The server side requires the configuration of three compo-
nents: 1) Object Creation Latency – We use different datasets that
originate from various data analysis types, like NumPy or
• DataStream: component handles loading data for trans-
Xarray dataset. Depending on the encoding protocol, we may
mission. This can be any one of the payloads described
need to copy the data from its native format to a specific format
in section IV-B.
like Capn’Proto or Protobuf objects. This copying process
• Producer: functions as the server side of the application.
adds some overhead that should be taken into consideration.
It packages data from the selected data stream and
However, for encoding protocols like JSON, BSON, and Pickle
transmits it over the wire using the selected streaming
that do not require format changes, we store the data in a
technology, which may be any of the technologies de-
Pydantic class. The results in Figure 3 show that for larger
scribed in section III-B.
array datasets like BatchMatrix, Plasma, and MNIST encoding
• Marshaler: handles the serialization of the data from the
methods such as Protobuf, Thrift, and Captn’Proto tend to have
stream using the specified serialization protocol. This can
higher object creation latency as they need to copy data into
be any of those described in section III-A.
their own data types.
The configuration of the client side is similar but only
requires a marshaler to be configured to match the one used 2) Object Creation Throughput – We consider the object
for the producer. It does not require knowledge of the data creation throughput for each serialization method. The object
stream. creation time measures the time to convert data from the native
• Consumer: functions as the client side of the application. data structure (such as a NumPy array) to the serialization
It receives data transmitted by the producer using the format. Object creation time is important to consider if the
selected streaming technology, processes the incoming format that the data will be used in will be different from
messages, and performs the necessary actions. Producers the format it is sent over the wire. Typically, object creation
and consumers interact using a configured protocol. force a copy of the data to be sent, which impacts the total
• Broker: required by the streaming protocol (e.g., for throughput, especially when considering large array like data.
Kafka, RabbitMQ, etc.) are run externally from the test Figure 4 shows the object creation throughput for each
in the background. In our framework, we configure all dataset and each encoding method. It is interesting to note
brokers using docker-compose [47] to ensure that our here that protocol based methods incur a greater penalty for
broker configurations are reproducible for every test. the object creation. This effect is more noticeable in larger
• Logger: is used by the marshaler to capture performance datasets such as the BatchMatrix and Plasma datasets.
metrics for each test in a JSON file. For each message
sent, the logger captures four timestamps: 1) before serial- 3) Compression Ratio – The payload size, a crucial per-
ization, 2) after serialization, 3) after transmission, and 4) formance metric of serialization protocols, is independent
after deserialization. Using these four timestamps, we can of the choice of streaming protocols. Therefore, we have
calculate the serialization, deserialization, transmission, calculated the average compression ratio over all runs for each
8
10 1
10 2
Fig. 3. Object latency (Lo ) measured in milliseconds, for various data streams (x-axis) and encoding methods (color).
104
103
102
101
batchMatrix float32 float64 int32 iris mnist plasma scientificPapers
Data Stream
Fig. 4. Object creation throughput (To ) measured in megabytes per second, for various data streams (x-axis) and encoding methods (color).
serialization protocol. Figure 5 presents the results for each Avro, achieve significantly worse compression in comparison.
protocol and data stream. Notably, Pickle, Avro, and XML In fact, due to the extra markup required for these formats,
consistently produce the largest payload sizes, often exceeding they can produce a larger payload size that the original data.
the original size. This is due to the inefficiency of their text-
based encodings and the additional meta-data tags they add as 4) Serialization Latency – The results for serialization time
overhead. Pickle, a binary format for storing Python objects, are shown in figure 6. There is a clear trend across all data
is particularly known for its large sizes and is not optimal for streams from text based protocols being the slowest (Avro,
encoding data for streaming. YAML, etc.) towards binary encoded protocol-based methods
The results show that the serialization protocol, Capn‘Proto, (Capn’Proto, protobuf, etc.) being the fastest. Binary encoded
outperforms others in terms of payload size. The packed option but no-protocol methods fall in between these two extremes.
of Capn‘Proto, also known as capnp-packed, is responsible It is interesting to note that Capn’Proto has the fastest
for additional size efficiency. The capnp-packed format is serialization time. This is likely due to the fact that Capn’Proto
closely followed by several binary serialization formats that stores data in format that is ready for serialization over the
show similar performance. The reason behind this performance wire.
can be attributed to their ability to achieve near-identical
compression, which is close to the limits of what is possible 5) Deserialization Latency – The results for serialization time
for that particular data stream. are shown in figure 7. Again, a clear trend may be seen across
Examining across data streams, it can be seen that the all data streams from text based protocols to binary encoded
BatchMatrix dataset is fundamentally limited. This is because protocol based methods.
it is made up of randomly generated numbers, making it Like serialization, Capn’Proto is generally the fastest dese-
incompressible due to the lack of redundancy in the data. rialization methods across all tests. As mentioned above, this
Conversely, for more realistic data such as MNIST and Plasma, is likely due to Capn’Proto storing the data in a pre-serialized
a much higher compression ratio is achieved.. Better compres- form.
sion is achieved for formats such as Capn’Proto Packed, which
exploit the redundancy in the data to achieve greater compres- 6) Serialization Throughput – Figure 8 shows the average
sion. Text-based formats, such as YAML, JSON, XML, and throughput for serialization of the data using different types
9
Payload Compression
104 capnp-packed protobuf ubjson json pickle
cbor thrift capnp xml avro
Compression Ratio (%)
102
101
100
batchMatrix float32 float64 int32 iris mnist plasma scientificPapers
Data Stream
Fig. 5. The compression ratio (C), for various data streams (x-axis) and serialization protocols (color).
Serialisation
107 capnp thrift json bson avro
106 capnp-packed messagepack cbor xml yaml
protobuf ubjson pickle
Log Duration (ms)
105
104
103
102
101
Fig. 6. Serialization (Ls ) latency for different data streams (x-axis) and encoding protocols (color). Protocol encoding methods such as Protobuf and
Captn’Proto consistently offer the best performance in terms of both serialization. Text-based encodings (YAML, XML, etc.) add a large latency penalty to
serialization by increasing the verbosity of the data.
Deserialisation
107 protobuf messagepack pickle json yaml
106 capnp-packed ubjson cbor xml avro
capnp thrift bson
Log Duration (ms)
105
104
103
102
101
Fig. 7. Deserialization (Ld ) latency for different data streams (x-axis) and encoding protocols (color). As with serialization, protocol encoding methods such
as Protobuf and Captn’Proto offer the best performance. Likewise, Text-based encodings (YAML, XML, etc.) add a large latency penalty to serialization by
increasing the verbosity of the data.
10
of protocols. It is evident from the graph that serialization With larger payloads such as BatchMatrix and plasma
techniques based on protocols such as ProtoBuf, Thrift, and data streams, the impact of serialization protocol becomes
Capn’Proto offer the highest serialization throughput. Binary more noticeable. It is challenging to identify a trend between
methods that are protocol independent offer moderate through- encoding protocols in terms of latency, except that it is crucial
put performance with the added advantage of greater flexibility to note the inefficiency of using XML and YAML for larger
as compared to protocol methods. Text-based methods perform payloads.
the worst due to their high serialization overhead. For the BatchMatrix data stream, an issue arises when
Surprisingly, Avro also performs quite well by this metric. sending a large YAML-encoded payload through the Python
We believe that despite being a human-readable text-based API, which causes ADIOS to produce a segmentation fault.
method, it is also a protocol-based method. This means that Therefore, subsequent latency and throughput plots result in
both the producer and consumer are aware of the types and NaN, the empty cells in Figure 9.
data structures being transmitted over the wire, facilitating
faster throughput. 9) Transmission Throughput – By examining the throughput,
we gain better understanding of how different protocols affect
transmission. Figure 10 shows that RPC methods achieve
105 serialisation higher transmission throughput than message streaming tech-
deserialisation nologies. When dealing with larger payloads, such as the
Serialisation Throughput (Mb/s)
l
n
on
ssa ed
ro
pro ift
uf
r
p
ck
kle
xm
pn cbo
jso
jso
pn
tob
av
thr
pa
me -pack
bs
ya
pic
ca
Encoding tests. ZeroMQ offers the lower latency in the BatchMatrix test
because it avoids the overhead of copying the data into a new
Fig. 8. Serialization (Ts ) and deserialization (Td ) throughput for each structure, as is the case with Thift or Protobuf. Among the
encoding averaged over all data streams.
broker based methods, RabbitMQ consistently performs well.
When it comes to encoding methods, protocol-based meth-
ods generally perform the best across all datasets and stream-
7) Deserialization Throughput – Figure 8 also shows the ing methods. However, it is not clear which method offers the
average throughput for deserialization of the data using dif- lowest latency in general. Protocol-based methods can achieve
ferent types of protocols. It is noticeable that deserialization high throughput by mixing encoding protocols and RPC
throughput time across all methods is smaller, indicating that frameworks. For example, considering the MNIST dataset,
deserialization is a main bottleneck to transmission. Capn’Proto achieves the lowest latency with the thrift protocol.
There is a clear trend towards protocol encoding for com-
8) Transmission Latency – Figure 9 shows the transmission plex data sets such as Iris, MNIST, and Plasma. Among
latency for various combinations of serialization and streaming streaming technologies, Thrift generally shows the best per-
technologies. The heatmap in each combination is sorted by formance.
the average latency from lowest to highest for each streaming
technology. 11) Total Throughput – Figure 12 shows total throughput,
Across all technologies, it is observed that transmission which is consistent with the total latency results discussed
latency is largely dependent on the choice of streaming in the previous section. Protocol-based methods achieve the
technology rather than the choice of serialization protocol. highest throughput. Among all the serialization protocols,
In streaming technologies, a broker is required as an inter- Thrift is generally the best performing one. ZeroMQ performs
mediary, which increases the overall latency, whereas RPC well with the biggest dataset, BatchMatrix. Although the best
technologies have no broker, and hence, have lower latency. encoding method is inconclusive, it shows a trend toward
Among messaging technologies, RabbitMQ performs better protocol-based methods, which give the highest throughput.
with larger payloads, while ActiveMQ achieves lower latency
with smaller payloads but performs worst on the largest 12) Effect of Batch Size on Throughput – In machine
payload (e.g., BatchMatrix). In RPC-based methods, Thrift learning applications, data is often processed in batches. Our
consistently has the lowest latency except for the BatchMatrix findings underscore the potential of batching data before
stream, where Capn’Proto narrowly beats Thrift. transmission to enhance throughput. However, it is crucial to
11
encoding
encoding
encoding
cbor
protobuf ubjson
json ubjson
xml cbor
thrift
avro 102 yaml 101 yaml 101 pickle
bson
ubjson messagepack
cbor cbor
adios bson
protobuf 101
thrift
yaml avro
pickle messagepack
pickle messagepack
json
json 101 protobuf bson avro
xml adios avro adios
kafka
kafka
kafka
kafka
thrift
avro
thrift
avro
thrift
thrift
avro
capnp
zeromq
grpc
zeromq
grpc
capnp
activemq
zeromq
zeromq
adios
rabbitmq
pulsar
activemq
adios
rabbitmq
pulsar
capnp
grpc
avro
activemq
adios
rabbitmq
pulsar
capnp
grpc
activemq
rabbitmq
adios
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
protocol protocol protocol protocol
iris mnist plasma scientificPapers
zeromq
bson zeromq
bson zeromq
adios zeromq
capnp
pickle cbor capnp-packed ubjson 103
json
cbor 101 pickle
capnp-packed ubjson
messagepack json
xml
xml json capnp 102 adios
encoding
encoding
encoding
encoding
capnp-packed ubjson 101 pickle pickle
adios messagepack protobuf capnp-packed
protobuf
capnp capnp
xml thrift
cbor 101 messagepack
cbor 101
thrift
ubjson 100 adios
thrift 100 avro
json bson
protobuf
yaml
avro protobuf
avro bson
yaml 100 thrift
avro
messagepack yaml xml yaml
kafka
kafka
kafka
kafka
thrift
capnp
avro
thrift
zeromq
grpc
adios
capnp
avro
thrift
activemq
zeromq
rabbitmq
pulsar
grpc
adios
capnp
avro
thrift
zeromq
rabbitmq
activemq
pulsar
grpc
zeromq
adios
rabbitmq
activemq
pulsar
capnp
grpc
avro
rabbitmq
adios
activemq
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
protocol protocol protocol protocol
Fig. 9. Each subplot or heatmap shows transmission latency (Ltrans ) for different serialization protocols (vertical axis) and streaming technologies (horizontal
axis). Dark red indicates higher latency.
encoding
encoding
encoding
pickle
cbor 102 thrift
ubjson protobuf
ubjson ubjson
cbor
protobuf
avro json
yaml xml
yaml 10 2 thrift
pickle
bson
ubjson 101 messagepack
cbor 10 3 cbor
adios bson
protobuf 10 3
thrift
yaml avro
pickle messagepack
pickle messagepack
json
json
xml protobuf
adios bson
avro avro
adios
kafka
kafka
kafka
kafka
thrift
avro
thrift
avro
thrift
avro
thrift
avro
capnp
zeromq
grpc
zeromq
zeromq
adios
rabbitmq
pulsar
activemq
capnp
grpc
activemq
zeromq
adios
rabbitmq
pulsar
capnp
grpc
activemq
rabbitmq
adios
pulsar
capnp
grpc
activemq
rabbitmq
adios
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
encoding
encoding
encoding
xml
capnp-packed json
ubjson capnp
pickle 102 adios
pickle
adios
protobuf messagepack
capnp protobuf
thrift capnp-packed
messagepack 100
capnp
thrift 10 1 xml
adios
10 1 cbor
avro 101 cbor
bson
ubjson thrift json protobuf 10 1
yaml
avro protobuf
avro bson
yaml thrift
avro
messagepack yaml xml yaml
kafka
kafka
kafka
kafka
thrift
capnp
avro
thrift
thrift
thrift
avro
zeromq
grpc
adios
capnp
avro
zeromq
grpc
adios
capnp
avro
activemq
zeromq
zeromq
rabbitmq
pulsar
rabbitmq
activemq
pulsar
grpc
adios
rabbitmq
activemq
pulsar
capnp
grpc
adios
rabbitmq
activemq
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
Fig. 10. Each subplot or heatmap shows transmission throughput (Ttrans ) for different serialization protocols (vertical axis) and streaming technologies
(horizontal axis).
12
encoding
encoding
encoding
capnp-packed
ubjson 103 xml
json ubjson
xml cbor
protobuf
thrift messagepack 101 cbor 101 pickle
avro
cbor 102 cbor
protobuf adios
messagepack bson
messagepack 101
bson
json pickle
adios pickle
bson json
yaml
xml
yaml yaml
avro yaml
avro avro
adios
kafka
kafka
kafka
kafka
thrift
avro
thrift
avro
thrift
thrift
zeromq
capnp
grpc
zeromq
zeromq
adios
rabbitmq
pulsar
activemq
capnp
grpc
activemq
rabbitmq
adios
pulsar
capnp
grpc
avro
activemq
zeromq
rabbitmq
adios
pulsar
capnp
grpc
avro
rabbitmq
activemq
adios
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
protocol protocol protocol protocol
iris mnist plasma scientificPapers
zeromq
bson zeromq
capnp-packed zeromq
adios zeromq
capnp
pickle capnp capnp-packed 103 ubjson 103
cbor
json 101 messagepack
pickle capnp
ubjson pickle
capnp-packed
capnp-packed ubjson protobuf protobuf
encoding
encoding
encoding
encoding
protobuf cbor 101 messagepack messagepack
xml
adios adios
bson pickle
thrift json
adios
capnp protobuf cbor xml 101
ubjson 100 thrift bson 101 cbor
thrift
messagepack json
xml 100 avro
json bson
thrift
avro
yaml avro
yaml xml
yaml avro
yaml
kafka
kafka
kafka
kafka
thrift
capnp
avro
thrift
capnp
avro
thrift
avro
thrift
zeromq
grpc
adios
activemq
zeromq
rabbitmq
pulsar
grpc
adios
capnp
zeromq
grpc
zeromq
rabbitmq
activemq
pulsar
adios
rabbitmq
activemq
pulsar
capnp
grpc
avro
rabbitmq
adios
activemq
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
protocol protocol protocol protocol
Fig. 11. Each subplot or heatmap shows total latency (Ltot ) for different serialization protocols (vertical axis), and streaming technologies (horizontal axis).
encoding
encoding
encoding
pickle
capnp-packed ubjson
xml protobuf
ubjson thrift
cbor
ubjson
thrift json
messagepack xml
cbor 10 2 protobuf
pickle
avro 101 cbor adios bson 10 3
cbor
bson protobuf
pickle
10 3 messagepack
pickle messagepack
json
json
xml 100 adios
yaml bson
yaml yaml
avro
yaml avro avro adios
kafka
kafka
kafka
kafka
thrift
avro
thrift
avro
thrift
avro
thrift
avro
zeromq
capnp
grpc
zeromq
zeromq
adios
rabbitmq
pulsar
activemq
capnp
grpc
activemq
zeromq
adios
rabbitmq
pulsar
capnp
grpc
activemq
rabbitmq
adios
pulsar
capnp
grpc
activemq
rabbitmq
adios
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
encoding
encoding
encoding
capnp-packed
protobuf ubjson
cbor protobuf
messagepack protobuf
messagepack
xml
adios adios
bson pickle
thrift 101
json
adios 100
capnp 10 1 protobuf 10 1
cbor xml
ubjson
thrift thrift
json bson
avro cbor
bson 10 1
messagepack
avro xml
avro json
xml 100 thrift
avro
yaml yaml yaml yaml
kafka
kafka
kafka
kafka
thrift
capnp
avro
thrift
thrift
thrift
avro
zeromq
grpc
adios
capnp
avro
zeromq
grpc
adios
capnp
avro
activemq
zeromq
zeromq
rabbitmq
pulsar
rabbitmq
activemq
pulsar
grpc
adios
rabbitmq
activemq
pulsar
capnp
grpc
adios
rabbitmq
activemq
pulsar
rocketmq
rocketmq
rocketmq
rocketmq
Fig. 12. Each subplot or heatmap shows total throughput (Ttot ) for different serialization protocols (vertical axis), and streaming technologies (horizontal
axis).
13
1.0
0.8
0.6
0.4
0.2
0.0
yaml xml avro json bson messagepack thrift pickle cbor ubjson capnp-packed protobuf capnp
Encoding
Fig. 13. Total throughput Ttot for various batch sizes (e.g., ranging from 1 to 256) for the MNIST dataset.
note that the batch size can significantly impact the throughput offered the best performance due to its lightweight markup
of the method used. and smaller payload size compared to YAML or Avro.
Figure 13 shows the throughput of the MNIST dataset with a Among different messaging technologies, we generally
variable batch size. As the batch size increases, the throughput found that Apache Thrift achieves very high throughput and
reduces. This is due to the increased overhead of copying and low latency across various scenarios. With message broker sys-
serializing data for transmission. However, when the batch size tems, RabbitMQ generally demonstrates the best performance.
is increased beyond 32 images per batch, the overall through- Surprisingly, we did not observe much of a difference when
put begins to improve because fewer packets are needed to combining different protocol-based encoding and messaging
be communicated over the network. For binary and protocol systems. We hypothesized that ProtoBuf would be most effi-
encoding methods, increasing the batch size is shown to also cient when combined with the gRPC, or that Capn’Proto’s
increase the throughput. This observation is consistent with RPC implementation would perform best with Capn’Proto.
previous results; generally, protocol-based methods offer the However, this appears not to be the case.
best throughput. At larger batch sizes (> 128), the throughput Larger batch sizes facilitate higher throughput for array
continues to increase because fewer transmission time is sig- datasets, as shown by our throughput and batch size exper-
nificantly slower than the serialization/deserialization cost. So iment in Figure 13, when using either a binary or protocol-
grouping many examples into a single transmission improves based serialization method. For text-based encoding methods,
throughput. adding the required markup and lack of compression destroys
any advantage of batching data for transmission.
VI. D ISCUSSION
We can draw several conclusions based on the experiments B. Limitations and future directions
presented in this work. We identify the following key points One notable limitation is that in this study we did not
from our results: investigate the potential of scaling with multiple clients. Pre-
vious research has examined this aspect for message queuing
A. Recommendations systems [13]. A future study could focus on examining the
RPC systems are faster than messaging broker systems reliability of various RPC technologies based on the number
due to the overhead of the intermediate broker. This makes of consumers.
RPC highly efficient for high throughput and low latency
transmission of large data, although they do not offer the same VII. C ONCLUSION
delivery guarantees as message broker systems. In this work, we investigated 132 combinations of different
We found that the choice of messaging technology has a encoding methods and messaging technologies. We evaluated
greater impact than the encoding protocol. their performance across 11 different metrics and benchmarked
Protocol-based encoding methods such as Capt’n Proto each combination against 6 different datasets, ranging from toy
and ProtoBuf perform best for complex data that can be datasets to machine learning, to scientific data from the fusion
compressed, while MessagePack is a competitive choice for energy domain. We found that messaging technology has the
smaller or random data. Protocol-based encoding methods biggest impact on performance, regardless of over the specific
offer the fastest serialization and best compression, with serialization method used. Protocol-based methods offer the
Thrift offering the best throughput and Capn’Proto offering highest throughput and lowest latency but at the expense of
the best compression. Binary encoding methods offer more flexibility and robustness. Protocol encoding methods offered
flexibility at the cost of slower encoding speed. Among the the best performance, at the cost of flexibility. Notably, we did
binary encoding methods we tested, MessagePack generally not see much difference when combining different protocol-
performed the best. Considering text-based protocols, JSON based encoding and messaging systems. Finally, we found that
14
the batch size affects the data throughput for all binary and C. Ham, N. Heiberg, S. S. Henderson, E. Highcock, B. Hnat, J. Howard,
protocol-based encoding methods. J. Huang, S. W. A. Irvine, A. S. Jacobsen, O. Jones, I. Katramados,
D. Keeling, A. Kirk, I. Klimek, L. Kogan, J. Leland, B. Lipschultz,
B. Lloyd, J. Lovell, B. Madsen, O. Marshall, R. Martin, G. McArdle,
C ONTRIBUTION K. McClements, B. McMillan, A. Meakins, H. F. Meyer, F. Militello,
J. Milnes, S. Mordijck, A. W. Morris, D. Moulton, D. Muir, K. Mukhi,
SJ: Designed and implemented the experimental framework, S. Murphy-Sugrue, O. Myatra, G. Naylor, P. Naylor, S. L. Newton,
shaping the research methodology and contributions to the T. O’Gorman, J. Omotani, M. G. O’Mullane, S. Orchard, S. J. P. Pamela,
L. Pangione, F. Parra, R. V. Perez, L. Piron, M. Price, M. L. Reinke,
writing and conceptualization of the paper. NC: Provided the F. Riva, C. M. Roach, D. Robb, D. Ryan, S. Saarelma, M. Salewski,
MAST data for the study and offered expertise in the fusion S. Scannell, A. A. Schekochihin, O. Schmitz, S. Sharapov, R. Sharples,
domain, enhancing the scientific rigor of this empirical study S. A. Silburn, S. F. Smith, A. Sperduti, R. Stephen, N. T. Thomas-Davies,
A. J. Thornton, M. Turnyanskiy, M. Valovič, F. V. Wyk, R. G. L. Vann,
and editing and refining the manuscript. SK: Provided tech- N. R. Walkden, I. Waters, H. R. Wilson, t. M.-U. Team, and t. E. M.
nical supervision and introduced the core idea, building upon Team, “Overview of new MAST physics in anticipation of first results
SK’s prior work at the University of Oxford, and contributed from MAST Upgrade,” Nuclear Fusion, vol. 59, no. 11, p. 112011, Jun.
2019.
to the writing and editing of the paper, figures, and plots. [6] R. M. Churchill, C. S. Chang, J. Choi, R. Wang, S. Klasky, R. Kube,
H. Park, M. J. Choi, J. S. Park, M. Wolf, R. Hager, S. Ku, S. Kampel,
ACKNOWLEDGMENT T. Carroll, K. Silber, E. Dart, and B. S. Cho, “A Framework for
International Collaboration on ITER Using Large-Scale Data Transfer
We would like to thank our colleagues at UKAEA and to Enable Near-Real-Time Analysis,” Fusion Science and Technology,
STFC for supporting the FAIR-MAST project. Additionally vol. 77, no. 2, pp. 98–108, Feb. 2021.
[7] A. Pavone, A. Merlo, S. Kwak, and J. Svensson, “Machine learning and
we would like to thank Stephen Dixon, Jonathan Hollo- Bayesian inference in nuclear fusion research: An overview,” Plasma
combe, Adam Parker, Lucy Kogan, and Jimmy Measures from Physics and Controlled Fusion, vol. 65, no. 5, p. 053001, Apr. 2023.
UKAEA for assisting our understanding of the Fusion data. We [8] S. Khan, E. Rydow, S. Etemaditajbakhsh, K. Adamek, and W. Armour,
“Web Performance Evaluation of High Volume Streaming Data Visual-
would also like to extend our thanks to the wider FAIR-MAST ization,” IEEE Access, vol. 11, pp. 15 623–15 636, 2023.
project which include Shaun De Witt, James Hodson, Stanislas [9] D. P. Proos and N. Carlsson, “Performance Comparison of Messaging
Pamela, Rob Akers from UKAEA and Jeyan Thiyagalingam Protocols and Serialization Formats for Digital Twins in IoV,” in 2020
IFIP Networking Conference (Networking), Jun. 2020, pp. 10–18.
from STFC. We also want to extend our gratitude to the [10] D. Friesel and O. Spinczyk, “Data Serialization Formats for the
MAST Team for their efforts in collecting and curating the Internet of Things,” Electronic Communications of the EASST,
raw diagnostic source data during the operation of the MAST vol. 80, no. 0, Sep. 2021, number: 0. [Online]. Available: https:
//journal.ub.tu-berlin.de/eceasst/article/view/1134
experiment. [11] B. Petersen, H. Bindner, S. You, and B. Poulsen, “Smart grid serializa-
tion comparison: Comparision of serialization for distributed control in
R EFERENCES the context of the Internet of Things,” in 2017 Computing Conference,
Jul. 2017, pp. 1339–1346.
[1] T. Hey, K. Butler, S. Jackson, and J. Thiyagalingam, “Machine learning [12] A. Sumaray and S. K. Makki, “A comparison of data serialization
and big scientific data,” vol. 378, no. 2166, p. 20190054. [Online]. formats for optimal efficiency on a mobile platform,” in Proceedings
Available: https://royalsocietypublishing.org/doi/10.1098/rsta.2019.0054 of the 6th International Conference on Ubiquitous Information
[2] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, Management and Communication. Kuala Lumpur Malaysia: ACM,
M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, Feb. 2012, pp. 1–6. [Online]. Available: https://dl.acm.org/doi/10.1145/
P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, 2184751.2184810
O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, [13] G. Fu, Y. Zhang, and G. Yu, “A Fair Comparison of Message Queuing
A. J. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. ’t Hoen, Systems,” IEEE Access, vol. 9, pp. 421–432, 2021.
R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, [14] W. F. Godoy, N. Podhorszki, R. Wang, C. Atkins, G. Eisenhauer,
A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, J. Gu, P. Davis, J. Choi, K. Germaschewski, K. Huck, A. Huebl,
S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. M. Kim, J. Kress, T. Kurc, Q. Liu, J. Logan, K. Mehta, G. Ostrouchov,
Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, M. Parashar, F. Poeschel, D. Pugmire, E. Suchyta, K. Takahashi,
A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, N. Thompson, S. Tsutsumi, L. Wan, M. Wolf, K. Wu, and S. Klasky,
“The FAIR Guiding Principles for scientific data management and “ADIOS 2: The Adaptable Input Output System. A framework for high-
stewardship,” Scientific Data, vol. 3, no. 1, p. 160018, Mar. 2016. performance data management,” SoftwareX, vol. 12, p. 100561, Jul.
[3] P. Rocca-Serra, W. Gu, V. Ioannidis, T. Abbassi-Daloii, S. Capella- 2020.
Gutierrez, I. Chandramouliswaran, A. Splendiani, T. Burdett, R. T. [15] “Extensible Markup Language (XML) 1.1 (Second Edition).” [Online].
Giessmann, D. Henderson, D. Batista, I. Emam, Y. Gadiya, L. Giovanni, Available: https://www.w3.org/TR/2006/REC-xml11-20060816/
E. Willighagen, C. Evelo, A. J. G. Gray, P. Gribbon, N. Juty, D. Welter, [16] T. Bray, “The JavaScript object notation (JSON) data interchange
K. Quast, P. Peeters, T. Plasterer, C. Wood, E. van der Horst, D. Reilly, format,” Internet Engineering Task Force, Request for Comments
H. van Vlijmen, S. Scollen, A. Lister, M. Thurston, R. Granell, and S.- RFC 8259, Dec. 2017, num Pages: 16. [Online]. Available: https:
A. Sansone, “The FAIR Cookbook - the essential resource for and by //datatracker.ietf.org/doc/std90
FAIR doers,” Scientific Data, vol. 10, no. 1, p. 292, May 2023. [17] “YAML Ain’t Markup Language (YAML™) revision 1.2.2.” [Online].
[4] A. Sykes, J.-W. Ahn, R. Akers, E. Arends, P. G. Carolan, G. F. Available: https://yaml.org/spec/1.2.2/
Counsell, S. J. Fielding, M. Gryaznevich, R. Martin, M. Price, C. Roach, [18] “BSON (Binary JSON): Specification.” [Online]. Available: https:
V. Shevchenko, M. Tournianski, M. Valovic, M. J. Walsh, H. R. Wilson, //bsonspec.org/spec.html
and MAST Team, “First physics results from the MAST Mega-Amp [19] “Universal Binary JSON Specification – The universally compatible
Spherical Tokamak,” Physics of Plasmas, vol. 8, no. 5, pp. 2101–2106, format specification for Binary JSON.” [Online]. Available: https:
May 2001. //ubjson.org/
[5] J. R. Harrison, R. J. Akers, S. Y. Allan, J. S. Allcock, J. O. Allen, [20] “CBOR — Concise Binary Object Representation | Overview.”
L. Appel, M. Barnes, N. B. Ayed, W. Boeglin, C. Bowman, J. Bradley, [Online]. Available: https://cbor.io/
P. Browning, P. Bryant, M. Carr, M. Cecconello, C. D. Challis, [21] “MessagePack,” May 2023, original-date: 2010-03-19T02:08:43Z.
S. Chapman, I. T. Chapman, G. J. Colyer, S. Conroy, N. J. Conway, [Online]. Available: https://github.com/msgpack/msgpack
M. Cox, G. Cunningham, R. O. Dendy, W. Dorland, B. D. Dudson, [22] “PEP 3154 – Pickle protocol version 4 | peps.python.org.” [Online].
L. Easy, S. D. Elmore, T. Farley, X. Feng, A. R. Field, A. Fil, G. M. Available: https://peps.python.org/pep-3154/
Fishpool, M. Fitzgerald, K. Flesch, M. F. J. Fox, H. Frerichs, S. Gadgil, [23] “Protocol Buffers Version 3 Language Specification,” section: reference.
D. Gahle, L. Garzotti, Y.-C. Ghim, S. Gibson, K. J. Gibson, S. Hall, [Online]. Available: https://protobuf.dev/reference/protobuf/proto3-spec/
15
[24] M. Slee, A. Agarwal, and M. Kwiatkowski, “Thrift: Scalable Cross- [38] “Apache Thrift,” https://thrift.apache.org/.
Language Services Implementation.” [39] “ZeroMQ,” https://zeromq.org/.
[25] “Cap’n Proto: Cap’n Proto, FlatBuffers, and SBE.” [Online]. Available: [40] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-free
https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html coordination for Internet-scale systems.”
[26] “Apache Avro Specification.” [Online]. Available: https://avro.apache. [41] R. A. Fisher, “The Use of Multiple Measurements in Taxonomic
org/docs/1.11.1/specification/ Problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[27] “gRPC.” [Online]. Available: https://grpc.io/ [42] L. Deng, “The mnist database of handwritten digit images for machine
[28] A. Luis, P. Casares, J. J. Cuadrado-Gallego, and M. A. Patricio, learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6,
“PSON: A Serialization Format for IoT Sensor Networks,” Sensors, pp. 141–142, 2012.
vol. 21, no. 13, p. 4559, Jan. 2021, number: 13 Publisher:
Multidisciplinary Digital Publishing Institute. [Online]. Available: [43] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang,
https://www.mdpi.com/1424-8220/21/13/4559 and N. Goharian, “A discourse-aware attention model for abstractive
[29] A. Wolnikowski, S. Ibanez, J. Stone, C. Kim, R. Manohar, and R. Soulé, summarization of long documents,” Proceedings of the 2018 Conference
“Zerializer: Towards Zero-Copy Serialization,” in Proceedings of the of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers),
Workshop on Hot Topics in Operating Systems, ser. HotOS ’21.
New York, NY, USA: Association for Computing Machinery, 2021, 2018. [Online]. Available: http://dx.doi.org/10.18653/v1/n18-2097
pp. 206–212, event-place: Ann Arbor, Michigan. [Online]. Available: [44] S. Khan and D. Wallom, “A System for Organizing, Collecting, and
https://doi.org/10.1145/3458336.3465283 Presenting Open-Source Intelligence,” Journal of Data, Information and
[30] “ActiveMQ,” https://activemq.apache.org/. Management, vol. 4, no. 2, pp. 107–117, Jun. 2022.
[31] “Apache Kafka,” https://kafka.apache.org/. [45] S. Khan, P. H. Nguyen, A. Abdul-Rahman, E. Freeman, C. Turkay,
[32] “Apache Pulsar,” https://pulsar.apache.org/. and M. Chen, “Rapid Development of a Data Visualization Service in
[33] “RabbitMQ: Easy to use, flexible messaging and streaming,” an Emergency Response,” IEEE Transactions on Services Computing,
https://rabbitmq.com/. vol. 15, no. 3, pp. 1251–1264, 2022.
[34] “RocketMQ,” https://rocketmq.apache.org/. [46] “Github: Streaming performance analysis.” [Online]. Available: https:
[35] “Apache Avro,” https://avro.apache.org/. //github.com/stfc-sciml/streaming-performance-analysis
[36] “Cap’n Proto,” https://capnproto.org/. [47] D. Merkel, “Docker: lightweight linux containers for consistent devel-
[37] “gRPC,” https://grpc.io/. opment and deployment,” Linux journal, vol. 2014, no. 239, p. 2, 2014.