0% found this document useful (0 votes)

61 views15 pages

Streaming Technologies and Serialization Protocols

Uploaded by

Lovely Karn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views15 pages

Streaming Technologies and Serialization Protocols

Uploaded by

Lovely Karn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

1

Streaming Technologies and Serialization Protocols:

Empirical Performance Analysis
Samuel Jackson , Nathan Cummings , and Saiful Khan

Abstract—Efficiently streaming high-volume data is essential The pressing need to facilitate real-time data analysis [6] and
for real-time data analytics, visualization, and AI and machine leverage recent advancements in machine learning [7] further
learning model training. Various streaming technologies and emphasizes the necessity for efficient data streaming tech-
serialization protocols have been developed to meet different
streaming needs. Together, they perform differently across vari- nologies. These technologies must not only handle the sheer
arXiv:2407.13494v1 [cs.SE] 18 Jul 2024

ous tasks and datasets. Therefore, when developing a streaming volume of data but also integrate seamlessly with analytical
system, it can be challenging to make an informed decision on tools.
the suitable combination, as we encountered when implement- In this paper, we extend the work conducted in [8] for
ing streaming for the UKAEA’s MAST data or SKA’s radio the SKA’s radio astronomy data streaming and visualization.
astronomy data. This study addresses this gap by proposing an
empirical study of widely used data streaming technologies and We explore an array of streaming technologies available. We
serialization protocols. We introduce an extensible and open- consider the combination of two major choices of technology
source software framework to benchmark their efficiency across when implementing a streaming service: (a) the choice of a
various performance metrics. Our findings reveal significant streaming system, which performs the necessary communica-
performance differences and trade-offs between these technolo- tion between two endpoints, and (b) the choice of encoding
gies. These insights can help in choosing suitable streaming and
serialization solutions for contemporary data challenges. We aim used to convert the data into transmittable formats. Our
to provide the scientific community and industry professionals contributions are as follows:
with the knowledge to optimize data streaming for better data • We provide a comprehensive review of 11 streaming
utilization and real-time analysis.
technologies and 11 encoding methods, categorized by
Index Terms—Data streaming, messaging systems, serialization their underlying principles and operational frameworks.
protocols, web services, performance evaluation, empirical study, • We introduce an extensible software framework designed
and applications.
to benchmark the efficiency of various combinations of
streaming technology and serialization protocols, assess-
I. I NTRODUCTION ing them across 11 performance metrics.
• By testing 132 combinations, we offer a detailed com-

W ITH the exponential increase in data generation from

large scientific experiments and the concurrent rise
of data-intensive machine learning algorithms within scien-
parative analysis of their performance across six different
data types.
• Our findings not only highlight the performance differ-
tific computing [1], traditional methods of data transfer are
entials and trade-offs between these technologies, but we
becoming inadequate. The trend necessitates efficient data
also discuss the limitations of this study and potential
streaming methods that allow end-users to access subsets of
directions for further research.
data remotely. Additionally, the drive for FAIR and open
data [2], [3] mandates that such data are, ultimately, publicly Through this comprehensive study, we aim to equip the
accessible to end users over a wide-area network connection. scientific community with deeper insights into choosing appro-
The Mega-Ampere Spherical Tokamak (MAST) [4] was priate streaming technologies and serialization protocols that
a spherical tokamak in operation at the UK Atomic En- can meet the demands of modern data challenges.
ergy Authority (UKAEA), Culham Centre for Fusion Energy Section II briefly reviews the related work in this area.
(CCFE) from 1999 to 2013 and its upgraded successor was Section III provides an overview of the different serialization
the MAST-U [5] began operation in 2020. These facilities protocols and data streaming technologies reviewed in this
generate gigabytes of data per experimental shot, accumulating study. Section IV outlines our experimental methodology,
substantial data daily. including the performance metrics considered, implementation
The lack of public accessibility to the historical archive of details of our benchmark framework and the choice of datasets
data produced by the MAST experiment has limited collab- used for evaluation. Section V discusses the results of our ex-
orative opportunities with international and industry partners. periments across all performance metrics and datasets. Finally,
sections VI and VII reflect on the results of our study and draw
Samuel Jackson and Nathan Cummings are with Computing Division, recommendations of technology choices.
Culham Centre for Fusion Energy, Culham Science Centre, Abingdon, OX14
3EB, Saiful Khan is with Scientific Computing Department, Science and
Technology Facilities Council, Rutherford Appleton Laboratory, Didcot, OX11 II. R ELATED W ORK
0QX. This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no Khan et al. [8] evaluated the performance of streaming data
longer be accessible. and web-based visualization for SKA’s radio astronomy data.
2

They also conducted a limited analysis on the serialization, 1) Text Formats: Extensible Markup Language (XML) [15]
deserialization, and transmission latency of two protocols - is a markup language and data format developed by the World
ProtoBuf and JSON. Our work builds on their research by Wide Web consortium. It is designed to store and transmit
covering a more extensive range of combinations. arbitrary data in a simple, human-readable format. XML adds
Proos et al. [9] consider the performance of three different context to data using tags with descriptive attributes for each
serialization formats (JSON, Flatbuffers, and Protobuf) and data item. It has been extended to various derivative formats,
a mixture of three different messaging protocols (AMQP, such as XHTML and EXI.
MQTT, and CoAP). They evaluate the performance using JavaScript Object Notation (JSON) [16] is another human-
real “CurrentStatus” messages from Scania vehicles as JSON readable data interchange format that represents data as a
data payload data. They monitor communication between a collection of nested key-value pairs. JSON is commonly used
desktop computer and a Raspberry Pi. They consider numerous for data exchange protocol in RESTful APIs. Due to the
evaluation metrics such as latency, message size, and serial- smaller payload size, it is often seen as a lower overhead
ization/deserialization speed. alternative to XML for data interchange.
The authors of [10] compare 10 different serialization YAML Ain’t Markup Language (YAML) [17] is a simple
formats for use with different types of micro-controllers and text-based data format often used for configuration files. It is
evaluate the size of the payload from each method. They test less verbose than XML and supports advanced features such
performance with two types of messages 1) JSON payloads as comments, extensible data types, and internal referencing.
obtained from “public mqtt.eclipse.org messages” and 2) ob- 2) Binary Formats: Binary JSON (BSON) [18] is a binary
ject classes from smartphone-centric studies [11], [12]. data format based on JSON, developed by MongoDB. Similar
Fu and Zhang [13] presented a detailed review of different to JSON, BSON also represents data structures using key-
messaging systems. They evaluate each method in terms of value pairs. It was initially designed for use with the Mon-
throughput and latency when sending randomly generated text goDB NoSQL database but can be used independently of the
payloads. They evaluate each method only on the local device system. BSON extends the JSON format with several data
to avoid bias from any network specifics. Orthogonal to our types that are not present in JSON, such as a datetime format.
work, they are focused on evaluating the scaling of each Universal Binary JSON (UBJSON) [19] is another binary
system over a number of producers, consumers, and message extension to the JSON format created by Apache. UBJSON
queues. is designed according to the original philosophy of JSON and
Churchill et al. [6] explored using ADIOS2 [14] for trans- does not include additional data types, unlike BSON.
ferring large amounts of Tokamak diagnostic data from the Concise Binary Object Representation (CBOR) [20] is also
K-STAR facility in Korea to the NERSC and PPPL facilities based on the JSON format. The major defining feature of
in the USA for near-real-time data analysis. CBOR is its extensibility, allowing the user to define custom
We differentiate our study from these related works by tags that add context to complex data beyond the built-in
evaluating 1) A wide variety of different streaming technolo- primitives.
gies, both message broker-based and RPC-based. 2) consid- MessagePack [21] is a binary serialization format, again
ering a large number of data serialization formats, including based on JSON. It was designed to achieve smaller payload
text, binary, and protocol-based formats. 3) We evaluate the sizes than BSON and supports over 50 programming lan-
combination of these technologies, developing an extensible guages.
framework for measuring and comparing serialization and Pickle [22] is a binary serialization format built into the
streaming technologies. 4) Evaluating the performance over Python programming language. It was primarily designed to
10 different metrics. We comprehensively evaluate 10 different offer a data interchange format for communicating between
streaming technologies with 12 different serialization methods different Python instances.
over 8 different datasets. 3) Protocol Formats: Protocol Buffers (ProtoBuf) [23]
were developed by Google as an efficient data interchange
III. BACKGROUND format, particularly optimized for inter-machine communica-
In this paper, we study how the choice of streaming tion. Specifically, ProtoBuf is designed to facilitate remote
technologies and serialization protocols critically affects data procedural call (RPC) communication through gRPC [27].
transfer speed. Specifically, we analyze the application of Data structures used for communication are defined in .proto
popular messaging technologies and serialization protocols files, which are then compiled into generated code for various
across diverse datasets used in machine learning. Before supported languages. During transmission, these data struc-
discussing our experimental setup and results, this section tures are serialized into a compact binary format that omits
provides an overview of message systems and serialization names, data types, and other identifiers, making it non-self-
protocols suitable for streaming data. descriptive. Upon receipt, the messages are decoded using the
shared protocol buffer definitions.
Thrift [24] is another binary data format developed by
A. Serialization Protocols Apache Software Foundation or Apache, similar in many
In this section, we provide a brief overview of three dif- respects to ProtoBuf. In Thrift, data structures are also defined
ferent categories of serialization protocol: text formats, binary in a separate file, and these definitions are used to generate
formats, and protocol formats. corresponding appropriate data structures in various supported
3

TABLE I
A COMPARISON OF VARIOUS SERIALIZATION PROTOCOLS . T YPE : DESCRIBES HOW THE METHOD SERIALIZES DATA , WHETHER IN TEXT OR BINARY
FORMAT, OR RELYING ON A COMMON PROTOCOL . H UMAN READABLE : INDICATES WHETHER THE SERIALIZATION SCHEME IS LEGIBLE TO A HUMAN
READER . D EFINED SCHEMA : SPECIFIES WHETHER PRODUCER AND CONSUMER SHARE A COMMON KNOWLEDGE OF THE DATA FORMAT PRIOR TO
TRANSMISSION . C ODE GENERATED SCHEMA : STATES WHETHER THE SERIALIZATION REQUIRES CODE TO BE GENERATED FROM A PREDEFINED
PROTOCOL .

Code
Human Defined Based
Protocol Type Binary Generated
Readable Schema On
Schema
XML [15] Text ✕ ✓ ✕ ✕
JSON [16] Text ✕ ✓ ✕ ✕
YAML [17] Text ✕ ✓ ✕ ✕
BSON [18] Binary ✓ ✕ ✕ ✕ JSON
UBSON [19] Binary ✓ ✕ ✕ ✕ JSON
CBOR [20] Binary ✓ ✓ ✕ ✕ MessagePack
MessagePack [21] Binary ✓ ✕ ✕ ✕ JSON
Pickle [22] Binary ✓ ✕ ✕ ✕
ProtoBuf [23] Protocol ✓ ✕ ✓ ✓
Thrift [24] Protocol ✓ ✕ ✓ ✓
Capn’Proto [25] Protocol ✓ ✕ ✓ ✓
Avro [26], [12], [10] Protocol ✓ ✓ ✓ ✕

languages. Before transmission, data is serialized into a binary fer from larger payload and serialization costs due to the
format. Thrift is also designed for RPC communication and overhead of the markup describing the data. In contrast, binary
includes methods for defining services that use Thrift data formats serialize the data to bytes before transmission. These
structures. However, Thrift has a smaller number of supported formats are not human-readable, but achieve a better payload
data types compared to ProtoBuf. size with lower serialization costs. Protocol-based formats
Capn’Proto [25] is a protocol-based binary format that com- also encode data in binary, but differ in that they rely on
petes with ProtoBuf and Thrift. Capn’Proto differentiates itself a predefined protocol definition shared between sender and
with two main features. First, its internal data representation receiver. Using a shared protocol frees more information out
is identical to its encoded representation, which eliminates the of the transmitted packet, yielding smaller payloads and faster
need for a serializing step. Second, its RPC service imple- serialization time.
mentation offers a unique feature called “time travel” enabling
chained RPCs to be executed as a single request. Additionally, B. Data Streaming Technologies
Capn’Proto offers a byte-packing method that reduces payload
In this section, we discuss three different categories of data
size, albeit with the expense of some increase in serialization
streaming technologies: message queue-based, RPC-based,
time. In our experiments, we refer to the byte packed version
and low-level.
of Capn’Proto as ”capnp-packed” to differentiate it from the
1) Message Queues: ActiveMQ [30], developed in Java by
unpacked version, ”capnp”.
Apache, is a flexible messaging system designed to support
Avro [26] is a schema-based binary serialization technology various communication protocols, including AMQP, STOMP,
developed by Apache. Avro uses JSON to define schema REST, XMPP, and OpenWire. The system’s architecture is
data structures and namespaces. These schemas are shared based on a controller-worker model, where the controller
between both producer and consumer. One of Avro’s key broker is synchronized with worker brokers. The system
advantages is its dynamic schema definition, which does not operates in two modes: topic mode and queuing mode. In
require code generation, unlike competitors such as ProtoBuf. topic mode, ActiveMQ employs a publish-subscribe (pub/sub)
Avro messages are also self-describing, meaning they can be mechanism, where messages are transient, and delivery is not
decoded without needing access to the original schema. guaranteed. Conversely, in queue mode, ActiveMQ utilizes
We also considered the PSON format [28] and Zerial- point-to-point messaging approach, storing messages on disk
izer [29]. PSON is a binary serialization format with a current or in a database to ensure at-least-once delivery. For our
implementation limited to C++ and lacks Python bindings, experiments, we utilize the STOMP communication protocol.
which restricts its applicability for our study. Zerializer, on Kafka [31] is a distributed event processing platform written
the other hand, necessitates a specific hardware setup for in Scala and Java; initially developed by LinkedIn and now
implementation, placing it outside the scope of our study maintained by Apache. Kafka leverages the concept of topics
due to practical constraints. Consequently, while these formats and partitions to achieve parallelism and reliability. Consumers
might offer potential advantages, their limitations in terms of can subscribe to one more topic, with each topic divided
language support and hardware requirements precluded their into multiple partitions. Each partition is read by a single
inclusion in our experimental evaluation. consumer, ensuring message order within that partition. For
A summary of serialization protocols can be found in enhanced reliability, topics and partitions are replicated across
Table I. The text-based formats represent data during a text- multiple brokers within a cluster. Kafka employs a peer-to-
based markup. While human-readable, text-based formats suf- peer (P2P) architecture to synchronize brokers, with no single
4

TABLE II
A COMPARISON OF DIFFERENT DATA STREAMING TECHNOLOGIES .

Code
Queue Consume Broker Delivery Order Multiple
Name Type Generated
Mode Mode Architecture Guarantee Guarantee Consumer
Protocol
ActiveMQ [30] Messaging Pub/Sub & P2P Pull controller-worker at-least-once queue-order ✕ ✓
Kafka [31] Messaging P2P Pull P2P All partition-order ✕ ✓
Pulsar [32] Messaging P2P Push P2P All global-order ✕ ✓
RabbitMQ [33] Messaging Pub/Sub Push/Pull controller-worker at-least/most-once None ✕ ✓
RocketMQ [34] Messaging Pub/Sub Push/Pull controller-worker at-least-once queue-order ✕ ✓
Avro [35] RPC P2P Pull Brokerless None global-order ✕ ✕
Capn’Proto [36] RPC P2P Pull Brokerless None global-order ✓ ✕
gRPC [37] RPC P2P Pull Brokerless None global-order ✓ ✕
Thrift [38] RPC P2P Pull Brokerless None global-order ✓ ✕
ZMQ [39] Low Level P2P Pull Brokerless None global-order ✕ ✕
ADIOS2 [14] Low Level P2P Pull Brokerless None global-order ✕ ✕

broker taking precedence over other brokers. Zookeeper [40] tion references in bookies. These bookies are coordinated by
manages brokers within the cluster. Kafka uses TCP for a bookkeeper, which is also load-balanced using Zookeeper.
communication between message queues and supports only Each partition is further split into several segments and dis-
push-based message delivery to consumers while persisting tributed across different bookies. The separation of message
messages to disk for durability and fault tolerance. storage from message brokers means that if an individual
RabbitMQ [33], developed by VMWare, is a widely used broker fails, it can be replaced with another broker without
messaging system known for its robust support for various loss of information. Similarly, if a bookie fails, the replica
messaging protocols, including AMQP, STOMP, and MQTT. information stored in other bookies can take over, ensuring
Implemented in Erlang programming language, RabbitMQ data integrity. Pulsar’s architecture allows it to offer a global
leverages Erlang’s inherent support for distributed computa- ordering and delivery guarantee, although this high reliability
tion, eliminating the need for a separate cluster manager. A and scalability come at the cost of extra communication
RabbitMQ cluster consists of multiple brokers, each hosting an overhead between brokers and bookies.
exchange and multiple queues. The exchange is bound to one For a detailed overview of different message queue tech-
queue per broker, with queues synchronized across brokers. nologies, please refer [13].
One queue acts as the controller, while the others function 2) RPC Based: gRPC [37], developed by Google, is an
as workers. RabbitMQ supports point-to-point communication RPC framework that utilizes ProtoBuf as its default serializa-
and both push and pull consumer modes. Although message tion protocol. To define the available RPC calls for a client,
ordering is not guaranteed, RabbitMQ provides at-least-once gRPC requires a protocol definition written in ProtoBuf. While
and at-most-once delivery guarantees. RabbitMQ faces poor ProtoBuf is the standard, sending arbitrary bytes from other
scalability issues due to the need to replicate each queue on serialization protocols over gRPC is possible by defining a
every broker. Our experiments utilize the STOMP protocol for message type with a bytes field. The Python gRPC imple-
communication with the pika python package. mentation supports synchronous and asynchronous (asyncio)
RocketMQ [34], developed by Alibaba and written in Java, communication. For all our experiments with gRPC, we use
is a messaging system that employs a bespoke communication asynchronous communication.
protocol. It defines a set of topics, each internally split Capn’Proto [36] and Thrift also have their own RPC frame-
into a set of queues. Each queue is hosted on a separate works. Similar to gRPC, these frameworks define remote pro-
broker within the cluster, and queues are replicated using a cedural calls within their protocol definitions, using their own
controller-worker paradigm. Brokers can dynamically register syntax specification. Like gRPC, they allow the transmission
with a name server, which manages cluster and query routing. of arbitrary bytes by defining a message with a bytes field.
RocketMQ guarantees message ordering, and supports at- Avro provides RPC-based communication protocol as well.
least-once delivery. Consumers may receive messages from Unlike other RPC-based methods, Avro does not require the
RocketMQ either using push or pull modes. Message queuing RPC protocol to be explicitly defined. This flexibility comes
is implemented using the pub/sub paradigm, and RocketMQ at the expense of stricter type validation, setting Avro apart
scales well with a large number of topics and consumers. from systems such as gRPC and Thrift.
Pulsar [32], created by Yahoo and now maintained by 3) Low Level: In addition to RPC and messaging systems,
Apache, is implemented in Java and designed to support a we consider two low-level communication systems: ZeroMQ
large number of consumers and topics while ensuring high and ADIOS2. Like RPC systems, they do not rely on an
reliability. Pulsar’s innovative architecture separates message intermediate broker for message transmission.
storage from the message broker. A cluster of brokers is ZeroMQ (ZMQ) [39] is a brokerless communications library
managed by a load balancer (Zookeeper). Similar to Kafka, developed by iMatix. It is a highly flexible message frame-
each topic is split into partitions. However, instead of storing work that uses TCP sockets and supports various messaging
messages within partitions on the broker, Pulser stores parti- patterns, such as push/pull, pub/sub, request/reply, and many
5

more. Notably, ZeroMQ’s zero-copy feature minimizes the A. Performance Metrics

copying of bytes during data transmission, making it well- We consider 11 performance metrics: seven of these metrics
suited for handling large messages. In our experiments, we are associated with each serialization protocol, and the remain-
implement a simple push/pull messaging pattern to avoid ing four are linked to the combination of streaming technology
the additional communication overhead associated with RPC and serialization protocol. To define the metrics, we first need
methods. to establish the different sizes of data as it passes through
The ADaptable Input Output System (ADIOS) [14] is a our pipeline. We denote the size of the data straight from the
unified communications library developed as part of the U.S stream as Sd , the size of the data after object creation as So ,
Department of Energy’s (DoE) Exascale Computing Project. and the size of the payload after encoding to bytes as Sp .
It is designed to stream exascale data loads for interprocess Additionally, we define the number of samples in a dataset as
and wide area network (WAN) communication. In this study, N.
we compare the WAN capabilities of ADIOS, which uses To evaluate the performance of each serialization technology
ZeroMQ for it’s messaging protocol. We use ADIOS2 for we measure:
communication and the low-level Python API to facilitate
communication between client and server. 1) Object Creation Latency (Lo ) – This measures the total
We do not consider other RPC systems such as Apache time taken to convert the program-specific native format
Flight. Instead, we rely on ProtoBuf and gRPC for their (e.g. numpy array, xarray dataset) into the format
communication protocols. required for transmission. This is an important metric
A summary of the comparison of various data streaming because some formats, such as Capn’Proto store their
technologies can be found in Table II. Message queue-based data internally in a serialization-ready format. However,
technologies use message queues and a publish/subscribe in reality, we often need to work with arrays that are in an
model to transmit data. Producers publish messages to a topic, analysis-ready format, such as numpy array or xarray
and multiple consumers can subscribe to these topics to read dataset. Converting between the two models naturally
messages from the queue. These systems operate in push incurs a penalty since it involves copying the data.
mode, where the system delivers messages to consumers, or 2) Object Creation Throughput (To = PSd N(i) ) – This is
Lo
in pull mode, where consumers request messages from the similar to serialization and deserialization throughput, we
message queuing system. RPC-based technologies define a consider object creation throughput is a sum of latencies
communication protocol shared between producers and con- for all samples sent) of converting a native object (e.g.,
sumers, eliminating the need for an intermediate broker. Pro- numpy array, xarray dataset) to the transmission for-
ducers respond to remote procedure calls (consumer requests) mat expected by the protocol (e.g., a ProtoBuf object or
to provide data. Low-level communication protocols and the a Capn’Proto object).
S
ADIOS also do not require an intermediate broker. Unlike 3) Compression Ratio (C = Spo × 100) – This is defined as
RPC technologies, they do not wait for clients’ requests to the ratio of the size of the payload Sp after serialization
send messages, reducing communication overhead. ZeroMQ to the size of the object So , for a given encoding. A
and ADIOS support zero-copy messaging transfer of raw smaller compression ratio ultimately means less data to
bytes, which is particularly beneficial for large array workloads be transmitted over the wire, and therefore, protocols that
where encoding and copying data can be costly. produce a smaller payload should be more performant.
These technologies differ in their fault tolerance. Message 4) Serialization Latency (Ls ) – This is the total time taken
queuing systems prioritize reliability by caching messages to in seconds to encode the original data into the serialized
disk to prevent load shedding during high message rates. In format for transmission. Encoding data with any serial-
contrast, RPC systems keep all requests in memory, offering ization protocol incurs a non-zero cost due to the need
faster performance at the expense of lower fault tolerance. to format, copy, and compress data for transmission. A
Many protocol-based serialization formats introduced earlier larger serialization time can potentially negate the benefits
include RPC communications libraries that support sending of a smaller payload size because it increases the total
arbitrary bytes. For example, Protobuf-encoded messages can transmission time.
be sent using Avro RPC communication library. 5) Deserialization Latency (Ld ) – This is similar to serial-
ization time, this metric measures the total time required
IV. E MPIRICAL S TUDY D ESIGN to deserialize a payload after transmission across the
The objective of this empirical study is to investigate wire. As with serialization time, a slow deserialization
and compare various streaming technologies and serialization time can also negate the effects of a smaller payload.
So
protocols for scientific data. We examine the interplay between 6) Serialization Throughput (Ts = L s
) – This is the
serialization protocol and streaming technology by exploring serialization time divided by the size of the object to be
different combinations of them. We conduct experiments on transmitted. This measures how many bytes per second a
all the technologies discussed in section III, which includes 11 serialization protocol can handle, independent of the size
different streaming technologies and 15 different serialization of the data stream.
protocols. We test each combination of technology across eight 7) Deserialization Throughput (Td = PSo N(i) ) – This is
Ld
different payloads, resulting in 11 × 15 × 8 = 1320 different the deserialization time divided by the size of the object
combinations. received. This measures how many bytes per second a
6

Numpy Array Pydantic Object JSON (utf-8) JSON (utf-8) Pydantic Object

Xarray Dataset Object ProtoBuf Object ProtoBuf ProtoBuf ProtoBuf Object

Serialisation Transmission De-serialisation
Creation
Dictionary Capn'Proto Object Capn'Proto Capn'Proto Capn'Proto Object
... Lo , T o ... Ls , T s ... Ltrans, Ttrans ... Ld , T d ...

Ltot, Ttot

Fig. 1. Illustrates the data flow from producer to consumer, indicating the places at which various performance metrics are recorded. These metrics include
(A) Lo : object creation latency, (B) To : object creation throughput, (C) C: compression ratio, (D) Ls : serialization latency, (E) Ld deserialization latency,
(F) Ts : serialization throughput, (G) Td : deserialization throughput, (H) Ltrans : transmission latency, (I) Ttrans : transmission throughput, (J) Ltot : total
latency, and (K) Ttot : total throughput.

Marshaler Use Logger

BatchMatrix DataStream
1 1
1
MNIST Producer Use Use Consumer
Broker

Plasma KafkaProducer KafkaConsumer JSONMarshaler

... 1 1
GRPCProducer 1 1 GRPCConsumer AvroMarshaler
Producer Consumer
Runner
Builder Builder
AvroProducer AvroConsumer XMLMarshaler
... ... ...

Fig. 2. Diagram showing the architecture of our streaming framework. A Runner is used to create a Producer and Consumer pair for each type of
streaming technology. Both producer and consumer are instantiated with a Marshaler that encodes data to the desired format (e.g. JSON, ProtoBuf etc.).
Producers are created with a data stream object that generates data samples for transmission. Depending on the streaming method, the Consumer and
Producer may connect to an external message broker.

deserialization protocol can handle, independent of the deserialization times. Similarly, we can calculate total and
size of the data stream. transmission throughput.
For streaming technologies, we consider two different per-
formance metrics: B. Dataset
8) Transmission Latency (Ltrans ) – This is the time taken In our experiments, we consider eight different payloads,
for a payload to be sent over the wire, excluding the time ranging from simple data to common machine learning work-
taken to encode the message. loads, and include fusion science data. Our goal is to cover a
9) Transmission Throughput (Ttrans = P Sd(i) N
) – This range of scenarios. This section briefly describes the datasets
Ltrans used to evaluate performance with various streaming technolo-
is similar to total throughput, but considers the payload
gies and serialization protocols.
size divided by the time taken to send the message over
• Numerical Primitives: As a baseline comparison, we
the wire, exclusive of the serialization time.
10) Total Latency (Ltot ) – This is the total time for a payload use simple datasets consisting of randomly generated nu-
to be transmitted from producer to consumer, inclusive of merical primitives for int32, float32, and float64
the serialization time. types.
• BatchMatrix: A synthetic dataset where each message
11) Total Throughput (Ttot = PSd N(i) ) – This is the original
Ltot consists of a randomly generated 3D tensor of type
data object size divided by the total time to send the
float32 with shape {32, 100, 100} to simulate sending
message. Throughput measures the rate of bytes that can
a batched set of image samples.
be communicated over the wire.
• Iris Data: This is a dataset using the well-known Iris
Finally, we also investigate the effect of batch size on dataset [41]. The Iris dataset contains an array of four
the throughput. Grouping data into batches is a common float32 features and a one-dimensional string target
requirement during machine learning training, and we show variable.
increasing the batch size while lowering the number of com- • MNIST: We use the widely used MNIST machine learn-
munications has a positive effect on throughput. ing image dataset [42] as a realistic example of streaming
We make a distinction between transmission time and total 2D tensor data.
time (Fig. 1). The total time is the end-to-end transmission • Scientific Papers: The scientific papers dataset is a well-
of a message, including the time to serialize the message known dataset in the field of NLP and text process-
and send it over the wire. Transmission time is the time ing [43]. The dataset comprises 349,128 articles of text
taken to transmit the payload excluding the serialization and from PubMed and arXiv publications. Each sample is
7

repeated as a collection of string for properties such and total time. Additionally, the logger captures the pay-
as article, abstract, and section names for transmission. load size of each message immediately after serialization.
• Plasma Current Data: A more realistic example of With this additional information, we can calculate the
scientific data, we use plasma current data from the average payload size and throughput of the streaming
MAST tokamak [4]. Each set of plasma current data service.
contains three 1D arrays of type float32: data, time, ADIOS and ZeroMQ can directly send array data without
and errors. The “data” array represents the amount of copying the input array. However, to achieve this, the array
current at each timestep, the “time” represents the time data must be directly passed to the communication library
the measurement was taken in seconds, and the “errors” without serialization. Therefore, we additionally consider Ze-
represents the standard deviation of the error in the roMQ and ADIOS to have their own ”native” encoding strat-
measured current. egy for each stream, which is only used with their respective
streaming protocol. This allows for a fair comparison with
other technologies because sending an encoding array with
C. Implementation and Experimental Setup
ADIOS or ZeroMQ incurs an additional copy that could be
We developed a framework to measure the performance circumvented by properly using their zero-copy functionality.
of streaming and serialization technology. The architecture Following the convention of previous work [13], [8], we run
diagram of our framework is shown in Figure 2, which follows each streaming test locally, with the producer and consumer
service-oriented architecture [44], [45] and is implemented in on the same machine to avoid network-specific issues.
Python. We used the appropriate Python client library for each
streaming and serialization technology. The source code can V. R ESULTS
be found in our GitHub repository [46].
The user interacts with the framework through a command- In this section we present the results of our experiments
line interface. A test runner sets up both the server-side and with the combination of different streaming technologies,
client-side of the streaming test. serialization protocols, and data streams.
The server side requires the configuration of three compo-
nents: 1) Object Creation Latency – We use different datasets that
originate from various data analysis types, like NumPy or
• DataStream: component handles loading data for trans-
Xarray dataset. Depending on the encoding protocol, we may
mission. This can be any one of the payloads described
need to copy the data from its native format to a specific format
in section IV-B.
like Capn’Proto or Protobuf objects. This copying process
• Producer: functions as the server side of the application.
adds some overhead that should be taken into consideration.
It packages data from the selected data stream and
However, for encoding protocols like JSON, BSON, and Pickle
transmits it over the wire using the selected streaming
that do not require format changes, we store the data in a
technology, which may be any of the technologies de-
Pydantic class. The results in Figure 3 show that for larger
scribed in section III-B.
array datasets like BatchMatrix, Plasma, and MNIST encoding
• Marshaler: handles the serialization of the data from the
methods such as Protobuf, Thrift, and Captn’Proto tend to have
stream using the specified serialization protocol. This can
higher object creation latency as they need to copy data into
be any of those described in section III-A.
their own data types.
The configuration of the client side is similar but only
requires a marshaler to be configured to match the one used 2) Object Creation Throughput – We consider the object
for the producer. It does not require knowledge of the data creation throughput for each serialization method. The object
stream. creation time measures the time to convert data from the native
• Consumer: functions as the client side of the application. data structure (such as a NumPy array) to the serialization
It receives data transmitted by the producer using the format. Object creation time is important to consider if the
selected streaming technology, processes the incoming format that the data will be used in will be different from
messages, and performs the necessary actions. Producers the format it is sent over the wire. Typically, object creation
and consumers interact using a configured protocol. force a copy of the data to be sent, which impacts the total
• Broker: required by the streaming protocol (e.g., for throughput, especially when considering large array like data.
Kafka, RabbitMQ, etc.) are run externally from the test Figure 4 shows the object creation throughput for each
in the background. In our framework, we configure all dataset and each encoding method. It is interesting to note
brokers using docker-compose [47] to ensure that our here that protocol based methods incur a greater penalty for
broker configurations are reproducible for every test. the object creation. This effect is more noticeable in larger
• Logger: is used by the marshaler to capture performance datasets such as the BatchMatrix and Plasma datasets.
metrics for each test in a JSON file. For each message
sent, the logger captures four timestamps: 1) before serial- 3) Compression Ratio – The payload size, a crucial per-
ization, 2) after serialization, 3) after transmission, and 4) formance metric of serialization protocols, is independent
after deserialization. Using these four timestamps, we can of the choice of streaming protocols. Therefore, we have
calculate the serialization, deserialization, transmission, calculated the average compression ratio over all runs for each
8

Object Creation Time

thrift messagepack xml json protobuf yaml capnp-packed
cbor ubjson bson pickle avro capnp
100
Latency (ms)

10 1

10 2

batchMatrix float32 float64 int32 iris mnist plasma scientificPapers

Data Stream

Fig. 3. Object latency (Lo ) measured in milliseconds, for various data streams (x-axis) and encoding methods (color).

Object Creation Throughput

thrift messagepack xml json protobuf yaml capnp-packed
Creation Throughput (Mb/s)

105 cbor ubjson bson pickle avro capnp

104

103

102

101
batchMatrix float32 float64 int32 iris mnist plasma scientificPapers
Data Stream

Fig. 4. Object creation throughput (To ) measured in megabytes per second, for various data streams (x-axis) and encoding methods (color).

serialization protocol. Figure 5 presents the results for each Avro, achieve significantly worse compression in comparison.
protocol and data stream. Notably, Pickle, Avro, and XML In fact, due to the extra markup required for these formats,
consistently produce the largest payload sizes, often exceeding they can produce a larger payload size that the original data.
the original size. This is due to the inefficiency of their text-
based encodings and the additional meta-data tags they add as 4) Serialization Latency – The results for serialization time
overhead. Pickle, a binary format for storing Python objects, are shown in figure 6. There is a clear trend across all data
is particularly known for its large sizes and is not optimal for streams from text based protocols being the slowest (Avro,
encoding data for streaming. YAML, etc.) towards binary encoded protocol-based methods
The results show that the serialization protocol, Capn‘Proto, (Capn’Proto, protobuf, etc.) being the fastest. Binary encoded
outperforms others in terms of payload size. The packed option but no-protocol methods fall in between these two extremes.
of Capn‘Proto, also known as capnp-packed, is responsible It is interesting to note that Capn’Proto has the fastest
for additional size efficiency. The capnp-packed format is serialization time. This is likely due to the fact that Capn’Proto
closely followed by several binary serialization formats that stores data in format that is ready for serialization over the
show similar performance. The reason behind this performance wire.
can be attributed to their ability to achieve near-identical
compression, which is close to the limits of what is possible 5) Deserialization Latency – The results for serialization time
for that particular data stream. are shown in figure 7. Again, a clear trend may be seen across
Examining across data streams, it can be seen that the all data streams from text based protocols to binary encoded
BatchMatrix dataset is fundamentally limited. This is because protocol based methods.
it is made up of randomly generated numbers, making it Like serialization, Capn’Proto is generally the fastest dese-
incompressible due to the lack of redundancy in the data. rialization methods across all tests. As mentioned above, this
Conversely, for more realistic data such as MNIST and Plasma, is likely due to Capn’Proto storing the data in a pre-serialized
a much higher compression ratio is achieved.. Better compres- form.
sion is achieved for formats such as Capn’Proto Packed, which
exploit the redundancy in the data to achieve greater compres- 6) Serialization Throughput – Figure 8 shows the average
sion. Text-based formats, such as YAML, JSON, XML, and throughput for serialization of the data using different types
9

Payload Compression
104 capnp-packed protobuf ubjson json pickle
cbor thrift capnp xml avro
Compression Ratio (%)

messagepack bson yaml

103

102

101

100
batchMatrix float32 float64 int32 iris mnist plasma scientificPapers
Data Stream

Fig. 5. The compression ratio (C), for various data streams (x-axis) and serialization protocols (color).

Serialisation
107 capnp thrift json bson avro
106 capnp-packed messagepack cbor xml yaml
protobuf ubjson pickle
Log Duration (ms)

105
104
103
102
101

batchMatrix float32 float64 int32 iris mnist plasma scientificPapers

Serialization Protocol

Fig. 6. Serialization (Ls ) latency for different data streams (x-axis) and encoding protocols (color). Protocol encoding methods such as Protobuf and
Captn’Proto consistently offer the best performance in terms of both serialization. Text-based encodings (YAML, XML, etc.) add a large latency penalty to
serialization by increasing the verbosity of the data.

Deserialisation
107 protobuf messagepack pickle json yaml
106 capnp-packed ubjson cbor xml avro
capnp thrift bson
Log Duration (ms)

105
104
103
102
101

batchMatrix float32 float64 int32 iris mnist plasma scientificPapers

Serialization Protocol

Fig. 7. Deserialization (Ld ) latency for different data streams (x-axis) and encoding protocols (color). As with serialization, protocol encoding methods such
as Protobuf and Captn’Proto offer the best performance. Likewise, Text-based encodings (YAML, XML, etc.) add a large latency penalty to serialization by
increasing the verbosity of the data.
10

of protocols. It is evident from the graph that serialization With larger payloads such as BatchMatrix and plasma
techniques based on protocols such as ProtoBuf, Thrift, and data streams, the impact of serialization protocol becomes
Capn’Proto offer the highest serialization throughput. Binary more noticeable. It is challenging to identify a trend between
methods that are protocol independent offer moderate through- encoding protocols in terms of latency, except that it is crucial
put performance with the added advantage of greater flexibility to note the inefficiency of using XML and YAML for larger
as compared to protocol methods. Text-based methods perform payloads.
the worst due to their high serialization overhead. For the BatchMatrix data stream, an issue arises when
Surprisingly, Avro also performs quite well by this metric. sending a large YAML-encoded payload through the Python
We believe that despite being a human-readable text-based API, which causes ADIOS to produce a segmentation fault.
method, it is also a protocol-based method. This means that Therefore, subsequent latency and throughput plots result in
both the producer and consumer are aware of the types and NaN, the empty cells in Figure 9.
data structures being transmitted over the wire, facilitating
faster throughput. 9) Transmission Throughput – By examining the throughput,
we gain better understanding of how different protocols affect
transmission. Figure 10 shows that RPC methods achieve
105 serialisation higher transmission throughput than message streaming tech-
deserialisation nologies. When dealing with larger payloads, such as the
Serialisation Throughput (Mb/s)

104 BatchMatrix and Plasma data streams, protocol-based seri-

alization choices such as Thirft, Capn’Proto, and Protobuf
103 provide higher throughput than other methods. Interestingly,
messagepack also performs well with larger payloads.
102 Similar to latency, the choice of streaming technology is
more important than the encoding. However, a trend towards
101 protocol encoding methods can be observed on some larger
datasets, such as the plasma dataset.

10) Total Latency – Figure 11 shows the total latency for

l
n
on

ssa ed

pro ift
uf
r

p
ck

kle
xm

pn cbo
jso

jso

pn
tob
av
thr
pa
me -pack
bs
ya

pic

all combinations tat were tested. As before, it is clear that

ub
ge
p

the Thrift, Capn’Proto, and ZeroMQ all perform well in these

Encoding tests. ZeroMQ offers the lower latency in the BatchMatrix test
because it avoids the overhead of copying the data into a new
Fig. 8. Serialization (Ts ) and deserialization (Td ) throughput for each structure, as is the case with Thift or Protobuf. Among the
encoding averaged over all data streams.
broker based methods, RabbitMQ consistently performs well.
When it comes to encoding methods, protocol-based meth-
ods generally perform the best across all datasets and stream-
7) Deserialization Throughput – Figure 8 also shows the ing methods. However, it is not clear which method offers the
average throughput for deserialization of the data using dif- lowest latency in general. Protocol-based methods can achieve
ferent types of protocols. It is noticeable that deserialization high throughput by mixing encoding protocols and RPC
throughput time across all methods is smaller, indicating that frameworks. For example, considering the MNIST dataset,
deserialization is a main bottleneck to transmission. Capn’Proto achieves the lowest latency with the thrift protocol.
There is a clear trend towards protocol encoding for com-
8) Transmission Latency – Figure 9 shows the transmission plex data sets such as Iris, MNIST, and Plasma. Among
latency for various combinations of serialization and streaming streaming technologies, Thrift generally shows the best per-
technologies. The heatmap in each combination is sorted by formance.
the average latency from lowest to highest for each streaming
technology. 11) Total Throughput – Figure 12 shows total throughput,
Across all technologies, it is observed that transmission which is consistent with the total latency results discussed
latency is largely dependent on the choice of streaming in the previous section. Protocol-based methods achieve the
technology rather than the choice of serialization protocol. highest throughput. Among all the serialization protocols,
In streaming technologies, a broker is required as an inter- Thrift is generally the best performing one. ZeroMQ performs
mediary, which increases the overall latency, whereas RPC well with the biggest dataset, BatchMatrix. Although the best
technologies have no broker, and hence, have lower latency. encoding method is inconclusive, it shows a trend toward
Among messaging technologies, RabbitMQ performs better protocol-based methods, which give the highest throughput.
with larger payloads, while ActiveMQ achieves lower latency
with smaller payloads but performs worst on the largest 12) Effect of Batch Size on Throughput – In machine
payload (e.g., BatchMatrix). In RPC-based methods, Thrift learning applications, data is often processed in batches. Our
consistently has the lowest latency except for the BatchMatrix findings underscore the potential of batching data before
stream, where Capn’Proto narrowly beats Thrift. transmission to enhance throughput. However, it is crucial to
11

batchMatrix float32 float64 int32

zeromq
adios zeromq
capnp-packed zeromq
capnp-packed zeromq
capnp-packed
capnp-packed
messagepack bson
xml capnp
json
103 xml
capnp
capnp capnp 103 thrift yaml 103
pickle 103 thrift protobuf ubjson
encoding

encoding

encoding
cbor
protobuf ubjson
json ubjson
xml cbor
thrift
avro 102 yaml 101 yaml 101 pickle
bson
ubjson messagepack
cbor cbor
adios bson
protobuf 101
thrift
yaml avro
pickle messagepack
pickle messagepack
json
json 101 protobuf bson avro
xml adios avro adios
kafka

kafka

kafka
thrift
avro

thrift
avro

thrift

thrift
avro
capnp
zeromq
grpc

zeromq
grpc
capnp
activemq

zeromq

zeromq
adios
rabbitmq
pulsar
activemq

adios
rabbitmq
pulsar

capnp
grpc
avro
activemq
adios
rabbitmq
pulsar

capnp
grpc
activemq
rabbitmq
adios
pulsar
rocketmq

rocketmq

rocketmq
protocol protocol protocol protocol
iris mnist plasma scientificPapers
zeromq
bson zeromq
bson zeromq
adios zeromq
capnp
pickle cbor capnp-packed ubjson 103
json
cbor 101 pickle
capnp-packed ubjson
messagepack json
xml
xml json capnp 102 adios
encoding

encoding

encoding
capnp-packed ubjson 101 pickle pickle
adios messagepack protobuf capnp-packed
protobuf
capnp capnp
xml thrift
cbor 101 messagepack
cbor 101
thrift
ubjson 100 adios
thrift 100 avro
json bson
protobuf
yaml
avro protobuf
avro bson
yaml 100 thrift
avro
messagepack yaml xml yaml
kafka

kafka

kafka
thrift
capnp
avro

thrift
zeromq
grpc
adios

capnp
avro

thrift
activemq

zeromq
rabbitmq
pulsar

grpc
adios

capnp
avro

thrift
zeromq
rabbitmq
activemq
pulsar

grpc

zeromq
adios
rabbitmq
activemq
pulsar

capnp
grpc
avro
rabbitmq
adios
activemq
pulsar
rocketmq

rocketmq

rocketmq
protocol protocol protocol protocol

Fig. 9. Each subplot or heatmap shows transmission latency (Ltrans ) for different serialization protocols (vertical axis) and streaming technologies (horizontal
axis). Dark red indicates higher latency.

batchMatrix float32 float64 int32

zeromq
adios zeromq
capnp-packed zeromq
capnp-packed zeromq
capnp-packed
capnp-packed 103 bson capnp 100 xml
messagepack
capnp xml
capnp 10 1 json
thrift capnp
yaml 10 1
encoding

encoding

pickle
cbor 102 thrift
ubjson protobuf
ubjson ubjson
cbor
protobuf
avro json
yaml xml
yaml 10 2 thrift
pickle
bson
ubjson 101 messagepack
cbor 10 3 cbor
adios bson
protobuf 10 3
thrift
yaml avro
pickle messagepack
pickle messagepack
json
json
xml protobuf
adios bson
avro avro
adios
kafka

kafka

kafka
thrift
avro

thrift
avro

thrift
avro
capnp
zeromq
grpc

zeromq

zeromq
adios
rabbitmq
pulsar
activemq

capnp
grpc
activemq

zeromq
adios
rabbitmq
pulsar

capnp
grpc
activemq
rabbitmq
adios
pulsar

capnp
grpc
activemq
rabbitmq
adios
pulsar
rocketmq

rocketmq

protocol protocol protocol protocol

iris mnist plasma scientificPapers
zeromq
bson zeromq
bson zeromq
adios 103 zeromq
capnp
pickle
json 100 cbor
pickle 100 capnp-packed
ubjson ubjson
json
cbor capnp-packed messagepack xml 101
encoding

encoding

xml
capnp-packed json
ubjson capnp
pickle 102 adios
pickle
adios
protobuf messagepack
capnp protobuf
thrift capnp-packed
messagepack 100
capnp
thrift 10 1 xml
adios
10 1 cbor
avro 101 cbor
bson
ubjson thrift json protobuf 10 1
yaml
avro protobuf
avro bson
yaml thrift
avro
messagepack yaml xml yaml
kafka

kafka

kafka
thrift
capnp
avro

thrift

thrift
avro
zeromq
grpc
adios

capnp
avro
zeromq
grpc
adios

capnp
avro
activemq

zeromq

zeromq
rabbitmq
pulsar

rabbitmq
activemq
pulsar

grpc
adios
rabbitmq
activemq
pulsar

capnp
grpc
adios
rabbitmq
activemq
pulsar
rocketmq

rocketmq

protocol protocol protocol protocol

Fig. 10. Each subplot or heatmap shows transmission throughput (Ttrans ) for different serialization protocols (vertical axis) and streaming technologies
(horizontal axis).
12

batchMatrix float32 float64 int32

zeromq
adios zeromq
capnp-packed zeromq
capnp-packed zeromq
capnp-packed
messagepack
capnp 104 capnp
bson capnp
thrift
103 capnp
xml
protobuf thrift 103 json ubjson 103
pickle ubjson protobuf thrift
encoding

encoding

encoding
capnp-packed
ubjson 103 xml
json ubjson
xml cbor
protobuf
thrift messagepack 101 cbor 101 pickle
avro
cbor 102 cbor
protobuf adios
messagepack bson
messagepack 101
bson
json pickle
adios pickle
bson json
yaml
xml
yaml yaml
avro yaml
avro avro
adios
kafka

kafka

kafka
thrift
avro

thrift
avro

thrift

thrift
zeromq
capnp
grpc

zeromq

zeromq
adios
rabbitmq
pulsar
activemq

capnp
grpc
activemq
rabbitmq
adios
pulsar

capnp
grpc
avro
activemq

zeromq
rabbitmq
adios
pulsar

capnp
grpc
avro
rabbitmq
activemq
adios
pulsar
rocketmq

rocketmq

rocketmq
protocol protocol protocol protocol
iris mnist plasma scientificPapers
zeromq
bson zeromq
capnp-packed zeromq
adios zeromq
capnp
pickle capnp capnp-packed 103 ubjson 103
cbor
json 101 messagepack
pickle capnp
ubjson pickle
capnp-packed
capnp-packed ubjson protobuf protobuf
encoding

encoding

encoding
protobuf cbor 101 messagepack messagepack
xml
adios adios
bson pickle
thrift json
adios
capnp protobuf cbor xml 101
ubjson 100 thrift bson 101 cbor
thrift
messagepack json
xml 100 avro
json bson
thrift
avro
yaml avro
yaml xml
yaml avro
yaml
kafka

kafka

kafka
thrift
capnp
avro

thrift
capnp
avro

thrift
avro

thrift
zeromq
grpc
adios
activemq

zeromq
rabbitmq
pulsar

grpc
adios

capnp
zeromq
grpc

zeromq
rabbitmq
activemq
pulsar

adios
rabbitmq
activemq
pulsar

capnp
grpc
avro
rabbitmq
adios
activemq
pulsar
rocketmq

rocketmq

rocketmq
protocol protocol protocol protocol

Fig. 11. Each subplot or heatmap shows total latency (Ltot ) for different serialization protocols (vertical axis), and streaming technologies (horizontal axis).

batchMatrix float32 float64 int32

zeromq zeromq zeromq zeromq
adios
messagepack capnp-packed
capnp capnp-packed
capnp 100 capnp-packed
capnp
capnp
protobuf bson
thrift 10 1 thrift
json xml
ubjson 10 1
102
encoding

encoding

pickle
capnp-packed ubjson
xml protobuf
ubjson thrift
cbor
ubjson
thrift json
messagepack xml
cbor 10 2 protobuf
pickle
avro 101 cbor adios bson 10 3
cbor
bson protobuf
pickle
10 3 messagepack
pickle messagepack
json
json
xml 100 adios
yaml bson
yaml yaml
avro
yaml avro avro adios
kafka

kafka

kafka
thrift
avro

thrift
avro

thrift
avro
zeromq
capnp
grpc

zeromq

zeromq
adios
rabbitmq
pulsar
activemq

capnp
grpc
activemq

zeromq
adios
rabbitmq
pulsar

capnp
grpc
activemq
rabbitmq
adios
pulsar

capnp
grpc
activemq
rabbitmq
adios
pulsar
rocketmq

rocketmq

protocol protocol protocol protocol

iris mnist plasma scientificPapers
zeromq
bson zeromq
capnp-packed zeromq
adios zeromq
capnp
pickle
cbor 100 capnp
messagepack 100 capnp-packed
capnp ubjson
pickle
json pickle ubjson 102 capnp-packed 101
encoding

encoding

capnp-packed
protobuf ubjson
cbor protobuf
messagepack protobuf
messagepack
xml
adios adios
bson pickle
thrift 101
json
adios 100
capnp 10 1 protobuf 10 1
cbor xml
ubjson
thrift thrift
json bson
avro cbor
bson 10 1
messagepack
avro xml
avro json
xml 100 thrift
avro
yaml yaml yaml yaml
kafka

kafka

kafka
thrift
capnp
avro

thrift

thrift
avro
zeromq
grpc
adios

capnp
avro
zeromq
grpc
adios

capnp
avro
activemq

zeromq

zeromq
rabbitmq
pulsar

rabbitmq
activemq
pulsar

grpc
adios
rabbitmq
activemq
pulsar

capnp
grpc
adios
rabbitmq
activemq
pulsar
rocketmq

rocketmq

protocol protocol protocol protocol

Fig. 12. Each subplot or heatmap shows total throughput (Ttot ) for different serialization protocols (vertical axis), and streaming technologies (horizontal
axis).
13

1.6 1 2 4 8 16 32 64 128 256

1.4
1.2
Throughput (Mb/s)

1.0
0.8
0.6
0.4
0.2
0.0
yaml xml avro json bson messagepack thrift pickle cbor ubjson capnp-packed protobuf capnp
Encoding

Fig. 13. Total throughput Ttot for various batch sizes (e.g., ranging from 1 to 256) for the MNIST dataset.

note that the batch size can significantly impact the throughput offered the best performance due to its lightweight markup
of the method used. and smaller payload size compared to YAML or Avro.
Figure 13 shows the throughput of the MNIST dataset with a Among different messaging technologies, we generally
variable batch size. As the batch size increases, the throughput found that Apache Thrift achieves very high throughput and
reduces. This is due to the increased overhead of copying and low latency across various scenarios. With message broker sys-
serializing data for transmission. However, when the batch size tems, RabbitMQ generally demonstrates the best performance.
is increased beyond 32 images per batch, the overall through- Surprisingly, we did not observe much of a difference when
put begins to improve because fewer packets are needed to combining different protocol-based encoding and messaging
be communicated over the network. For binary and protocol systems. We hypothesized that ProtoBuf would be most effi-
encoding methods, increasing the batch size is shown to also cient when combined with the gRPC, or that Capn’Proto’s
increase the throughput. This observation is consistent with RPC implementation would perform best with Capn’Proto.
previous results; generally, protocol-based methods offer the However, this appears not to be the case.
best throughput. At larger batch sizes (> 128), the throughput Larger batch sizes facilitate higher throughput for array
continues to increase because fewer transmission time is sig- datasets, as shown by our throughput and batch size exper-
nificantly slower than the serialization/deserialization cost. So iment in Figure 13, when using either a binary or protocol-
grouping many examples into a single transmission improves based serialization method. For text-based encoding methods,
throughput. adding the required markup and lack of compression destroys
any advantage of batching data for transmission.
VI. D ISCUSSION
We can draw several conclusions based on the experiments B. Limitations and future directions
presented in this work. We identify the following key points One notable limitation is that in this study we did not
from our results: investigate the potential of scaling with multiple clients. Pre-
vious research has examined this aspect for message queuing
A. Recommendations systems [13]. A future study could focus on examining the
RPC systems are faster than messaging broker systems reliability of various RPC technologies based on the number
due to the overhead of the intermediate broker. This makes of consumers.
RPC highly efficient for high throughput and low latency
transmission of large data, although they do not offer the same VII. C ONCLUSION
delivery guarantees as message broker systems. In this work, we investigated 132 combinations of different
We found that the choice of messaging technology has a encoding methods and messaging technologies. We evaluated
greater impact than the encoding protocol. their performance across 11 different metrics and benchmarked
Protocol-based encoding methods such as Capt’n Proto each combination against 6 different datasets, ranging from toy
and ProtoBuf perform best for complex data that can be datasets to machine learning, to scientific data from the fusion
compressed, while MessagePack is a competitive choice for energy domain. We found that messaging technology has the
smaller or random data. Protocol-based encoding methods biggest impact on performance, regardless of over the specific
offer the fastest serialization and best compression, with serialization method used. Protocol-based methods offer the
Thrift offering the best throughput and Capn’Proto offering highest throughput and lowest latency but at the expense of
the best compression. Binary encoding methods offer more flexibility and robustness. Protocol encoding methods offered
flexibility at the cost of slower encoding speed. Among the the best performance, at the cost of flexibility. Notably, we did
binary encoding methods we tested, MessagePack generally not see much difference when combining different protocol-
performed the best. Considering text-based protocols, JSON based encoding and messaging systems. Finally, we found that
14

the batch size affects the data throughput for all binary and C. Ham, N. Heiberg, S. S. Henderson, E. Highcock, B. Hnat, J. Howard,
protocol-based encoding methods. J. Huang, S. W. A. Irvine, A. S. Jacobsen, O. Jones, I. Katramados,
D. Keeling, A. Kirk, I. Klimek, L. Kogan, J. Leland, B. Lipschultz,
B. Lloyd, J. Lovell, B. Madsen, O. Marshall, R. Martin, G. McArdle,
C ONTRIBUTION K. McClements, B. McMillan, A. Meakins, H. F. Meyer, F. Militello,
J. Milnes, S. Mordijck, A. W. Morris, D. Moulton, D. Muir, K. Mukhi,
SJ: Designed and implemented the experimental framework, S. Murphy-Sugrue, O. Myatra, G. Naylor, P. Naylor, S. L. Newton,
shaping the research methodology and contributions to the T. O’Gorman, J. Omotani, M. G. O’Mullane, S. Orchard, S. J. P. Pamela,
L. Pangione, F. Parra, R. V. Perez, L. Piron, M. Price, M. L. Reinke,
writing and conceptualization of the paper. NC: Provided the F. Riva, C. M. Roach, D. Robb, D. Ryan, S. Saarelma, M. Salewski,
MAST data for the study and offered expertise in the fusion S. Scannell, A. A. Schekochihin, O. Schmitz, S. Sharapov, R. Sharples,
domain, enhancing the scientific rigor of this empirical study S. A. Silburn, S. F. Smith, A. Sperduti, R. Stephen, N. T. Thomas-Davies,
A. J. Thornton, M. Turnyanskiy, M. Valovič, F. V. Wyk, R. G. L. Vann,
and editing and refining the manuscript. SK: Provided tech- N. R. Walkden, I. Waters, H. R. Wilson, t. M.-U. Team, and t. E. M.
nical supervision and introduced the core idea, building upon Team, “Overview of new MAST physics in anticipation of first results
SK’s prior work at the University of Oxford, and contributed from MAST Upgrade,” Nuclear Fusion, vol. 59, no. 11, p. 112011, Jun.
2019.
to the writing and editing of the paper, figures, and plots. [6] R. M. Churchill, C. S. Chang, J. Choi, R. Wang, S. Klasky, R. Kube,
H. Park, M. J. Choi, J. S. Park, M. Wolf, R. Hager, S. Ku, S. Kampel,
ACKNOWLEDGMENT T. Carroll, K. Silber, E. Dart, and B. S. Cho, “A Framework for
International Collaboration on ITER Using Large-Scale Data Transfer
We would like to thank our colleagues at UKAEA and to Enable Near-Real-Time Analysis,” Fusion Science and Technology,
STFC for supporting the FAIR-MAST project. Additionally vol. 77, no. 2, pp. 98–108, Feb. 2021.
[7] A. Pavone, A. Merlo, S. Kwak, and J. Svensson, “Machine learning and
we would like to thank Stephen Dixon, Jonathan Hollo- Bayesian inference in nuclear fusion research: An overview,” Plasma
combe, Adam Parker, Lucy Kogan, and Jimmy Measures from Physics and Controlled Fusion, vol. 65, no. 5, p. 053001, Apr. 2023.
UKAEA for assisting our understanding of the Fusion data. We [8] S. Khan, E. Rydow, S. Etemaditajbakhsh, K. Adamek, and W. Armour,
“Web Performance Evaluation of High Volume Streaming Data Visual-
would also like to extend our thanks to the wider FAIR-MAST ization,” IEEE Access, vol. 11, pp. 15 623–15 636, 2023.
project which include Shaun De Witt, James Hodson, Stanislas [9] D. P. Proos and N. Carlsson, “Performance Comparison of Messaging
Pamela, Rob Akers from UKAEA and Jeyan Thiyagalingam Protocols and Serialization Formats for Digital Twins in IoV,” in 2020
IFIP Networking Conference (Networking), Jun. 2020, pp. 10–18.
from STFC. We also want to extend our gratitude to the [10] D. Friesel and O. Spinczyk, “Data Serialization Formats for the
MAST Team for their efforts in collecting and curating the Internet of Things,” Electronic Communications of the EASST,
raw diagnostic source data during the operation of the MAST vol. 80, no. 0, Sep. 2021, number: 0. [Online]. Available: https:
//journal.ub.tu-berlin.de/eceasst/article/view/1134
experiment. [11] B. Petersen, H. Bindner, S. You, and B. Poulsen, “Smart grid serializa-
tion comparison: Comparision of serialization for distributed control in
R EFERENCES the context of the Internet of Things,” in 2017 Computing Conference,
Jul. 2017, pp. 1339–1346.
[1] T. Hey, K. Butler, S. Jackson, and J. Thiyagalingam, “Machine learning [12] A. Sumaray and S. K. Makki, “A comparison of data serialization
and big scientific data,” vol. 378, no. 2166, p. 20190054. [Online]. formats for optimal efficiency on a mobile platform,” in Proceedings
Available: https://royalsocietypublishing.org/doi/10.1098/rsta.2019.0054 of the 6th International Conference on Ubiquitous Information
[2] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, Management and Communication. Kuala Lumpur Malaysia: ACM,
M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, Feb. 2012, pp. 1–6. [Online]. Available: https://dl.acm.org/doi/10.1145/
P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, 2184751.2184810
O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, [13] G. Fu, Y. Zhang, and G. Yu, “A Fair Comparison of Message Queuing
A. J. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. ’t Hoen, Systems,” IEEE Access, vol. 9, pp. 421–432, 2021.
R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, [14] W. F. Godoy, N. Podhorszki, R. Wang, C. Atkins, G. Eisenhauer,
A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, J. Gu, P. Davis, J. Choi, K. Germaschewski, K. Huck, A. Huebl,
S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. M. Kim, J. Kress, T. Kurc, Q. Liu, J. Logan, K. Mehta, G. Ostrouchov,
Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, M. Parashar, F. Poeschel, D. Pugmire, E. Suchyta, K. Takahashi,
A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, N. Thompson, S. Tsutsumi, L. Wan, M. Wolf, K. Wu, and S. Klasky,
“The FAIR Guiding Principles for scientific data management and “ADIOS 2: The Adaptable Input Output System. A framework for high-
stewardship,” Scientific Data, vol. 3, no. 1, p. 160018, Mar. 2016. performance data management,” SoftwareX, vol. 12, p. 100561, Jul.
[3] P. Rocca-Serra, W. Gu, V. Ioannidis, T. Abbassi-Daloii, S. Capella- 2020.
Gutierrez, I. Chandramouliswaran, A. Splendiani, T. Burdett, R. T. [15] “Extensible Markup Language (XML) 1.1 (Second Edition).” [Online].
Giessmann, D. Henderson, D. Batista, I. Emam, Y. Gadiya, L. Giovanni, Available: https://www.w3.org/TR/2006/REC-xml11-20060816/
E. Willighagen, C. Evelo, A. J. G. Gray, P. Gribbon, N. Juty, D. Welter, [16] T. Bray, “The JavaScript object notation (JSON) data interchange
K. Quast, P. Peeters, T. Plasterer, C. Wood, E. van der Horst, D. Reilly, format,” Internet Engineering Task Force, Request for Comments
H. van Vlijmen, S. Scollen, A. Lister, M. Thurston, R. Granell, and S.- RFC 8259, Dec. 2017, num Pages: 16. [Online]. Available: https:
A. Sansone, “The FAIR Cookbook - the essential resource for and by //datatracker.ietf.org/doc/std90
FAIR doers,” Scientific Data, vol. 10, no. 1, p. 292, May 2023. [17] “YAML Ain’t Markup Language (YAML™) revision 1.2.2.” [Online].
[4] A. Sykes, J.-W. Ahn, R. Akers, E. Arends, P. G. Carolan, G. F. Available: https://yaml.org/spec/1.2.2/
Counsell, S. J. Fielding, M. Gryaznevich, R. Martin, M. Price, C. Roach, [18] “BSON (Binary JSON): Specification.” [Online]. Available: https:
V. Shevchenko, M. Tournianski, M. Valovic, M. J. Walsh, H. R. Wilson, //bsonspec.org/spec.html
and MAST Team, “First physics results from the MAST Mega-Amp [19] “Universal Binary JSON Specification – The universally compatible
Spherical Tokamak,” Physics of Plasmas, vol. 8, no. 5, pp. 2101–2106, format specification for Binary JSON.” [Online]. Available: https:
May 2001. //ubjson.org/
[5] J. R. Harrison, R. J. Akers, S. Y. Allan, J. S. Allcock, J. O. Allen, [20] “CBOR — Concise Binary Object Representation | Overview.”
L. Appel, M. Barnes, N. B. Ayed, W. Boeglin, C. Bowman, J. Bradley, [Online]. Available: https://cbor.io/
P. Browning, P. Bryant, M. Carr, M. Cecconello, C. D. Challis, [21] “MessagePack,” May 2023, original-date: 2010-03-19T02:08:43Z.
S. Chapman, I. T. Chapman, G. J. Colyer, S. Conroy, N. J. Conway, [Online]. Available: https://github.com/msgpack/msgpack
M. Cox, G. Cunningham, R. O. Dendy, W. Dorland, B. D. Dudson, [22] “PEP 3154 – Pickle protocol version 4 | peps.python.org.” [Online].
L. Easy, S. D. Elmore, T. Farley, X. Feng, A. R. Field, A. Fil, G. M. Available: https://peps.python.org/pep-3154/
Fishpool, M. Fitzgerald, K. Flesch, M. F. J. Fox, H. Frerichs, S. Gadgil, [23] “Protocol Buffers Version 3 Language Specification,” section: reference.
D. Gahle, L. Garzotti, Y.-C. Ghim, S. Gibson, K. J. Gibson, S. Hall, [Online]. Available: https://protobuf.dev/reference/protobuf/proto3-spec/
15

[24] M. Slee, A. Agarwal, and M. Kwiatkowski, “Thrift: Scalable Cross- [38] “Apache Thrift,” https://thrift.apache.org/.
Language Services Implementation.” [39] “ZeroMQ,” https://zeromq.org/.
[25] “Cap’n Proto: Cap’n Proto, FlatBuffers, and SBE.” [Online]. Available: [40] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-free
https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html coordination for Internet-scale systems.”
[26] “Apache Avro Specification.” [Online]. Available: https://avro.apache. [41] R. A. Fisher, “The Use of Multiple Measurements in Taxonomic
org/docs/1.11.1/specification/ Problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[27] “gRPC.” [Online]. Available: https://grpc.io/ [42] L. Deng, “The mnist database of handwritten digit images for machine
[28] A. Luis, P. Casares, J. J. Cuadrado-Gallego, and M. A. Patricio, learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6,
“PSON: A Serialization Format for IoT Sensor Networks,” Sensors, pp. 141–142, 2012.
vol. 21, no. 13, p. 4559, Jan. 2021, number: 13 Publisher:
Multidisciplinary Digital Publishing Institute. [Online]. Available: [43] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang,
https://www.mdpi.com/1424-8220/21/13/4559 and N. Goharian, “A discourse-aware attention model for abstractive
[29] A. Wolnikowski, S. Ibanez, J. Stone, C. Kim, R. Manohar, and R. Soulé, summarization of long documents,” Proceedings of the 2018 Conference
“Zerializer: Towards Zero-Copy Serialization,” in Proceedings of the of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers),
Workshop on Hot Topics in Operating Systems, ser. HotOS ’21.
New York, NY, USA: Association for Computing Machinery, 2021, 2018. [Online]. Available: http://dx.doi.org/10.18653/v1/n18-2097
pp. 206–212, event-place: Ann Arbor, Michigan. [Online]. Available: [44] S. Khan and D. Wallom, “A System for Organizing, Collecting, and
https://doi.org/10.1145/3458336.3465283 Presenting Open-Source Intelligence,” Journal of Data, Information and
[30] “ActiveMQ,” https://activemq.apache.org/. Management, vol. 4, no. 2, pp. 107–117, Jun. 2022.
[31] “Apache Kafka,” https://kafka.apache.org/. [45] S. Khan, P. H. Nguyen, A. Abdul-Rahman, E. Freeman, C. Turkay,
[32] “Apache Pulsar,” https://pulsar.apache.org/. and M. Chen, “Rapid Development of a Data Visualization Service in
[33] “RabbitMQ: Easy to use, flexible messaging and streaming,” an Emergency Response,” IEEE Transactions on Services Computing,
https://rabbitmq.com/. vol. 15, no. 3, pp. 1251–1264, 2022.
[34] “RocketMQ,” https://rocketmq.apache.org/. [46] “Github: Streaming performance analysis.” [Online]. Available: https:
[35] “Apache Avro,” https://avro.apache.org/. //github.com/stfc-sciml/streaming-performance-analysis
[36] “Cap’n Proto,” https://capnproto.org/. [47] D. Merkel, “Docker: lightweight linux containers for consistent devel-
[37] “gRPC,” https://grpc.io/. opment and deployment,” Linux journal, vol. 2014, no. 239, p. 2, 2014.

Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
No ratings yet
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
12 pages
A Survey On The Evolution of Stream Processing Systems
No ratings yet
A Survey On The Evolution of Stream Processing Systems
35 pages
Scalable Search Shilpa E
No ratings yet
Scalable Search Shilpa E
6 pages
Information Systems
No ratings yet
Information Systems
15 pages
Designing and Optimizing Scalable, Cloud-Native Data Pipelines For Real-Time Analytics: A Comprehensive Study
No ratings yet
Designing and Optimizing Scalable, Cloud-Native Data Pipelines For Real-Time Analytics: A Comprehensive Study
7 pages
Subzero Signals Neutrinos Under The Ice
No ratings yet
Subzero Signals Neutrinos Under The Ice
16 pages
Video Sum3
No ratings yet
Video Sum3
5 pages
Next-Generation Data Pipeline Designs For Modern A
No ratings yet
Next-Generation Data Pipeline Designs For Modern A
7 pages
Review Paper Final
No ratings yet
Review Paper Final
6 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Advanced Topics and Smart Systems For Wireless Com
No ratings yet
Advanced Topics and Smart Systems For Wireless Com
4 pages
Module4 1
No ratings yet
Module4 1
68 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
2023 10 4 2 Miletic
No ratings yet
2023 10 4 2 Miletic
12 pages
1174: Futuristic Trends and Innovations in Multimedia Systems Using Big Data, Iot and Cloud Technologies (Ftims)
No ratings yet
1174: Futuristic Trends and Innovations in Multimedia Systems Using Big Data, Iot and Cloud Technologies (Ftims)
7 pages
Benchmarking Big Data Systems
No ratings yet
Benchmarking Big Data Systems
30 pages
009.4 - Traditional Vs Streaming Systems Data Models
No ratings yet
009.4 - Traditional Vs Streaming Systems Data Models
3 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Master Thesis
No ratings yet
Master Thesis
68 pages
Analyzing P2P Live Streaming Systems
No ratings yet
Analyzing P2P Live Streaming Systems
12 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Iot Stream Processing and Analytics in The Fog
No ratings yet
Iot Stream Processing and Analytics in The Fog
21 pages
Big Data Analytics For Wireless and Wired Network Design: A Survey
No ratings yet
Big Data Analytics For Wireless and Wired Network Design: A Survey
23 pages
Feng LIU, Zhan-Rong LI, Yang GAO and Sheng-Si LUO
No ratings yet
Feng LIU, Zhan-Rong LI, Yang GAO and Sheng-Si LUO
7 pages
Vincenzo Thesis
No ratings yet
Vincenzo Thesis
231 pages
Performance Evaluation of A Satellite Communicatio
No ratings yet
Performance Evaluation of A Satellite Communicatio
11 pages
Remote Sensing: An Overview of Platforms For Big Earth Observation Data Management and Analysis
No ratings yet
Remote Sensing: An Overview of Platforms For Big Earth Observation Data Management and Analysis
25 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
Big Data Analytics Project Guidelines
No ratings yet
Big Data Analytics Project Guidelines
6 pages
XSEDE15 Part1 Intro
No ratings yet
XSEDE15 Part1 Intro
101 pages
BBC Radio Energy Footprint Analysis
No ratings yet
BBC Radio Energy Footprint Analysis
46 pages
2012 IN4392 Lecture-5 CloudProgrammingModels
100% (1)
2012 IN4392 Lecture-5 CloudProgrammingModels
95 pages
BDA Unit 3
No ratings yet
BDA Unit 3
18 pages
On The Consumption of Multimedia Content Using Mobile Devices A Year To Year User Case Study
No ratings yet
On The Consumption of Multimedia Content Using Mobile Devices A Year To Year User Case Study
8 pages
System Design A Structured Approach
No ratings yet
System Design A Structured Approach
10 pages
Lecture - Week04
No ratings yet
Lecture - Week04
29 pages
A New Apache Spark-Based Framework For Big Data Streaming Forecasting in IoT Networks (Recovered)
No ratings yet
A New Apache Spark-Based Framework For Big Data Streaming Forecasting in IoT Networks (Recovered)
23 pages
Computer Networks: Kyungmin Cho, Younghyun Ju, Sungjae Jo, Yunseok Rhee, Junehwa Song
No ratings yet
Computer Networks: Kyungmin Cho, Younghyun Ju, Sungjae Jo, Yunseok Rhee, Junehwa Song
23 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
TFM Marta Corrochano Garrido
No ratings yet
TFM Marta Corrochano Garrido
50 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Week 4
No ratings yet
Week 4
8 pages
High-Frequency 6G Network Analysis
No ratings yet
High-Frequency 6G Network Analysis
16 pages
Unit 3.2 (2 MARKS)
No ratings yet
Unit 3.2 (2 MARKS)
2 pages
(Comparative Study, Survey Paper) Online ML Techniques For Network
No ratings yet
(Comparative Study, Survey Paper) Online ML Techniques For Network
17 pages
014 - Distinguishing Features of Streaming Data
No ratings yet
014 - Distinguishing Features of Streaming Data
2 pages
27-60 No Colour
No ratings yet
27-60 No Colour
34 pages
Big Data Analytics
No ratings yet
Big Data Analytics
13 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
A Ow Trace Generator Using Graph-Based Traffic Classification Techniques
No ratings yet
A Ow Trace Generator Using Graph-Based Traffic Classification Techniques
7 pages
Machine Learning For Internet of Things Data A 2018 Digital Communications A
No ratings yet
Machine Learning For Internet of Things Data A 2018 Digital Communications A
15 pages
7-Data and Analytics For IoT
No ratings yet
7-Data and Analytics For IoT
33 pages
SCADA System Overview and Features
No ratings yet
SCADA System Overview and Features
14 pages
Fcomp 05 1099582
No ratings yet
Fcomp 05 1099582
12 pages
SPA EC2 Cluster MakeUpSolutions
No ratings yet
SPA EC2 Cluster MakeUpSolutions
4 pages
Hot Data Analytics For Real-Time Streaming in Iot Platform
No ratings yet
Hot Data Analytics For Real-Time Streaming in Iot Platform
227 pages
CHARMe Malta Workshop Final
No ratings yet
CHARMe Malta Workshop Final
14 pages
2024 0522T16 16 26 - R3dlog
No ratings yet
2024 0522T16 16 26 - R3dlog
40 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Hexacon 0 Click Rce On Tesla Model 3 Through Tpms Sensors Light
No ratings yet
Hexacon 0 Click Rce On Tesla Model 3 Through Tpms Sensors Light
44 pages
G RPC
No ratings yet
G RPC
14 pages
Facebook Thrift
100% (2)
Facebook Thrift
27 pages
Native Applications With Go and Java For Docker and Kubernetes 10690006
No ratings yet
Native Applications With Go and Java For Docker and Kubernetes 10690006
61 pages
gRPC vs Thrift: A Comparative Analysis
No ratings yet
gRPC vs Thrift: A Comparative Analysis
1 page
Google's Software Practices Guide
No ratings yet
Google's Software Practices Guide
21 pages
Message Encoding
No ratings yet
Message Encoding
4 pages
Distributed Services With Go: Extracted From
No ratings yet
Distributed Services With Go: Extracted From
10 pages
Microservices Seminar Report
No ratings yet
Microservices Seminar Report
45 pages
Unit Iii Basics - of - Hadoop
No ratings yet
Unit Iii Basics - of - Hadoop
46 pages
Assignment DoVietAnh 1801040004
No ratings yet
Assignment DoVietAnh 1801040004
30 pages
2-Protocol Buffers Documentation
No ratings yet
2-Protocol Buffers Documentation
2 pages
Comprehensive gRPC Tutorial Guide
100% (1)
Comprehensive gRPC Tutorial Guide
111 pages
ibaHD Server API Read - v1.1 - en
No ratings yet
ibaHD Server API Read - v1.1 - en
17 pages
Avro Data Serialization Guide
No ratings yet
Avro Data Serialization Guide
30 pages
DEVNET-1775-Introduction To OpenConfig
No ratings yet
DEVNET-1775-Introduction To OpenConfig
35 pages
Tensorflow Internal
No ratings yet
Tensorflow Internal
17 pages
2-Protocol-Design Sistemas Distribuidos
No ratings yet
2-Protocol-Design Sistemas Distribuidos
29 pages
Protocol Buffers: Google's Serialization Format
No ratings yet
Protocol Buffers: Google's Serialization Format
28 pages
Building Production Grade Microservices With Go and GRPC - A Step-By-Step Developer Guide With Example - DeV Community
No ratings yet
Building Production Grade Microservices With Go and GRPC - A Step-By-Step Developer Guide With Example - DeV Community
28 pages
Interface defin-WPS Office
No ratings yet
Interface defin-WPS Office
7 pages
Unit IV Basics - of - Hadoop
No ratings yet
Unit IV Basics - of - Hadoop
20 pages
Biometric API Guide for Developers
No ratings yet
Biometric API Guide for Developers
12 pages
Hadoop Basics for Data Science Students
No ratings yet
Hadoop Basics for Data Science Students
22 pages
Roadrunner Release Notes
No ratings yet
Roadrunner Release Notes
86 pages
TypeScript SDK Developer's Guide - Features - Temporal Documentation
No ratings yet
TypeScript SDK Developer's Guide - Features - Temporal Documentation
41 pages
Schema Evolution in Avro, Protocol Buffers and Thrift - Martin Kleppmann's Blog
No ratings yet
Schema Evolution in Avro, Protocol Buffers and Thrift - Martin Kleppmann's Blog
6 pages
Animal Identification Using Deep Learning On Raspberry Pi
No ratings yet
Animal Identification Using Deep Learning On Raspberry Pi
3 pages

Streaming Technologies and Serialization Protocols

Uploaded by

Streaming Technologies and Serialization Protocols

Uploaded by

1

Streaming Technologies and Serialization Protocols:

W ITH the exponential increase in data generation from

more. Notably, ZeroMQ’s zero-copy feature minimizes the A. Performance Metrics

Xarray Dataset Object ProtoBuf Object ProtoBuf ProtoBuf ProtoBuf Object

Marshaler Use Logger

Plasma KafkaProducer KafkaConsumer JSONMarshaler

Object Creation Time

batchMatrix float32 float64 int32 iris mnist plasma scientificPapers

Object Creation Throughput

105 cbor ubjson bson pickle avro capnp

messagepack bson yaml

batchMatrix float32 float64 int32 iris mnist plasma scientificPapers

batchMatrix float32 float64 int32 iris mnist plasma scientificPapers

104 BatchMatrix and Plasma data streams, protocol-based seri-

10) Total Latency – Figure 11 shows the total latency for

all combinations tat were tested. As before, it is clear that

the Thrift, Capn’Proto, and ZeroMQ all perform well in these

batchMatrix float32 float64 int32

batchMatrix float32 float64 int32

protocol protocol protocol protocol

protocol protocol protocol protocol

batchMatrix float32 float64 int32

batchMatrix float32 float64 int32

protocol protocol protocol protocol

protocol protocol protocol protocol

1.6 1 2 4 8 16 32 64 128 256

You might also like