0% found this document useful (0 votes)
8 views10 pages

Lecture6 DataFlowLayer

Uploaded by

2022da04739
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Lecture6 DataFlowLayer

Uploaded by

2022da04739
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Flow Manager

Pravin Y Pawar
Distributed Data Flows

Need
• Distributed state management is required in order to process the data in scalable way

• Distributed data flows consists of


 Data collection
 Data processing

• Systems for data flow management have matured over the years
 In-house developments
 Standard queuing systems like ActiveMQ
 Services like Kafka and Flume
Distributed Data Flows systems

Requirement
• Systems should support
 “At least once” delivery semantic
 Solving “n+1” delivery problem
Data Delivery Semantic

• Three options for data delivery and processing


 At most once delivery
 At least once delivery
 Exactly once delivery
At most once delivery semantic

• Systems used for monitoring purposes


• Important to inform the admins about the problems
• Not all data transmissions required
• Down-sample the data to improve performance
• Data loss is approximately known
Exactly once delivery semantic

• Financial systems or advertising systems


• Every message has to be delivered only once
• Data loss not affordable as it might be revenue loss
• Achieved through queuing systems like ActiveMQ, RabbitMQ
• Usually queue semantics implemented on server side
At least once delivery semantic

• Balance two extremes by providing reliable message delivery by pushing the message handling
semantics to the consumer
• Consumers are free to implement message handling without bothered about other consumers
• Dependent on application logic and handled in application level only
The “n+1” problem

• In data processing pipeline, every time a new service or processing mechanism is added it must
integrate with each of the other systems in place
 Common antipattern
 Handling interaction between systems becomes pain point

• Data flow systems standardizes the communication between the bus layer and each application
, also it manages the physical flow of messages between systems
 Allows any number of consumers and producers to communicate using common protocol
Example Systems

• High performance systems with sufficient scalability to support real time streaming
 Apache Kafka
 Flume by Cloudera

• Kafka
 Directed towards users who are building applications from scratch, giving them the freedom to directly
integrate a data motion system

• Flume
 Design makes it well suited to environments that have existing applications that needs to be federated
into single processing environment
Thank You!
In our next session : Streaming Data Processor

You might also like