Data Flow Manager
Pravin Y Pawar
Distributed Data Flows
Need
• Distributed state management is required in order to process the data in scalable way
• Distributed data flows consists of
Data collection
Data processing
• Systems for data flow management have matured over the years
In-house developments
Standard queuing systems like ActiveMQ
Services like Kafka and Flume
Distributed Data Flows systems
Requirement
• Systems should support
“At least once” delivery semantic
Solving “n+1” delivery problem
Data Delivery Semantic
• Three options for data delivery and processing
At most once delivery
At least once delivery
Exactly once delivery
At most once delivery semantic
• Systems used for monitoring purposes
• Important to inform the admins about the problems
• Not all data transmissions required
• Down-sample the data to improve performance
• Data loss is approximately known
Exactly once delivery semantic
• Financial systems or advertising systems
• Every message has to be delivered only once
• Data loss not affordable as it might be revenue loss
• Achieved through queuing systems like ActiveMQ, RabbitMQ
• Usually queue semantics implemented on server side
At least once delivery semantic
• Balance two extremes by providing reliable message delivery by pushing the message handling
semantics to the consumer
• Consumers are free to implement message handling without bothered about other consumers
• Dependent on application logic and handled in application level only
The “n+1” problem
• In data processing pipeline, every time a new service or processing mechanism is added it must
integrate with each of the other systems in place
Common antipattern
Handling interaction between systems becomes pain point
• Data flow systems standardizes the communication between the bus layer and each application
, also it manages the physical flow of messages between systems
Allows any number of consumers and producers to communicate using common protocol
Example Systems
• High performance systems with sufficient scalability to support real time streaming
Apache Kafka
Flume by Cloudera
• Kafka
Directed towards users who are building applications from scratch, giving them the freedom to directly
integrate a data motion system
• Flume
Design makes it well suited to environments that have existing applications that needs to be federated
into single processing environment
Thank You!
In our next session : Streaming Data Processor