Apache Kafka


This week I want to discuss an up and coming topic specifically Apache Kafka. For a long time streaming data into Hadoop was not considered relevant due to the fact that Hadoop was batch MapReduce. In quite a few respects it is still Batch though there are several efforts to make it faster for interactive viz. Impala, Tez on Hive, Drill, etc. But with the introduction of the Lambda Architecture, it has real streaming more relevant and with the appearance of Storm(w/ trident), Spark streaming, etc.

Data as relevant streams has become more relevant. If you had a chance to attend my talk from last month, I made an important point on the fact that Real time analytics of data includes two perspectives viz. Real time Processing(How fast the data is available to be consumed once the data is in) and real time reporting( How fast the data can be generated given the specific requirements). Now I believe there is a third layer viz. real time ingestion of the data and that is where tools like Kafka come into play. Being able to ingest the data makes it relevant to the systems where you process it because it does affect your SLA’s in terms of delivery. Kafka, Flume, etc. have into this category. As time passes, we will see more tools that do the same but it also means that understanding you incorporate these tools into arsenal will impact how your goals are met.

Post a comment