Enterprises of today are having to deal with ever-increasing size of data sources and variety of data, putting greater emphasis on the need for a comprehensive and robust data flow management. The primary goal of a data flow management tool is to provide reliable movement and delivery of data from a source to a destination. Several data movement frameworks exist today that perform transformations/actions and different types of operations on data in either streaming/batch fashion. In this blog post, we will review two of the popular data flow frameworks that help enterprise developers in performing data flow & management without writing any code.
Why Data flow management?
Data flow management plays a crucial role in daily ETL operations in Enterprises. ETL operations also include a variety of business logic that involve data manipulation, which is not necessarily done in a single step. Failure in any step along the process leads to corrupt data that cannot be relied upon for down-stream processing. This is where data flow management tools help in providing users with a clear visualization of data provenance and data lineage. Vendors emphasize more on visualizing the data flow so that enterprise developers can have more control over data flow and monitor/ perform ETL operations in a reusable and scalable way.
Both StreamSets & NiFi have their own set of processors to connect with different sources to pull data, do business processing and store the results back to one or multiple downstream data stores or systems. Both these tools provide users with rich visual interfaces to drag & drop the processors, configure their properties and a one-click option to start/stop data pipelines.
StreamSets gives a very lightweight web app to create data pipelines. Below is a screenshot of the same. Each pipeline can be a separate data flow, and all such pipelines can be differentiated and seen by their statuses as Running, Non-Running, Invalid, Error pipelines.
Once clicked on Create New Pipeline, a new pipeline with an empty canvas will be created as shown below. Here we can drag & drop processors that are visible on the right and connect them to prepare a data flow. Users can create their own custom processors to match a specific requirement.
NiFi provides a web interface for user interactions to create, delete, edit, monitor and administrate dataflows. NiFi provides a plain big canvas with several options to operationalize the dataflows and an option to create process groups, where users create and differentiate dataflows. Users can have multiple process groups going deeper.
When we drag & drop a processor, NiFi provides an option to select the processor we wish to create (as shown below). Users can also create their own custom processors to match a specific requirement.
Are they Scalable? Do they handle Big data?
Yes! Both StreamSets & NiFi are scalable and can handle huge volumes of data in their own way. Let’s see how they handle.
StreamSets can be run as a standalone application or on a distributed cluster like YARN and Mesos, and uses SPARK internally to run its jobs. When run in cluster mode, StreamSets deploys a SPARK application to the YARN or MESOS cluster and uses the capabilities of existing spark cluster. Based on cluster resources, we can also set back pressure & throughput at the origin. It can further be scaled horizontally by adding more hadoop nodes.
NiFi can also run in standalone mode or on a cluster. To run NiFi in cluster mode, we need to create and manage our own NiFi cluster. Depending on the size of NiFi cluster, one can control the back pressure & throughput in execute processors. It can be scaled horizontally by adding more NiFi nodes.
Data flow visualization & Monitoring
StreamSets takes record-based approach while processing data. The visual interface of this framework shows stats of records that are getting processed, Errored, Input and Output for a particular processor, throughput etc., all at runtime. Users can monitor the stats of records that are going to error in real time & check for the reasons provided for each individual error record.
NiFi takes a file-based approach while processing data. Its processing happens based on FlowFile, which is a lightweight file in NiFi on which all the operations of processors are performed. Users can see details of what has happened on a particular FlowFile through its visual interface called data provenance. NiFi always gives a feasibility to split the file into records and process if the user is really interested to perform record wise operations, in which case each record will become a separate FlowFile in NiFi.
To summarize, the following are the differences between StreamSets & NiFi
|Developed in Java||Developed in Java|
|Introduced in 2015||Introduced in 2006|
|Open sourced but not incubated as an Apache project||Open sourced and incubated as a complete Apache project|
|Not many contributions from open source community||Lot of contributions from open source & one of the most active open source projects in Apache|
|Supports Multi-tenant authorization||Supports Multi-tenant authorization|
|Follows record based approach||Follows file based approach|
|Process a batch of records at a time, users can adjust the batch size at the origin||Process each record at a time and directed more towards continuous processing|
|Can use existing Spark cluster for scalability||Should create a NiFi cluster and add more NiFi nodes for scalability|
|APIs are available for custom Origins, Processors & destinations||APIs are available for custom Origins, Processors & destinations|
|Not a good suggestion for orchestration||A good suggestion for orchestration too|
|Can export/ import a pipeline as JSON||Can export/ import a pipeline as XML template|
|Versioning of pipelines are supported in the enterprise version through DPM||Versioning of flows are supported from NiFi-1.5.0|
|Data schema drift can be identified||Data schema drift could not be identified|
|Works with CDH, MapR and HDP||Available in Hortonwork’s data flow distribution(HDF) & works with CDH, MapR and HDP|
|Cluster coordination is taken care of by YARN/MESOS||Cluster coordination is done by zookeeper|
|Throughput varies from processor to processor, since we cannot control the back pressure for every processor||Both throughput and Back pressure can be controlled, since we can adjust the batch size at every stage of the pipeline|
|All the processors should wait until the completion of whole batch to receive new input||Processors won’t be waiting for the completion of whole batch, new input will get processed by each processor if they are free & if not input will be waiting in queue before that processor|
|Checkpointing happens only at the end of pipeline||Checkpointing happens at every processor|
|Whole batch should be processed again on failure, as checkpointing happens only after completion of the whole batch at the end of pipeline||On failure can run the pipeline from the processor where it fails|
|Delivery can be at-least once||Delivery can be at-least once|
|Easy to containerize and run pipelines in cluster mode||Difficult to containerize and run dataflows on a cluster|
This has primarily been an overview of both platforms, and they further vary in terms of architecture and usability, which needs to be explored. While both platforms are developed to perform the same task of data flow management, it is up to the requirements of an end user to choose from these two depending on the differences shared above.