Enterprises of today are having to deal with ever-increasing size of data sources and variety of data, putting greater emphasis on the need for a comprehensive and robust data flow management. The primary goal of a data flow management tool is to provide reliable movement and delivery of data from a source to a destination. Several data movement frameworks exist today that perform transformations/actions and different types of operations on data in either streaming/batch fashion. In this blog post, we will review two of the popular data flow frameworks that help enterprise developers in performing data flow & management without writing any code.
Data flow management plays a crucial role in daily ETL operations in Enterprises. ETL operations also include a variety of business logic that involve data manipulation, which is not necessarily done in a single step. Failure in any step along the process leads to corrupt data that cannot be relied upon for down-stream processing. This is where data flow management tools help in providing users with a clear visualization of data provenance and data lineage. Vendors emphasize more on visualizing the data flow so that enterprise developers can have more control over data flow and monitor/ perform ETL operations in a reusable and scalable way.
Both StreamSets & NiFi have their own set of processors to connect with different sources to pull data, do business processing and store the results back to one or multiple downstream data stores or systems. Both these tools provide users with rich visual interfaces to drag & drop the processors, configure their properties and a one-click option to start/stop data pipelines.
StreamSets gives a very lightweight web app to create data pipelines. Below is a screenshot of the same. Each pipeline can be a separate data flow, and all such pipelines can be differentiated and seen by their statuses as Running, Non-Running, Invalid, Error pipelines.
Once clicked on Create New Pipeline, a new pipeline with an empty canvas will be created as shown below. Here we can drag & drop processors that are visible on the right and connect them to prepare a data flow. Users can create their own custom processors to match a specific requirement.
NiFi provides a web interface for user interactions to create, delete, edit, monitor and administrate dataflows. NiFi provides a plain big canvas with several options to operationalize the dataflows and an option to create process groups, where users create and differentiate dataflows. Users can have multiple process groups going deeper.
When we drag & drop a processor, NiFi provides an option to select the processor we wish to create (as shown below). Users can also create their own custom processors to match a specific requirement.
Yes! Both StreamSets & NiFi are scalable and can handle huge volumes of data in their own way. Let’s see how they handle.
StreamSets can be run as a standalone application or on a distributed cluster like YARN and Mesos, and uses SPARK internally to run its jobs. When run in cluster mode, StreamSets deploys a SPARK application to the YARN or MESOS cluster and uses the capabilities of existing spark cluster. Based on cluster resources, we can also set back pressure & throughput at the origin. It can further be scaled horizontally by adding more hadoop nodes.
NiFi can also run in standalone mode or on a cluster. To run NiFi in cluster mode, we need to create and manage our own NiFi cluster. Depending on the size of NiFi cluster, one can control the back pressure & throughput in execute processors. It can be scaled horizontally by adding more NiFi nodes.
StreamSets takes record-based approach while processing data. The visual interface of this framework shows stats of records that are getting processed, Errored, Input and Output for a particular processor, throughput etc., all at runtime. Users can monitor the stats of records that are going to error in real time & check for the reasons provided for each individual error record.
NiFi takes a file-based approach while processing data. Its processing happens based on FlowFile, which is a lightweight file in NiFi on which all the operations of processors are performed. Users can see details of what has happened on a particular FlowFile through its visual interface called data provenance. NiFi always gives a feasibility to split the file into records and process if the user is really interested to perform record wise operations, in which case each record will become a separate FlowFile in NiFi.
To summarize, the following are the differences between StreamSets & NiFi
This has primarily been an overview of both platforms, and they further vary in terms of architecture and usability, which needs to be explored. While both platforms are developed to perform the same task of data flow management, it is up to the requirements of an end user to choose from these two depending on the differences shared above.