Challenge

  • Regular challenges with current traditional Data warehouse
    • Cannot scale anymore
    • Maintainability
    • Cost-intensive
    • Single point of failure
  • With the increase of digitalization everywhere, the amount of data varied & multiplied started seeing the challenges with 3V’s 
    • Volume
    • Velocity
    • Variety
  • Way behind meeting the customer’s demands
  • Not many options to enhance & make things more functional

Processing then

  • Store data in huge SQL databases
  • Complex SQL’s & Stored procedures to process data
  • High-performance enterprise servers to process smaller amounts of data
  • Manual testing
  • Aging monolithic solution

Processing now

  • Store data in fault-tolerant distributed storage
  • High-level languages to process data
  • Economically distributed clusters to process huge amounts of data
  • Fully functional and unit tested
  • New age modular solution

Solution

  • Leverage Big data stack to perform regular ETL process without hampering the existing platform 
    • Use Sqoop to fetch data from the existing relational data warehouse into HDFS
    • Use Hive to normalize/de-normalize the data fetched as per the application’s requirement
    • Apache spark to process the data in memory and return the results back to HDFS
    • Export the results back to the existing relational data warehouse
  • All this regular ETL process has been simplified and well orchestrated using NIFI data pipelines to provide a seamless automated operational experience
  • Troubleshooting made easy even at production levels

Deployment Infrastructure

  • Cloudera’s CDH
  • Apache NiFi
  • Sqoop
  • Hive
  • Spark

Business impact

  • The operational cost of infra is significantly reduced
  • Reduced a day’s effort to 2-3 hours
  • Aided to create new products/applications gaining potential customer’s interest
  • Opened a gateway to perform analytics & machine learning at scale to derive key business insights