Data sources

– We have to deal with several data sources from which healthcare data is generated and is to be consumed

– These can be traditional databases, data warehouses, data lakes, flat files, IOT devices, EHRs and this list goes on. 

Data formats

– To move healthcare data from Point A to Point B the data is to be in a standard format and these formats vary based on what the data is used to represent and what is the origin of the data

– To name some standard formats we have HL7, FHIR, CDM, CDA, DICOM, Custom delimed, etc.  

Data Types

– Data is present in several types like structured, semi-structured, images and maintaining and collecting all of them is a challenge

Socio-economic Data

– This type of human data helps to predict and analyse various types of conditions and scenarios so is very useful for ML

– The veracity and inconsistency of this data brings difficulty in making use

Data volume and scalability

– Healthcare data is always in huge volumes which include various types of information

– To scale vertically or horizontally with the amount of the data that is coming is necessary 

Data Security & Governance

– HIPAA Compliance


The Data Ingestion process has two phases

1. Data collection at the Data Exchange Platform

2. Data movement and conversion using Data Ingestion Pipelines

Data Exchange Platform

Data from different sources is first collected here inside a  secure storage where several data operations if necessary can happen like de-identification, anonymization, stitching. 

To maintain data privacy of patients, payers, providers we separate Personally Identifiable Information (PII) from the Protected Health Information (PHI) which also makes us compliant with security standards. 

Stitching and linking of related data also takes place to better identify, manage data and ease the process in further steps

Data Ingestion Pipelines

All these raw data from multiple sources and disparate formats are stored in a data warehouse in the FHIR standard format. We chose the FHIR format as it helps bring various types of data into a single place in the form of JSON or XML to help better manage and also take the advantages that inherently come with it such as exposing APIs, applications, interoperability, etc. 

To perform this conversion we create customized data pipelines for each data source and data schema format where cleansed and transformed data is obtained in the end. 

These pipelines are built using Apache NiFi for orchestration and transformations of data. NiFi also provide few processors that can help handling healthcare standard formats like HL7 where we can validate and extract HL7 attributes from the block messages to further perform actions like create JSON objects to store the data or transform them.

As an alternative we also support Streamsets here where we have the same pipelines built here and with the help of the custom processors and stagelibs that we built at Predera to handle healthcare data for instance HL7 messages we can easily read an HL7 message validate it and then based on the type of message take decisions and transformations before pushing these messages to any destination. The actions and operations can be performed on FHIR resources as well.

To handle the processing of big data or large batches Apache Spark can be leveraged which performs in-memory parallel processing of data.  Both NiFi/Streamsets and Spark are integrated together to make these pipelines even more flexible.

We worked closely with several vendors and this system was used to pull data from EHRs like EPIC data systems, SMART contracts from Akiri Data Platform, de-identified datasets from Datavant. 

All these systems and processes are HIPAA compliant where every operation and movement of data is logged and audited. 

Predera Benefit

– A robust data ingestion platform which can handle various sources, formats, types and volumes of healthcare data 

– A HIPAA Complaint environment with integrated security standards  to store and process the data

– Data platform that has the data ready for ML usage along with the integration of Predera’s DataModel and AI platform as an extension

– Support for Batch big data to Near-real time streaming data

– Works with major EHRs out of the box