Components of Data Strategy
(Some definitions are taken from IBM.com)
Data Pipeline:
A data pipeline is a method in which raw data is ingested from various data sources, transformed and then ported to a data store, such as a data lake or data warehouse, for analysis.
Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization. This is particularly important when the destination for the dataset is a relational database. This type of data repository has a defined schema which requires alignment—that is, matching data columns and types—to update existing data with new data.
Data can be sourced from:
- APIs
- SQL/Non-SQL databases
- Flat files
- Other formats
Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization. This is particularly important when the destination for the dataset is a relational database. This type of data repository has a defined schema which requires alignment—that is, matching data columns and types—to update existing data with new data.
Type of data pipelines
a. Batch Processing
Batch processing loads “batches” of data into a repository during set time intervals, which are typically scheduled during off-peak business hours. This way, other workloads aren’t impacted as batch processing jobs tend to work with large volumes of data, which can tax the overall system. Batch processing is usually the optimal data pipeline when there isn’t an immediate need to analyze a specific dataset (for example, monthly accounting), and it is more associated with the ETL data integration process, which stands for “extract, transform, and load.”
Batch processing jobs form a workflow of sequenced commands, where the output of one command becomes the input of the next command. For example, one command might kick off data ingestion, the next command may trigger filtering of specific columns, and the subsequent command may handle aggregation. This series of commands will continue until the data quality is completely transformed and rewritten into a data repository.
We did it using Control-M scheduling where an out condition is passed on to the downstream application(s), which picks it up as an in condition to continue the ingestion, transformation, etc.
b. Streaming data pipelines / event-driven architectures
Unlike batching processing, streaming data pipelines—also known as event-driven architectures—continuously process events generated by various sources, such as sensors or user interactions within an application. Events are processed and analyzed, and then either stored in databases or sent downstream for further analysis.
Streaming data is leveraged when it is required for data to be continuously updated. For example, apps or point-of-sale systems need real-time data to update inventory and sales history of their products; that way, sellers can inform consumers if a product is in stock or not. A single action, such as a product sale, is considered an “event,” and related events, such as adding an item to checkout, are typically grouped together as a “topic” or “stream.” These events are then transported via messaging systems or message brokers, such as the open-source offering, Apache Kafka.
Since data events are processed shortly after occurring, streaming processing systems have lower latency than batch systems, but aren’t considered as reliable as batch processing systems as messages can be unintentionally dropped or spend a long time in queue. Message brokers help to address this concern through acknowledgements, where a consumer confirms processing of the message to the broker to remove it from the queue.
Data Pipeline Architecture
Remember DDEP zones -
Raw zone or Landing zone: (no user access, true source data stored in source format, history is retained),
Democratized zone: Source with field level encryption, natural keys extracted to enable integration, history is retained. Seamless access to enterprise data without the overhead of stringent and bureaucratic access controls. Sensitive data is encrypted.
Publish zone: Operational data sets. Downstream applications connect to source data; SLA driven. Data structures are designed to meet consumption patterns. Publish data is consumed by downstream apps, APIs, EDW, and Bi Analytics tools.
Discovery zone: is another zone beneath the democratized zone from where data science tools and BI/analytics tools connect to derive their individual needs.
FDP/BDP/XDP: Data Product: are additional layers to the right of Publish from where applications source data for individual needs.
> Ingestion: data is ingested into raw and from there to publish/bdps where it is stored.
> Transformation: transformation happens prior to moving data to publish and bdps.
> Storing: data is stored
When sending downstream, the data is sent outbound via outbound job on to Sterling or another system via passing out condition.
ETL Vs. Data Pipelines
An ETL Pipeline ends with loading the data into a database or data warehouse. A Data Pipeline doesn't always end with the loading. In a Data Pipeline, the loading can instead activate new processes and flows by triggering webhooks in other systems.
Data Lineage
Data lineage is the process of tracking the flow of data over time, providing a clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline.
Data lineage tools provide a record of data throughout its lifecycle, including source information and any data transformations that have been applied during any ETL or ELT processes.
This type of documentation enables users to observe and trace different touchpoints along the data journey, allowing organizations to validate for accuracy and consistency. This is a critical capability to ensure data quality within an organization. It is commonly used to gain context about historical processes as well as trace errors back to the root cause.
Reliable data is essential to drive better decision-making and process improvement across all facets of business--from sales to human resources. However, this information is valuable only if stakeholders remain confident in its accuracy as insights are only as good as the quality of the data. Data lineage gives visibility into changes that may occur as a result of data migrations, system updates, errors and more, ensuring data integrity throughout its lifecycle.
Data lineage documents the relationship between enterprise data in various business and IT applications.
Datawarehouse vs. Data Lake vs. Data Mart
A data warehouse is a system that aggregates data from multiple sources into a single, central, consistent data store to support data mining, artificial intelligence (AI), and machine learning—which, ultimately, can enhance sophisticated analytics and business intelligence. Through this strategic collection process, data warehouse solutions consolidate data from the different sources to make it available in one unified form.
A data mart is a focused version of a data warehouse that contains a smaller subset of data important to and needed by a single team or a select group of users within an organization. A data mart is built from an existing data warehouse (or other data sources) through a complex procedure that involves multiple technologies and tools to design and construct a physical database, populate it with data, and set up intricate access and management protocols.
While it is a challenging process, it enables a business line to discover more-focused insights quicker than working with a broader data warehouse data set. For example, marketing teams may benefit from creating a data mart from an existing warehouse, as its activities are usually performed independently from the rest of the business. Therefore, the team doesn’t need access to all enterprise data.
A data lake, too, is a repository for data. A data lake provides massive storage of unstructured or raw data fed via multiple sources, but the information has not yet been processed or prepared for analysis. As a result of being able to store data in a raw format, data lakes are more accessible and cost-effective than data warehouses. There is no need to clean and process data before ingesting.
For example, governments can use technology to track data on traffic behavior, power usage, and waterways, and store it in a data lake while they figure out how to use the data to create “smarter cities” with more efficient services.
Data Identification:
Critical Data Elements
Every organization handles a vast volume of data, but not all data are equally crucial to their objectives. They prioritize data governance based on business goals, regulatory requirements, and risk tolerance. By focusing efforts on critical data, particularly Critical Data Elements (CDEs), they effectively manage data risks as part of our Risk Management Strategy.
Data Element is a unit of data. Critical Data Elements are those that if missed or of low quality will impact a business' ability to carry out business.
Data Storage
Data Provisioning
Data Integration
Data Governance
No comments:
Post a Comment