Data Warehouse -

Introduction

A data warehouse (DW) is a digital storage system that connects and harmonises large amounts of data from many different sources.
Its purpose is to feed business intelligence (BI), reporting, and analytics, and support regulatory requirements – so companies can turn their data into insight and make smart, data-driven decisions.
Data warehouses store current and historical data in one place and act as the single source of truth for an organisation.
Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision-making.
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing data of TB size, the storage shifted to to the Data Warehouse.

Key Components of a Data Warehouse

A typical data warehouse has four main components:

Central database.
ETL (extract, transform, load) tools.
Metadata
Access tools.

All of these components are engineered for speed so that you can get results quickly and analyse data on the fly.

Central database:

A database serves as the foundation of your data warehouse. Traditionally, these have been standard relational databases running on premise or in the cloud. But because of Big Data, the need for true, real-time performance, and a drastic reduction in the cost of RAM, in-memory databases are rapidly gaining in popularity.

Data integration:

Data is pulled from source systems and modified to align the information for rapid analytical consumption using a variety of data integration approaches such as ETL (extract, transform, load) and ELT as well as real-time data replication, bulk-load processing, data transformation, and data quality and enrichment services.

Metadata:

Metadata is data about your data. It specifies the source, usage, values, and other features of the data sets in your data warehouse. There is business metadata, which adds context to your data, and technical metadata, which describes how to access data – including where it resides and how it is structured.

Data warehouse access tools:

Access tools allow users to interact with the data in your data warehouse. Examples of access tools include: query and reporting tools, application development tools, data mining tools, and OLAP tools.

Data Warehouse Architecture

There are two most common architectures named Kimball and Inmon.

Kimball

Kimball’s approach to designing a Dataware house was introduced by Ralph Kimball.
This approach starts with recognizing the business process and questions that Dataware house has to answer.
These sets of information are being analyzed and then documented well.
The Extract Transform Load (ETL) software brings all data from multiple data sources called data marts and then is loaded into a common area called staging.
Then this is transformed into an OLAP cube.
Applications Setup and Built are quick, Generating report against multiple star schema is very successful.
Database operations are very effective and Occupies less space in the database and management is easy.
It has iterative steps and is cost-effective. And maintenance is difficult.
On Data integration it focuses on Individual business areas. It has a Bottom-Up Approach for implementation.
And It prefers data to be in the De-normalized model.

Inmon

Inmon’s approach to designing a Dataware house was introduced by Bill Inmon. This approach starts with a corporate data model.
This model recognizes key areas and also takes care of customers, products, and vendors.
This model serves for the creation of a detailed logical model which is used for major operations. Details and models are then used to develop a physical model.
This model is normalized and makes data redundancy less.
This is a complex model that is difficult to be used for business purposes for which data marts are created and each department is able to use it for their purposes.
The data warehouse is very flexible to changes and Business processes can be understood very easily.
Reports can be handled across enterprises and ETL process is very less prone to errors.
It has Top-Down Approach for implementation and It focuses on Enterprise-wide areas.
Initial cost is huge and the development cost is low.
It prefers data to be in a normalized model and maintenance is easy.

Benefits of a cloud data warehouse

Cloud-based data warehouses are rising in popularity – for good reason. These modern warehouses offer several advantages over traditional, on-premise versions. Here are the top benefits of a cloud data warehouse.

Quick to deploy:

With cloud data warehousing, you can purchase nearly unlimited computing power and data storage in just a few clicks – and you can build your own data warehouse, data marts, and sandboxes from anywhere, in minutes.

Low total cost of ownership (TCO):

Data warehouse-as-a-service (DWaaS) pricing models are set up so you only pay for the resources you need, when you need them. You don’t have to forecast your long-term needs or pay for more compute throughout the year than necessary. You can also avoid upfront costs like expensive hardware, server rooms, and maintenance staff. Separating the storage pricing from the computing pricing also gives you a way to drive down the costs.

Elasticity:

With a cloud data warehouse, you can dynamically scale up or down as needed. Cloud gives us a virtualised, highly distributed environment that can manage huge volumes of data that can scale up and down.

Security and disaster recovery:

In many cases, cloud data warehouses actually provide stronger data security and encryption than on-premise DWs. Data is also automatically duplicated and backed-up, so you can minimise the risk of lost data.

Real-time technologies:

Cloud data warehouses built on in-memory database technology can provide extremely fast data processing speeds to deliver real-time data for instantaneous situational awareness.

Empower business users:

Cloud data warehouses empower employees equally and globally with a single view of data from numerous sources and a rich set of tools and features to easily perform data analysis tasks. They can connect new apps and data sources without IT.

Data modeling techniques for data warehousing

Data warehouse modeling is the process of designing and organizing your data models within your data warehouse platform.
The design and organization process consists of setting up the appropriate databases and schemas so that the data can be transformed and then stored in a way that makes sense to the end user.
Three types of models to include in your data warehouse.
- Base or staging models
- Intermediate models
- Core models

When data warehouse modeling, you need to build your architecture with base, intermediate, and core models in mind.
Base models are necessary to protect your raw data and create consistent naming standards across different data sources.
Intermediate models act as the middleman between base and core models and allow you to build modular data models.
Core models are the final transformation product utilized by the data analyst.
It’s important to plan and create the databases and schemas within your data warehouse before you begin the modeling process.
Finding a system that works for you will allow you to build powerful models that are effective in delivering insight to your organization’s business teams.
These models have the power to change the data culture within your organization by the standards you put in place in your warehouse.

ETL Processes in Data Warehousing

Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse.
ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML).
You can address specific business intelligence needs through data analytics (such as predicting the outcome of business decisions, generating reports and dashboards, reducing operational inefficiency, and more).

Extraction:

The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most important steps of ETL process.

Transformation:

The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format. It may involve following processes/tasks:

Filtering – loading only certain attributes into the data warehouse.

Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA, etc.

Joining – joining multiple attributes into one.

Splitting – splitting a single attribute into multiple attributes.

Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

Loading:

The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is done after longer but regular intervals. The rate and period of loading solely depends on the requirements and varies from system to system.

Advantages of ETL process in data warehousing:

Improved data quality: ETL process ensures that the data in the data warehouse is accurate, complete, and up-to-date.
Better data integration: ETL process helps to integrate data from multiple sources and systems, making it more accessible and usable.
Increased data security: ETL process can help to improve data security by controlling access to the data warehouse and ensuring that only authorized users can access the data.
Improved scalability: ETL process can help to improve scalability by providing a way to manage and analyze large amounts of data.
Increased automation: ETL tools and technologies can automate and simplify the ETL process, reducing the time and effort required to load and update data in the warehouse.

Disadvantages of ETL process in data warehousing:

High cost: ETL process can be expensive to implement and maintain, especially for organizations with limited resources.
Complexity: ETL process can be complex and difficult to implement, especially for organizations that lack the necessary expertise or resources.
Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be able to handle unstructured data or real-time data streams.
Limited scalability: ETL process can be limited in terms of scalability, as it may not be able to handle very large amounts of data.
Data privacy concerns: ETL process can raise concerns about data privacy, as large amounts of data are collected, stored, and analysed.

Data warehouse challenges you may face

Data integration

Most organizations have data stored in multiple systems like CRM, ERP, flat files, or other databases. Integrating these diverse data sources into a unified warehouse can be complex. The use of ETL (Extract, Transform, Load) processes and tools can facilitate streamlined integration. Mapping out source-to-target data transformation rules is also essential to ensure data accuracy.

Data quality

Inaccurate, outdated, or inconsistent data can lead to misleading analytics. Implementing data cleansing, validation, and deduplication routines. Establishing data governance practices can also uphold the quality and integrity of the data.

Scalability

As data volume grows, the warehouse should accommodate without degrading performance. Choosing scalable infrastructure like cloud-based platforms. Also, design the architecture to accommodate data growth, such as using partitioning strategies.

Performance

Query performance is crucial for end-users who need timely insights. Proper indexing, denormalization where necessary, and usage of OLAP cubes or in-memory databases can optimize query performance.

Data security

Protecting sensitive data from unauthorized access is paramount. Implement robust access controls, data masking, and encryption. Regular audits can also ensure security measures are effective.

Data modeling

Designing the data warehouse structure is fundamental. A poor design can lead to inefficiencies, redundancies, and complications.

Use proven modeling techniques like star schema or snowflake schema. Engage with business users to understand requirements clearly.

Historical data handling

Data warehouses often need to store historical data for trend analysis, which can pose storage and organizational challenges. Decide on data retention policies, implement slowly changing dimensions (SCDs), and consider cost- effective storage solutions for older data.

conclusion

In conclusion, a data warehouse is an essential component of any BI strategy. It provides a central repository for data that can be used to make informed decisions, improve performance, and perform more advanced analytics. While building a data warehouse requires significant investment, the benefits in terms of improved decision-making and business outcomes make it a worthwhile investment for many organizations.

Author Name : Thulasirajan G

Position : Data engineer