1. Introduction

In a research published in 2021, Allied Market Research estimated that the worldwide Data Warehousing market will grow to $51.18 billion by 2028.

In this tutorial, we’ll talk about the Data Warehouse (DW), a centralized type of management technology used for analytics. Mainly, we’ll walk through the architecture of a DW and introduce its key factors and main components. We’ll compare a typical DW with a Database and Data Lake and discuss the most common applications and usages of a DW.

2. Design of a Data Warehouse

Data warehousing was first introduced by Bill Inmon around 1970, and its initial goal was to facilitate the data pipe from raw unstructured data into an effective and profitable decision.

A data warehouse is mainly defined as a centralized organizational archive of data collections that contributes to decision-making tasks. The aim of a DW is to collect, store and analyze useful and meaningful information, regularly, from different types of origins, transactional systems, or relational databases in order to increase the performance of an organization or a corporation.

2.1. Architecture of a Data Warehouse

The architecture of a typical DW system looks as follows, and it is mainly based on four different components. The load, warehouse, and query manager, along with some end-user access tools:

Architecture of a typical Data Warehouse system

As we can see, raw data is collected from different sources. At first, the extracted data is cleaned and transformed into the desired form in order to be put into the data warehouse. Standard interfaces and gateways such as Exploratory Data Analysis (EDA) and SQL are typically used to implement data extraction. Note that even though the data is regularly updated so that the warehouse keeps up to date, historical data is not permanently deleted.

In addition, there are several subject-oriented data marts, that basically form a smaller version of the data warehouse and focus on a specific process, and can be easily customized for a special purpose.

Moreover, there exists metadata management, the road map of the DW that acts as a directory and its task is to locate certain data. Administrative metadata consists of all the data required to set up and operate a warehouse.

Finally, the data repository can be reached and used for analysis, query and reporting, and mining on the other end, including visualization and presentation tools.

3. Characteristics of a Data Warehouse

According to Bill Inmon’s definition, a DW is “a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision-making.” These are the key features of a DW that distinguish it from other systems.

3.1. Subject-Oriented

DWs are subject-oriented since they can focus on a sole task and generate analyses and reports from a particular area of data rather than the overall information of an organization. These subjects may be a product, a customer, a sales inventory, or a supply chain of a company.

3.2. Integrated

Most businesses desire to make use of an integrated enterprise warehouse that gathers data on all subjects (such as products, sales, and employees). DWs come to cover this need since the data, is kept in a warehouse in a consistent and standardized way between various data types from various sources.

3.3. Time-Varying

The information collected in a DW includes a time stamp and a typical DW considers updates and changes throughout time based on a regular schedule. Also, in a DW, historical data is stored. On the other hand, a transactions system often just maintains the latest information.

3.4. Non-Volatile

The data of DW is non-volatile, meaning that it is unchangeable. When new data is provided, the historic data is not removed. This supports analysis of what occurred and when and facilitates the generation of meaningful patterns or visualizations and the result of the conclusions.

4. Types of a Data Warehouse

Enterprise Data Warehouses (EDW), Operational Data Stores (ODS), and data marts are the three primary forms of a DW.

4.1. EDW

An EDW is also a centralized warehouse and is typically composed of a number of databases that provide a consistent method for categorizing, arranging, and classifying data by a specific subject, method, or variable.

4.2. ODS

An ODS is utilized for practical reporting, documentation, management, and decision-making and is a supportive feature of an EDW. An ODS is frequently used for daily tasks as it is continuously updated.

4.3. Data Mart

A data mart is a single component of a DW designed to manage a certain division, area, or business line. Regularly, the ODS stores data from the data mart. The data is subsequently transmitted from the ODS to the EDW, where it is utilized and archived.

5. OLAP

Online Analytical Processing (OLAP) is a method for loading and transferring data from outside origins to data warehouses, generating valuable information from it, and supporting queries on it. Most OLAP programs rely heavily on queries and frequently use data from DWs. Using metrics for modeling, planning, or forecasting, online analytical processing in data warehouses enables quick computation of statistical business data.

For data mining, analytics, and other purposes, OLAP performs complicated queries to massive volumes of historical data that have been gathered from OLTP databases and other sources.

5.1. OLAP and OLTP

Online Transactional Processing (OLTP) allows several users to make numerous database transactions and arrangements in real time, usually over the Internet. The key difference between OLAP and OLTP is that each system has been designed to handle different kinds of operations.

Extensive analysis of data for better decision-making is achieved and designed for OLAP. On the contrary, OLTP is designed to process a significant amount of transactions for applications that use client self-service.

6. Data Warehouse, Database, and Data Lake

Basically, a database, a data warehouse, and a data lake vary primarily and are designed for different purposes.

The data needed to run a sole application is maintained in a database. Therefore, a monolithic architecture, in which the entire data is produced by a single application, is ideal for databases. On the other hand, a DW is designed to support massive amounts of information that may be used from several organizational divisions.

In addition, a data lake maintains raw (structured or unstructured) data from different data types, both recent and old, from one or more systems. Data lakes are widely used for machine learning applications predictive modeling and statistical analysis. On the contrary, due to its well-structured form, a DW is suitable for strategic decisions and the planned use of data. Finally, a Data Lake’s approach to data utilization is Extracting, Loading, and Transforming whereas a DW’s is Extracting, Transforming, and Loading.

7. Benefits and Limitations of a Data Warehouse

The main advantage of a DW is the ability to store, analyze, and derive value from massive quantities of diverse data while retaining the history of the archive in order to increase the performance of a single task. Also, a DW is quick in data retrieving, includes error identification and correction, and offers easy integration, as it transforms the data into a more readable and simple format.

On the contrary, time-consuming preparation and set-up are required for a data warehouse because much of the labor involves manually inserting raw data into the DW. Additionally, compatibility issues and maintenance expenditures may exist because of the constant need to update a DW. Finally, there is a restricted utility that DW provides in the event that managing confidential information is necessary.

8. Applications of Data Warehouse

Data scientists and machine learning engineers can simply use the information in data warehouses in their work because of their highly organized structure. A common real-world application of a DW is banking and finance, in which a DW may be used to obtain information regarding client deposits, loans, and money. With the right approach, the banking sector may examine consumer costs and develop more effective plans to increase revenues on both sides.

Another frequent application of DW is in e-commerce platforms, where it is useful to extract important marketing information (such as likes, views, and website visits) from advertising methods and marketing strategies and utilize them to better engage with their target groups and potential customers.

Also, DWs are widely used in, healthcare data storage, agrotechnology, retail management, and in the manufacture and distribution of goods and everyday products.

The most widely known DWs are Amazon Redshift, Google BigQuery, IBM Db2 Warehouse, and Microsoft Azure Synapse.

9. Conclusion

Almost every modern organization has the need for a DW in order to increase its performance and profits. This underlines both its continuing importance and the need to invest in the proper and most suitable DW technology.

In this article, we walked through Data Warehouses. In particular, we mainly introduced the DW architecture and structure and discussed the main characteristics of a DW and the key differences between a DW, a Database, and a Data Lake as well. We talked about the basic benefits and limitations of a DW. Finally, we mentioned real-world tasks and applications that a DW can be used.


« 上一篇: 梳排序解释