Data Warehouse Overview
The word data warehouse was first developed by Bill Inmon in the early 1990s. He referred to it as being a integrated collection of information that could help companies and organizations make better decisions.
To be effective, a data warehouse had to be integrated, subject oriented, non-volatile, and time variant. In this article, I will go over all these factors in detail. If you are building a data warehouse, it is important for you to understand why they are important.
Being subject oriented means that the data will provide information about a specific subject rather than the information about the functions of a company. Because a data warehouse is subject oriented, it will allow you to analyze information that is connected to a specific subject. Being integrated means that the data that is collected within the data warehouse can come from different sources, but can be combined into one unit that is relevant and logical. Having a time-variant means that all the information within the data warehouse can be found with a given period of time.
It is important that the information contained within a data warehouse is stable. While data can be added, it should never be deleted. This property is referred to as being non-volatile. When a company uses a data warehouse that is stable, this will allow them to get a better understanding of the operations within their company. Despite the fact that these terms were first coined in the the 1990s, they are still highly accurate today. However, it should be noted that some data warehouses are volatile. The reason for this is because many modern data warehouses deal with terabytes of data.
Because they must store terabytes of data, many companies are forced to delete some of their information after a certain period of time. For instance, some companies will systematically delete data that has reached three years of age. Before a data warehouse can be built, the correct data must be located. Generally, the information that will be added to the warehouse will come from daily information or historical information. The historical information may be stored in a legacy system, and is challenging to extract.
The design of the data warehouse is important as well. It is important for designers to make sure the design is consistent with the queries that will be conducted within the warehouse. To do this successfully, it is important for designers to understand the database schema. It is crucial to make sure the data warehouse is designed correctly, as it is difficult to recreate some forms of data. Another important aspect of data warehouses is data acquisition. Data acquisition can be defined as transferring data from a source to the warehouse. Data acquisition is one of the most expensive parts of building a data warehouse. This process will often be conducted with an ETL tool.
As of this time, there are just over 50 ETL tools being sold. It may cost a company millions of dollars in order to transfer data from sources to the warehouse. Once the initial data has been transferred to the data warehouse, the process must be repeated consistently. Data acquisition is a continous process, and the goal of a company is to make sure the warehouse is updated on a regular basis. When the warehouse is updated, it is often hard to determine which information in the source has changed since the previous update. The process of dealing with this issue is called changed data capture. This process has become a separate field, and there are a number of products currently be sold to deal with it.
It is important for data to be cleaned before it can be placed in the warehouse. The data cleansing process is usually done during the data acquisition phase. Any data that is placed in a warehouse before being clean will pose a danger to the system, and it cannot be used.
The reason for this is because the data may not be correct if it is not cleaned, and a company may make incorrect decisions based on it. This could lead to a number of problems. For example, all the information within a data warehouse that means the same thing must be stored in the same form. If there is information that reads "MS" and "Microsoft," even though they mean the same thing, only one of them can be used to recognize the element within the data warehouse.