Multidimensional cubes are created out of the data stored in the data warehouse. Numerical measures are also summarized into pre-aggregated values when cubes are constructed. These cubes are then stored in multidimensional structures that are designed for rapid query. The pre-aggregated information is combined with raw fact data to respond to a wide variety of queries.
Types of changes that impact on cubes
Since cubes contain summarized data from the data warehouse, any changes to the data in the warehouse also affects the integrity and accuracy cubes which have been created from such data. Therefore, synchronization of data in the data warehouse and in the cubes becomes very important. OLAP data must be updated after the data in the data warehouse has been modified. Cubes, dimensions and partitions must be processed to incorporate the new or changed data. The method of processing the OLAP object depends on the kind of change that has been effected to the warehouse.
The most common type of changes to the data in the data warehouse is addition of current data to the data warehouse. Modifications of original data or change in the design of the warehouse are seldom attempted. Such additions of data will impact on cube definitions available to client applications. The impact of such additions is usually managed by carefully defining partition filters and by designing a strategy to synchronize OLAP and data warehouse data.
Changes to the data warehouse can also be made to correct errors in the data. This is usually minimized and taken care of during data transformation, scrubbing and validation operations. Changes may also occur due to changes in the structure of the organization or its products. Such changes will be incorporated into the OLTP database and then transferred to the data warehouse while migrating the data from the former to the latter. Cubes can accommodate and absorb the changes that correct value errors, but changes that move a fact table form one dimension member to another will adversely affect the integrity of the results derived from processing the cube. The data loaded into the cube will have to be refreshed by reprocessing the cube and recalculating the aggregations. The reprocessing of the cube can be done by selecting Full process or Refresh data processing options. If the aggregations remain the same the refresh data processing option will be faster.
Dimension hierarchies can also be affected by changes in the data in the data warehouse dimension tables even when the table schema remains the same. The dimension hierarchy depends on the relationships between the members of the dimension table. When the relationships change the dimension structure must be rebuilt.
Synchronization OLAP and Data Warehouse data
Valid cubes are made available to client applications. Since these cubes interact with the data in the data warehouse, a synchronization strategy must be put in place while designing the data warehouse. The strategy should provide for addition of data to the warehouse without causing data distortion and wrong reporting on queries by cubes.
Real time cubes are used to automatically update the data in the cubes, when the data in the warehouse has been updated. This is generally used when live data needs to be analyzed. They extend OLAP capabilities and do not replace the traditional cube designs and applications.
One strategy for managing additions to the data warehouse and OLAP data is the design of a batch update system. In this strategy, a batch number is assigned to all the data records in the fact table. When the cube is designed and a filter expression is added for each of the cubes partitions to specify the largest batch number. Any additions to the fact table subsequently will include a newer and higher batch number. Cubes then, become unaffected by the addition of new records as cubes are restricted to reading data only from the earlier batches. A batch number in a Dimension table will be useful in ensuring referential integrity.
Dimensions and cubes or partitions are to be processed to incorporate new data after a batch of data has been added to the fact table and the dimension tables. Shared dimensions are to be processed before cubes use them. The Incremental update option can be used to update the additions to a dimension if the structure of the dimension is not affected. However, the new members will be displayed but the cells associated with those members will remain empty till the cube is updated with new data form the fact table that relates to the new members. Use of the Rebuild the dimension structure option makes all the cubes that incorporate the dimension unavailable to the client applications.
The new data can be incorporated into the cube by updating the filter expression in each of the cube’s partitions to include the new batch number. The cube has to then be processed or incrementally updated. If the cube’s data is divided by multiple partitions, one of the partitions can be used to accumulate the data batches and that partition alone can be processed. Other partitions can be equipped with filters to prevent fresh data from being added to them.
When a cube is being deployed by a client application and the cube is being processed, the cube remains online till the processing is complete. Once the processing is complete the refreshed cube is displayed to the client application. This is true when the cube is processed with the Incremental update option or the Refresh data option. During full process the client application will be disconnected from the cube and must reconnect to the new cube after the process is complete.
The challenge is to manage changes to data warehouse data effectively and to ensure that such changes, are reflected, in the cubes in real time. The challenges are many and varied and it is important to make the right trade-offs along the way.