Among the various issues which need to be addressed while developing a data warehouse are illustrated below.
When and how to gather data
There are two ways of gathering data
- Continually
- Periodically
In source driven architecture for gathering the data, the data sources transmit new information either continually (as transaction processing take place) or periodically (nightly for example). In destination driven architecture the data warehouse periodically sends requests for new data to sources.
What schema to use
Data sources that have been constructed independently are likely to have different schemas. In fact they may even use different data models. Part of task of warehouse is to perform schema integration, and to convert data to integrated schema before they are stored. As a result, the data stored in the warehouse are not just copy of the data at the source. Instead, they can be thought of as a materialized view of data at sources.
Data transformation and cleansing
The task of correcting and pre-processing data is called data cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be corrected.
For example, names and addresses are often misspelled. They can be corrected by consulting the database of addresses in each city. The approximate matching of the data required for the task is called as fuzzy lookup.
Addresses collected for multiple sources may have duplicates. They can be eliminated by merge purge operation. Records for multiple individuals in a house may be grouped together this is called as householding.
Data may be transformed by various other ways like changing the units of measurement. Data warehouse typically have graphical tools to support the data transformation. Such tools allow the transformation to be specified as boxes and edges to indicate the flow of data.
How to propagate updates
Updates on relations at data sources must be propagated to warehouse. If the relations at the warehouse are exactly same as those at the data source, the propagation is straight forward. If they are not, the problem of propagating updates is basically the view-maintenance problem.
What data to summarize
The raw data generated by transaction processing system may be too large to store online. However, we can answer many queries by maintaining just summary data obtained by aggregation on relation, rather than maintain entire relation.