Thursday, February 10, 2011

Data warehouse architecture


Architecture is the conceptualization of how the data warehouse is built. There is no right or wrong architecture. The worthiness of architecture can be judged in how the conceptualization aids in building, maintenance, and usage of the data warehouse. One possible simple conceptualization of data warehouse architecture consists of following layers.
  •          Data Extraction and Loading
  •          The Warehouse
  •          Analyze and Query -- OLAP Tools
  •          Metadata
  •          Data Mining tools


    •   Data extraction and loading 

    This is the first step in data warehousing here the data is transformed by various techniques like

    • Extracting
    • Conditioning
    • Loading

    • Extracting the data – Here the data is Captured from operational source in “as is” status. The sources are VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix. 
    • Conditioning – In this step the conversion of data types from the source to the target data store (warehouse) is done.
    • Loading – Here SQL is used to append or insert new data
      •          Record at a time interface
      •          Will lead to random disk I/O’s. 
                               Also batch load utility is used for loading.  
    •   The warehouse 

    Heart of the data warehouse is the data itself. Data is organized in a way that represents business. For this subject orientation is used. It is collection of customer, product, policy, account etc. A subject may be implemented as a set of related tables. E.g. customer may be described by the separate tables describing all the properties.  

    The subject orientation gives summarized data but still it is not adequate to answer some questions and detailed data is too voluminous. So usually the summary data is stored on disk and details on the tapes. Derived data is useful in this context. 

    Data in the warehouse is stored in 

    •        Dimension tables
    •        Fact Table

    •          Dimension tables
      •          Define business in terms already familiar to users
      •          Small tables (about a million rows)
      •          Joined to fact table by a foreign key
      •          heavily indexed
      •          typical dimensions
      •          time periods, geographic region (markets, cities), products, customers, salesperson, etc.
    •            Fact Table
      •          Central table
      •          mostly raw numeric items
      •          narrow rows, a few columns at most
      •          large number of rows (millions to a billion)
      •          Access via dimensions
    The tables are stored with the specific schema. A schema is the organized structure. There are various types of schemas.
    •    Star Schema
    •    Fact Constellation Schema
    •    Snowflake schema
     
    • Star Schema
      •          A single fact table and for each dimension one dimension table        
      •          Does not capture hierarchies directly

       
      • Snowflake schema
        •          Represent dimensional hierarchy directly by normalizing tables.
        •          Easy to maintain and saves storage
       

      • Fact Constellation schema –
        •         Multiple fact tables that share many dimension tables
        •         Booking and Checkout may share many dimension tables in the hotel industry




      •  Analyze, Query & OLAP Tools
        
      Query processing is done by
      •          Indexing
      •          Pre computed views/aggregates
      •          SQL extensions
       
      • Indexing Techniques
      Exploiting indexes to reduce scanning of data is of crucial importance. There can be various types of indexes
        •          Bitmap Indexes
        •          Join Indexes
      • Pre-computed Aggregates
      Aggregated data is kept for efficiency (pre-computed queries).
      Aggregated table can be maintained by the
        •         warehouse server
        •         middle tier
        •         client applications
      It may include
          •         rank (top 10 customers)
          •         percentile (top 30% of customers)
          •         median, mode
      • SQL extensions
             SQL has provided many features for querying the database some extensions include.
        •          Aggregates
        •          Using SubQueries in calculations
      • On-Line Analytical Processing (OLAP)
             Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System
      OLAP = Multidimensional Database
          •          It is a powerful visualization paradigm
          •          It provides fast, interactive response times
          •          It is good for analyzing time series
          •          It can be useful to find some clusters and outliers
          •          Many vendors offer OLAP tools
        • Nature of OLAP Analysis
          •          Aggregation -- (total sales, percent-to-total)
          •          Comparison -- Budget vs. Expenses
          •          Ranking -- Top 10, quartile analysis
          •          Access to detailed and aggregate data
          •          Complex criteria specification
          •          Visualization
       
      • Metadata Repository
      Metadata repository is data about data it has various types
      •          Administrative metadata
      •          Business data
      •          operational metadata

      •          Administrative metadata contains
        •         source databases and their contents
        •         gateway descriptions
        •         warehouse schema, view & derived data definitions
        •         dimensions, hierarchies
        •         pre-defined queries and reports
        •         data mart locations and contents
        •         data partitions
        •         data extraction, cleansing, transformation rules, defaults
        •         data refresh and purging rules
        •         user profiles, user groups
        •         security: user authorization, access control
      •          Business data contains
        •         business terms and definitions
        •         ownership of data
        •         charging policies
      •          Operational metadata contains
        •         data lineage:  history of migrated data and sequence of transformations applied
        •         currency of data:  active, archived, purged
        •         Monitoring information:  warehouse usage statistics, error reports, audit trails.
       
      • Data mining and reporting tools -
      Data mining tools help to extract the information from data warehouse and report generation. Some of the tools are.
      •          Oracle -- Discoverer2000       
      •          Information Builders Inc. -- Focus for Windows
      •          Software AG -- Esperant         
      •          Sterling Software -- VISION:Data       
      •          Prism Solutions -- Prism Warehouse Manager
      •          Red Brick Systems -- DecisionScape Formation.

      No comments:

      Post a Comment