Friday, February 11, 2011

Development Strategy Issues of Data Warehouse


Among the various issues which need to be addressed while developing a data warehouse are illustrated below.

When and how to gather data

There are two ways of gathering data
  •          Continually
  •          Periodically
In source driven architecture for gathering the data, the data sources transmit new information either continually (as transaction processing take place) or periodically (nightly for example). In destination driven architecture the data warehouse periodically sends requests for new data to sources.

What schema to use

Data sources that have been constructed independently are likely to have different schemas. In fact they may even use different data models. Part of task of warehouse is to perform schema integration, and to convert data to integrated schema before they are stored. As a result, the data stored in the warehouse are not just copy of the data at the source. Instead, they can be thought of as a materialized view of data at sources.

Data transformation and cleansing

The task of correcting and pre-processing data is called data cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be corrected. 

For example, names and addresses are often misspelled. They can be corrected by consulting the database of addresses in each city. The approximate matching of the data required for the task is called as fuzzy lookup.

Addresses collected for multiple sources may have duplicates. They can be eliminated by merge purge operation. Records for multiple individuals in a house may be grouped together this is called as householding.

Data may be transformed by various other ways like changing the units of measurement. Data warehouse typically have graphical tools to support the data transformation. Such tools allow the transformation to be specified as boxes and edges to indicate the flow of data.

How to propagate updates 

Updates on relations at data sources must be propagated to warehouse. If the relations at the warehouse are exactly same as those at the data source, the propagation is straight forward. If they are not, the problem of propagating updates is basically the view-maintenance problem.

What data to summarize

The raw data generated by transaction processing system may be too large to store online. However, we can answer many queries by maintaining just summary data obtained by aggregation on relation, rather than maintain entire relation.



 

Thursday, February 10, 2011

Data warehouse architecture


Architecture is the conceptualization of how the data warehouse is built. There is no right or wrong architecture. The worthiness of architecture can be judged in how the conceptualization aids in building, maintenance, and usage of the data warehouse. One possible simple conceptualization of data warehouse architecture consists of following layers.
  •          Data Extraction and Loading
  •          The Warehouse
  •          Analyze and Query -- OLAP Tools
  •          Metadata
  •          Data Mining tools


    •   Data extraction and loading 

    This is the first step in data warehousing here the data is transformed by various techniques like

    • Extracting
    • Conditioning
    • Loading

    • Extracting the data – Here the data is Captured from operational source in “as is” status. The sources are VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix. 
    • Conditioning – In this step the conversion of data types from the source to the target data store (warehouse) is done.
    • Loading – Here SQL is used to append or insert new data
      •          Record at a time interface
      •          Will lead to random disk I/O’s. 
                               Also batch load utility is used for loading.  
    •   The warehouse 

    Heart of the data warehouse is the data itself. Data is organized in a way that represents business. For this subject orientation is used. It is collection of customer, product, policy, account etc. A subject may be implemented as a set of related tables. E.g. customer may be described by the separate tables describing all the properties.  

    The subject orientation gives summarized data but still it is not adequate to answer some questions and detailed data is too voluminous. So usually the summary data is stored on disk and details on the tapes. Derived data is useful in this context. 

    Data in the warehouse is stored in 

    •        Dimension tables
    •        Fact Table

    •          Dimension tables
      •          Define business in terms already familiar to users
      •          Small tables (about a million rows)
      •          Joined to fact table by a foreign key
      •          heavily indexed
      •          typical dimensions
      •          time periods, geographic region (markets, cities), products, customers, salesperson, etc.
    •            Fact Table
      •          Central table
      •          mostly raw numeric items
      •          narrow rows, a few columns at most
      •          large number of rows (millions to a billion)
      •          Access via dimensions
    The tables are stored with the specific schema. A schema is the organized structure. There are various types of schemas.
    •    Star Schema
    •    Fact Constellation Schema
    •    Snowflake schema
     
    • Star Schema
      •          A single fact table and for each dimension one dimension table        
      •          Does not capture hierarchies directly

       
      • Snowflake schema
        •          Represent dimensional hierarchy directly by normalizing tables.
        •          Easy to maintain and saves storage
       

      • Fact Constellation schema –
        •         Multiple fact tables that share many dimension tables
        •         Booking and Checkout may share many dimension tables in the hotel industry




      •  Analyze, Query & OLAP Tools
        
      Query processing is done by
      •          Indexing
      •          Pre computed views/aggregates
      •          SQL extensions
       
      • Indexing Techniques
      Exploiting indexes to reduce scanning of data is of crucial importance. There can be various types of indexes
        •          Bitmap Indexes
        •          Join Indexes
      • Pre-computed Aggregates
      Aggregated data is kept for efficiency (pre-computed queries).
      Aggregated table can be maintained by the
        •         warehouse server
        •         middle tier
        •         client applications
      It may include
          •         rank (top 10 customers)
          •         percentile (top 30% of customers)
          •         median, mode
      • SQL extensions
             SQL has provided many features for querying the database some extensions include.
        •          Aggregates
        •          Using SubQueries in calculations
      • On-Line Analytical Processing (OLAP)
             Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System
      OLAP = Multidimensional Database
          •          It is a powerful visualization paradigm
          •          It provides fast, interactive response times
          •          It is good for analyzing time series
          •          It can be useful to find some clusters and outliers
          •          Many vendors offer OLAP tools
        • Nature of OLAP Analysis
          •          Aggregation -- (total sales, percent-to-total)
          •          Comparison -- Budget vs. Expenses
          •          Ranking -- Top 10, quartile analysis
          •          Access to detailed and aggregate data
          •          Complex criteria specification
          •          Visualization
       
      • Metadata Repository
      Metadata repository is data about data it has various types
      •          Administrative metadata
      •          Business data
      •          operational metadata

      •          Administrative metadata contains
        •         source databases and their contents
        •         gateway descriptions
        •         warehouse schema, view & derived data definitions
        •         dimensions, hierarchies
        •         pre-defined queries and reports
        •         data mart locations and contents
        •         data partitions
        •         data extraction, cleansing, transformation rules, defaults
        •         data refresh and purging rules
        •         user profiles, user groups
        •         security: user authorization, access control
      •          Business data contains
        •         business terms and definitions
        •         ownership of data
        •         charging policies
      •          Operational metadata contains
        •         data lineage:  history of migrated data and sequence of transformations applied
        •         currency of data:  active, archived, purged
        •         Monitoring information:  warehouse usage statistics, error reports, audit trails.
       
      • Data mining and reporting tools -
      Data mining tools help to extract the information from data warehouse and report generation. Some of the tools are.
      •          Oracle -- Discoverer2000       
      •          Information Builders Inc. -- Focus for Windows
      •          Software AG -- Esperant         
      •          Sterling Software -- VISION:Data       
      •          Prism Solutions -- Prism Warehouse Manager
      •          Red Brick Systems -- DecisionScape Formation.

      Sunday, February 6, 2011

      Data Warehouse concept need and history

      Introduction and Need

      Today in the age of information technology data is an important entity. Today many companies face the problem related to data like

      o   Data is scattered over the network
      o   Many versions, subtle differences
      o   Need an expert to get the data
      o   Available data poorly documented
      o   Data found can’t be used
      §  Results are unexpected
      §  Data needs to be transformed from one form to other


      So data warehouse concept was introduced to tackle such problems.


      Definitions

      A warehouse is a subject oriented, integrated, time variant and non volatile collection of data in support of management’s decision making process.

      Data warehouse is defined as a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context.

      Data Warehouse is combining data from multiple sources in one comprehensive database. It is a process of transforming data into information and making it available to users in a timely enough manner to make a difference.

      Data warehouse is a technique for assembling and managing data from various sources for the purpose of answering business questions thus making decisions that were not previous possible.

      It is a decision support database maintained separately from the organization’s operational database.


      Characteristics

      Data warehouse has following characteristics

      • Subject oriented – Data that gives information about a particular subject instead of about the company’s ongoing operations.

      • Integrated – Data that is gathered into data warehouse from a variety of sources and merged into a coherent whole.

      • Time variant – All data in data warehouse is identified with a particular time period.

      • Non volatile – Data is stable in data warehouse. More data is added but data is never removed.


      History

      The Key developments in early years of data warehousing are

      • 1960 – General Mills and Datrmouth College in joint research project.
      • 1970 – ACNielson and IRI provide dimensional data marts for retail sales.
      • 1983 – Teradata introduces a database management sys specifically designed for decision support .
      • 1988 – Barry Devlin & Paul murphy published the article An architecture for a business & info. Sys  In IBM systems Journal where they introduced the term “Business data warehouse”
      • 1990 – Red Brick systems introduces Red Brick Warehouse a database mang sys. for data warehousing.
      • 1991 – Prism solutions intorduced Prism Warehouse manager s/w for developing a warehouse.
      • 1995 – The data warehousing institute that promotes data warehousing is founded
      • 1996 – Ralph kimball publishes the book The Data Warehouse Toolkit
      • 1997 – Oracle 8 with support for star queries is released.
      Uses

      • Analyse trends – Data warehouse is commonly used by the companies to analyse trends over time.
      • View operations – Companies can use the data warehouse to view day to day operations.
      • Planning – Its main purpose however is to facilitate the strategic planning resulting from long term data overviews.
      • Other uses – The overview can be used to
        • Build the business model,
        • Make forecasts,
        • Write reports etc.
      • Current data – Data warehouse is used to make up to date database.

      Thursday, February 3, 2011

      Autonomic Systems


      Autonomic computer systems are the systems capable of self-management. They overcome the rapidly growing complexity of computing systems management and reduce the barrier that complexity poses to further growth.

                  In other words, autonomic computing refers to the self-managing characteristics of distributed computing resources, adapting to unpredictable changes whilst hiding intrinsic complexity to operators and users. 

                  It is inspired by the autonomic nervous system of the human body. This nervous system controls important bodily functions (e.g. respiration, heart rate, and blood pressure) without any conscious intervention.


      Features

      An autonomic system can have following features 

      •   Make decisions on its own,
      •   Will constantly check and optimize its status
      •   Automatically adapt itself to changing conditions. 

      The Autonomic Computing Initiative (ACI) aims at providing the foundation for autonomic systems. An autonomic computing framework can be build by Autonomic Components interacting with each other.
      An Autonomic Components can be modeled with 

      •   Sensors (for self-monitoring)
      •   Effectors (for self-adjustment)
      •   Knowledge and adapter (for exploiting policies based on self- and environment awareness).
       
      Benefits of autonomic systems

      •   They enable modern, networked computing systems to manage themselves without direct human intervention. 
      •   Here human operator takes on a new role: He does not control the system directly. Instead, he defines general policies and rules that serve as an input for the self-management process. For this process.
      •   IBM has defined the following four functional areas for autonomic systems
       

      1.      Self-Configuration: Automatic configuration of components;
      2.      Self-Healing: Automatic discovery, and correction of faults;
      3.      Self-Optimization: Automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements;
      4.      Self-Protection: Proactive identification and protection from arbitrary attacks.

      Personal broadcasting


      Personal broadcasting is a term for participatory journalism that focuses on television webcasting over the internet and mobile internet.  
      The Internet has made it possible for almost anyone to create their own web-based, online radio station. For a small investment, it is now possible to stream music or talk show to a global audience. For educational organizations and other qualifying groups, many broadcasting options are available like
      •   Low Power FM Broadcasts
      •   Create Your Own Radio Show
      •   Create Internet Radio
      Benefits of personal broadcasting 

      •  Privacy and Security: Users invite subscribers, such as family and friends, who are the only people who can access a channel’s password-protected media. 
      • Convenience for Users: After a simple download and installation, users simply save files to a folder associated with the channel, and links are automatically fed to the subscriber. Any time a file is saved or edited, the channel’s links are automatically updated. Any type of file can be securely shared in a channel, including online media such as YouTube videos and online music.
      • Convenience for Subscribers: Subscribers do not have to seek out their friend or family members’ updates on multiple Web sites. Instead, it automatically comes to them in convenient, well-visited places: a personalized homepage such as My Yahoo or My AOL, a social networking site, in Microsoft Vista’s Sidebar, or aWeb browser bookmark. Anyone with Internet access can receive Private Netcasting channels, on their computer or mobile device. The subscriber does not need to install anything.

      • Organization: Users can easily arrange the sequence of their shared media to tell a story or to have the most recent appear noticeably on top. They can create separate channels by audience, such as a Joke Channel for friends, a Hobby Channel for other enthusiasts and a Family Channel for the extended family.

      • Unlimited Sharing: Consumers can share an unlimited amount of personal media with their subscribers, overcoming the size limitations of alternative systems. Subscribers always will get the most popular and newest content “On Demand,” even if the user’s PC is turned off. 

      • Feedback Mechanisms Consistent with Web 2.0: Each media item has its own discussion thread enabling focused communication between the user and all channel subscribers. Users can review a channel’s “Ratings” which tracks how many times a media item has been viewed or commented upon, to help them learn which are most popular for future personal broadcasts.


      Example of Personal broadcasting media - 

      EnjoyMyMedia is the world’s first personal broadcasting service
      Its features are

      •  Its free
      •  It Enables consumers to conveniently and securely broadcast personal multimedia—photos, videos, audio, documents or any other file—over the Internet in private, secure channels.
      •  Private Netcasting turns the act of saving a file into a way to stay connected, and is the safest and simplest way to keep friends and family updated.
      •  Transforms PC folders into broadcast channels.
      •  Save any file to a folder, and the files are automatically shared with invited family and friends, who do not need to install any software. 
      • Subscribers access the passwordprotected channels throughMy Yahoo, iGoogle, My AOL, Microsoft Vista’s Sidebar, Facebook, current Internet browsers or any other RSS-compliant system. The digital media market has been focused on sharing with the world in a one-to-many model, including distributing to a consumer’s 1,000 “closest friends” on a social networking site, or video sharing.

        Context aware environments


        Introduction & Definition

        Context awareness is a term from ubiquitous computing or as so-called pervasive computing which sought to deal with linking changes in the environment with computer systems, which are otherwise static. 

        This term is usually applied to the business theory in relation with the business process management.

        Basically context is the rules of inter-relationship of features in processing any entities. Context is supposed to be the combination of the four entities which are

        • Location
        • Identity
        • Activity
        • Time

        Context aware devices are the devices which can both sense, and react based on their environment. Devices may have information about the circumstances under which they are able to operate and based on rules, or an intelligent stimulus, react accordingly. These devices try to make assumptions about the user's current situation. 

        Context aware systems can
        •          Perceive a situation i.e. acquisition of context.
        •          Understand the context. i.e. matching sensory stimulus to a context.
        •          Trigger actions i.e. behaviour based on the recognized context.
        User’s activity and location are very crucial factors in context awareness. 

        Applications of context aware devices 

        •   Adapt interfaces
        •   Tailor application-relevant data
        •   Discover services
        •   Make the user interaction implicit
        •   Build smart environments.

        Categories of context aware environments
        • 1.      Human factors related context
          •       User information – user’s habits, emotional state etc.
          •       User’s social environment - co-location of others, social interaction
          •       User’s tasks - Spontaneous activity, engaged tasks.
        • 2.      Physical environment context
          •       Location (absolute position, relative position, co-location,...),
          •       Infrastructure (surrounding resources for computation, communication, task performance...),
          •       Physical conditions (noise, light, pressure,...).