Pervasive And Ubiquitous Computing

Friday, February 11, 2011

Development Strategy Issues of Data Warehouse

Among the various issues which need to be addressed while developing a data warehouse are illustrated below.

When and how to gather data

There are two ways of gathering data

Continually
Periodically

In source driven architecture for gathering the data, the data sources transmit new information either continually (as transaction processing take place) or periodically (nightly for example). In destination driven architecture the data warehouse periodically sends requests for new data to sources.

What schema to use

Data sources that have been constructed independently are likely to have different schemas. In fact they may even use different data models. Part of task of warehouse is to perform schema integration, and to convert data to integrated schema before they are stored. As a result, the data stored in the warehouse are not just copy of the data at the source. Instead, they can be thought of as a materialized view of data at sources.

Data transformation and cleansing

The task of correcting and pre-processing data is called data cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be corrected.

For example, names and addresses are often misspelled. They can be corrected by consulting the database of addresses in each city. The approximate matching of the data required for the task is called as fuzzy lookup.

Addresses collected for multiple sources may have duplicates. They can be eliminated by merge purge operation. Records for multiple individuals in a house may be grouped together this is called as householding.

Data may be transformed by various other ways like changing the units of measurement. Data warehouse typically have graphical tools to support the data transformation. Such tools allow the transformation to be specified as boxes and edges to indicate the flow of data.

How to propagate updates

Updates on relations at data sources must be propagated to warehouse. If the relations at the warehouse are exactly same as those at the data source, the propagation is straight forward. If they are not, the problem of propagating updates is basically the view-maintenance problem.

What data to summarize

The raw data generated by transaction processing system may be too large to store online. However, we can answer many queries by maintaining just summary data obtained by aggregation on relation, rather than maintain entire relation.

Thursday, February 10, 2011

Data warehouse architecture

Architecture is the conceptualization of how the data warehouse is built. There is no right or wrong architecture. The worthiness of architecture can be judged in how the conceptualization aids in building, maintenance, and usage of the data warehouse. One possible simple conceptualization of data warehouse architecture consists of following layers.

Data Extraction and Loading
The Warehouse
Analyze and Query -- OLAP Tools
Metadata
Data Mining tools

Data extraction and loading

This is the first step in data warehousing here the data is transformed by various techniques like

Extracting
Conditioning
Loading

Extracting the data – Here the data is Captured from operational source in “as is” status. The sources are VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix.

Conditioning – In this step the conversion of data types from the source to the target data store (warehouse) is done.

Loading – Here SQL is used to append or insert new data

Record at a time interface
Will lead to random disk I/O’s.

Also batch load utility is used for loading.

The warehouse –

Heart of the data warehouse is the data itself. Data is organized in a way that represents business. For this subject orientation is used. It is collection of customer, product, policy, account etc. A subject may be implemented as a set of related tables. E.g. customer may be described by the separate tables describing all the properties.

The subject orientation gives summarized data but still it is not adequate to answer some questions and detailed data is too voluminous. So usually the summary data is stored on disk and details on the tapes. Derived data is useful in this context.

Data in the warehouse is stored in

Dimension tables
Fact Table

Dimension tables

Define business in terms already familiar to users
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets, cities), products, customers, salesperson, etc.

Fact Table

Central table
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a billion)
Access via dimensions

The tables are stored with the specific schema. A schema is the organized structure. There are various types of schemas.

Star Schema
Fact Constellation Schema
Snowflake schema

Star Schema

A single fact table and for each dimension one dimension table
Does not capture hierarchies directly

Snowflake schema

Represent dimensional hierarchy directly by normalizing tables.
Easy to maintain and saves storage

Fact Constellation schema –

Multiple fact tables that share many dimension tables
Booking and Checkout may share many dimension tables in the hotel industry

Analyze, Query & OLAP Tools

Query processing is done by

Indexing
Pre computed views/aggregates
SQL extensions

Indexing Techniques

Exploiting indexes to reduce scanning of data is of crucial importance. There can be various types of indexes

Bitmap Indexes
Join Indexes

Pre-computed Aggregates

Aggregated data is kept for efficiency (pre-computed queries).

Aggregated table can be maintained by the

warehouse server
middle tier
client applications

It may include

rank (top 10 customers)
percentile (top 30% of customers)
median, mode

SQL extensions

SQL has provided many features for querying the database some extensions include.

Aggregates
Using SubQueries in calculations

On-Line Analytical Processing (OLAP)

Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System

OLAP = Multidimensional Database

It is a powerful visualization paradigm
It provides fast, interactive response times
It is good for analyzing time series
It can be useful to find some clusters and outliers
Many vendors offer OLAP tools

Nature of OLAP Analysis

Aggregation -- (total sales, percent-to-total)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate data
Complex criteria specification
Visualization

Metadata Repository

Metadata repository is data about data it has various types

Administrative metadata
Business data
operational metadata

Administrative metadata contains

source databases and their contents
gateway descriptions
warehouse schema, view & derived data definitions
dimensions, hierarchies
pre-defined queries and reports
data mart locations and contents
data partitions
data extraction, cleansing, transformation rules, defaults
data refresh and purging rules
user profiles, user groups
security: user authorization, access control

Business data contains

business terms and definitions
ownership of data
charging policies

Operational metadata contains

data lineage: history of migrated data and sequence of transformations applied
currency of data: active, archived, purged
Monitoring information: warehouse usage statistics, error reports, audit trails.

Data mining and reporting tools -

Data mining tools help to extract the information from data warehouse and report generation. Some of the tools are.

Oracle -- Discoverer2000
Information Builders Inc. -- Focus for Windows
Software AG -- Esperant
Sterling Software -- VISION:Data
Prism Solutions -- Prism Warehouse Manager
Red Brick Systems -- DecisionScape Formation.

Sunday, February 6, 2011

Data Warehouse concept need and history

Introduction and Need

Today in the age of information technology data is an important entity. Today many companies face the problem related to data like

o Data is scattered over the network

o Many versions, subtle differences

o Need an expert to get the data

o Available data poorly documented

o Data found can’t be used

§ Results are unexpected

§ Data needs to be transformed from one form to other

So data warehouse concept was introduced to tackle such problems.

Definitions

A warehouse is a subject oriented, integrated, time variant and non volatile collection of data in support of management’s decision making process.

Data warehouse is defined as a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context.

Data Warehouse is combining data from multiple sources in one comprehensive database. It is a process of transforming data into information and making it available to users in a timely enough manner to make a difference.

Data warehouse is a technique for assembling and managing data from various sources for the purpose of answering business questions thus making decisions that were not previous possible.

It is a decision support database maintained separately from the organization’s operational database.

Characteristics

Data warehouse has following characteristics

Subject oriented – Data that gives information about a particular subject instead of about the company’s ongoing operations.

Integrated – Data that is gathered into data warehouse from a variety of sources and merged into a coherent whole.

Time variant – All data in data warehouse is identified with a particular time period.

Non volatile – Data is stable in data warehouse. More data is added but data is never removed.

History

The Key developments in early years of data warehousing are

1960 – General Mills and Datrmouth College in joint research project.
1970 – ACNielson and IRI provide dimensional data marts for retail sales.
1983 – Teradata introduces a database management sys specifically designed for decision support .
1988 – Barry Devlin & Paul murphy published the article An architecture for a business & info. Sys In IBM systems Journal where they introduced the term “Business data warehouse”
1990 – Red Brick systems introduces Red Brick Warehouse a database mang sys. for data warehousing.
1991 – Prism solutions intorduced Prism Warehouse manager s/w for developing a warehouse.
1995 – The data warehousing institute that promotes data warehousing is founded
1996 – Ralph kimball publishes the book The Data Warehouse Toolkit
1997 – Oracle 8 with support for star queries is released.

Uses

Analyse trends – Data warehouse is commonly used by the companies to analyse trends over time.
View operations – Companies can use the data warehouse to view day to day operations.
Planning – Its main purpose however is to facilitate the strategic planning resulting from long term data overviews.
Other uses – The overview can be used to

Build the business model,
Make forecasts,
Write reports etc.

Current data – Data warehouse is used to make up to date database.

Thursday, February 3, 2011

Autonomic Systems

Autonomic computer systems are the systems capable of self-management. They overcome the rapidly growing complexity of computing systems management and reduce the barrier that complexity poses to further growth.

In other words, autonomic computing refers to the self-managing characteristics of distributed computing resources, adapting to unpredictable changes whilst hiding intrinsic complexity to operators and users.

It is inspired by the autonomic nervous system of the human body. This nervous system controls important bodily functions (e.g. respiration, heart rate, and blood pressure) without any conscious intervention.

Features

An autonomic system can have following features

Make decisions on its own,
Will constantly check and optimize its status
Automatically adapt itself to changing conditions.

The Autonomic Computing Initiative (ACI) aims at providing the foundation for autonomic systems. An autonomic computing framework can be build by Autonomic Components interacting with each other.

An Autonomic Components can be modeled with

Sensors (for self-monitoring)
Effectors (for self-adjustment)
Knowledge and adapter (for exploiting policies based on self- and environment awareness).

Benefits of autonomic systems

They enable modern, networked computing systems to manage themselves without direct human intervention.

Here human operator takes on a new role: He does not control the system directly. Instead, he defines general policies and rules that serve as an input for the self-management process. For this process.

IBM has defined the following four functional areas for autonomic systems

1. Self-Configuration: Automatic configuration of components;

2. Self-Healing: Automatic discovery, and correction of faults;

3. Self-Optimization: Automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements;

4. Self-Protection: Proactive identification and protection from arbitrary attacks.

Personal broadcasting

Personal broadcasting is a term for participatory journalism that focuses on television webcasting over the internet and mobile internet.

The Internet has made it possible for almost anyone to create their own web-based, online radio station. For a small investment, it is now possible to stream music or talk show to a global audience. For educational organizations and other qualifying groups, many broadcasting options are available like

Low Power FM Broadcasts
Create Your Own Radio Show
Create Internet Radio

Benefits of personal broadcasting

Privacy and Security: Users invite subscribers, such as family and friends, who are the only people who can access a channel’s password-protected media.

Convenience for Users: After a simple download and installation, users simply save files to a folder associated with the channel, and links are automatically fed to the subscriber. Any time a file is saved or edited, the channel’s links are automatically updated. Any type of file can be securely shared in a channel, including online media such as YouTube videos and online music.

Convenience for Subscribers: Subscribers do not have to seek out their friend or family members’ updates on multiple Web sites. Instead, it automatically comes to them in convenient, well-visited places: a personalized homepage such as My Yahoo or My AOL, a social networking site, in Microsoft Vista’s Sidebar, or aWeb browser bookmark. Anyone with Internet access can receive Private Netcasting channels, on their computer or mobile device. The subscriber does not need to install anything.

Organization: Users can easily arrange the sequence of their shared media to tell a story or to have the most recent appear noticeably on top. They can create separate channels by audience, such as a Joke Channel for friends, a Hobby Channel for other enthusiasts and a Family Channel for the extended family.

Unlimited Sharing: Consumers can share an unlimited amount of personal media with their subscribers, overcoming the size limitations of alternative systems. Subscribers always will get the most popular and newest content “On Demand,” even if the user’s PC is turned off.

Feedback Mechanisms Consistent with Web 2.0: Each media item has its own discussion thread enabling focused communication between the user and all channel subscribers. Users can review a channel’s “Ratings” which tracks how many times a media item has been viewed or commented upon, to help them learn which are most popular for future personal broadcasts.

Example of Personal broadcasting media -

EnjoyMyMedia is the world’s first personal broadcasting service
Its features are

Its free
It Enables consumers to conveniently and securely broadcast personal multimedia—photos, videos, audio, documents or any other file—over the Internet in private, secure channels.
Private Netcasting turns the act of saving a file into a way to stay connected, and is the safest and simplest way to keep friends and family updated.
Transforms PC folders into broadcast channels.
Save any file to a folder, and the files are automatically shared with invited family and friends, who do not need to install any software.
Subscribers access the passwordprotected channels throughMy Yahoo, iGoogle, My AOL, Microsoft Vista’s Sidebar, Facebook, current Internet browsers or any other RSS-compliant system. The digital media market has been focused on sharing with the world in a one-to-many model, including distributing to a consumer’s 1,000 “closest friends” on a social networking site, or video sharing.

Context aware environments

Introduction & Definition

Context awareness is a term from ubiquitous computing or as so-called pervasive computing which sought to deal with linking changes in the environment with computer systems, which are otherwise static.

This term is usually applied to the business theory in relation with the business process management.

Basically context is the rules of inter-relationship of features in processing any entities. Context is supposed to be the combination of the four entities which are

Location
Identity
Activity
Time

Context aware devices are the devices which can both sense, and react based on their environment. Devices may have information about the circumstances under which they are able to operate and based on rules, or an intelligent stimulus, react accordingly. These devices try to make assumptions about the user's current situation.

Context aware systems can

Perceive a situation i.e. acquisition of context.
Understand the context. i.e. matching sensory stimulus to a context.
Trigger actions i.e. behaviour based on the recognized context.

User’s activity and location are very crucial factors in context awareness.

Applications of context aware devices

Adapt interfaces
Tailor application-relevant data
Discover services
Make the user interaction implicit
Build smart environments.

Categories of context aware environments

1. Human factors related context –

User information – user’s habits, emotional state etc.
User’s social environment - co-location of others, social interaction
User’s tasks - Spontaneous activity, engaged tasks.

2. Physical environment context –

Location (absolute position, relative position, co-location,...),
Infrastructure (surrounding resources for computation, communication, task performance...),
Physical conditions (noise, light, pressure,...).