0% found this document useful (0 votes)
267 views998 pages

DW Bi

Uploaded by

mahima patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
267 views998 pages

DW Bi

Uploaded by

mahima patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

What is data warehouse?

A data warehouse is a electronical storage of an Organization's historical data for


the purpose of
analysis and reporting. According to Kimpball, a datawarehouse should be subject-
oriented,
non-volatile, integrated and time-variant.

What are the benefits of data warehouse?

Historical data stored in data warehouse helps to analyze different aspects of


business
including, performance analysis, trend analysis, trend prediction etc. which
ultimately increases
efficiency of business processes.

Why Data Warehouse is used?

Data warehouse facilitates reporting on different key business processes known as


KPI. Data warehouse can be
further used for data mining which helps trend prediction, forecasts, pattern
recognition etc.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on
that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand,
OLAP systems are deliberately denormalized for fast data retrieval through SELECT
operations.

Explanatory Note:

In a departmental shop, when we pay the prices at the check-out counter, the sales
person at the
counter keys-in all the data into a "Point-Of-Sales" machine. That data is
transaction data and the
related system is a OLTP system. On the other hand, the manager of the store might
want to view a
report on out-of-stock materials, so that he can place purchase order for them.
Such report will come
out from OLAP system
What is data mart?

Data marts are generally designed for a single subject area. An organization may
have data pertaining to different
departments like Finance, HR, Marketing etc. stored in data warehouse and each
department may have separate data
marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model is entity-relationship model which is designed with a goal of normalizing


the data.

What is dimensional modelling?

Dimensional model consists of dimension and fact tables. Fact tables store
different transactional measurements and
the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achieve high
degree of normalization but to facilitate easy and faster data retrieval.

What is dimension?

A dimension is something that qualifies a quantity (measure).

If I just say� �20kg�, it does not mean anything. But 20kg of Rice (Product) is
sold to Ramesh (customer) on 5th April
(date), gives a meaningful sense. These product, customer and dates are some
dimension that qualified the measure.
Dimensions are mutually independent.

Technically speaking, a dimension is a data element that categorizes each item in a


data set into non-overlapping
regions.

What is fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but
not always) numerical values that
can be aggregated.

What are additive, semi-additive and non-additive measures?


Star-Schema
Non-additive measures are those which can not be used inside any numeric
aggregation function (e.g. SUM(), AVG()
etc.). One example of non-additive fact is any kind of ratio or percentage.
Example, 5% profit margin, revenue to
asset ratio etc. A non-numerical data can also be a non-additive measure when that
data is stored in fact tables.

Semi-additive measures are those where only a subset of aggregation function can be
applied. Let�s say account
balance. A sum() function on balance does not give a useful result but max() or
min() balance might be useful.
Consider price rate or currency rate. Sum is meaningless on rate; however, average
function might be useful.

Additive measures can be used with any aggregation function like Sum(), Avg() etc.
Example is Sales Quantity etc.

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table
references number of dimension
tables so as the keys (primary key) from all the dimension tables flow into the
fact table (as foreign key) where
measures are stored. This entity-relationship diagram looks like a star, hence the
name.

Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.

A star-schema is a special case of snow-flake schema.

What is snow-flake schema?


snowflake-schema
This is another logical arrangement of tables in dimensional modeling where a
centralized fact table references
number of other dimension tables; however, those dimension tables are further
normalized into multiple related
tables.

Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.
Additionally all the products can be further grouped under different product
families stored in a different table so
that primary key of product family tables also goes into the product table as a
foreign key. Such construct will be
called a snow-flake schema as product table is further snow-flaked into product
family.

Note
Snow-flake increases degree of normalization in the design.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

1. Conformed Dimension
2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further
classify dimension as

1. Unchanging or static dimension (UCD)


2. Slowly changing dimension (SCD)
3. Rapidly changing Dimension (RCD)

What is a 'Conformed Dimension'?

A conformed dimension is the dimension that is shared across multiple subject area.
Consider 'Customer' dimension.
Both marketing and sales department may use the same customer dimension table in
their reports. Similarly, a 'Time'
or 'Date' dimension will be shared by different subject areas. These dimensions are
conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical


subsets of one another are said to be
conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not
have its own dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc.
does not have any more
associated attributes and hence can not be designed as a dimension table.

What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags,


indicators etc.) so that those can be
removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this
table is to store all the combinations of
the dimensional attributes which you could not fit into the different dimension
tables otherwise. One may want to
read an interesting document, De-clutter with Junk (Dimension)

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with
different contextual meaning.
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date
of Delivery", or "Date of Hire". This is
often referred to as a 'role-playing dimension'
What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing. These can be of many
types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are
most common.

What is rapidly changing dimension?

This is a dimension where data changes rapidly.

Describe different types of slowly changing Dimension (SCD)

Type 0:

A Type 0 dimension is where dimensional changes are not considered. This does not
mean that the attributes of the
dimension do not change in actual business situation. It just means that, even if
the value of the attributes change,
history is not kept and the table holds all the previous data.

Type 1:

A type 1 dimension is where history is not maintained and the table always shows
the recent data. This effectively
means that such dimension table is always updated with recent data whenever there
is a change, and because of this
update, we lose the previous values.

Type 2:

A type 2 dimension table tracks the historical changes by creating separate rows in
the table with different surrogate
keys. Consider there is a customer C1 under group G1 first and later on the
customer is changed to group G2. Then
there will be two separate records in dimension table like below,

Key

Customer

Group

Start Date

End Date

C1

G1

1st Jan 2000

31st Dec 2005

2
C1

G2

1st Jan 2006

NULL
Note that separate surrogate keys are generated for the two records. NULL end date
in the second row denotes that
the record is the current record. Also note that, instead of start and end dates,
one could also keep version number
column (1, 2 � etc.) to denote different versions of the record.

Type 3:

A type 3 dimension stored the history in a separate column instead of separate


rows. So unlike a type 2 dimension
which is vertically growing, a type 3 dimension is horizontally growing. See the
example below,

Key

Customer

Previous Group

Current Group

C1

G1

G2

This is only good when you need not store many consecutive histories and when date
of change is not required to be
stored.

Type 6:

A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar
to type 2, but only you add one extra
column to denote which record is the current record.

Key

Customer

Group

Start Date

End Date

Current Flag

C1

G1
1st Jan 2000

31st Dec 2005

C1

G2

1st Jan 2006

NULL

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a


dimension has a huge number of
rapidly changing attributes it is better to separate those attributes in different
table called mini dimension. This is
done because if the main dimension table is designed as SCD type 2, the table will
soon outgrow in size and create
performance issues. It is better to segregate the rapidly changing members in
different table thereby keeping the
main dimension table small and performing.
What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:

Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,

1. Who are the students taught by a specific teacher.


2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,

1. Which teacher did not teach any student?


2. Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?

A data warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is
termed as granularity. But all reporting requirements from that data warehouse do
not need the same degree of
details.

To understand this, let's consider an example from retail business. A certain


retail chain has 500 shops accross
Europe. All the shops record detail level transactions regarding the products they
sale and those data are captured in
a data warehouse.

Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is


needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?
Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.

Informatica Questions

Welcome to the finest collection of Informatica Interview Questions with standard


answers that you can count on.
Read and understand all the questions and their answers below and in the following
pages to get a good grasp in
Informatica.

What are the differences between Connected and Unconnected Lookup?

Connected Lookup

Unconnected Lookup

Connected lookup participates in dataflow


and receives input directly from the pipeline

Unconnected lookup receives input values


from the result of a LKP: expression in
another transformation

Connected lookup can use both dynamic and


static cache

Unconnected Lookup cache can NOT be


dynamic

Connected lookup can return more than one


column value ( output port )

Unconnected Lookup can return only one


column value i.e. output port

Connected lookup caches all lookup columns

Unconnected lookup caches only the lookup


output ports in the lookup conditions and the
return port

Supports user-defined default values (i.e.


value to return when lookup conditions are
not satisfied)

Does not support user defined default values

What is the difference between Router and Filter?

Router

Filter

Router transformation divides the incoming


records into multiple groups based on some
condition. Such groups can be mutually
inclusive (Different groups may contain same
record)

Filter transformation restricts or blocks the


incoming record set based on one given
condition.

Router transformation itself does not block


any record. If a certain record does not match
any of the routing conditions, the record is
routed to default group

Filter transformation does not have a default


group. If one record does not match filter
condition, the record is blocked

Router acts like CASE.. WHEN statement in


SQL (Or Switch().. Case statement in C)

Filter acts like WHERE condition is SQL.

What can we do to improve the performance of Informatica Aggregator


Transformation?

Aggregator performance improves dramatically if records are sorted before passing


to the aggregator and "sorted
input" option under aggregator properties is checked. The record set should be
sorted on those columns that are used
in Group By operation.
It is often a good idea to sort the record set in database level (why?) e.g. inside
a source qualifier transformation,
unless there is a chance that already sorted records from source qualifier can
again become unsorted before reaching
aggregator

What are the different lookup cache?

Lookups can be cached or uncached (No cache). Cached lookup can be either static or
dynamic. A static cache is one
which does not modify the cache once it is built and it remains same during the
session run. On the other hand, A
dynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the
incoming source data.

A lookup cache can also be divided as persistent or non-persistent based on whether


Informatica retains the cache
even after session run is complete or not respectively

How can we update a record in target table without using Update strategy?

A target table can be updated without using 'Update Strategy'. For this, we need to
define the key in the target table
in Informatica level and then we need to connect the key and the field we want to
update in the mapping Target. In
the session level, we should set the target property as "Update as Update" and
check the "Update" check-box.

Let's assume we have a target table "Customer" with fields as "Customer ID",
"Customer Name" and "Customer
Address". Suppose we want to update "Customer Address" without an Update Strategy.
Then we have to define
"Customer ID" as primary key in Informatica level and we will have to connect
Customer ID and Customer Address
fields in the mapping. If the session properties are set correctly as described
above, then the mapping will only
update the customer address field for all matching customer IDs.

Under what condition selecting Sorted Input in aggregator may fail the session?

. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.

Why is Sorter an Active Transformation?


Ans. When the Sorter transformation is configured to treat output rows as distinct,
it assigns all ports as part of the
sort key. The Integration Service discards duplicate rows compared during the sort
operation. The number of Input
Rows will vary as compared with the Output rows and hence it is an Active
transformation.

Is lookup an active or passive transformation?

From Informatica 9x, Lookup transformation can be configured as as "Active"


transformation. Find out How to
configure lookup as active transformation

What is the difference between Static and Dynamic Lookup Cache?

Ans. We can configure a Lookup transformation to cache the corresponding lookup


table. In case of static or read-
only lookup cache the Integration Service caches the lookup table at the beginning
of the session and does not update
the lookup cache while it processes the Lookup transformation.

In case of dynamic lookup cache the Integration Service dynamically inserts or


updates data in the lookup cache and
passes the data to the target. The dynamic cache is synchronized with the target.

What is the difference between STOP and ABORT options in Workflow


Monitor?

Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.

In contrast ABORT command has a timeout period of 60 seconds. If the Integration


Service cannot finish processing
and committing data within the timeout period, it kills the DTM process and
terminates the session.

How to Delete duplicate row using Informatica

Scenario 1: Duplicate rows are present in relational database


Suppose we have Duplicate records in Source System and we want to load only the
unique records in the Target
System eliminating the duplicate rows. What will be the approach?
Source Qualifier Transformation DISTINCT clause
Ans.
Assuming that the source system is a Relational Database, to eliminate duplicate
records, we can check the Distinct
option of the Source Qualifier of the source table and load the target accordingly.

Scenario 2: Deleting duplicate records from flatfile

A collection of scenario based Informatica Interview Questions.

Deleting duplicate row for FLAT FILE sources

Now suppose the source system is a Flat File. Here in the Source Qualifier you will
not be able to select the distinct
clause as it is disabled due to flat file source table. Hence the next approach may
be we use a Sorter Transformation
and check the Distinct option. When we select the distinct option all the columns
will the selected as keys, in
ascending order by default.
Sorter Transformation DISTINCT clause

Deleting Duplicate Record Using Informatica Aggregator

Other ways to handle duplicate records in source batch run is to use an Aggregator
Transformation and using the
Group By checkbox on the ports having duplicate occurring data. Here you can have
the flexibility to select the last or
the first of the duplicate column value records. Apart from that using Dynamic
Lookup Cache of the target table and
associating the input ports with the lookup port and checking the Insert Else
Update option will help to eliminate the
duplicate records in source and hence loading unique records in the target.

For more details on Dynamic Lookup Cache

Loading Multiple Target Tables Based on Conditions

Q2. Suppose we have some serial numbers in a flat file source. We want to load the
serial numbers in two target files
one containing the EVEN serial numbers and the other file having the ODD ones.

Ans. After the Source Qualifier place a Router Transformation. Create two Groups
namely EVEN and ODD, with
filter conditions as MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively. Then
output the two groups
into two flat file targets.
Router Transformation Groups Tab

Normalizer Related Questions

Q3. Suppose in our Source Table we have data as given below:

Student Name

Maths

Life Science

Physical Science

Sam

100

70

80

John

75

100

85

Tom

80

100

85

We want to load our Target Table as:

Student Name

Subject Name

Marks

Sam

Maths

100
Sam

Life Science

70

Sam

Physical Science

80

John

Maths

75

John

Life Science

100

John

Physical Science

85

Tom

Maths

80

Tom

Life Science

100

Tom

Physical Science

85

Describe your approach.

Ans. Here to convert the Rows to Columns we have to use the Normalizer
Transformation followed by an
Expression Transformation to Decode the column taken into consideration. For more
details on how the mapping is
performed please visit Working with Normalizer
Q4. Name the transformations which converts one to many rows i.e increases the
i/p:o/p row count. Also what is the
name of its reverse transformation.

Ans. Normalizer as well as Router Transformations are the Active transformation


which can increase the number of
input rows to output rows.

Aggregator Transformation is the active transformation that performs the reverse


action.

Q5. Suppose we have a source table and we want to load three target tables based on
source rows such that first row
moves to first target table, secord row in second target table, third row in third
target table, fourth row again in first
target table so on and so forth. Describe your approach.
Router Transformation Groups Tab
Ans. We can clearly understand that we need a Router transformation to route or
filter source data to the three target
tables. Now the question is what will be the filter conditions. First of all we
need an Expression Transformation
where we have all the source table columns and along with that we have another i/o
port say seq_num, which is gets
sequence numbers for each source row from the port NextVal of a Sequence Generator
start value 0 and increment
by 1. Now the filter condition for the three router groups will be:

. MOD(SEQ_NUM,3)=1 connected to 1st target table


. MOD(SEQ_NUM,3)=2 connected to 2nd target table
. MOD(SEQ_NUM,3)=0 connected to 3rd target table

Loading Multiple Flat Files using one mapping

Q6. Suppose we have ten source flat files of same structure. How can we load all
the files in target database in a
single batch run using a single mapping.

Ans. After we create a mapping to load data in target database from flat files,
next we move on to the session
property of the Source Qualifier. To load a set of source files we need to create a
file say final.txt containing the
source falt file names, ten files in our case and set the Source filetype option as
Indirect. Next point this flat file
final.txt fully qualified through Source file directory and Source filename.
Session Property Flat File

Q7. How can we implement Aggregation operation without using an Aggregator


Transformation in Informatica.

Ans. We will use the very basic concept of the Expression Transformation that at a
time we can access the previous
row data as well as the currently processed data in an expression transformation.
What we need is simple Sorter,
Expression and Filter transformation to achieve aggregation at Informatica level.

For detailed understanding visit Aggregation without Aggregator

Q8. Suppose in our Source Table we have data as given below:

Student Name

Subject Name

Marks

Sam

Maths

100

Tom

Maths

80

Sam

Physical Science

80
Mapping using sorter and Aggregator
John

Maths

75

Sam

Life Science

70

John

Life Science

100

John

Physical Science

85

Tom

Life Science

100

Tom

Physical Science

85

We want to load our Target Table as:

Student Name

Maths

Life Science

Physical Science

Sam

100

70

80

John
75

100

85

Tom

80

100

85

Describe your approach.

Ans. Here our scenario is to convert many rows to one rows, and the transformation
which will help us to achieve
this is Aggregator.

Our Mapping will look like this:


Sorter Transformation
We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT
ascending.

Now based on STUDENT_NAME in GROUP BY clause the following output subject columns
are populated as

. MATHS: MAX(MARKS, SUBJECT=Maths)


. LIFE_SC: MAX(MARKS, SUBJECT=Life Science)
. PHY_SC: MAX(MARKS, SUBJECT=Physical Science)
Aggregator Transformation

Revisiting Source Qualifier Transformation

Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and
why it is an ACTIVE
transformation?

Ans. A Source Qualifier is an Active and Connected Informatica transformation that


reads the rows from a relational
database or flat file source.

. We can configure the SQ to join [Both INNER as well as OUTER JOIN] data
originating from the same
source database.
. We can use a source filter to reduce the number of rows the Integration Service
queries.
. We can specify a number for sorted ports and the Integration Service adds an
ORDER BY clause to the
default SQL query.
. We can choose Select Distinctoption for relational databases and the Integration
Service adds a SELECT
DISTINCT clause to the default SQL query.
. Also we can write Custom/Used Defined SQL query which will override the default
query in the SQ by
changing the default settings of the transformation properties.
. Also we have the option to write Pre as well as Post SQL statements to be
executed before and after the SQ
query in the source database.
Since the transformation provides us with the property Select Distinct, when the
Integration Service adds a SELECT
DISTINCT clause to the default SQL query, which in turn affects the number of rows
returned by the Database to the
Integration Service and hence it is an Active transformation.

Q10. What happens to a mapping if we alter the datatypes between Source and its
corresponding Source Qualifier?

Ans. The Source Qualifier transformation displays the transformation datatypes. The
transformation datatypes
determine how the source database binds data when the Integration Service reads it.

Now if we alter the datatypes in the Source Qualifier transformation or the


datatypes in the source definition and
Source Qualifier transformation do not match, the Designer marks the mapping as
invalid when we save it.

Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports
property in the SQ and then we add
Custom SQL Query. Explain what will happen.

Ans. Whenever we add Custom SQL or SQL override query it overrides the User-Defined
Join, Source Filter,
Number of Sorted Ports, and Select Distinct settings in the Source Qualifier
transformation. Hence only the user
defined SQL Query will be fired in the database and all the other options will be
ignored .

Q12. Describe the situations where we will use the Source Filter, Select Distinct
and Number Of Sorted Ports
properties of Source Qualifier transformation.

Ans. Source Filter option is used basically to reduce the number of rows the
Integration Service queries so as to
improve performance.

Select Distinct option is used when we want the Integration Service to select
unique values from a source, filtering
out unnecessary data earlier in the data flow, which might improve performance.

Number Of Sorted Ports option is used when we want the source data to be in a
sorted fashion so as to use the same
in some following transformations like Aggregator or Joiner, those when configured
for sorted input will improve
the performance.

Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query
and the OUTPUT PORTS
order in SQ transformation do not match?

Ans. Mismatch or Changing the order of the list of selected columns to that of the
connected transformation output
ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include
keyword WHERE say, WHERE
CUSTOMERS.CUSTOMER_ID > 1000.

Ans. We use source filter to reduce the number of source records. If we include the
string WHERE in the source filter,
the Integration Service fails the session.

Q15. Describe the scenarios where we go for Joiner transformation instead of Source
Qualifier transformation.

Ans. While joining Source Data of heterogeneous sources as well as to join flat
files we will use the Joiner
transformation. Use the Joiner transformation when we need to join the following
types of sources:

. Join data from different Relational Databases.


. Join data from different Flat Files.
. Join relational sources and flat files.

Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase
source system.

Ans. Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the


source is Sybase, do not sort more
than 16 columns.

Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to
Target tables TGT1 and
TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?

Ans. If we have multiple Source Qualifier transformations connected to multiple


targets, we can designate the order
in which the Integration Service loads data into the targets.

In the Mapping Designer, We need to configure the Target Load Plan based on the
Source Qualifier transformations
in a mapping to specify the required loading order.
Target Load Plan
Target Load Plan Ordering

Q18. Suppose we have a Source Qualifier transformation that populates two target
tables. How do you ensure TGT2
is loaded after TGT1?

Ans. In the Workflow Manager, we can Configure Constraint based load ordering for a
session. The Integration
Service orders the target load on a row-by-row basis. For every row generated by an
active source, the Integration
Service loads the corresponding transformed row first to the primary key table,
then to the foreign key table.
Constraint based loading
Hence if we have one Source Qualifier transformation that provides data for
multiple target tables having primary
and foreign key relationships, we will go for Constraint based load ordering.

Revisiting Filter Transformation

Q19. What is a Filter Transformation and why it is an Active one?

Ans. A Filter transformation is an Active and Connected transformation that can


filter rows in a mapping.

Only the rows that meet the Filter Condition pass through the Filter transformation
to the next transformation in the
pipeline. TRUE and FALSE are the implicit return values from any filter condition
we set. If the filter condition
evaluates to NULL, the row is assumed to be FALSE.

The numeric equivalent of FALSE is zero (0) and any non-zero value is the
equivalent of TRUE.

As an ACTIVE transformation, the Filter transformation may change the number of


rows passed through it. A filter
condition returns TRUE or FALSE for each row that passes through the
transformation, depending on whether a row
meets the specified condition. Only rows that return TRUE pass through this
transformation. Discarded rows do not
appear in the session log or reject files.
Q20. What is the difference between Source Qualifier transformations Source Filter
to Filter transformation?

Ans.

SQ Source Filter

Filter Transformation

Source Qualifier
transformation filters rows
when read from a source.

Filter transformation filters rows from


within a mapping

Source Qualifier
transformation can only
filter rows from Relational
Sources.

Filter transformation filters rows


coming from any type of source
system in the mapping level.

Source Qualifier limits the


row set extracted from a
source.

Filter transformation limits the row set


sent to a target.

Source Qualifier reduces the


number of rows used
throughout the mapping
and hence it provides better
performance.

To maximize session performance,


include the Filter transformation as
close to the sources in the mapping as
possible to filter out unwanted data
early in the flow of data from sources
to targets.

The filter condition in the


Source Qualifier
transformation only uses
standard SQL as it runs in
the database.

Filter Transformation can define a


condition using any statement or
transformation function that returns
either a TRUE or FALSE value.
Revisiting Joiner Transformation
Q21. What is a Joiner Transformation and why it is an Active one?

Ans. A Joiner is an Active and Connected transformation used to join source data
from the same source system or
from two related heterogeneous sources residing in different locations or file
systems.

The Joiner transformation joins sources with at least one matching column. The
Joiner transformation uses a
condition that matches one or more pairs of columns between the two sources.

The two input pipelines include a master pipeline and a detail pipeline or a master
and a detail branch. The master
pipeline ends at the Joiner transformation, while the detail pipeline continues to
the target.

In the Joiner transformation, we must configure the transformation properties


namely Join Condition, Join Type and
Sorted Input option to improve Integration Service performance.

The join condition contains ports from both input sources that must match for the
Integration Service to join two
rows. Depending on the type of join selected, the Integration Service either adds
the row to the result set or discards
the row.

The Joiner transformation produces result sets based on the join type, condition,
and input data sources. Hence it is
an Active transformation.

Q22. State the limitations where we cannot use Joiner in the mapping pipeline.

Ans. The Joiner transformation accepts input from most transformations. However,
following are the limitations:

. Joiner transformation cannot be used when either of the input pipeline contains
an Update Strategy
transformation.
. Joiner transformation cannot be used if we connect a Sequence Generator
transformation directly before the
Joiner transformation.

Q23. Out of the two input pipelines of a joiner, which one will you set as the
master pipeline?

Ans. During a session run, the Integration Service compares each row of the master
source against the detail source.
The master and detail sources need to be configured for optimal performance.
To improve performance for an Unsorted Joiner transformation, use the source with
fewer rows as the master
source. The fewer unique rows in the master, the fewer iterations of the join
comparison occur, which speeds the join
process.

When the Integration Service processes an unsorted Joiner transformation, it reads


all master rows before it reads the
detail rows. The Integration Service blocks the detail source while it caches rows
from the master source. Once the
Integration Service reads and caches all master rows, it unblocks the detail source
and reads the detail rows.

To improve performance for a Sorted Joiner transformation, use the source with
fewer duplicate key values as the
master source.

When the Integration Service processes a sorted Joiner transformation, it blocks


data based on the mapping
configuration and it stores fewer rows in the cache, increasing performance.

Blocking logic is possible if master and detail input to the Joiner transformation
originate from different sources.
Otherwise, it does not use blocking logic. Instead, it stores more rows in the
cache.

Q24. What are the different types of Joins available in Joiner Transformation?

Ans. In SQL, a join is a relational operator that combines data from multiple
tables into a single result set. The Joiner
transformation is similar to an SQL join except that data can originate from
different types of sources.

The Joiner transformation supports the following types of joins :

. Normal
. Master Outer
. Detail Outer
. Full Outer
Join Type property of Joiner Transformation class="caption"

Note: A normal or master outer join performs faster than a full outer or detail
outer join.

Q25. Define the various Join Types of Joiner Transformation.

Ans.

. In a normal join , the Integration Service discards all rows of data from the
master and detail source that do
not match, based on the join condition.
. A master outer join keeps all rows of data from the detail source and the
matching rows from the master
source. It discards the unmatched rows from the master source.
. A detail outer join keeps all rows of data from the master source and the
matching rows from the detail
source. It discards the unmatched rows from the detail source.
. A full outer join keeps all rows of data from both the master and detail sources.

Q26. Describe the impact of number of join conditions and join order in a Joiner
Transformation.

Ans. We can define one or more conditions based on equality between the specified
master and detail sources. Both
ports in a condition must have the same datatype.

If we need to use two ports in the join condition with non-matching datatypes we
must convert the datatypes so that
they match. The Designer validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two
sources.

The order of the ports in the join condition can impact the performance of the
Joiner transformation. If we use
multiple ports in the join condition, the Integration Service compares the ports in
the order we specified.

NOTE: Only equality operator is available in joiner join condition.

Q27. How does Joiner transformation treat NULL value matching.

Ans. The Joiner transformation does not match null values.

For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the
Integration Service does not
consider them a match and does not join the two rows.

To join rows with null values, replace null input with default values in the Ports
tab of the joiner, and then join on
the default values.

Note: If a result set includes fields that do not contain data in either of the
sources, the Joiner transformation
populates the empty fields with null values. If we know that a field will return a
NULL and we do not want to insert
NULLs in the target, set a default value on the Ports tab for the corresponding
port.

Q28. Suppose we configure Sorter transformations in the master and detail pipelines
with the following sorted ports
in order: ITEM_NO, ITEM_NAME, PRICE.

When we configure the join condition, what are the guidelines we need to follow to
maintain the sort order?

Ans. If we have sorted both the master and detail pipelines in order of the ports
say ITEM_NO, ITEM_NAME and
PRICE we must ensure that:

. Use ITEM_NO in the First Join Condition.


. If we add a Second Join Condition, we must use ITEM_NAME.
. If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use
ITEM_NAME in the
Second Join Condition.
. If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort
order and the
Integration Service fails the session.
Mapping using Joiner
Q29. What are the transformations that cannot be placed between the sort origin and
the Joiner transformation so that
we do not lose the input sort order.

Ans. The best option is to place the Joiner transformation directly after the sort
origin to maintain sorted data.
However do not place any of the following transformations between the sort origin
and the Joiner transformation:

. Custom
. UnsortedAggregator
. Normalizer
. Rank
. Union transformation
. XML Parser transformation
. XML Generator transformation
. Mapplet [if it contains any one of the above mentioned transformations]

Q30. Suppose we have the EMP table as our source. In the target we want to view
those employees whose salary is
greater than or equal to the average salary for their departments. Describe your
mapping approach.

Ans. Our Mapping will look like this:

ahref="http://png.dwbiconcepts.com/images/tutorial/info_interview/
info_interview10.png"

To start with the mapping we need the following transformations:

After the Source qualifier of the EMP table place a Sorter Transformation . Sort
based on DEPTNOport.
Sorter Ports Tab

Next we place a Sorted Aggregator Transformation. Here we will find out the AVERAGE
SALARY for each
(GROUP BY) DEPTNO.

When we perform this aggregation, we lose the data for individual employees.

To maintain employee data, we must pass a branch of the pipeline to the Aggregator
Transformation and pass a
branch with the same sorted source data to the Joiner transformation to maintain
the original data.

When we join both branches of the pipeline, we join the aggregated data with the
original data.
Aggregator Ports Tab
Aggregator Properties Tab

So next we need Sorted Joiner Transformation to join the sorted aggregated data
with the original data, based on
DEPTNO. Here we will be taking the aggregated pipeline as the Master and original
dataflow as Detail Pipeline.
Joiner Condition Tab
Joiner Properties Tab

After that we need a Filter Transformation to filter out the employees having
salary less than average salary for their
department.

Filter Condition: SAL>=AVG_SAL


Filter Properties Tab

Lastly we have the Target table instance.

Revisiting Sequence Generator Transformation

Q31. What is a Sequence Generator Transformation?

Ans. A Sequence Generator transformation is a Passive and Connected transformation


that generates numeric
values. It is used to create unique primary key values, replace missing primary
keys, or cycle through a sequential
range of numbers. This transformation by default contains ONLY Two OUTPUT ports
namely CURRVAL and
NEXTVAL. We cannot edit or delete these ports neither we cannot add ports to this
unique transformation. We can
create approximately two billion unique numeric values with the widest range from 1
to 2147483647.

Q32. Define the Properties available in Sequence Generator transformation in brief.

Ans.

Sequence
Generator
Properties

Description
Start Value

Start value of the generated sequence that we want


the Integration Service to use if we use the Cycle
option. If we select Cycle, the Integration Service
cycles back to this value when it reaches the end
value. Default is 0.

Increment By

Difference between two consecutive values from the


NEXTVAL port.Default is 1.

End Value

Maximum value generated by SeqGen. After


reaching this value the session will fail if the
sequence generator is not configured to cycle.Default
is 2147483647.

Current
Value

Current value of the sequence. Enter the value we


want the Integration Service to use as the first value
in the sequence. Default is 1.

Cycle

If selected, when the Integration Service reaches the


configured end value for the sequence, it wraps
around and starts the cycle again, beginning with the
configured Start Value.

Number of
Cached
Values

Number of sequential values the Integration Service


caches at a time. Default value for a standard
Sequence Generator is 0. Default value for a reusable
Sequence Generator is 1,000.

Reset

Restarts the sequence at the current value each time a


session runs.This option is disabled for reusable
Sequence Generator transformations.
Sequence Generator
Q33. Suppose we have a source table populating two target tables. We connect the
NEXTVAL port of the Sequence
Generator to the surrogate keys of both the target tables.

Will the Surrogate keys in both the target tables be same? If not how can we flow
the same sequence values in both of
them.

Ans. When we connect the NEXTVAL output port of the Sequence Generator directly to
the surrogate key columns
of the target tables, the Sequence number will not be the same.

A block of sequence numbers is sent to one target tables surrogate key column. The
second targets receives a block of
sequence numbers from the Sequence Generator transformation only after the first
target table receives the block of
sequence numbers.

Suppose we have 5 rows coming from the source, so the targets will have the
sequence values as TGT1 (1,2,3,4,5) and
TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and
Increment by 1.

Now suppose the requirement is like that we need to have the same surrogate keys in
both the targets.

Then the easiest way to handle the situation is to put an Expression Transformation
in between the Sequence
Generator and the Target tables. The SeqGen will pass unique values to the
expression transformation, and then the
rows are routed from the expression transformation to the targets.
Q34. Suppose we have 100 records coming from the source. Now for a target column
population we used a Sequence
generator.

Suppose the Current Value is 0 and End Value of Sequence generator is set to 80.
What will happen?

Ans. End Value is the maximum value the Sequence Generator will generate. After it
reaches the End value the
session fails with the following error message:

TT_11009 Sequence Generator Transformation: Overflow error.

Failing of session can be handled if the Sequence Generator is configured to Cycle


through the sequence, i.e.
whenever the Integration Service reaches the configured end value for the sequence,
it wraps around and starts the
cycle again, beginning with the configured Start Value.

Q35. What are the changes we observe when we promote a non resuable Sequence
Generator to a resuable one? And
what happens if we set the Number of Cached Values to 0 for a reusable
transformation?

Ans. When we convert a non reusable sequence generator to resuable one we observe
that the Number of Cached
Values is set to 1000 by default; And the Reset property is disabled.

When we try to set the Number of Cached Values property of a Reusable Sequence
Generator to 0 in the
Transformation Developer we encounter the following error message:

The number of cached values must be greater than zero for reusable sequence
transformation.

Revisiting Aggregator Transformation

Q36. What is an Aggregator Transformation?

Ans. An aggregator is an Active, Connected transformation which performs aggregate


calculations like AVG,
COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM and VARIANCE.

Q37. How an Expression Transformation differs from Aggregator Transformation?

Ans. An Expression Transformation performs calculation on a row-by-row basis. An


Aggregator Transformation
performs calculations on groups.
Q38. Does an Informatica Transformation support only Aggregate expressions?

Ans. Apart from aggregate expressions Informatica Aggregator also supports non-
aggregate expressions and
conditional clauses.

Q39. How does Aggregator Transformation handle NULL values?

Ans. By default, the aggregator transformation treats null values as NULL in


aggregate functions. But we can specify
to treat null values in aggregate functions as NULL or zero.

Q40. What is Incremental Aggregation?

Ans. We can enable the session option, Incremental Aggregation for a session that
includes an Aggregator
Transformation. When the Integration Service performs incremental aggregation, it
actually passes changed source
data through the mapping and uses the historical cache data to perform aggregate
calculations incrementally.

For reference check Implementing Informatica Incremental Aggregation

Q41. What are the performance considerations when working with Aggregator
Transformation?

Ans.

. Filter the unnecessary data before aggregating it. Place a Filter transformation
in the mapping before the
Aggregator transformation to reduce unnecessary aggregation.
. Improve performance by connecting only the necessary input/output ports to
subsequent transformations,
thereby reducing the size of the data cache.
. Use Sorted input which reduces the amount of data cached and improves session
performance.

Q42. What differs when we choose Sorted Input for Aggregator Transformation?

Ans. Integration Service creates the index and data caches files in memory to
process the Aggregator transformation.
If the Integration Service requires more space as allocated for the index and data
cache sizes in the transformation
properties, it stores overflow values in cache files i.e. paging to disk. One way
to increase session performance is to
increase the index and data cache sizes in the transformation properties. But when
we check Sorted Input the
Integration Service uses memory to process an Aggregator transformation it does not
use cache files.

Q43. Under what conditions selecting Sorted Input in aggregator will still not
boost session performance?
Ans.

. Incremental Aggregation, session option is enabled.


. The aggregate expression contains nested aggregate functions.
. Source data is data driven.

Q44. Under what condition selecting Sorted Input in aggregator may fail the
session?

Ans.

. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.

Q45. Suppose we do not group by on any ports of the aggregator what will be the
output.

Ans. If we do not group values, the Integration Service will return only the last
row for the input rows.

Q46. What is the expected value if the column in an aggregator transform is neither
a group by nor an aggregate
expression?

Ans. Integration Service produces one row for each group based on the group by
ports. The columns which are
neither part of the key nor aggregate expression will return the corresponding
value of last record of the group
received. However, if we specify particularly the FIRST function, the Integration
Service then returns the value of the
specified first row of the group. So default is the LAST function.

Q47. Give one example for each of Conditional Aggregation, Non-Aggregate expression
and Nested Aggregation.

Ans.

Use conditional clauses in the aggregate expression to reduce the number of rows
used in the aggregation. The
conditional clause can be any clause that evaluates to TRUE or FALSE.

SUM( SALARY, JOB = CLERK )

Use non-aggregate expressions in group by ports to modify or replace groups.

IIF( PRODUCT = Brown Bread, Bread, PRODUCT )


The expression can also include one aggregate function within another aggregate
function, such as:

MAX( COUNT( PRODUCT ))

Revisiting Rank Transformation

Q48. What is a Rank Transform?

Ans. Rank is an Active Connected Informatica transformation used to select a set of


top or bottom values of data.

Q49. How does a Rank Transform differ from Aggregator Transform functions MAX and
MIN?

Ans. Like the Aggregator transformation, the Rank transformation lets us group
information. The Rank Transform
allows us to select a group of top or bottom values, not just one value as in case
of Aggregator MAX, MIN functions.

Q50. What is a RANK port and RANKINDEX?

Ans. Rank port is an input/output port use to specify the column for which we want
to rank the source values. By
default Informatica creates an output port RANKINDEX for each Rank transformation.
It stores the ranking position
for each row in a group.

Q51. How can you get ranks based on different groups?

Ans. Rank transformation lets us group information. We can configure one of its
input/output ports as a group by
port. For each unique value in the group port, the transformation creates a group
of rows falling within the rank
definition (top or bottom, and a particular number in each rank).

Q52. What happens if two rank values match?

Ans. If two rank values match, they receive the same value in the rank index and
the transformation skips the next
value.

Q53. What are the restrictions of Rank Transformation?

Ans.

. We can connect ports from only one transformation to the Rank transformation.
. We can select the top or bottom rank.
. We need to select the Number of records in each rank.
. We can designate only one Rank port in a Rank transformation.

Q54. How does a Rank Cache works?

Ans. During a session, the Integration Service compares an input row with rows in
the data cache. If the input row
out-ranks a cached row, the Integration Service replaces the cached row with the
input row. If we configure the Rank
transformation to rank based on different groups, the Integration Service ranks
incrementally for each group it finds.
The Integration Service creates an index cache to stores the group information and
data cache for the row data.

Q55. How does Rank transformation handle string values?

Ans. Rank transformation can return the strings at the top or the bottom of a
session sort order. When the Integration
Service runs in Unicode mode, it sorts character data in the session using the
selected sort order associated with the
Code Page of IS which may be French, German, etc. When the Integration Service runs
in ASCII mode, it ignores this
setting and uses a binary sort order to sort character data.

Revisiting Sorter Transformation

Q56. What is a Sorter Transformation?

Ans. Sorter Transformation is an Active, Connected Informatica transformation used


to sort data in ascending or
descending order according to specified sort keys. The Sorter transformation
contains only input/output ports.

Q57. Why is Sorter an Active Transformation?

Ans. When the Sorter transformation is configured to treat output rows as distinct,
it assigns all ports as part of the
sort key. The Integration Service discards duplicate rows compared during the sort
operation. The number of Input
Rows will vary as compared with the Output rows and hence it is an Active
transformation.

Q58. How does Sorter handle Case Sensitive sorting?

Ans. The Case Sensitive property determines whether the Integration Service
considers case when sorting data.
When we enable the Case Sensitive property, the Integration Service sorts uppercase
characters higher than
lowercase characters.

Q59. How does Sorter handle NULL values?


Ans. We can configure the way the Sorter transformation treats null values. Enable
the property Null Treated Low if
we want to treat null values as lower than any other value when it performs the
sort operation. Disable this option if
we want the Integration Service to treat null values as higher than any other
value.

Q60. How does a Sorter Cache works?

Ans. The Integration Service passes all incoming data into the Sorter Cache before
Sorter transformation performs the
sort operation.

The Integration Service uses the Sorter Cache Size property to determine the
maximum amount of memory it can
allocate to perform the sort operation. If it cannot allocate enough memory, the
Integration Service fails the session.
For best performance, configure Sorter cache size with a value less than or equal
to the amount of available physical
RAM on the Integration Service machine.

If the amount of incoming data is greater than the amount of Sorter cache size, the
Integration Service temporarily
stores data in the Sorter transformation work directory. The Integration Service
requires disk space of at least twice
the amount of incoming data when storing data in the work directory.

Revisiting Union Transformation

Q61. What is a Union Transformation?

Ans. The Union transformation is an Active, Connected non-blocking multiple input


group transformation use to
merge data from multiple pipelines or sources into one pipeline branch. Similar to
the UNION ALL SQL statement,
the Union transformation does not remove duplicate rows.

Q62. What are the restrictions of Union Transformation?

Ans.

. All input groups and the output group must have matching ports. The precision,
datatype, and scale must be
identical across all groups.
. We can create multiple input groups, but only one default output group.
. The Union transformation does not remove duplicate rows.
. We cannot use a Sequence Generator or Update Strategy transformation upstream
from a Union
transformation.
. The Union transformation does not generate transactions.
General questions

Q63. What is the difference between Static and Dynamic Lookup Cache?

Ans. We can configure a Lookup transformation to cache the corresponding lookup


table. In case of static or read-
only lookup cache the Integration Service caches the lookup table at the beginning
of the session and does not update
the lookup cache while it processes the Lookup transformation.

In case of dynamic lookup cache the Integration Service dynamically inserts or


updates data in the lookup cache and
passes the data to the target. The dynamic cache is synchronized with the target.

Q64. What is Persistent Lookup Cache?

Ans. Lookups are cached by default in Informatica. Lookup cache can be either non-
persistent or persistent. The
Integration Service saves or deletes lookup cache files after a successful session
run based on whether the Lookup
cache is checked as persistent or not.

Q65. What is the difference between Reusable transformation and Mapplet?

Ans. Any Informatica Transformation created in the in the Transformation Developer


or a non-reusable promoted to
reusable transformation from the mapping designer which can be used in multiple
mappings is known as Reusable
Transformation. When we add a reusable transformation to a mapping, we actually add
an instance of the
transformation. Since the instance of a reusable transformation is a pointer to
that transformation, when we change
the transformation in the Transformation Developer, its instances reflect these
changes.

A Mapplet is a reusable object created in the Mapplet Designer which contains a set
of transformations and lets us
reuse the transformation logic in multiple mappings. A Mapplet can contain as many
transformations as we need.
Like a reusable transformation when we use a mapplet in a mapping, we use an
instance of the mapplet and any
change made to the mapplet is inherited by all instances of the mapplet.

Q66. What are the transformations that are not supported in Mapplet?

Ans. Normalizer, Cobol sources, XML sources, XML Source Qualifier transformations,
Target definitions, Pre- and
post- session Stored Procedures, Other Mapplets.

Q67. What are the ERROR tables present in Informatica?


Ans.

. PMERR_DATA- Stores data and metadata about a transformation row error and its
corresponding source
row.
. PMERR_MSG- Stores metadata about an error and the error message.
. PMERR_SESS- Stores metadata about the session.
. PMERR_TRANS- Stores metadata about the source and transformation ports, such as
name and datatype,
when a transformation error occurs.

Q68. What is the difference between STOP and ABORT?

Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.

In contrast ABORT command has a timeout period of 60 seconds. If the Integration


Service cannot finish processing
and committing data within the timeout period, it kills the DTM process and
terminates the session.

Q69. Can we copy a session to new folder or new repository?

Ans. Yes we can copy session to new folder or repository provided the corresponding
Mapping is already in there.

Q70. What type of join does Lookup support?

Ans. Lookup is just similar like SQL LEFT OUTER JOIN.

(Page 3 of 3)

What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,

1. Who are the students taught by a specific teacher.


2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,

1. Which teacher did not teach any student?


2. Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?


A data warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is
termed as granularity. But all reporting requirements from that data warehouse do
not need the same degree of
details.

To understand this, let's consider an example from retail business. A certain


retail chain has 500 shops accross
Europe. All the shops record detail level transactions regarding the products they
sale and those data are captured in
a data warehouse.

Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is


needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.

This article attempts to refresh your Unix skills in the form of a question/answer
based Unix tutorial on Unix
command lines. The commands discussed here are particulary useful for the
developers working in the middle-tier
(e.g. ETL) systems, where they may need to interact with several *nx source systems
for data retrieval.

How to print/display the first line of a file?


There are many ways to do this. However the easiest way to display the first line
of a file is using the [head]
command.

$> head -1 file.txt

No prize in guessing that if you specify [head -2] then it would print first 2
records of the file.

Another way can be by using [sed] command. [Sed] is a very powerful text editor
which can be used for various text
manipulation purposes like this.

$> sed '2,$ d' file.txt

How does the above command work? The 'd' parameter basically tells [sed] to delete
all the records from display
from line 2 to last line of the file (last line is represented by $ symbol). Of
course it does not actually delete those lines
from the file, it just does not display those lines in standard output screen. So
you only see the remaining line which
is the 1st line.

How to print/display the last line of a file?

The easiest way is to use the [tail] command.

$> tail -1 file.txt

If you want to do it using [sed] command, here is what you should write:

$> sed -n '$ p' test

From our previous answer, we already know that '$' stands for the last line of the
file. So '$ p' basically prints (p for
print) the last line in standard output screen. '-n' switch takes [sed] to silent
mode so that [sed] does not print
anything else in the output.

How to display n-th line of a file?

The easiest way to do it will be by using [sed] I guess. Based on what we already
know about [sed] from our previous
examples, we can quickly deduce this command:

$> sed �n '<n> p' file.txt

You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be
$> sed �n '4 p' test

Of course you can do it by using [head] and [tail] command as well like below:

$> head -<n> file.txt | tail -1

You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be

$> head -4 file.txt | tail -1

How to remove the first line / header from a file?

We already know how [sed] can be used to delete a certain line from the output � by
using the'd' switch. So if we
want to delete the first line the command should be:

$> sed '1 d' file.txt

But the issue with the above command is, it just prints out all the lines except
the first line of the file on the standard
output. It does not really change the file in-place. So if you want to delete the
first line from the file itself, you have
two options.

Either you can redirect the output of the file to some other file and then rename
it back to original file like below:

$> sed '1 d' file.txt > new_file.txt

$> mv new_file.txt file.txt

Or, you can use an inbuilt [sed] switch '�i' which changes the file in-place. See
below:

$> sed �i '1 d' file.txt

How to remove the last line/ trailer from a file in Unix script?

Always remember that [sed] switch '$' refers to the last line. So using this
knowledge we can deduce the below
command:

$> sed �i '$ d' file.txt

How to remove certain lines from a file in Unix?

If you want to remove line <m> to line <n> from a given file, you can accomplish
the task in the similar method
shown above. Here is an example:
$> sed �i '5,7 d' file.txt

The above command will delete line 5 to line 7 from the file file.txt

How to remove the last n-th line from a file?

This is bit tricky. Suppose your file contains 100 lines and you want to remove the
last 5 lines. Now if you know how
many lines are there in the file, then you can simply use the above shown method
and can remove all the lines from
96 to 100 like below:

$> sed �i '96,100 d' file.txt # alternative to command [head -95 file.txt]

But not always you will know the number of lines present in the file (the file may
be generated dynamically, etc.) In
that case there are many different ways to solve the problem. There are some ways
which are quite complex and
fancy. But let's first do it in a way that we can understand easily and remember
easily. Here is how it goes:

$> tt=`wc -l file.txt | cut -f1 -d' '`;sed �i "`expr $tt - 4`,$tt d" test

As you can see there are two commands. The first one (before the semi-colon)
calculates the total number of lines
present in the file and stores it in a variable called �tt�. The second command
(after the semi-colon), uses the variable
and works in the exact way as shows in the previous example.

How to check the length of any line in a file?

We already know how to print one line from a file which is this:

$> sed �n '<n> p' file.txt

Where <n> is to be replaced by the actual line number that you want to print. Now
once you know it, it is easy to
print out the length of this line by using [wc] command with '-c' switch.

$> sed �n '35 p' file.txt | wc �c

The above command will print the length of 35th line in the file.txt.

How to get the nth word of a line in Unix?

Assuming the words in the line are separated by space, we can use the [cut]
command. [cut] is a very powerful and
useful command and it's real easy. All you have to do to get the n-th word from the
line is issue the following
command:

cut �f<n> -d' '


'-d' switch tells [cut] about what is the delimiter (or separator) in the file,
which is space ' ' in this case. If the separator
was comma, we could have written -d',' then. So, suppose I want find the 4th word
from the below string: �A quick
brown fox jumped over the lazy cat�, we will do something like this:

$> echo �A quick brown fox jumped over the lazy cat� | cut �f4 �d' '

And it will print �fox�

How to reverse a string in unix?

Pretty easy. Use the [rev] command.

$> echo "unix" | rev

xinu

How to get the last word from a line in Unix file?

We will make use of two commands that we learnt above to solve this. The commands
are [rev] and [cut]. Here we
go.

Let's imagine the line is: �C for Cat�. We need �Cat�. First we reverse the line.
We get �taC rof C�. Then we cut the
first word, we get 'taC'. And then we reverse it again.

$>echo "C for Cat" | rev | cut -f1 -d' ' | rev

Cat

How to get the n-th field from a Unix command output?

We know we can do it by [cut]. Like below command extracts the first field from the
output of [wc �c] command

$>wc -c file.txt | cut -d' ' -f1

109

But I want to introduce one more command to do this here. That is by using [awk]
command. [awk] is a very
powerful command for text pattern scanning and processing. Here we will see how may
we use of [awk] to extract
the first field (or first column) from the output of another command. Like above
suppose I want to print the first
column of the [wc �c] output. Here is how it goes like this:

$>wc -c file.txt | awk ' ''{print $1}'

109
The basic syntax of [awk] is like this:

awk 'pattern space''{action space}'

The pattern space can be left blank or omitted, like below:

$>wc -c file.txt | awk '{print $1}'

109

In the action space, we have asked [awk] to take the action of printing the first
column ($1). More on [awk] later.

How to replace the n-th line in a file with a new line in Unix?

This can be done in two steps. The first step is to remove the n-th line. And the
second step is to insert a new line in
n-th line position. Here we go.

Step 1: remove the n-th line

$>sed -i'' '10 d' file.txt # d stands for delete

Step 2: insert a new line at n-th line position

$>sed -i'' '10 i This is the new line' file.txt # i stands for insert

How to show the non-printable characters in a file?

Open the file in VI editor. Go to VI command mode by pressing [Escape] and then
[:]. Then type [set list]. This will
show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the
file.

How to zip a file in Linux?

Use inbuilt [zip] command in Linux

How to unzip a file in Linux?

Use inbuilt [unzip] command in Linux.

$> unzip �j file.zip

How to test if a zip file is corrupted in Linux?

Use �-t� switch with the inbuilt [unzip] command


$> unzip �t file.zip

How to check if a file is zipped in Unix?

In order to know the file type of a particular file use the [file] command like
below:

$> file file.txt

file.txt: ASCII text

If you want to know the technical MIME type of the file, use �-i� switch.

$>file -i file.txt

file.txt: text/plain; charset=us-ascii

If the file is zipped, following will be the result

$> file �i file.zip

file.zip: application/x-zip

How to connect to Oracle database from within shell script?

You will be using the same [sqlplus] command to connect to database that you use
normally even outside the shell
script. To understand this, let's take an example. In this example, we will connect
to database, fire a query and get the
output printed from the unix shell. Ok? Here we go �

$>res=`sqlplus -s username/password@database_name <<EOF

SET HEAD OFF;

select count(*) from dual;

EXIT;

EOF`

$> echo $res

If you connect to database in this method, the advantage is, you will be able to
pass Unix side shell
variables value to the database. See below example

$>res=`sqlplus -s username/password@database_name <<EOF

SET HEAD OFF;

select count(*) from student_table t where t.last_name=$1;


EXIT;

EOF`

$> echo $res

12

How to execute a database stored procedure from Shell script?

$> SqlReturnMsg=`sqlplus -s username/password@database<<EOF

BEGIN

Proc_Your_Procedure(� your-input-parameters �);

END;

EXIT;

EOF`

$> echo $SqlReturnMsg

How to check the command line arguments in a UNIX command in Shell Script?

In a bash shell, you can access the command line arguments using $0, $1, $2, �
variables, where $0 prints the
command name, $1 prints the first input parameter of the command, $2 the second
input parameter of the command
and so on.

How to fail a shell script programmatically?

Just put an [exit] command in the shell script with return value other than 0. this
is because the exit codes of
successful Unix programs is zero. So, suppose if you write

exit -1

inside your program, then your program will thrown an error and exit immediately.

How to list down file/folder lists alphabetically?

Normally [ls �lt] command lists down file/folder list sorted by modified time. If
you want to list then alphabetically,
then you should simply specify: [ls �l]

How to check if the last command was successful in Unix?


To check the status of last executed command in UNIX, you can check the value of an
inbuilt bash variable [$?]. See
the below example:

$> echo $?

How to check if a file is present in a particular directory in Unix?

Using command, we can do it in many ways. Based on what we have learnt so far, we
can make use of [ls] and [$?]
command to do this. See below:

$> ls �l file.txt; echo $?

If the file exists, the [ls] command will be successful. Hence [echo $?] will print
0. If the file does not exist, then [ls]
command will fail and hence [echo $?] will print 1.

How to check all the running processes in Unix?

The standard command to see this is [ps]. But [ps] only shows you the snapshot of
the processes at that instance. If
you need to monitor the processes for a certain period of time and need to refresh
the results in each interval,
consider using the [top] command.

$> ps �ef

If you wish to see the % of memory usage and CPU usage, then consider the below
switches

$> ps aux

If you wish to use this command inside some shell script, or if you want to
customize the output of [ps] command,
you may use �-o� switch like below. By using �-o� switch, you can specify the
columns that you want [ps] to print
out.

$>ps -e -o stime,user,pid,args,%mem,%cpu

How to tell if my process is running in Unix?

You can list down all the running processes using [ps] command. Then you can �grep�
your user name or process
name to see if the process is running. See below:

$>ps -e -o stime,user,pid,args,%mem,%cpu | grep "opera"

14:53 opera 29904 sleep 60 0.0 0.0

14:54 opera 31536 ps -e -o stime,user,pid,arg 0.0 0.0

14:54 opera 31538 grep opera 0.0 0.0


Print
Email
How to get the CPU and Memory details in Linux server?

In Linux based systems, you can easily access the CPU and memory details from
the /proc/cpuinfo and
/proc/meminfo, like this:

$>cat /proc/meminfo

$>cat /proc/cpuinfo

Just try the above commands in your system to see how it works

What is a database? A question for both pro and newbie

.
.

Published on Wednesday, 28 April 2010 16:46

Written by Akash Mitra

inShare0

Remember Codd's Rule? Or Acid Property of database? May be you still hold these
basic properties to your heart or
may be you no longer remember them. Let's revisit these ideas once again..

A database is a collection of data for one or more multiple uses. Databases are
usually integrated and offers both data
storing and retrieval.

Codd's Rule

Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by
Edgar F. Codd, a pioneer of the
relational model for databases.

Rule 0: The system must qualify as relational, as a database, and as a management


system.
For a system to qualify as a relational database management system (RDBMS), that
system must use its relational
facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way,
namely by values in column positions
within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the


fundamental requirement for primary keys. It
says that every individual scalar value in the database must be logically
addressable by specifying the name of the
containing table, the name of the containing column and the primary key value of
the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must
support a representation of "missing
information and inapplicable information" that is systematic, distinct from all
regular values (for example, "distinct
from zero or any other number", in the case of numeric values), and independent of
data type. It is also implied that
such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to
authorized users by means of their
regular query language. That is, users must be able to access the database's
structure (catalog) using the same query
language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

. Has a linear syntax


. Can be used both interactively and within application programs,
. Supports data definition operations (including view definitions), data
manipulation operations (update as
well as retrieval), security and integrity constraints, and transaction management
operations (begin, commit,
and rollback).
Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This
means that data can be retrieved
from a relational database in sets constructed of data from multiple rows and/or
multiple tables. This rule states that
insert, update, and delete operations should be supported for any retrievable set
rather than just for a single row in a
single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked
lists etc.) must not require a change
to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a
change to an application based on
the structure. Logical data independence is more difficult to achieve than physical
data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and


stored in the catalog. It must be
possible to change such constraints as and when appropriate without unnecessarily
affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be


invisible to users of the database. Existing
applications should continue to operate successfully :

. when a distributed version of the DBMS is first introduced; and


. when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:


If the system provides a low-level (record-at-a-time) interface, then that
interface cannot be used to subvert the
system, for example, bypassing a relational security or integrity constraint.

Database ACID Property

ACID(atomicity, consistency, isolation, durability) is a set of properties that


guarantee that database transactions are
processed reliably.

Atomicity: Atomicity requires that database modifications must follow an all or


nothing rule. Each transaction is said
to be atomic if when one part of the transaction fails, the entire transaction
fails and database state is left unchanged

Consistency: The consistency property ensures that the database remains in a


consistent state; more precisely, it says
that any transaction will take the database from one consistent state to another
consistent state. The consistency rule
applies only to integrity rules that are within its scope. Thus, if a DBMS allows
fields of a record to act as references
to another record, then consistency implies the DBMS must enforce referential
integrity: by the time any transaction
ends, each and every reference in the database must be valid.

Isolation: Isolation refers to the requirement that other operations cannot access
or see data that has been modified
during a transaction that has not yet completed. Each transaction must remain
unaware of other concurrently
executing transactions, except that one transaction may be forced to wait for the
completion of another transaction
that has modified data that the waiting transaction requires.

Durability: Durability is the DBMS's guarantee that once the user has been notified
of a transaction's success, the
transaction will not be lost. The transaction's data changes will survive system
failure, and that all integrity
constraints have been satisfied, so the DBMS won't need to reverse the transaction.
Many DBMSs implement
durability by writing transactions into a transaction log that can be reprocessed
to recreate the system state right
before any later failure.

Why people Hate Project Managers � A must read for would-be managers
"Project Managers" are inevitable. Love them or hate them, but if you are in a
project, you have to accept them. They
are Omnipresent in any project. They intervene too much on technical things without
much knowledge. They create
unrealistic targets and nonsensical methods of achieving them. And they invariably
fail to acknowledge the
individual hard work. Are they of any use?

In a recent online survey by amplicate.com, 51% of the participants expressed hate


for project managers and project
management. Look around yourself in your office, the scenario is probably the same.
So what are the reasons that
make people hate their project managers? DWBIConcepts delved deeper into this
question and found out top 5
reasons about why project managers are hated.

Remember, all project managers are not hated! So, following reasons off course
don�t apply to them.

1. Project managers are lazy

Generally project managers are not answerable to their subordinates. They are self-
paced and semi autocratic. These
allowances provide them the opportunity to spend time lazily. Many project managers
spend more time surfing
internet than evaluating the performances of his/her subordinates.

The cure for their laziness is pro-activeness which can help them spend quality
time in office.

2. Project Managers snatch other people�s credit

I know of a project manager �Harry� (name changed), who used to receive work from
client and assign the work to
his subordinate �John� and once John finished the work and sent Harry an email,
Harry used to copy the contents of
John�s mail and reply back to the client. Since Harry never �forwarded� John�s mail
directly to client � so client was
always oblivion to the actual person (John) doing their work. Client always used to
send appreciation mail to Harry
only and John was never accredited for the work he did.

The advice for the would-be project managers here is to remain conscious about the
individual contributions and
give them their due credit whenever possible.

3. Project managers are reluctant to listen to new idea

There is no one-size-fit-all solution when it comes to project management. Just


because a specific idea worked in your
earlier project, doesn�t mean that will work in your next project also. Everybody
is good at something or other.
Everybody has some idea. Not all of them are good. But some of them are. So be
flexible and open to new ideas.
Listen carefully what others have to say and if you have to discard them, give
proper reasons.

4. Project Managers fail to do realistic planning

Proper planning makes thing easy. What do you think is the main difference between
a NASA space project and a
service industry IT project? The project members in that NASA project are the same
kind of engineers that you have
in your project. May be many of them passed from the same graduate school. The same
set of people who made one
project a marvellous success, fail miserably in some other project. There is
nothing wrong with those people. But
there is something wrong with the leader leading that set of people. A NASA project
succeeds because of a
meticulous and realistic planning whereas the other project slogs.

Create a detail plan and follow it closely.

5. Project Managers don't know the technology well

Don�t let new tools and technologies outsmart you. Technology space is ever
changing. Try to keep pace with that.

Install the software and tools that are being used in your project in your laptop.
Play with them. Know what their
features are and what their limitations are. Read blogs on them. Start your own
blog and write something interesting
in that in a regular basis. Be a savvy. Otherwise you will be fooled by your own
people.

A road-map on Testing in Data Warehouse

Testing in data warehouse projects are till date a less explored area. However, if
not done properly, this can be a
major reason for data warehousing project failures - especially in user acceptance
phase. Given here a mind-map that
will help a project manager to think all the aspects of testing in data
warehousing.

Testing Mindmap
DWBI Testing

Points to consider for DWBI Testing

1. Why is it important?
. To bug-free the code
. To ensure data quality
. To increase credibility of BI Reports
. More BI projects fail after commissioning due to quality issue
2. What constitutes DWBI Testing?
. Performance Testing
. Functional Testing
. Canned Report Testing
. Ad-hoc testing
. Load Reconciliation

3. What can be done to ease it?


. Plan for testing
. Start building DWBI Test competency
. Design code that generates debug information
. Build reconciliation mechanism

4. Why is it difficult?
. Limited Testing Tool
. Automated Testing not always possible
. Data traceability not always available
. Requires extensive functional knowledge
. Metadata management tool often fails
. Deals with bulk data - has performance impact
. Number of data conditions are huge

Use the above mind-map to plan and prepare the testing activity for your data
warehousing project.

Enterprise Data Warehouse Data Reconciliation Methodology

An enterprise data warehouse often fetches records from several disparate systems
and store them centrally in an
enterprise-wide warehouse. But what is the guarantee that the quality of data will
not degrade in the process of
centralization?

Data Reconciliation

Many of the data warehouses are built on n-tier architecture with multiple data
extraction and data insertion jobs
between two consecutive tiers. As it happens, the nature of the data changes as it
passes from one tier to the next tier.
Data reconciliation is the method of reconciling or tie-up the data between any two
consecutive tiers (layers).
Why Reconciliation is required?

In the process of extracting data from one source and then transforming the data
and loading it to the next layer, the
whole nature of the data can change considerably. It might also happen that some
information is lost while
transforming the data. A reconciliation process helps to identify such loss of
information.

One of the major reasons of information loss is loading failures or errors during
loading. Such errors can occur due to
several reasons e.g.

. Inconsistent or non coherent data from source


. Non-integrating data among different sources
. Unclean/ non-profiled data
. Un-handled exceptions
. Constraint violations
. Logical issues/ Inherent flaws in program
. Technical failures like loss of connectivity, loss over network, space issue etc.

Failure due to any such issue can result into potential information loss leading to
unreliable data quality for business
process decision making.

Further more, if such issues are not rectified at the earliest, this becomes even
more costly to �patch� later. Therefore
this is highly suggested that a proper data reconciliation process must be in place
in any data Extraction-
Transformation-Load (ETL) process.

Scope of Data Reconciliation

Data reconciliation is often confused with the process of data quality testing.
Even worse, sometimes data
reconciliation process is used to investigate and pin point the data issues.

While data reconciliation may be a part of data quality assurance, these two things
are not necessarily same.

Scope of data reconciliation should be limited to identify, if at all, there is any


issue in the data or not. The scope
should not be extended to automate the process of data investigation and pin
pointing the issues.

A successful reconciliation process should only indicate whether or not the data is
correct. It will not indicate why the
data is not correct. Reconciliation process answers �what� part of the question,
not �why� part of the question.
Methods of Data Reconciliation

Master Data Reconciliation

Master data reconciliation is the method of reconciling only the master data
between source and target. Master data
are generally unchanging or slowly changing in nature and no aggregation operation
is done on the dataset. That is -
the granularity of the data remains same in both source and target. That is why
master data reconciliation is often
relatively easy and quicker to implement.

In one business process, �customer�, �products�, �employee� etc. are some good
example of master data. Ensuring
the total number of customer in the source systems match exactly with the total
number of customers in the target
system is an example of customer master data reconciliation.

Some of the common examples of master data reconciliation can be the following
measures,

1. Total count of rows, example


. Total Customer in source and target
. Total number of Products in source and target etc.

2. Total count of rows based on a condition, example


. Total number of active customers
. Total number of inactive customers etc.

Transactional Data Reconciliation

Sales quantity, revenue, tax amount, service usage etc. are examples of
transactional data. Transactional data make
the very base of BI reports so any mismatch in transactional data can cause direct
impact on the reliability of the
report and the whole BI system in general. That is why reconciliation mechanism
must be in-place in order to detect
such a discrepancy before hand (meaning, before the data reach to the final
business users)

Transactional data reconciliation is always done in terms of total sum. This


prevents any mismatch otherwise caused
due to varying granularity of qualifying dimensions. Also this total sum can be
done on either full data or only on
incremental data set.

Some examples measures used for transactional data reconciliation can be

1. Sum of total revenue calculated from source and target


2. Sum of total product sold calculated from source and target etc.
Data Warehouse design phase
Automated Data Reconciliation

For large warehouse systems, it is often convenient to automate the data


reconciliation process by making this an
integral part of data loading. This can be done by maintaining separate loading
metadata tables and populating those
tables with reconciliation queries. The existing reporting architecture of the
warehouse can be then used to generate
and publish reconciliation reports at the end of the loading. Such automated
reconciliation will keep all the stake
holders informed about the trustworthiness of the reports.

Top 10 things you must know before designing a data warehouse

This paper outlines some of the most important (and equally neglected) things that
one must consider before and
during the design phase of a data warehouse. In our experience, we have seen data
warehouse designers often miss
out on these items merely because they thought them to be too trivial to attract
their attentions. Guess what, at the
end of the day such neglects cost them heavily as they cut short the overall ROI of
the data warehouse.

Here we outline some data warehouse gotchas that you should be aware of.

1. ETL solution takes more time to design than analytical solutions

In a top-down design approach people often start to visualize the end data and
realize the complexity
associated with data analytics first. As they tend to see more details of it, they
tend to devote more time for
designing of the analytical or reporting solutions and less time for the designing
of the background ETL staffs
that deal with data extraction / cleaning / transformation etc. They often live
under the assumption that it
would be comparatively easy to map the source data from the existing systems since
users already have
better understanding on the source systems. Moreover, the need and complexity of
cleansing / profiling of
the source data would be less since the data is already coming from standard source
systems.

Needless to say, these assumptions often turn void when it comes to


actually coding the ETL layer to feed the data warehouse. Almost
always, mapping, cleaning and preparing data turns out significantly
more time consuming compared to design of Reporting / Analytics layer.

From budgeting and costing standpoints also, an architect prefers to choose the
case of data reporting and
analytics over background ETL as the former can be more easily presented to the
senior management over
the later in order to get them sanction the budget. This leads to disproportionate
budget between background
ETL and frontend Reporting tasks.

2. Data Warehouse scope will increase along the development

Users often do not know what they want from the data until they start to see the
data. As and when
development progress and more and more data visualization becomes possible, users
start wishing even
more out of their data. This phenomenon is unavoidable and designers must allocate
extra time to
accomodate such ad-hoc requirements.

Many requirements that were implicit in the beginning becomes explicit and
indispensable in the later phase
of the project. Since you can not avoid it, make sure that you already have
adequate time allocated in your
project plan before hand.

3. Issues will be discovered in the source system that went undetected till date

The power of an integrated data warehouse becomes apparent when you start
discovering discrepancies and
issues in the existing stable (so-called) source systems. The real problem,
however, is - designers often make
the wrong assumption that the source systems or upstream systems are fault free.
And that is why they do
not allocate any time or resource in their project plan to deal with those issues.

Data warehouse developers do discover issues in the source systems. And those
issues take lot of time to get
fixed. More than often those issues are not even fixed in the source (to minimize
the impact on business) and
some work around is suggested to deal with those issues in the data warehouse level
directly (although that
is not generally a good idea). Source system issues confuse everybody and require
more administrative time
(that technical time) to resolve as DW developers need to identify and make their
case to prove it to the
source systems that the issue(s) does exist. These are huge time wasters and often
not incorporated in the
project plan.

4. You will need to validate data not being validated in source systems
Source systems do not always give you the correct data. A lot of validations and
checks are not done in the
source system level (e.g. OLTP systems) and each time a validation check is
skipped, it creates danger of
sending unexpected data to the data warehouse level. Therefore before you can
actually process data in data
warehouse, you will require to perform some validation checks at your end to ensure
the expected data
availability.

This is again unavoidable. If you do not make those checks that would cause issues
at your side which
include things like, data loading error, reconciliation failure even data integrity
threats. Hence ensure that
proper time and resource allocation are there to work on these items.

5. User training will not be sufficient and users will not put their training to
use

You would face a natural resistance from the existing business users who would show
huge inertia against
the acceptance to the new system. In order to ease the things, adequate user
training sessions are generally
arranged for the users of the data warehouse. But you will notice that "adequate"
training is not "sufficient"
for them (mainly due to they need to unlearn a lot of things to learn the use of
the new data warehouse).

Even if you arrange adequate training to the users, you would find that the users
are not really putting their
training to use when it comes to doing things in the new data warehouse. That's
often because facts and
figures from the new data warehouse often challenge their existing convictions and
they are reluctant to
accept it whole heartedly.

User training and acceptance is probably the single most important non-technical
challenge that makes or
breaks a data warehouse. No matter what amount of effort you put as a designer to
design the data
warehouse - if the users are not using it - the data warehouse is as good as
failure. As the old saying goes in
Sanskrit � �a tree is known by the name of its fruit�, the success of data
warehouse is measured from the
information it produces. If the information is not relevant to the users and if
they are reluctant to use it - you
lost the purpose. Hence make all the possible efforts to connect to the users and
train them to use the data
warehouse. Mere 'adequate' training is not 'sufficient' here.

6. Users will create conflicting business rules

That is because the users often belong to different departments of the company and
even though each one of
them knows the business of her department pretty well, she would not know the
business of the other
department that well. And when you take the data from all these departments and try
to combine them
together into an integrated data warehouse, you would often discover that business
rule suggested by one
user is completely opposite to the business rule suggested by the other.

Such cases are generally involved and need collaboration between multiple parties
to come into the
conclusion. It's better to consider such cases way before during the planning phase
to avoid the late surprises.

7. Volumetric mis-judgement is more common than you thought

A very minutely done volumetric estimate in the starting phase of the project would
go weary later. This
happens due to several reasons e.g. slight change in the standard business metrics
may create huge impact on
the volumetric estimates.

For example, suppose a company has 1 million customers who are expected to grow at
a rate of 7% per
annum. While calculating the volume and size of your data warehouse you have used
this measure in several
places. Now if the customer base actually increase by 10% instead of 7%, that would
mean 30000 more
customers. In a fact table of granularity customer, product, day - this would mean
30000 X 10 X 365 more
records (assuming on average one customer use 10 products). If one record takes
1kb, then the fact table
would now require - (30000 X 365 X 10 X 1kb ) /(1024 X 1024) = 100+ GB more disk
space from only one table.

8. It's IT's responsibility to prove the correctness of your data

When user look at one value in your report and says, "I think it's not right" - the
onus is on you to prove the
correctness or validity of that data. Nobody is going to help you around to prove
how right your data
warehouse is. For this reason, it is absolutely necessary to build a solid data
reconciliation framework for
your data warehouse. A reconciliation framework that can trigger an early alarm
whenever something does
not match between source and target, so that you get enough time to investigate
(and if required, fix) the
issue.

Such reconciliation framework however indispensable is not easy to create. Not only
they require huge
amount of effort and expertise, they also tend to run in the same production server
almost same time as that
of production load and eat up lot of performance. Moreover, such reconciliation
framework is often not a
client side requirement - making it even difficult for you to allocate time and
budget. But if you do not do it
that would be a much bigger mistake to make.

9. Data Warehousing project incur high maintenance cost


Apart from development and deployment, maintenance also incur huge cost in data
warehousing. Server
maintenance, software licensing, regular data purging, database maintenance all
these incur costs.

It's important to set the expectation in the very beginning of the project about
the huge maintenance cost
implications.

10. Amount of time needed to refresh your data warehouse is going to be your top
concern

You need to load data in the data warehouse, generally at least daily (although
sometimes more frequently
than this) and also monthly / quarterly / yearly etc. Loading latest data into data
warehouse ensures that your
reports are all up-to-date. However, the time required to load data (refresh time)
is going to be more than
what you have calculated and that's too going to increase day by day.

One of the major hinderances in the acceptance of a data warehouse by its users is
its performance. I have
seen too many cases where reports generated from data warehouse miss SLA and
severely damage the
dependency and credibility of the data warehouse. In fact, I have seen cases where
daily load runs more than
a day and never completes to generate timely daily report. There have been other
famous cases of SLA breach
as well.

I can not tress this enough but performance considerations are hugely important for
the success of a data
warehouse and it's more important than what you thought. Do everything necessary to
make your data
warehouse perform well - reduce overhead, maintain servers, cut-off complexities,
do regular system
performance tests (SPT) and weigh the performance against industry benchmarks, make
SPT a part of user
acceptance test (UAT) etc.

Common Mistakes in Data Modelling

A model is an abstraction of some aspect of a problem. A data model is a model that


describes how data is
represented and accessed, usually for a database. The construction of a data model
is one of the most difficult tasks of
software engineering and is often pivotal to the success or failure of a project.
There are too many factors that determine the success of a data model in terms of
its usability and effectiveness. Not
all of them can be discsed here. Plus people tend to make different types of
mistakes for different types of modelling
patterns. Some modelling patterns are prone to some specific types of issues which
might not be prevalent is other
types of patterns. Nevertheless, I have tried to compile a list of some widespread
mistakes that are commonly found
in data modelling patterns.

Avoid Large Data Models

Well, you may be questioning how much large is large. The answer: it depends. You
must ask yourself if the large
size of the model is really justified. The more complex your model is, the more
prone it is to contain design errors.
For an example, you may want to try to limit your models to not more than 200
tables. To be able to do that, in the
early phase of data modelling ask yourself these questions �

. Is the large size really justified?


. Is there any extraneous content that I can remove?
. Can I shift the representation and make the model more concise?
. How much work is to develop the application for this model and is that
worthwhile?
. Is there any speculative content that I can remove? (Speculative contents are
those which are not immediately
required but still kept in the model as �might be required� in the future.)

If you consciously try to keep things simple, most likely you will also be able to
avoid the menace of over modelling.
Over modelling leads to over engineering which leads to over work without any
defined purpose. A person who
does modelling just for the sake of modelling often ends up doing over modelling.

Watch carefully if you have following signs in your data model?

. Lots of entities with no or very few non-key attributes?


. Lots of modelling objects with names which no business user would recognise?
. You yourself have lot of troubles coming up with the names of the attributes?

All the above are sure signs of over modelling that only increases your burden (of
coding, of loading, of maintaining,
of securing, of using).

Lack of Clarity or Purpose

Purpose of the model determines the level of details that you want to keep in the
model. If you are unsure about the
purpose, you will definitely end up designing a model that is too detail or too
brief for the purpose.
Violation of Normalization
Clarity is also very important. For example - do you clearly know the data types
that you should be using for all the
business attributes? Or do you end up using some speculative data types (and
lengths)?

Modern data modelling tools come with different concepts of declaring data (e.g.
domain and enumeration concept
in ERWin) that helps to bring clarity to the model. So, before you start building �
pause for a moment and ask
yourself if you really understand the purpose of the model.

Reckless violation of Normal Form

(Applicable for operational data models)

When the tables in the model satisfy higher levels of normal forms, they are less
likely to store redundant or
contradictory data. But there is no hard and fast rule about maintaining those
normal forms. A modeller is allowed to
violate these rules for good purpose (such as to increase performance) and such a
relaxation is called
denormalization.

But the problem occurs � when a modeller violates the normal form deliberately
without a clearly defined purpose.
Such reckless violation breaks apart the whole design principle behind the data
model and often renders the model
unusable. So if you are unsure of something � just stick to the rules. Don�t get
driven by vague purposes.

The above figure shows a general hierarchical relationship between customer and its
related categories. Let�s say a
customer can fall under following categories � Consumer, Business, Corporate and
Wholesaler. Given this condition,
�ConsumerFlag� is a redundant column on Customer table.

Traps in Dimensional Modelling


SCD Type2 Modelling Issue
When it comes to dimensional modelling, there are some inexcusable mistakes that
people tends to make. Here are a
few of them �

Snow-flaking between two Type-II slowly changing dimension (SCD) tables

Below is an example of such a modelling.

Theoretically speaking there is no issue with such a model, at least until one
tries to create the ETL programming
(extraction-transformation-loading) code behind these tables.

Consider this � in the above example, suppose something changed in the


�ProductType� table which created a new
row in �ProductType� table (since ProductType is SCD2, any historical change will
be maintained by adding new
row). This new row will have new surrogate key. But in the Product table, any
existing row is still pointing to the old
product type record and hence leading to data anomaly.

Indiscriminate use of Surrogate keys

Surrogate Keys are used as a unique identifier to represent an entity in the


modelled world. Surrogate keys are
required when we cannot use a natural key to uniquely identify a record or when
using a surrogate key is deemed
more suitable as the natural key is not a good fir for primary key (natural key too
long, data type not suitable for
indexing etc.)
But surrogate keys also come with some disadvantages. The values of surrogate keys
have no relationship with the
real world meaning of the data held in a row. Therefore over usage of surrogate
keys (often in the name of
�standardization�) lead to the problem of disassociation and creates unnecessary
ETL burden and performance
degradation.

Even query optimization becomes difficult when one disassociates the surrogate key
with the natural key. The reason
being � since surrogate key takes the place of primary key, unique index is applied
on that column. And any query
based on natural key identifier leads to full table scan as that query cannot take
the advantage of unique index on the
surrogate key.

Before assigning a surrogate key to a table, ask yourself these questions �

. Am I using a surrogate key only for the sake of maintaining standard?


. Is there any unique not null natural key that I can use as primary key instead of
a new surrogate key?
. Can I use my natural key as primary key without degrading the performance?

If the answer of the above questions are �YES� � don�t use the surrogate key.

Data Mining - a simple guide for beginners

This paper introduces the subject of data mining in simple lucid language and moves
on to build more complex
concepts. Start here if you are a beginner.

Data Mining. I have an allergy to this term.

Not because I hate the subject of data mining itself, but because this term is so
much over-used and misused
and exploited and commercialized and often conveyed in inaccurate manner, in
inappropriate places and often
with intentional vagueness.

So when I decided to write about what is data mining, I was convinced that I need
to write about what is NOT
data mining first, in order to build a formal definition of data mining.
http://png.dwbiconcepts.com/images/wikipedia-icon.png
What is Data Mining? (And what it is not)

Here is the Wikipedia definition of data mining:

�Data mining � is the process of discovering


new patterns from large data sets�

Now the question is: what does the above definition really mean and how does it
differ from finding
information from databases? We often store information in databases (as in data
warehouses) and retrieve the
information from the database when we need it. Is that data mining? Answer is �no�.
We will soon see why is it
so.

Let�s start with the big picture first. This all starts with something called
"Knowledge Discovery in Database". Data
mining is basically one of the steps in the process of knowledge discovery in
database (KDD). Knowledge
discovery process is basically divided in 5 steps:

1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Evaluation

�Selection� is the step where we identify the data, �pre-processing� is where we


cleanse and profile the data,
�transformation� step is required for data preparation, and then is data mining.
Lastly we use �Evaluation� to
test the result of the data mining.

Notice here the term � �Knowledge� as in Knowledge Discovery in Database (KDD). Why
did you say
�Knowledge�? Why not �information� or �data�?

This is because there are differences among the terms �data�, �information� and
�knowledge�. Let�s
understand this difference through one example.

You run a local departmental store and you log all the details of your customers in
the store
database. You know the names of your customers and what items they buy each day.

For example, Alex, Jessica and Paul visit your store every Sunday and buys candle.
You store
this information in your store database. This is data. Any time you want to know
who are
the visitors that buy candle, you can query your database and get the answer. This
is
information. You want to know how many candles are sold on each day of week from
your
store, you can again query your database and you�d get the answer � that�s also
information.

But suppose there are 1000 other customers who also buy candle from you on every
Sunday
(mostly � with some percentage of variations) and all of them are Christian by
religion. So,
you can conclude that Alex, Jessica and Paul must be also Christian.

Now the religion of Alex, Jessica and Paul were not given to you as data. This
could not be retrieved from the
database as information. But you learnt this piece of information indirectly. This
is the �knowledge� that you
discovered. And this discovery was done through a process called �Data Mining�.

Now there are chances that you are wrong about Alex, Jessica and Paul. But there
are fare amount of chances
that you are actually right. That is why it is very important to �evaluate� the
result of KDD process.

I gave you this example because I wanted to make a clear distinction between
knowledge and information in
the context of data mining. This is important to understand our first question �
why retrieving information
from deep down of your database is not same as data mining. No matter how complex
the information retrieval
process is, no matter how deep the information is located at, it�s still not data
mining.

As long as you are not dealing with predictive analysis or not discovering �new�
pattern from the existing data
� you are not doing data mining.

What are the applications of Data Mining?

When it comes to applying data mining, your imagination is the only barrier (not
really . there are
technological hindrances as well as we will see later). But it�s true that data
mining is applied in almost any
fields starting from genetics to human rights violation. One of the most important
applications is in �Machine
Learning�. Machine learning is a branch of artificial intelligence concerned with
the design and development of
algorithms that allow computers to evolve behaviors based on empirical data.
Machine learning makes it
possible for computers to take autonomous decisions based on the data available
from past experiences. Many
of the standard problems of today�s world are being solved by the application of
machine learning as solving
them otherwise (e.g. through the deterministic algorithmic approach) would be
impossible given the breadth
and depth of the problem.
Let me start with one example of the application of data mining that enables
machine-learning algorithm to
drive an autonomous vehicle. This vehicle does not have any driver and it moves
around the road all by itself.
The way it maneuvers and overcomes the obstacles is by applying the images that it
sees (through a VGA
camera) and then using data mining to determine the course of action based on the
data of its past experiences.

Fig. Autonomous Vehicle Designed in Stanford University using


Data Mining methods to maneuver (Video)

There are notable applications of data mining in the subjects such as �

. Voice recognition

Think of Siri in iPhone. How does it understand your commands? Clearly it�s not
deterministically
programmable as every body has different tone and accent and voice. And not only it
understands, it
also adapts better with your voice as you keep using it more and more.

. Classification of DNA sequences

DNA sequence contains biological information. One of the many approaches of DNA
sequencing is
through sequence mining where data mining techniques are applied to find
statistically relevant
patters, which are then compared with previously studied sequences to understand
the given sequence.

. Natural Language processing

Consider the following conversations between customer (Mike) and shop-keeper


(Linda).

Mike: You have playing cards?


Linda: We have one blue stack from Jackson�s and also one other from Deborah
Mike: What is the price?
Linda: Jackson�s $4 and Deborah�s $7.
Mike: Okay give me the blue one please.

Now consider this. What if �Linda� was an automated machine? You could probably
have the same
kind of conversations still, but it would probably had much more unnatural.

Mike: You have playing cards?


Robot: Yes.
Mike: What type of playing cards do you have?
Robot: We have Jackson�s and Deborah�s playing cards.
Mike: What are the colors of the playing cards?
Robot: Which Company�s playing card do you want to know the color of?
Mike: What is the color of Jackson�s playing cards?
Robot: Blue.
Mike: What are the prices of Jackson�s and deborah�s playing cards?
Robot: Jacksons� playing cards cost you $4 and Deborah�s playing cards cost you $7.

Mike: Ok, then can I buy the blue ones?


Robot: We do not have any product called �blue ones�.
Mike: Can I have the blue color playing cards please?
Robot: Sure!

I know the above example is a bit of overshoot, but you got the idea. Machines do
not understand
natural language. And it�s a challenge to make them understand the same. And until
we do that we
wont be able to build a really useful human-computer interface.

Recently, real advancement on natural language processing is done after the


application of data mining.
Prior implementations of language-processing tasks typically involved the direct
hand coding of large
sets of rules. But the machine-learning paradigm instead used general learning
algorithms � often,
although not always, grounded in statistical inference � to automatically learn
such rules through the
analysis of large corpora of typical real-world examples.

Methods of data mining

Now if the above examples interest you then let�s continue learning more about data
mining. One of the first
tasks that we have to do next is to understand the different approaches that are
used in the field of data mining.
Below list shows most of the important methods:

Anomaly Detection

This is the method of detecting patterns in a given data set that does not conform
to an established normal
behavior. This is applied in number of different fields such as � network intrusion
detection, share market fraud
detection etc.

Association Rule Learning


This is a method of discovering interesting relations between variables in large
databases. Ever seen �Buyers
who bought this product, also bought these:� type of messages in e-commerce
websites (e.g. in Amazon.com)?
That�s an example of Association Rule learning.

Clustering

Clustering is the method of assigning a set of objects into groups (called


clusters) so that the objects in the same
cluster are more similar (in some sense or another) to each other than to those in
other clusters. Cluster analysis
is widely used in market research when working with multivariate data. Market
researchers often use this to
create customer segmentation, product segmentation etc.

Classification

This method is used for the task of generalizing known structure to apply to new
data. For example, an email
program might attempt to classify an email as legitimate or spam.

Regression

Attempts to find a function, which models the data with the least error. The above
example of autonomous
driving uses this method.

Next we would learn about each of these methods in greater detail with examples of
their
SQL Questions

What is normalization? Explain different levels of normalization?

Check out the article Q100139 from Microsoft knowledge base and of course, there's
much more
information available in the net. It will be a good idea to get a hold of any RDBMS
fundamentals text
book, especially the one by C. J. Date. Most of the times, it will be okay if you
can explain till third normal
form.

What is de-normalization and when would you go for it?

As the name indicates, de-normalization is the reverse process of normalization. It


is the controlled
introduction of redundancy in to the database design. It helps improve the query
performance as the
number of joins could be reduced.

How do you implement one-to-one, one-to-many and many-to-many relationships while


designing
tables?

One-to-One relationship can be implemented as a single table and rarely as two


tables with primary and
foreign key relationships. One-to-Many relationships are implemented by splitting
the data into two
tables with primary key and foreign key relationships. Many-to-Many relationships
are implemented
using a junction table with the keys from both the tables forming the composite
primary key of the
junction table.

It will be a good idea to read up a database designing fundamentals text book.


What's the difference between a primary key and a unique key?

Both primary key and unique enforce uniqueness of the column on which they are
defined. But by
default primary key creates a clustered index on the column, where are unique
creates a non-clustered
index by default. Another major difference is that, primary key does not allow
NULLs, but unique key
allows one NULL only.

What are user defined data types and when you should go for them?
User defined data types let you extend the base SQL Server data types by providing
a descriptive name,
and format to the database. Take for example, in your database, there is a column
called Flight_Num which appears in many tables. In all these tables it should
bevarchar(8). In this case
you could create a user defined data type called Flight_num_type of varchar(8) and
use it across all your
tables.

See sp_addtype, sp_droptype in books online.

What is bit data type and what's the information that can be stored inside a bit
column?

Bit data type is used to store Boolean information like 1 or 0 (true or false).
Until SQL Server 6.5 bit data
type could hold either a 1 or 0 and there was no support for NULL. But from SQL
Server 7.0 onwards, bit
data type can represent a third state, which is NULL.

Define candidate key, alternate key, composite key.

A candidate key is one that can identify each row of a table uniquely. Generally a
candidate key becomes
the primary key of the table. If the table has more than one candidate key, one of
them will become the
primary key, and the rest are called alternate keys.

A key formed by combining at least two or more columns is called composite key.

What are defaults? Is there a column to which a default cannot be bound?


A default is a value that will be used by a column, if no value is supplied to that
column while inserting
data. IDENTITY columns and timestamp columns can't have defaults bound to them. See
CREATE
DEFAULT in books online.

What is a transaction and what are ACID properties?

A transaction is a logical unit of work in which, all the steps must be performed
or none. ACID stands for
Atomicity, Consistency, Isolation, Durability. These are the properties of a
transaction. For more
information and explanation of these properties, see SQL Server books online or
any RDBMS fundamentals text book.

Explain different isolation levels

An isolation level determines the degree of isolation of data between concurrent


transactions. The default
SQL Server isolation level is Read Committed. Here are the other isolation levels
(in the ascending order
of isolation): Read Uncommitted, Read Committed, Repeatable Read, Serializable. See
SQL Server books
online for an explanation of the isolation levels. Be sure to read about SET
TRANSACTION ISOLATION
LEVEL, which lets you customize the isolation level at the connection level.

CREATE INDEX myIndex ON myTable (myColumn)

What type of Index will get created after executing the above statement?

Non-clustered index. Important thing to note: By default a clustered index gets


created on the primary
key, unless specified otherwise.

What is the maximum size of a row?

8060 bytes. Do not be surprised with questions like 'What is the maximum number of
columns per table'.
Check out SQL Server books online for the page titled: "Maximum Capacity
Specifications".

Explain Active/Active and Active/Passive cluster configurations


Hopefully you have experience setting up cluster servers. But if you do not, at
least be familiar with the
way clustering works and the two clustering configurations Active/Active and
Active/Passive. SQL
Server books online has enough information on this topic and there is a good white
paper available on
Microsoft site.

Explain the architecture of SQL Server


This is a very important question and you better be able to answer it if consider
yourself a DBA. SQL
Server books online is the best place to read about SQL Server architecture. Read
up the chapter
dedicated to SQL Server Architecture.

What is Lock Escalation?

Lock escalation is the process of converting a lot of low level locks (like row
locks, page locks) into higher
level locks (like table locks). Every lock is a memory structure too many locks
would mean, more memory
being occupied by locks. To prevent this from happening, SQL Server escalates the
many fine-grain locks
to fewer coarse-grain locks. Lock escalation threshold was definable in SQL Server
6.5, but from SQL
Server 7.0 onwards it's dynamically managed by SQL Server.

What's the difference between DELETE TABLE and TRUNCATE TABLE commands?

DELETE TABLE is a logged operation, so the deletion of each row gets logged in the
transaction log,
which makes it slow. TRUNCATE TABLE also deletes all the rows in a table, but it
will not log the
deletion of each row, instead it logs the de-allocation of the data pages of the
table, which makes it faster.
Of course, TRUNCATE TABLE can be rolled back.

Explain the storage models of OLAP

Check out MOLAP, ROLAP and HOLAP in SQL Server books online for more information.

What are the new features introduced in SQL Server 2000 (or the latest release of
SQL Server at the time
of your interview)? What changed between the previous version of SQL Server and the
current version?
This question is generally asked to see how current is your knowledge. Generally
there is a section in the
beginning of the books online titled "What's New", which has all such information.
Of course, reading just
that is not enough, you should have tried those things to better answer the
questions. Also check out the
section titled "Backward Compatibility" in books online which talks about the
changes that have taken
place in the new version.

What are constraints? Explain different types of constraints.


Constraints enable the RDBMS enforce the integrity of the database automatically,
without needing you
to create triggers, rule or defaults.

Types of constraints: NOT NULL, CHECK, UNIQUE, PRIMARY KEY, FOREIGN KEY

For an explanation of these constraints see books online for the pages titled:
"Constraints" and "CREATE
TABLE", "ALTER TABLE"

What is an index? What are the types of indexes? How many clustered indexes can be
created on a
table? I create a separate index on each column of a table. what are the advantages
and disadvantages
of this approach?

Indexes in SQL Server are similar to the indexes in books. They help SQL Server
retrieve the data quicker.

Indexes are of two types. Clustered indexes and non-clustered indexes. When you
create a clustered
index on a table, all the rows in the table are stored in the order of the
clustered index key. So, there can
be only one clustered index per table. Non-clustered indexes have their own storage
separate from the
table data storage. Non-clustered indexes are stored as B-tree structures (so do
clustered indexes), with
the leaf level nodes having the index key and it's row locater. The row located
could be the RID or the
Clustered index key, depending up on the absence or presence of clustered index on
the table.

If you create an index on each column of a table, it improves the query


performance, as the query
optimizer can choose from all the existing indexes to come up with an efficient
execution plan. At the
same time, data modification operations (such as INSERT, UPDATE, DELETE) will
become slow, as
every time data changes in the table, all the indexes need to be updated. Another
disadvantage is that,
indexes need disk space, the more indexes you have, more disk space is used.

What is RAID and what are different types of RAID configurations?

RAID stands for Redundant Array of Inexpensive Disks, used to provide fault
tolerance to database
servers. There are six RAIDlevels 0 through 5 offering different levels of
performance, fault tolerance.
MSDN has some information about RAID levels and for detailed information, check out
the RAID
advisory board's homepage

What are the steps you will take to improve performance of a poor performing query?
This is a very open ended question and there could be a lot of reasons behind the
poor performance of a
query. But some general issues that you could talk about would be: No indexes,
table scans, missing or
out of date statistics, blocking, excess recompilations of stored procedures,
procedures and triggers
without SET NOCOUNT ON, poorly written query with unnecessarily complicated joins,
too much
normalization, excess usage of cursors and temporary tables.

Some of the tools/ways that help you troubleshooting performance problems are:

. SET SHOWPLAN_ALL ON,


. SET SHOWPLAN_TEXT ON,
. SET STATISTICS IO ON,
. SQL Server Profiler,
. Windows NT /2000 Performance monitor,
. Graphical execution plan in Query Analyzer.

Download the white paper on performance tuning SQL Server from Microsoft web site.

What are the steps you will take, if you are tasked with securing an SQL Server?

Again this is another open ended question. Here are some things you could talk
about: Preferring NT
authentication, using server, database and application roles to control access to
the data, securing the
physical database files using NTFS permissions, using an unguessable SA password,
restricting physical
access to the SQL Server, renaming the Administrator account on the SQL Server
computer, disabling the
Guest account, enabling auditing, using multi-protocol encryption, setting up SSL,
setting up firewalls,
isolating SQL Server from the web server etc.

Read the white paper on SQL Server security from Microsoft website. Also check out
My SQL Server
security best practices
What is a deadlock and what is a live lock? How will you go about resolving
deadlocks?

Deadlock is a situation when two processes, each having a lock on one piece of
data, attempt to acquire a
lock on the other's piece. Each process would wait indefinitely for the other to
release the lock, unless one
of the user processes is terminated. SQL Server detects deadlocks and terminates
one user's process.

A livelock is one, where a request for an exclusive lock is repeatedly denied


because a series of
overlapping shared locks keeps interfering. SQL Server detects the situation after
four denials and refuses
further shared locks. A livelock also occurs when read transactions monopolize a
table or page, forcing a
write transaction to wait indefinitely.

Check out SET DEADLOCK_PRIORITY and "Minimizing Deadlocks" in SQL Server books
online. Also
check out the article Q169960 from Microsoft knowledge base.

What is blocking and how would you troubleshoot it?

Blocking happens when one connection from an application holds a lock and a second
connection
requires a conflicting lock type. This forces the second connection to wait,
blocked on the first.

Read up the following topics in SQL Server books online: Understanding and avoiding
blocking, Coding
efficient transactions.

Explain CREATE DATABASE syntax

Many of us are used to creating databases from the Enterprise Manager or by just
issuing the command:

CREATE DATABASE MyDB.

But what if you have to create a database with two file groups, one on drive C and
the other on drive D
with log on drive E with an initial size of 600 MB and with a growth factor of 15%?
That's why being a
DBA you should be familiar with the CREATE DATABASE syntax. Check out SQL Server
books online
for more information.
How to restart SQL Server in single user mode? How to start SQL Server in minimal
configuration
mode?

SQL Server can be started from command line, using the SQLSERVR.EXE. This EXE has
some very
important parameters with which a DBA should be familiar with. -m is used for
starting SQL Server in
single user mode and -f is used to start the SQL Server in minimal configuration
mode. Check out SQL
Server books online for more parameters and their explanations.
As a part of your job, what are the DBCC commands that you commonly use for
database
maintenance?

DBCC CHECKDB,
DBCC CHECKTABLE,
DBCC CHECKCATALOG,
DBCC CHECKALLOC,
DBCC SHOWCONTIG,
DBCC SHRINKDATABASE,
DBCC SHRINKFILE etc.

But there are a whole load of DBCC commands which are very useful for DBAs. Check
out SQL Server
books online for more information.

What are statistics, under what circumstances they go out of date, how do you
update them?

Statistics determine the selectivity of the indexes. If an indexed column has


unique values then the
selectivity of that index is more, as opposed to an index with non-unique values.
Query optimizer uses
these indexes in determining whether to choose an index or not while executing a
query.

Some situations under which you should update statistics:

1. If there is significant change in the key values in the index

2. If a large amount of data in an indexed column has been added, changed, or


removed (that is, if
the distribution of key values has changed), or the table has been truncated using
the
TRUNCATE TABLE statement and then repopulated
3. Database is upgraded from a previous version

Look up SQL Server books online for the following commands:

UPDATE STATISTICS,
STATS_DATE,
DBCC SHOW_STATISTICS,
CREATE STATISTICS,
DROP STATISTICS,
sp_autostats,
sp_createstats,
sp_updatestats

What are the different ways of moving data/databases between servers and databases
in SQL Server?

There are lots of options available, you have to choose your option depending upon
your requirements.
Some of the options you have are:

BACKUP/RESTORE,
Detaching and attaching databases,
Replication,
DTS,
BCP,
logshipping,
INSERT...SELECT,
SELECT...INTO,
creating INSERT scripts to generate data.

Explain different types of BACKUPs available in SQL Server? Given a particular


scenario, how would
you go about choosing a backup plan?

Types of backups you can create in SQL Sever 7.0+ are Full database backup,
differential database
backup, transaction log backup, filegroup backup. Check out the BACKUP and RESTORE
commands in
SQL Server books online. Be prepared to write the commands in your interview. Books
online also has
information on detailed backup/restore architecture and when one should go for a
particular kind of
backup.

What is database replication? What are the different types of replication you can
set up in SQL Server?

Replication is the process of copying/moving data between databases on the same or


different servers.
SQL Server supports the following types of replication scenarios:

* Snapshot replication
* Transactional replication (with immediate updating subscribers, with queued
updating subscribers)
* Merge replication

See SQL Server books online for in-depth coverage on replication. Be prepared to
explain how different
replication agents function, what are the main system tables used in replication
etc.
How to determine the service pack currently installed on SQL Server?

The global variable @@Version stores the build number of the sqlservr.exe, which is
used to determine the
service pack installed. To know more about this process visit SQL Server service
packs and versions.

What are cursors? Explain different types of cursors. What are the disadvantages of
cursors? How can
you avoid cursors?

Cursors allow row-by-row processing of the resultsets.

Types of cursors:

Static,
Dynamic,
Forward-only,
Keyset-driven.

See books online for more information.

Disadvantages of cursors: Each time you fetch a row from the cursor, it results in
a network roundtrip,
where as a normal SELECT query makes only one round trip, however large the
resultset is. Cursors are
also costly because they require more resources and temporary storage (results in
more IO operations).
Further, there are restrictions on the SELECT statements that can be used with some
types of cursors.

Most of the times, set based operations can be used instead of cursors. Here is an
example:
If you have to give a flat hike to your employees using the following criteria:

Salary between 30000 and 40000 -- 5000 hike


Salary between 40000 and 55000 -- 7000 hike
Salary between 55000 and 65000 -- 9000 hike
In this situation many developers tend to use a cursor, determine each employee's
salary and update his
salary according to the above formula. But the same can be achieved by multiple
update statements or
can be combined in a single UPDATE statement as shown below:

UPDATE tbl_emp SET salary =


CASE WHEN salary BETWEEN 30000 AND 40000 THEN salary + 5000
WHEN salary BETWEEN 40000 AND 55000 THEN salary + 7000
WHEN salary BETWEEN 55000 AND 65000 THEN salary + 10000
END

Another situation in which developers tend to use cursors: You need to call a
stored procedure when a
column in a particular row meets certain condition. You don't have to use cursors
for this. This can be
achieved using WHILE loop, as long as there is a unique key to identify each row.

Write down the general syntax for a SELECT statements covering all the options.

Here's the basic syntax: (Also checkout SELECT in books online for advanced
syntax).

SELECT select_list
[INTO new_table_]
FROM table_source
[WHERE search_condition]
[GROUP BY group_by__expression]
[HAVING search_condition]
[ORDER BY order__expression [ASC | DESC] ]

What is a join and explain different types of joins?

Joins are used in queries to explain how different tables are related. Joins also
let you select data from a
table depending upon data from another table.
Types of joins:

INNER JOINs,
OUTER JOINs,
CROSS JOINs

OUTER JOINs are further classified as


LEFT OUTER JOINS,
RIGHT OUTER JOINS and
FULL OUTER JOINS.

For more information see pages from books online titled: "Join Fundamentals" and
"Using Joins".

Can you have a nested transaction?

Yes, very much. Check out BEGIN TRAN, COMMIT, ROLLBACK, SAVE TRAN and @@TRANCOUNT

What is an extended stored procedure? Can you instantiate a COM object by using T-
SQL?

An extended stored procedure is a function within a DLL (written in a programming


language like C,
C++ using Open Data Services (ODS) API) that can be called from T-SQL, just the way
we call normal
stored procedures using the EXEC statement. See books online to learn how to create
extended stored
procedures and how to add them to SQL Server.

Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-
SQL by
using sp_OACreate stored procedure.

Also see books online for sp_OAMethod, sp_OAGetProperty, sp_OASetProperty,


sp_OADestroy.

What is the system function to get the current user's user id?
USER_ID(). Also check out other system functions like

USER_NAME(),
SYSTEM_USER,
SESSION_USER,
CURRENT_USER,
USER,
SUSER_SID(),
HOST_NAME().
What are triggers? How many triggers you can have on a table? How to invoke a
trigger on demand?

Triggers are special kind of stored procedures that get executed automatically when
an INSERT,
UPDATE or DELETE operation takes place on a table.

In SQL Server 6.5 you could define only 3 triggers per table, one for INSERT, one
for UPDATE and one
for DELETE. From SQL Server 7.0 onwards, this restriction is gone, and you could
create multiple
triggers per each action. But in 7.0 there's no way to control the order in which
the triggers fire. In SQL
Server 2000 you could specify which trigger fires first or fires last using
sp_settriggerorder

Triggers cannot be invoked on demand. They get triggered only when an associated
action (INSERT,
UPDATE, DELETE) happens on the table on which they are defined.

Triggers are generally used to implement business rules, auditing. Triggers can
also be used to extend the
referential integrity checks, but wherever possible, use constraints for this
purpose, instead of triggers, as
constraints are much faster.

Till SQL Server 7.0, triggers fire only after the data modification operation
happens. So in a way, they are
called post triggers. But in SQL Server 2000 you could create pre triggers also.
Search SQL Server 2000
books online for INSTEAD OF triggers.

Also check out books online for 'inserted table', 'deleted table' and
COLUMNS_UPDATED()

There is a trigger defined for INSERT operations on a table, in an OLTP system. The
trigger is written to
instantiate a COM object and pass the newly inserted rows to it for some custom
processing.

What do you think of this implementation? Can this be implemented better?

Instantiating COM objects is a time consuming process and since you are doing it
from within a trigger, it
slows down the data insertion process. Same is the case with sending emails from
triggers. This scenario
can be better implemented by logging all the necessary data into a separate table,
and have a job which
periodically checks this table and does the needful.

What is a self join? Explain it with an example.


Self join is just like any other join, except that two instances of the same table
will be joined in the query.
Here is an example: Employees table which contains rows for normal employees as
well as managers. So,
to find out the managers of all the employees, you need a self join.

CREATE TABLE emp


(
empid int,
mgrid int,
empname char(10)
)

INSERT emp SELECT 1,2,'Vyas'


INSERT emp SELECT 2,3,'Mohan'
INSERT emp SELECT 3,NULL,'Shobha'
INSERT emp SELECT 4,2,'Shridhar'
INSERT emp SELECT 5,2,'Sourabh'

SELECT t1.empname [Employee], t2.empname [Manager]


FROM emp t1, emp t2
WHERE t1.mgrid = t2.empid

Here is an advanced query using a LEFT OUTER JOIN that even returns the employees
without
managers (super bosses)

SELECT t1.empname [Employee], COALESCE(t2.empname, 'No manager') [Manager]


FROM emp t1
LEFT OUTER JOIN
emp t2
ON
t1.mgrid = t2.empid
SQL interview questions and answers
By admin | July 14, 2008

1. What are two methods of retrieving SQL?


2. What cursor type do you use to retrieve multiple recordsets?
3. What is the difference between a "where" clause and a "having" clause? - "Where"
is a kind of restiriction
statement. You use where clause to restrict all the data from DB.Where clause is
using before result
retrieving. But Having clause is using after retrieving the data.Having clause is a
kind of filtering command.
4. What is the basic form of a SQL statement to read data out of a table? The basic
form to read data out of
table is �SELECT * FROM table_name; � An answer: �SELECT * FROM table_name WHERE
xyz= �whatever�;�
cannot be called basic form because of WHERE clause.
5. What structure can you implement for the database to speed up table reads?-
Follow the rules of DB
tuning we have to: 1] properly use indexes ( different types of indexes) 2]
properly locate different DB
objects across different tablespaces, files and so on.3] create a special space
(tablespace) to locate some of the
data with special datatype ( for example CLOB, LOB and �)
6. What are the tradeoffs with having indexes? - 1. Faster selects, slower updates.
2. Extra storage space to
store indexes. Updates are slower because in addition to updating the table you
have to update the index.
7. What is a "join"? - �join� used to connect two or more tables logically with or
without common field.
8. What is "normalization"? "Denormalization"? Why do you sometimes want to
denormalize? -
Normalizing data means eliminating redundant information from a table and
organizing the data so that
future changes to the table are easier. Denormalization means allowing redundancy
in a table. The main
benefit of denormalization is improved performance with simplified data retrieval
and manipulation. This is
done by reduction in the number of joins needed for data processing.
9. What is a "constraint"? - A constraint allows you to apply simple referential
integrity checks to a table.
There are four primary types of constraints that are currently supported by SQL
Server:
PRIMARY/UNIQUE - enforces uniqueness of a particular table column. DEFAULT -
specifies a default
value for a column in case an insert operation does not provide one. FOREIGN KEY -
validates that every
value in a column exists in a column of another table. CHECK - checks that every
value stored in a column is
in some specified list. Each type of constraint performs a specific type of action.
Default is not a constraint.
NOT NULL is one more constraint which does not allow values in the specific column
to be null. And also it
the only constraint which is not a table level constraint.
10. What types of index data structures can you have? - An index helps to faster
search values in tables. The
three most commonly used index-types are: - B-Tree: builds a tree of possible
values with a list of row IDs
that have the leaf value. Needs a lot of space and is the default index type for
most databases. - Bitmap:
string of bits for each possible value of the column. Each bit string has one bit
for each row. Needs only few
space and is very fast.(however, domain of value cannot be large, e.g. SEX(m,f);
degree(BS,MS,PHD) - Hash:
A hashing algorithm is used to assign a set of characters to represent a text
string such as a composite of
keys or partial keys, and compresses the underlying data. Takes longer to build and
is supported by
relatively few databases.
11. What is a "primary key"? - A PRIMARY INDEX or PRIMARY KEY is something which
comes mainly from
database theory. From its behavior is almost the same as an UNIQUE INDEX, i.e.
there may only be one of
each value in this column. If you call such an INDEX PRIMARY instead of UNIQUE, you
say something
about
your table design, which I am not able to explain in few words. Primary Key is a
type of a constraint
enforcing uniqueness and data integrity for each row of a table. All columns
participating in a primary key
constraint must possess the NOT NULL property.
12. What is a "functional dependency"? How does it relate to database table design?
- Functional dependency
relates to how one object depends upon the other in the database. for example,
procedure/function sp2 may
be called by procedure sp1. Then we say that sp1 has functional dependency on sp2.
13. What is a "trigger"? - Triggers are stored procedures created in order to
enforce integrity rules in a database.
A trigger is executed every time a data-modification operation occurs (i.e.,
insert, update or delete). Triggers
are executed automatically on occurance of one of the data-modification operations.
A trigger is a database
object directly associated with a particular table. It fires whenever a specific
statement/type of statement is
issued against that table. The types of statements are insert,update,delete and
query statements. Basically,
trigger is a set of SQL statements A trigger is a solution to the restrictions of a
constraint. For instance: 1.A
database column cannot carry PSEUDO columns as criteria where a trigger can. 2. A
database constraint
cannot refer old and new values for a row where a trigger can.
14. Why can a "group by" or "order by" clause be expensive to process? - Processing
of "group by" or "order
by" clause often requires creation of Temporary tables to process the results of
the query. Which depending
of the result set can be very expensive.
15. What is "index covering" of a query? - Index covering means that "Data can be
found only using indexes,
without touching the tables"
16. What types of join algorithms can you have?
17. What is a SQL view? - An output of a query can be stored as a view. View acts
like small table which meets
our criterion. View is a precomplied SQL query which is used to select data from
one or more tables. A view
is like a table but it doesn�t physically take any space. View is a good way to
present data in a particular
format if you use that query quite often. View can also be used to restrict users
from accessing the tables
directly.

Linux command line Q&A

By admin | July 15, 2008

1. You need to see the last fifteen lines of the files dog, cat and horse. What
command should you use?
tail -15 dog cat horse
The tail utility displays the end of a file. The -15 tells tail to display the last
fifteen lines of each specified file.
2. Who owns the data dictionary?
The SYS user owns the data dictionary. The SYS and SYSTEM users are created when
the database is
created.
3. You routinely compress old log files. You now need to examine a log from two
months ago. In order to
view its contents without first having to decompress it, use the _________ utility.

zcat
The zcat utility allows you to examine the contents of a compressed file much the
same way that cat
displays a file.
4. You suspect that you have two commands with the same name as the command is not
producing the
expected results. What command can you use to determine the location of the command
being run?
which
The which command searches your path until it finds a command that matches the
command you are
looking for and displays its full path.
5. You locate a command in the /bin directory but do not know what it does. What
command can you use to
determine its purpose.
whatis
The whatis command displays a summary line from the man page for the specified
command.
6. You wish to create a link to the /data directory in bob�s home directory so you
issue the command ln /data
/home/bob/datalink but the command fails. What option should you use in this
command line to be
successful.
Use the -F option
In order to create a link to a directory you must use the -F option.
7. When you issue the command ls -l, the first character of the resulting display
represents the file�s
___________.
type
The first character of the permission block designates the type of file that is
being displayed.
8. What utility can you use to show a dynamic listing of running processes?
__________
top
The top utility shows a listing of all running processes that is dynamically
updated.
9. Where is standard output usually directed?
to the screen or display
By default, your shell directs standard output to your screen or display.
10. You wish to restore the file memo.ben which was backed up in the tarfile
MyBackup.tar. What command
should you type?
tar xf MyBackup.tar memo.ben
This command uses the x switch to extract a file. Here the file memo.ben will be
restored from the tarfile
MyBackup.tar.
11. You need to view the contents of the tarfile called MyBackup.tar. What command
would you use?
tar tf MyBackup.tar
The t switch tells tar to display the contents and the f modifier specifies which
file to examine.
12. You want to create a compressed backup of the users� home directories. What
utility should you use?
tar
You can use the z modifier with tar to compress your archive at the same time as
creating it.
13. What daemon is responsible for tracking events on your system?
syslogd
The syslogd daemon is responsible for tracking system information and saving it to
specified log files.
14. You have a file called phonenos that is almost 4,000 lines long. What text
filter can you use to split it into
four pieces each 1,000 lines long?
split
The split text filter will divide files into equally sized pieces. The default
length of each piece is 1,000 lines.
15. You would like to temporarily change your command line editor to be vi. What
command should you
type to change it?
set -o vi
The set command is used to assign environment variables. In this case, you are
instructing your shell to
assign vi as your command line editor. However, once you log off and log back in
you will return to the
previously defined command line editor.
16. What account is created when you install Linux?
root
Whenever you install Linux, only one user account is created. This is the superuser
account also known as
root.
17. What command should you use to check the number of files and disk space used
and each user�s defined
quotas?
repquota
The repquota command is used to get a report on the status of the quotas you have
set including the amount
of allocated space and amount of used space.

What is the difference between oracle,sql and sql server ?

. Oracle is based on RDBMS.


. SQL is Structured Query Language.
. SQL Server is another tool for RDBMS provided by MicroSoft.

why you need indexing ? where that is stroed and what you mean by schema object?
For what purpose
we are using view?

We cant create an Index on Index.. Index is stoed in user_index table.Every object


that has been created
on Schema is Schema Object like Table,View etc.If we want to share the particular
data to various users
we have to use the virtual table for the Base table...So tht is a view.

indexing is used for faster search or to retrieve data faster from various table.
Schema containing set of
tables, basically schema means logical separation of the database. View is crated
for faster retrieval of
data. It's customized virtual table. we can create a single view of multiple
tables. Only the drawback
is..view needs to be get refreshed for retrieving updated data.

Difference between Store Procedure and Trigger?

. we can call stored procedure explicitly.


. but trigger is automatically invoked when the action defined in trigger is done.
ex: create trigger after Insert on
. this trigger invoked after we insert something on that table.
. Stored procedure can't be inactive but trigger can be Inactive.
. Triggers are used to initiate a particular activity after fulfilling certain
condition.It need to define
and can be enable and disable according to need.

What is the advantage to use trigger in your PL?

Triggers are fired implicitly on the tables/views on which they are created. There
are various advantages
of using a trigger. Some of them are:
. Suppose we need to validate a DML statement(insert/Update/Delete) that modifies a
table then
we can write a trigger on the table that gets fired implicitly whenever DML
statement is executed
on that table.
. Another reason of using triggers can be for automatic updation of one or more
tables whenever a
DML/DDL statement is executed for the table on which the trigger is created.
. Triggers can be used to enforce constraints. For eg : Any insert/update/ Delete
statements should
not be allowed on a particular table after office hours. For enforcing this
constraint Triggers
should be used.
. Triggers can be used to publish information about database events to subscribers.
Database event
can be a system event like Database startup or shutdown or it can be a user even
like User loggin
in or user logoff.

What the difference between UNION and UNIONALL?

Union will remove the duplicate rows from the result set while Union all does'nt.

What is the difference between TRUNCATE and DELETE commands?

Both will result in deleting all the rows in the table .TRUNCATE call cannot be
rolled back as it is a DDL
command and all memory space for that table is released back to the server.
TRUNCATE is much
faster.Whereas DELETE call is an DML command and can be rolled back.

Which system table contains information on constraints on all the tables created ?
yes,
USER_CONSTRAINTS,
system table contains information on constraints on all the tables created

Explain normalization ?
Normalisation means refining the redundancy and maintain stablisation. there are
four types of
normalisation :
first normal forms, second normal forms, third normal forms and fourth Normal
forms.

How to find out the database name from SQL*PLUS command prompt?
Select * from global_name;
This will give the datbase name which u r currently connected to.....

What is the difference between SQL and SQL Server ?


SQLServer is an RDBMS just like oracle,DB2 from Microsoft
whereas
Structured Query Language (SQL), pronounced "sequel", is a language that provides
an interface to
relational database systems. It was developed by IBM in the 1970s for use in System
R. SQL is a de facto
standard, as well as an ISO and ANSI standard. SQL is used to perform various
operations on RDBMS.

What is diffrence between Co-related sub query and nested sub query?

Correlated subquery runs once for each row selected by the outer query. It contains
a reference to a value
from the row selected by the outer query.

Nested subquery runs only once for the entire nesting (outer) query. It does not
contain any reference to
the outer query row.

For example,

Correlated Subquery:

select e1.empname, e1.basicsal, e1.deptno from emp e1 where e1.basicsal = (select


max(basicsal) from emp
e2 where e2.deptno = e1.deptno)

Nested Subquery:

select empname, basicsal, deptno from emp where (deptno, basicsal) in (select
deptno, max(basicsal) from
emp group by deptno)

WHAT OPERATOR PERFORMS PATTERN MATCHING?


Pattern matching operator is LIKE and it has to used with two attributes

1. % and

2. _ ( underscore )

% means matches zero or more characters and under score means mathing exactly one
character

1)What is difference between Oracle and MS Access?


2) What are disadvantages in Oracle and MS Access?
3) What are feratures&advantages in Oracle and MS Access?
Oracle's features for distributed transactions, materialized views and replication
are not available with
MS Access. These features enable Oracle to efficiently store data for multinational
companies across the
globe. Also these features increase scalability of applications based on Oracle.

What is database?
A database is a collection of data that is organized so that itscontents can easily
be accessed, managed and
updated. open this url : http://www.webopedia.com/TERM/d/database.html

What is cluster.cluster index and non cluster index ?


Clustered Index:- A Clustered index is a special type of index that reorders the
way records in the table
are physically stored. Therefore table may have only one clustered index.Non-
Clustered Index:- A Non-
Clustered index is a special type of index in which the logical order of the index
does not match the
physical stored order of the rows in the disk. The leaf nodes of a non-clustered
index does not consists of
the data pages. instead the leaf node contains index rows.

How can i hide a particular table name of our schema?


you can hide the table name by creating synonyms.

e.g) you can create a synonym y for table x

create synonym y for x;

What is difference between DBMS and RDBMS?


The main difference of DBMS & RDBMS is

RDBMS have Normalization. Normalization means to refining the redundant and


maintain the
stablization.
the DBMS hasn't normalization concept.

What are the advantages and disadvantages of primary key and foreign key in SQL?

Primary key

Advantages

1) It is a unique key on which all the other candidate keys are functionally
dependent

Disadvantage

1) There can be more than one keys on which all the other attributes are dependent
on.

Foreign Key
Advantage

1)It allows refrencing another table using the primary key for the other table

Which date function is used to find the difference between two dates?
datediff

for Eg: select datediff (dd,'2-06-2007','7-06-2007')

output is 5

This article is a step-by-step instruction for those who want to install Oracle 10g
database on their
computer. This document provides guidelines to install Oracle 10g database on
Microsoft Windows
environment. If you use other operating system other than Microsoft Windows, the
process is not too
much different from that of Microsoft Windows, since Oracle uses Oracle Universal
Installer to install its
software.

For more information about installing Oracle 10g under operating systems other than
Microsoft
Windows, please refer to this URL :

http://www.oracle.com/pls/db102/homepage

How to get Oracle 10g :

You can download Oracle 10g database from www.oracle.com. You must registered and
create an
account before you can download the software. The example in this document uses
Oracle Database 10g
Release 2 (10.2.0.1.0) for Microsoft Windows.

How to uninstall Oracle database software :

1. Uninstall all Oracle components using the Oracle Universal Installer (OUI).
oracle10g installation
2. Run regedit.exe and delete the HKEY_LOCAL_MACHINE/ SOFTWARE/ORACLE key. This
contains registry entire for all Oracle products.
3. Delete any references to Oracle services left behind in the following part of
the registry: HKEY
LOCAL MACHINE/ SYSTEM/ CurrentControlsSet/ Services/Ora*. It should be pretty
obvious
which ones relate to Oracle
4. Reboot your machine.
5. Delete the C: \Oracle directory, or whatever directory is your Oracle_Base.
6. Delete the C:\Program Files \Oracle directory.
7. Empty the contents of your c:\temp directory.
8. Empty your recycle bin.

Installing Oracle 10g database software :

1. Insert Oracle CD , the autorun window opens automatically. If you are installing
from network or
hard disk, click setup.exe in the installation folder.
2. The Oracle Universal Installer (OUI) will run and display the Select
Installation
MethodWindow.

3. Choose Basic Installation:


Select this option to quickly install Oracle Database 10g. This method requires
minimal user
input. It installs the software and optionally creates a general-purpose database
based on the
information you provide.
For basic installation, you specify the following:
oracle10g installation
Oracle Home Location � Enter the directory in which to install the Oracle Database
10g
software. You must specify a new Oracle home directory for each new installation of
Oracle
Database 10g. Use the default value, which is :

c:\oracle\product\10.2.0\db_1

Installation Type � Select Enterprise Edition :


If you have limited space, select standard edition. Personal edition installs the
same software as
the Enterprise Edition, but supports only a single-user development and deployment
environment.
Create Starter Database � Check this box to create a database during installation.
Oracle
recommends that you create a starter database for first Create Starter Database �
time
installations. Choose a Global Database Name, like cs157b, or just use the default
value.
Type a password. Don�t lose this password, since you will need it to connect to the
database
server.
Click next

4. The Product-Specific Prerequisite Checks window appears: Click next


http://faq.programmerworld.net/images/db_images/oracle10g_installation3.jpg
oracle10g_installation
5. A summary screen appears showing information such as your global settings, space

requirements and the new products to be installed. Click Install to start the
installation..
6. The Install window appears showing installation progress.
7. At the end of the installation phase, the Configuration Assistants window
appears. This window
lists the configuration assistants that are started automatically.
If you are creating a database, then the Database Configuration Assistant starts
automatically in
oracle 10g installation
oracle 10g installation
a separate window.

At the end of database creation, you are prompted to unlock user accounts to make
the accounts
accessible. The SYS and SYSTEM accounts are already unlocked. Click OK to bypass
password
management.
oracle 10g installation
oracle 10g installation
Note: Oracle 10g still keeps scott / tiger username and password (UID=scott,
PWD=tiger) from
the old version of oracle. In the old version of oracle, scott/tiger user ID is
available by default,
but not in oracle 10g. If you want to use scott /tiger account, you must unlock it
by clicking
�Password Management� at the last window.
Password Management window will appear like the one shown below. Find the user name

�Scott� and uncheck the �Lock Account?� column for the user name.

8. Your installation and database creation is now complete. The End of Installation
window
displays several important URLs, one of which is for Enterprise Manager.
9. You can navigate to this URL in your browser and log in as the SYS user with the
associated
password, and connect as SYSDBA. You use Enterprise Manager to perform common
database
administration tasks
Note : you can access Oracle Enterprise Manager using browser by typing the URL
shown above
in your browser. Instead of typing the IP address, you can also access the
Enterprise Manager by
typing http://localhost:1158/em or �http://[yourComputerName]:1158/em� or by
clicking �Start
>> All Programs >> Oracle � [YourOracleHome_home1] >> Database Control �
[yourOracleID]�
in Windows menu.
By default, use user ID �SYSTEM�, with the password that you have chosen at the
beginning of
installation, to connect to database, SQLPlus, etc. If you want to use other user
ID, you may create
a new user .

Data Modeling

What is Data Model

Data Model is a logical map that represents the inherent properties of the data
independent of
software, hardware, or machine performance considerations. The model shows data
elements
grouped into records, as well as the association around those records.

Since the data model is the basis for data implementation regardless of software or

hardware platforms, the data model should present descriptions about a data in an
abstract
manner which does not mention detailed information specific to any hardware or
software
such as bits manipulation or index addition.

There are two generally accepted meanings on the term data model. The first is that
the
data model could be some sort of theory about the formal description of the data's
structure
and use without any mention of heavy technical terms related to information
technology.
The second is that a data model instance is the application of the data model
theory in order
to create to meet requirements of some applications such as those used in a
business
enterprise.

The structural part of a data model theory refers to the collection of data
structures which
make up a data when it is being created. These data structures represent entities
and
objects in the database model. For instance the data model may that be of a
business
enterprise involved in sales of toys.

The real life things of interest would include customers, company staff and of
course the toy
items. Since the database which will keep the records of these things of interest
cannot
understand the real meaning of customers, company staff and toy item, there should
be
created a data representation of this real life things.

The integrity part of a data model refers to the collection of rules which governs
the
constraints on the data structures so that structural integrity could be achieved.
In the
integrity aspect of a data model, the formal definition of an extensive sets of
rules and
consistent application of data is defined so that the data can be used for its
intended
purpose.

Techniques are defined on hot to maintain data in the data resource and to ensure
that the
data consistently contains value which is loyal to its source while at the same
time accurate
in its destination. This is to ensure that data will always have data value
integrity, data
structure integrity, data retention integrity, and data derivation integrity.

The manipulation part of a data model refers to the collection of operators which
be applied
to the data structures. These operations include query and update of data within
the
database. This is important because not all data can be allowed for altering or
deletion. The
data manipulation part works hand in hand with the integrity part so that the data
model
can result in high quality in the database for the data consumers to enjoy.

As an example, let us take the relational model. The data model defined in the
structural
part refers to the modified concept of the mathematical relation. The reasoning
about such
data is represented as n-ary which is a subset of the Cartesian product of n
domains.

The integrity part refers to the expression in the first order logic and the
manipulation part
refers to the relational algebra as well as tuple and domain calculus.

The process of defining a data model is extremely important in any database


implementation in that there can only be one data model which is the basis for a
wide
variety of data implementation.
Hence, any database management system such as Access, Oracle or MySQL can be
implementing and maintaining a database based on one data model only

What is Data Modeling

Data Modeling is a method used to define and analyze data requirements needed to
support
the business functions of an enterprise. These data requirements are recorded as a
conceptual data model with associated data definitions. Data modeling defines the
relationships between data elements and structures.

Data modeling can be used for a wide array of purposes. It is an act of exploring
data
oriented structures without considering any specific applications that the data
will be used
in. It is like a conceptual definition an entity and its real life counterparts
which is any thing
that is of interest to the organization implementing a database.

Data models are the products of data modeling. In general, three data model styles
namely
conceptual data model, logical data model and physical data model.

The conceptual data model is often called the domain model. It describes the
semantics of a
business organization as this model consists of entity classes which represent
things of
significance to an organization and the relationships of these entities.
Relationships are
defined as assertions about associations between various pairs of entity classes.
The
conceptual data model is commonly used to explore domain concepts with project
stakeholders. Conceptual models may be created to explore high level static
business
structures and concepts. But they can be used as well as precursor or alternatives
to logical
data models.

The logical data model is used in exploring domain concepts and other key areas
such as
relationships and domain problems. The logical data models could be defined for the
scope
of a single project or for the whole enterprise. The logical data model describes
semantics
related to particular data manipulation methods and such descriptions include those
of
tables, columns, object oriented classes, XML tags and many other things. The
logical data
model depicts some logical entity types, the data attributes to describe those
entities and
relations among the entities.

The physical data model is used in the design of the internal database schema. This
design
defines data tables, data columns for the tables and the relationships among the
tables.
Among other things that the physical data model is concern with include
descriptions of the
physical means by which data should be stored. This storage aspect embraces
concerns on
hard disk partitioning, CPU usage optimization, creation of table spaces and
others.

Applications developers need to understand the fundamentals of data modeling so


that their
application can be optimized. It should be noted that the tasks involve in data
modeling
may be performed in an iterative manner. These data modeling tasks include the
following:
Identifying entity types, Identifying attributes, Applying naming conventions,
Identifying
relationships, Applying data model patterns, Assigning keys, Normalizing to reduce
data
redundancy and De-normalizing to improve performance.

Data modeling also focuses on the structure of a data within a domain. This
structure is
described in such a manner that specification is in a dedicated grammar for an
artificial
language used for a certain domain. But as always, the description of the data
structure will
never make any mention of a specific implementation of any database management
system
such as specific vendors.

Sometimes, having different data modelers could lead to confusion as they could
potentially
produce different data models within the same domain. The difference could stem
from
different levels of abstraction in the data models. This can be overcome by coming
up with
generic data modeling methods.

For instance, generic data modeling could take advantage of generic patterns in a
business
organization. An example is the concept of a Party which includes Persons and
Organizations. A generic data model for this entity may be easier to implement to
without
creating conflict along the way

What is Common Data Model

This data model represents events, entities and objects in the real world that are
of interest
to the company. It is subject oriented and includes all aspects of the real world,
primarily
activities pertaining to the business.

To use lay terms, a data model can be considered a road map to get one employee
from
point A to point B in the least mileage, most scenery and shortest time of travel.

In the science of computing, data models are structured and organized data
structures that
are implemented in a database management system. Aside from defining and organizing

business data, data modeling also includes implicitly and explicitly imposing
constraints and
limitations on the data within the data structure.

A data model may be instance of a conceptual schema, logical schema and physical
schema.

A conceptual schema is a description of the semantics of an organization. All the


terms of
the business from the most minute details such as staff information to the most
complex
business transactions are being defined and translated as entity classes.
Relationships
among entities are also defined in a conceptual schema.

A logical schema is a description of the semantics in the conceptual schema. It can


be
represented by a particular technology for data manipulation. This schema is
composed of
particular descriptions of columns, tables, XML tags, object oriented classes and
others.
Later on, these descriptions will be used in the software applications
implementation to
simulate real life scenario of activities in the business.

The physical schema, as the name implies, is the description of the physical means
for
storing data. This can include definitions for storage requirements in hard terms
like
computers, central processing units, network cables, routers and others.

Data Architects and Business Analysts usually work hand in hand to make an
efficient data
model for an organization. To come up with a good Common Data Model output, they
need
to be guided by the following:

1. They have to be sure about database concepts like cardinality, normalization and

optionality;
2. The have to have in depth knowledge of the actual rules of business and its
requirements;
3. They should be more interested in the final resulting database than the data
model.

A data model describes the structure of the database within a business and in
effect, the
underlying structure of the business as well. It can be thought of as a grammar for
an
artificial intelligence in business or any other undertaking.
In the real world, the kinds of things are represented as entities in the data
model. This
entities are can hold any information or attribute as well as relationships.
Irrespective of
how data is represented in the computer system, the data model describes the
company
data.

It is always advised to have a good conceptual data model to describe the semantics
of a
given subject area. A conceptual data model is a collection of assertions
pertaining to the
nature of information used by the company. Entities should be named with natural
language
instead of a technical term. Relationships which are properly named also form
concrete
assertions about the subject.

In large data warehouses, it is imperative that a Common Data Model must be


consistent
and stable. Since companies may have several databases around the world feeding
data to
a central warehouse, a Common Data Model takes a lot of load in the central
processing of
the warehouse because disparities among database sources are already made seamless.
What is Common Data Modeling

Common Data Modeling is defining the unifying the structure used in allowing
heterogeneous business environments to interoperate. A Common Data Model is very
critical to a business organization.

Especially with today's business environment where it is common to have multiple


applications, a Common Data Model seamless integrates seemingly unrelated data into

useful information to give a company a competitive advantage over its competitors.


Data
Warehouses make intensive use data models to make companies have a real update on
how
the business is faring.

In Common Data Modeling, Business Architects and analysts need to face the data
first
before defining a common data or abstraction layer so that they will not be bound
to a
particular schema and thus make the Business Enterprise more flexible.

Business Schemas are the underlying definition of all business related activities.
Data
Models are actually instances of Business Schemas � conceptual, logical and
physical
schemas. These schemas have several aspects of definition and they usually form a
concrete basis for the design of Business Data Architecture.

Data Modeling is actually a vast field but having a Common Data Model for a certain
domain
can answer problems with many different models operating in a homogeneous
environment.
To make Common Data Models, modelers need to focus on one standard of Data
Abstraction. They need to agree on certain elements to be concretely rendered so
uniformity
and consistency is obtained.

Generic patterns can be used to attain a Common Data Model. Some of these patterns
include using entities such as "party" to refer to persons and organizations, or
"product
types", "activity type", "geographic area" among others. Robust Common Data Models
explicitly include versions of these entities.

A good approach to Common Data Modeling is to a have a generic Data Model which
consists of generic types of entity like class, relationships, individual thing and
others. Each
instance of these classes can have subtypes.
Common Data Modeling process may obey some these rules:
1. Attributes are to be treated as relationships with other entities.
2. Entities are defined under the very nature of a Business Activity, rule, policy
or structure
but not the role that it plays within a given context.
3. Entities must have a local identifier in an exchange file or database. This
identifier must
be unique and artificial but should not use relationships to be part of the local
identifier.
4. Relationships, activities and effects of events should not be represented by
attributed but
by the type of entity.
5. Types of relationships should be defined on a generic or high level. The highest
level is
defined as a relationship between one individual thing with another individual
thing.

Data Modeling often uses the Entity-Relationship Model (ERM). This model is a
representation of structured data. This type of Data Modeling can be used to
describe any
ontology (the term used to describe the overview and classification of terms and
their
respective relationships) for a certain area of interest.

What is Common Data Modeling Method

Common Data Modeling is one of the core considerations when setting up a business
data
warehouse. Any serious company wanting to have a data warehouse will have to be
first
serious about data models. Building a data model takes time and it is not unusual
for
companies to spend two to five years just doing it.

Data Models should reflect practical and real world operations and that is why a
common
data modeling method of combining forward, reverse and vertical methods make
perfect
sense to seamlessly integrate disparate data coming in whether top down or bottom
up
from different sources and triggering events.

Professionals involved Enterprise Data Modeling projects understand the great


importance of
accurately reflecting what exactly happens in an industry without having to create
entities
artificially. It can be easy to overlook and side step some issues which can be
analytically
difficult, issues people have no experience of or issues which may be politically
sensitive.
When these are side stepped, data models can become seriously flawed.

Business Architects, analysts and data modelers work together to look around and
look for
the best practices found in the industry. These best practices are then synthesized
into the
enterprise model to reflect the current state of the business and the future it
wants to get
into.

A good Enterprise Data Model should strike a balance between conceptual entities
and
functional entities based on practical, real and available industry standard data.
Conceptual
entities are defined within the company and will take on the values of the data by
the
defined by the company. Examples of conceptual entities are products status,
marital
status, customer types, etc.

On the other hand, functional entities refer to entities that are already well
defined, industry
standard data ready to be placed into database tables. Examples of functional
entities are
D&B Paydex Rating and FICO Score.

Businesses usually start with simply and grows more complex as they progress. It
may start
by selling goods or providing services to clients. These goods and services
delivered as well
as money received were recorded and then reused. So over time, transactions pile up
over
another and the set up can get more and more complex. Despite the complexity, the
business is still essentially a simple entity that has just grown in complexity.

This happens when the business does not have a very defined common data modeling
method. Many software applications could not provide ways to integrate real world
data and
data within the data architecture.

This scenario where there is not common business model can worsen when disparate
multiple systems are used within the company each and each of the system has
differing
views on the underlying data structures.

Business Intelligence can perform more efficiently with Common Data Modeling
Method. As
its name implies, Business Intelligence processes billions of data from the data
warehouse
so that a variety of statistical analysis can be reported and a recommendation on
innovation
to give the company more competitive edge can be presented.

With Common Data Modeling Method, processes can be made faster as the internal
structure of data are made closer to reality compared to non-usage of the data
model. It
should be noted that the common set up of today's business involves having data
sources
from as many geographical locations as possible.

A Look at the Entity-Relationship

Entity-Relationship

The Entity-Relationship or E-R is a model which deals with real world entities, it
includes a
set of objects and the relationships among them. Entity is an object that exists
and is easily
distinguishable from others. Like people, they can easily be distinguished from
others
through various methods; for instance you can distinguish people from one another
by
social security numbers.
http://www.learn.geekinterview.com/images/dm01.gif
http://www.learn.geekinterview.com/images/dm02.gif

Also an entity can be concrete, like a book, person, or place, while it can also be
abstract,
like a holiday for example. Now an entity set is a set of entities that share
something, like
multiple holders of a bank account, those people would be considered an entity set.
Also
entity sets do not need to be disjoint.

Here is an example, the entity set employee (all employees of a bank) and the
entity
set customer (all customers of the bank) may have members in common. Such as an
employee or employees may also be members of the bank. This puts them into both
sets.

We must keep in mind that an entity or entity set is defined by a set of


attributes. An
attribute is a function which maps an entity set into a domain. Every entity is
described by a
set of (attribute, data value) pairs. There is one pair for each attribute of the
entity set.

To illustrate in simpler terms, consider the following.

. A bank has employees, and customers.

. Employees are an entity set defined by employee numbers.

. Customers are an entity set defined by account numbers.

. Employees can be customers as well, and will have possession of both employee
numbers and account numbers.
Relationships and Relationship Sets

A relationship, is an association between several entities. A relationship set is


much
like the others but it is a set of entities that are associated through one or more
aspects.
This is where relationship and relationship set start to differ because basically
it is a
mathematical relation. Consider the following:

If the equation expresses different entity sets, then a relationship set R,


is a subset of
http://www.learn.geekinterview.com/images/dm03.gif

Where would express the actual relationship. Though it appears to be little


complicated, with some practice it can be no more challenging then reading a simple

sentence.

One should remember that the role of an entity is the function it has in a
relationship.
Consider an example, the relationship �works-for� could be ordered pairs of
different
employee entities or entity sets. The first employee entity takes the role of a
manager or
supervisor, where as the other one will take on the role of worker or associate.

Relationships can also have descriptive attributes. This can be seen in the example
of a date
(as in the last date of access to an account), this date is an attribute of a
customer
account relationship set.

Attributes

A particular set of entities and the relationships between them can be defined in a
number
of ways. The differentiating factor is how you deal with the attributes. Consider a
set of
employees as an entity, this time let us say that the set attributes are employee
name andphone number.

In some instances the phone number should be considered an entity alone, with its
own
attributes being the location and uniqueness of the number it is self. Now we have
two
entity sets, and the relationship between them being through the attribute of the
phone
number. This defines the association, not only between the employees but also
between
the employee phone numbers. This new definition allows us to more accurately
reflect the
real world.
Basically what constitutes an entity and what constitutes an attribute depends
largely on
the structure of the situation that is being modeled, as well as the semantics
associated
with the attributes in question.

Let us now look at an example of the entity-relationship graph.


http://www.learn.geekinterview.com/images/dm04b.png
We can easily express the logical structure of a database in picture with an entity

relationship diagram. Yet this is only possible when we keep in mind what
components
are involved in creating this type of model.

The vital components of an entity relationship model include:

* Rectangles, representing entity sets.


* Ellipses, representing attributes.
* Diamonds, representing relationship sets.
* Lines�, linking attributes to entity sets and entity sets to relationship sets.

Below diagram illustrates how an Entity-Relationship models work.

Entity-Relational Diagram Styles

Some of the different variations of the Entity-Relational diagram you will see are:
* Diamonds are omitted - a link between entities indicates a relationship.
Less symbols, means a clearer picture but what will happen with descriptive
attributes? In
this case, we have to create an intersection entity to possess the attributes
instead.

* There can be numbers instead of arrowheads to indicate cardinality.


The symbols, 1, n and m can be used. E.g. 1 to 1, 1 to n, n to m. Some feel this is
easier
to understand than arrowheads.

* Also we can use a range of numbers that can indicate the different options of
relationship

E.g. (0, 1) is used to indicate minimum zero (optional), maximum 1. We can also use
(0,n),
(1,1) or (1,n). This is typically used on the near end of the link - it is very
confusing at first,
but this structure gives us more information.

* Multi-valued attributes can be indicated in a certain manner.


This means attributes are able to have more than one value. An example of this is
hobbies.
Still this structure has to be normalized at a later date.

* Extended Entity-Relationship diagrams allow more details or constraints in the


real world
to be recorded.

This allows us to map composite attributes and record derived attributes. We can
then use
subclasses and super classes. This structure is generalization and specialization.

Summary

Entity-Relationship diagrams are a very important data modeling tool that can help
organize
the data in a project into categories defining entities and the relationships
between entities.
This process has proved time and again to allow the analyst to create a nice
database
structure and helps to store the data correctly.

Entity

The data entity represent both real and abstract entity about which data is being
stored.
The types of entities fall into classes (roles, events, locations, and concepts).
This could be
employees, payments, campuses, books, and so on. Specific examples of an entity are

called instances.

Relationship

The relationship between data is a natural association that exists with one or more
entities.
Like the employees process payments. Cardinality is the number of occurrences of a
single
entity for one occurrence of the related entity, such as, an employee may process
many
payments but might not process any depending on the nature of their job.

Attribute

An attribute represents the common characteristic of a particular entity or entity


set. The
employee number and pay rate are both attributes. An attribute or combinations of
attributes that identify one and only one instance of an entity are called a
primary key or an
identifier. For example, an employee number is an identifier.

Concept Oriented Model

What is a Concept Oriented Model?

Concept-oriented model is proclaimed to be the next level in data modeling. The


method is
based upon the assumption that data aspects are living concepts where each is a
combination of a number of super-concepts.

The belief goes on to complement the top concept and the bottom concept structures.
This
particular structure constitutes a lattice which is also described as an order
satisfying certain
properties. Each item is then defined as a mixture of some other super items that
are taken
from the related super-concepts.
In simple words, the Concept Oriented model is based on a lattice theory or an
order of
sets. Each component is defined as a mixture of its super-concepts. The top concept

structure provided the most abstract view, with no items. Where as, the bottom
concept
structure is more specific and provides a much more detailed representation of the
model.

The syntax structure offers what is commonly referred to as a multi-dimensional


hierarchical
space where the items are found. The sub-concepts of the top concepts are also
called
primitive concepts.
The semantics of this model are represented by the data items; each item is
actually a
combination of its related super-items. Each item has a non-specified number of
sub-items
from the corresponding sub-concepts. The path from model semantics onto
dimensionality
of the model leads from the current concept to some of its corresponding super-
concepts.

Each step within this path normally has a name that is in context to its
corresponding
concept. The number of these paths from the top of the model to the bottom of the
model is
its dimensionality. Now each dimension corresponds to one variable or one
attribute. Thus is
supposed to be one-valued.

All one valued attributes are also directed upward within the structure. Yet if we
were to
reverse the direction of the dimensions then we would have what is often called
sub-
dimensions or reverse dimensions.

These dimensions also corresponds to the attributes or the properties, but they
take many
values from the sub-concepts rather then the super-concepts to form normal
dimensions.
After we have explored the dimensionality of the concept model structure we move
forward
to address the relations between the concepts.

When speaking of relations, each concept is related to its super-concepts, yet the
super-
concepts are also clarified as relation with regard to this concept. So, in order
to be a
relation is a comparative role. More specifically each item is a single instance of
relation for
the corresponding super-items and it is an object link to other objects by the
means or the
relations of its sub-items, which clarified as relation to instances. This brings
us to grouping
and/or aggregation.
Let�s continue to think of these items in relations, this way we can imagine each
item has a
number of �parents� from the super-concepts as well as a number of sub-items from
the
sub-concepts.

Items are interpreted as a group, set, and category for its sub-items. Yet it is
also a
member of the sets, groups, and categories formula represented by the super-items.
You
can see the dual functions or roles for the items when you consider them in this
light.

Continuing to think of our items in the light we have created we can now see that
each
problem domain that would be represented in a concept model have differing levels
of
details.

Near the top we would find it is represented as a single element, like an


organization in
whole. However, we can still spread information from the lower levels to the top
level and
create the aggregated feature we need by seeking the sub-items including in the
parent
item and then by apply the aggregation task.

We will finish up this section by quickly touching on the topic of multi-


dimensional analysis
and constraints. First let�s address the matter of multi-dimensional analysis.

We can easily indicate that multiple source concepts and enforce input constraints
upon
them. Then the constraints can be spread in a downward direction to the bottom
level. This
is the most specific level of all of the levels.

Once we have completed this step the result of this then transported back up the
chain of
levels toward a target concept. Then we can begin an operation of moving one of the
source
concepts downward, basically choosing one of the sub-concepts with more detail.
This is
called Drill Down. We also have an operation known as Roll Up; this is the process
of
moving up by selecting some super-concept with less detail.

With all this talk of constraints, you are probably wondering what they are as
well. Simply
for each concept we can indicate constraints the corresponding items need to
satisfy. That
forces us to describe the actual properties by indicating a path resembling a
zigzag pattern
in the concept structure.

The zigzag path then goes up when needed to get more detailed information; it also
goes
down to retrieve more general information. Using this we can easily then express
the
constraints in terms of the other items and where they can be found.

What are Concepts?


The Concept oriented model deals with concepts rather then class. A concept is a
combination of classes, one is a reference class and the other is an object class.
When the
concept fails to define its specific reference class it is then equal to a
conventional class.

Object and references both have corresponding structure and behavioral methods. A
simple
consequence of having concepts is that the object are presented and accessed in an
indirect
manner, this is done by the concept using custom references with subjective domain
specific structures as well as functions.
Now you might also want to know what a sub-concept or a super-concept is as well. A

super-concept is a concept that is combined with other concepts in the definition


of the
concept. An example of this would be the concept Orders=<Addresses, Customers>,
this
has two super-concepts which are the Addresses and the Customers.

Remember there is always an upward directed arrow in the concept graph from a sub-
concept to any of the corresponding super-concepts. Therefore the sub-concept then
associated with the start of that arrow type and the super-concepts is associated
with the
end of that arrow.

Sub-concepts also have two parts, such as Order Parts, the formula would be Order
Parts=<Products, and Orders> or Order Operations=<Orders, and operations>.

So now we have dissected the main components of the Concept Oriented Model, it is
obviously important to the association of data and concepts related to that data.
We can
now understand the uses and functions of this model with a bit more clarity.

Though the Concept Oriented model is complex and definitely worthy of further
research. It
is suggested that anyone who has had their curiosity sparked by this article look
further into
the model, and perhaps even further explore the additional functions and uses since
it can
be applied to many situations.

Object-Relational Model

What is the Object-Relational Model?

The object-relational model is designed to provide a relational database management


that
allows developers to integrate databases with their data types and methods. It is
essentially
a relational model that allows users to integrate object-oriented features into it.
This design is most recently shown in the Nordic Object/Relational Model. The
primary
function of this new object-relational model is to more power, greater flexibility,
better
performance, and greater data integrity then those that came before it.

Some of the benefits that are offered by the Object-Relational Model include:

. Extensibility - Users are able to extend the capability of the database server;
this can
be done by defining new data types, as well as user-defined patterns. This allows
the
user to store and manage data.
.

. Complex types - It allows users to define new data types that combine one or more
of
the currently existing data types. Complex types aid in better flexibility in
organizing the
data on a structure made up of columns and tables.
.

. Inheritance - Users are able to define objects or types and tables that procure
the
properties of other objects, as well as add new properties that are specific to the
object
that has been defined.
.

. A field may also contain an object with attributes and operations.


..

. Complex objects can be stored in relational tables.

The object-relational database management systems which are also known as ORDBMS,
these systems provide an addition of new and extensive object storage capabilities
to the
relational models at the center of the more modern information systems of today.

These services assimilate the management of conventional fielded data, more complex

objects such as a time-series or more detailed geospatial data and varied dualistic
media
such as audio, video, images, and applets.

This can be done due to the model working to summarize methods with data
structures, the
ORDBMS server can implement complex analytical data and data management operations
to
explore and change multimedia and other more complex objects.
What are some of the functions and advantages to the Object-Relational Model?

It can be said that the object relational model is an evolutionary technology, this
approach
has take on the robust transaction and performance management aspects of its
predecessors and the flexibility of the object-oriented model (we will address this
in a later
article).

Database developers can now work with somewhat familiar tabular structures and data

definition but with more power and capabilities. This also allows them to perform
such task
all the while assimilating new object management possibilities. Also the query and
procedural languages and the call interfaces in the object relational database
management
systems are familiar.

The main function of the object relational model is to combine the convenience of
the
relational model with the object model. The benefits of this combination range from

scalability to support for rich data types.

However, the relational model has to be drastically modified in order to support


the classic
features of the object oriented programming. This creates some specific
characteristics for
the object-relational model.

Some of these characteristics include:

. Base Data type extension

. Support complex objects

. Inheritance (which we discussed in more detail above.)

. And finally Rule systems

Object-relational models allow users to define data types, function, and also
operators. As a
direct result of this the functionality and performance of this model are
optimized. The
massive scalability of the object-relational model is its most notable advantage,
and it can
be seen at work in many of today�s vendor programs.
The History of the Object-Relational Model

As said before the Object-Relational model is a combination of the Relational Model


and the
Object Oriented Model. The Relational Model made its way into the world of data in
the
1970s; it was a hit but managed to leave developers wanting more flexibility and
capability.

The Object Oriented model seemed to move into the spot light in the 1990s, the idea
of
being able to store object oriented data was a hit, but what happened to the
relational data?
Later in the 1990s the Object-Relational model was developed, combining the
advantages of
its most successful predecessors such as; user defined data types, user defined
functions,
and inheritance and sub-classes.

This model grew from the research conducted in the 1990s. The researches many goal
was
to extend the capabilities of the relational model by including objects oriented
concepts. It
was a success.

What about Object Relational Mapping?

Object-Relational mapping is a programming method used to convert data between


incompatible data type systems in relational databases and object oriented
languages. Here
are some basics when it comes to mapping. Java classes can be mapped to relational
database management systems tables.

The easiest way to begin mapping between a enduring class and a table is one-on-
one. In a
case such as this, all of the attributes in the enduring class are represented by
all of the
columns of the table. Each case in point of a business class is then in turn stored
in a row of
that table.

Though this particular type of mapping is pretty straightforward, it can conflict


with the
existing object and entity-relation (we discussed entities in the previous article)
models.
This is partially due to the fact that the goal of the object modeling is to model
an
organizational process using real world objects, (we will discuss more about object
modeling
in the following article), where as the goal of entity-relational modeling is to
normalize and
retrieve data in a quick manner.

Due to this two types of class to table modeling methods have been adopted by most
users.
This was to help overcome the issues caused by differences between the relational
and
object models. The two methods are known as SUBSET mapping and SUPERSET mapping.

Let�s talk briefly about these two methods.

With SUBSET Mapping the attributes of a persistent class, or described above as an


enduring class, represent either a section of the columns in a table or all of the
columns in
the table.
SUBSET Mapping is used mostly when all of the attributes of a class are mapped to
the
same table. This method is useful also when a class is not concerned with a portion
of the
columns of its table in the database due to the fact that they are not a part of
the business
model.

SUBSET Mapping is used to create projection classes as well for tables with a
sizable
number of columns. A projection class contains enough information to enable the
user to
choose a row for complete retrieval from a database.

This essentially reduces the amount of information passed through out the network.
This
type of mapping can also be used to help may a class inheritance tree to a table of
using
filters.

Now let�s consider SUPERSET Mapping. With a persistent class the superset mapping
method holds attributes taken from columns of more then one table. This particular
method
of mapping is also known as table spanning.

Mapping using the SUPERSET method is meant to create view classes that cover the
underlying data model, or to map a class inheritance tree to a database by using a
Vertical
mapping tactic.

The final word

There are millions of other aspects and advantages to this model. The Object-
Relational
model does what no other single model before it could do. By combining the
strongest
points of those that did come before it, this model has surpasses expectations, and
taken on
a definitive role in database technology. Despite what models follow it, this model
is here to
stay.
http://www.learn.geekinterview.com/images/dm05a.png

The Object Model

What is the Object Model?

The Object model, also referred to as the object oriented model was designed to add

database functionality to object programming languages. Object models help to


extend the
semantics of C++, which are Smalltalk and Java object programming languages used to

provide full-featured database programming capability, all the while retaining the
native
language compatibility as well.

A notable benefit of this particular approach is the unification of the application


and
database development into a complete data structure and language environment.
Application then require less code, they use a more natural data modeling, and the
code
bases are much easier to maintain as a result. Developers can then construct whole
database applications with a modest amount of extra effort put into it. Object
models are
often also used to show the connection between objects and collections.

Unlike the relational model, where a complicated data structure needs to be


flattened to fit
into tables or even joined from those tables to form the in-memory structure,
object models
have little or no performance overhead used to store or retrieve a hierarchy of
inter-related
objects. The one-to-one mapping of object programming languages to the database
object
had two major benefits over the older storage methods. One, it offers a higher
performance
management of objects. Secondly, it allows for better management of the more
complex
inter-relationships between objects. These two aspects make object modeling much
better
suited to support applications such as a financial portfolio risk analysis system,
telecommunications service applications, design and manufacturing systems, and even

patient record systems, all of which have very complex relationships between data.

Are there different types of object models?

When you search the web for some concrete information on the object model don�t be
surprised to end up with mix matched results, none of which plainly stating �Object
Model�.
You will instead turn up results for Document Object Models, and Component Object
Models.
This is because the Object Model has been modified just slightly to apply to
different
instances. We will touch on that before moving on.

So what exactly is a Document Object Model? Well you might see it often referred to
as a
DOM, this model is a platform and language neutral interface that allows programs
or script
to vigorously access as well as update the content, structures, and styles of
documents. The
document can then be processed further and the results can be incorporated back
into the
contents of the page.

The Component Object Model which is also referred to as COM, this model is
basically a
component software architecture that enables users to build applications and
systems alike
from components supplied by different software vendors. Component Object Models are
the
underlying design that forms the foundation for some higher-level software
services. Some
of these services include those that are provided by OLE. Any PC user may be
surprised to
learn that a COM is also known as ActiveX. An application we are all familiar with,
especially
those of use that spend a lot of time surfing the internet.

Many of the traditional operating systems were designed to deal with only the
application
binaries and not the actual components. Due to this the benefits of good component-
oriented designs have until now never gone beyond the compilation step. In a world
that is
object-centric it is confusing that our operating systems still can not recognize
objects.
Instead our operating systems have been dealing with only application binaries or
EXEs.
This prevented objects in one process from communication with objects in a
different
process while using their own defined method.

History

The object model really hit the programming scene in the mid-1990s. Around October
of
1998 the first specification of the Document Object model was released by W3C, it
was
known as DOM 1. Later in 2000 DOM 2 followed, it surpassed its older version by
including
specifics with the style sheet Object Model and style information manipulation.
Most recently
DOM 3 wowed the programming world with its release in 2004. Thus far there have
been no
more current releases, as of now we are still using the DOM 3 model, and it has
served us
well.

The history of the Component Object Model is a bit lengthier; we will summarize its
more
dramatic points. DDE was one of the very first methods of inter-process
communication. It
allowed sending and receiving communications or messages between applications. This
is
also sometimes referred to as a conversation between applications. At this point I
think it is
important to point that Windows is the leading Component Object Model vendor, and
the
history of COM is based richly on the information a discoveries made by Windows.

The budding technology of COM was the base of OLE, which means Object Linking and
Embedding. This was one of the most successful technologies introduced with
Windows. The
programs we soon being added into application like Word and Excel by 1991, and on
into
1992. it was not until 1996 that Windows truly realized the potential for their
discover. They
found that the OLE custom controls could expand a web browsers capability enough to

present content.

From that point the vendor had be integrating aspects of COM into many of their
applications, some like Microsoft Office. There is no way to tell how far or how
long the
evolution of Object Modeling with travel, we need only sit back and watch as it
transforms
our software and application into tools to help us mold and shape our future in
technology.

Examples of Object models

The first example we will cover it the Document Object Model. This example is a
remake of
a more detailed example, I have reduced the information provided in the example in
order
to express on the important features of the model. This example can be seen below:
http://www.learn.geekinterview.com/images/dm06.png
http://www.learn.geekinterview.com/images/dm07.png

By looking at the example provided above, we can clearly see the process in which
the
Document Object Model is used. The sample model is designed to show us the way in
which
the document is linked to each element and the coinciding text that is linked to
those
elements..

Now we will take a quick look at a simple component object model example. This
particular
example has been based on one of the models provided by window themselves.
On the example above you see two different types of arrows. The solid arrows are
used to
indicate the USES, where as the dashed arrows are used to represent the OPTIONALLY
USES. The boxes with green out lined text are there to represent the aspects
provided with
WDTF. The blue high lighted text is you Implement or Modify example. The red is
expressing the implementation of your own action interface, and the text high
lighted in
black indicates your operating systems or driver API. By viewing this sample of the

Component Object Model, we can see how the components are linked and the way in
which
they communication between one another.

So after exploring the Object model we can safely come to the conclusion that the
Object
model does serve an important purpose that no model before it was able to grasp.
Though
the model has been modified to fit with specific instances the main use is to model
object
data.

Windows is one of the more notable vendors who have put the Object Model in the
limelight; it will be interesting to see what heights this model reaches with their
assistance.
I will be keeping a close eye out for the next evolutionary change in the Object
Model.

The Associative Model

What is the Associative Data Model?

The Associative data model is a model for databases unlike any of those we spoke of
in prior
articles. Unlike the relational model, which is record based and deals with
entities and
attributes, this model works with entities that have a discreet independent
existence, and
their relationships are modeled as associations.
The Associative model was bases on a subject-verb-object syntax with bold parallels
in
sentences built from English and other languages. Some examples of phrases that are

suitable for the Associative model could include:

. Cyan is a Color
. Marc is a Musician
. Musicians play instruments
. Swings are in a park
. A Park is in a City (the bold text indicates the verbs)
By studying the example above it is easy to see that the verb is actually a way of
association. The association�s sole purpose is to identify the relationship between
the
subject and the object.

The Associative database had two structures, there are a set of items and a set of
links that
are used to connected them together. With the item structure the entries must
contain a
unique indication, a type, and a name. Entries in the links structure must also
have a unique
indicator along with indicators for the related source, subject, object, and verb.

How is the Associative Data Model different?

The Associative model structure is efficient with the storage room fore there is no
need to
put aside existing space for the data that is not yet available. This differs from
the relational
model structure. With the relational model the minimum of a single null byte is
stored for
missing data in any given row. Also some relational databases set aside the maximum
room
for a specified column in each row.

The Associative database creates storage of custom data for each user, or other
needs clear
cut and economical when considering maintenance or network resources. When
different
data needs to be stored the Associative model is able to manage the task more
effectively
then the relational model.

With the Associative model there are entities and associations. The entity is
identified as
discrete and has an independent existence, where as the association depends on
other
things. Let�s try to simplify this a little before moving on.

Let�s say the entity is an organization, the associations would be the customer and
the
employees. It is possible for the entity to have many business roles at the same
time, each
role would be recorded as an association. When the circumstances change, one or
more of
the associations may no longer apply, but the entity will continue to endure.

The Associative model is designed to store metadata in the same structures where
the data
itself is stored. This metadata describes the structure of the database and the how
different
kinds of data can interconnect. Simple data structures need more to transport a
database
competent of storing the varying of data that a modernized business requires along
with the
protection and managements that is important for internet implementation.
The Associative model is built from chapters and the user�s view the content of the
database
is controlled by their profile. The profile is a list of chapters. When some links
between items
in the chapters inside as well as outside of a specific profile exist, those links
will not be
visible to the user.

There is a combination of chapters and profiled that can simplify the making of the
database
to specific users or ever subject groups. The data that is related to one of the
user groups
would remain unseen to another, and would be replaced by a different data set.

Are there any potential disadvantages to the Associative Data Model?

With the Associative model there is not record. When assembling all of the current
information on a complex order the data storage needs to be re-visited multiple
times. This
could pose as a disadvantage. Some calculations seem to suggest that Associative
database
would need as many as four times the data reads as the relational database.

All of the changes and deletions to the Associative model are directly affected by
adding
links to the database. However we must not that a deleted association is not
actually
deleted itself. Rather it is linked to an assertion that has been deleted. Also
when an entity
is re-named it is not actually re-named but rather linked to its new name.

In order to reduce the complexity that is a direct result from the parameterization
required
by heftier software packages we can rely on the chapters, profiles and the
continuation of
database engines that expect data stored to be different between the individual
entities or
associations. To set of or hold back program functions in a database the use of
�Flags� has
begun to be practiced.

The packages that are based on an Associative model would use the structure of the
database along with the metadata to control this process. This can ultimately lead
to the
generalization of what are often lengthy and costly implementation processes.

A generalization such as this would produce considerable cost reductions for users
purchasing or implementing bigger software packages, this could reduce risks
related with
the changes of post implementation as well.

How well does the Associative Model suit the demands of data?

Some ask if there is still an ongoing demand for a better database. Honestly, there
will
always be that demand. The weaker points of the current relational model are now
apparent, due to the character of the data we still need to store changing. Binary
structures
that are supportive to multimedia have posed real challenged for relational
databases in the
same way that the object-oriented programming methods did.

When we look back on the Object databases we can see that they have no conquered
the
market, and have their cousins the hybrid relational products with their object
extensions.
So will the Associative model solve some of the issues surrounding the relational
model?
The answer is not entirely clear, though it may resolve some issues it is not
completely clear
how efficiently the model will manage when set against the bigger binary blocks of
data.

The security of data is crucial, as is the speed of transaction. User interfaces


and database
management facilities should but up to pace. When a database is designed to aid in
the use
of internet applications it should allow back ups without needing to take the data
off-line as
well.

Programming interfaces need to be hearty and readily available to a range of


development
languages, the Associative database will need to show that it is good practice to
store data
using the subject-verb-object method in every case as well. There will always be
questions
about maintaining performance as the database grows, this should be expected.

So what�s the verdict?

Areas of the Associative database design do seem simpler then the relational
models, still as
we have pointed out there are also areas that call for careful attention. There are
issues
related to the creation of chapters that remain daunting at best.

Even so, if the concept of the Associative model proves itself to be a genuinely
feasible and
is able to bring out a new and efficient database, then others could bring to life
products
that are built upon the base ideas that exist with this model.

There is definitely an undeniable demand for a faster operating database model that
will
scale up to bigger servers and down to the smaller devices. It will be an
interesting journey
to witness; I personally would like to see if the future databases built using this
model can
make their mark in the market.

The Hierarchical Model


http://www.learn.geekinterview.com/images/dm08.png

What is a Hierarchical Model?

The term Hierarchical Model covers a broad concept spectrum. It often refers to a
lot of set
ups like Multi-Level models where there are various levels of information or data
all related
be some larger form.

The Hierarchical model is similar to the Network model; it displays a collection of


records in
trees, rather then arbitrary graphs.

Here is an example of on type of conventional Hierarchical model:

You can see from the above figure that the supplementing information or details
branch out
from the main or core topic, creating a �tree� like form. This allows for a visual
relationship
of each aspect and enables the user to track how the data is related.

There are many other ways to create this type of model, this is one of the simplest
and is
used the most often.

An example of information you would use the Hierarchical model to record would be
the
levels within an organization, the information would flow such as:
http://www.learn.geekinterview.com/images/dm09.png

. An organization has several departments

. Each department has several subdivisions

. Each subdivision has sections

So the Hierarchical model for this scenario would look closely like the one below.
As you can
see this model is substantially larger, the benefit of the Hierarchical model is
that it allows
for a continuous growth, though it can take up a lot of room.

With each addition of data a new branch on the �tree� is formed, adding to the
information
as a whole as well as the size.

Hierarchical models allow for a visual parent/ child relationship between data
sets,
organizational information, or even mathematics.

The idea for these models is to begin with the smallest details, in the example
above that
would be the sections.

From the smallest details you would move up (it is often easiest to think of the
model as a
hierarchy) to the subdivisions, above the subdivisions you find departments, and
finally
ending at one �parent� the organization.
Once finished you can sit back and view the entire �family� of data and clearly
distinguish
how it is related.

How is the Hierarchical Model used?

The first mainframe database management systems were essentially the birth place of
the
Hierarchical model.

The hierarchical relationships between varying data made it easier to seek and find
specific
information.

Though the model is idea for viewing relationships concerning data many
applications no
longer use the model. Still some are finding that the Hierarchical model is idea
for data
analysis.

Perhaps the most well known use of the Hierarchical model is the Family Tree, but
people
began realizing that the model could not only display the relationships between
people but
also those between mathematics, organizations, departments and their employees and
employee skills, the possibilities are endless.

Simply put this type of model displays hierarchies in data starting from one
�parent� and
branching into other data according to relation to the previous data.

Commonly this structure is used with organizational structures to define the


relationship
between different data sets.
Normally this contains employees, students, skills, and so forth. Yet we are
beginning to see
the model used in more professional and meta-data oriented environments such as
large
organizations, scientific studies, and even financial projects.

Though the Hierarchical model is rarely used some of its few uses include file
systems and
XML documents.
The tree like structure is idea for relating repeated data, and though it is not
currently
applied often the model can be applied to many situations.

Issues Related to Hierarchical Models

The Hierarchical model can present some issues while focusing on data analysis.
There is
the issue of independence of observations, when data is related it tends to share
some type
of background information linking it together, therefore the data is not entirely
independent.

However, most diagnostic methods have need of independence of observations as a key

hypothesis for the analysis.

This belief is corrupted in the incident of hierarchical data; such as when


ordinary minimum
square regressions turn out typical miscalculations that are too small.

Subsequently, this usually results in a greater likelihood of rejection of an


unacceptable
assumption than if:

(1) a suitable statistical analysis was performed, or


(2) the data contained within honestly self governs observation.

Other Hierarchical Model Structures

Though the tree like structure is perhaps the simplest and also the most desirable
form for
new users there are other types or structures for this model.
Hierarchy is also structured as an outline or indented list. It can be found in the
indented
lists of XML documents.

The example, below, presents information similar to those above that we have
created but
the tree like form is not used in this Hierarchical Modeling but that of
indentation.
. ORGANISATION
o Department 1
. Subdivision 1
. Section 1

. Section 2

. Section 3

. Subdivision 2
. Section 1

. Section 2

. Section 3

. Subdivision 3
. Section 1
. Section 2

. Section 3

o Department 2
. Subdivision 1
. Section 1

. Section 2

. Section 2

. Subdivision 2
. Section 1
. Section 2

. Section 3

One thing you must keep in mind at all times is that no matter what type of
structure you
use for the model you need to be able to add categories at any time, as well as
delete them.
http://www.learn.geekinterview.com/images/dm10.png
An idea to ensure that this is possible is to use a list view or tree view with
expandable and
collapsible categories.

You can also use the model in a visual form, something involving a cylinder or
pyramid or
even a cube, this visual presentation of the data would be most suitable for a
presentation
of data to a group of professionals.

This form would be better for smaller less detailed levels. There is an example
using some
of the same information from above but shown more compact below.

There are various structures of the Hierarchical Model; in fact there are many more
then
those shown here.
The type you use all depends on the data you are using. The methods differ
according to
whether your data is people related, mathematical related, or just simple
statistics.

Review of the Hierarchical Model Facts

1. This model expresses the relationships between information. How they are related
and
what they are most closely related to.

2. The Hierarchical Model is often thought of as a hierarchy. The idea is to think


of your data
as a family.

3. The model has many different structures and forms. Each is best used depending
on the
type of data being recorded, the amount of data being recorded, and who it is being

recorded for.

4. Speaking in parent/child terms data can have many children but only one parent.

5. The model begins with core data and branches off into supplementing data or
smaller
related data.

6. One must remember to start with the smallest detail and work their way up.

If you keep to these simple and compacted guidelines your own Hierarchical Model
will be
successful, clean, clear, and well built. The point is to present information in a
simple and
easy to read manner.
The Multi-Dimensional Model

What is a Multi-dimensional Model?

Multi-dimensional model is an integral aspect of the On-line Analytical Processing


which also known as OLAP.

Due to the fact that OLAP is online it provides information quickly, iterative
queries are
often posed during interactive sessions.
Due to the analytical nature of OLAP the queries are often complex. The multi-
dimensional
model is used to solve this kind of complex queries. The model is important because
it
applies simplicity.

This helps users understand the databases and enables software to plot a course
through
the databases effectively.

Multi-dimensional data models are made up of logical cubes, measures, and


dimensions.
Within the models you can also find hierarchies, levels, and attributes.

The straightforwardness of the model is essential due to the fact that is


identifies objects
that represent real world entities.

The analysts know what measures they want to see, what dimensions and attributes
make
the data important, and in what ways the dimensions of their work is organized into
levels
as well as hierarchies.

What are Logical Cubes and Logical Measures?

Let us touch on what the logical cubes and logical measures are before we move on
to more
complicated details.

Logical cubes are designed to organize measures that have the same exact
dimensions.
Measures that are in the same cube have the same relationship to other logical
objects;
they can easily be analyzed and shown together.
With logical measures cells of the logical cube are filled with facts collected
about an
organization�s operations or functions.

The measures are organized according to the dimensions, which also deals with time
dimension.
Analytic databases contain outlines of historical data, taken from data in a
heritage system,
also those other data sources such as syndicated sources. The normally acceptable
amount
of historical data for analytic applications is about three years worth.

The measures are static; they are also trusted to be consistent while they are
being used to
help make informed decisions.

The measures are updated often, most applications update data by adding to the
dimensions of a measure. These updates give users concrete historical record of a
specific
organizational activity for an interval. This is very productive.

Another productive strategy is that adopted by other application, which fully


rebuild the
data rather then perform updated.

The lowest level of a measure is called the grain. Often this level of data is
never seen, even
so it has a direct affect on the type of analysis that can be done.

This level also determines whether or not the analysts can obtain answers.
Questions such
as, when are men most likely prone to place orders for custom purchases?

Logical Attributes, Dimensions, Hierarchies and Levels

Logical cubes and measures were relatively simple and easy to digest. Now we will
consider
Logical Dimensions, which is a little more complex. Dimensions have a unique set of
values
that define and categorized the data.

These form the sides of the logical cubes and through this the measures inside of
the cubes
as well. The measures themselves are usually multi-dimensional; due to this a value
within
a measure should be qualified by a member of all of the dimensions in order to be
appropriate.

The Hierarchy is a mode which is used to organize the data at each level of
aggregation.
When looking at data, developers use hierarchy dimensions to identify trends on a
specific
level, as well as drill down to lower lever to indicate what is causing such
trends, then they
can also roll up to the higher levels to view how these trends affect the bigger
sections of
the organization.

Back to the levels, each level represent a position in the hierarchy, the levels
above the
most detailed level contain aggregated values for the levels that are beneath it.

On different levels, the members of those levels have a hierarchical relation,


which we
defined in the article related to this topic as a parent / child relationship,
where the parent
can have many children but there can only be one parent to a child.

The hierarchies and levels have a many-to-many relationship, the hierarchy is


usually
consisted of many levels, and one level can be includes in various hierarchies.

Finally to wrap up this section we will take a quick look at Logical Attributes. By
now we
should all know that an attribute provides extra information about the data.

Some of there attributes are used simply for display. You can have attributes that
are like,
flavors, colors, sizes, the possibilities are endless.

It is this kind of attribute that can be helpful in data selection and also in
answering
questions.

An example of the type of questions that the attributes can help answer are; what
colors
are most popular in abstract painting? Also we can ask, what flavor of ice-cream do
seven
year olds prefer?
We also have time attributes, which can give us information about the time
dimensions we
spoke of earlier, this information can be helpful in some kinds of analysis.
These types of analysis can be indication the last day or amount of days in a time
period.
That pretty much wraps it up for attributes at this point. We will revisit the
topic a little
later.

Variables

Now we will consider the issue of variables. A variable is basically a value table
for data,
which is an array with a specific type of data and is indexed by a particular list
of
dimensions. Please be sure to understand that the dimensions are not stored in the
variable.

Each mixture of members of a dimension define a data cell. This is true whether a
value for
that cell is present or not. Therefore, if data is missing, or absent the fact of
the absences
can either be included or excluded from analysis.

There is no specific relationship between variables that share like dimensions.


Even so, a
logical relationship does exist between then, this is due to the fact that even
though they
may store different data that could be from a different data type, they are
identical
containers.

When you have variable that contain identical dimensions it creates a logical cube.
With that
in mind, you can see how if you change a dimension, like adding time periods to the
time
dimension then the variables change as well to include the new time periods, this
happens
even if the other variable have no data for them.

The variables that share dimensions can be manipulated in a array of ways, this
includes
aggregation, allocation, modeling, and calculations.
This is more specifically numeric calculations, and it is an easy and fast method
in the
analytic work place. We can also use variables to store measures.

In an analytic work place factual information is kept in variables, normally they


are kept
with a numeric data type.
Each data type is then stored in an associated variable, this is so that while
sales and
expense data may have like dimensions and the same data type, they will be stored
in
distinct variables.

In addition to using variable to store measures they can be used to store attribute
as well.
There are major differences between the two.

While attribute are multi-dimensional, only one dimension is the data dimension.
Attributes
give us information about each dimension member no matter what level it inhabits.

Through out our journey of learning about the different types of data models, I
think that
the multi-dimensional model is perhaps one of the most useful.

It takes key aspects from other models like the relational mode, the hierarchical
model, and
the object model, and combines those aspects into one competent database that has a
wide
variety of possible uses.

Network Model

What is a Network Model?

Oddly enough the Network model was designed to do what the Hierarchical model could
not.
Though both show how data is related the Network model allows for data to not only
have
many children but also many parents, where as the Hierarchical model allowed for
only one
parent with many children. With the Network model data relationships must be
predefined.

It was in 1971 that the Conference on Data System Languages or CODASYL officially
or
formally defined the Network model. This is essentially how the CODASYL defined the

Network model:
The central data modeling theory in the network model is the set theory. A set
contains a
holder record style, a set title, and an affiliate record type.

An affiliate record type is able to have the same role in more than one set;
because of this
the multi-parent hypothesis is established. A holder record style can be an
affiliate or holder
in another set as well.

The data model is an uncomplicated system, and link and connection record styles
(often
referred to as junction records) may well be existent, as well as additional sets
between
them.

Therefore, the entire network of relationships is demonstrated by a number of pair


wise
sets; within each set some record type is holder or owner (meaning one record type)
this
will be located at the tail of the network arrow (See figure below for an example)
and one or
more of the record types are presented as members or affiliates (the will be
located at the
head of the relationship arrow). Usually, a set defines a 1: M relationship,
although 1:1 is
permitted.

The Traditional Network Model


http://www.learn.geekinterview.com/images/dm11.png

The most notable advantage of the Network model is that in comparison with the
Hierarchical model it allows for a more natural avenue to modeling relationships
between
information. Though the model has been widely used it has failed to dominate the
world of
data modeling.

This is believed to be due to large companies choosing to continue using the


Hierarchical
model with some alterations to accommodate their individual needs and because it
had
been made almost obsolete by the Relational Model which offers a higher lever, and
a more
declarative interface.

For a while the performance benefits of the lower lever navigational interfaces
used with the
Hierarchical and Network models were well suited for most large applications.

Yet as hardware advanced and became faster the added productivity and flexibility
of the
newer models proved to be better equipped for the data needs.
Soon the Hierarchical and Network models were all but forgotten in relation to
corporate
enterprise usage.

The OSI Network Model

Open System Interconnection or OSI models were created to serve as tools that could
be
used to describe the various hardware and software components that can be found in
a
network system.

Over the year we have learned that this is particularly useful for educational
purposes, and
in expressing the full details of the things that need to occur for a network
application to be
successful.

This particular model consists of seven separate layers, with the hardware placed
at the
very bottom, and the software located at the top.

The arrow identifies that a message originating in an application program in the


column
listed as #1 must make its way through all of the other layers contained in both of
the
computers in order to make it to the destination application in the column listed
as #2.

This process could easily be compared to that of reading an email. Imagine Column
#1 and
#2 as computers when exploring the figure below:
http://www.learn.geekinterview.com/images/dm12.png

The first layer, which is clear labeled as the physical layer, is used to describe
components
like that of internal voltage levels, it is also used to define the timing for the
conduction of
single fragments.

The next layer is the Data Link, which is the second layer that is listed in the
example
above, this often relates to the sending of a small amount of data, this could be
and often is
a byte, it is also often used for the task of error corrections.

The Network layer follows the Data Link layer, this defines how to transport the
message
through and within the network. If you can stop an moment and think of this layer
as one
working with an internet connection, it is easy to imagine that it would be used to
add the
correct network address.

Next we have the Transport layer, this layer is designed to divide small amounts of
the data
into smaller sets, or if needed it even severs to recombine them into a larger more
complete
set. The Transport layer also deals with data integrity; this process often
involves a
checksum.
Following the Transport layer we find the Session layer, this next layer is related
to issues
that go further or are more complicated then a single set of data.

More to the point the layer is meant to address resuming transmissions like those
that have
been prematurely interrupted or even some how corrupted by some kind of outside
influence. This layer also often makes long term connections to other remote
machines.

Following the Session layer is where we find the Presentation layer. This layer
acts as an
application interface so that syntax formats and codes are consistent with two
networked or
connected machines.

The Presentation layer Ialso designed to provide sub-routines as well, these are
often what
the user may call on to access their network functions, and perform some functions
like
encrypting data, or even compressing their data.

Finally we have the Application layer. This layer is where the actual user programs
can be
found. In a computer this could be as simple as a web browser surprisingly enough,
or it
could serve as a ladder logic program on a PLC.

Network Model Tips

After reading this article it is not hard to see the big differences between the
Hierarchical
Model and the Network Model. The network model is by far more complicated and deals
with
larger amounts of information that can be related in various and complicated ways.

This model is more useful due to the fact that the data can have many-to-many
relationships, not restricting in to a single parent to a child structure. This is
how the
Hierarchical Model works with data.
Though the Network model has been officially replaced by the more accommodating
Relational Model, for me it is not hard to imagine how it can still be used today,
and may
very well still be being used by PCs around the globe when I think of the Network
Model in
relation to how we email one another.
After reviewing the information and investigating the facts of the Network model I
have
come to the conclusion that it is a sound and relatively helpful model if not a bit

complicated.

Its one major downfall being that the data must be predefined; this adds
restrictions and is
why a more suitable model was needed for more advanced data. Ultimately this one
restriction lead to the model�s untimely replacement with in the world of data
analysis.

What is a Relational Model?

The Relational Model is a clean and simple model that uses the concept of a
relation using a
table rather then a graph or shapes. The information is put into a grid like
structure that
consists of columns running up and down and rows that run from left to right, this
is where
information can be categorized and sorted.

The columns contain information related to name, age, and so on. The rows contain
all the
data of a single instance of the table such as a person named Michelle.

In the Relational Model, every row must have a unique identification or key used to
allocate
the data that will follow it. Often, keys are used to join data from two or more
relations
based on matching identification.

Here is a small example of the grid like Relational Model:

Social Security
Number

Name

Date of Birth
Annual Income

Dependents

M-000-00-0002
F-000-00-0001
000-00-0003

Michelle

June 22nd, 1973

39,000

000-00-0001

Michael

December 12th, 1949

78,510

000-00-0002

Grehetta
March 5th, 1952

0
The Relational Model can often also include concepts known commonly as foreign
keys,
foreign keys are primary keys in one relation that are kept in another relation to
allow for
the joining of data.

An example of foreign keys is storing your mother's and father's social security
number in
the row that represents you. Your parents' social security numbers are keys for the
rows
that represent them and are also foreign keys in the row that represents you. Now
we can
begin to understand how the Relational Model works.

How did we get the Relational Model?

Like most other things the Relational Model was born due to someone�s need. In 1969
Dr.
Edgar F. Codd published the first use of the relational model though it was meant
to be no
more then a report for IBM, if swept across and through data analysis unlike any
before it.

Codd's paper was primarily concerned with what later came to be called the
structural part
of the relational model; that is, it discusses relations per se (and briefly
mentions keys), but
it does not get into the relational operations at all (what later came to be called
the
manipulative part of the model).

Codd�s discovery, his creation was a breath of fresh air for those digging through
data
banks, trying to categorize and define data. When he invented this model he truly
may have
not foreseen what an incredible impact it would have on the world of data.

Known Issues with the Relational Model

Some believe there is a great deal of room for improvement where the Relational
Model is
concerned. It may be a surprise to find not everyone supported relational model.
There
have been claims that the rectangular tables do not allow for large amounts of data
to be
recorded.

With the example of apples and oranges, both are fruits and therefore related in
that way
but apples have different attributes then oranges, At times a user may only want to
see one
or the other, then again the may want to view both. Handling this type of data with
the
relational model can be very tricky.

We are beginning to hear more and more about the need for a better model, a more
adequate structure, still no one has been able to design something that can truly
hold its
own with the Relational Model.
True the model could use a bit of tweaking and leaves a little to be desired, yet,
what would
the perfect model be? What could we use that would apply to as many instances as
the
Relational model, and still surpass its usefulness?

Advantages of the Relational Model

The Relational Model has survived through the years, though there are those who are

always trying to construct a more efficient way, it has managed to come out the
victor thus
far. One reason may be due to the structure it is big enough to be worthy of
optimizing.

Another notable reason is that the relational operations work on sets of data
objects, this
seems to make it a reasonably adequate model for remote access. Finally, it is a
clean
model and concise model that does not encourage design extravagance, or phrased as,

�design cuteness.�

Some prefer the clean and simple style that the Relational Model offers, they can
easily do
with out colorful shapes and stylish layouts, instead wanting nothing more then the
clear cut
facts and relevant information.

Here are a few of the more obvious and noted advantages to the Relational Model:

. Allows for Data Independence. This helps to provide a sharp and clear boundary
between the logical and physical aspects of database management.
.

. Simplicity. This provides a more simple structure than those that were being
before it. A
simple structure that is easy to communicate to users and programmers and a wide
variety of users in an enterprise can interact with a simple model.
.
. A good theoretical background. This means that it provides a theoretical
background
for database management field.

Do not be surprised to find that these are nearly the very same advantages that Dr.
Codd
listed in the debut of this model. It is obvious that he was right, and these
advantages have
been restated again and again since the first publication on his report.
There has been no other model brought into view that has had the advantages of the
Relational Model, though there have been hybrids of the model, some of which we
will
discuss in later articles.

The Relational Model versus Earlier Models

We began with the Hierarchical Model, this model allowed us to distribute our data
in terms
of relation, some what like that of a hierarchy, it showed a parent/child type of
relation. It is
one big down fall being that of the fact that each �child� could only have one
parent, but a
parent could have many children.

This model served us well in its time of glory, and sure there are still systems
using it now,
though trusting their more hefty loads of data to better equipped models.

Following the Hierarchical Model we investigated the Network Model. This model was
closely
kin to the Hierarchical Model in that it to allow for a parent/child view of data,
its main
advantage over the previous model being that it allowed for a many-to-many
relationship
between data sets.

Still the data had to be predefined. Though some forms of this model are still used
today, it
has become some what obsolete.

Finally we come to our current model. The Relational Model. Like those before it,
it to
expresses the relationships between data, only it allows for larger input and does
not have
to be predefined. This model allows users to record and relate large amounts of
data.

This model also allows for multi-level relationships between data sets, meaning
they can be
related in many ways, or even only one way. It is easy to understand how this model
has
managed to out live those that came shortly before it. It is versatile, simple,
clean in
structure, and applicable to nearly every type of data we use.

What is the Semi-Structured Data Model?

The semi-structured data model is a data model where the information that would
normal
be connected to a schema is instead contained within the data, this is often
referred to as
self describing model.

With this type of database there is no clear separation between the data and the
schema,
also the level to which it is structured relies on the application being used.
Certain forms of semi-structured data have no separate schema, while in others
there is a
separate schema but only in areas of little restriction on the data.

Modeling semi-structured data in graphs which have labels that give semantics to
its
fundamental structure is a natural process. Databases of this type include the
modeling
power of other extensions of flat relational databases, to sheathed databases which
enable
the encapsulation of entities, as well as to the object databases, which also
enable recurring
references between objects.

Data that is semi-structured has just recently come into view as an important area
of study
for various reasons. One reason is that there are data sources like the World Wide
Web,
which we often treat as a database but it cannot be controlled by a schema.

Another reason is it might be advantageous to have a very flexible format for data
exchange
between contrasting databases. Finally there is also the reason that when dealing
with
structured data it sometimes may still be helpful to view it as semi-structured
data for the
tasks of browsing.

What is Semi-Structured data?

We are familiar with structured data, which is the data that has been clearly
formed,
formatted, modeled, and organized into customs that are easy for us to work and
manage.
We are also familiar with unstructured data.

Unstructured data combines the bulk of information that does not fit into a set of
databases.
The most easily recognized form of unstructured data is the text in a document,
like this
article.
What you may not have known is that there is a middle ground for data; this is the
data we
refer to as semi-structured. This would be data sets that some implied structure is
usually
followed, but still not a standard enough structured to meet the criteria needed
for the
types of management and mechanization that is normally applied to structured data.

We deal with semi-structured data every day; this applies in both technical and
non-
technical environments. Web pages track definite distinctive forms, and the content

entrenched within HTML usually have a certain extent of metadata within the tags.
Details about the data are implied instantly when using this information. This is
why semi-
structured data is so intriguing, though there is no set formatting rule, and there
is still
adequate reliability in which some interesting information can be taken from.

What does the Semi-Structured Data Model do?

Some advantages to the semi-structured data model include:

. Representation of the information about data sources that normally can not be
constrained by schema.

. The model provides a flexible format used for the data switch over amongst
dissimilar
kinds of databases.

. Semi-structured data models are supportive in screening structured data as semi-


structured data.

. The schema is effortlessly altered with the model

. The data transportation configuration can be convenient.

The most important exchange being made in using a semi-structured database model is

quite possibly that the queries will not be made as resourcefully as in the more
inhibited
structures, like the relational model.

Normally the records in a semi-structured database are stored with only one of a
kind IDs
that are referenced with indicators to their specific locality on a disk. Due to
this the course-
plotting or path based queries are very well-organized, yet for the purpose of
doing
searches over scores of records it is not as practical for the reason that it is
forced to seek
in the various regions of the disk by following the indicators.

We can clearly see that there are some disadvantages with semi-structured data
model, as
there are with all other models, lets take a moment to outline a few of these
disadvantages.

Issues with Semi-Structured Data

Semi-structured data need to be characterized, turned over, stored, manipulated or


analyzed with adeptness. Even so there are challenges in semi-structured data use.
Some of
these challenges include:
Data Diversity: The issues of data diversity in federated systems is a complex
issue, it also
involves areas such as unit and semantic incompatibilities, grouping
incompatibilities, and
non-consistent overlapping of sets.

Extensibility: It is vital to realize that extensibility as used to data is in


indication to data
presentation and not data processing. Data processing should be able to happen with
out
the aid of database updates.

Storage: Transfer formats like XML are universally in text or in Unicode; they are
also
prime candidates for transference, yet not so much for storage. The presentations
are
instead stored by deep seated and accessible systems that support such standards.

In short, many academic, open source, or other direct attention to these particular
issues
have been at an on-the-surface level of resolving representation or definitions, or
even
units.

The formation of sufficient processing engines for well-organized and scalable


storage
recovery has been wholly deficient in the complete driving force for a semi-
structured data
model. It is obvious that this needs further study and attention from developers.

Conclusion

We have researched many area of the semi-structured data model; include the
differences
between structured data, unstructured data, and semi-structured data. We have also
explored the various used for the model.

After looking at the advantages and the disadvantages, we are now educated enough
about
the semi-structured model to make a decision regarding its usefulness.
Though this model is worthy of more research and deeper contemplation. The
advantage of
flexibility and diversity that this particular model offers is more then
praiseworthy.

After researching, one can see many conventional and non-conventional uses for this
model
in our systems. A model example for semi-structured data model is depicted below.
http://www.learn.geekinterview.com/images/dm13.png

The semi-structured information used above is actually the detail pertaining to


this very
article. Each line or arrow in the model had a specific purpose. This purpose is
clearly listed
as Article, Author, Title, and Year.

At the end of each arrow you can find the corresponding information. So this model
example
expresses the information about this article, the information being express is the
title of the
article which is

The Semi-Structure Data Model, also expresses the year in which the article was
written
which is 2008, and finally is tells us who the author is. As you can see from the
example
this data model is pretty easy to follow and useful when dealing with semi-
structured
information like web pages.

Star Schema
What is the Star Schema?

The Star Schema is basically the simplest form of a data warehouse. This schema is
made
up of fact tables and dimension table. We have covered dimension tables in previous
articles
but the concept of fact tables is fairly new.

A fact table contains measurable or factual data about an organization. The


information
contained in the schema is usually numerical, additive measurements, the tables can
consist
of numerous columns and an extensive amounts of rows.

The two tables are different from each other only in the way that they are used in
the
schema. They are actually made up of the same structure and the same SQL syntax is
used
to create them as well.

Interestingly enough in some schemas a fact table can also play the role of a
dimension
table in certain conditions and vice versa. Though they may be physically a like it
is vital
that we also understand the differences between fact table and dimension tables.

A fact table in a sales database, used with the star schema, could deal with the
revenue for
products of an organization from each customer in each market over a period of
time.
However, a dimension table in the same database would define the organizations
customers, the markets, the products, and the time periods that are found in the
fact
tables.

When a schema is designed right it will offer dimensions tables that enable the
user to leaf
through the database and get comfortable with the information that it contains.
This helps
the user when they need to write queries with constraints so that the information
that
gratifies those constraints is routed back into the database.
Star Schema Important Issues

As with any other schema performance is a big deal with the Star Schema. The
decision
support system is particularly important; users utilize this system to query large
quantities
of data. Star Schema�s happen to perform the most adequate decision support
applications.

Another issue that is important mention are the roles that fact and dimension
tables play in
a schema. When considering the material databases, the fact table is essentially a
referencing table, where as the dimension table plays the role of a referenced
table.
http://www.learn.geekinterview.com/images/dm14a.png

We can correctly come to the conclusion that a fact table has a foreign key to
reference
other tables and a dimension table is the foreign key reference from one or
multiple tables.

Tables that are references or are referenced by other tables have what is known as
a
primary key. A primary key is a column or columns with contents that specifically
identify
the rows. With simple star schemas, the fact table�s primary key can have multiple
foreign
keys.

The foreign key can be a column or a group of columns in a table which has values
that are
identified by the primary key of another table. When a database is developed the
statements used to make the tables should select the columns that are meant to form
the
primary keys as well as the foreign keys. Below is an example of a Star Schema.

Simple Star Schema


. The Bold column name Indicates the primary key

. Lines indicate one to many foreign key relationships

. Bold italic column names indicate the primary key that is a foreign key to
another table

Let�s point out a few things about the Star Schema above:

. Items listed in the boxes above are columns in the tables with the same names as
the
box names.

. The Primary key columns are in bold text.

. The foreign key columns are in italic text (you can see that the primary key from
the
green Dimension box is also a key in the orange box, the primary key from the
turquoise
box is also a foreign key in the orange box.)

. You can see that columns that are part of the primary key and the foreign keys
are
labeled in bold and italic text, like the key 1 in orange box.

. The foreign key relationships are identified by the lines that are used to
connect the
boxes that represent tables.
Even though a primary key value must be one of a kind in the rows of a dimension
table the
value can take place many times in a foreign key of a fact table, as in a many to
one
relationship. The many to one relationship can be present between the foreign keys
of the
fact table and the primary key they refer to in the dimension tables.

The star schema can hold many fact tables as well. Multiple fact tables are present
because
the have unrelated facts, like invoices and sales. With some situations multiple
fact tables
are present simply to support performance.

You can see multiple fact tables serving this purpose when they are used to support
levels
of summary data, more specifically when the amount is large, like with daily sales
data.
Referencing tables are also used to define many-to-many relationships between
dimensions.
This is usually referred to as an associative table or even a cross-reference
table. This can
be seen at work in the sales database as well. In a sales database each product has
one or
more groups that is belongs to, each of those groups also contain many products.

The many-to-many relationships is designed through the establishment of a


referencing
table that is meant to define the various combinations of the products and groups
within the
organization.

We can also identify many-to-many relationships by having dimension tables with


multicolumn primary keys that serve as foreign key references in fact tables.

A rough example of this would be yet again with the sale database, as we said
before each
product is in one or more groups and each of those grouse have multiple products,
which is
a many-to-many relationship.

Designing a Star Schema

When designing a schema for a database we must keep in mind that the design affects
the
way in which it can be used as well as the performance.

Due to this fact it is vital that one makes the preliminary investment in time and
research
they dedicate to the design a database one that is beneficial to the needs of its
user. Let�s
wrap things up with a few suggestions about things to consider when designing a
schema:

. What is the function of the organization? Identify what the main processes are
for the
organization; it may be sales, product orders, or even product assembly, to name a
few.
This is a vital step; the processes must be identified in order to create a useful
database.
. What is meant to be accomplished? As all databases, a schema should reflect the
organization, in what it measures as well as what it tracks.

. Where is the data coming from? It is imperative to consider projected put in data
and its
sources will disclose whether the existing data can support the projected schema.

. What dimensions and attributes of the organization will be reflected by the


dimension
tables?

. Will there be dimensions that may change in time? If the organization contains
dimensions that change often then it is better to measure it as a fact, rather then
have it
stored as a dimension.
. What is the level of detail of the facts? Each row should contain the same kind
of data.
Differing data would be addressed with a multiple fact table design or even by
modifying
the single table so that there is a flag to identify the differences can be stored
with the
data. You want to consider the amount of data, the space, and the performance needs

when deciding how to deal with different levels of detail in data.

. If there are changes how will they be addressed, and how significant is
historical
information?

XML Database

What is an XML database?

The XML database is most commonly described as a data persistence system that
enables
data to be accessed, exported, and imported. XML stands for Extensible Markup
Language.

The XML database is a Meta markup language that was developed by W3C to handle the
inadequacies of HTML. The HTML language began to evolve quickly as more
functionality
was added to it.

Soon there was a need to have a domain-specific markup language that was not full
of the
unnecessary data of HTML, thus XML was brought to life.

XML and HTML are indeed very different, the biggest way in which they differ is
where in
HTML semantics and syntax tags are unchanging, in XML the creator of the document
is
able to produce tags whose syntax and semantics are particular to the intended
application.
The semantics of tag in XML are reliant on the framework of the application that
processes
the document. Another difference between XML and HTML is an XML document has to in
good form.

XML�s beginning purpose may have been to mark up content, but it did not take long
for
users to realize that XML also gave them a way to describe structured data, that in
turn
made XML significant as a data storage and exchange format as well.

Here are a few of the advantages that the XML data format has:
. Built-in support for internationalization due to the reality that it utilizes
Unicode.

. Platform self-government or independence.

. The individual decipherable format makes it easier for developers to trace and
repair
errors than with preceding data storage formats.

. The extensibility the method that enables developers to put in additional


information
without breaching applications that were created from older versions of the
arrangement.

. A great quantity of off-the-shelf apparatus for doling out XML documents are
already
present.

Native XML Database

A Native XML Database or NXD defines a model for an XML document instead of the
data in
the document; it stores and retrieves documents in relation to the model.

At the very least the model will consist of attributes, elements, and document
order. The
NXD has a XML document as its fundamental area of storage.

The Database is also not obligated to have any specific underlying tangible storage
model. It
can be built on a hierarchical, relational, or even an object-oriented database,
all of which
we have explored in detail.

It can also use a proprietary storage format such as indexed or compressed files.
So we can
gather from this information that the database is unique in storing XML data and
stores all
agents of the XML model without breaking it down.

We have also learned that the NXD is not really an independent database all of the
time.
And are not meant to replace actually databases, they are a tool; this tool is used
to aid the
developer through providing a full-bodied storage and management of XML documents.

Features of the XML Database

Not all databases are the same, yet there is enough features between them that is
similar to
give us a rough idea of some of the basic structure. Before we continue let us note
that the
database is still evolving and will continue to do so for some time.
One feature of the database is XML storage. It stores documents as a unit, and
creates
models that are closely related to XML or a related technology like DOM.

The model includes un-uniformed levels of complexity as well as supplementation for

content and semi-structured data. Mapping is used to ensure that the XML unique
model of
data is managed.

After the data is stored the user will need to continue to use the NXD tools. It is
not as
useful as to try to access the data tables using SQL as one would think; this is
because the
data that would be viewed would be the model of an XML document not the entities
that the
data depict.

It important to note that the business entity model is within the XML document
domain, not
the storage system, in order to work with the actually data you will have to work
with it as
XML.

Another feature of the database worth mentioning is queries. Currently XPath is the
query
language of choice. To function as a database query language XPath is extended some
what
to allow queries across compilations of documents.

On a negative note XPath was not created to be a database query language so it does
not
function properly in that area.
In order to better the performance of queries NXDs support the development of index
on
the data stored in the collections.

The index can be used to improve the speed of the query execution. Fine points of
what can
be indexed and how the index is fashioned varies with products.
What kind of data types are supported by XML?

You might be surprised to hear that XML does not actually support any data types.
The XML
document is almost always text, even if by chance it does represent another data
type; this
would be something like a date or integer.

Usually the data exchange software converts the data from a text form, like in the
XML
document, to other forms within the database and vice versa.

Two methods are most common in determining which conversion to do. The fist of
these
methods is that the software determines the type of data that is from the database
schema,
this works out well because it is always available at run time.

The other method that is common is that the user clearly provides the data type,
like with
the mapping information.

This can be recorded by the user or even generated without human intervention from
a
database schema or even an XML schema.

When it is generated automatically, the data types can be taken from database
schemas as
well as from certain types of XML schemas.

The is another issue related to conversions as well, this has to do largely with
what text
formats are recognized when being exchanged from XML or what could be produced when

exchanging data from XML.


With most situations the amount of the text formats that are supported for a
specific data
type is given to be some what restricted, this is seen with a single specific
format or with
those that are supported by a particular JDBC driver.
It is important to also note that dates will usually cause issues; this is largely
due to the fact
that the range of possible formats is extremely extensive. When you also consider
number
with international formats, these can add to the problems as well.

Concluding statements

XML may seem to be confusing, however, it is beneficial and even a bit less
complicated
then HTML. Yet when you are beginning to take the step to understanding XML when
you
have spent much time working with HTML, the process can be a bit distressing.

Never fear, once you have completed that step, XML is definitely a dominant format.
It is
also used in almost all of the models we have discussed, making it a vital area to
explore in
more detail.

Entity Attribute Value (EAV)

What is an Entity-Attribute-Value Model (EAV)?

The Entity-Attribute-Value model or EAV is also sometimes referred to at the


Object-
Attribute-Value Model, or even the Open Schema. This is a data model that is often
used in
instances where the amount of attributes, properties, or parameters that can be
used to
define an entity are potentially limitless, however the number that will apply to
the entity is
some what modest.

The easiest way to understand the function of the Entity-Attribute-Value model


design is to
try to understand row modeling, as Entity-Attribute Value models are a universal
form. Let�s
think of a department store database. These databases are responsible for managing
endless amounts of products and product brands.

It is innately obvious that the product names wouldn�t be hard-coded as the names
of the
columns in a table. Alternatively one department�s product descriptions in a
product table
may function as follows: purchases/sales of an individual item are recorded in
another table
that would have separate rows with a way to use the product ID for referencing.

An Entity-Attribute-Value design normally involves a solitary table with three


columns, these
columns most often contain data referring to; the entity, an attribute, and a value
for that
attribute.
In this design one row actually stores a single fact, in a traditional table that
has one
column per attribute, one row stores a set of facts. The Entity-Attribute-Value
design is
applicable when the number of parameters that could apply to an entity is
significantly more
then those that truly apply to a single entity.

Where is the Entity-Attribute-Value Model used?

Perhaps the most notable example of the EAV model is in the production databases we
see
with clinical work. This includes clinical past history, present clinical
complaints, physical
examinations, lab test, special investigations, and diagnoses. Basically all of the
aspects
that could apply to a patient. When we take into account all of the specialties of
medicine,
this information can consist of hundreds of thousands of units of data.

However, most people who visit a health care provides have few findings. Physicians
simply
do not have the time to ask a patient about every possible thing, this is just not
the way in
which patients are examined. Rather then using the process of elimination against
thousands of possibilities the health care provider focuses on the primary
complaints of the
patient, and then asks questions related to those complaints.

Now let�s consider how some one would attempt to represent a general-purpose
clinical
record in a database like those we discussed earlier.

By creating a table or even a set of tables with thousands of columns would not be
the best
choice of action, the vast majority of the columns would be unacceptable, also the
user
interface would be obsolete with out an extremely elaborate logic that could hide
groups of
columns based on the data that has been entered in the previous columns.
To complicate things further the patient record and medical findings continue to
grow. The
Entity-Attribute-Value data model is a natural solution for this perplexing issue,
and you
shouldn�t be surprised to find that larger clinical data repositories do use this
model.

What is the Structure of the Entity-Attribute-Value Table?

Earlier we covered the facts that the EAV table consists of thee columns in which
data is
recorded. Those columns were the entity, the attribute, and the value. Now we will
talk a
little more in-depth about each column.
. The Entity, sticking to the scenario of clinical finding, the entity would be the
patient
event. This would contain at the very least a patient ID and the date and time of
the
examination.
.

. The Attribute, or often referred to as the parameter, is a foreign key into a


table of
attribute definitions. In our example it would be the definitions of the clinical
findings.
The attributes table should contain the attribute ID, the attribute name,
description, data
type, units of measurement, and columns aiding input validation.
.

. The Value of an attribute, this depends on the data type.

Entity-Attribute-Value Database

This database is most commonly called the EAV database; this is a database where a
large
portion of data is modeled as EAV. Yet, you may still find some traditional
relational tables
within this type of database.

. We stated earlier what the EAV modeling does for certain categories of data such
as
clinical findings where attributes are many and few. However where these specific
functions do not apply we can use a traditional relational model instead. Using EAV
has
nothing to do with leaving the common sense and principles of a relational model
behind.
.

. The EAV database is basically un-maintainable without the support of many tables
that
store supportive metadata. These metadata tables usually outnumber the EAV tables
by
about three or more, they are normally traditional relational tables.
.
. The Entity in clinical data is usually a Clinical Event as we have discussed
above.
However for more general purposes the entity is a key into an Objects table that is
used
to note common information about all of the objects in the database. The use of an
Object table does not need EAV, traditional tables can be used to store the
category-
specific details of each object.
.

. The Value brings all values into lings, as in the EAV data example above as well,
this
results in a simple, yet still not scalable structure. Larger systems use separate
EAV
tables for each of their data types, including the binary larger objects, this
deals with the
metadata for a specific attribute in identifying the EAV table in which the data
will be
stored.
.

. The Attribute, in the EAV table this is no more then an Attribute ID, there are
normally
multiple metadata tables that contain the attribute related information.

Issues Associated with the EAV Model

There have been a number of issues with the Entity-Attribute-Value model brought to
light
throughout its lifetime. We will briefly discuss those now. It is important that we
clarify first
that these issues arise when metadata is not used with the EAV model, for metadata
is vital
for its functionality.

Here are some of the issues:

Flaccidity. The litheness is wonderful, still there is a time where we no longer


have any
structure at all. Normally you can not rely on the built in database features like
the
referential integrity any longer. To ensure that the column takes the values within
an
acceptable range only you need to code the integrity checks inside of the
application. This
does not aid in making the model maintainable.

Designer issues. Adding attributes as you go is tolerable for a prototype. Yet if


you are
unaware of what data you want to use from the go, you are just looking for problem.

The technology of the relational databases will be inaccessible and will have to be
recreated
by a development team, this could include system tables, Graphical query tools,
fine
grained data security, incremental back-up and restore, and exception handling,
partitioned
tabled, and clustered indexes. All of which are currently non-existent.
The actual format is not supported well by the DBMS internals. Standard query
optimizers
for SQL do not handle the EAV formatted data that well, and a lot of time will need
to be
dedicated to performance tuning for an acceptable production quality application.

As you can see from above there are still a few issues that need to be addressed by

developers in order to make the EAV optimal. Regardless of those issues we have
also
learned that if we use metadata with the EAV we can avoid many if not all of these
issues.

Entity Relation Diagram


Entity Relation Diagram which is also known as an E-R diagram is a data relation
diagram. The entity relation diagram uses specialized graphical symbols for
illustrating all
of the interrelationships between entities and attributes in the database. It
represents the
arrangement and relationship of data entities for the logical data structure.

There are three general graphical symbols used in entity relation diagram and these

symbols are: box, diamond and oval. The box is commonly used for representing the
entities in the database. The diamond is typically used for representing the
relationships and
finally, the oval is used for representing all the attributes.

In many other entity relation diagrams, the rectangle symbol is used to represent
entity
sets while the ellipse symbol is used to represent attributes. The line is
generally used for
linking attributes to entity sets and entity sets to relationship sets.

The entity relation diagram is used to represent the entire information system for
easy
management of resources. The diagram can make the people concerned easily identify
concepts or entities which exist in the whole information system as well as the
entire
business structure and the complex interrelationships between them.

An entity relation diagram is also often used in visualizing a relational database.


Each of the
entities represent the database table while the relationships lines represent the
keys in one
of the tables pointing to a specific record in the related tables or tables
depending on the
kinds of relationship (one to one, one to many, many to one, many to many).

An entity relation diagram could also be an abstract representation of something


which does
necessarily mean capturing every table needed within the database but is instead
serving to
diagram major concepts and relationships.

It may represent very industry specific theoretical overview of the major entities
and
relationships needed for management of the industry resources whatever they may be.
It
may assist in the designing process of the database for an e-resource management
system
but may not necessarily identify every table which would be used.

The "Crow's Foot" notation is an alternative entity relations diagram. In this


diagram
scheme, the relationships are represented with connecting lines between entities
and the
symbols at the ends of the lines are to represent the cardinality of the
relationship.

In representing cardinality, the "Crow's Foot" notation uses three symbols: the
ring
represent zero; the dash represents one; and the crow's foot represents more or
many.

In representing the four types of cardinality which an entity could have in a


relationship, the
"Crow's Foot" uses the following symbols: ring and dash represents zero or one;
dash and
dash represents exactly one; ring and crow's feet represents zero or more; dash and
crow's
feet represents one of more.

This diagram scheme may not be as famous and widely used as the symbols above but
it is
fast gaining notice especially now that it is used with Oracle texts and in some
visual
diagram and flowcharting tools such as Visio and PowerDesigner.

Those who prefer using the "Crow's Foot" notation say that this technique give
better clarity
in the identification of the many, or child, side of the relationship as compared
to other
techniques. This scheme also gives more concise notation for identifying mandatory
relationship with the use of perpendicular bar, or an optional relation, or an open
circle.

There are many tools for entity relation diagrams available in the market or the
internet
today. The proprietary tools include Oracle Designer, SILVERRUN ModelSphere,
SmartDraw,
CA ERwin Data Modeler, DB Visual ARCHITECT, Microsoft Visio, owerDesigner and
ER/Studio. For those who want free tools, their choices include MySQL Workbench,
Open
System Architect, DBDesigner and Ferret.

Entity Structure Chart

An entity structure chart is a visual representation of everything related to the


entities
related to the business rules, activities as well as the data model for the company
in its
database, data warehouse or any general information system implementation. It is a
chart
that depicts the structure and existence of data attributes and data entities in
the common
data structure.

An entity structure chart draws the graphical structure for any data within the
enterprise
common data structure. This graphical drawing is intended to help data document
analysts,
database administrators, informational technology and all other staff of the
organization
visualized the data structures and the information system design. Entity relation
diagrams
makes use of Entity Structure Chart in order to provide a complete representation
of the
logical data structure.

Without the aid of an entity structure chart, data entities and all attributes and
relations
pertaining to them, would all be defined in bullet format with details in paragraph
format.
This can be straining to the eyes as well as difficult to analyze because one will
have to dig
through all those words and sentences.
With an entity structure chart, it becomes easy to have an analysis that shows the
position
of any selected entity in the structures that have been defined. An entity may
exist in many
structure charts. This chart would also give a great benefit in validating the
position or
absence of an entity within one or more specific structures.

For instance, there might be a need for identifying where a specific set of
departmental data
exists within the structure of the entire organization. With an entity structure,
it would be
very easy to spot and point to the location of the departmental data being looked
for.

The data entities which are represented in the entity structure chart may include
resource
data, roles of the entity, data from each organizational or departmental unit, data
location,
and any other business items which may apply.

An entity represents any real world object and the entity structure is technically
and
basically the formalism used in a structural knowledge representation scheme that
systematically organizes a family of possible structures of a system. Such an
entity
structure chart illustrates decomposition, coupling, and taxonomic relationships
among
entities.

The decomposition of an entity is the particular area which is concerned with how
the any
entity may be broken down further into sub-entities until its atomicity. Coupling
pertains to
the specifications detailing how sub-entities may be coupled together to
reconstitute the
entity.

An entity structure chart has a direct support for the entity relation diagram.
This diagram is
the visual and graphical representation of all of the interrelationships between
entities and
attributes in the database. And while the entity relation diagram uses graphical
symbols
such as box, diamond, oval and lines for representing the entities and
relationships in the
database, the entity structure chart may or may not use the same symbols in trying
to
visualize the structure of the entity for the database or the enterprise
information system.

It could be said that an entity structure chart is the break down of the components
of the
entity relations diagram. In the entity relations diagram only the entities and the
table
relations are being specified. But for the entity structure chart, the visual
illustration may
include a symbol for the entity is structurally represented.

For instance, let us say the entity is a CUSTOMER. The entity structure details
everything
about the customer including name, age, etc, and its relationships with products
and other
entities and at the same time the data type on how the CUSTOMER and its attributes
will be
physically stored in the database.
External Schema

The word schema as defined in the dictionary means plan, diagram, scheme or an
underlying organizational structure. Therefore, as can be very briefly said, an
external
schema is a plan on how to structure data so it can seamlessly integrates with any
information system that needs it.

It also means that data needs to integrate with the business schema of the
implementing
organization. External Schema is a schema that represents the structure of data
used by
applications.

A database management system typically has a three layer architecture composed of


internal schema, conceptual schema and the external schema. The conceptual schema
is
the schema which describes the aspects relevant the universe of discourse in the
DBMS.

Each of the external schemas describes the part of the information which is
appropriate to
the group of users at whom the schema is being addressed. Each external schema is
derived from the conceptual schema.

The external schema definitions are all based on a data dictionary. The universe of

discourse of the data dictionary is all information in the use and management of
the
database system.

The external schema definition system is the means by which the external schemas
are
being defined. The external schema must contain information which must be derivable
from
the conceptual schema.
In systems which are based on the object oriented paradigm, this may not
necessarily mean
that the classes which have been included in the schema have to have been
previously
defined in the conceptual schema. Any external schema may also include classes that
have
been defined in the conceptual schema like it may also contain derived classes.

A derived class could be any classes which have been directly or indirectly defined
on the
bases of the conceptual schema classes and have been defined and included in the
data
dictionary.

The typical external schema definition for the object oriented paradigm includes
three steps
which are: definition of the necessary derived classes; selection of the set of
classes that
will constitute the external schema; and generation of the output external schema.
A general external schema explicitly defines data structure and content in terms of
the data
model that tackles structural, integrity and manipulation of the data. As such, the
external
data schema include the data vocabulary which defines the element and attribute
names,
the content model which holds the definition of relationships and corresponding
structure,
and the data types.

Some of these data types are integer, string, decimal, boolean, double, floast,
hexBinary,
base64Binary, QName,dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay,
gMonth, NOTATION and all others which may be applicable or appropriate for
representing
the entities in the information system.

An external schema is a very important aspect of any information technology


application
most especially in the field of database.

As this defines the structure of data used in applications, this will make it easy
for the
information system to deal with all processes, optimizations and troubleshooting.
This
schema will also ensure that the data used in the information system adheres by the
rules
of the business and follows the framework of the data architecture.

In data modeling, the external schema may refer to the collection of data
structures that
are being used in creating databases representing all objects and entities which
are modeled
by the database.

The same external schema stores the definition of the collection of governing rules
and
constraints place on the data structure in order to maintain structure integrity by
preventing
orphaned records and disconnected data sets. In the same external schema is also
can be
found the collection of operators which are applicable to the data structures such
as update,
insert, query of the database.

Four-Schema Concept

Four-Schema Concept

Four-Schema concept consist of

a physical schema,
a logical schema,
a data view schema, and
a business schema.
The use of the four schema concept is greatly taken advantaged of the in the
implementation of a service oriented business processes integration. It helps to
resolve
problems with 3-schema concept.

Today, there is a very widely recognized trend in the marketing and business
environment
and this trend is the pushing towards a scenario wherein companies and business
enterprises are networked together in order to be able gain higher profit from the
collaborative efforts, and improve operational flexibility while reducing the
operational cost.

This trend the transformation of the current business environment requires


companies to
adopt more collaborative working practices which are based on the integration of
businesses
processes within a wide area of business players such as the business partners,
suppliers,
vendors, and public bodies.

In order for integration to take place, the underlying architecture must have to be
resolved
first so that smooth and near seamless integration can happen.

One of the most important aspects of not just in enterprise data integration but in
computer
science in general is the data modeling.

This is the process of creating a data model with the use of the model theory (the
formal
description of data) so that a data model instance can be created. The physical
schema, a
logical schema, a data view schema, and a business schema.

A typical data model may have an instance of one of the three. The conceptual or
business
schema is used for describing an the semantics pertaining to the business
organizations. It
is in this schema that the entity classes which represent things of significance to
the
organization and the entity relationships which are the assertions about
associations
between pairs of entity classes are being defined.

In the logical schema, the descriptions of the semantics are being contained here.
The
semantics' descriptions may be represented by a particular data manipulation
technology
including the descriptions of tables and columns, object oriented classes, and XML
tags,
among other things.

The physical schema is where the descriptions of the physical means by which data
are
stored and other concerns pertaining to partitions, CPUs, tablespaces, etc.

The addition of the data view schema is what makes the four schema concept complete
and
distinct. The data view schema details how enterprises can offer the information
that they
wish to share with others as well as request what they want.

In real life business data management implementation, there really are many types
of data
model being developed using many different notations and methods.
Some of the data models are biased towards physical implementation while other are
biased
toward understanding business data, and a still a few are biased towards business
managers.

The four schema concept, as mentioned, is biased towards service oriented business
processes integration.

The typical overall information architecture of on-demand enterprise collaboration


or service
oriented business processes integration implementation involves on-demand
information
exchange which is covered under the four schema concept.

Some good practices in this area tend to rely on homogeneous semantics which can
pose
difficulty in achieving for independent databases owned by independent enterprises.

The problem on difficulty can be overcome by developing a new information exchange


model to extend previous global query results and cover independent databases.

The new model provides a four-schema architecture which can already allow
management
for information sharing.

Information matching is also to be considered and this can be done by employing


approaches for query database and export database design.

Service oriented business processes integration is fast become a standard in


enterprise
management and many new schemas are expected to come into development.

Logical Data Model

A logical data model is an important aspect in the design and implementation of a


data
warehouse in that the efficiency of the databases depends heavily on data models.
Logical Data Model refers to the actual implementation of a conceptual module in a
database. It represents normalized design of common data model which is required to

support the design of an information system.

The logical data model elaborates the representation all data pertaining to the
organization
and this model organizes the enterprise data in management technology jargon.

In 1975 when the American National Standards Institute (ANSI) first introduced the
idea of
a logical schema for data modeling, there were only two choices that time which
were the
hierarchical and network models.

Today there are three choices for logical data model and these choices are
relational, object
oriented and Extensible Markup Language (XML). The relational option defined the
data
model in terms of tables and columns. The object oriented option defines data in
terms of
classes, attributes and associations. Finally, the XML option defined data in terms
of tags.

The logical data model is based closely on the conceptual data model which
describes all
business semantic in natural language without pointing any specific means of
technical
implementation such as the use of hardware, software or network technologies.

The process of logical data modeling could be a labor intensive technique depending
on the
size of the enterprise the data model will be used for. The resulting logical data
model
represents all the definition, characteristics, and relationships of data in a
business,
technical, or conceptual environment. In short, logical data modeling is about
describing
end-user data to systems and end-user staff.

The very core of the logical data model is the definition of the three types of
data objects
which the building blocks of the data model and these data objects are the
entities,
attributes, and relationships. Entities refer to persons, places, events or things
which are of
particular interest to the company.

Some examples of entities are Employees, States, Orders, and Time Sheets.
Attributes refer
to the properties of the entities. Examples of attributes for the Employee entity
are first
name, birthday, gender, address, age and many others. Lastly, relationships refer
to the
way where in the entities relate to each other. An example relationship would be
"customers purchase products" or "students enroll in classes".

The above mentioned example is a logical data model using the Entity-Relationship
(ER)
model which identifies entities, relationships, and attributes and normalize your
data.
A logical data model should be carefully designed because it will have tremendous
impact on
the actual physical implementation of the database and the larger data warehouse.

A logical data model influenced the design of data movement standards and
techniques
such as the heavily used extract, transform and load (ETL) process in data
warehousing and
the enterprise application integration (EAI), degree of normalization, use of
surrogate keys
and cardinality.

Likewise, it will determine the efficiency in data referencing and in managing all
the
business and technical metadata and the metadata repository. Several pre-packaged
third
party business solutions like enterprise resource planning (ERP) or HR systems have
their
own logical data model and when they are integrated into the overall existing
enterprise
model with a well designed logical data model, the implementation may turn out to
be
easier and less time consuming resulting in saving of money for the company.

Enterprise Data Model


An Enterprise Data Model is a representation of single definition of data of an
enterprise is
and the representation is not biased on any system application. It independently
defines
how the data is sources, stored, processed or accessed physically.

Enterprise Data Model gives overall picture of an industry perspective by offering


an
integrated blueprint view of entire data that is produced as well as consumed in
all
departments of an enterprise or an organization. It helps to resolve all potential
inconsistencies and parochial interpretations of the data used

It can also be a framework or an architectural design of data integration which


enables the
function of identifying all shareable and/an redundant data across functional and
organizational boundaries. It serves to minimize data redundancy, disparity, and
errors;
core to data quality, consistency, and accuracy.

With Enterprise Data Model being a data architectural framework, the business
enterprise
will have some sort of starting point for all data system designs. Its theoretical
blueprint can
provide for provisions, rules and guide in the planning, building and
implementation of data
systems.

In the area of enterprise information system, the Operational Data Store (ODS) or
Data
Warehouse (DW) are two of the largest components which need carefully designed
enterprise data model because data integration is the fundamental principle
underlying any
such effort and a good model can facilitate data integration, diminishing the data
silos,
inherent in legacy systems.

As the name implies, the very core of an Enterprise Data Model is about the data,
regardless of where the data is coming from and how it will be finally used. The
model is
meant primarily to give clear definitions on how come up with efficient initiatives
in the
aspects of Data Quality, Data Ownership, Data System Extensibility, Industry Data
Integration, Integration of Packaged Applications and Strategic Systems Planning.
The process of making an enterprise model typically utilizes a top down bottom up
approach
for all designs of the data systems including the operational data store, data
marts, data
warehouse and applications. The enterprise data model is built in three levels of
decomposition and forms a pyramid shape.

The first to be created is Subject Area Model which sits on top of the pyramid. It
expands
down to create the Enterprise Conceptual Model and finally the Enterprise
Conceptual Entity
Model is created and occupies the base part of the pyramid. The three models are
interrelated but each of them has its own unique purpose and identity.

A fundamental objective of an Enterprise Subject Area Model is segregating the


entire
organization into several subjects in a manner similar to divide and conquer. Among
the
aspects of this level are Subject Areas, Subject Area Groupings, Subject Area Data
Taxonomy and Subject Area Model Creation.

The Enterprise Conceptual Model, the second level in the pyramid, identifies and
defines the
major business concepts of each of the subject areas. This model is high level data
model
having an average of several concepts for every subject area. These concepts have
finer
details compared to the subject area details. This model also defines the
relationships of
each concept.

The Enterprise Conceptual Entity Model represents all things which are important to
each
business area from the perspective of the entire enterprise. This is the detailed
level of an
enterprise data model which each of the concept being expanded within each subject
area.
It is also in this level that the business and its data rules are examined, rather
than existing
systems so as to create the major data entities, the corresponding business keys,
relationships and attributes.

Star Schema

The star schema, which is sometimes called a star join schema, is one of the most
simple
styles of a data warehouse schema. It consists of a few fact tables that reference
any
number of dimension tables. The facts tables hold the main data with the typically
smaller
dimension tables describing each individual value of a dimension.

Star Schema is characterized by

. simplicity

. allows easy navigation


. has rapid response time

Normalization is not a goal of star schema design. Star schemas are usually divided
into fact
tables and dimensional tables, where the dimensional tables supply supporting
information.

A fact table contains a compound primary key that consists of aggregate of relevant

dimension keys while a dimension table has a simple primary key.

A dimension table is also commonly in the second normal form as it consolidates


redundant
data while a fact table is commonly in the third normal form as all data depend on
either
one dimension or all of them and not just on combinations of a few dimensions.

A star schema is a very important aspect in a data warehouse implementation in that


it is
the best way to implement a multi-dimensional database by using any common
mainstream
relational database. It is also a very simple method and from the perspective of
the users,
the queries are simple because the only joins and conditions involve a fact table
and a
single level of dimension tables, without the indirect dependencies to other tables
that are
possible in a better normalized snowflake schema.

Making a star schema for a database may relatively easy but it still very important
to make
some investments in time and research because the schema's effect on the usability
and
performance of the database is very important in the long run.

In a data warehouse implementation, the creation of the star schema database is one
of the
most important and often the final process in implementing the data warehouse. A
star
schema has also a significant importance in some business intelligence processes
such as
on-line transaction processing (OLTP) system and the on-line analytical processing
(OLAP).

On-line Transaction Processing is a standard, normalized database structure and as


the
name implies, it is used for transactions, which means that involves database table
inserts,
updates, and deletes must be fast. For instance, let us take the scenario in the
organization's call center.

Several call center agents continuously take calls and enter order typically
involving
numerous items which must be stored immediately in the database. This makes the
scenario very critical and that the speed of inserts, updates and deletes should be

maximized. In order to optimize the performance, the database should hold as few
records
as possible at any given time.

On the other hand, On-line Analytical Processing, though this may mean many
different
things to different people, are many for analyzing corporate data. But in some
cases, the
terms OLAP and star schema are used interchangeably. But a more precise way of
thinking
would be to think of a star schema database is an OLAP system which can be any
system of
read-only, historical, aggregated data

The same OLAP and OLTP can be optimized with a star schema in a data warehouse
implementation. Since a data warehouse is the main repository of a company's
historical
data, it naturally contains very high volumes of data which can be used for
analysis with
OLAP. Querying these data may take a long time but with the help of star schema in
implementation, the access time may be made faster and more efficient.

Reverse Data Modeling

Reverse data modeling is basically a form of reverse IT code engineering and it is


a process
wherein an IT expert tries to extract information from an existing system in order
to work
backward and derive a physical model and work further back to a logical model in
the case
of data modeling.

Because of the some of the nuances associated with different database systems and
development platforms, reverse data modeling is generally a difficult task to
accomplish.
But today, there are software vendors trying to offer solutions to make reverse
data
modeling relatively easier to do. These reverse data modeling software solutions
can take
snapshots of existing databases and in producing physical models from which an IT
staff can
begin to verify table and column names.

While it can be generally relatively easy to document table and column names
through using
such tools, or by reviewing database creation scripts, thing that is really very
complicated
and difficult to do is dealing with the relationships between tables in the
database.

Hard skills like coding and software engineering are very important in reverse data

modeling, some soft skills like documentation, employee know-how, training, can
also be
invaluable in figuring out exactly how data is being used by the system in
question.

One of the biggest factors for a successful task of reverse data modeling is
learning as much
as possible about using the application can you determine how the existing data
structures
are being used and how this compares with the original design.
Reverse data modeling is a technology that represents a big opportunity for some
legacy
reporting systems to have its useful life gain extension but it should not be
construed as a
permanent replacement for the data warehousing technology stack.

Reverse data modeling has also a very big role in helping improve the efficiency of

enterprise applications by determining faults in the design, aiding integration of


existing
systems by pinpointing table and column definitions and relationships between such
tables,
and allowing comparisons of existing systems with potential new system solutions.
A data warehouse implementation is based on an enterprise data model a system for
analyzing data in at least data source of an enterprise. The system may be
comprised of
various steps like a step for providing a meta model for an enterprise as well as
step in
forming the data schema from the meta model. Also included in the system may be the
definition for creating a database organized to the data schema, incorporating data
into the
database is part of the system, and performing analysis on the data in the
database.

Some commercial software offers solutions that can enable true enterprise
application
engineering by storing and documenting data, processes, business requirements, and
objects that can be shared by application developers throughout an organization.

With reverse data modeling and the help of such software, the business organization

implementing an enterprise data management system can easily design and control
enterprise software for quality, consistency, and reusability in business
applications through
the managed sharing of meta-data.

Like all other reverse engineering technologies regardless of whether they are for
software
or hardware or non-IT implementation, reverse data modeling is very useful in time
when
there is a deep seated need for system troubleshooting with very complicated
problems.
Reverse data modeling can give IT staff with a deeper insight and understanding of
the data
models thereby empowering them to come up with actions to improve the management of

the system.

Physical Data Model

There are three basic styles of data models: conceptual data model, logical data
model and
physical data model. The conceptual data model is sometimes called the domain model
and
it is typically used for exploring domain concepts in an enterprise with
stakeholders of the
project.

The logical model is used for exploring the domain concepts as well as their
relationships.
This model depicts the logical entity types, typically referred to simply as entity
types, the
data attributes describing those entities, and the relationships between the
entities.
The physical data model is used in the design of the database's internal schema and
as
such, it depicts the data columns of those tables, and the relationships between
the tables.
This model represents the data design taking into account the facilities and
constraints of
any given database management system. The physical data model is often derived from
the
logical data model although some can reverse engineer this from any database
implementation.
A detailed physical data model contains all artifacts a database requires in
creating
relationships between tables or achieving performance goals, such as indexes,
constraint
definitions, linking tables, partitioned tables or clusters. This model is also
often used
calculating estimates for data storage and sometimes is sometimes includes details
on
storage allocation for a database implementation.

The physical data model is basically the output of physical data modeling which is
conceptually similar to design class modeling whose main goal is to design the
internal
schema of a database, depicting the data tables, the data columns of those tables,
and the
relationships between the tables.
In a physical data model, the tables are first identified where data will be stored
in the
database.

For instance, in a university database, the database may contain the Student table
to store
data about students. Then there may also be the Course table, Professors table, and
other
related table to contain related information. The tables will then be normalized.

Data normalization is the process wherein the data attributes in a data model are
being
organized to reduce data redundancy, increase data integrity and increase the
cohesion of
tables and to reduce the coupling between tables.
After the tables are being normalized, the columns will be identified. A column is
the
database equivalent of an attribute. Each table will have one or more columns. In
our
example, the university database may have columns in the Student table such as
FirstName, LastName and StudentNumber columns.

The stored procedures are then being identified. Conceptually, a stored procedure
is like a
global method for implementing a database. An example of a stored procedure would
be a
code to compute student average mark, student payables or number of students
enrolled
and allowable in a certain course. Relationships are also identified in a physical
data model.

A relationship defines how some attributes in one table relate to another


attributes in
another table. Relationships are very important in ensuring that there is data
integrity in a
database after an update, insert or deletions is being performed.

Keys are also assigned in the tables. A key is one or more data attributes which
identifies a
table row to make it unique and thus eliminate data redundancy and increase data
integrity.

Other specifications indicated in a physical data model include the application of


naming
conventions and application of data model patterns.

Database Concepts

In the 1960s, the System Development Corporation, one of the world�s first computer

software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today. A number of corporations, notably with IBM and Rockwell at
the
forefront, developed database software throughout the 1960s and early 1970s. MUMPS
(also known as M), developed by a team at Massachusetts General Hospital in the
late
1960s, was the first programming language developed specifically to make use of
database
technology. In 1970, the relational database model was born. Although this model
was
more theoretical than practical at the time, it took hold in the database community
as soon
as the necessary processing power was available to implement such systems.

Databases Are Fun!


Have you ever wondered the name of a movie you watched years ago, and although it
was
on the tip of your tongue, the few words of that short title just wouldn�t come
out? What
about a book you read when you were younger that had a particularly compelling
story to
tell, and despite every effort on your part to relate that story to a friend, the
details simply
weren�t in your head anymore?

Perhaps you wish you had kept a simple, personal record of these movies and books,
so you
could have a quick look at it and easily identify the title that has been bothering
you all this
time? A database would be perfect for that!

The purest idea of a database is nothing more than a record of things that have
happened.
Granted, most professionals use databases for much more than storing their favorite
movies
and books, but at the most basic level, those professionals are simply recording
events, too.

Every time you join a web site, your new account information ends up in a database.
Have
you ever rented movies from Netflix? Their entire web site is essentially one big
database, a
record of every movie available to rent along with the locations of their copies of
that movie
among their many warehouses and customers� homes. The list goes on! Databases are
everywhere, and they assist us in performing many essential tasks throughout our
daily
lives.

You can easily see how databases affect all aspects of our modern lives, since
everything
you do, from calls you make on your mobile phone to transactions you make at the
bank to
the times you drive through a toll plaza and used your toll tag to pay, is recorded
in a
database somewhere.
If these databases did not exist, our lives would surely be much less convenient.
In the 21st
century, we are so accustomed to using credit cards and printing airline boarding
passes at
home that, if databases were to suddenly disappear, it would almost seem like we
were
cavemen again.

Fortunately, we have databases, and there are many people who are skilled at using
them
and developing software to use alongside them. These people take great pride in
their work,
as database programming is difficult but nonetheless very rewarding.

Consider, for example, the database team at Amazon.com. They built an enormous
database to contain information about books, book reviews, products that are not
books at
all, customers, customers� preferences, and tons of other things.
http://learn.geekinterview.com/images/ot1.jpg
It must have taken them months to get the database just right and ready for
customers to
use! But, once it started to work well and Amazon.com went live back in 1995, can
you
imagine the sense of pride those developers had as millions of potential customers
poured
onto the web site and began to interact with the product they spent so much time
perfecting? That must have been an incredible feeling for the database team!

By no means do these brief words do justice to the power, complexity, or utility of


modern
database platforms. However, they are certain to provide at least a little insight
into how
significantly databases have changed our lives and how effective they are at
providing high
quality solutions for difficult problems.

What is a Database?

When you are in a big electronics store buying the latest edition of the iPod, how
does that
store�s inventory tracking system know you just bought an iPod and not, for
example, a car
stereo or a television

Let�s walk through the process of buying an iPod and consider all the implications
this has
on the inventory database that sits far underneath all the shiny, new gadgets on
the sales
floor.

When you hand the iPod box to the cashier, a barcode scanner reads the label on the
box,
which has a product identification number. In barcode language, this number might
be
something like 885909054336. The barcode representing this number can be seen in
Figure
1.
Figure 1. A sample barcode

The barcode acts as a unique identifier for the product; in this case, all iPods
that are the
same model as the one passing across the barcode reader have the same exact
barcode.

The barcode scanner relays the number represented by the barcode to the register at
the
cashier�s station, which sends a request (or a query) to the store�s inventory
database. This
database could be in the same store as the register or somewhere across the country
or
even around the world, thanks to the speed and reliability of the Internet.
The register asks the database, �What are the name and price of the product that
has this
barcode?� To which the database responds, �That product is an iPod, and it costs
$200.�

You, the customer, pay your $200 and head home with a new toy. Your work in the
store is
finished, but the inventory management system still needs to reconcile your
purchase with
the database!

When the sale is complete, the register needs to tell the database that the iPod
was sold.
The ensuing conversation goes something like the following.

Register: �How many products with this barcode are in our inventory?�
Database: �1,472.�

Register: �Now, 1,471 products with this barcode are in our inventory.�

Database: �OK.�

What Did the Database Do?

Data Retrieval

Of course, this is not the whole story. Much more happens behind the scenes than
simple
conversational requests and acknowledgements.
The first interaction the register had with the database occurred when the request
for the
product name and price was processed. Let�s take a look at how that request was
really
handled.

If the database is an SQL database, like MySQL or PostgreSQL or many others, then
the
request would be transmitted in the standard Structured Query Language (SQL). The
software running on the register would send a query to the database that looks
similar to
the following.

SELECT name, price FROM products WHERE id = 885909054336;


http://learn.geekinterview.com/images/ot2.jpg
http://learn.geekinterview.com/images/ot3.jpg
This query instructs the database to look in the products table for a row (also
called a
record) in which the id column exactly equals 885909054336.

Every database may contain multiple tables, and every table may contain multiple
rows, so
specifying the name of the table and the row�s unique identifier is very important
to this
query. To illustrate this, an example of a small products table is shown in Figure
2.

When the database has successfully found the table and the row with the specified
id, it
looks for the values in the name and price columns in that row. In our example,
those
values would be �iPod� and �200.00�, as seen in Figure 2. The execution of the
previous
SELECT statement, which extracts those values from the table, is shown in Figure 3.

The database then sends a message back to the register containing the product�s
name and
price, which the register interprets and displays on the screen for the cashier to
see.

Data Modification

The second time the register interacts with the database, when the inventory number
is
updated, requires a little more work than simply asking the database for a couple
numbers.
Now, in addition to requesting the inventory number with a SELECT statement, an
UPDATE
statement is used to change the value of the number.

First, the register asks the database how many iPods are in the inventory (or �on
hand�).
http://learn.geekinterview.com/images/ot4.jpg
http://learn.geekinterview.com/images/ot5a.jpg

SELECT onhand FROM products WHERE id = 885909054336;

The database returns the number of products on hand, the register decrements that
number
by one to represent the iPod that was just sold, and then the register updates the
database
with the new inventory number.

UPDATE products SET onhand = 1471 WHERE id = 885909054336;

This sequence is presented in Figure 4.

In Figure 4, the database responds to the UPDATE query with UPDATE 1, which simply
means one record was updated successfully.

Now that the number of iPods on hand has been changed, how does one verify the new
number? With another SELECT query, of course! This is shown in Figure 5.

Now, the register has updated the database to reflect the iPod you just purchased
and
verified the new number of iPods on hand. That was pretty simple, wasn�t it?

More on Databases
You now know databases are made of tables, which are, in turn, made of records.
Each
record has values for specific columns, and in many cases, a record can be uniquely

identified by the value contained in at least one column.

In our example, the barcode number uniquely identified the iPod, which cost $200,
in the
products table. You have also seen that values in a database can be modified. In
this case,
the number of iPods on hand was changed from 1,472 to 1,471.

Database Systems

Early Databases

In the 1960s, the System Development Corporation, one of the world�s first computer

software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today.

A number of corporations, notably with IBM and Rockwell at the forefront, developed

database software throughout the 1960s and early 1970s. MUMPS (also known as M),
developed by a team at Massachusetts General Hospital in the late 1960s, was the
first
programming language developed specifically to make use of database technology.

In 1970, the relational database model was born. Although this model was more
theoretical
than practical at the time, it took hold in the database community as soon as the
necessary
processing power was available to implement such systems.

The advent of the relational model paved the way for Ingres and System R, which
were
developed at the University of California at Berkeley and IBM, respectively, in
1976. These
two database systems and the fundamental ideas upon which they were built evolved
into
the databases we use today. Oracle and DB2, two other very popular database
platforms,
followed in the footsteps of Ingres and System R in the early 1980s.

Modern Databases

The Ingres system developed at Berkeley spawned some of the professional database
systems we see today, such as Sybase, Microsoft SQL Server, and PostgreSQL.
Now, PostgreSQL is arguably the most advanced and fastest free database system
available,
and it is widely used for generic and specific database applications alike. MySQL
is another
free database system used in roughly the same scope of applications as PostgreSQL.
While
MySQL is owned and developed by a single company, MySQL AB in Sweden, PostgreSQL
has
no central development scheme, and its development relies on the contributions of
software
developers around the world.

IBM�s System R database was the first to use the Structured Query Language (SQL),
which
is also widely used today. System R, itself, however, was all but abandoned by IBM
in favor
of focusing on more powerful database systems like DB2 and, eventually, Informix.
These
products are now generally used in large-scale database applications. For example,
the Wal-
Mart chain of large department stores has been a customer of both DB2 and Informix
for
many years.

Modern Database

The other major player in the database game, Oracle, has been available under a
proprietary license since it was released as Oracle V2 in 1979. It has undergone a
number
of major revisions since then and, in 2007, was released as Oracle 11g. Like DB2
and
Informix, Oracle is mostly used for very large databases, such as those of global
chain
stores, technology companies, governments, and so forth. Because of the similar
client
bases enjoyed by IBM and Oracle, the companies tend to be mutually cooperative in
database and middleware application development.

Microsoft SQL Server, initially based on Sybase, is another full-featured and


expensive
database system designed to attract large customers. Its primary competitors are
IBM and
Oracle, but Microsoft has, to a great extent, been unable to secure a significant
percentage
of the high-end database market as its client base. As a result, SQL Server caters
mainly to
the lower end of the pool of larger database customers.
Some speculate Microsoft�s inability to capture the higher end of the market is a
result of
SQL Server�s dependence on the Microsoft Windows operating system. In many cases,
Windows is seen as less reliable and less stable than UNIX-based operating systems
like
Solaris, FreeBSD, and Linux; all of which support databases like Oracle, DB2 and
Informix,
and MySQL and PostgreSQL.

In order of market share in terms of net revenue in 2006, the leaders in database
platform
providers are Oracle, with the greatest market share; IBM; and Microsoft.
While the database systems with the greatest markets shares use SQL as their query
language, other languages are used to interact with a handful of other relatively
popular
databases. Most developers will never encounter these languages in their daily
work, but for
purposes of being complete, some of these languages are IBM Business System 12,
EJB-QL,
Quel, Object Query Language, LINQ, SQLf, FSQL, and Datalog. Of particular note is
IBM
Business System 12, which preceded SQL but was, for some time, used with System R
instead of SQL due to SQL being relationally incomplete at the time.

Today, organizations with large database projects tend to choose Oracle, DB2,
Informix,
Sybase, or Microsoft SQL Server for their database platforms because of the
comprehensive
support contracts offered in conjunction with those products. Smaller organizations
or
organizations with technology-heavy staff might choose PostgreSQL or MySQL because
they
are free and offer good, community-based support.

Terminology

The term �database� is widely misused to refer to an entire database system.


Oracle, for
example, is not a database but a full-featured Database Management System (DBMS).
In
fact, a DBMS can be used to manage many databases, and as such, a database is just
one
part of a DBMS. In this series of articles, the terms �database system� and
�database
platform� are used to refer to the idea of a DBMS.

Further, most modern database systems employ the idea of the relational database,
and
they are properly called Relational Database Management Systems (RDBMS). The
distinction
between a DBMS and a RDBMS, unless critical to the understanding of a specific
topic, is not
made in these articles.

Database Interaction
Database Interaction

Efficient interaction, efficient storage, and efficient processing are the three
key properties
of a successful database platform. In this article, we explore the first: efficient
interaction.

Interaction Category 1: Command Line Clients

Many database platforms are shipped with a simple command line utility that allows
the
user to interact with the database. PostgreSQL ships with psql, which gives the
user
extensive control over the operation of the database and over the tables and schema
in the
database. Oracle�s SQLPlus and MySQL�s MySQL are similar utilities. Collectively,
these are
also called SQL shells.

Interaction Category 2: GUI Clients

Another popular way to interact directly with a database is by using a graphical


user
interface (GUI) that connects to the database server. Oracle�s proprietary SQL
Developer
software is one of these, although for every database on the market, there are
probably at
least two or three good, free GUI packages available. Figure 2 shows the �object
browser� in
pgAdmin III, a free administration tool for PostgreSQL databases.

Interaction Category 3: Application Development

The final method for interacting with a database is through an application. This
indirect
interaction might occur, for example, when a bank customer is withdrawing money
from an
ATM. The customer only presses a few buttons and walks away with cash, but the
software
running on the ATM is communicating with the bank�s database to execute the
customer�s
transaction. Applications that need to interact with databases can be written in
nearly all
programming languages, and almost all database platforms support this form of
interaction.

Command Line Clients

A command line client usually provides the most robust functionality for
interacting with a
database. And, because they are usually developed by the same people who developed
the
database platform, command line clients are typically also the most reliable. On
the other
hand, effectively using a command line client to its full extent requires expert
database skill.
The �help� features of command line clients are often not comprehensive, so
figuring out
how to perform a complex operation may require extensive study and reference on the
part
of the user. Some basic usage of the PostgreSQL command line client is shown in
Figure 1.
http://learn.geekinterview.com/images/ot9.jpg

All command line clients operate in a similar manner to that shown in Figure 1. For
users
with extensive knowledge of SQL, these clients are used frequently.

One typically accesses an SQL command line client by logging into the database
server and
running them from the shell prompt of a UNIX-like operating system. Logging into
the
server may be achieved via telnet or, preferably, SSH. In a large company, the
Information
Technology department may have a preferred application for these purposes.

GUI Clients and Application Development

GUI Clients
The simplest way to think about a GUI client is to consider it to be a
sophisticated, flashy
wrapper around a command line client. Really, it falls into the third category of
interaction,
application development, but since the only purpose of this application is to
interface with
the database, we can refer to it separately as a GUI client.

The GUI client gives the user an easy-to-use, point-and-click interface to the
internals of the
database. The user may browse databases, schemas, tables, keys, sequences, and,
essentially, everything else the user could possibly want to know about a database.
In most
cases, the GUI client also has a direct interface to a simulated command line, so
the user
can enter raw SQL code, in addition to browsing through the database. Figure 2
shows the
object browser in pgAdmin III, a free, cross-platform GUI client for PostgreSQL .
http://learn.geekinterview.com/images/ot10.jpg
http://learn.geekinterview.com/images/ot11.jpg
Figure 2. The object browser in pgAdmin III

With an easy tree format to identify every element of the database and access to
even more
information with a few simple clicks, the GUI client is an excellent choice for
database
interaction for many users.

Figure 3 shows the Server Information page of MySQL Administrator, the standard GUI
tool
for MySQL databases.

Figure 3. The MySQL Administrator Server Information page


http://learn.geekinterview.com/images/ot12.jpg
http://learn.geekinterview.com/images/ot13.jpg
Application Development

Application development is the most difficult and time-consuming of the three


methods of
interacting with a database. This approach is only considered when a computer
program
needs to access a database in order to query or update data that is relevant to the
program.

For example, the software running on an ATM at a bank needs to access the bank�s
central
database to retrieve information about a customer�s account and then update that
information while the transaction is being performed.

Applications that require databases can be written in virtually any programming


language.
For stand-alone applications, the most popular language for database programming is
C++,
with a growing following in the C# and Java communities. For web applications, Perl
and
PHP are the most popular languages, followed by ASP (and ASP.NET) and Python.
Interest
in using Ruby with the web and databases is growing, as well.

Many database access extensions for modern programming languages exist, and they
all
have their advantages and caveats. The expert database programmer will learn these
caveats, however, and eventually become comfortable and quite skilled at
manipulating
database objects within application code.

Figure 4 and Figure 5 show the code for a simple database application written in
Perl and its
output, respectively.
With all the features of modern programming languages, extremely complex database
applications can be written. This example merely glosses over the connection,
query, and
disconnection parts of a database application.

Getting Ahead with Databases

Database Overview

You have been using databases for a few years, and you think you are at the top of
your
game. Or, perhaps, you have been interested in databases for a while, and you think
you
did like to pursue a career using them, but you do not know where to start. What is
the next
step in terms of finding more rewarding education and employment?

There are two routes people normally take in order to make them more marketable
and, at
the same time, advance their database skills. The first, earning an IT or computer
science
degree, requires more effort and time than the second, which is completing a
certification
program.

If you do not have a science, engineering, or IT degree yet and you want to keep
working
with databases and IT for at least a few years, the degree would probably be worth
the
time. For that matter, if you already have an undergraduate degree, then perhaps a
master�s degree would be the right choice for you? Master�s degrees typically only
require
three semesters of study, and they can really brighten up a resume. An MBA is a
good
option, too, if the idea of managing people instead of doing technical work suits
your fancy.
Your employees would probably let you touch the databases once in a while, too!

Many universities offer evening classes for students who work during the day, and
the
content of those classes is often focused on professional topics, rather than
abstract or
theoretical ideas that one would not regularly use while working in the IT field.
Online
universities like the University of Phoenix also offer IT degrees, and many busy
professionals have been earning their degrees that way for years now.

Certifications, while quite popular and useful in the late 1990s and early 2000s,
seem to be
waning in their ability to make one marketable. That said, getting a certification
is much
quicker than earning a degree and requires a minimal amount of study if the
certificate
candidate already works in the relevant field.

The certification will also highlight a job applicant�s desire to �get ahead� and
�stay ahead�
in the field. It may not bump the applicant up the corporate food chain like an MBA
might,
but it could easily increase the dollar amount on the paychecks by five to ten
percent or
more.
If you feel like you could be making more money based on your knowledge of
databases,
exploring the degree and certification avenues of continuing education may be one
of the
best things you do for your career.

Relational Databases

What is a Relational Database?

Popular, modern databases are built on top of an idea called �relational algebra�,
which
defines how �relations� (e.g. tables and sequences in databases) interact within
the entire
�set� of relations. This set of relations includes all the relations in a single
database.

Knowing how to use relational algebra is not particularly important when using
databases;
however, one must understand the implications certain parts of relational algebra
have on
database design.

Relational algebra is part of the study of logic and may be simply defined as �a
set of
relations closed under operators�. This means that if an operation is performed on
one or
more members of a set, another member of that same set is produced as a result.
Mathematicians and logicians refer to this concept as �closure�.

Integers

Consider the set of integers, for example. The numbers 2 and 6 are integers. If you
add 2 to
6, the result is 8, which is also an integer. Because this works for all integers,
it can be said
that the set of integers is closed under addition. Indeed, the set of integers is
closed under
addition, subtraction, and multiplication. It is not closed under division,
however. This can
be easily seen by the division of 1 by 2, which yields one half, a rational number
that is not
an integer.
Database Relations

Using the integer example as a starting point, we can abstract the idea of closure
to
relations. In a relational database, a set of relations exists. For the purposes of
initially
understanding relational databases, it is probably best to simply think of a
relation as being
a table, even though anything in a database that stores data is, in fact, a
relation.

Performing an operation on one or more of these relations must always yield another

relation. If one uses the JOIN operator on two tables, for example, a third table
is always
produced. This resulting table is another relation in the database, so we can see
relations
are closed under the JOIN operator.
Relations are closed under all SQL operators, and this is precisely why databases
of this
nature can be called relational databases

Database Concurrency and Reliability

Database Concurrency and Reliability Overview

Concurrency and reliability have long been �hot topics� of discussion among
developers and
users of distributed systems. The fundamental problem can be seen in a simple
example, as
follows.

Suppose two users are working on the same part of a database at the same time. They
both
UPDATE the same row in the same table, but they provide different values in the
UPDATE.
The UPDATE commands are sent to the database precisely at the same time. What does
the
database system do about this, and what are the rules governing its decision?

ACID

When discussing concurrency and reliability, developers often talk about the
components of
ACID: atomicity, consistency, isolation, and durability. Together, these properties
guarantee
that a database transaction is processed in a reliable, predictable manner. A
transaction, in
this case, can be defined as any set of operations that changes the state of the
database. It
could be something as simple as reading a value, deciding how to manipulate that
value
based on what was read, and then updating the value.

Atomicity

The atomicity property guarantees that a transaction is either completed in full or


not
completed at all. Thus, the result of an operation is always success or failure,
and no
transaction can result in a partial completion. Essentially, by making a
transaction �atomic�,
all the operations involved in the transaction are virtually combined into one
single
operation.

Two important rules provide transaction atomicity. First, as operations in a


transaction
occur, those operations must remain unknown to all other processes accessing the
database
at the same time. Other processes may see only the final product after the
transaction is
complete, or they will see no changes at all.

The second rule is somewhat of an extension of the first rule. It says that, if any
operations
involved in a transaction fail, the entire transaction fails, and the database is
restored to the
http://learn.geekinterview.com/images/ot38.jpg
state before the transaction began. This prevents a transaction from being
partially
completed.

Database Consistency

Consistency is probably the most fundamental of the four ACID components. As such,
it is
arguably the most important in many cases. In its most basic form, consistency
tells us that
no part of a transaction is allowed to break the rules of a database.

For example, if a column is constrained to be NOT NULL and an application attempts


to add
a row with a NULL value in that column, the entire transaction must fail, and no
part of the
row may be added to the database.

In this example, if consistency were not upheld, the NULL value would initially
still not be
added as part of the row, but the remaining parts of the row would be added.
However,
since no value would be specified for the NOT NULL column, it would revert to NULL,

anyway, and violate the rules of the database. The subtleties of consistency go far
beyond
an obvious conflict between NOT NULL columns and NULL value, but this example is a
clear
illustration of a simple violation of consistency. In Figure 1, we can see that no
part of a row
is added when we try to violate the NOT NULL constraint.

Isolation

The isolation property ensures that, if a transaction is being executed, no


processes other
than the one executing the transaction see the transaction in a partially completed
state. A
simple example of this is as follows. Suppose one customer of a bank transfers
money to
another customer. This money should appear in one customer�s account and then in
the
other customer�s account but never in both accounts simultaneously. The money must
always be somewhere, and it must never be in two places at the same time.
Formally, isolation requires that the database�s transaction history is
serializable. This
means that a log of transactions can be replayed and have the same effect on the
database
as they did originally.

Durability

A database system that maintains durability ensures that a transaction, once


completed, will
persist. This may sound like a vague definition, but it is really quite simple. If
an application
executes a database transaction, and the database notifies the application that the

transaction is complete, then no future, unintended event will be able to reverse


that
transaction. A popular method of ensuring durability is to write all transactions
to a log,
which can be replayed from an appropriate time in the case of system failure. No
transaction is considered to be complete until it is properly written to the log.

Distributed Databases

Distributed Databases Overview

Suppose you created a database for a web application a few years ago. It started
with a
handful of users but steadily grew, and now its growth is far outpacing the
server�s
relatively limited resources. You could upgrade the server, but that would only
stem the
effects of the growth for a year or two. Also, now that you have thousands of
users, you are
worried not only about scalability but also about reliability.

If that one database server fails, a few sleepless nights and many hours of
downtime might
be required to get a brand new server configured to host the database again. No
one-time
solution is going to scale infinitely or be perfectly reliable, but there are a
number of ways
to distribute a database across multiple servers that will increase the scalability
and
reliability of the entire system.
Put simply, we want to have multiple servers hosting the same database. This will
prevent a
single failure from taking the entire database down, and it will spread the
database across a
large resource pool.

By definition, a distributed database is one that is run by a central management


system but
which has storage nodes distributed among many processors. These �slave� nodes
could be
in the same physical location as the �master�, or they could be connected via a
LAN, WAN,
or the Internet. Many times, database nodes in configurations like this have
significant
failover properties, like RAID storage and/or off-site backup, to improve chances
of
successful recovery after a database failure.
Distributed databases can exist in many configurations, each of which may be used
alone or
combined to achieve different goals.

Distributed Database Architecture

A distributed database is divided into sections called nodes. Each node typically
runs on a
different computer, or at least a different processor, but this is not true in all
cases.

Horizontal Fragments

One of the usual reasons for distributing a database across multiple nodes is to
more
optimally manage the size of the database.

For example, if a database contains information about customers in the United


States, it
might be distributed across three servers, one each for the eastern United States,
the mid-
western United States, and the western United States.

Each server might be responsible for customers with certain ZIP codes. Since ZIP
codes are
generally arranged from lowest to highest as they progress westward across the
country,
the actual limits on the ZIP codes might be 00000 through 33332, 33333 through
66665,
and 66666 through 99999, respectively.

In this case, each node would be responsible for approximately one third of the
data for
which a single, non-distributed node would be. If each of these three nodes
approached its
own storage limit, another node or two nodes might be added to the database, and
the ZIP
codes for which they are responsible could be altered appropriately. More
�intelligent�
configurations could be imagined, as well, wherein, for example, population density
is
considered, and larger metropolitan areas like New York City would be grouped with
fewer
other cities and towns.

In a distribution like this, either the database application or the database


management
system, itself, could be responsible for deciding which database node would process

requests for certain ZIP codes. Regardless of which approach is taken, the
distribution of
the database must remain transparent to the user of the application. That is, the
user
should not realize that separate databases might handle different transactions.
Reducing node storage size in this manner is an example of using horizontal
fragments to
distribute a database. This means that each node contains a subset of the larger
database�s
rows.

Distributed Database Architecture Vertical Fragments

Vertical Fragments

The vertical fragment approach to database distribution is similar in concept to


the
horizontal fragment approach, but it does not lend itself as easily to scalability.
Vertical
fragments occur when columns, instead of rows, are distributed across multiple
nodes.

A situation that calls for vertical fragments might arise if a table contains
information that is
pertinent, separately, to multiple applications. Using the previous example of a
database
that stores customer information, we might imagine an airline�s frequent flyer
program.

These programs typically track, among other things, customers� personal


information, like
addresses and phone numbers, along with a list of all the trips they have flown and
the
miles they have accrued along the way.

These sets of data have different applications: the customer information is used
when
mailing tickets and other correspondence, and the mileage information is used when
deciding how many complimentary flights a customer may purchase or whether the
customer has flown enough miles to obtain �elite� status in the program. Since the
two sets
of data are generally not accessed at the same time, they can easily be separated
and
stored on different nodes.

Since airlines typically have a large number of customers, this distribution could
be made
even more efficient by incorporating both horizontal fragmentation and vertical
fragmentation. This combined fragmentation is often called the mixed fragment
approach.

Other Fragmentation Types

A database can be broken up into many smaller pieces, and a large number of methods
for
doing this have been developed. A simple web search for something like �distributed

databases� would probably prove fruitful for further exploration into other, more
complex,
methods of implementing a distributed database. However, there are two more terms
with
which the reader should be familiar with respect to database fragmentation.
The first is homogeneous distribution, which simply means that each node in a
distributed database is running the same software with the same extensions and so
forth. In
this case, the only logical differences among the nodes are the sets of data stored
at each
one. This is normally the condition under which distributed databases run.

However, one could imagine a case in which multiple database systems might be
appropriate for managing different subsets of a database. This is called
heterogeneous
distribution and allows the incorporation of different database software programs
into one
big database. Systems like this are useful when multiple databases provide
different feature
sets, each of which could be used to improve the performance, reliability, and/or
scalability
of the database system.

Replication

In addition to the distribution situations above, full-database replication is also


available for
many database platforms. This is really what we mean when we say a database is
hosted by
multiple servers, but in general, the idea of distributing pieces of a database
should be
considered before putting much thought into wholesale replication of a database.
This is for
one simple reason: replication is expensive. It�s expensive in terms of finance,
time, and
data, but for many applications, it truly is the best solution.

Here, we briefly discuss �master-master� replication, which is perhaps the most


complicated
of all the replication solutions. This is also the most comprehensive replication
solution,
since each master always has a current copy of the database. Because of this, the
entire
database will still be available if one node fails.

The Three Expenses in Distributed Databases

Essentially, replication entails creating exact copies of databases on many


computers and
updating every database simultaneously whenever an update is performed on one
database.
The pitfalls of this process are explained by the three expenses, below.
Replication has finance expense because every server, every hard drive, every
battery-
backed RAID card, every network switch, every fast network connection, every
battery-
backed power supply, and every other piece of associated hardware must be
purchased. In
addition to that are the costs of bandwidth, maintenance, backup servers, co-
location,
remote management, and many other things. For a decent-sized database, this could
very
easily run into the tens of thousands of dollars before even getting to the �every
hard drive�
part of the list.

Replication has time expense because each operation performed on one node�s
database
must be performed on each other node�s database simultaneously. Before the
operation can
be said to be committed, each other node must verify that the operation in its own
database
succeeded. This can take a lot of time and produce considerable lag in the
interface to the
database.

And, replication has data expense because every time the database is replicated,
another
hard drive or two or more fills up with data pertaining to the database. Then,
every time
one node gets a request to update that data, it must transmit just as many requests
as
there are other nodes. And, confirmations of those updates must be sent back to the
node
that requested the update. That means a lot of data is flying around among the
database
nodes, which, in turn, means ample bandwidth must be available to handle it.

How to Initiate Replication

Many of the more popular databases support some sort of native replication. MySQL,
for
example, provides the GRANT REPLICATION command, which initiates replication
automatically. PostgreSQL, on the other hand, requires external software for
replication.
This usually happens in the form of Slony-1, a comprehensive replication suite.
Each
database platform has a different method for initiating replication services, so it
is best to
consult that platform�s manual before implementing a replication solution.

Considerations

When implementing a distributed database, one must take care to properly weigh the
advantages and disadvantages of the distribution. Distributing a database is a
complicated
and sometimes expensive task, and it may not be the best solution for every
project. On the
other hand, with some spare equipment and a passionate database developer,
distributing a
database could be a relatively simple and straightforward task.

The most important thing to consider is how extensively your database system
supports
distribution. PostgreSQL, MySQL, and Oracle, for example, have a number of native
and
external methods for distributing their databases, but not all database systems are
so
robust or so focused on providing a distributed service. Research must be performed
to
determine whether the database system supports the sort of distribution required.

The field of distributed database management is relatively young, so the


appropriate
distribution model for a particular task may not be readily available. In a
situation like this,
designing one�s own distributed database system may be the best development option.

Regardless of the approach taken, distributing a database can be a very rewarding


process
when considering the improvement of the scalability and reliability of a system.
Business Intelligence

What is Business Data

Business data refers to the information about people, places, things, business
rules, and
events in relation to operating a business.

Serious businesses need to consider setting up business intelligence and data


warehouses. In pursuing these capabilities, they need to adopt a holistic view
coupled with
wise investment and careful execution. For a business to really grow, it should
consider
interrelated areas involving people, strategy, process, applications, metrics, data
and
architecture.

It is very important to gather business data and base an organization's decision on


the
statistical report to get precise decisions on how to more the company forward for
sustainability.

Knowing things about people and their buying behaviors can make a company generate
very important business data. For instance, statisticians and market researches
know that
certain age groups have unique buying habits. Races and people from different
demographics locations also have buying patterns of their own so gathering these
information in one business database can be a good way to future target marketing.

In terms of production, business data about where to get raw materials, how much
the cost
is, what are the customs and importation policies of the raw materials' country of
origin and
other information are also very important.

There are many software applications that manage business data for easy statistical

reporting and spotting of trends and patterns.


The Business Data Catalog functionality in some applications allows users to
present line-of-
business data. It can search and retrieve information from back end systems such as

Enterprise Resource Planning (ERP), Customer Resource Management (CRM), Advance


Planner and Optimizer (APO) and Supply Chain Management (SCM).

In many companies, they maintain a business data warehouse where data from several
are
collected and integrate every few minutes. These repositories of business data may
supply
needed information to generate reports and recommendations in an intelligent
manner. Hence the term Business Intelligence is already widely used in the business

industry today.

Business intelligence generally refers to technologies and software applications


that are
used to gather, integrate and analyze business data and other information
pertaining to the
operation of the company. It can help companies gain more comprehensive and in
depth
knowledge of the many factors that can affect their business. These knowledge may
include
metrics on sales, internal operations and production. With recommendations from
business
intelligence, companies can make better decisions for the business.

For processing billions of business data in the data warehouse for business
intelligence,
companies use high powered and secure computer systems that are installed with
different
levels of security access.

Several software applications and tools have been developed to gather and analyze
large
amounts of unstructured business data ranging from sales statistics, production
metrics to
employee attendance and customer relations. Business intelligence software
applications
very depending on the vendor but the common attribute in most of them is that they
can be
customized based on the needs and requirements of the business company. Many
companies have in-house developers to take care of business data as the company
continues to evolve.

Some example of business intelligence tools to process business data include Score
carding,
Business activity monitoring, Business Performance Management and Performance
Measurement, Enterprise Management systems and Supply Chain Management/Demand
Chain Management. Free Business Intelligence and open source products include
Pentaho,
RapidMiner, SpagoBI and Palo, an OLAP database.

Business data is the core of the science of Analytics. Analytics is the study of
business data
that uses statistical analysis in knowing and understanding patterns and trends in
order to
foresee or predict business performance. Analytics is commonly associated with data

mining and statistical analysis although it is more leaned towards physics-like


modeling
which involves extensive computation.

What is Business-Driven Approach

A Business-driven approach is any process of identifying the data needed to support

business activities, acquiring or capturing those data, and maintaining them in the
data
resource.
Everyday, billions of data and information gets carried across different
communications
media. The number medium is of course the internet. Another data communications
media
are television and mobile phones

Any non-business individual or entity may not find the real significance of these
data. They
are merely there to because it is innate in people to communicate and get connected
with
each other.

But individuals or organizations who think about business take advantage of these
data in a
business-driven approach. They try to collect, aggregate, summarize and
statistically
analyze data so they know what products they may want to see and who among the
people
will be their target market.

Many people who started from scratch but were driven by passion for business have
created
multinational corporations. Traditional vendors started from the sidewalk and took
a
business driven approach to move their wares from the streets to the internets
where
millions of potential buyers converge.

Today's businesses are very closely dependent on the use of technology. More and
more
transactions are going online. With the possibility of high security money
transfers, people
can buy items from continents away using just their credit cards. They can also
maintain
online accounts and make fund transfers in seconds at one click of the mouse.

Software application developers and vendors are coming up with innovative tools to
help
business optimize their performance. Large data warehouses, those repositories of
all sorts
data getting bulkier every single minute, are being set up and companies are
investing in
labor intensive high power capable computers connected to local area networks and
the
internet.
To manage these data warehouses, database administrators use relational database
technology with the aid of tools like Enterprise Resource Planning (ERP), Customer
Resource
Management (CRM), Advance Planner and Optimizer (APO), Supply Chain Management
(SCM), Business Information Warehouse (BIW), Supplier Relationship Management
(SRM),
Human Resource Management System (HRMS) and Product Lifecycle Management (PLM)
among thousands of others.

To keep businesses stable and credible especially when dealing with a global market
using
the internet, it needs to ensure security of critical data and information. Keeping
these data
safe is a responsibility which requires coordinated, continuous and very focused
efforts. Companies should invest in infrastructures that shield critical resources.

To keep the business afloat in all situations, it has to see to it that procurement
and
contract management is well maintained and coordinated. It has to wield substantial

purchasing power to cope up with new and highly competitive contracting methods.
A business needs to constantly evaluate enterprise resource planning implementation
and
should have a distinct department to manage finance and human resources with
principles
that are sound.

Collecting, sharing and reporting of technology information by the business


organizations
should be strategized to effectively manage investment in technology and at the
same time
reduce burden of managing redundant and unrelated information.

All these mentioned requirements in a business-driven approach should follow a


business
framework. This framework will guide the company's staff and talents so that all
they do
will be profitable for the company. The business framework will also make it easy
for
applications developer, especially if they are in-house developers, to tailor cut
their outputs
to make the organizations function at its optimum.

A business-driven approach involves keen appreciation of tiny details, spotting and


tracing
of industry trends and patterns, very careful and analytical planning, constant
research on
the needs and preferences of potential buyers and intensive use and investment of
the
latest in technology.

What is Business Drivers

Business drivers are the people, information, and tasks that support the
fulfillment of a
business objective. They lead the company trying to get it away from pitfalls and
turn
unforeseen mistakes into good lessons for future success and sustainability.

A business needs to be constantly driven and updated to be at par with its


competitors and
to be in sync with the latest trends in business technology which change sometimes
very
unexpectedly.

Technology is fast evolving and businesses that do not evolve with technology will
suffer
tremendously. The world of business is a competitive sink or swim arena. Every
single day,
a business is exposed to risks and challenges but this is just a natural thing.

The foremost key business drivers of course are the staff and talents � the people.
Being
the literal brains behind every business, people make and set the objectives,
execution of
critical decisions and constant innovation to move the business forward. Human
resources
department are scouting for the best people in the business industry everyday.
Different
people have different aptitudes so it is important to employ only those with
business
stamina. While most companies prefer people with advance degrees specifically
master and
doctorates in business administration and the like, there are many people who have
not
even finished college but have innate skills for running a business effectively.

Technological innovation is another key business driver. It cannot be argued that


today's
businesses cannot function without the use of information technology. Hundreds of
millions
of data are coming in everyday and for businesses to take advantage of these data
to
improve their products and services, they need use technology. People alone,
although
they are the thinking entities, are more prone to error when it comes to doing
repetitive and
labor intensive tasks. Compared to machines, people also work a lot slower.

Software applications automated daily tasks in the business. Manual recording is a


tedious
and very repetitive work but a software application can overcome this problem with
less
error and faster processing. And business is not just about recording. Everyday,
hundreds
of computations need to be done. Product sales, procurement, employee salary,
inventory
and other things need to be considered. These activities involve millions of
figures and
complex formulas are needed to achieve precise results. Individual results are
aggregated
with other results to come up with a bigger perspective of the company operations.
Tools
like Enterprise Resource Planning (ERP), Customer Resource Management (CRM),
Advance
Planner and Optimizer (APO), Supply Chain Management (SCM), Business Information
Warehouse (BIW), Supplier Relationship Management (SRM), Human Resource Management
System (HRMS) and Product Lifecycle Management (PLM) are increasingly being
employed
by businesses organization to boost their overall performance.

The fast rising in popularity of open source and open standards has also been a
significant
business driver in recent years. Open standards make possible the sharing of non
proprietary so portability is has no longer become an issue. In the past,
integration of
internal systems of a company was expensive because different computing platforms
could
not effectively communicate with each other. Now, with more bulk of data being
transferred from one data warehouse to another, open standards is making things a
lot
faster and more efficient.
Open source on the other hand allows companies free software or software with
minimal fee
and thus save the company a lot of money to be used in other investments. Open
source
software applications are in no way inferior than commercial applications. With
open
source, anybody can make additions to the source code so this can be a good way to
customize software to the business archicture.

Business Architecture

Business architecture is one of the four layers of an IT architecture. The other


three layers
are information, applications and technology. Business architecture describes the
business
processes utilized within the organization.
If one looks at the dictionary, architecture is define as "a unifying or coherent
form or
structure." As one of the layers of the IT architecture, it is extremely important
to know
and understand the business architecture to come up relevant software systems.
Aside
from the technical aspects, information system architects should be concerned with
the
content and usage of the systems they are building for the business organization.

It cannot be argued that today's business is very tightly intertwined with


information system
technology. Even for the small home business, software applications are very much
in
use. It is impossible for multinational corporations to operate without business
software.

A good analogy for business architecture would be architecture of real buildings.


Building
architects need to understand the purpose of the building before doing the design.
If they
are designing homes, they need to also understand certain patterns and trends like
behavior of families. If they design skyscrapers, they need understand weather
conditions
among other things.

In the same manner, a business architects need to understand basic business


concepts. They need to know the requirements of the business. For example, if the
business is about manufacturing of furniture, they need to know where to get the
raw
materials and how much they cost. They also need to know who the target clients are
and
how to deal with competition.

Business architects should also determine the scope of the business. How can the
company
grow and branch out to areas or countries? What is the expected annual growth based
on
product manufacture and sales revenues?

There are many more considerations to account and once all these are in one place,
a
business architect starts drawing the design. The design must have to cater to all
aspects of
the business. There is no trivial aspect in business as tiny details can create
huge impact
on the organization as a whole.

As one of the layer of the IT architecture, the business architecture is the very
framework
where the other layers, information, application and technology are based on.
Business
data constitute information which may be used by business software applications
which are
executed by hardware technology. All these other layers operate within the business

architecture framework.

Software applications are developed to simulation real life activities. Manual


transactions
like recording sales and revenues which in the past involved tallying the data in
books are
automated with the use of business software.
Many business application softwares have been developed to adapt to the business
architecture. There are many private application software developers selling highly

customizable software packages to cater to the needs of an organization. Such


technologies
as Enterprise Resource Planning (ERP), Customer Resource Management (CRM), Advance
Planner and Optimizer (APO), Supply Chain Management (SCM), Business Information
Warehouse (BIW), Supplier Relationship Management (SRM), Human Resource Management
System (HRMS) and Product Lifecycle Management (PLM) among others are can all be
customized based on the requirements as indicted in the business architecture.

The nature of business is fast evolving. Many traditional businesses have evolved
from a
local place to global proportions. One mobile phone has evolved from wood pulp
mills to
rubber works to what today is perhaps one of the global leaders in technological
innovations. Some insurance companies also consider themselves banks and vice
versa. These complex business evolution need to be tracked so appropriate changes
in the
business architecture can also be taken care by the information systems architects.

What is Business Experts

Business experts are people who thoroughly understand the business and the data
supporting the business. They know the specific business rules and processes.

In a typical company setup, a person will have a long way to climb up the corporate
ladder
so he or she will land on top manager positions. During the period that he is
climbing the
corporate ladder, he learns valuable lessons about the company's objectives,
decision
making guides, internal and external policies, business strategy and many other
aspect of
the business.

The chief executive officer (CEO) is the business organization's highest ranking
executive.
He is responsible for running the business, carrying out policies of the board of
directors and
making decisions that can highly impact the company. The CEO must not only be a
business expert in general but must also be a business expert in particular about
all the
details of the business he is running. He can be considered the image of the
company and
his relationship both internally and externally with other companies is very vital
for the
success of the business.

Accountant are key business experts responsible for recording, auditing and
inspecting
financial records of business and prepares financial and tax reports. Most
accountants give
recommendations by laying out projected sales, income, revenue and others.

Marketing people, whether marketing managers or staff, are constantly on the look
out for
marketing trends. They gather different statistical data and demography so they
know the
target for goods and services. They closely work with advertising people.
Business software developers are among the top notch information technology
professionals. Aside from mastering the technical aspect of IT like computer
languages and
IT infrastructure, the must also know the very framework of the business
architecture. Their applications are made to automate business tasks like
transaction
processing and all kinds of business related reporting.

Customer relations specialists take of the needs of clients especially the current
and loyal
ones. They make sure that clients are satisfied with the products and services.
They also
act like marketing staff by recommending other products and service to clients.
Their main
responsibility is keeping the clients happy, satisfied and wanting for more.

Human resource staff take care of hiring the best minds suited for the business.
Since
businesses offer different products and services, the human resource staff is
responsible for
screening potential employees to handle the business operations. In order for the
human
resource staff to match the potential candidate, they (the HR staff) should also
know the ins
and outs of the business they are dealing with.

There are also business experts who do not want to be employed in other companies
but
instead they want to have their own business and be their own boss. These people
are
called entrepreneur. Entrepreneurs spend invest their own money on the business of
their
choice. The make feasibility studies first before throwing in their money or they
may hire
the services of a business consultant.

A business consultant is a seasoned business expert that that has a lot of business
success
experiences under his belt. Most business consultants are not fully attached to one

company alone. Business consultants know all aspects of the business. He recommends

actions by studying the financial status and transaction history of the company he
offering
his services. Many companies are offering business consultancy as the main line of
service.

Corporate lawyers focus on laws pertaining to business. They take charge of the
contracts
and represent companies during time of legal misunderstanding with other entities.
In all
undertakings whether traditional or new, corporate lawyers will have to ensure that
the
company does not violate any law of a given country.

What is Business Schema

A schema that represents the structure of business transactions used by clients in


the real
world. It is considered to be unnormalized data.

The dictionary defines the word "schema" as a plan, diagram, scheme or an


underlying
organizational structure. It can also mean a conceptual framework.
Running a business is a not a simple undertaking. In fact, it can get very complex
as the
business grows and growth is one of the priority goals of any business.

For a business to be efficiently managed, it has to have an adopted standard of


rules and
policies. It has to have business semantics detailed in declarative description.

Business rule is very important to define a business schema. The business rule can
influence and guide behaviors and result to support to policies and response to
environmental events and situations. Rules may be the top means whereby a business
organization directs its movement and defines it objectives and perform appropriate
actions.

A business schema can be represented in a data model with un-normalized data. A


data
model can reflect two and a half of the four different kinds of business rules
which are
terms, facts, results of derivations and constraints. A data model can reflect the
data
parameters that control the rules of the business.

The terms in a business schema is the precise definition of words used in the
business
rule. Order, Product Type and Line items are terms that refer to entity class which
are
things of heavy significance to the business. Attributes are terms that describe
the entity
class. For example, total number and total value are attributes of an order.
Attributes of
Product Type may include manufacturer, unit price and materials used. Quantity and
extended value are attributes for Line Item entity class.

Facts, another business rule in the schema, describe a thing like the role a thing
plays and
other descriptions. A data model has three kinds of facts which relationships,
attributes and
super types / sub-types.
Derivation can be any attribute that is a derivative of other attributes or system
variables. For example, the extended value attribute for the entity class line
items can be
determined by multiplying quantity of line item by the unit price of the product
type.

Constraints refer to conditions which determine what values a relationship or an


attribute
can or cannot have.

Many companies hire consultants to document and consolidated the standards, rules,
policies and practices. Then these documentations are handed to IT and database
consultants so that they can be transformed into database rules that follow the
business
schema.
Many independent or third party consultancy firms, research institutions and
software
application vendors and developers offer rich business schema solutions. Even if
business
rules are constantly changing and software applications are already in final form,
these
software applications can be customized to be in sync with the constantly changing
business
rules in particular and the schema in general.

A business rules engine is widely used software application that is used to manage
and
automate business rules to follow legal regulations. For example the law that
states "An
employee can be fired for any reason or no reason but not for an illegal reason" is
ensured
to be followed by the software.

The rule engine's most significant function is to help classify, register and
manage business
and legal rules and verify for constancy. Likewise, the can infer some rules basing
other
existing rules and relate them to IT application. Rules in IT applications can
automatically
detect unusual situations arising from the operations.

With a business schema clearly defined, there is little room for mistake in running
a
business successfully.

Business-Driven Data Distribution

Business-driven data distribution refers to the situation where the business need
for data at
a specific location drives the company to develop a data site and the distribution
of data to
the said newly developed site. This data distribution is independent of a
telecommunications
network.

Having a data warehouse can have tremendous positive impact on a company because
the
data warehouse can allow people in the business to get timely answers to many
business
related questions and problems. Aside from answers, data warehouses can also help
the
company spot trends and patterns on the spending lifestyles of both their existing
clients or
potential customers.

Building a data warehouse can be expensive. With today's businesses relying heavily
on
data, it is not uncommon for companies to invest millions of dollar on IT
infrastructures
alone. Aside from that, companies still need to expend on hiring of IT staff and
consultants.

For many companies, the best way to grow is to utilize the internet as the main
market
place of their goods and services. With the internet, a company can be exposed to
the
global market place. People from different places can have easy access to goods or
make
inquiries when they are interested. Customer relationship can be sealed a lot
faster without
staff having to travel far places. The company can have presence in virtually all
countries
around the world.

But with a wide market arena comes greater demand for data. One data warehouse
alone
location may not be able to handle the bulk data that need to be gathered,
aggregated,
analyzed and reported.

This is where business-driven data distribution comes in. For instance, a company
that has
mainly operated in New York and London has one data warehouse to serve both
offices. But
since many people on the internet have come to know of the company and availed of
their
services and have since then been loyal clients, many different locations are
starting to
generate bulks of data. Say, Tokyo has an exponentially increasing pool of clients,
the
company should already consider creating a new data warehouse in the location. This
data
warehouse can be the source of data to distribute to other data warehouses located
in other
areas but maintained by the same company.

Companies considering establishing new data warehouses for business-driven data


distribution need many things to consider before jumping into the venture. As
mentioned,
setting up and maintaining a data warehouse does not come cheap. Some of the basic
questions to answer before building the data warehouse are "Will this bring
efficiency to the
company operations?" or "Will this save the company in the long run and bring in
more
revenue, new markets and products, competitive advantage and improved customer
service?"

If all these questions are coming up with positive answers, then the next logical
step would
be to design and implement the new data warehouse that will become the new
business-
driven data distribution center. As in most data warehouses, the appropriate high
powered
computers and servers will have to bought and installed. Relational database
software
applications and other user friendly tools should also be acquired.

As time progresses, a company may have several location that have data warehouses
acting
as business driven data distributions. These separate warehouses continually
communicate
with each transferring data updates every minute and synchronizing and aggregating
data
so business staff can get the most recent trends and patterns in the market.

With this set up, a company can have very competitive advantage over the
competitors.
They can formulate new policies, strategies and business rules based on the demand
created from the reports of the many business driven data distribution centers
around the
world.
What is Business Activity

Business Activities refer the component of information technology infrastructure


representing all the business activities in a company whether they are manual or
automated.

Business activities utilize all data resources and platform resources in order to
performing
specific tasks and duties of the company.

For a company to survive in the business, it must have a competitive edge in the
arena
where multitudes of competitors exist. Competitive advantages can be had by having
rigorous and highly comprehensive methods for current and future rules, policies
and
behaviors for the processes within the organization.

Although business activities broadly relates to optimization practices in business


done by
people in relation to business management and organizational structure, it is
closely and
strictly associated with information technology implementation within the company
and all
its branches if they exist.

Business activities is part of the Information Technology Infrastructure. The


Information
Technology Infrastructure is the framework of the company dwelling on the
approaches for
best practices to be translated into software and hardware applications for optimum
delivery
of quality IT services.

Business activities are the bases for setting management procedures so support the
company in achieving the best financial quality and value in IT operations and
services.

Different companies generally have common business activities. In general, business


activities may be broadly categorized into accounting, inventory, materials
acquisition,
human resource development, customer relationship management, and products and
services marketing.

Although these broad categories may apply to many businesses, companies have
differing
needs to the categories. For instance, a law firm may not need to acquire raw
materials as
do furniture companies. Hotels need intensive marketing while an environmental non
profit
organization may not need marketing at all.

Several vendors sell IT solutions to take care of business activities and expedite
manual
work by automatic them through computers.
An Enterprise Resource Planning (ERP) system is a software based application that
integrates all data processes of a company into one efficient and fast system.
Generally,
ERP uses several components of the software application and hardware system to have
the
integration. ERP system is highly dependent on relational databases and they it
involves
huge data requirements. Companies set up data warehouses to feed the ERP.

Data warehouses in business companies are often called Business Information


warehouse.
They are intelligent data warehouses capable of analysis and reporting. Every day,
the
extract, transform and load data into the database in an intelligent fashion.

Customer Relationship Management (CRM) is another automating process to take care


the
business activity aspect of handling clients and customers of the company. This
system
captures, stores and analyzes customers data so the company can come up with
innovative
moves to further please the clients to their satisfaction. Again, the data is
stored in the data
warehouse or business information warehouse.

Supply Chain Management (SCM) makes the planning, implementation and control of
operation related to storage of raw materials, inventory of work in process and
point-of-
origin to point-of-consumption of finished products.

Supply management system takes care of the methods and process involved in
institutional
or corporate buying. Corporate buying may include purchasing of raw materials or
already
finished goods to be resold.

If not for the benefits of installing, information technology infrastructure these


business
activities would take a long time to finish. Doing these business manually will not
just take a
long time to accomplish but can also lead to many errors and inconsistencies.
With today's business trends going towards online marketing and selling, having a
good
business application software system can bring many advantages and benefits to
companies. Investing in an information technology infrastructure many be initially
expensive
but the return of investments will be long term.

Data Tracking

Data tracking involves tracing data from the point of its origin to its final
state. Data
tracking is very useful in a data warehouse implementation. As it is very well
known, a data
warehouse is very complex system that involves disparate data coming various
operational
data sources and data marts.
Hence, data keeps traveling from one server to another. Data tracking helps develop
data
collection proficiency at each site when proper management actions are being taken
to
ensure data integrity.

Some servers handle data processes by archiving raw, compute and test data
automatically.
Raw data are stored exactly in the format as its was received including the header,
footer
and all other information about the data.

Data tracking can be employed to improve the quality for transactions. For example,
I do
my withdrawal using the automated teller machine (ATM) and something unexpected
happens to my transaction which results in the machine not dispensing any money but

deducting the amount from my balance as reflected in receipt spewed out by the
machine.

When I report this incident to the bank authorities, they can easily trace the
series of events
by tracking the data I entered into the machine and the activities that myself and
the ATM
machine did. Because the data is tracked, they can easily spot patterns which led
to the
problem and from then, they can immediately take actions to improve the services

Data tracking can also be used in cases of fraud being committed. Inside the
company, if
there an erring person, the data tracking process may involve inspecting the audit
trail logs.

Some websites offer aggregated data through data tracking by acquiring certain
fields of
data using remote connection technologies. For instance, web applications can track
the
sales of a company which is being refreshed regularly so while a staff is on
business travel
or simply on holiday, he can still see what is happening in the business operation.
In another example, a website may offer a service such as an hourly update of the
global
weather and this can be done by tracking data from different geographical locations
data
sites.

But despite the uses, there also issues associate with data tracking. The issue of
security
and privacy is one of the biggest areas of concerns with data tracking. Many
website try to
install some small codes on the web browsers so that they can track the return
visits of an
internet user and bring up the preferences he has specified during his previous
visits.

This information about the preferences is tracked from the small code copies on the

computer. This code is called "cookie". Cookies in themselves are intended for good
use but
there many coders who have exploited their use by making them track sensitive
information
and steal them for bad purposes.
There are many specialized used for data tracking purposes available as free
download or
for commercial purposes coming with a fee. These software data tracking tools have
easy to
use graphical dashboards and very efficient back end programs that can give users
the data
they need to track on a real time basis.

Graphical data tracking applications like these make for perfect monitoring tools
for
database or data warehouse administrators who want to keep track of the data
traveling
from data mart or operational data source to another. The graphical presentation
can make
the administrator easily spot the erring and data and have an instant idea of where
that
data is currently located.

With today's fast paced business environment, data tracking tools and devices can
greatly
enhance the information system of any organization in order for them to make wise
decisions and corresponding moves in the face of a challenging situation.

Intelligent Agent

An intelligent agent is a sub-system of the artificial intelligence but with a lot


less
functionalities and "intelligence". An intelligent agent is simple a software tool
designed to
assist end users in performing non-repetitive task related to computing processes.
The
intelligent part is that it can act on its behalf when configured to respond to
specific events.
An intelligent agent is sometimes referred to as bot (short for robot) and is often
used for
processing assistance in data mining.

There are two general types of intelligent agents � the physical agent and the
temporal
agent. The physical agent refers to an agent which uses sensors and other less
abstract and
more tangible means to do its job. On the other hand, the temporal agent may be
purely
codes that use time based stored information which are triggered depending on
configuration.

From those two general types of intelligent agents, there may be five classes of
intelligent
agents based on the degree of their functionalities and capabilities. These five
are simple
reflex agents, model-based reflex agents, goal-based agents, utility-based agents
and
learning agents.

The simple reflex agent functions on the basis of its most current perception is
also based
on the condition � action rule such as "if condition then action rule". The success
of the
agents job depends on how fully observable the environment is.

The model based agent has its current state stored inside the agent which maintains
certain
structures describing the part of the world which are unseen and this kind of agent
can
handle environments which are partially observable. The behavior of this kind of
agent
requires information about the way that the world works and behaves and thus is
sometimes considered to have the world view model.

The goal based agent is actually a model based agent but it stores information
about certain
situations and circumstances in a more desirable way by allowing the agent some
good
choices from among many possibilities.

The utility based agent uses a function that can map a state to a certain measure
of the
utility of the state.

Finally, a learning agent are is a self governing intelligent agents that can learn
and adapt
to constantly changing situations It can quickly learn even from large amounts of
data and
its learning can be online and in real time.

Intelligent agents have become the new paradigm for software development. The
concept
behind intelligent agents have been hailed as "the next significant breakthrough in
software
development". Today, intelligent agents are used in an increasingly wide variety of

applications intended for a wide variety of industries. These applications range


from
comparatively small systems such as email filters to large, open, complex, mission
critical
systems such as air traffic control.

In large data intensive applications like internet web servers, an example of an


intelligent
agent would be a software for determining ranks of websites. This is very important
because
ranking is the basis for advertisement rate and the overall value of the website.
An
intelligent agent may be used to audit the websites ranking in the leading search
engines
around the world.
Perhaps the most ubiquitous example of an intelligent agent is found on our
computers.
These are in our anti virus and anti spyware systems. Intelligent agents are
constantly on
the lookout for viral strain updates so our computers can be protected all the
time.

Data Warehouse Basics

What is Operational Database

Operational Database is the database-of-record, consisting of system-specific


reference data
and event data belonging to a transaction-update system. It may also contain system
control data such as indicators, flags, and counters. The operational database is
the source
of data for the data warehouse. It contains detailed data used to run the day-to-
day
operations of the business. The data continually changes as updates are made, and
reflect
the current value of the last transaction.

An operational database contains enterprise data which are up to date and


modifiable. In an
enterprise data management system, an operational database could be said to be an
opposite counterpart of a decision support database which contain non-modifiable
data that
are extracted for the purpose of statistical analysis. An example use of a decision
support
database is that it provides data so that the average salary of many different
kinds of
workers can be determined while the operational database contains the same data
which
would be used to calculate the amount for pay checks of the workers depending on
the
number of days that they have reported in any given period of time.

An operational database, as the name implies, is the database that is currently and

progressive in use capturing real time data and supplying data for real time
computations
and other analyzing processes.

For example, an operational database is the one which used for taking order and
fulfilling
them in a store whether it is a traditional store or an online store. Other areas
in business
that use an operational database is in a catalog fulfillment system any other Point
of Sale
system which is used in retail stores. An operational database is used for keeping
track of
payments and inventory. It takes information and amounts from credit cards and
accountants use the operational database because it must balance up to the last
penny.

An operational database is also used for supported IRS task filings and regulations
which is
why it is sometimes managed by the IT for the finance and operations groups in a
business
organization. Companies can seldom ran successfully without using an operational
database
as this database is based on accounts and transactions.

Because of the very dynamic nature of an operational database, there are certain
issues
that need to be addressed appropriately. An operational database can grow very fast
in size
and bulk so database administrations and IT analysts must purchase high powered
computer hardware and top notch database management systems.

Most business organizations have regulations and requirements that dictate storing
data for
longer periods of time for operation. This can even create more complex setup in
relation to
database performance and usability. With ever increasing or expanding operational
data
volume, operational databases will have additional stress on processing of
transactions
leading to slowing down of things. As a general trend, the more data there are in
the
operational database, the less efficient the transactions running against the
database tend
to be.
There are several reasons for this one of the most obvious reasons is that table
scans need
to reference more pages of data so it could give results. Indexes can also grow in
size so it
could support larger data volumes and with this increase, access by the index could
degrade
as there would be more levels that need to be traversed. Some IT professionals
address this
problem by having solutions that offload older data to data stores for archive.

Operational databases are just part of the entire enterprise data management and
some of
the data that need to be archived go directly to the data warehouse.

What is Operational Data Store (ODS)

An Operational Data Store (ODS) is an integrated database of operational data. Its


sources
include legacy systems and it contains current or near term data. An ODS may
contain 30 to
60 days of information, while a data warehouse typically contains years of data.

An operational data store is basically a database that is used for being an interim
area for a
data warehouse. As such, its primary purpose is for handling data which are
progressively in
use such as transactions, inventory and collecting data from Point of Sales. It
works with a
data warehouse but unlike a data warehouse, an operational data store does not
contain
static data. Instead, an operational data store contains data which are constantly
updated
through the course of the business operations.

ODS is specially designed such that it can quickly perform relatively simply
queries on
smaller volumes of data such as finding orders of a customer or looking for
available items
in the retails store. This is in contrast to the structure of a data warehouse
wherein one
needs to perform complex queries on high volumes of data. As a simple analogy, a
data
store may be a company's short term memory storing only the most recent information

while the data warehouse is the long term memory which also serves as a company's
historical data repository whose data stored are relatively permanent.
The history of the operational data store goes back to as early as the year 1990
when the
original ODS system were developed and used as a reporting tool for administrative
purposes. But even then, the ODS was already dynamic in nature and was usually
updated
every day as it provided reports about daily business transactions such as sales
totals or
orders being filled.

The ODS that time are now referred to as a Class III ODS. As information technology

evolved, so did ODS with the coming of the Class II ODS which was already capable
of
tracking more complex information such as product and location codes, and to update
the
database more frequently (perhaps hourly) to reflect changes. And then came the
Class I
ODS systems from the development of customer relationship management (CRM).
Many years, IT professional were having great problems with integrating legacy
applications
as the process would entail so many resources for maintenance and other efforts had
done
little to care of the needs of the legacy environments. With experimentations and
development of new technologies, there was little left for company IT resources. As
IT
people had experienced with legacy applications, the legacy environment has become
the
child consuming its parent.

There were many approaches done to respond to the problems associated with legacy
systems. One approach was to model data and have information engineering but this
proved to be slow in the delivery of tangible results. With the growth of legacy
systems
came the growth in complexity as well as the data model.

Another response done to address legacy system problems was the establishment of a
data
warehouse and this has proven to be beneficial but a data warehouse only addresses
the
informational aspect of the company.

The development of an operational data store has greatly addressed the problems
associate
with legacy systems. Much like a data warehouse, data from legacy systems are
transformed and integrated into the operational data store and once there, data
ages and
then passed into a data warehouse. One of the main roles of the ODS is to represent
a
collective, integrated view of the up-to-the-second operations of the company. It
is very
useful for corporate-wide mission-critical applications.

On-Line Analytical Processing

On-Line Analytical Processing

On-Line Analytical Processing is a processing that supports the analysis of


business trends
and projections. It is also known as decision support processing and OLAP. An OLAP
software enables companies to have real-time analysis of data stored in a database.
An
OLAP server is typically a separate component of an information system which
contains
specially coded algorithms and indexing tools to efficiently process data mining
tasks with
minimal impact on database performance.

OLAP uses multidimensional view of aggregate data to provide quick access to


strategic
information for further analysis and with this, a data user can have fast and very
efficient
view of data analysis as OLAP turns raw data into information that can be
understood by
users and manipulated in various ways. The multidimensional views of data that OLAP

requires also come with packaged calculation-intensive capabilities and time


intelligence.

OLAP is part of a wide category of business intelligence that includes ETL (extract
transform
load), relational reporting and data mining. Some critical areas of a business
enterprise
where OLAP are greatly used include business reporting for sales, marketing,
management
reporting, business process management (BPM), budgeting and forecasting, financial
reporting and similar areas.

Those databases that are planned and configured for use with OLAP use a
multidimensional
data model which can enable complex analytical and ad-hoc queries with a rapid
execution
time. Outputs from an OLAP query are displayed in a matrix or pivot format with
dimensions
forming the row and column of the matrix; the measures, the values.

OLAP is a function of business intelligence software which can enable data end
users to
easily and selectively get extracted data and view them from different points of
view. In
many cases, an OLAP as aspects designed for managers who want to look to make sense
of
enterprise information as well as how the company fares well with the competition
in a
certain industry. OLAP tools structure data in a hierarchical manner which is
exactly the way
many business mangers thinks of their enterprises. But OLAP also allows business
analysts
to rotate data and change relationships and perspectives so they get deeper
insights into
corporate information to enable them to analyze historical as well as future trends
and
patterns.

The OLAP cube is found in the core of any OLAP system. The OLAP cube is also
referred to
as the multidimensional cube or a hypercube. This cube contains numeric facts which
are
called measures and these measures are further categorized into several dimensions.
The
metadata of the cube is often made from a snowflake schema or star schema of tables
in a
relational database. The hierarchy goes from measures which are derived from the
records
in the fact table and dimensions which are derived from dimension tables.

A claim has it that OLAP cubes for complex queries has the power to produce answers
in
around 0.1% of the time for the same query on OLTP relational data. Aggregation is
the key
for OLAP to achieve the amazing performance and in OLAP, aggregations are built
from the
fact table. This is done by changing the granularity on specific dimensions and
aggregating
up data along these dimensions.

There are different kinds of OLAP such as Multidimensional OLAP (MOLAP) which uses
database structures which are optimal for attributes such as time period, location,
product
or account code; Relational OLAP (ROLAP) wherein the base data and the dimension
tables
are stored as relational tables and new table are created so they can hold
aggregated
information; and Hybrid OLAP (HOLAP) which can be a combination of OLAP types. Many

software vendors have their own versions of OLAP implementations.

On-Line Transaction Processing

On-Line Transaction Processing


On-Line Transaction Processing is a processing that supports the daily business
operations.
Also know as operational processing and OLTP. An OLTP is a database which must
typically
allow the real-time processing of SQL transactions to support traditional retail
processes, e-
commerce and other time-critical applications. It is also a class of program that
helps to
manage or facilitate transaction oriented applications such as data entry and
retrieval
transactions in a number of industries, including banking, airlines, mail order,
supermarkets, and manufacturers.

With today's business environment, it is impossible to run a business without


having to rely
on data. Processing online transactions these days increasingly requires support
for
transactions spanning a large network or even the global internet and may include
many
companies. Because of this great demand, many new OLTP software implementations use

client � server processing and brokering of software applications that can enable
transactions to run on various computer platforms within a network.

Today, with the ubiquity of the internet, more and more people even from those
remote
areas are not doing transactions online through an e-commerce environment. The term

transaction processing is often associated with the process wherein an online shop
or
ecommerce website accepts and processes payments through a customer's credit or
debit
card in real time in return for purchased goods and services.

During the process of online transactions, a merchant payment system will


automatically
connect to the bank or credit card company of the customer and carry out security,
fraud
and other checking for validity after which authorization to take the payment
follows. In is
strongly advised that when a company looks for other companies that will handle
online
transactions and processing, the company should have a system infrastructure that
is
robust, secure and reliable that give customers fast, seamless and secure checkout
time.
An OLTP implementation tends to be very large involving very high volume of data at
any
given time. Business organizations have invested in sophisticated transaction
management
software like Customer Information Control System (CICS) and database optimization
tactics that can help OLTP process very large numbers and volumes of concurrent
updates
on an OLTP-oriented database.

There are also many OLTP brokering programs which can distribute transaction
processing
among multiple computers on a network that can enhance the functions of an OLTP
working
on a more demanding decentralized database system. Service oriented architectures
and
web services are now commonly integrated with OLTP.

The two main benefits with using OLTP are simplicity and efficiency. OLTP helps
simplify a
business operation by reducing paper trails and helping draw faster and more
accurate
forecasting for revenues and expenses. OLTP helps provide a concrete foundation
with
timely updating of corporate data. For an enterprise' customers, OLTP allows the
more
choices on how they want to pay giving them more flexible time and enticing them to
make
more transactions. Most OLTP transactions offer services to customers 24 hours a
day seven
days a week.

But despite the great benefits that OLTP can give to companies and their customers,
there
are certain issues that it needs to address. The main issues pertaining to OLTP are
on
security and economic costs. Because an OLTP implementation is exposed on a
network,
more specifically the internet, the database may be susceptible to hackers and
intruders
who may be waiting on the side to get sensitive information on people and their
bank and
credit card accounts.

In terms of economic cost, when a business goes offline to do some steps of a


process,
buyers and suppliers tend to miss out on the services and benefits of an OLTP and
the
smallest system disruption may mean loss of time and money. But with proper care
and
implementation, OLTP still will remain to be a big help to business organizations
specially
those operating on a large scale.

What is Aggregation

In the broadest sense of the word, aggregation means collecting and combining of
data
horizontally, vertically and chronologically and then expressed in summary form to
be used
for statistical analysis. In the more technical sense, aggregation is a special
kind association
that specified a part of whole relationship between the component part and the
while.

As opposed to ordinary association, aggregation is more of an asymmetric


relationship and
transitive relationship. Aggregation also implies a stronger coupling and behavior
is
normally propagated across an aggregation.
In relational database, aggregation refers to combination of data from different
records. For
example, sales records and corporate income from one or several branches can be
reported
using aggregation. The process of aggregation of various data from several sources
can be
done by executive one database query.

An example of the use of aggregation in internet advertising is to get information


and trends
on a particular group based on a specific variable like income, age and profession.
The
information gathered and the patterns spotted may be used for website
personalization.
This is useful when a company wants to choose content to specific users and
advertising can
be directly targeted to them. For instance, a site that is selling a movies and
music on CDs
and DVDs may recommend or advertise genres based on the age of the user and the
data
aggregate for the group.
Online analytic processing (OLAP) is a particular method used in data warehouses
which
uses a simple type of data aggregation for the marketer to use an online reporting
strategy
to process gathered information.

Data aggregation can be used for personal data aggregation services to offer a user
one
point for collection of his personal information from other websites. The user can
use a
single master personal identification number (PIN) which he can use to access a
variety of
other accounts like airlines, clubs and financial institutions. This kind of
aggregation is often
called "screen scraping".

Aggregation services are offered as a standalone or may be in conjunction with


other
services like bills payment and tracking of portfolio which is provided by other
specialized
websites. Many big and stable companies that use the internet for web presence and
transactions offer aggregation services to entice visitors.

During the course of time, large amounts of aggregated account data from provider
to
server are transferred and may develop into a comprehensive database of user
profiles with
details of balances, securities transactions, credit information and other
information. Privacy
and security become a major issue but there are independent companies offering
these
related services.

Because of the possibility of liabilities that may arise from activities related to
data
aggregations such as security issues and infringement of intellectual property
rights,
aggregators may agree on a data feed arrangement at the discretion of the end user
or
customers. This may involve using an Open Financial Exchange (OFX) standard in
requesting and delivering the information to the customer. This agreement will
provide an
opportunity for institutions to protect the interest of their customers' interest
and for
aggregators to come up with more robust services. Screen scrapping without the
content
provider's consent can lead to allowing subscribers to see any account opened
through a
single website.

Another form of aggregation which is ubiquitous in the internet today is RSS


syndication.
RSS, which stands for Really Simple Syndication, is a small database that contains
headlines
and description of news or other information on a website. RSS gets aggregate data
from
several specified sources from other websites and they are automatically updated at
a
central point in one's syndicated website. Contents from RSS are read using a feed
reader
or an aggregator.

Automatic Data Partitioning

Automatic data partitioning is the process of breaking down large chunks of data
and
metadata at a specific data site into partitions according to the request
specification of the
client.
Data sites contain multitudes of varied data which can be extremely useful as a
statistical
basis for determining many trends in businesses. Because data in the data sites can
grow at
a very fast rate, the demand for internet traffic also increases. a good software
with
partitioning capability should be employed to manage the data warehouse. Many
software
application handling data also have advanced functions like traffic shaping and
policing so
that sufficient bandwidth can be maintained.

Relational database management systems (RDBMS) effectively manage data sites. This
database system follows the relational model introduced by E. F. Codd in which data
is
stored tables while the relationship among data is stored in another tables. This
is in
contrast to flat files where all data is stored in one contiguous area.

Since RDMS data is not stored in one contiguous area but instead broken down into
tables,
it becomes easy to partition data whether manually or automatically for easy
sharing and
distribution.

The biggest advantage to data partitioning is that I can divide large tables and
indexes into
smaller parts and as a result, the system's performance can be greatly improved
while
contention is reduced and data availability and distribution is increased.
Automatic data
partitioning makes the job of the database administrator a lot easier especially in
labor
intensive jobs such as doing back ups, loading data, recovering and processing a
query.

Data partitioning is commonly done by either splitting selected elements or by


creating
smaller separate databases each containing the basic components like tables,
indexes, and
transaction logs.

Horizontal partitioning is a technique where different rows are placed into


different
tables. For example, zip codes with less than 25000 are placed in a table called
EasterCustomer while those greater than 25000 are placed in a table called
CustomerWest.
If customers want to view a complete list of records, the database uses a view with
union
function.

Vertical partitioning is another technique wherein tables are created with fewer
columns
with additional separate tables to store the rest of the remaining columns.
Usually, the
process involves the use of different physical storage.

Data partitioning is used in a distributed database management system, a software


systems
which can allow the management of a distributed database. A Distributed database is
a
collection of many database which are logically interrelated and distributed over
many
computers in a network. This can allow certain clients to view only the data they
need in
their specifications while the rest of the viewer can see all the data as one not
partitioned.
Most of today's most popular relational database management systems have different
criteria for partitioning data. Their only similarity is that they take a partition
key and assign
a portion based on some criteria.

Some of the partitioning methods used as criteria include range partitioning, list
partitioning, hash partitioning and composite partitioning.

In range partitioning, the database systems selects a partition if the partitioning


key is
within a certain given range. For example, a partition could include all the rows
where a zip
code column has values between 60000 and 69999.

List partitioning is a method where a partition is assigned a specific list of


values like a list
of all countries in Southeast Asia.

Hash partitioning uses th