DW Bi
DW Bi
OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on
that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand,
OLAP systems are deliberately denormalized for fast data retrieval through SELECT
operations.
Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales
person at the
counter keys-in all the data into a "Point-Of-Sales" machine. That data is
transaction data and the
related system is a OLTP system. On the other hand, the manager of the store might
want to view a
report on out-of-stock materials, so that he can place purchase order for them.
Such report will come
out from OLAP system
What is data mart?
Data marts are generally designed for a single subject area. An organization may
have data pertaining to different
departments like Finance, HR, Marketing etc. stored in data warehouse and each
department may have separate data
marts. These data marts can be built on top of the data warehouse.
What is ER model?
Dimensional model consists of dimension and fact tables. Fact tables store
different transactional measurements and
the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achieve high
degree of normalization but to facilitate easy and faster data retrieval.
What is dimension?
If I just say� �20kg�, it does not mean anything. But 20kg of Rice (Product) is
sold to Ramesh (customer) on 5th April
(date), gives a meaningful sense. These product, customer and dates are some
dimension that qualified the measure.
Dimensions are mutually independent.
What is fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but
not always) numerical values that
can be aggregated.
Semi-additive measures are those where only a subset of aggregation function can be
applied. Let�s say account
balance. A sum() function on balance does not give a useful result but max() or
min() balance might be useful.
Consider price rate or currency rate. Sum is meaningless on rate; however, average
function might be useful.
Additive measures can be used with any aggregation function like Sum(), Avg() etc.
Example is Sales Quantity etc.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table
references number of dimension
tables so as the keys (primary key) from all the dimension tables flow into the
fact table (as foreign key) where
measures are stored. This entity-relationship diagram looks like a star, hence the
name.
Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.
Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.
Additionally all the products can be further grouped under different product
families stored in a different table so
that primary key of product family tables also goes into the product table as a
foreign key. Such construct will be
called a snow-flake schema as product table is further snow-flaked into product
family.
Note
Snow-flake increases degree of normalization in the design.
1. Conformed Dimension
2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further
classify dimension as
A conformed dimension is the dimension that is shared across multiple subject area.
Consider 'Customer' dimension.
Both marketing and sales department may use the same customer dimension table in
their reports. Similarly, a 'Time'
or 'Date' dimension will be shared by different subject areas. These dimensions are
conformed dimension.
A degenerated dimension is a dimension that is derived from fact table and does not
have its own dimension table.
A dimension key, such as transaction number, receipt number, Invoice number etc.
does not have any more
associated attributes and hence can not be designed as a dimension table.
These junk dimension attributes might not be related. The only purpose of this
table is to store all the combinations of
the dimensional attributes which you could not fit into the different dimension
tables otherwise. One may want to
read an interesting document, De-clutter with Junk (Dimension)
Dimensions are often reused for multiple applications within the same database with
different contextual meaning.
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date
of Delivery", or "Date of Hire". This is
often referred to as a 'role-playing dimension'
What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing. These can be of many
types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are
most common.
Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not
mean that the attributes of the
dimension do not change in actual business situation. It just means that, even if
the value of the attributes change,
history is not kept and the table holds all the previous data.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows
the recent data. This effectively
means that such dimension table is always updated with recent data whenever there
is a change, and because of this
update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in
the table with different surrogate
keys. Consider there is a customer C1 under group G1 first and later on the
customer is changed to group G2. Then
there will be two separate records in dimension table like below,
Key
Customer
Group
Start Date
End Date
C1
G1
2
C1
G2
NULL
Note that separate surrogate keys are generated for the two records. NULL end date
in the second row denotes that
the record is the current record. Also note that, instead of start and end dates,
one could also keep version number
column (1, 2 � etc.) to denote different versions of the record.
Type 3:
Key
Customer
Previous Group
Current Group
C1
G1
G2
This is only good when you need not store many consecutive histories and when date
of change is not required to be
stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar
to type 2, but only you add one extra
column to denote which record is the current record.
Key
Customer
Group
Start Date
End Date
Current Flag
C1
G1
1st Jan 2000
C1
G2
NULL
A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,
A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,
Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.
A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.
A data warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is
termed as granularity. But all reporting requirements from that data warehouse do
not need the same degree of
details.
Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.
Informatica Questions
Connected Lookup
Unconnected Lookup
Router
Filter
Lookups can be cached or uncached (No cache). Cached lookup can be either static or
dynamic. A static cache is one
which does not modify the cache once it is built and it remains same during the
session run. On the other hand, A
dynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the
incoming source data.
How can we update a record in target table without using Update strategy?
A target table can be updated without using 'Update Strategy'. For this, we need to
define the key in the target table
in Informatica level and then we need to connect the key and the field we want to
update in the mapping Target. In
the session level, we should set the target property as "Update as Update" and
check the "Update" check-box.
Let's assume we have a target table "Customer" with fields as "Customer ID",
"Customer Name" and "Customer
Address". Suppose we want to update "Customer Address" without an Update Strategy.
Then we have to define
"Customer ID" as primary key in Informatica level and we will have to connect
Customer ID and Customer Address
fields in the mapping. If the session properties are set correctly as described
above, then the mapping will only
update the customer address field for all matching customer IDs.
Under what condition selecting Sorted Input in aggregator may fail the session?
. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.
Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.
Now suppose the source system is a Flat File. Here in the Source Qualifier you will
not be able to select the distinct
clause as it is disabled due to flat file source table. Hence the next approach may
be we use a Sorter Transformation
and check the Distinct option. When we select the distinct option all the columns
will the selected as keys, in
ascending order by default.
Sorter Transformation DISTINCT clause
Other ways to handle duplicate records in source batch run is to use an Aggregator
Transformation and using the
Group By checkbox on the ports having duplicate occurring data. Here you can have
the flexibility to select the last or
the first of the duplicate column value records. Apart from that using Dynamic
Lookup Cache of the target table and
associating the input ports with the lookup port and checking the Insert Else
Update option will help to eliminate the
duplicate records in source and hence loading unique records in the target.
Q2. Suppose we have some serial numbers in a flat file source. We want to load the
serial numbers in two target files
one containing the EVEN serial numbers and the other file having the ODD ones.
Ans. After the Source Qualifier place a Router Transformation. Create two Groups
namely EVEN and ODD, with
filter conditions as MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively. Then
output the two groups
into two flat file targets.
Router Transformation Groups Tab
Student Name
Maths
Life Science
Physical Science
Sam
100
70
80
John
75
100
85
Tom
80
100
85
Student Name
Subject Name
Marks
Sam
Maths
100
Sam
Life Science
70
Sam
Physical Science
80
John
Maths
75
John
Life Science
100
John
Physical Science
85
Tom
Maths
80
Tom
Life Science
100
Tom
Physical Science
85
Ans. Here to convert the Rows to Columns we have to use the Normalizer
Transformation followed by an
Expression Transformation to Decode the column taken into consideration. For more
details on how the mapping is
performed please visit Working with Normalizer
Q4. Name the transformations which converts one to many rows i.e increases the
i/p:o/p row count. Also what is the
name of its reverse transformation.
Q5. Suppose we have a source table and we want to load three target tables based on
source rows such that first row
moves to first target table, secord row in second target table, third row in third
target table, fourth row again in first
target table so on and so forth. Describe your approach.
Router Transformation Groups Tab
Ans. We can clearly understand that we need a Router transformation to route or
filter source data to the three target
tables. Now the question is what will be the filter conditions. First of all we
need an Expression Transformation
where we have all the source table columns and along with that we have another i/o
port say seq_num, which is gets
sequence numbers for each source row from the port NextVal of a Sequence Generator
start value 0 and increment
by 1. Now the filter condition for the three router groups will be:
Q6. Suppose we have ten source flat files of same structure. How can we load all
the files in target database in a
single batch run using a single mapping.
Ans. After we create a mapping to load data in target database from flat files,
next we move on to the session
property of the Source Qualifier. To load a set of source files we need to create a
file say final.txt containing the
source falt file names, ten files in our case and set the Source filetype option as
Indirect. Next point this flat file
final.txt fully qualified through Source file directory and Source filename.
Session Property Flat File
Ans. We will use the very basic concept of the Expression Transformation that at a
time we can access the previous
row data as well as the currently processed data in an expression transformation.
What we need is simple Sorter,
Expression and Filter transformation to achieve aggregation at Informatica level.
Student Name
Subject Name
Marks
Sam
Maths
100
Tom
Maths
80
Sam
Physical Science
80
Mapping using sorter and Aggregator
John
Maths
75
Sam
Life Science
70
John
Life Science
100
John
Physical Science
85
Tom
Life Science
100
Tom
Physical Science
85
Student Name
Maths
Life Science
Physical Science
Sam
100
70
80
John
75
100
85
Tom
80
100
85
Ans. Here our scenario is to convert many rows to one rows, and the transformation
which will help us to achieve
this is Aggregator.
Now based on STUDENT_NAME in GROUP BY clause the following output subject columns
are populated as
Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and
why it is an ACTIVE
transformation?
. We can configure the SQ to join [Both INNER as well as OUTER JOIN] data
originating from the same
source database.
. We can use a source filter to reduce the number of rows the Integration Service
queries.
. We can specify a number for sorted ports and the Integration Service adds an
ORDER BY clause to the
default SQL query.
. We can choose Select Distinctoption for relational databases and the Integration
Service adds a SELECT
DISTINCT clause to the default SQL query.
. Also we can write Custom/Used Defined SQL query which will override the default
query in the SQ by
changing the default settings of the transformation properties.
. Also we have the option to write Pre as well as Post SQL statements to be
executed before and after the SQ
query in the source database.
Since the transformation provides us with the property Select Distinct, when the
Integration Service adds a SELECT
DISTINCT clause to the default SQL query, which in turn affects the number of rows
returned by the Database to the
Integration Service and hence it is an Active transformation.
Q10. What happens to a mapping if we alter the datatypes between Source and its
corresponding Source Qualifier?
Ans. The Source Qualifier transformation displays the transformation datatypes. The
transformation datatypes
determine how the source database binds data when the Integration Service reads it.
Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports
property in the SQ and then we add
Custom SQL Query. Explain what will happen.
Ans. Whenever we add Custom SQL or SQL override query it overrides the User-Defined
Join, Source Filter,
Number of Sorted Ports, and Select Distinct settings in the Source Qualifier
transformation. Hence only the user
defined SQL Query will be fired in the database and all the other options will be
ignored .
Q12. Describe the situations where we will use the Source Filter, Select Distinct
and Number Of Sorted Ports
properties of Source Qualifier transformation.
Ans. Source Filter option is used basically to reduce the number of rows the
Integration Service queries so as to
improve performance.
Select Distinct option is used when we want the Integration Service to select
unique values from a source, filtering
out unnecessary data earlier in the data flow, which might improve performance.
Number Of Sorted Ports option is used when we want the source data to be in a
sorted fashion so as to use the same
in some following transformations like Aggregator or Joiner, those when configured
for sorted input will improve
the performance.
Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query
and the OUTPUT PORTS
order in SQ transformation do not match?
Ans. Mismatch or Changing the order of the list of selected columns to that of the
connected transformation output
ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include
keyword WHERE say, WHERE
CUSTOMERS.CUSTOMER_ID > 1000.
Ans. We use source filter to reduce the number of source records. If we include the
string WHERE in the source filter,
the Integration Service fails the session.
Q15. Describe the scenarios where we go for Joiner transformation instead of Source
Qualifier transformation.
Ans. While joining Source Data of heterogeneous sources as well as to join flat
files we will use the Joiner
transformation. Use the Joiner transformation when we need to join the following
types of sources:
Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase
source system.
Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to
Target tables TGT1 and
TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?
In the Mapping Designer, We need to configure the Target Load Plan based on the
Source Qualifier transformations
in a mapping to specify the required loading order.
Target Load Plan
Target Load Plan Ordering
Q18. Suppose we have a Source Qualifier transformation that populates two target
tables. How do you ensure TGT2
is loaded after TGT1?
Ans. In the Workflow Manager, we can Configure Constraint based load ordering for a
session. The Integration
Service orders the target load on a row-by-row basis. For every row generated by an
active source, the Integration
Service loads the corresponding transformed row first to the primary key table,
then to the foreign key table.
Constraint based loading
Hence if we have one Source Qualifier transformation that provides data for
multiple target tables having primary
and foreign key relationships, we will go for Constraint based load ordering.
Only the rows that meet the Filter Condition pass through the Filter transformation
to the next transformation in the
pipeline. TRUE and FALSE are the implicit return values from any filter condition
we set. If the filter condition
evaluates to NULL, the row is assumed to be FALSE.
The numeric equivalent of FALSE is zero (0) and any non-zero value is the
equivalent of TRUE.
Ans.
SQ Source Filter
Filter Transformation
Source Qualifier
transformation filters rows
when read from a source.
Source Qualifier
transformation can only
filter rows from Relational
Sources.
Ans. A Joiner is an Active and Connected transformation used to join source data
from the same source system or
from two related heterogeneous sources residing in different locations or file
systems.
The Joiner transformation joins sources with at least one matching column. The
Joiner transformation uses a
condition that matches one or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline or a master
and a detail branch. The master
pipeline ends at the Joiner transformation, while the detail pipeline continues to
the target.
The join condition contains ports from both input sources that must match for the
Integration Service to join two
rows. Depending on the type of join selected, the Integration Service either adds
the row to the result set or discards
the row.
The Joiner transformation produces result sets based on the join type, condition,
and input data sources. Hence it is
an Active transformation.
Q22. State the limitations where we cannot use Joiner in the mapping pipeline.
Ans. The Joiner transformation accepts input from most transformations. However,
following are the limitations:
. Joiner transformation cannot be used when either of the input pipeline contains
an Update Strategy
transformation.
. Joiner transformation cannot be used if we connect a Sequence Generator
transformation directly before the
Joiner transformation.
Q23. Out of the two input pipelines of a joiner, which one will you set as the
master pipeline?
Ans. During a session run, the Integration Service compares each row of the master
source against the detail source.
The master and detail sources need to be configured for optimal performance.
To improve performance for an Unsorted Joiner transformation, use the source with
fewer rows as the master
source. The fewer unique rows in the master, the fewer iterations of the join
comparison occur, which speeds the join
process.
To improve performance for a Sorted Joiner transformation, use the source with
fewer duplicate key values as the
master source.
Blocking logic is possible if master and detail input to the Joiner transformation
originate from different sources.
Otherwise, it does not use blocking logic. Instead, it stores more rows in the
cache.
Q24. What are the different types of Joins available in Joiner Transformation?
Ans. In SQL, a join is a relational operator that combines data from multiple
tables into a single result set. The Joiner
transformation is similar to an SQL join except that data can originate from
different types of sources.
. Normal
. Master Outer
. Detail Outer
. Full Outer
Join Type property of Joiner Transformation class="caption"
Note: A normal or master outer join performs faster than a full outer or detail
outer join.
Ans.
. In a normal join , the Integration Service discards all rows of data from the
master and detail source that do
not match, based on the join condition.
. A master outer join keeps all rows of data from the detail source and the
matching rows from the master
source. It discards the unmatched rows from the master source.
. A detail outer join keeps all rows of data from the master source and the
matching rows from the detail
source. It discards the unmatched rows from the detail source.
. A full outer join keeps all rows of data from both the master and detail sources.
Q26. Describe the impact of number of join conditions and join order in a Joiner
Transformation.
Ans. We can define one or more conditions based on equality between the specified
master and detail sources. Both
ports in a condition must have the same datatype.
If we need to use two ports in the join condition with non-matching datatypes we
must convert the datatypes so that
they match. The Designer validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two
sources.
The order of the ports in the join condition can impact the performance of the
Joiner transformation. If we use
multiple ports in the join condition, the Integration Service compares the ports in
the order we specified.
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the
Integration Service does not
consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports
tab of the joiner, and then join on
the default values.
Note: If a result set includes fields that do not contain data in either of the
sources, the Joiner transformation
populates the empty fields with null values. If we know that a field will return a
NULL and we do not want to insert
NULLs in the target, set a default value on the Ports tab for the corresponding
port.
Q28. Suppose we configure Sorter transformations in the master and detail pipelines
with the following sorted ports
in order: ITEM_NO, ITEM_NAME, PRICE.
When we configure the join condition, what are the guidelines we need to follow to
maintain the sort order?
Ans. If we have sorted both the master and detail pipelines in order of the ports
say ITEM_NO, ITEM_NAME and
PRICE we must ensure that:
Ans. The best option is to place the Joiner transformation directly after the sort
origin to maintain sorted data.
However do not place any of the following transformations between the sort origin
and the Joiner transformation:
. Custom
. UnsortedAggregator
. Normalizer
. Rank
. Union transformation
. XML Parser transformation
. XML Generator transformation
. Mapplet [if it contains any one of the above mentioned transformations]
Q30. Suppose we have the EMP table as our source. In the target we want to view
those employees whose salary is
greater than or equal to the average salary for their departments. Describe your
mapping approach.
ahref="http://png.dwbiconcepts.com/images/tutorial/info_interview/
info_interview10.png"
After the Source qualifier of the EMP table place a Sorter Transformation . Sort
based on DEPTNOport.
Sorter Ports Tab
Next we place a Sorted Aggregator Transformation. Here we will find out the AVERAGE
SALARY for each
(GROUP BY) DEPTNO.
When we perform this aggregation, we lose the data for individual employees.
To maintain employee data, we must pass a branch of the pipeline to the Aggregator
Transformation and pass a
branch with the same sorted source data to the Joiner transformation to maintain
the original data.
When we join both branches of the pipeline, we join the aggregated data with the
original data.
Aggregator Ports Tab
Aggregator Properties Tab
So next we need Sorted Joiner Transformation to join the sorted aggregated data
with the original data, based on
DEPTNO. Here we will be taking the aggregated pipeline as the Master and original
dataflow as Detail Pipeline.
Joiner Condition Tab
Joiner Properties Tab
After that we need a Filter Transformation to filter out the employees having
salary less than average salary for their
department.
Ans.
Sequence
Generator
Properties
Description
Start Value
Increment By
End Value
Current
Value
Cycle
Number of
Cached
Values
Reset
Will the Surrogate keys in both the target tables be same? If not how can we flow
the same sequence values in both of
them.
Ans. When we connect the NEXTVAL output port of the Sequence Generator directly to
the surrogate key columns
of the target tables, the Sequence number will not be the same.
A block of sequence numbers is sent to one target tables surrogate key column. The
second targets receives a block of
sequence numbers from the Sequence Generator transformation only after the first
target table receives the block of
sequence numbers.
Suppose we have 5 rows coming from the source, so the targets will have the
sequence values as TGT1 (1,2,3,4,5) and
TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and
Increment by 1.
Now suppose the requirement is like that we need to have the same surrogate keys in
both the targets.
Then the easiest way to handle the situation is to put an Expression Transformation
in between the Sequence
Generator and the Target tables. The SeqGen will pass unique values to the
expression transformation, and then the
rows are routed from the expression transformation to the targets.
Q34. Suppose we have 100 records coming from the source. Now for a target column
population we used a Sequence
generator.
Suppose the Current Value is 0 and End Value of Sequence generator is set to 80.
What will happen?
Ans. End Value is the maximum value the Sequence Generator will generate. After it
reaches the End value the
session fails with the following error message:
Q35. What are the changes we observe when we promote a non resuable Sequence
Generator to a resuable one? And
what happens if we set the Number of Cached Values to 0 for a reusable
transformation?
Ans. When we convert a non reusable sequence generator to resuable one we observe
that the Number of Cached
Values is set to 1000 by default; And the Reset property is disabled.
When we try to set the Number of Cached Values property of a Reusable Sequence
Generator to 0 in the
Transformation Developer we encounter the following error message:
The number of cached values must be greater than zero for reusable sequence
transformation.
Ans. Apart from aggregate expressions Informatica Aggregator also supports non-
aggregate expressions and
conditional clauses.
Ans. We can enable the session option, Incremental Aggregation for a session that
includes an Aggregator
Transformation. When the Integration Service performs incremental aggregation, it
actually passes changed source
data through the mapping and uses the historical cache data to perform aggregate
calculations incrementally.
Q41. What are the performance considerations when working with Aggregator
Transformation?
Ans.
. Filter the unnecessary data before aggregating it. Place a Filter transformation
in the mapping before the
Aggregator transformation to reduce unnecessary aggregation.
. Improve performance by connecting only the necessary input/output ports to
subsequent transformations,
thereby reducing the size of the data cache.
. Use Sorted input which reduces the amount of data cached and improves session
performance.
Q42. What differs when we choose Sorted Input for Aggregator Transformation?
Ans. Integration Service creates the index and data caches files in memory to
process the Aggregator transformation.
If the Integration Service requires more space as allocated for the index and data
cache sizes in the transformation
properties, it stores overflow values in cache files i.e. paging to disk. One way
to increase session performance is to
increase the index and data cache sizes in the transformation properties. But when
we check Sorted Input the
Integration Service uses memory to process an Aggregator transformation it does not
use cache files.
Q43. Under what conditions selecting Sorted Input in aggregator will still not
boost session performance?
Ans.
Q44. Under what condition selecting Sorted Input in aggregator may fail the
session?
Ans.
. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.
Q45. Suppose we do not group by on any ports of the aggregator what will be the
output.
Ans. If we do not group values, the Integration Service will return only the last
row for the input rows.
Q46. What is the expected value if the column in an aggregator transform is neither
a group by nor an aggregate
expression?
Ans. Integration Service produces one row for each group based on the group by
ports. The columns which are
neither part of the key nor aggregate expression will return the corresponding
value of last record of the group
received. However, if we specify particularly the FIRST function, the Integration
Service then returns the value of the
specified first row of the group. So default is the LAST function.
Q47. Give one example for each of Conditional Aggregation, Non-Aggregate expression
and Nested Aggregation.
Ans.
Use conditional clauses in the aggregate expression to reduce the number of rows
used in the aggregation. The
conditional clause can be any clause that evaluates to TRUE or FALSE.
Q49. How does a Rank Transform differ from Aggregator Transform functions MAX and
MIN?
Ans. Like the Aggregator transformation, the Rank transformation lets us group
information. The Rank Transform
allows us to select a group of top or bottom values, not just one value as in case
of Aggregator MAX, MIN functions.
Ans. Rank port is an input/output port use to specify the column for which we want
to rank the source values. By
default Informatica creates an output port RANKINDEX for each Rank transformation.
It stores the ranking position
for each row in a group.
Ans. Rank transformation lets us group information. We can configure one of its
input/output ports as a group by
port. For each unique value in the group port, the transformation creates a group
of rows falling within the rank
definition (top or bottom, and a particular number in each rank).
Ans. If two rank values match, they receive the same value in the rank index and
the transformation skips the next
value.
Ans.
. We can connect ports from only one transformation to the Rank transformation.
. We can select the top or bottom rank.
. We need to select the Number of records in each rank.
. We can designate only one Rank port in a Rank transformation.
Ans. During a session, the Integration Service compares an input row with rows in
the data cache. If the input row
out-ranks a cached row, the Integration Service replaces the cached row with the
input row. If we configure the Rank
transformation to rank based on different groups, the Integration Service ranks
incrementally for each group it finds.
The Integration Service creates an index cache to stores the group information and
data cache for the row data.
Ans. Rank transformation can return the strings at the top or the bottom of a
session sort order. When the Integration
Service runs in Unicode mode, it sorts character data in the session using the
selected sort order associated with the
Code Page of IS which may be French, German, etc. When the Integration Service runs
in ASCII mode, it ignores this
setting and uses a binary sort order to sort character data.
Ans. When the Sorter transformation is configured to treat output rows as distinct,
it assigns all ports as part of the
sort key. The Integration Service discards duplicate rows compared during the sort
operation. The number of Input
Rows will vary as compared with the Output rows and hence it is an Active
transformation.
Ans. The Case Sensitive property determines whether the Integration Service
considers case when sorting data.
When we enable the Case Sensitive property, the Integration Service sorts uppercase
characters higher than
lowercase characters.
Ans. The Integration Service passes all incoming data into the Sorter Cache before
Sorter transformation performs the
sort operation.
The Integration Service uses the Sorter Cache Size property to determine the
maximum amount of memory it can
allocate to perform the sort operation. If it cannot allocate enough memory, the
Integration Service fails the session.
For best performance, configure Sorter cache size with a value less than or equal
to the amount of available physical
RAM on the Integration Service machine.
If the amount of incoming data is greater than the amount of Sorter cache size, the
Integration Service temporarily
stores data in the Sorter transformation work directory. The Integration Service
requires disk space of at least twice
the amount of incoming data when storing data in the work directory.
Ans.
. All input groups and the output group must have matching ports. The precision,
datatype, and scale must be
identical across all groups.
. We can create multiple input groups, but only one default output group.
. The Union transformation does not remove duplicate rows.
. We cannot use a Sequence Generator or Update Strategy transformation upstream
from a Union
transformation.
. The Union transformation does not generate transactions.
General questions
Q63. What is the difference between Static and Dynamic Lookup Cache?
Ans. Lookups are cached by default in Informatica. Lookup cache can be either non-
persistent or persistent. The
Integration Service saves or deletes lookup cache files after a successful session
run based on whether the Lookup
cache is checked as persistent or not.
A Mapplet is a reusable object created in the Mapplet Designer which contains a set
of transformations and lets us
reuse the transformation logic in multiple mappings. A Mapplet can contain as many
transformations as we need.
Like a reusable transformation when we use a mapplet in a mapping, we use an
instance of the mapplet and any
change made to the mapplet is inherited by all instances of the mapplet.
Q66. What are the transformations that are not supported in Mapplet?
Ans. Normalizer, Cobol sources, XML sources, XML Source Qualifier transformations,
Target definitions, Pre- and
post- session Stored Procedures, Other Mapplets.
. PMERR_DATA- Stores data and metadata about a transformation row error and its
corresponding source
row.
. PMERR_MSG- Stores metadata about an error and the error message.
. PMERR_SESS- Stores metadata about the session.
. PMERR_TRANS- Stores metadata about the source and transformation ports, such as
name and datatype,
when a transformation error occurs.
Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.
Ans. Yes we can copy session to new folder or repository provided the corresponding
Mapping is already in there.
(Page 3 of 3)
What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,
A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,
Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.
A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.
Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.
This article attempts to refresh your Unix skills in the form of a question/answer
based Unix tutorial on Unix
command lines. The commands discussed here are particulary useful for the
developers working in the middle-tier
(e.g. ETL) systems, where they may need to interact with several *nx source systems
for data retrieval.
No prize in guessing that if you specify [head -2] then it would print first 2
records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor
which can be used for various text
manipulation purposes like this.
How does the above command work? The 'd' parameter basically tells [sed] to delete
all the records from display
from line 2 to last line of the file (last line is represented by $ symbol). Of
course it does not actually delete those lines
from the file, it just does not display those lines in standard output screen. So
you only see the remaining line which
is the 1st line.
If you want to do it using [sed] command, here is what you should write:
From our previous answer, we already know that '$' stands for the last line of the
file. So '$ p' basically prints (p for
print) the last line in standard output screen. '-n' switch takes [sed] to silent
mode so that [sed] does not print
anything else in the output.
The easiest way to do it will be by using [sed] I guess. Based on what we already
know about [sed] from our previous
examples, we can quickly deduce this command:
You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be
$> sed �n '4 p' test
Of course you can do it by using [head] and [tail] command as well like below:
You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be
We already know how [sed] can be used to delete a certain line from the output � by
using the'd' switch. So if we
want to delete the first line the command should be:
But the issue with the above command is, it just prints out all the lines except
the first line of the file on the standard
output. It does not really change the file in-place. So if you want to delete the
first line from the file itself, you have
two options.
Either you can redirect the output of the file to some other file and then rename
it back to original file like below:
Or, you can use an inbuilt [sed] switch '�i' which changes the file in-place. See
below:
How to remove the last line/ trailer from a file in Unix script?
Always remember that [sed] switch '$' refers to the last line. So using this
knowledge we can deduce the below
command:
If you want to remove line <m> to line <n> from a given file, you can accomplish
the task in the similar method
shown above. Here is an example:
$> sed �i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
This is bit tricky. Suppose your file contains 100 lines and you want to remove the
last 5 lines. Now if you know how
many lines are there in the file, then you can simply use the above shown method
and can remove all the lines from
96 to 100 like below:
$> sed �i '96,100 d' file.txt # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the file may
be generated dynamically, etc.) In
that case there are many different ways to solve the problem. There are some ways
which are quite complex and
fancy. But let's first do it in a way that we can understand easily and remember
easily. Here is how it goes:
$> tt=`wc -l file.txt | cut -f1 -d' '`;sed �i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-colon)
calculates the total number of lines
present in the file and stores it in a variable called �tt�. The second command
(after the semi-colon), uses the variable
and works in the exact way as shows in the previous example.
We already know how to print one line from a file which is this:
Where <n> is to be replaced by the actual line number that you want to print. Now
once you know it, it is easy to
print out the length of this line by using [wc] command with '-c' switch.
The above command will print the length of 35th line in the file.txt.
Assuming the words in the line are separated by space, we can use the [cut]
command. [cut] is a very powerful and
useful command and it's real easy. All you have to do to get the n-th word from the
line is issue the following
command:
$> echo �A quick brown fox jumped over the lazy cat� | cut �f4 �d' '
xinu
We will make use of two commands that we learnt above to solve this. The commands
are [rev] and [cut]. Here we
go.
Let's imagine the line is: �C for Cat�. We need �Cat�. First we reverse the line.
We get �taC rof C�. Then we cut the
first word, we get 'taC'. And then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat
We know we can do it by [cut]. Like below command extracts the first field from the
output of [wc �c] command
109
But I want to introduce one more command to do this here. That is by using [awk]
command. [awk] is a very
powerful command for text pattern scanning and processing. Here we will see how may
we use of [awk] to extract
the first field (or first column) from the output of another command. Like above
suppose I want to print the first
column of the [wc �c] output. Here is how it goes like this:
109
The basic syntax of [awk] is like this:
109
In the action space, we have asked [awk] to take the action of printing the first
column ($1). More on [awk] later.
How to replace the n-th line in a file with a new line in Unix?
This can be done in two steps. The first step is to remove the n-th line. And the
second step is to insert a new line in
n-th line position. Here we go.
$>sed -i'' '10 i This is the new line' file.txt # i stands for insert
Open the file in VI editor. Go to VI command mode by pressing [Escape] and then
[:]. Then type [set list]. This will
show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the
file.
In order to know the file type of a particular file use the [file] command like
below:
If you want to know the technical MIME type of the file, use �-i� switch.
$>file -i file.txt
file.zip: application/x-zip
You will be using the same [sqlplus] command to connect to database that you use
normally even outside the shell
script. To understand this, let's take an example. In this example, we will connect
to database, fire a query and get the
output printed from the unix shell. Ok? Here we go �
EXIT;
EOF`
If you connect to database in this method, the advantage is, you will be able to
pass Unix side shell
variables value to the database. See below example
EOF`
12
BEGIN
END;
EXIT;
EOF`
How to check the command line arguments in a UNIX command in Shell Script?
In a bash shell, you can access the command line arguments using $0, $1, $2, �
variables, where $0 prints the
command name, $1 prints the first input parameter of the command, $2 the second
input parameter of the command
and so on.
Just put an [exit] command in the shell script with return value other than 0. this
is because the exit codes of
successful Unix programs is zero. So, suppose if you write
exit -1
inside your program, then your program will thrown an error and exit immediately.
Normally [ls �lt] command lists down file/folder list sorted by modified time. If
you want to list then alphabetically,
then you should simply specify: [ls �l]
$> echo $?
Using command, we can do it in many ways. Based on what we have learnt so far, we
can make use of [ls] and [$?]
command to do this. See below:
If the file exists, the [ls] command will be successful. Hence [echo $?] will print
0. If the file does not exist, then [ls]
command will fail and hence [echo $?] will print 1.
The standard command to see this is [ps]. But [ps] only shows you the snapshot of
the processes at that instance. If
you need to monitor the processes for a certain period of time and need to refresh
the results in each interval,
consider using the [top] command.
$> ps �ef
If you wish to see the % of memory usage and CPU usage, then consider the below
switches
$> ps aux
If you wish to use this command inside some shell script, or if you want to
customize the output of [ps] command,
you may use �-o� switch like below. By using �-o� switch, you can specify the
columns that you want [ps] to print
out.
$>ps -e -o stime,user,pid,args,%mem,%cpu
You can list down all the running processes using [ps] command. Then you can �grep�
your user name or process
name to see if the process is running. See below:
In Linux based systems, you can easily access the CPU and memory details from
the /proc/cpuinfo and
/proc/meminfo, like this:
$>cat /proc/meminfo
$>cat /proc/cpuinfo
Just try the above commands in your system to see how it works
.
.
inShare0
Remember Codd's Rule? Or Acid Property of database? May be you still hold these
basic properties to your heart or
may be you no longer remember them. Let's revisit these ideas once again..
A database is a collection of data for one or more multiple uses. Databases are
usually integrated and offers both data
storing and retrieval.
Codd's Rule
Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by
Edgar F. Codd, a pioneer of the
relational model for databases.
All information in the database is to be represented in one and only one way,
namely by values in column positions
within rows of tables.
The DBMS must allow each field to remain null (or empty). Specifically, it must
support a representation of "missing
information and inapplicable information" that is systematic, distinct from all
regular values (for example, "distinct
from zero or any other number", in the case of numeric values), and independent of
data type. It is also implied that
such representations must be manipulated by the DBMS in a systematic way.
The system must support an online, inline, relational catalog that is accessible to
authorized users by means of their
regular query language. That is, users must be able to access the database's
structure (catalog) using the same query
language that they use to access the database's data.
All views that are theoretically updatable must be updatable by the system.
The system must support set-at-a-time insert, update, and delete operators. This
means that data can be retrieved
from a relational database in sets constructed of data from multiple rows and/or
multiple tables. This rule states that
insert, update, and delete operations should be supported for any retrievable set
rather than just for a single row in a
single table.
Changes to the physical level (how the data is stored, whether in arrays or linked
lists etc.) must not require a change
to an application based on the structure.
Changes to the logical level (tables, columns, rows, and so on) must not require a
change to an application based on
the structure. Logical data independence is more difficult to achieve than physical
data independence.
Isolation: Isolation refers to the requirement that other operations cannot access
or see data that has been modified
during a transaction that has not yet completed. Each transaction must remain
unaware of other concurrently
executing transactions, except that one transaction may be forced to wait for the
completion of another transaction
that has modified data that the waiting transaction requires.
Durability: Durability is the DBMS's guarantee that once the user has been notified
of a transaction's success, the
transaction will not be lost. The transaction's data changes will survive system
failure, and that all integrity
constraints have been satisfied, so the DBMS won't need to reverse the transaction.
Many DBMSs implement
durability by writing transactions into a transaction log that can be reprocessed
to recreate the system state right
before any later failure.
Why people Hate Project Managers � A must read for would-be managers
"Project Managers" are inevitable. Love them or hate them, but if you are in a
project, you have to accept them. They
are Omnipresent in any project. They intervene too much on technical things without
much knowledge. They create
unrealistic targets and nonsensical methods of achieving them. And they invariably
fail to acknowledge the
individual hard work. Are they of any use?
Remember, all project managers are not hated! So, following reasons off course
don�t apply to them.
Generally project managers are not answerable to their subordinates. They are self-
paced and semi autocratic. These
allowances provide them the opportunity to spend time lazily. Many project managers
spend more time surfing
internet than evaluating the performances of his/her subordinates.
The cure for their laziness is pro-activeness which can help them spend quality
time in office.
I know of a project manager �Harry� (name changed), who used to receive work from
client and assign the work to
his subordinate �John� and once John finished the work and sent Harry an email,
Harry used to copy the contents of
John�s mail and reply back to the client. Since Harry never �forwarded� John�s mail
directly to client � so client was
always oblivion to the actual person (John) doing their work. Client always used to
send appreciation mail to Harry
only and John was never accredited for the work he did.
The advice for the would-be project managers here is to remain conscious about the
individual contributions and
give them their due credit whenever possible.
Proper planning makes thing easy. What do you think is the main difference between
a NASA space project and a
service industry IT project? The project members in that NASA project are the same
kind of engineers that you have
in your project. May be many of them passed from the same graduate school. The same
set of people who made one
project a marvellous success, fail miserably in some other project. There is
nothing wrong with those people. But
there is something wrong with the leader leading that set of people. A NASA project
succeeds because of a
meticulous and realistic planning whereas the other project slogs.
Don�t let new tools and technologies outsmart you. Technology space is ever
changing. Try to keep pace with that.
Install the software and tools that are being used in your project in your laptop.
Play with them. Know what their
features are and what their limitations are. Read blogs on them. Start your own
blog and write something interesting
in that in a regular basis. Be a savvy. Otherwise you will be fooled by your own
people.
Testing in data warehouse projects are till date a less explored area. However, if
not done properly, this can be a
major reason for data warehousing project failures - especially in user acceptance
phase. Given here a mind-map that
will help a project manager to think all the aspects of testing in data
warehousing.
Testing Mindmap
DWBI Testing
1. Why is it important?
. To bug-free the code
. To ensure data quality
. To increase credibility of BI Reports
. More BI projects fail after commissioning due to quality issue
2. What constitutes DWBI Testing?
. Performance Testing
. Functional Testing
. Canned Report Testing
. Ad-hoc testing
. Load Reconciliation
4. Why is it difficult?
. Limited Testing Tool
. Automated Testing not always possible
. Data traceability not always available
. Requires extensive functional knowledge
. Metadata management tool often fails
. Deals with bulk data - has performance impact
. Number of data conditions are huge
Use the above mind-map to plan and prepare the testing activity for your data
warehousing project.
An enterprise data warehouse often fetches records from several disparate systems
and store them centrally in an
enterprise-wide warehouse. But what is the guarantee that the quality of data will
not degrade in the process of
centralization?
Data Reconciliation
Many of the data warehouses are built on n-tier architecture with multiple data
extraction and data insertion jobs
between two consecutive tiers. As it happens, the nature of the data changes as it
passes from one tier to the next tier.
Data reconciliation is the method of reconciling or tie-up the data between any two
consecutive tiers (layers).
Why Reconciliation is required?
In the process of extracting data from one source and then transforming the data
and loading it to the next layer, the
whole nature of the data can change considerably. It might also happen that some
information is lost while
transforming the data. A reconciliation process helps to identify such loss of
information.
One of the major reasons of information loss is loading failures or errors during
loading. Such errors can occur due to
several reasons e.g.
Failure due to any such issue can result into potential information loss leading to
unreliable data quality for business
process decision making.
Further more, if such issues are not rectified at the earliest, this becomes even
more costly to �patch� later. Therefore
this is highly suggested that a proper data reconciliation process must be in place
in any data Extraction-
Transformation-Load (ETL) process.
Data reconciliation is often confused with the process of data quality testing.
Even worse, sometimes data
reconciliation process is used to investigate and pin point the data issues.
While data reconciliation may be a part of data quality assurance, these two things
are not necessarily same.
A successful reconciliation process should only indicate whether or not the data is
correct. It will not indicate why the
data is not correct. Reconciliation process answers �what� part of the question,
not �why� part of the question.
Methods of Data Reconciliation
Master data reconciliation is the method of reconciling only the master data
between source and target. Master data
are generally unchanging or slowly changing in nature and no aggregation operation
is done on the dataset. That is -
the granularity of the data remains same in both source and target. That is why
master data reconciliation is often
relatively easy and quicker to implement.
In one business process, �customer�, �products�, �employee� etc. are some good
example of master data. Ensuring
the total number of customer in the source systems match exactly with the total
number of customers in the target
system is an example of customer master data reconciliation.
Some of the common examples of master data reconciliation can be the following
measures,
Sales quantity, revenue, tax amount, service usage etc. are examples of
transactional data. Transactional data make
the very base of BI reports so any mismatch in transactional data can cause direct
impact on the reliability of the
report and the whole BI system in general. That is why reconciliation mechanism
must be in-place in order to detect
such a discrepancy before hand (meaning, before the data reach to the final
business users)
This paper outlines some of the most important (and equally neglected) things that
one must consider before and
during the design phase of a data warehouse. In our experience, we have seen data
warehouse designers often miss
out on these items merely because they thought them to be too trivial to attract
their attentions. Guess what, at the
end of the day such neglects cost them heavily as they cut short the overall ROI of
the data warehouse.
Here we outline some data warehouse gotchas that you should be aware of.
In a top-down design approach people often start to visualize the end data and
realize the complexity
associated with data analytics first. As they tend to see more details of it, they
tend to devote more time for
designing of the analytical or reporting solutions and less time for the designing
of the background ETL staffs
that deal with data extraction / cleaning / transformation etc. They often live
under the assumption that it
would be comparatively easy to map the source data from the existing systems since
users already have
better understanding on the source systems. Moreover, the need and complexity of
cleansing / profiling of
the source data would be less since the data is already coming from standard source
systems.
From budgeting and costing standpoints also, an architect prefers to choose the
case of data reporting and
analytics over background ETL as the former can be more easily presented to the
senior management over
the later in order to get them sanction the budget. This leads to disproportionate
budget between background
ETL and frontend Reporting tasks.
Users often do not know what they want from the data until they start to see the
data. As and when
development progress and more and more data visualization becomes possible, users
start wishing even
more out of their data. This phenomenon is unavoidable and designers must allocate
extra time to
accomodate such ad-hoc requirements.
Many requirements that were implicit in the beginning becomes explicit and
indispensable in the later phase
of the project. Since you can not avoid it, make sure that you already have
adequate time allocated in your
project plan before hand.
3. Issues will be discovered in the source system that went undetected till date
The power of an integrated data warehouse becomes apparent when you start
discovering discrepancies and
issues in the existing stable (so-called) source systems. The real problem,
however, is - designers often make
the wrong assumption that the source systems or upstream systems are fault free.
And that is why they do
not allocate any time or resource in their project plan to deal with those issues.
Data warehouse developers do discover issues in the source systems. And those
issues take lot of time to get
fixed. More than often those issues are not even fixed in the source (to minimize
the impact on business) and
some work around is suggested to deal with those issues in the data warehouse level
directly (although that
is not generally a good idea). Source system issues confuse everybody and require
more administrative time
(that technical time) to resolve as DW developers need to identify and make their
case to prove it to the
source systems that the issue(s) does exist. These are huge time wasters and often
not incorporated in the
project plan.
4. You will need to validate data not being validated in source systems
Source systems do not always give you the correct data. A lot of validations and
checks are not done in the
source system level (e.g. OLTP systems) and each time a validation check is
skipped, it creates danger of
sending unexpected data to the data warehouse level. Therefore before you can
actually process data in data
warehouse, you will require to perform some validation checks at your end to ensure
the expected data
availability.
This is again unavoidable. If you do not make those checks that would cause issues
at your side which
include things like, data loading error, reconciliation failure even data integrity
threats. Hence ensure that
proper time and resource allocation are there to work on these items.
5. User training will not be sufficient and users will not put their training to
use
You would face a natural resistance from the existing business users who would show
huge inertia against
the acceptance to the new system. In order to ease the things, adequate user
training sessions are generally
arranged for the users of the data warehouse. But you will notice that "adequate"
training is not "sufficient"
for them (mainly due to they need to unlearn a lot of things to learn the use of
the new data warehouse).
Even if you arrange adequate training to the users, you would find that the users
are not really putting their
training to use when it comes to doing things in the new data warehouse. That's
often because facts and
figures from the new data warehouse often challenge their existing convictions and
they are reluctant to
accept it whole heartedly.
User training and acceptance is probably the single most important non-technical
challenge that makes or
breaks a data warehouse. No matter what amount of effort you put as a designer to
design the data
warehouse - if the users are not using it - the data warehouse is as good as
failure. As the old saying goes in
Sanskrit � �a tree is known by the name of its fruit�, the success of data
warehouse is measured from the
information it produces. If the information is not relevant to the users and if
they are reluctant to use it - you
lost the purpose. Hence make all the possible efforts to connect to the users and
train them to use the data
warehouse. Mere 'adequate' training is not 'sufficient' here.
That is because the users often belong to different departments of the company and
even though each one of
them knows the business of her department pretty well, she would not know the
business of the other
department that well. And when you take the data from all these departments and try
to combine them
together into an integrated data warehouse, you would often discover that business
rule suggested by one
user is completely opposite to the business rule suggested by the other.
Such cases are generally involved and need collaboration between multiple parties
to come into the
conclusion. It's better to consider such cases way before during the planning phase
to avoid the late surprises.
A very minutely done volumetric estimate in the starting phase of the project would
go weary later. This
happens due to several reasons e.g. slight change in the standard business metrics
may create huge impact on
the volumetric estimates.
For example, suppose a company has 1 million customers who are expected to grow at
a rate of 7% per
annum. While calculating the volume and size of your data warehouse you have used
this measure in several
places. Now if the customer base actually increase by 10% instead of 7%, that would
mean 30000 more
customers. In a fact table of granularity customer, product, day - this would mean
30000 X 10 X 365 more
records (assuming on average one customer use 10 products). If one record takes
1kb, then the fact table
would now require - (30000 X 365 X 10 X 1kb ) /(1024 X 1024) = 100+ GB more disk
space from only one table.
When user look at one value in your report and says, "I think it's not right" - the
onus is on you to prove the
correctness or validity of that data. Nobody is going to help you around to prove
how right your data
warehouse is. For this reason, it is absolutely necessary to build a solid data
reconciliation framework for
your data warehouse. A reconciliation framework that can trigger an early alarm
whenever something does
not match between source and target, so that you get enough time to investigate
(and if required, fix) the
issue.
Such reconciliation framework however indispensable is not easy to create. Not only
they require huge
amount of effort and expertise, they also tend to run in the same production server
almost same time as that
of production load and eat up lot of performance. Moreover, such reconciliation
framework is often not a
client side requirement - making it even difficult for you to allocate time and
budget. But if you do not do it
that would be a much bigger mistake to make.
It's important to set the expectation in the very beginning of the project about
the huge maintenance cost
implications.
10. Amount of time needed to refresh your data warehouse is going to be your top
concern
You need to load data in the data warehouse, generally at least daily (although
sometimes more frequently
than this) and also monthly / quarterly / yearly etc. Loading latest data into data
warehouse ensures that your
reports are all up-to-date. However, the time required to load data (refresh time)
is going to be more than
what you have calculated and that's too going to increase day by day.
One of the major hinderances in the acceptance of a data warehouse by its users is
its performance. I have
seen too many cases where reports generated from data warehouse miss SLA and
severely damage the
dependency and credibility of the data warehouse. In fact, I have seen cases where
daily load runs more than
a day and never completes to generate timely daily report. There have been other
famous cases of SLA breach
as well.
I can not tress this enough but performance considerations are hugely important for
the success of a data
warehouse and it's more important than what you thought. Do everything necessary to
make your data
warehouse perform well - reduce overhead, maintain servers, cut-off complexities,
do regular system
performance tests (SPT) and weigh the performance against industry benchmarks, make
SPT a part of user
acceptance test (UAT) etc.
Well, you may be questioning how much large is large. The answer: it depends. You
must ask yourself if the large
size of the model is really justified. The more complex your model is, the more
prone it is to contain design errors.
For an example, you may want to try to limit your models to not more than 200
tables. To be able to do that, in the
early phase of data modelling ask yourself these questions �
If you consciously try to keep things simple, most likely you will also be able to
avoid the menace of over modelling.
Over modelling leads to over engineering which leads to over work without any
defined purpose. A person who
does modelling just for the sake of modelling often ends up doing over modelling.
All the above are sure signs of over modelling that only increases your burden (of
coding, of loading, of maintaining,
of securing, of using).
Purpose of the model determines the level of details that you want to keep in the
model. If you are unsure about the
purpose, you will definitely end up designing a model that is too detail or too
brief for the purpose.
Violation of Normalization
Clarity is also very important. For example - do you clearly know the data types
that you should be using for all the
business attributes? Or do you end up using some speculative data types (and
lengths)?
Modern data modelling tools come with different concepts of declaring data (e.g.
domain and enumeration concept
in ERWin) that helps to bring clarity to the model. So, before you start building �
pause for a moment and ask
yourself if you really understand the purpose of the model.
When the tables in the model satisfy higher levels of normal forms, they are less
likely to store redundant or
contradictory data. But there is no hard and fast rule about maintaining those
normal forms. A modeller is allowed to
violate these rules for good purpose (such as to increase performance) and such a
relaxation is called
denormalization.
But the problem occurs � when a modeller violates the normal form deliberately
without a clearly defined purpose.
Such reckless violation breaks apart the whole design principle behind the data
model and often renders the model
unusable. So if you are unsure of something � just stick to the rules. Don�t get
driven by vague purposes.
The above figure shows a general hierarchical relationship between customer and its
related categories. Let�s say a
customer can fall under following categories � Consumer, Business, Corporate and
Wholesaler. Given this condition,
�ConsumerFlag� is a redundant column on Customer table.
Theoretically speaking there is no issue with such a model, at least until one
tries to create the ETL programming
(extraction-transformation-loading) code behind these tables.
Even query optimization becomes difficult when one disassociates the surrogate key
with the natural key. The reason
being � since surrogate key takes the place of primary key, unique index is applied
on that column. And any query
based on natural key identifier leads to full table scan as that query cannot take
the advantage of unique index on the
surrogate key.
If the answer of the above questions are �YES� � don�t use the surrogate key.
This paper introduces the subject of data mining in simple lucid language and moves
on to build more complex
concepts. Start here if you are a beginner.
Not because I hate the subject of data mining itself, but because this term is so
much over-used and misused
and exploited and commercialized and often conveyed in inaccurate manner, in
inappropriate places and often
with intentional vagueness.
So when I decided to write about what is data mining, I was convinced that I need
to write about what is NOT
data mining first, in order to build a formal definition of data mining.
http://png.dwbiconcepts.com/images/wikipedia-icon.png
What is Data Mining? (And what it is not)
Now the question is: what does the above definition really mean and how does it
differ from finding
information from databases? We often store information in databases (as in data
warehouses) and retrieve the
information from the database when we need it. Is that data mining? Answer is �no�.
We will soon see why is it
so.
Let�s start with the big picture first. This all starts with something called
"Knowledge Discovery in Database". Data
mining is basically one of the steps in the process of knowledge discovery in
database (KDD). Knowledge
discovery process is basically divided in 5 steps:
1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Evaluation
Notice here the term � �Knowledge� as in Knowledge Discovery in Database (KDD). Why
did you say
�Knowledge�? Why not �information� or �data�?
This is because there are differences among the terms �data�, �information� and
�knowledge�. Let�s
understand this difference through one example.
You run a local departmental store and you log all the details of your customers in
the store
database. You know the names of your customers and what items they buy each day.
For example, Alex, Jessica and Paul visit your store every Sunday and buys candle.
You store
this information in your store database. This is data. Any time you want to know
who are
the visitors that buy candle, you can query your database and get the answer. This
is
information. You want to know how many candles are sold on each day of week from
your
store, you can again query your database and you�d get the answer � that�s also
information.
But suppose there are 1000 other customers who also buy candle from you on every
Sunday
(mostly � with some percentage of variations) and all of them are Christian by
religion. So,
you can conclude that Alex, Jessica and Paul must be also Christian.
Now the religion of Alex, Jessica and Paul were not given to you as data. This
could not be retrieved from the
database as information. But you learnt this piece of information indirectly. This
is the �knowledge� that you
discovered. And this discovery was done through a process called �Data Mining�.
Now there are chances that you are wrong about Alex, Jessica and Paul. But there
are fare amount of chances
that you are actually right. That is why it is very important to �evaluate� the
result of KDD process.
I gave you this example because I wanted to make a clear distinction between
knowledge and information in
the context of data mining. This is important to understand our first question �
why retrieving information
from deep down of your database is not same as data mining. No matter how complex
the information retrieval
process is, no matter how deep the information is located at, it�s still not data
mining.
As long as you are not dealing with predictive analysis or not discovering �new�
pattern from the existing data
� you are not doing data mining.
When it comes to applying data mining, your imagination is the only barrier (not
really . there are
technological hindrances as well as we will see later). But it�s true that data
mining is applied in almost any
fields starting from genetics to human rights violation. One of the most important
applications is in �Machine
Learning�. Machine learning is a branch of artificial intelligence concerned with
the design and development of
algorithms that allow computers to evolve behaviors based on empirical data.
Machine learning makes it
possible for computers to take autonomous decisions based on the data available
from past experiences. Many
of the standard problems of today�s world are being solved by the application of
machine learning as solving
them otherwise (e.g. through the deterministic algorithmic approach) would be
impossible given the breadth
and depth of the problem.
Let me start with one example of the application of data mining that enables
machine-learning algorithm to
drive an autonomous vehicle. This vehicle does not have any driver and it moves
around the road all by itself.
The way it maneuvers and overcomes the obstacles is by applying the images that it
sees (through a VGA
camera) and then using data mining to determine the course of action based on the
data of its past experiences.
. Voice recognition
Think of Siri in iPhone. How does it understand your commands? Clearly it�s not
deterministically
programmable as every body has different tone and accent and voice. And not only it
understands, it
also adapts better with your voice as you keep using it more and more.
DNA sequence contains biological information. One of the many approaches of DNA
sequencing is
through sequence mining where data mining techniques are applied to find
statistically relevant
patters, which are then compared with previously studied sequences to understand
the given sequence.
Now consider this. What if �Linda� was an automated machine? You could probably
have the same
kind of conversations still, but it would probably had much more unnatural.
I know the above example is a bit of overshoot, but you got the idea. Machines do
not understand
natural language. And it�s a challenge to make them understand the same. And until
we do that we
wont be able to build a really useful human-computer interface.
Now if the above examples interest you then let�s continue learning more about data
mining. One of the first
tasks that we have to do next is to understand the different approaches that are
used in the field of data mining.
Below list shows most of the important methods:
Anomaly Detection
This is the method of detecting patterns in a given data set that does not conform
to an established normal
behavior. This is applied in number of different fields such as � network intrusion
detection, share market fraud
detection etc.
Clustering
Classification
This method is used for the task of generalizing known structure to apply to new
data. For example, an email
program might attempt to classify an email as legitimate or spam.
Regression
Attempts to find a function, which models the data with the least error. The above
example of autonomous
driving uses this method.
Next we would learn about each of these methods in greater detail with examples of
their
SQL Questions
Check out the article Q100139 from Microsoft knowledge base and of course, there's
much more
information available in the net. It will be a good idea to get a hold of any RDBMS
fundamentals text
book, especially the one by C. J. Date. Most of the times, it will be okay if you
can explain till third normal
form.
Both primary key and unique enforce uniqueness of the column on which they are
defined. But by
default primary key creates a clustered index on the column, where are unique
creates a non-clustered
index by default. Another major difference is that, primary key does not allow
NULLs, but unique key
allows one NULL only.
What are user defined data types and when you should go for them?
User defined data types let you extend the base SQL Server data types by providing
a descriptive name,
and format to the database. Take for example, in your database, there is a column
called Flight_Num which appears in many tables. In all these tables it should
bevarchar(8). In this case
you could create a user defined data type called Flight_num_type of varchar(8) and
use it across all your
tables.
What is bit data type and what's the information that can be stored inside a bit
column?
Bit data type is used to store Boolean information like 1 or 0 (true or false).
Until SQL Server 6.5 bit data
type could hold either a 1 or 0 and there was no support for NULL. But from SQL
Server 7.0 onwards, bit
data type can represent a third state, which is NULL.
A candidate key is one that can identify each row of a table uniquely. Generally a
candidate key becomes
the primary key of the table. If the table has more than one candidate key, one of
them will become the
primary key, and the rest are called alternate keys.
A key formed by combining at least two or more columns is called composite key.
A transaction is a logical unit of work in which, all the steps must be performed
or none. ACID stands for
Atomicity, Consistency, Isolation, Durability. These are the properties of a
transaction. For more
information and explanation of these properties, see SQL Server books online or
any RDBMS fundamentals text book.
What type of Index will get created after executing the above statement?
8060 bytes. Do not be surprised with questions like 'What is the maximum number of
columns per table'.
Check out SQL Server books online for the page titled: "Maximum Capacity
Specifications".
Lock escalation is the process of converting a lot of low level locks (like row
locks, page locks) into higher
level locks (like table locks). Every lock is a memory structure too many locks
would mean, more memory
being occupied by locks. To prevent this from happening, SQL Server escalates the
many fine-grain locks
to fewer coarse-grain locks. Lock escalation threshold was definable in SQL Server
6.5, but from SQL
Server 7.0 onwards it's dynamically managed by SQL Server.
What's the difference between DELETE TABLE and TRUNCATE TABLE commands?
DELETE TABLE is a logged operation, so the deletion of each row gets logged in the
transaction log,
which makes it slow. TRUNCATE TABLE also deletes all the rows in a table, but it
will not log the
deletion of each row, instead it logs the de-allocation of the data pages of the
table, which makes it faster.
Of course, TRUNCATE TABLE can be rolled back.
Check out MOLAP, ROLAP and HOLAP in SQL Server books online for more information.
What are the new features introduced in SQL Server 2000 (or the latest release of
SQL Server at the time
of your interview)? What changed between the previous version of SQL Server and the
current version?
This question is generally asked to see how current is your knowledge. Generally
there is a section in the
beginning of the books online titled "What's New", which has all such information.
Of course, reading just
that is not enough, you should have tried those things to better answer the
questions. Also check out the
section titled "Backward Compatibility" in books online which talks about the
changes that have taken
place in the new version.
Types of constraints: NOT NULL, CHECK, UNIQUE, PRIMARY KEY, FOREIGN KEY
For an explanation of these constraints see books online for the pages titled:
"Constraints" and "CREATE
TABLE", "ALTER TABLE"
What is an index? What are the types of indexes? How many clustered indexes can be
created on a
table? I create a separate index on each column of a table. what are the advantages
and disadvantages
of this approach?
Indexes in SQL Server are similar to the indexes in books. They help SQL Server
retrieve the data quicker.
Indexes are of two types. Clustered indexes and non-clustered indexes. When you
create a clustered
index on a table, all the rows in the table are stored in the order of the
clustered index key. So, there can
be only one clustered index per table. Non-clustered indexes have their own storage
separate from the
table data storage. Non-clustered indexes are stored as B-tree structures (so do
clustered indexes), with
the leaf level nodes having the index key and it's row locater. The row located
could be the RID or the
Clustered index key, depending up on the absence or presence of clustered index on
the table.
RAID stands for Redundant Array of Inexpensive Disks, used to provide fault
tolerance to database
servers. There are six RAIDlevels 0 through 5 offering different levels of
performance, fault tolerance.
MSDN has some information about RAID levels and for detailed information, check out
the RAID
advisory board's homepage
What are the steps you will take to improve performance of a poor performing query?
This is a very open ended question and there could be a lot of reasons behind the
poor performance of a
query. But some general issues that you could talk about would be: No indexes,
table scans, missing or
out of date statistics, blocking, excess recompilations of stored procedures,
procedures and triggers
without SET NOCOUNT ON, poorly written query with unnecessarily complicated joins,
too much
normalization, excess usage of cursors and temporary tables.
Some of the tools/ways that help you troubleshooting performance problems are:
Download the white paper on performance tuning SQL Server from Microsoft web site.
What are the steps you will take, if you are tasked with securing an SQL Server?
Again this is another open ended question. Here are some things you could talk
about: Preferring NT
authentication, using server, database and application roles to control access to
the data, securing the
physical database files using NTFS permissions, using an unguessable SA password,
restricting physical
access to the SQL Server, renaming the Administrator account on the SQL Server
computer, disabling the
Guest account, enabling auditing, using multi-protocol encryption, setting up SSL,
setting up firewalls,
isolating SQL Server from the web server etc.
Read the white paper on SQL Server security from Microsoft website. Also check out
My SQL Server
security best practices
What is a deadlock and what is a live lock? How will you go about resolving
deadlocks?
Deadlock is a situation when two processes, each having a lock on one piece of
data, attempt to acquire a
lock on the other's piece. Each process would wait indefinitely for the other to
release the lock, unless one
of the user processes is terminated. SQL Server detects deadlocks and terminates
one user's process.
Check out SET DEADLOCK_PRIORITY and "Minimizing Deadlocks" in SQL Server books
online. Also
check out the article Q169960 from Microsoft knowledge base.
Blocking happens when one connection from an application holds a lock and a second
connection
requires a conflicting lock type. This forces the second connection to wait,
blocked on the first.
Read up the following topics in SQL Server books online: Understanding and avoiding
blocking, Coding
efficient transactions.
Many of us are used to creating databases from the Enterprise Manager or by just
issuing the command:
But what if you have to create a database with two file groups, one on drive C and
the other on drive D
with log on drive E with an initial size of 600 MB and with a growth factor of 15%?
That's why being a
DBA you should be familiar with the CREATE DATABASE syntax. Check out SQL Server
books online
for more information.
How to restart SQL Server in single user mode? How to start SQL Server in minimal
configuration
mode?
SQL Server can be started from command line, using the SQLSERVR.EXE. This EXE has
some very
important parameters with which a DBA should be familiar with. -m is used for
starting SQL Server in
single user mode and -f is used to start the SQL Server in minimal configuration
mode. Check out SQL
Server books online for more parameters and their explanations.
As a part of your job, what are the DBCC commands that you commonly use for
database
maintenance?
DBCC CHECKDB,
DBCC CHECKTABLE,
DBCC CHECKCATALOG,
DBCC CHECKALLOC,
DBCC SHOWCONTIG,
DBCC SHRINKDATABASE,
DBCC SHRINKFILE etc.
But there are a whole load of DBCC commands which are very useful for DBAs. Check
out SQL Server
books online for more information.
What are statistics, under what circumstances they go out of date, how do you
update them?
UPDATE STATISTICS,
STATS_DATE,
DBCC SHOW_STATISTICS,
CREATE STATISTICS,
DROP STATISTICS,
sp_autostats,
sp_createstats,
sp_updatestats
What are the different ways of moving data/databases between servers and databases
in SQL Server?
There are lots of options available, you have to choose your option depending upon
your requirements.
Some of the options you have are:
BACKUP/RESTORE,
Detaching and attaching databases,
Replication,
DTS,
BCP,
logshipping,
INSERT...SELECT,
SELECT...INTO,
creating INSERT scripts to generate data.
Types of backups you can create in SQL Sever 7.0+ are Full database backup,
differential database
backup, transaction log backup, filegroup backup. Check out the BACKUP and RESTORE
commands in
SQL Server books online. Be prepared to write the commands in your interview. Books
online also has
information on detailed backup/restore architecture and when one should go for a
particular kind of
backup.
What is database replication? What are the different types of replication you can
set up in SQL Server?
* Snapshot replication
* Transactional replication (with immediate updating subscribers, with queued
updating subscribers)
* Merge replication
See SQL Server books online for in-depth coverage on replication. Be prepared to
explain how different
replication agents function, what are the main system tables used in replication
etc.
How to determine the service pack currently installed on SQL Server?
The global variable @@Version stores the build number of the sqlservr.exe, which is
used to determine the
service pack installed. To know more about this process visit SQL Server service
packs and versions.
What are cursors? Explain different types of cursors. What are the disadvantages of
cursors? How can
you avoid cursors?
Types of cursors:
Static,
Dynamic,
Forward-only,
Keyset-driven.
Disadvantages of cursors: Each time you fetch a row from the cursor, it results in
a network roundtrip,
where as a normal SELECT query makes only one round trip, however large the
resultset is. Cursors are
also costly because they require more resources and temporary storage (results in
more IO operations).
Further, there are restrictions on the SELECT statements that can be used with some
types of cursors.
Most of the times, set based operations can be used instead of cursors. Here is an
example:
If you have to give a flat hike to your employees using the following criteria:
Another situation in which developers tend to use cursors: You need to call a
stored procedure when a
column in a particular row meets certain condition. You don't have to use cursors
for this. This can be
achieved using WHILE loop, as long as there is a unique key to identify each row.
Write down the general syntax for a SELECT statements covering all the options.
Here's the basic syntax: (Also checkout SELECT in books online for advanced
syntax).
SELECT select_list
[INTO new_table_]
FROM table_source
[WHERE search_condition]
[GROUP BY group_by__expression]
[HAVING search_condition]
[ORDER BY order__expression [ASC | DESC] ]
Joins are used in queries to explain how different tables are related. Joins also
let you select data from a
table depending upon data from another table.
Types of joins:
INNER JOINs,
OUTER JOINs,
CROSS JOINs
For more information see pages from books online titled: "Join Fundamentals" and
"Using Joins".
Yes, very much. Check out BEGIN TRAN, COMMIT, ROLLBACK, SAVE TRAN and @@TRANCOUNT
What is an extended stored procedure? Can you instantiate a COM object by using T-
SQL?
Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-
SQL by
using sp_OACreate stored procedure.
What is the system function to get the current user's user id?
USER_ID(). Also check out other system functions like
USER_NAME(),
SYSTEM_USER,
SESSION_USER,
CURRENT_USER,
USER,
SUSER_SID(),
HOST_NAME().
What are triggers? How many triggers you can have on a table? How to invoke a
trigger on demand?
Triggers are special kind of stored procedures that get executed automatically when
an INSERT,
UPDATE or DELETE operation takes place on a table.
In SQL Server 6.5 you could define only 3 triggers per table, one for INSERT, one
for UPDATE and one
for DELETE. From SQL Server 7.0 onwards, this restriction is gone, and you could
create multiple
triggers per each action. But in 7.0 there's no way to control the order in which
the triggers fire. In SQL
Server 2000 you could specify which trigger fires first or fires last using
sp_settriggerorder
Triggers cannot be invoked on demand. They get triggered only when an associated
action (INSERT,
UPDATE, DELETE) happens on the table on which they are defined.
Triggers are generally used to implement business rules, auditing. Triggers can
also be used to extend the
referential integrity checks, but wherever possible, use constraints for this
purpose, instead of triggers, as
constraints are much faster.
Till SQL Server 7.0, triggers fire only after the data modification operation
happens. So in a way, they are
called post triggers. But in SQL Server 2000 you could create pre triggers also.
Search SQL Server 2000
books online for INSTEAD OF triggers.
Also check out books online for 'inserted table', 'deleted table' and
COLUMNS_UPDATED()
There is a trigger defined for INSERT operations on a table, in an OLTP system. The
trigger is written to
instantiate a COM object and pass the newly inserted rows to it for some custom
processing.
Instantiating COM objects is a time consuming process and since you are doing it
from within a trigger, it
slows down the data insertion process. Same is the case with sending emails from
triggers. This scenario
can be better implemented by logging all the necessary data into a separate table,
and have a job which
periodically checks this table and does the needful.
Here is an advanced query using a LEFT OUTER JOIN that even returns the employees
without
managers (super bosses)
1. You need to see the last fifteen lines of the files dog, cat and horse. What
command should you use?
tail -15 dog cat horse
The tail utility displays the end of a file. The -15 tells tail to display the last
fifteen lines of each specified file.
2. Who owns the data dictionary?
The SYS user owns the data dictionary. The SYS and SYSTEM users are created when
the database is
created.
3. You routinely compress old log files. You now need to examine a log from two
months ago. In order to
view its contents without first having to decompress it, use the _________ utility.
zcat
The zcat utility allows you to examine the contents of a compressed file much the
same way that cat
displays a file.
4. You suspect that you have two commands with the same name as the command is not
producing the
expected results. What command can you use to determine the location of the command
being run?
which
The which command searches your path until it finds a command that matches the
command you are
looking for and displays its full path.
5. You locate a command in the /bin directory but do not know what it does. What
command can you use to
determine its purpose.
whatis
The whatis command displays a summary line from the man page for the specified
command.
6. You wish to create a link to the /data directory in bob�s home directory so you
issue the command ln /data
/home/bob/datalink but the command fails. What option should you use in this
command line to be
successful.
Use the -F option
In order to create a link to a directory you must use the -F option.
7. When you issue the command ls -l, the first character of the resulting display
represents the file�s
___________.
type
The first character of the permission block designates the type of file that is
being displayed.
8. What utility can you use to show a dynamic listing of running processes?
__________
top
The top utility shows a listing of all running processes that is dynamically
updated.
9. Where is standard output usually directed?
to the screen or display
By default, your shell directs standard output to your screen or display.
10. You wish to restore the file memo.ben which was backed up in the tarfile
MyBackup.tar. What command
should you type?
tar xf MyBackup.tar memo.ben
This command uses the x switch to extract a file. Here the file memo.ben will be
restored from the tarfile
MyBackup.tar.
11. You need to view the contents of the tarfile called MyBackup.tar. What command
would you use?
tar tf MyBackup.tar
The t switch tells tar to display the contents and the f modifier specifies which
file to examine.
12. You want to create a compressed backup of the users� home directories. What
utility should you use?
tar
You can use the z modifier with tar to compress your archive at the same time as
creating it.
13. What daemon is responsible for tracking events on your system?
syslogd
The syslogd daemon is responsible for tracking system information and saving it to
specified log files.
14. You have a file called phonenos that is almost 4,000 lines long. What text
filter can you use to split it into
four pieces each 1,000 lines long?
split
The split text filter will divide files into equally sized pieces. The default
length of each piece is 1,000 lines.
15. You would like to temporarily change your command line editor to be vi. What
command should you
type to change it?
set -o vi
The set command is used to assign environment variables. In this case, you are
instructing your shell to
assign vi as your command line editor. However, once you log off and log back in
you will return to the
previously defined command line editor.
16. What account is created when you install Linux?
root
Whenever you install Linux, only one user account is created. This is the superuser
account also known as
root.
17. What command should you use to check the number of files and disk space used
and each user�s defined
quotas?
repquota
The repquota command is used to get a report on the status of the quotas you have
set including the amount
of allocated space and amount of used space.
why you need indexing ? where that is stroed and what you mean by schema object?
For what purpose
we are using view?
indexing is used for faster search or to retrieve data faster from various table.
Schema containing set of
tables, basically schema means logical separation of the database. View is crated
for faster retrieval of
data. It's customized virtual table. we can create a single view of multiple
tables. Only the drawback
is..view needs to be get refreshed for retrieving updated data.
Triggers are fired implicitly on the tables/views on which they are created. There
are various advantages
of using a trigger. Some of them are:
. Suppose we need to validate a DML statement(insert/Update/Delete) that modifies a
table then
we can write a trigger on the table that gets fired implicitly whenever DML
statement is executed
on that table.
. Another reason of using triggers can be for automatic updation of one or more
tables whenever a
DML/DDL statement is executed for the table on which the trigger is created.
. Triggers can be used to enforce constraints. For eg : Any insert/update/ Delete
statements should
not be allowed on a particular table after office hours. For enforcing this
constraint Triggers
should be used.
. Triggers can be used to publish information about database events to subscribers.
Database event
can be a system event like Database startup or shutdown or it can be a user even
like User loggin
in or user logoff.
Union will remove the duplicate rows from the result set while Union all does'nt.
Both will result in deleting all the rows in the table .TRUNCATE call cannot be
rolled back as it is a DDL
command and all memory space for that table is released back to the server.
TRUNCATE is much
faster.Whereas DELETE call is an DML command and can be rolled back.
Which system table contains information on constraints on all the tables created ?
yes,
USER_CONSTRAINTS,
system table contains information on constraints on all the tables created
Explain normalization ?
Normalisation means refining the redundancy and maintain stablisation. there are
four types of
normalisation :
first normal forms, second normal forms, third normal forms and fourth Normal
forms.
How to find out the database name from SQL*PLUS command prompt?
Select * from global_name;
This will give the datbase name which u r currently connected to.....
What is diffrence between Co-related sub query and nested sub query?
Correlated subquery runs once for each row selected by the outer query. It contains
a reference to a value
from the row selected by the outer query.
Nested subquery runs only once for the entire nesting (outer) query. It does not
contain any reference to
the outer query row.
For example,
Correlated Subquery:
Nested Subquery:
select empname, basicsal, deptno from emp where (deptno, basicsal) in (select
deptno, max(basicsal) from
emp group by deptno)
1. % and
2. _ ( underscore )
% means matches zero or more characters and under score means mathing exactly one
character
What is database?
A database is a collection of data that is organized so that itscontents can easily
be accessed, managed and
updated. open this url : http://www.webopedia.com/TERM/d/database.html
What are the advantages and disadvantages of primary key and foreign key in SQL?
Primary key
Advantages
1) It is a unique key on which all the other candidate keys are functionally
dependent
Disadvantage
1) There can be more than one keys on which all the other attributes are dependent
on.
Foreign Key
Advantage
1)It allows refrencing another table using the primary key for the other table
Which date function is used to find the difference between two dates?
datediff
output is 5
This article is a step-by-step instruction for those who want to install Oracle 10g
database on their
computer. This document provides guidelines to install Oracle 10g database on
Microsoft Windows
environment. If you use other operating system other than Microsoft Windows, the
process is not too
much different from that of Microsoft Windows, since Oracle uses Oracle Universal
Installer to install its
software.
For more information about installing Oracle 10g under operating systems other than
Microsoft
Windows, please refer to this URL :
http://www.oracle.com/pls/db102/homepage
You can download Oracle 10g database from www.oracle.com. You must registered and
create an
account before you can download the software. The example in this document uses
Oracle Database 10g
Release 2 (10.2.0.1.0) for Microsoft Windows.
1. Uninstall all Oracle components using the Oracle Universal Installer (OUI).
oracle10g installation
2. Run regedit.exe and delete the HKEY_LOCAL_MACHINE/ SOFTWARE/ORACLE key. This
contains registry entire for all Oracle products.
3. Delete any references to Oracle services left behind in the following part of
the registry: HKEY
LOCAL MACHINE/ SYSTEM/ CurrentControlsSet/ Services/Ora*. It should be pretty
obvious
which ones relate to Oracle
4. Reboot your machine.
5. Delete the C: \Oracle directory, or whatever directory is your Oracle_Base.
6. Delete the C:\Program Files \Oracle directory.
7. Empty the contents of your c:\temp directory.
8. Empty your recycle bin.
1. Insert Oracle CD , the autorun window opens automatically. If you are installing
from network or
hard disk, click setup.exe in the installation folder.
2. The Oracle Universal Installer (OUI) will run and display the Select
Installation
MethodWindow.
c:\oracle\product\10.2.0\db_1
requirements and the new products to be installed. Click Install to start the
installation..
6. The Install window appears showing installation progress.
7. At the end of the installation phase, the Configuration Assistants window
appears. This window
lists the configuration assistants that are started automatically.
If you are creating a database, then the Database Configuration Assistant starts
automatically in
oracle 10g installation
oracle 10g installation
a separate window.
At the end of database creation, you are prompted to unlock user accounts to make
the accounts
accessible. The SYS and SYSTEM accounts are already unlocked. Click OK to bypass
password
management.
oracle 10g installation
oracle 10g installation
Note: Oracle 10g still keeps scott / tiger username and password (UID=scott,
PWD=tiger) from
the old version of oracle. In the old version of oracle, scott/tiger user ID is
available by default,
but not in oracle 10g. If you want to use scott /tiger account, you must unlock it
by clicking
�Password Management� at the last window.
Password Management window will appear like the one shown below. Find the user name
�Scott� and uncheck the �Lock Account?� column for the user name.
8. Your installation and database creation is now complete. The End of Installation
window
displays several important URLs, one of which is for Enterprise Manager.
9. You can navigate to this URL in your browser and log in as the SYS user with the
associated
password, and connect as SYSDBA. You use Enterprise Manager to perform common
database
administration tasks
Note : you can access Oracle Enterprise Manager using browser by typing the URL
shown above
in your browser. Instead of typing the IP address, you can also access the
Enterprise Manager by
typing http://localhost:1158/em or �http://[yourComputerName]:1158/em� or by
clicking �Start
>> All Programs >> Oracle � [YourOracleHome_home1] >> Database Control �
[yourOracleID]�
in Windows menu.
By default, use user ID �SYSTEM�, with the password that you have chosen at the
beginning of
installation, to connect to database, SQLPlus, etc. If you want to use other user
ID, you may create
a new user .
Data Modeling
Data Model is a logical map that represents the inherent properties of the data
independent of
software, hardware, or machine performance considerations. The model shows data
elements
grouped into records, as well as the association around those records.
Since the data model is the basis for data implementation regardless of software or
hardware platforms, the data model should present descriptions about a data in an
abstract
manner which does not mention detailed information specific to any hardware or
software
such as bits manipulation or index addition.
There are two generally accepted meanings on the term data model. The first is that
the
data model could be some sort of theory about the formal description of the data's
structure
and use without any mention of heavy technical terms related to information
technology.
The second is that a data model instance is the application of the data model
theory in order
to create to meet requirements of some applications such as those used in a
business
enterprise.
The structural part of a data model theory refers to the collection of data
structures which
make up a data when it is being created. These data structures represent entities
and
objects in the database model. For instance the data model may that be of a
business
enterprise involved in sales of toys.
The real life things of interest would include customers, company staff and of
course the toy
items. Since the database which will keep the records of these things of interest
cannot
understand the real meaning of customers, company staff and toy item, there should
be
created a data representation of this real life things.
The integrity part of a data model refers to the collection of rules which governs
the
constraints on the data structures so that structural integrity could be achieved.
In the
integrity aspect of a data model, the formal definition of an extensive sets of
rules and
consistent application of data is defined so that the data can be used for its
intended
purpose.
Techniques are defined on hot to maintain data in the data resource and to ensure
that the
data consistently contains value which is loyal to its source while at the same
time accurate
in its destination. This is to ensure that data will always have data value
integrity, data
structure integrity, data retention integrity, and data derivation integrity.
The manipulation part of a data model refers to the collection of operators which
be applied
to the data structures. These operations include query and update of data within
the
database. This is important because not all data can be allowed for altering or
deletion. The
data manipulation part works hand in hand with the integrity part so that the data
model
can result in high quality in the database for the data consumers to enjoy.
As an example, let us take the relational model. The data model defined in the
structural
part refers to the modified concept of the mathematical relation. The reasoning
about such
data is represented as n-ary which is a subset of the Cartesian product of n
domains.
The integrity part refers to the expression in the first order logic and the
manipulation part
refers to the relational algebra as well as tuple and domain calculus.
Data Modeling is a method used to define and analyze data requirements needed to
support
the business functions of an enterprise. These data requirements are recorded as a
conceptual data model with associated data definitions. Data modeling defines the
relationships between data elements and structures.
Data modeling can be used for a wide array of purposes. It is an act of exploring
data
oriented structures without considering any specific applications that the data
will be used
in. It is like a conceptual definition an entity and its real life counterparts
which is any thing
that is of interest to the organization implementing a database.
Data models are the products of data modeling. In general, three data model styles
namely
conceptual data model, logical data model and physical data model.
The conceptual data model is often called the domain model. It describes the
semantics of a
business organization as this model consists of entity classes which represent
things of
significance to an organization and the relationships of these entities.
Relationships are
defined as assertions about associations between various pairs of entity classes.
The
conceptual data model is commonly used to explore domain concepts with project
stakeholders. Conceptual models may be created to explore high level static
business
structures and concepts. But they can be used as well as precursor or alternatives
to logical
data models.
The logical data model is used in exploring domain concepts and other key areas
such as
relationships and domain problems. The logical data models could be defined for the
scope
of a single project or for the whole enterprise. The logical data model describes
semantics
related to particular data manipulation methods and such descriptions include those
of
tables, columns, object oriented classes, XML tags and many other things. The
logical data
model depicts some logical entity types, the data attributes to describe those
entities and
relations among the entities.
The physical data model is used in the design of the internal database schema. This
design
defines data tables, data columns for the tables and the relationships among the
tables.
Among other things that the physical data model is concern with include
descriptions of the
physical means by which data should be stored. This storage aspect embraces
concerns on
hard disk partitioning, CPU usage optimization, creation of table spaces and
others.
Data modeling also focuses on the structure of a data within a domain. This
structure is
described in such a manner that specification is in a dedicated grammar for an
artificial
language used for a certain domain. But as always, the description of the data
structure will
never make any mention of a specific implementation of any database management
system
such as specific vendors.
Sometimes, having different data modelers could lead to confusion as they could
potentially
produce different data models within the same domain. The difference could stem
from
different levels of abstraction in the data models. This can be overcome by coming
up with
generic data modeling methods.
For instance, generic data modeling could take advantage of generic patterns in a
business
organization. An example is the concept of a Party which includes Persons and
Organizations. A generic data model for this entity may be easier to implement to
without
creating conflict along the way
This data model represents events, entities and objects in the real world that are
of interest
to the company. It is subject oriented and includes all aspects of the real world,
primarily
activities pertaining to the business.
To use lay terms, a data model can be considered a road map to get one employee
from
point A to point B in the least mileage, most scenery and shortest time of travel.
In the science of computing, data models are structured and organized data
structures that
are implemented in a database management system. Aside from defining and organizing
business data, data modeling also includes implicitly and explicitly imposing
constraints and
limitations on the data within the data structure.
A data model may be instance of a conceptual schema, logical schema and physical
schema.
The physical schema, as the name implies, is the description of the physical means
for
storing data. This can include definitions for storage requirements in hard terms
like
computers, central processing units, network cables, routers and others.
Data Architects and Business Analysts usually work hand in hand to make an
efficient data
model for an organization. To come up with a good Common Data Model output, they
need
to be guided by the following:
1. They have to be sure about database concepts like cardinality, normalization and
optionality;
2. The have to have in depth knowledge of the actual rules of business and its
requirements;
3. They should be more interested in the final resulting database than the data
model.
A data model describes the structure of the database within a business and in
effect, the
underlying structure of the business as well. It can be thought of as a grammar for
an
artificial intelligence in business or any other undertaking.
In the real world, the kinds of things are represented as entities in the data
model. This
entities are can hold any information or attribute as well as relationships.
Irrespective of
how data is represented in the computer system, the data model describes the
company
data.
It is always advised to have a good conceptual data model to describe the semantics
of a
given subject area. A conceptual data model is a collection of assertions
pertaining to the
nature of information used by the company. Entities should be named with natural
language
instead of a technical term. Relationships which are properly named also form
concrete
assertions about the subject.
Common Data Modeling is defining the unifying the structure used in allowing
heterogeneous business environments to interoperate. A Common Data Model is very
critical to a business organization.
In Common Data Modeling, Business Architects and analysts need to face the data
first
before defining a common data or abstraction layer so that they will not be bound
to a
particular schema and thus make the Business Enterprise more flexible.
Business Schemas are the underlying definition of all business related activities.
Data
Models are actually instances of Business Schemas � conceptual, logical and
physical
schemas. These schemas have several aspects of definition and they usually form a
concrete basis for the design of Business Data Architecture.
Data Modeling is actually a vast field but having a Common Data Model for a certain
domain
can answer problems with many different models operating in a homogeneous
environment.
To make Common Data Models, modelers need to focus on one standard of Data
Abstraction. They need to agree on certain elements to be concretely rendered so
uniformity
and consistency is obtained.
Generic patterns can be used to attain a Common Data Model. Some of these patterns
include using entities such as "party" to refer to persons and organizations, or
"product
types", "activity type", "geographic area" among others. Robust Common Data Models
explicitly include versions of these entities.
A good approach to Common Data Modeling is to a have a generic Data Model which
consists of generic types of entity like class, relationships, individual thing and
others. Each
instance of these classes can have subtypes.
Common Data Modeling process may obey some these rules:
1. Attributes are to be treated as relationships with other entities.
2. Entities are defined under the very nature of a Business Activity, rule, policy
or structure
but not the role that it plays within a given context.
3. Entities must have a local identifier in an exchange file or database. This
identifier must
be unique and artificial but should not use relationships to be part of the local
identifier.
4. Relationships, activities and effects of events should not be represented by
attributed but
by the type of entity.
5. Types of relationships should be defined on a generic or high level. The highest
level is
defined as a relationship between one individual thing with another individual
thing.
Data Modeling often uses the Entity-Relationship Model (ERM). This model is a
representation of structured data. This type of Data Modeling can be used to
describe any
ontology (the term used to describe the overview and classification of terms and
their
respective relationships) for a certain area of interest.
Common Data Modeling is one of the core considerations when setting up a business
data
warehouse. Any serious company wanting to have a data warehouse will have to be
first
serious about data models. Building a data model takes time and it is not unusual
for
companies to spend two to five years just doing it.
Data Models should reflect practical and real world operations and that is why a
common
data modeling method of combining forward, reverse and vertical methods make
perfect
sense to seamlessly integrate disparate data coming in whether top down or bottom
up
from different sources and triggering events.
Business Architects, analysts and data modelers work together to look around and
look for
the best practices found in the industry. These best practices are then synthesized
into the
enterprise model to reflect the current state of the business and the future it
wants to get
into.
A good Enterprise Data Model should strike a balance between conceptual entities
and
functional entities based on practical, real and available industry standard data.
Conceptual
entities are defined within the company and will take on the values of the data by
the
defined by the company. Examples of conceptual entities are products status,
marital
status, customer types, etc.
On the other hand, functional entities refer to entities that are already well
defined, industry
standard data ready to be placed into database tables. Examples of functional
entities are
D&B Paydex Rating and FICO Score.
Businesses usually start with simply and grows more complex as they progress. It
may start
by selling goods or providing services to clients. These goods and services
delivered as well
as money received were recorded and then reused. So over time, transactions pile up
over
another and the set up can get more and more complex. Despite the complexity, the
business is still essentially a simple entity that has just grown in complexity.
This happens when the business does not have a very defined common data modeling
method. Many software applications could not provide ways to integrate real world
data and
data within the data architecture.
This scenario where there is not common business model can worsen when disparate
multiple systems are used within the company each and each of the system has
differing
views on the underlying data structures.
Business Intelligence can perform more efficiently with Common Data Modeling
Method. As
its name implies, Business Intelligence processes billions of data from the data
warehouse
so that a variety of statistical analysis can be reported and a recommendation on
innovation
to give the company more competitive edge can be presented.
With Common Data Modeling Method, processes can be made faster as the internal
structure of data are made closer to reality compared to non-usage of the data
model. It
should be noted that the common set up of today's business involves having data
sources
from as many geographical locations as possible.
Entity-Relationship
The Entity-Relationship or E-R is a model which deals with real world entities, it
includes a
set of objects and the relationships among them. Entity is an object that exists
and is easily
distinguishable from others. Like people, they can easily be distinguished from
others
through various methods; for instance you can distinguish people from one another
by
social security numbers.
http://www.learn.geekinterview.com/images/dm01.gif
http://www.learn.geekinterview.com/images/dm02.gif
Also an entity can be concrete, like a book, person, or place, while it can also be
abstract,
like a holiday for example. Now an entity set is a set of entities that share
something, like
multiple holders of a bank account, those people would be considered an entity set.
Also
entity sets do not need to be disjoint.
Here is an example, the entity set employee (all employees of a bank) and the
entity
set customer (all customers of the bank) may have members in common. Such as an
employee or employees may also be members of the bank. This puts them into both
sets.
. Employees can be customers as well, and will have possession of both employee
numbers and account numbers.
Relationships and Relationship Sets
sentence.
One should remember that the role of an entity is the function it has in a
relationship.
Consider an example, the relationship �works-for� could be ordered pairs of
different
employee entities or entity sets. The first employee entity takes the role of a
manager or
supervisor, where as the other one will take on the role of worker or associate.
Relationships can also have descriptive attributes. This can be seen in the example
of a date
(as in the last date of access to an account), this date is an attribute of a
customer
account relationship set.
Attributes
A particular set of entities and the relationships between them can be defined in a
number
of ways. The differentiating factor is how you deal with the attributes. Consider a
set of
employees as an entity, this time let us say that the set attributes are employee
name andphone number.
In some instances the phone number should be considered an entity alone, with its
own
attributes being the location and uniqueness of the number it is self. Now we have
two
entity sets, and the relationship between them being through the attribute of the
phone
number. This defines the association, not only between the employees but also
between
the employee phone numbers. This new definition allows us to more accurately
reflect the
real world.
Basically what constitutes an entity and what constitutes an attribute depends
largely on
the structure of the situation that is being modeled, as well as the semantics
associated
with the attributes in question.
relationship diagram. Yet this is only possible when we keep in mind what
components
are involved in creating this type of model.
Some of the different variations of the Entity-Relational diagram you will see are:
* Diamonds are omitted - a link between entities indicates a relationship.
Less symbols, means a clearer picture but what will happen with descriptive
attributes? In
this case, we have to create an intersection entity to possess the attributes
instead.
* Also we can use a range of numbers that can indicate the different options of
relationship
E.g. (0, 1) is used to indicate minimum zero (optional), maximum 1. We can also use
(0,n),
(1,1) or (1,n). This is typically used on the near end of the link - it is very
confusing at first,
but this structure gives us more information.
This allows us to map composite attributes and record derived attributes. We can
then use
subclasses and super classes. This structure is generalization and specialization.
Summary
Entity-Relationship diagrams are a very important data modeling tool that can help
organize
the data in a project into categories defining entities and the relationships
between entities.
This process has proved time and again to allow the analyst to create a nice
database
structure and helps to store the data correctly.
Entity
The data entity represent both real and abstract entity about which data is being
stored.
The types of entities fall into classes (roles, events, locations, and concepts).
This could be
employees, payments, campuses, books, and so on. Specific examples of an entity are
called instances.
Relationship
The relationship between data is a natural association that exists with one or more
entities.
Like the employees process payments. Cardinality is the number of occurrences of a
single
entity for one occurrence of the related entity, such as, an employee may process
many
payments but might not process any depending on the nature of their job.
Attribute
The belief goes on to complement the top concept and the bottom concept structures.
This
particular structure constitutes a lattice which is also described as an order
satisfying certain
properties. Each item is then defined as a mixture of some other super items that
are taken
from the related super-concepts.
In simple words, the Concept Oriented model is based on a lattice theory or an
order of
sets. Each component is defined as a mixture of its super-concepts. The top concept
structure provided the most abstract view, with no items. Where as, the bottom
concept
structure is more specific and provides a much more detailed representation of the
model.
Each step within this path normally has a name that is in context to its
corresponding
concept. The number of these paths from the top of the model to the bottom of the
model is
its dimensionality. Now each dimension corresponds to one variable or one
attribute. Thus is
supposed to be one-valued.
All one valued attributes are also directed upward within the structure. Yet if we
were to
reverse the direction of the dimensions then we would have what is often called
sub-
dimensions or reverse dimensions.
These dimensions also corresponds to the attributes or the properties, but they
take many
values from the sub-concepts rather then the super-concepts to form normal
dimensions.
After we have explored the dimensionality of the concept model structure we move
forward
to address the relations between the concepts.
When speaking of relations, each concept is related to its super-concepts, yet the
super-
concepts are also clarified as relation with regard to this concept. So, in order
to be a
relation is a comparative role. More specifically each item is a single instance of
relation for
the corresponding super-items and it is an object link to other objects by the
means or the
relations of its sub-items, which clarified as relation to instances. This brings
us to grouping
and/or aggregation.
Let�s continue to think of these items in relations, this way we can imagine each
item has a
number of �parents� from the super-concepts as well as a number of sub-items from
the
sub-concepts.
Items are interpreted as a group, set, and category for its sub-items. Yet it is
also a
member of the sets, groups, and categories formula represented by the super-items.
You
can see the dual functions or roles for the items when you consider them in this
light.
Continuing to think of our items in the light we have created we can now see that
each
problem domain that would be represented in a concept model have differing levels
of
details.
We can easily indicate that multiple source concepts and enforce input constraints
upon
them. Then the constraints can be spread in a downward direction to the bottom
level. This
is the most specific level of all of the levels.
Once we have completed this step the result of this then transported back up the
chain of
levels toward a target concept. Then we can begin an operation of moving one of the
source
concepts downward, basically choosing one of the sub-concepts with more detail.
This is
called Drill Down. We also have an operation known as Roll Up; this is the process
of
moving up by selecting some super-concept with less detail.
With all this talk of constraints, you are probably wondering what they are as
well. Simply
for each concept we can indicate constraints the corresponding items need to
satisfy. That
forces us to describe the actual properties by indicating a path resembling a
zigzag pattern
in the concept structure.
The zigzag path then goes up when needed to get more detailed information; it also
goes
down to retrieve more general information. Using this we can easily then express
the
constraints in terms of the other items and where they can be found.
Object and references both have corresponding structure and behavioral methods. A
simple
consequence of having concepts is that the object are presented and accessed in an
indirect
manner, this is done by the concept using custom references with subjective domain
specific structures as well as functions.
Now you might also want to know what a sub-concept or a super-concept is as well. A
Remember there is always an upward directed arrow in the concept graph from a sub-
concept to any of the corresponding super-concepts. Therefore the sub-concept then
associated with the start of that arrow type and the super-concepts is associated
with the
end of that arrow.
Sub-concepts also have two parts, such as Order Parts, the formula would be Order
Parts=<Products, and Orders> or Order Operations=<Orders, and operations>.
So now we have dissected the main components of the Concept Oriented Model, it is
obviously important to the association of data and concepts related to that data.
We can
now understand the uses and functions of this model with a bit more clarity.
Though the Concept Oriented model is complex and definitely worthy of further
research. It
is suggested that anyone who has had their curiosity sparked by this article look
further into
the model, and perhaps even further explore the additional functions and uses since
it can
be applied to many situations.
Object-Relational Model
Some of the benefits that are offered by the Object-Relational Model include:
. Extensibility - Users are able to extend the capability of the database server;
this can
be done by defining new data types, as well as user-defined patterns. This allows
the
user to store and manage data.
.
. Complex types - It allows users to define new data types that combine one or more
of
the currently existing data types. Complex types aid in better flexibility in
organizing the
data on a structure made up of columns and tables.
.
. Inheritance - Users are able to define objects or types and tables that procure
the
properties of other objects, as well as add new properties that are specific to the
object
that has been defined.
.
The object-relational database management systems which are also known as ORDBMS,
these systems provide an addition of new and extensive object storage capabilities
to the
relational models at the center of the more modern information systems of today.
These services assimilate the management of conventional fielded data, more complex
objects such as a time-series or more detailed geospatial data and varied dualistic
media
such as audio, video, images, and applets.
This can be done due to the model working to summarize methods with data
structures, the
ORDBMS server can implement complex analytical data and data management operations
to
explore and change multimedia and other more complex objects.
What are some of the functions and advantages to the Object-Relational Model?
It can be said that the object relational model is an evolutionary technology, this
approach
has take on the robust transaction and performance management aspects of its
predecessors and the flexibility of the object-oriented model (we will address this
in a later
article).
Database developers can now work with somewhat familiar tabular structures and data
definition but with more power and capabilities. This also allows them to perform
such task
all the while assimilating new object management possibilities. Also the query and
procedural languages and the call interfaces in the object relational database
management
systems are familiar.
The main function of the object relational model is to combine the convenience of
the
relational model with the object model. The benefits of this combination range from
Object-relational models allow users to define data types, function, and also
operators. As a
direct result of this the functionality and performance of this model are
optimized. The
massive scalability of the object-relational model is its most notable advantage,
and it can
be seen at work in many of today�s vendor programs.
The History of the Object-Relational Model
The Object Oriented model seemed to move into the spot light in the 1990s, the idea
of
being able to store object oriented data was a hit, but what happened to the
relational data?
Later in the 1990s the Object-Relational model was developed, combining the
advantages of
its most successful predecessors such as; user defined data types, user defined
functions,
and inheritance and sub-classes.
This model grew from the research conducted in the 1990s. The researches many goal
was
to extend the capabilities of the relational model by including objects oriented
concepts. It
was a success.
The easiest way to begin mapping between a enduring class and a table is one-on-
one. In a
case such as this, all of the attributes in the enduring class are represented by
all of the
columns of the table. Each case in point of a business class is then in turn stored
in a row of
that table.
Due to this two types of class to table modeling methods have been adopted by most
users.
This was to help overcome the issues caused by differences between the relational
and
object models. The two methods are known as SUBSET mapping and SUPERSET mapping.
SUBSET Mapping is used to create projection classes as well for tables with a
sizable
number of columns. A projection class contains enough information to enable the
user to
choose a row for complete retrieval from a database.
This essentially reduces the amount of information passed through out the network.
This
type of mapping can also be used to help may a class inheritance tree to a table of
using
filters.
Now let�s consider SUPERSET Mapping. With a persistent class the superset mapping
method holds attributes taken from columns of more then one table. This particular
method
of mapping is also known as table spanning.
Mapping using the SUPERSET method is meant to create view classes that cover the
underlying data model, or to map a class inheritance tree to a database by using a
Vertical
mapping tactic.
There are millions of other aspects and advantages to this model. The Object-
Relational
model does what no other single model before it could do. By combining the
strongest
points of those that did come before it, this model has surpasses expectations, and
taken on
a definitive role in database technology. Despite what models follow it, this model
is here to
stay.
http://www.learn.geekinterview.com/images/dm05a.png
The Object model, also referred to as the object oriented model was designed to add
provide full-featured database programming capability, all the while retaining the
native
language compatibility as well.
patient record systems, all of which have very complex relationships between data.
When you search the web for some concrete information on the object model don�t be
surprised to end up with mix matched results, none of which plainly stating �Object
Model�.
You will instead turn up results for Document Object Models, and Component Object
Models.
This is because the Object Model has been modified just slightly to apply to
different
instances. We will touch on that before moving on.
So what exactly is a Document Object Model? Well you might see it often referred to
as a
DOM, this model is a platform and language neutral interface that allows programs
or script
to vigorously access as well as update the content, structures, and styles of
documents. The
document can then be processed further and the results can be incorporated back
into the
contents of the page.
The Component Object Model which is also referred to as COM, this model is
basically a
component software architecture that enables users to build applications and
systems alike
from components supplied by different software vendors. Component Object Models are
the
underlying design that forms the foundation for some higher-level software
services. Some
of these services include those that are provided by OLE. Any PC user may be
surprised to
learn that a COM is also known as ActiveX. An application we are all familiar with,
especially
those of use that spend a lot of time surfing the internet.
Many of the traditional operating systems were designed to deal with only the
application
binaries and not the actual components. Due to this the benefits of good component-
oriented designs have until now never gone beyond the compilation step. In a world
that is
object-centric it is confusing that our operating systems still can not recognize
objects.
Instead our operating systems have been dealing with only application binaries or
EXEs.
This prevented objects in one process from communication with objects in a
different
process while using their own defined method.
History
The object model really hit the programming scene in the mid-1990s. Around October
of
1998 the first specification of the Document Object model was released by W3C, it
was
known as DOM 1. Later in 2000 DOM 2 followed, it surpassed its older version by
including
specifics with the style sheet Object Model and style information manipulation.
Most recently
DOM 3 wowed the programming world with its release in 2004. Thus far there have
been no
more current releases, as of now we are still using the DOM 3 model, and it has
served us
well.
The history of the Component Object Model is a bit lengthier; we will summarize its
more
dramatic points. DDE was one of the very first methods of inter-process
communication. It
allowed sending and receiving communications or messages between applications. This
is
also sometimes referred to as a conversation between applications. At this point I
think it is
important to point that Windows is the leading Component Object Model vendor, and
the
history of COM is based richly on the information a discoveries made by Windows.
The budding technology of COM was the base of OLE, which means Object Linking and
Embedding. This was one of the most successful technologies introduced with
Windows. The
programs we soon being added into application like Word and Excel by 1991, and on
into
1992. it was not until 1996 that Windows truly realized the potential for their
discover. They
found that the OLE custom controls could expand a web browsers capability enough to
present content.
From that point the vendor had be integrating aspects of COM into many of their
applications, some like Microsoft Office. There is no way to tell how far or how
long the
evolution of Object Modeling with travel, we need only sit back and watch as it
transforms
our software and application into tools to help us mold and shape our future in
technology.
The first example we will cover it the Document Object Model. This example is a
remake of
a more detailed example, I have reduced the information provided in the example in
order
to express on the important features of the model. This example can be seen below:
http://www.learn.geekinterview.com/images/dm06.png
http://www.learn.geekinterview.com/images/dm07.png
By looking at the example provided above, we can clearly see the process in which
the
Document Object Model is used. The sample model is designed to show us the way in
which
the document is linked to each element and the coinciding text that is linked to
those
elements..
Now we will take a quick look at a simple component object model example. This
particular
example has been based on one of the models provided by window themselves.
On the example above you see two different types of arrows. The solid arrows are
used to
indicate the USES, where as the dashed arrows are used to represent the OPTIONALLY
USES. The boxes with green out lined text are there to represent the aspects
provided with
WDTF. The blue high lighted text is you Implement or Modify example. The red is
expressing the implementation of your own action interface, and the text high
lighted in
black indicates your operating systems or driver API. By viewing this sample of the
Component Object Model, we can see how the components are linked and the way in
which
they communication between one another.
So after exploring the Object model we can safely come to the conclusion that the
Object
model does serve an important purpose that no model before it was able to grasp.
Though
the model has been modified to fit with specific instances the main use is to model
object
data.
Windows is one of the more notable vendors who have put the Object Model in the
limelight; it will be interesting to see what heights this model reaches with their
assistance.
I will be keeping a close eye out for the next evolutionary change in the Object
Model.
The Associative data model is a model for databases unlike any of those we spoke of
in prior
articles. Unlike the relational model, which is record based and deals with
entities and
attributes, this model works with entities that have a discreet independent
existence, and
their relationships are modeled as associations.
The Associative model was bases on a subject-verb-object syntax with bold parallels
in
sentences built from English and other languages. Some examples of phrases that are
. Cyan is a Color
. Marc is a Musician
. Musicians play instruments
. Swings are in a park
. A Park is in a City (the bold text indicates the verbs)
By studying the example above it is easy to see that the verb is actually a way of
association. The association�s sole purpose is to identify the relationship between
the
subject and the object.
The Associative database had two structures, there are a set of items and a set of
links that
are used to connected them together. With the item structure the entries must
contain a
unique indication, a type, and a name. Entries in the links structure must also
have a unique
indicator along with indicators for the related source, subject, object, and verb.
The Associative model structure is efficient with the storage room fore there is no
need to
put aside existing space for the data that is not yet available. This differs from
the relational
model structure. With the relational model the minimum of a single null byte is
stored for
missing data in any given row. Also some relational databases set aside the maximum
room
for a specified column in each row.
The Associative database creates storage of custom data for each user, or other
needs clear
cut and economical when considering maintenance or network resources. When
different
data needs to be stored the Associative model is able to manage the task more
effectively
then the relational model.
With the Associative model there are entities and associations. The entity is
identified as
discrete and has an independent existence, where as the association depends on
other
things. Let�s try to simplify this a little before moving on.
Let�s say the entity is an organization, the associations would be the customer and
the
employees. It is possible for the entity to have many business roles at the same
time, each
role would be recorded as an association. When the circumstances change, one or
more of
the associations may no longer apply, but the entity will continue to endure.
The Associative model is designed to store metadata in the same structures where
the data
itself is stored. This metadata describes the structure of the database and the how
different
kinds of data can interconnect. Simple data structures need more to transport a
database
competent of storing the varying of data that a modernized business requires along
with the
protection and managements that is important for internet implementation.
The Associative model is built from chapters and the user�s view the content of the
database
is controlled by their profile. The profile is a list of chapters. When some links
between items
in the chapters inside as well as outside of a specific profile exist, those links
will not be
visible to the user.
There is a combination of chapters and profiled that can simplify the making of the
database
to specific users or ever subject groups. The data that is related to one of the
user groups
would remain unseen to another, and would be replaced by a different data set.
With the Associative model there is not record. When assembling all of the current
information on a complex order the data storage needs to be re-visited multiple
times. This
could pose as a disadvantage. Some calculations seem to suggest that Associative
database
would need as many as four times the data reads as the relational database.
All of the changes and deletions to the Associative model are directly affected by
adding
links to the database. However we must not that a deleted association is not
actually
deleted itself. Rather it is linked to an assertion that has been deleted. Also
when an entity
is re-named it is not actually re-named but rather linked to its new name.
In order to reduce the complexity that is a direct result from the parameterization
required
by heftier software packages we can rely on the chapters, profiles and the
continuation of
database engines that expect data stored to be different between the individual
entities or
associations. To set of or hold back program functions in a database the use of
�Flags� has
begun to be practiced.
The packages that are based on an Associative model would use the structure of the
database along with the metadata to control this process. This can ultimately lead
to the
generalization of what are often lengthy and costly implementation processes.
A generalization such as this would produce considerable cost reductions for users
purchasing or implementing bigger software packages, this could reduce risks
related with
the changes of post implementation as well.
How well does the Associative Model suit the demands of data?
Some ask if there is still an ongoing demand for a better database. Honestly, there
will
always be that demand. The weaker points of the current relational model are now
apparent, due to the character of the data we still need to store changing. Binary
structures
that are supportive to multimedia have posed real challenged for relational
databases in the
same way that the object-oriented programming methods did.
When we look back on the Object databases we can see that they have no conquered
the
market, and have their cousins the hybrid relational products with their object
extensions.
So will the Associative model solve some of the issues surrounding the relational
model?
The answer is not entirely clear, though it may resolve some issues it is not
completely clear
how efficiently the model will manage when set against the bigger binary blocks of
data.
Areas of the Associative database design do seem simpler then the relational
models, still as
we have pointed out there are also areas that call for careful attention. There are
issues
related to the creation of chapters that remain daunting at best.
Even so, if the concept of the Associative model proves itself to be a genuinely
feasible and
is able to bring out a new and efficient database, then others could bring to life
products
that are built upon the base ideas that exist with this model.
There is definitely an undeniable demand for a faster operating database model that
will
scale up to bigger servers and down to the smaller devices. It will be an
interesting journey
to witness; I personally would like to see if the future databases built using this
model can
make their mark in the market.
The term Hierarchical Model covers a broad concept spectrum. It often refers to a
lot of set
ups like Multi-Level models where there are various levels of information or data
all related
be some larger form.
You can see from the above figure that the supplementing information or details
branch out
from the main or core topic, creating a �tree� like form. This allows for a visual
relationship
of each aspect and enables the user to track how the data is related.
There are many other ways to create this type of model, this is one of the simplest
and is
used the most often.
An example of information you would use the Hierarchical model to record would be
the
levels within an organization, the information would flow such as:
http://www.learn.geekinterview.com/images/dm09.png
So the Hierarchical model for this scenario would look closely like the one below.
As you can
see this model is substantially larger, the benefit of the Hierarchical model is
that it allows
for a continuous growth, though it can take up a lot of room.
With each addition of data a new branch on the �tree� is formed, adding to the
information
as a whole as well as the size.
Hierarchical models allow for a visual parent/ child relationship between data
sets,
organizational information, or even mathematics.
The idea for these models is to begin with the smallest details, in the example
above that
would be the sections.
From the smallest details you would move up (it is often easiest to think of the
model as a
hierarchy) to the subdivisions, above the subdivisions you find departments, and
finally
ending at one �parent� the organization.
Once finished you can sit back and view the entire �family� of data and clearly
distinguish
how it is related.
The first mainframe database management systems were essentially the birth place of
the
Hierarchical model.
The hierarchical relationships between varying data made it easier to seek and find
specific
information.
Though the model is idea for viewing relationships concerning data many
applications no
longer use the model. Still some are finding that the Hierarchical model is idea
for data
analysis.
Perhaps the most well known use of the Hierarchical model is the Family Tree, but
people
began realizing that the model could not only display the relationships between
people but
also those between mathematics, organizations, departments and their employees and
employee skills, the possibilities are endless.
Simply put this type of model displays hierarchies in data starting from one
�parent� and
branching into other data according to relation to the previous data.
Though the Hierarchical model is rarely used some of its few uses include file
systems and
XML documents.
The tree like structure is idea for relating repeated data, and though it is not
currently
applied often the model can be applied to many situations.
The Hierarchical model can present some issues while focusing on data analysis.
There is
the issue of independence of observations, when data is related it tends to share
some type
of background information linking it together, therefore the data is not entirely
independent.
Though the tree like structure is perhaps the simplest and also the most desirable
form for
new users there are other types or structures for this model.
Hierarchy is also structured as an outline or indented list. It can be found in the
indented
lists of XML documents.
The example, below, presents information similar to those above that we have
created but
the tree like form is not used in this Hierarchical Modeling but that of
indentation.
. ORGANISATION
o Department 1
. Subdivision 1
. Section 1
. Section 2
. Section 3
. Subdivision 2
. Section 1
. Section 2
. Section 3
. Subdivision 3
. Section 1
. Section 2
. Section 3
o Department 2
. Subdivision 1
. Section 1
. Section 2
. Section 2
. Subdivision 2
. Section 1
. Section 2
. Section 3
One thing you must keep in mind at all times is that no matter what type of
structure you
use for the model you need to be able to add categories at any time, as well as
delete them.
http://www.learn.geekinterview.com/images/dm10.png
An idea to ensure that this is possible is to use a list view or tree view with
expandable and
collapsible categories.
You can also use the model in a visual form, something involving a cylinder or
pyramid or
even a cube, this visual presentation of the data would be most suitable for a
presentation
of data to a group of professionals.
This form would be better for smaller less detailed levels. There is an example
using some
of the same information from above but shown more compact below.
There are various structures of the Hierarchical Model; in fact there are many more
then
those shown here.
The type you use all depends on the data you are using. The methods differ
according to
whether your data is people related, mathematical related, or just simple
statistics.
1. This model expresses the relationships between information. How they are related
and
what they are most closely related to.
3. The model has many different structures and forms. Each is best used depending
on the
type of data being recorded, the amount of data being recorded, and who it is being
recorded for.
4. Speaking in parent/child terms data can have many children but only one parent.
5. The model begins with core data and branches off into supplementing data or
smaller
related data.
6. One must remember to start with the smallest detail and work their way up.
If you keep to these simple and compacted guidelines your own Hierarchical Model
will be
successful, clean, clear, and well built. The point is to present information in a
simple and
easy to read manner.
The Multi-Dimensional Model
Due to the fact that OLAP is online it provides information quickly, iterative
queries are
often posed during interactive sessions.
Due to the analytical nature of OLAP the queries are often complex. The multi-
dimensional
model is used to solve this kind of complex queries. The model is important because
it
applies simplicity.
This helps users understand the databases and enables software to plot a course
through
the databases effectively.
The analysts know what measures they want to see, what dimensions and attributes
make
the data important, and in what ways the dimensions of their work is organized into
levels
as well as hierarchies.
Let us touch on what the logical cubes and logical measures are before we move on
to more
complicated details.
Logical cubes are designed to organize measures that have the same exact
dimensions.
Measures that are in the same cube have the same relationship to other logical
objects;
they can easily be analyzed and shown together.
With logical measures cells of the logical cube are filled with facts collected
about an
organization�s operations or functions.
The measures are organized according to the dimensions, which also deals with time
dimension.
Analytic databases contain outlines of historical data, taken from data in a
heritage system,
also those other data sources such as syndicated sources. The normally acceptable
amount
of historical data for analytic applications is about three years worth.
The measures are static; they are also trusted to be consistent while they are
being used to
help make informed decisions.
The measures are updated often, most applications update data by adding to the
dimensions of a measure. These updates give users concrete historical record of a
specific
organizational activity for an interval. This is very productive.
The lowest level of a measure is called the grain. Often this level of data is
never seen, even
so it has a direct affect on the type of analysis that can be done.
This level also determines whether or not the analysts can obtain answers.
Questions such
as, when are men most likely prone to place orders for custom purchases?
Logical cubes and measures were relatively simple and easy to digest. Now we will
consider
Logical Dimensions, which is a little more complex. Dimensions have a unique set of
values
that define and categorized the data.
These form the sides of the logical cubes and through this the measures inside of
the cubes
as well. The measures themselves are usually multi-dimensional; due to this a value
within
a measure should be qualified by a member of all of the dimensions in order to be
appropriate.
The Hierarchy is a mode which is used to organize the data at each level of
aggregation.
When looking at data, developers use hierarchy dimensions to identify trends on a
specific
level, as well as drill down to lower lever to indicate what is causing such
trends, then they
can also roll up to the higher levels to view how these trends affect the bigger
sections of
the organization.
Back to the levels, each level represent a position in the hierarchy, the levels
above the
most detailed level contain aggregated values for the levels that are beneath it.
Finally to wrap up this section we will take a quick look at Logical Attributes. By
now we
should all know that an attribute provides extra information about the data.
Some of there attributes are used simply for display. You can have attributes that
are like,
flavors, colors, sizes, the possibilities are endless.
It is this kind of attribute that can be helpful in data selection and also in
answering
questions.
An example of the type of questions that the attributes can help answer are; what
colors
are most popular in abstract painting? Also we can ask, what flavor of ice-cream do
seven
year olds prefer?
We also have time attributes, which can give us information about the time
dimensions we
spoke of earlier, this information can be helpful in some kinds of analysis.
These types of analysis can be indication the last day or amount of days in a time
period.
That pretty much wraps it up for attributes at this point. We will revisit the
topic a little
later.
Variables
Now we will consider the issue of variables. A variable is basically a value table
for data,
which is an array with a specific type of data and is indexed by a particular list
of
dimensions. Please be sure to understand that the dimensions are not stored in the
variable.
Each mixture of members of a dimension define a data cell. This is true whether a
value for
that cell is present or not. Therefore, if data is missing, or absent the fact of
the absences
can either be included or excluded from analysis.
When you have variable that contain identical dimensions it creates a logical cube.
With that
in mind, you can see how if you change a dimension, like adding time periods to the
time
dimension then the variables change as well to include the new time periods, this
happens
even if the other variable have no data for them.
The variables that share dimensions can be manipulated in a array of ways, this
includes
aggregation, allocation, modeling, and calculations.
This is more specifically numeric calculations, and it is an easy and fast method
in the
analytic work place. We can also use variables to store measures.
In addition to using variable to store measures they can be used to store attribute
as well.
There are major differences between the two.
While attribute are multi-dimensional, only one dimension is the data dimension.
Attributes
give us information about each dimension member no matter what level it inhabits.
Through out our journey of learning about the different types of data models, I
think that
the multi-dimensional model is perhaps one of the most useful.
It takes key aspects from other models like the relational mode, the hierarchical
model, and
the object model, and combines those aspects into one competent database that has a
wide
variety of possible uses.
Network Model
Oddly enough the Network model was designed to do what the Hierarchical model could
not.
Though both show how data is related the Network model allows for data to not only
have
many children but also many parents, where as the Hierarchical model allowed for
only one
parent with many children. With the Network model data relationships must be
predefined.
It was in 1971 that the Conference on Data System Languages or CODASYL officially
or
formally defined the Network model. This is essentially how the CODASYL defined the
Network model:
The central data modeling theory in the network model is the set theory. A set
contains a
holder record style, a set title, and an affiliate record type.
An affiliate record type is able to have the same role in more than one set;
because of this
the multi-parent hypothesis is established. A holder record style can be an
affiliate or holder
in another set as well.
The data model is an uncomplicated system, and link and connection record styles
(often
referred to as junction records) may well be existent, as well as additional sets
between
them.
The most notable advantage of the Network model is that in comparison with the
Hierarchical model it allows for a more natural avenue to modeling relationships
between
information. Though the model has been widely used it has failed to dominate the
world of
data modeling.
For a while the performance benefits of the lower lever navigational interfaces
used with the
Hierarchical and Network models were well suited for most large applications.
Yet as hardware advanced and became faster the added productivity and flexibility
of the
newer models proved to be better equipped for the data needs.
Soon the Hierarchical and Network models were all but forgotten in relation to
corporate
enterprise usage.
Open System Interconnection or OSI models were created to serve as tools that could
be
used to describe the various hardware and software components that can be found in
a
network system.
Over the year we have learned that this is particularly useful for educational
purposes, and
in expressing the full details of the things that need to occur for a network
application to be
successful.
This particular model consists of seven separate layers, with the hardware placed
at the
very bottom, and the software located at the top.
This process could easily be compared to that of reading an email. Imagine Column
#1 and
#2 as computers when exploring the figure below:
http://www.learn.geekinterview.com/images/dm12.png
The first layer, which is clear labeled as the physical layer, is used to describe
components
like that of internal voltage levels, it is also used to define the timing for the
conduction of
single fragments.
The next layer is the Data Link, which is the second layer that is listed in the
example
above, this often relates to the sending of a small amount of data, this could be
and often is
a byte, it is also often used for the task of error corrections.
The Network layer follows the Data Link layer, this defines how to transport the
message
through and within the network. If you can stop an moment and think of this layer
as one
working with an internet connection, it is easy to imagine that it would be used to
add the
correct network address.
Next we have the Transport layer, this layer is designed to divide small amounts of
the data
into smaller sets, or if needed it even severs to recombine them into a larger more
complete
set. The Transport layer also deals with data integrity; this process often
involves a
checksum.
Following the Transport layer we find the Session layer, this next layer is related
to issues
that go further or are more complicated then a single set of data.
More to the point the layer is meant to address resuming transmissions like those
that have
been prematurely interrupted or even some how corrupted by some kind of outside
influence. This layer also often makes long term connections to other remote
machines.
Following the Session layer is where we find the Presentation layer. This layer
acts as an
application interface so that syntax formats and codes are consistent with two
networked or
connected machines.
The Presentation layer Ialso designed to provide sub-routines as well, these are
often what
the user may call on to access their network functions, and perform some functions
like
encrypting data, or even compressing their data.
Finally we have the Application layer. This layer is where the actual user programs
can be
found. In a computer this could be as simple as a web browser surprisingly enough,
or it
could serve as a ladder logic program on a PLC.
After reading this article it is not hard to see the big differences between the
Hierarchical
Model and the Network Model. The network model is by far more complicated and deals
with
larger amounts of information that can be related in various and complicated ways.
This model is more useful due to the fact that the data can have many-to-many
relationships, not restricting in to a single parent to a child structure. This is
how the
Hierarchical Model works with data.
Though the Network model has been officially replaced by the more accommodating
Relational Model, for me it is not hard to imagine how it can still be used today,
and may
very well still be being used by PCs around the globe when I think of the Network
Model in
relation to how we email one another.
After reviewing the information and investigating the facts of the Network model I
have
come to the conclusion that it is a sound and relatively helpful model if not a bit
complicated.
Its one major downfall being that the data must be predefined; this adds
restrictions and is
why a more suitable model was needed for more advanced data. Ultimately this one
restriction lead to the model�s untimely replacement with in the world of data
analysis.
The Relational Model is a clean and simple model that uses the concept of a
relation using a
table rather then a graph or shapes. The information is put into a grid like
structure that
consists of columns running up and down and rows that run from left to right, this
is where
information can be categorized and sorted.
The columns contain information related to name, age, and so on. The rows contain
all the
data of a single instance of the table such as a person named Michelle.
In the Relational Model, every row must have a unique identification or key used to
allocate
the data that will follow it. Often, keys are used to join data from two or more
relations
based on matching identification.
Social Security
Number
Name
Date of Birth
Annual Income
Dependents
M-000-00-0002
F-000-00-0001
000-00-0003
Michelle
39,000
000-00-0001
Michael
78,510
000-00-0002
Grehetta
March 5th, 1952
0
The Relational Model can often also include concepts known commonly as foreign
keys,
foreign keys are primary keys in one relation that are kept in another relation to
allow for
the joining of data.
An example of foreign keys is storing your mother's and father's social security
number in
the row that represents you. Your parents' social security numbers are keys for the
rows
that represent them and are also foreign keys in the row that represents you. Now
we can
begin to understand how the Relational Model works.
Like most other things the Relational Model was born due to someone�s need. In 1969
Dr.
Edgar F. Codd published the first use of the relational model though it was meant
to be no
more then a report for IBM, if swept across and through data analysis unlike any
before it.
Codd's paper was primarily concerned with what later came to be called the
structural part
of the relational model; that is, it discusses relations per se (and briefly
mentions keys), but
it does not get into the relational operations at all (what later came to be called
the
manipulative part of the model).
Codd�s discovery, his creation was a breath of fresh air for those digging through
data
banks, trying to categorize and define data. When he invented this model he truly
may have
not foreseen what an incredible impact it would have on the world of data.
Some believe there is a great deal of room for improvement where the Relational
Model is
concerned. It may be a surprise to find not everyone supported relational model.
There
have been claims that the rectangular tables do not allow for large amounts of data
to be
recorded.
With the example of apples and oranges, both are fruits and therefore related in
that way
but apples have different attributes then oranges, At times a user may only want to
see one
or the other, then again the may want to view both. Handling this type of data with
the
relational model can be very tricky.
We are beginning to hear more and more about the need for a better model, a more
adequate structure, still no one has been able to design something that can truly
hold its
own with the Relational Model.
True the model could use a bit of tweaking and leaves a little to be desired, yet,
what would
the perfect model be? What could we use that would apply to as many instances as
the
Relational model, and still surpass its usefulness?
The Relational Model has survived through the years, though there are those who are
always trying to construct a more efficient way, it has managed to come out the
victor thus
far. One reason may be due to the structure it is big enough to be worthy of
optimizing.
Another notable reason is that the relational operations work on sets of data
objects, this
seems to make it a reasonably adequate model for remote access. Finally, it is a
clean
model and concise model that does not encourage design extravagance, or phrased as,
�design cuteness.�
Some prefer the clean and simple style that the Relational Model offers, they can
easily do
with out colorful shapes and stylish layouts, instead wanting nothing more then the
clear cut
facts and relevant information.
Here are a few of the more obvious and noted advantages to the Relational Model:
. Allows for Data Independence. This helps to provide a sharp and clear boundary
between the logical and physical aspects of database management.
.
. Simplicity. This provides a more simple structure than those that were being
before it. A
simple structure that is easy to communicate to users and programmers and a wide
variety of users in an enterprise can interact with a simple model.
.
. A good theoretical background. This means that it provides a theoretical
background
for database management field.
Do not be surprised to find that these are nearly the very same advantages that Dr.
Codd
listed in the debut of this model. It is obvious that he was right, and these
advantages have
been restated again and again since the first publication on his report.
There has been no other model brought into view that has had the advantages of the
Relational Model, though there have been hybrids of the model, some of which we
will
discuss in later articles.
We began with the Hierarchical Model, this model allowed us to distribute our data
in terms
of relation, some what like that of a hierarchy, it showed a parent/child type of
relation. It is
one big down fall being that of the fact that each �child� could only have one
parent, but a
parent could have many children.
This model served us well in its time of glory, and sure there are still systems
using it now,
though trusting their more hefty loads of data to better equipped models.
Following the Hierarchical Model we investigated the Network Model. This model was
closely
kin to the Hierarchical Model in that it to allow for a parent/child view of data,
its main
advantage over the previous model being that it allowed for a many-to-many
relationship
between data sets.
Still the data had to be predefined. Though some forms of this model are still used
today, it
has become some what obsolete.
Finally we come to our current model. The Relational Model. Like those before it,
it to
expresses the relationships between data, only it allows for larger input and does
not have
to be predefined. This model allows users to record and relate large amounts of
data.
This model also allows for multi-level relationships between data sets, meaning
they can be
related in many ways, or even only one way. It is easy to understand how this model
has
managed to out live those that came shortly before it. It is versatile, simple,
clean in
structure, and applicable to nearly every type of data we use.
The semi-structured data model is a data model where the information that would
normal
be connected to a schema is instead contained within the data, this is often
referred to as
self describing model.
With this type of database there is no clear separation between the data and the
schema,
also the level to which it is structured relies on the application being used.
Certain forms of semi-structured data have no separate schema, while in others
there is a
separate schema but only in areas of little restriction on the data.
Modeling semi-structured data in graphs which have labels that give semantics to
its
fundamental structure is a natural process. Databases of this type include the
modeling
power of other extensions of flat relational databases, to sheathed databases which
enable
the encapsulation of entities, as well as to the object databases, which also
enable recurring
references between objects.
Data that is semi-structured has just recently come into view as an important area
of study
for various reasons. One reason is that there are data sources like the World Wide
Web,
which we often treat as a database but it cannot be controlled by a schema.
Another reason is it might be advantageous to have a very flexible format for data
exchange
between contrasting databases. Finally there is also the reason that when dealing
with
structured data it sometimes may still be helpful to view it as semi-structured
data for the
tasks of browsing.
We are familiar with structured data, which is the data that has been clearly
formed,
formatted, modeled, and organized into customs that are easy for us to work and
manage.
We are also familiar with unstructured data.
Unstructured data combines the bulk of information that does not fit into a set of
databases.
The most easily recognized form of unstructured data is the text in a document,
like this
article.
What you may not have known is that there is a middle ground for data; this is the
data we
refer to as semi-structured. This would be data sets that some implied structure is
usually
followed, but still not a standard enough structured to meet the criteria needed
for the
types of management and mechanization that is normally applied to structured data.
We deal with semi-structured data every day; this applies in both technical and
non-
technical environments. Web pages track definite distinctive forms, and the content
entrenched within HTML usually have a certain extent of metadata within the tags.
Details about the data are implied instantly when using this information. This is
why semi-
structured data is so intriguing, though there is no set formatting rule, and there
is still
adequate reliability in which some interesting information can be taken from.
. Representation of the information about data sources that normally can not be
constrained by schema.
. The model provides a flexible format used for the data switch over amongst
dissimilar
kinds of databases.
The most important exchange being made in using a semi-structured database model is
quite possibly that the queries will not be made as resourcefully as in the more
inhibited
structures, like the relational model.
Normally the records in a semi-structured database are stored with only one of a
kind IDs
that are referenced with indicators to their specific locality on a disk. Due to
this the course-
plotting or path based queries are very well-organized, yet for the purpose of
doing
searches over scores of records it is not as practical for the reason that it is
forced to seek
in the various regions of the disk by following the indicators.
We can clearly see that there are some disadvantages with semi-structured data
model, as
there are with all other models, lets take a moment to outline a few of these
disadvantages.
Storage: Transfer formats like XML are universally in text or in Unicode; they are
also
prime candidates for transference, yet not so much for storage. The presentations
are
instead stored by deep seated and accessible systems that support such standards.
In short, many academic, open source, or other direct attention to these particular
issues
have been at an on-the-surface level of resolving representation or definitions, or
even
units.
Conclusion
We have researched many area of the semi-structured data model; include the
differences
between structured data, unstructured data, and semi-structured data. We have also
explored the various used for the model.
After looking at the advantages and the disadvantages, we are now educated enough
about
the semi-structured model to make a decision regarding its usefulness.
Though this model is worthy of more research and deeper contemplation. The
advantage of
flexibility and diversity that this particular model offers is more then
praiseworthy.
After researching, one can see many conventional and non-conventional uses for this
model
in our systems. A model example for semi-structured data model is depicted below.
http://www.learn.geekinterview.com/images/dm13.png
At the end of each arrow you can find the corresponding information. So this model
example
expresses the information about this article, the information being express is the
title of the
article which is
The Semi-Structure Data Model, also expresses the year in which the article was
written
which is 2008, and finally is tells us who the author is. As you can see from the
example
this data model is pretty easy to follow and useful when dealing with semi-
structured
information like web pages.
Star Schema
What is the Star Schema?
The Star Schema is basically the simplest form of a data warehouse. This schema is
made
up of fact tables and dimension table. We have covered dimension tables in previous
articles
but the concept of fact tables is fairly new.
The two tables are different from each other only in the way that they are used in
the
schema. They are actually made up of the same structure and the same SQL syntax is
used
to create them as well.
Interestingly enough in some schemas a fact table can also play the role of a
dimension
table in certain conditions and vice versa. Though they may be physically a like it
is vital
that we also understand the differences between fact table and dimension tables.
A fact table in a sales database, used with the star schema, could deal with the
revenue for
products of an organization from each customer in each market over a period of
time.
However, a dimension table in the same database would define the organizations
customers, the markets, the products, and the time periods that are found in the
fact
tables.
When a schema is designed right it will offer dimensions tables that enable the
user to leaf
through the database and get comfortable with the information that it contains.
This helps
the user when they need to write queries with constraints so that the information
that
gratifies those constraints is routed back into the database.
Star Schema Important Issues
As with any other schema performance is a big deal with the Star Schema. The
decision
support system is particularly important; users utilize this system to query large
quantities
of data. Star Schema�s happen to perform the most adequate decision support
applications.
Another issue that is important mention are the roles that fact and dimension
tables play in
a schema. When considering the material databases, the fact table is essentially a
referencing table, where as the dimension table plays the role of a referenced
table.
http://www.learn.geekinterview.com/images/dm14a.png
We can correctly come to the conclusion that a fact table has a foreign key to
reference
other tables and a dimension table is the foreign key reference from one or
multiple tables.
Tables that are references or are referenced by other tables have what is known as
a
primary key. A primary key is a column or columns with contents that specifically
identify
the rows. With simple star schemas, the fact table�s primary key can have multiple
foreign
keys.
The foreign key can be a column or a group of columns in a table which has values
that are
identified by the primary key of another table. When a database is developed the
statements used to make the tables should select the columns that are meant to form
the
primary keys as well as the foreign keys. Below is an example of a Star Schema.
. Bold italic column names indicate the primary key that is a foreign key to
another table
Let�s point out a few things about the Star Schema above:
. Items listed in the boxes above are columns in the tables with the same names as
the
box names.
. The foreign key columns are in italic text (you can see that the primary key from
the
green Dimension box is also a key in the orange box, the primary key from the
turquoise
box is also a foreign key in the orange box.)
. You can see that columns that are part of the primary key and the foreign keys
are
labeled in bold and italic text, like the key 1 in orange box.
. The foreign key relationships are identified by the lines that are used to
connect the
boxes that represent tables.
Even though a primary key value must be one of a kind in the rows of a dimension
table the
value can take place many times in a foreign key of a fact table, as in a many to
one
relationship. The many to one relationship can be present between the foreign keys
of the
fact table and the primary key they refer to in the dimension tables.
The star schema can hold many fact tables as well. Multiple fact tables are present
because
the have unrelated facts, like invoices and sales. With some situations multiple
fact tables
are present simply to support performance.
You can see multiple fact tables serving this purpose when they are used to support
levels
of summary data, more specifically when the amount is large, like with daily sales
data.
Referencing tables are also used to define many-to-many relationships between
dimensions.
This is usually referred to as an associative table or even a cross-reference
table. This can
be seen at work in the sales database as well. In a sales database each product has
one or
more groups that is belongs to, each of those groups also contain many products.
A rough example of this would be yet again with the sale database, as we said
before each
product is in one or more groups and each of those grouse have multiple products,
which is
a many-to-many relationship.
When designing a schema for a database we must keep in mind that the design affects
the
way in which it can be used as well as the performance.
Due to this fact it is vital that one makes the preliminary investment in time and
research
they dedicate to the design a database one that is beneficial to the needs of its
user. Let�s
wrap things up with a few suggestions about things to consider when designing a
schema:
. What is the function of the organization? Identify what the main processes are
for the
organization; it may be sales, product orders, or even product assembly, to name a
few.
This is a vital step; the processes must be identified in order to create a useful
database.
. What is meant to be accomplished? As all databases, a schema should reflect the
organization, in what it measures as well as what it tracks.
. Where is the data coming from? It is imperative to consider projected put in data
and its
sources will disclose whether the existing data can support the projected schema.
. Will there be dimensions that may change in time? If the organization contains
dimensions that change often then it is better to measure it as a fact, rather then
have it
stored as a dimension.
. What is the level of detail of the facts? Each row should contain the same kind
of data.
Differing data would be addressed with a multiple fact table design or even by
modifying
the single table so that there is a flag to identify the differences can be stored
with the
data. You want to consider the amount of data, the space, and the performance needs
. If there are changes how will they be addressed, and how significant is
historical
information?
XML Database
The XML database is most commonly described as a data persistence system that
enables
data to be accessed, exported, and imported. XML stands for Extensible Markup
Language.
The XML database is a Meta markup language that was developed by W3C to handle the
inadequacies of HTML. The HTML language began to evolve quickly as more
functionality
was added to it.
Soon there was a need to have a domain-specific markup language that was not full
of the
unnecessary data of HTML, thus XML was brought to life.
XML and HTML are indeed very different, the biggest way in which they differ is
where in
HTML semantics and syntax tags are unchanging, in XML the creator of the document
is
able to produce tags whose syntax and semantics are particular to the intended
application.
The semantics of tag in XML are reliant on the framework of the application that
processes
the document. Another difference between XML and HTML is an XML document has to in
good form.
XML�s beginning purpose may have been to mark up content, but it did not take long
for
users to realize that XML also gave them a way to describe structured data, that in
turn
made XML significant as a data storage and exchange format as well.
Here are a few of the advantages that the XML data format has:
. Built-in support for internationalization due to the reality that it utilizes
Unicode.
. The individual decipherable format makes it easier for developers to trace and
repair
errors than with preceding data storage formats.
. A great quantity of off-the-shelf apparatus for doling out XML documents are
already
present.
A Native XML Database or NXD defines a model for an XML document instead of the
data in
the document; it stores and retrieves documents in relation to the model.
At the very least the model will consist of attributes, elements, and document
order. The
NXD has a XML document as its fundamental area of storage.
The Database is also not obligated to have any specific underlying tangible storage
model. It
can be built on a hierarchical, relational, or even an object-oriented database,
all of which
we have explored in detail.
It can also use a proprietary storage format such as indexed or compressed files.
So we can
gather from this information that the database is unique in storing XML data and
stores all
agents of the XML model without breaking it down.
We have also learned that the NXD is not really an independent database all of the
time.
And are not meant to replace actually databases, they are a tool; this tool is used
to aid the
developer through providing a full-bodied storage and management of XML documents.
Not all databases are the same, yet there is enough features between them that is
similar to
give us a rough idea of some of the basic structure. Before we continue let us note
that the
database is still evolving and will continue to do so for some time.
One feature of the database is XML storage. It stores documents as a unit, and
creates
models that are closely related to XML or a related technology like DOM.
content and semi-structured data. Mapping is used to ensure that the XML unique
model of
data is managed.
After the data is stored the user will need to continue to use the NXD tools. It is
not as
useful as to try to access the data tables using SQL as one would think; this is
because the
data that would be viewed would be the model of an XML document not the entities
that the
data depict.
It important to note that the business entity model is within the XML document
domain, not
the storage system, in order to work with the actually data you will have to work
with it as
XML.
Another feature of the database worth mentioning is queries. Currently XPath is the
query
language of choice. To function as a database query language XPath is extended some
what
to allow queries across compilations of documents.
On a negative note XPath was not created to be a database query language so it does
not
function properly in that area.
In order to better the performance of queries NXDs support the development of index
on
the data stored in the collections.
The index can be used to improve the speed of the query execution. Fine points of
what can
be indexed and how the index is fashioned varies with products.
What kind of data types are supported by XML?
You might be surprised to hear that XML does not actually support any data types.
The XML
document is almost always text, even if by chance it does represent another data
type; this
would be something like a date or integer.
Usually the data exchange software converts the data from a text form, like in the
XML
document, to other forms within the database and vice versa.
Two methods are most common in determining which conversion to do. The fist of
these
methods is that the software determines the type of data that is from the database
schema,
this works out well because it is always available at run time.
The other method that is common is that the user clearly provides the data type,
like with
the mapping information.
This can be recorded by the user or even generated without human intervention from
a
database schema or even an XML schema.
When it is generated automatically, the data types can be taken from database
schemas as
well as from certain types of XML schemas.
The is another issue related to conversions as well, this has to do largely with
what text
formats are recognized when being exchanged from XML or what could be produced when
Concluding statements
XML may seem to be confusing, however, it is beneficial and even a bit less
complicated
then HTML. Yet when you are beginning to take the step to understanding XML when
you
have spent much time working with HTML, the process can be a bit distressing.
Never fear, once you have completed that step, XML is definitely a dominant format.
It is
also used in almost all of the models we have discussed, making it a vital area to
explore in
more detail.
It is innately obvious that the product names wouldn�t be hard-coded as the names
of the
columns in a table. Alternatively one department�s product descriptions in a
product table
may function as follows: purchases/sales of an individual item are recorded in
another table
that would have separate rows with a way to use the product ID for referencing.
Perhaps the most notable example of the EAV model is in the production databases we
see
with clinical work. This includes clinical past history, present clinical
complaints, physical
examinations, lab test, special investigations, and diagnoses. Basically all of the
aspects
that could apply to a patient. When we take into account all of the specialties of
medicine,
this information can consist of hundreds of thousands of units of data.
However, most people who visit a health care provides have few findings. Physicians
simply
do not have the time to ask a patient about every possible thing, this is just not
the way in
which patients are examined. Rather then using the process of elimination against
thousands of possibilities the health care provider focuses on the primary
complaints of the
patient, and then asks questions related to those complaints.
Now let�s consider how some one would attempt to represent a general-purpose
clinical
record in a database like those we discussed earlier.
By creating a table or even a set of tables with thousands of columns would not be
the best
choice of action, the vast majority of the columns would be unacceptable, also the
user
interface would be obsolete with out an extremely elaborate logic that could hide
groups of
columns based on the data that has been entered in the previous columns.
To complicate things further the patient record and medical findings continue to
grow. The
Entity-Attribute-Value data model is a natural solution for this perplexing issue,
and you
shouldn�t be surprised to find that larger clinical data repositories do use this
model.
Earlier we covered the facts that the EAV table consists of thee columns in which
data is
recorded. Those columns were the entity, the attribute, and the value. Now we will
talk a
little more in-depth about each column.
. The Entity, sticking to the scenario of clinical finding, the entity would be the
patient
event. This would contain at the very least a patient ID and the date and time of
the
examination.
.
Entity-Attribute-Value Database
This database is most commonly called the EAV database; this is a database where a
large
portion of data is modeled as EAV. Yet, you may still find some traditional
relational tables
within this type of database.
. We stated earlier what the EAV modeling does for certain categories of data such
as
clinical findings where attributes are many and few. However where these specific
functions do not apply we can use a traditional relational model instead. Using EAV
has
nothing to do with leaving the common sense and principles of a relational model
behind.
.
. The EAV database is basically un-maintainable without the support of many tables
that
store supportive metadata. These metadata tables usually outnumber the EAV tables
by
about three or more, they are normally traditional relational tables.
.
. The Entity in clinical data is usually a Clinical Event as we have discussed
above.
However for more general purposes the entity is a key into an Objects table that is
used
to note common information about all of the objects in the database. The use of an
Object table does not need EAV, traditional tables can be used to store the
category-
specific details of each object.
.
. The Value brings all values into lings, as in the EAV data example above as well,
this
results in a simple, yet still not scalable structure. Larger systems use separate
EAV
tables for each of their data types, including the binary larger objects, this
deals with the
metadata for a specific attribute in identifying the EAV table in which the data
will be
stored.
.
. The Attribute, in the EAV table this is no more then an Attribute ID, there are
normally
multiple metadata tables that contain the attribute related information.
There have been a number of issues with the Entity-Attribute-Value model brought to
light
throughout its lifetime. We will briefly discuss those now. It is important that we
clarify first
that these issues arise when metadata is not used with the EAV model, for metadata
is vital
for its functionality.
The technology of the relational databases will be inaccessible and will have to be
recreated
by a development team, this could include system tables, Graphical query tools,
fine
grained data security, incremental back-up and restore, and exception handling,
partitioned
tabled, and clustered indexes. All of which are currently non-existent.
The actual format is not supported well by the DBMS internals. Standard query
optimizers
for SQL do not handle the EAV formatted data that well, and a lot of time will need
to be
dedicated to performance tuning for an acceptable production quality application.
As you can see from above there are still a few issues that need to be addressed by
developers in order to make the EAV optimal. Regardless of those issues we have
also
learned that if we use metadata with the EAV we can avoid many if not all of these
issues.
There are three general graphical symbols used in entity relation diagram and these
symbols are: box, diamond and oval. The box is commonly used for representing the
entities in the database. The diamond is typically used for representing the
relationships and
finally, the oval is used for representing all the attributes.
In many other entity relation diagrams, the rectangle symbol is used to represent
entity
sets while the ellipse symbol is used to represent attributes. The line is
generally used for
linking attributes to entity sets and entity sets to relationship sets.
The entity relation diagram is used to represent the entire information system for
easy
management of resources. The diagram can make the people concerned easily identify
concepts or entities which exist in the whole information system as well as the
entire
business structure and the complex interrelationships between them.
It may represent very industry specific theoretical overview of the major entities
and
relationships needed for management of the industry resources whatever they may be.
It
may assist in the designing process of the database for an e-resource management
system
but may not necessarily identify every table which would be used.
In representing cardinality, the "Crow's Foot" notation uses three symbols: the
ring
represent zero; the dash represents one; and the crow's foot represents more or
many.
This diagram scheme may not be as famous and widely used as the symbols above but
it is
fast gaining notice especially now that it is used with Oracle texts and in some
visual
diagram and flowcharting tools such as Visio and PowerDesigner.
Those who prefer using the "Crow's Foot" notation say that this technique give
better clarity
in the identification of the many, or child, side of the relationship as compared
to other
techniques. This scheme also gives more concise notation for identifying mandatory
relationship with the use of perpendicular bar, or an optional relation, or an open
circle.
There are many tools for entity relation diagrams available in the market or the
internet
today. The proprietary tools include Oracle Designer, SILVERRUN ModelSphere,
SmartDraw,
CA ERwin Data Modeler, DB Visual ARCHITECT, Microsoft Visio, owerDesigner and
ER/Studio. For those who want free tools, their choices include MySQL Workbench,
Open
System Architect, DBDesigner and Ferret.
An entity structure chart draws the graphical structure for any data within the
enterprise
common data structure. This graphical drawing is intended to help data document
analysts,
database administrators, informational technology and all other staff of the
organization
visualized the data structures and the information system design. Entity relation
diagrams
makes use of Entity Structure Chart in order to provide a complete representation
of the
logical data structure.
Without the aid of an entity structure chart, data entities and all attributes and
relations
pertaining to them, would all be defined in bullet format with details in paragraph
format.
This can be straining to the eyes as well as difficult to analyze because one will
have to dig
through all those words and sentences.
With an entity structure chart, it becomes easy to have an analysis that shows the
position
of any selected entity in the structures that have been defined. An entity may
exist in many
structure charts. This chart would also give a great benefit in validating the
position or
absence of an entity within one or more specific structures.
For instance, there might be a need for identifying where a specific set of
departmental data
exists within the structure of the entire organization. With an entity structure,
it would be
very easy to spot and point to the location of the departmental data being looked
for.
The data entities which are represented in the entity structure chart may include
resource
data, roles of the entity, data from each organizational or departmental unit, data
location,
and any other business items which may apply.
An entity represents any real world object and the entity structure is technically
and
basically the formalism used in a structural knowledge representation scheme that
systematically organizes a family of possible structures of a system. Such an
entity
structure chart illustrates decomposition, coupling, and taxonomic relationships
among
entities.
The decomposition of an entity is the particular area which is concerned with how
the any
entity may be broken down further into sub-entities until its atomicity. Coupling
pertains to
the specifications detailing how sub-entities may be coupled together to
reconstitute the
entity.
An entity structure chart has a direct support for the entity relation diagram.
This diagram is
the visual and graphical representation of all of the interrelationships between
entities and
attributes in the database. And while the entity relation diagram uses graphical
symbols
such as box, diamond, oval and lines for representing the entities and
relationships in the
database, the entity structure chart may or may not use the same symbols in trying
to
visualize the structure of the entity for the database or the enterprise
information system.
It could be said that an entity structure chart is the break down of the components
of the
entity relations diagram. In the entity relations diagram only the entities and the
table
relations are being specified. But for the entity structure chart, the visual
illustration may
include a symbol for the entity is structurally represented.
For instance, let us say the entity is a CUSTOMER. The entity structure details
everything
about the customer including name, age, etc, and its relationships with products
and other
entities and at the same time the data type on how the CUSTOMER and its attributes
will be
physically stored in the database.
External Schema
The word schema as defined in the dictionary means plan, diagram, scheme or an
underlying organizational structure. Therefore, as can be very briefly said, an
external
schema is a plan on how to structure data so it can seamlessly integrates with any
information system that needs it.
It also means that data needs to integrate with the business schema of the
implementing
organization. External Schema is a schema that represents the structure of data
used by
applications.
Each of the external schemas describes the part of the information which is
appropriate to
the group of users at whom the schema is being addressed. Each external schema is
derived from the conceptual schema.
The external schema definitions are all based on a data dictionary. The universe of
discourse of the data dictionary is all information in the use and management of
the
database system.
The external schema definition system is the means by which the external schemas
are
being defined. The external schema must contain information which must be derivable
from
the conceptual schema.
In systems which are based on the object oriented paradigm, this may not
necessarily mean
that the classes which have been included in the schema have to have been
previously
defined in the conceptual schema. Any external schema may also include classes that
have
been defined in the conceptual schema like it may also contain derived classes.
A derived class could be any classes which have been directly or indirectly defined
on the
bases of the conceptual schema classes and have been defined and included in the
data
dictionary.
The typical external schema definition for the object oriented paradigm includes
three steps
which are: definition of the necessary derived classes; selection of the set of
classes that
will constitute the external schema; and generation of the output external schema.
A general external schema explicitly defines data structure and content in terms of
the data
model that tackles structural, integrity and manipulation of the data. As such, the
external
data schema include the data vocabulary which defines the element and attribute
names,
the content model which holds the definition of relationships and corresponding
structure,
and the data types.
Some of these data types are integer, string, decimal, boolean, double, floast,
hexBinary,
base64Binary, QName,dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay,
gMonth, NOTATION and all others which may be applicable or appropriate for
representing
the entities in the information system.
As this defines the structure of data used in applications, this will make it easy
for the
information system to deal with all processes, optimizations and troubleshooting.
This
schema will also ensure that the data used in the information system adheres by the
rules
of the business and follows the framework of the data architecture.
In data modeling, the external schema may refer to the collection of data
structures that
are being used in creating databases representing all objects and entities which
are modeled
by the database.
The same external schema stores the definition of the collection of governing rules
and
constraints place on the data structure in order to maintain structure integrity by
preventing
orphaned records and disconnected data sets. In the same external schema is also
can be
found the collection of operators which are applicable to the data structures such
as update,
insert, query of the database.
Four-Schema Concept
Four-Schema Concept
a physical schema,
a logical schema,
a data view schema, and
a business schema.
The use of the four schema concept is greatly taken advantaged of the in the
implementation of a service oriented business processes integration. It helps to
resolve
problems with 3-schema concept.
Today, there is a very widely recognized trend in the marketing and business
environment
and this trend is the pushing towards a scenario wherein companies and business
enterprises are networked together in order to be able gain higher profit from the
collaborative efforts, and improve operational flexibility while reducing the
operational cost.
In order for integration to take place, the underlying architecture must have to be
resolved
first so that smooth and near seamless integration can happen.
One of the most important aspects of not just in enterprise data integration but in
computer
science in general is the data modeling.
This is the process of creating a data model with the use of the model theory (the
formal
description of data) so that a data model instance can be created. The physical
schema, a
logical schema, a data view schema, and a business schema.
A typical data model may have an instance of one of the three. The conceptual or
business
schema is used for describing an the semantics pertaining to the business
organizations. It
is in this schema that the entity classes which represent things of significance to
the
organization and the entity relationships which are the assertions about
associations
between pairs of entity classes are being defined.
In the logical schema, the descriptions of the semantics are being contained here.
The
semantics' descriptions may be represented by a particular data manipulation
technology
including the descriptions of tables and columns, object oriented classes, and XML
tags,
among other things.
The physical schema is where the descriptions of the physical means by which data
are
stored and other concerns pertaining to partitions, CPUs, tablespaces, etc.
The addition of the data view schema is what makes the four schema concept complete
and
distinct. The data view schema details how enterprises can offer the information
that they
wish to share with others as well as request what they want.
In real life business data management implementation, there really are many types
of data
model being developed using many different notations and methods.
Some of the data models are biased towards physical implementation while other are
biased
toward understanding business data, and a still a few are biased towards business
managers.
The four schema concept, as mentioned, is biased towards service oriented business
processes integration.
Some good practices in this area tend to rely on homogeneous semantics which can
pose
difficulty in achieving for independent databases owned by independent enterprises.
The new model provides a four-schema architecture which can already allow
management
for information sharing.
The logical data model elaborates the representation all data pertaining to the
organization
and this model organizes the enterprise data in management technology jargon.
In 1975 when the American National Standards Institute (ANSI) first introduced the
idea of
a logical schema for data modeling, there were only two choices that time which
were the
hierarchical and network models.
Today there are three choices for logical data model and these choices are
relational, object
oriented and Extensible Markup Language (XML). The relational option defined the
data
model in terms of tables and columns. The object oriented option defines data in
terms of
classes, attributes and associations. Finally, the XML option defined data in terms
of tags.
The logical data model is based closely on the conceptual data model which
describes all
business semantic in natural language without pointing any specific means of
technical
implementation such as the use of hardware, software or network technologies.
The process of logical data modeling could be a labor intensive technique depending
on the
size of the enterprise the data model will be used for. The resulting logical data
model
represents all the definition, characteristics, and relationships of data in a
business,
technical, or conceptual environment. In short, logical data modeling is about
describing
end-user data to systems and end-user staff.
The very core of the logical data model is the definition of the three types of
data objects
which the building blocks of the data model and these data objects are the
entities,
attributes, and relationships. Entities refer to persons, places, events or things
which are of
particular interest to the company.
Some examples of entities are Employees, States, Orders, and Time Sheets.
Attributes refer
to the properties of the entities. Examples of attributes for the Employee entity
are first
name, birthday, gender, address, age and many others. Lastly, relationships refer
to the
way where in the entities relate to each other. An example relationship would be
"customers purchase products" or "students enroll in classes".
The above mentioned example is a logical data model using the Entity-Relationship
(ER)
model which identifies entities, relationships, and attributes and normalize your
data.
A logical data model should be carefully designed because it will have tremendous
impact on
the actual physical implementation of the database and the larger data warehouse.
A logical data model influenced the design of data movement standards and
techniques
such as the heavily used extract, transform and load (ETL) process in data
warehousing and
the enterprise application integration (EAI), degree of normalization, use of
surrogate keys
and cardinality.
Likewise, it will determine the efficiency in data referencing and in managing all
the
business and technical metadata and the metadata repository. Several pre-packaged
third
party business solutions like enterprise resource planning (ERP) or HR systems have
their
own logical data model and when they are integrated into the overall existing
enterprise
model with a well designed logical data model, the implementation may turn out to
be
easier and less time consuming resulting in saving of money for the company.
With Enterprise Data Model being a data architectural framework, the business
enterprise
will have some sort of starting point for all data system designs. Its theoretical
blueprint can
provide for provisions, rules and guide in the planning, building and
implementation of data
systems.
In the area of enterprise information system, the Operational Data Store (ODS) or
Data
Warehouse (DW) are two of the largest components which need carefully designed
enterprise data model because data integration is the fundamental principle
underlying any
such effort and a good model can facilitate data integration, diminishing the data
silos,
inherent in legacy systems.
As the name implies, the very core of an Enterprise Data Model is about the data,
regardless of where the data is coming from and how it will be finally used. The
model is
meant primarily to give clear definitions on how come up with efficient initiatives
in the
aspects of Data Quality, Data Ownership, Data System Extensibility, Industry Data
Integration, Integration of Packaged Applications and Strategic Systems Planning.
The process of making an enterprise model typically utilizes a top down bottom up
approach
for all designs of the data systems including the operational data store, data
marts, data
warehouse and applications. The enterprise data model is built in three levels of
decomposition and forms a pyramid shape.
The first to be created is Subject Area Model which sits on top of the pyramid. It
expands
down to create the Enterprise Conceptual Model and finally the Enterprise
Conceptual Entity
Model is created and occupies the base part of the pyramid. The three models are
interrelated but each of them has its own unique purpose and identity.
The Enterprise Conceptual Model, the second level in the pyramid, identifies and
defines the
major business concepts of each of the subject areas. This model is high level data
model
having an average of several concepts for every subject area. These concepts have
finer
details compared to the subject area details. This model also defines the
relationships of
each concept.
The Enterprise Conceptual Entity Model represents all things which are important to
each
business area from the perspective of the entire enterprise. This is the detailed
level of an
enterprise data model which each of the concept being expanded within each subject
area.
It is also in this level that the business and its data rules are examined, rather
than existing
systems so as to create the major data entities, the corresponding business keys,
relationships and attributes.
Star Schema
The star schema, which is sometimes called a star join schema, is one of the most
simple
styles of a data warehouse schema. It consists of a few fact tables that reference
any
number of dimension tables. The facts tables hold the main data with the typically
smaller
dimension tables describing each individual value of a dimension.
. simplicity
Normalization is not a goal of star schema design. Star schemas are usually divided
into fact
tables and dimensional tables, where the dimensional tables supply supporting
information.
A fact table contains a compound primary key that consists of aggregate of relevant
Making a star schema for a database may relatively easy but it still very important
to make
some investments in time and research because the schema's effect on the usability
and
performance of the database is very important in the long run.
In a data warehouse implementation, the creation of the star schema database is one
of the
most important and often the final process in implementing the data warehouse. A
star
schema has also a significant importance in some business intelligence processes
such as
on-line transaction processing (OLTP) system and the on-line analytical processing
(OLAP).
Several call center agents continuously take calls and enter order typically
involving
numerous items which must be stored immediately in the database. This makes the
scenario very critical and that the speed of inserts, updates and deletes should be
maximized. In order to optimize the performance, the database should hold as few
records
as possible at any given time.
On the other hand, On-line Analytical Processing, though this may mean many
different
things to different people, are many for analyzing corporate data. But in some
cases, the
terms OLAP and star schema are used interchangeably. But a more precise way of
thinking
would be to think of a star schema database is an OLAP system which can be any
system of
read-only, historical, aggregated data
The same OLAP and OLTP can be optimized with a star schema in a data warehouse
implementation. Since a data warehouse is the main repository of a company's
historical
data, it naturally contains very high volumes of data which can be used for
analysis with
OLAP. Querying these data may take a long time but with the help of star schema in
implementation, the access time may be made faster and more efficient.
Because of the some of the nuances associated with different database systems and
development platforms, reverse data modeling is generally a difficult task to
accomplish.
But today, there are software vendors trying to offer solutions to make reverse
data
modeling relatively easier to do. These reverse data modeling software solutions
can take
snapshots of existing databases and in producing physical models from which an IT
staff can
begin to verify table and column names.
While it can be generally relatively easy to document table and column names
through using
such tools, or by reviewing database creation scripts, thing that is really very
complicated
and difficult to do is dealing with the relationships between tables in the
database.
Hard skills like coding and software engineering are very important in reverse data
modeling, some soft skills like documentation, employee know-how, training, can
also be
invaluable in figuring out exactly how data is being used by the system in
question.
One of the biggest factors for a successful task of reverse data modeling is
learning as much
as possible about using the application can you determine how the existing data
structures
are being used and how this compares with the original design.
Reverse data modeling is a technology that represents a big opportunity for some
legacy
reporting systems to have its useful life gain extension but it should not be
construed as a
permanent replacement for the data warehousing technology stack.
Reverse data modeling has also a very big role in helping improve the efficiency of
Some commercial software offers solutions that can enable true enterprise
application
engineering by storing and documenting data, processes, business requirements, and
objects that can be shared by application developers throughout an organization.
With reverse data modeling and the help of such software, the business organization
implementing an enterprise data management system can easily design and control
enterprise software for quality, consistency, and reusability in business
applications through
the managed sharing of meta-data.
Like all other reverse engineering technologies regardless of whether they are for
software
or hardware or non-IT implementation, reverse data modeling is very useful in time
when
there is a deep seated need for system troubleshooting with very complicated
problems.
Reverse data modeling can give IT staff with a deeper insight and understanding of
the data
models thereby empowering them to come up with actions to improve the management of
the system.
There are three basic styles of data models: conceptual data model, logical data
model and
physical data model. The conceptual data model is sometimes called the domain model
and
it is typically used for exploring domain concepts in an enterprise with
stakeholders of the
project.
The logical model is used for exploring the domain concepts as well as their
relationships.
This model depicts the logical entity types, typically referred to simply as entity
types, the
data attributes describing those entities, and the relationships between the
entities.
The physical data model is used in the design of the database's internal schema and
as
such, it depicts the data columns of those tables, and the relationships between
the tables.
This model represents the data design taking into account the facilities and
constraints of
any given database management system. The physical data model is often derived from
the
logical data model although some can reverse engineer this from any database
implementation.
A detailed physical data model contains all artifacts a database requires in
creating
relationships between tables or achieving performance goals, such as indexes,
constraint
definitions, linking tables, partitioned tables or clusters. This model is also
often used
calculating estimates for data storage and sometimes is sometimes includes details
on
storage allocation for a database implementation.
The physical data model is basically the output of physical data modeling which is
conceptually similar to design class modeling whose main goal is to design the
internal
schema of a database, depicting the data tables, the data columns of those tables,
and the
relationships between the tables.
In a physical data model, the tables are first identified where data will be stored
in the
database.
For instance, in a university database, the database may contain the Student table
to store
data about students. Then there may also be the Course table, Professors table, and
other
related table to contain related information. The tables will then be normalized.
Data normalization is the process wherein the data attributes in a data model are
being
organized to reduce data redundancy, increase data integrity and increase the
cohesion of
tables and to reduce the coupling between tables.
After the tables are being normalized, the columns will be identified. A column is
the
database equivalent of an attribute. Each table will have one or more columns. In
our
example, the university database may have columns in the Student table such as
FirstName, LastName and StudentNumber columns.
The stored procedures are then being identified. Conceptually, a stored procedure
is like a
global method for implementing a database. An example of a stored procedure would
be a
code to compute student average mark, student payables or number of students
enrolled
and allowable in a certain course. Relationships are also identified in a physical
data model.
Keys are also assigned in the tables. A key is one or more data attributes which
identifies a
table row to make it unique and thus eliminate data redundancy and increase data
integrity.
Database Concepts
In the 1960s, the System Development Corporation, one of the world�s first computer
software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today. A number of corporations, notably with IBM and Rockwell at
the
forefront, developed database software throughout the 1960s and early 1970s. MUMPS
(also known as M), developed by a team at Massachusetts General Hospital in the
late
1960s, was the first programming language developed specifically to make use of
database
technology. In 1970, the relational database model was born. Although this model
was
more theoretical than practical at the time, it took hold in the database community
as soon
as the necessary processing power was available to implement such systems.
Perhaps you wish you had kept a simple, personal record of these movies and books,
so you
could have a quick look at it and easily identify the title that has been bothering
you all this
time? A database would be perfect for that!
The purest idea of a database is nothing more than a record of things that have
happened.
Granted, most professionals use databases for much more than storing their favorite
movies
and books, but at the most basic level, those professionals are simply recording
events, too.
Every time you join a web site, your new account information ends up in a database.
Have
you ever rented movies from Netflix? Their entire web site is essentially one big
database, a
record of every movie available to rent along with the locations of their copies of
that movie
among their many warehouses and customers� homes. The list goes on! Databases are
everywhere, and they assist us in performing many essential tasks throughout our
daily
lives.
You can easily see how databases affect all aspects of our modern lives, since
everything
you do, from calls you make on your mobile phone to transactions you make at the
bank to
the times you drive through a toll plaza and used your toll tag to pay, is recorded
in a
database somewhere.
If these databases did not exist, our lives would surely be much less convenient.
In the 21st
century, we are so accustomed to using credit cards and printing airline boarding
passes at
home that, if databases were to suddenly disappear, it would almost seem like we
were
cavemen again.
Fortunately, we have databases, and there are many people who are skilled at using
them
and developing software to use alongside them. These people take great pride in
their work,
as database programming is difficult but nonetheless very rewarding.
Consider, for example, the database team at Amazon.com. They built an enormous
database to contain information about books, book reviews, products that are not
books at
all, customers, customers� preferences, and tons of other things.
http://learn.geekinterview.com/images/ot1.jpg
It must have taken them months to get the database just right and ready for
customers to
use! But, once it started to work well and Amazon.com went live back in 1995, can
you
imagine the sense of pride those developers had as millions of potential customers
poured
onto the web site and began to interact with the product they spent so much time
perfecting? That must have been an incredible feeling for the database team!
What is a Database?
When you are in a big electronics store buying the latest edition of the iPod, how
does that
store�s inventory tracking system know you just bought an iPod and not, for
example, a car
stereo or a television
Let�s walk through the process of buying an iPod and consider all the implications
this has
on the inventory database that sits far underneath all the shiny, new gadgets on
the sales
floor.
When you hand the iPod box to the cashier, a barcode scanner reads the label on the
box,
which has a product identification number. In barcode language, this number might
be
something like 885909054336. The barcode representing this number can be seen in
Figure
1.
Figure 1. A sample barcode
The barcode acts as a unique identifier for the product; in this case, all iPods
that are the
same model as the one passing across the barcode reader have the same exact
barcode.
The barcode scanner relays the number represented by the barcode to the register at
the
cashier�s station, which sends a request (or a query) to the store�s inventory
database. This
database could be in the same store as the register or somewhere across the country
or
even around the world, thanks to the speed and reliability of the Internet.
The register asks the database, �What are the name and price of the product that
has this
barcode?� To which the database responds, �That product is an iPod, and it costs
$200.�
You, the customer, pay your $200 and head home with a new toy. Your work in the
store is
finished, but the inventory management system still needs to reconcile your
purchase with
the database!
When the sale is complete, the register needs to tell the database that the iPod
was sold.
The ensuing conversation goes something like the following.
Register: �How many products with this barcode are in our inventory?�
Database: �1,472.�
Register: �Now, 1,471 products with this barcode are in our inventory.�
Database: �OK.�
Data Retrieval
Of course, this is not the whole story. Much more happens behind the scenes than
simple
conversational requests and acknowledgements.
The first interaction the register had with the database occurred when the request
for the
product name and price was processed. Let�s take a look at how that request was
really
handled.
If the database is an SQL database, like MySQL or PostgreSQL or many others, then
the
request would be transmitted in the standard Structured Query Language (SQL). The
software running on the register would send a query to the database that looks
similar to
the following.
Every database may contain multiple tables, and every table may contain multiple
rows, so
specifying the name of the table and the row�s unique identifier is very important
to this
query. To illustrate this, an example of a small products table is shown in Figure
2.
When the database has successfully found the table and the row with the specified
id, it
looks for the values in the name and price columns in that row. In our example,
those
values would be �iPod� and �200.00�, as seen in Figure 2. The execution of the
previous
SELECT statement, which extracts those values from the table, is shown in Figure 3.
The database then sends a message back to the register containing the product�s
name and
price, which the register interprets and displays on the screen for the cashier to
see.
Data Modification
The second time the register interacts with the database, when the inventory number
is
updated, requires a little more work than simply asking the database for a couple
numbers.
Now, in addition to requesting the inventory number with a SELECT statement, an
UPDATE
statement is used to change the value of the number.
First, the register asks the database how many iPods are in the inventory (or �on
hand�).
http://learn.geekinterview.com/images/ot4.jpg
http://learn.geekinterview.com/images/ot5a.jpg
The database returns the number of products on hand, the register decrements that
number
by one to represent the iPod that was just sold, and then the register updates the
database
with the new inventory number.
In Figure 4, the database responds to the UPDATE query with UPDATE 1, which simply
means one record was updated successfully.
Now that the number of iPods on hand has been changed, how does one verify the new
number? With another SELECT query, of course! This is shown in Figure 5.
Now, the register has updated the database to reflect the iPod you just purchased
and
verified the new number of iPods on hand. That was pretty simple, wasn�t it?
More on Databases
You now know databases are made of tables, which are, in turn, made of records.
Each
record has values for specific columns, and in many cases, a record can be uniquely
In our example, the barcode number uniquely identified the iPod, which cost $200,
in the
products table. You have also seen that values in a database can be modified. In
this case,
the number of iPods on hand was changed from 1,472 to 1,471.
Database Systems
Early Databases
In the 1960s, the System Development Corporation, one of the world�s first computer
software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today.
A number of corporations, notably with IBM and Rockwell at the forefront, developed
database software throughout the 1960s and early 1970s. MUMPS (also known as M),
developed by a team at Massachusetts General Hospital in the late 1960s, was the
first
programming language developed specifically to make use of database technology.
In 1970, the relational database model was born. Although this model was more
theoretical
than practical at the time, it took hold in the database community as soon as the
necessary
processing power was available to implement such systems.
The advent of the relational model paved the way for Ingres and System R, which
were
developed at the University of California at Berkeley and IBM, respectively, in
1976. These
two database systems and the fundamental ideas upon which they were built evolved
into
the databases we use today. Oracle and DB2, two other very popular database
platforms,
followed in the footsteps of Ingres and System R in the early 1980s.
Modern Databases
The Ingres system developed at Berkeley spawned some of the professional database
systems we see today, such as Sybase, Microsoft SQL Server, and PostgreSQL.
Now, PostgreSQL is arguably the most advanced and fastest free database system
available,
and it is widely used for generic and specific database applications alike. MySQL
is another
free database system used in roughly the same scope of applications as PostgreSQL.
While
MySQL is owned and developed by a single company, MySQL AB in Sweden, PostgreSQL
has
no central development scheme, and its development relies on the contributions of
software
developers around the world.
IBM�s System R database was the first to use the Structured Query Language (SQL),
which
is also widely used today. System R, itself, however, was all but abandoned by IBM
in favor
of focusing on more powerful database systems like DB2 and, eventually, Informix.
These
products are now generally used in large-scale database applications. For example,
the Wal-
Mart chain of large department stores has been a customer of both DB2 and Informix
for
many years.
Modern Database
The other major player in the database game, Oracle, has been available under a
proprietary license since it was released as Oracle V2 in 1979. It has undergone a
number
of major revisions since then and, in 2007, was released as Oracle 11g. Like DB2
and
Informix, Oracle is mostly used for very large databases, such as those of global
chain
stores, technology companies, governments, and so forth. Because of the similar
client
bases enjoyed by IBM and Oracle, the companies tend to be mutually cooperative in
database and middleware application development.
In order of market share in terms of net revenue in 2006, the leaders in database
platform
providers are Oracle, with the greatest market share; IBM; and Microsoft.
While the database systems with the greatest markets shares use SQL as their query
language, other languages are used to interact with a handful of other relatively
popular
databases. Most developers will never encounter these languages in their daily
work, but for
purposes of being complete, some of these languages are IBM Business System 12,
EJB-QL,
Quel, Object Query Language, LINQ, SQLf, FSQL, and Datalog. Of particular note is
IBM
Business System 12, which preceded SQL but was, for some time, used with System R
instead of SQL due to SQL being relationally incomplete at the time.
Today, organizations with large database projects tend to choose Oracle, DB2,
Informix,
Sybase, or Microsoft SQL Server for their database platforms because of the
comprehensive
support contracts offered in conjunction with those products. Smaller organizations
or
organizations with technology-heavy staff might choose PostgreSQL or MySQL because
they
are free and offer good, community-based support.
Terminology
Further, most modern database systems employ the idea of the relational database,
and
they are properly called Relational Database Management Systems (RDBMS). The
distinction
between a DBMS and a RDBMS, unless critical to the understanding of a specific
topic, is not
made in these articles.
Database Interaction
Database Interaction
Efficient interaction, efficient storage, and efficient processing are the three
key properties
of a successful database platform. In this article, we explore the first: efficient
interaction.
Many database platforms are shipped with a simple command line utility that allows
the
user to interact with the database. PostgreSQL ships with psql, which gives the
user
extensive control over the operation of the database and over the tables and schema
in the
database. Oracle�s SQLPlus and MySQL�s MySQL are similar utilities. Collectively,
these are
also called SQL shells.
The final method for interacting with a database is through an application. This
indirect
interaction might occur, for example, when a bank customer is withdrawing money
from an
ATM. The customer only presses a few buttons and walks away with cash, but the
software
running on the ATM is communicating with the bank�s database to execute the
customer�s
transaction. Applications that need to interact with databases can be written in
nearly all
programming languages, and almost all database platforms support this form of
interaction.
A command line client usually provides the most robust functionality for
interacting with a
database. And, because they are usually developed by the same people who developed
the
database platform, command line clients are typically also the most reliable. On
the other
hand, effectively using a command line client to its full extent requires expert
database skill.
The �help� features of command line clients are often not comprehensive, so
figuring out
how to perform a complex operation may require extensive study and reference on the
part
of the user. Some basic usage of the PostgreSQL command line client is shown in
Figure 1.
http://learn.geekinterview.com/images/ot9.jpg
All command line clients operate in a similar manner to that shown in Figure 1. For
users
with extensive knowledge of SQL, these clients are used frequently.
One typically accesses an SQL command line client by logging into the database
server and
running them from the shell prompt of a UNIX-like operating system. Logging into
the
server may be achieved via telnet or, preferably, SSH. In a large company, the
Information
Technology department may have a preferred application for these purposes.
GUI Clients
The simplest way to think about a GUI client is to consider it to be a
sophisticated, flashy
wrapper around a command line client. Really, it falls into the third category of
interaction,
application development, but since the only purpose of this application is to
interface with
the database, we can refer to it separately as a GUI client.
The GUI client gives the user an easy-to-use, point-and-click interface to the
internals of the
database. The user may browse databases, schemas, tables, keys, sequences, and,
essentially, everything else the user could possibly want to know about a database.
In most
cases, the GUI client also has a direct interface to a simulated command line, so
the user
can enter raw SQL code, in addition to browsing through the database. Figure 2
shows the
object browser in pgAdmin III, a free, cross-platform GUI client for PostgreSQL .
http://learn.geekinterview.com/images/ot10.jpg
http://learn.geekinterview.com/images/ot11.jpg
Figure 2. The object browser in pgAdmin III
With an easy tree format to identify every element of the database and access to
even more
information with a few simple clicks, the GUI client is an excellent choice for
database
interaction for many users.
Figure 3 shows the Server Information page of MySQL Administrator, the standard GUI
tool
for MySQL databases.
For example, the software running on an ATM at a bank needs to access the bank�s
central
database to retrieve information about a customer�s account and then update that
information while the transaction is being performed.
Many database access extensions for modern programming languages exist, and they
all
have their advantages and caveats. The expert database programmer will learn these
caveats, however, and eventually become comfortable and quite skilled at
manipulating
database objects within application code.
Figure 4 and Figure 5 show the code for a simple database application written in
Perl and its
output, respectively.
With all the features of modern programming languages, extremely complex database
applications can be written. This example merely glosses over the connection,
query, and
disconnection parts of a database application.
Database Overview
You have been using databases for a few years, and you think you are at the top of
your
game. Or, perhaps, you have been interested in databases for a while, and you think
you
did like to pursue a career using them, but you do not know where to start. What is
the next
step in terms of finding more rewarding education and employment?
There are two routes people normally take in order to make them more marketable
and, at
the same time, advance their database skills. The first, earning an IT or computer
science
degree, requires more effort and time than the second, which is completing a
certification
program.
If you do not have a science, engineering, or IT degree yet and you want to keep
working
with databases and IT for at least a few years, the degree would probably be worth
the
time. For that matter, if you already have an undergraduate degree, then perhaps a
master�s degree would be the right choice for you? Master�s degrees typically only
require
three semesters of study, and they can really brighten up a resume. An MBA is a
good
option, too, if the idea of managing people instead of doing technical work suits
your fancy.
Your employees would probably let you touch the databases once in a while, too!
Many universities offer evening classes for students who work during the day, and
the
content of those classes is often focused on professional topics, rather than
abstract or
theoretical ideas that one would not regularly use while working in the IT field.
Online
universities like the University of Phoenix also offer IT degrees, and many busy
professionals have been earning their degrees that way for years now.
Certifications, while quite popular and useful in the late 1990s and early 2000s,
seem to be
waning in their ability to make one marketable. That said, getting a certification
is much
quicker than earning a degree and requires a minimal amount of study if the
certificate
candidate already works in the relevant field.
The certification will also highlight a job applicant�s desire to �get ahead� and
�stay ahead�
in the field. It may not bump the applicant up the corporate food chain like an MBA
might,
but it could easily increase the dollar amount on the paychecks by five to ten
percent or
more.
If you feel like you could be making more money based on your knowledge of
databases,
exploring the degree and certification avenues of continuing education may be one
of the
best things you do for your career.
Relational Databases
Popular, modern databases are built on top of an idea called �relational algebra�,
which
defines how �relations� (e.g. tables and sequences in databases) interact within
the entire
�set� of relations. This set of relations includes all the relations in a single
database.
Knowing how to use relational algebra is not particularly important when using
databases;
however, one must understand the implications certain parts of relational algebra
have on
database design.
Relational algebra is part of the study of logic and may be simply defined as �a
set of
relations closed under operators�. This means that if an operation is performed on
one or
more members of a set, another member of that same set is produced as a result.
Mathematicians and logicians refer to this concept as �closure�.
Integers
Consider the set of integers, for example. The numbers 2 and 6 are integers. If you
add 2 to
6, the result is 8, which is also an integer. Because this works for all integers,
it can be said
that the set of integers is closed under addition. Indeed, the set of integers is
closed under
addition, subtraction, and multiplication. It is not closed under division,
however. This can
be easily seen by the division of 1 by 2, which yields one half, a rational number
that is not
an integer.
Database Relations
Using the integer example as a starting point, we can abstract the idea of closure
to
relations. In a relational database, a set of relations exists. For the purposes of
initially
understanding relational databases, it is probably best to simply think of a
relation as being
a table, even though anything in a database that stores data is, in fact, a
relation.
Performing an operation on one or more of these relations must always yield another
relation. If one uses the JOIN operator on two tables, for example, a third table
is always
produced. This resulting table is another relation in the database, so we can see
relations
are closed under the JOIN operator.
Relations are closed under all SQL operators, and this is precisely why databases
of this
nature can be called relational databases
Concurrency and reliability have long been �hot topics� of discussion among
developers and
users of distributed systems. The fundamental problem can be seen in a simple
example, as
follows.
Suppose two users are working on the same part of a database at the same time. They
both
UPDATE the same row in the same table, but they provide different values in the
UPDATE.
The UPDATE commands are sent to the database precisely at the same time. What does
the
database system do about this, and what are the rules governing its decision?
ACID
When discussing concurrency and reliability, developers often talk about the
components of
ACID: atomicity, consistency, isolation, and durability. Together, these properties
guarantee
that a database transaction is processed in a reliable, predictable manner. A
transaction, in
this case, can be defined as any set of operations that changes the state of the
database. It
could be something as simple as reading a value, deciding how to manipulate that
value
based on what was read, and then updating the value.
Atomicity
The second rule is somewhat of an extension of the first rule. It says that, if any
operations
involved in a transaction fail, the entire transaction fails, and the database is
restored to the
http://learn.geekinterview.com/images/ot38.jpg
state before the transaction began. This prevents a transaction from being
partially
completed.
Database Consistency
Consistency is probably the most fundamental of the four ACID components. As such,
it is
arguably the most important in many cases. In its most basic form, consistency
tells us that
no part of a transaction is allowed to break the rules of a database.
In this example, if consistency were not upheld, the NULL value would initially
still not be
added as part of the row, but the remaining parts of the row would be added.
However,
since no value would be specified for the NOT NULL column, it would revert to NULL,
anyway, and violate the rules of the database. The subtleties of consistency go far
beyond
an obvious conflict between NOT NULL columns and NULL value, but this example is a
clear
illustration of a simple violation of consistency. In Figure 1, we can see that no
part of a row
is added when we try to violate the NOT NULL constraint.
Isolation
Durability
Distributed Databases
Suppose you created a database for a web application a few years ago. It started
with a
handful of users but steadily grew, and now its growth is far outpacing the
server�s
relatively limited resources. You could upgrade the server, but that would only
stem the
effects of the growth for a year or two. Also, now that you have thousands of
users, you are
worried not only about scalability but also about reliability.
If that one database server fails, a few sleepless nights and many hours of
downtime might
be required to get a brand new server configured to host the database again. No
one-time
solution is going to scale infinitely or be perfectly reliable, but there are a
number of ways
to distribute a database across multiple servers that will increase the scalability
and
reliability of the entire system.
Put simply, we want to have multiple servers hosting the same database. This will
prevent a
single failure from taking the entire database down, and it will spread the
database across a
large resource pool.
A distributed database is divided into sections called nodes. Each node typically
runs on a
different computer, or at least a different processor, but this is not true in all
cases.
Horizontal Fragments
One of the usual reasons for distributing a database across multiple nodes is to
more
optimally manage the size of the database.
Each server might be responsible for customers with certain ZIP codes. Since ZIP
codes are
generally arranged from lowest to highest as they progress westward across the
country,
the actual limits on the ZIP codes might be 00000 through 33332, 33333 through
66665,
and 66666 through 99999, respectively.
In this case, each node would be responsible for approximately one third of the
data for
which a single, non-distributed node would be. If each of these three nodes
approached its
own storage limit, another node or two nodes might be added to the database, and
the ZIP
codes for which they are responsible could be altered appropriately. More
�intelligent�
configurations could be imagined, as well, wherein, for example, population density
is
considered, and larger metropolitan areas like New York City would be grouped with
fewer
other cities and towns.
requests for certain ZIP codes. Regardless of which approach is taken, the
distribution of
the database must remain transparent to the user of the application. That is, the
user
should not realize that separate databases might handle different transactions.
Reducing node storage size in this manner is an example of using horizontal
fragments to
distribute a database. This means that each node contains a subset of the larger
database�s
rows.
Vertical Fragments
A situation that calls for vertical fragments might arise if a table contains
information that is
pertinent, separately, to multiple applications. Using the previous example of a
database
that stores customer information, we might imagine an airline�s frequent flyer
program.
These sets of data have different applications: the customer information is used
when
mailing tickets and other correspondence, and the mileage information is used when
deciding how many complimentary flights a customer may purchase or whether the
customer has flown enough miles to obtain �elite� status in the program. Since the
two sets
of data are generally not accessed at the same time, they can easily be separated
and
stored on different nodes.
Since airlines typically have a large number of customers, this distribution could
be made
even more efficient by incorporating both horizontal fragmentation and vertical
fragmentation. This combined fragmentation is often called the mixed fragment
approach.
A database can be broken up into many smaller pieces, and a large number of methods
for
doing this have been developed. A simple web search for something like �distributed
databases� would probably prove fruitful for further exploration into other, more
complex,
methods of implementing a distributed database. However, there are two more terms
with
which the reader should be familiar with respect to database fragmentation.
The first is homogeneous distribution, which simply means that each node in a
distributed database is running the same software with the same extensions and so
forth. In
this case, the only logical differences among the nodes are the sets of data stored
at each
one. This is normally the condition under which distributed databases run.
However, one could imagine a case in which multiple database systems might be
appropriate for managing different subsets of a database. This is called
heterogeneous
distribution and allows the incorporation of different database software programs
into one
big database. Systems like this are useful when multiple databases provide
different feature
sets, each of which could be used to improve the performance, reliability, and/or
scalability
of the database system.
Replication
Replication has time expense because each operation performed on one node�s
database
must be performed on each other node�s database simultaneously. Before the
operation can
be said to be committed, each other node must verify that the operation in its own
database
succeeded. This can take a lot of time and produce considerable lag in the
interface to the
database.
And, replication has data expense because every time the database is replicated,
another
hard drive or two or more fills up with data pertaining to the database. Then,
every time
one node gets a request to update that data, it must transmit just as many requests
as
there are other nodes. And, confirmations of those updates must be sent back to the
node
that requested the update. That means a lot of data is flying around among the
database
nodes, which, in turn, means ample bandwidth must be available to handle it.
Many of the more popular databases support some sort of native replication. MySQL,
for
example, provides the GRANT REPLICATION command, which initiates replication
automatically. PostgreSQL, on the other hand, requires external software for
replication.
This usually happens in the form of Slony-1, a comprehensive replication suite.
Each
database platform has a different method for initiating replication services, so it
is best to
consult that platform�s manual before implementing a replication solution.
Considerations
When implementing a distributed database, one must take care to properly weigh the
advantages and disadvantages of the distribution. Distributing a database is a
complicated
and sometimes expensive task, and it may not be the best solution for every
project. On the
other hand, with some spare equipment and a passionate database developer,
distributing a
database could be a relatively simple and straightforward task.
The most important thing to consider is how extensively your database system
supports
distribution. PostgreSQL, MySQL, and Oracle, for example, have a number of native
and
external methods for distributing their databases, but not all database systems are
so
robust or so focused on providing a distributed service. Research must be performed
to
determine whether the database system supports the sort of distribution required.
Business data refers to the information about people, places, things, business
rules, and
events in relation to operating a business.
Knowing things about people and their buying behaviors can make a company generate
very important business data. For instance, statisticians and market researches
know that
certain age groups have unique buying habits. Races and people from different
demographics locations also have buying patterns of their own so gathering these
information in one business database can be a good way to future target marketing.
In terms of production, business data about where to get raw materials, how much
the cost
is, what are the customs and importation policies of the raw materials' country of
origin and
other information are also very important.
There are many software applications that manage business data for easy statistical
In many companies, they maintain a business data warehouse where data from several
are
collected and integrate every few minutes. These repositories of business data may
supply
needed information to generate reports and recommendations in an intelligent
manner. Hence the term Business Intelligence is already widely used in the business
industry today.
For processing billions of business data in the data warehouse for business
intelligence,
companies use high powered and secure computer systems that are installed with
different
levels of security access.
Several software applications and tools have been developed to gather and analyze
large
amounts of unstructured business data ranging from sales statistics, production
metrics to
employee attendance and customer relations. Business intelligence software
applications
very depending on the vendor but the common attribute in most of them is that they
can be
customized based on the needs and requirements of the business company. Many
companies have in-house developers to take care of business data as the company
continues to evolve.
Some example of business intelligence tools to process business data include Score
carding,
Business activity monitoring, Business Performance Management and Performance
Measurement, Enterprise Management systems and Supply Chain Management/Demand
Chain Management. Free Business Intelligence and open source products include
Pentaho,
RapidMiner, SpagoBI and Palo, an OLAP database.
Business data is the core of the science of Analytics. Analytics is the study of
business data
that uses statistical analysis in knowing and understanding patterns and trends in
order to
foresee or predict business performance. Analytics is commonly associated with data
business activities, acquiring or capturing those data, and maintaining them in the
data
resource.
Everyday, billions of data and information gets carried across different
communications
media. The number medium is of course the internet. Another data communications
media
are television and mobile phones
Any non-business individual or entity may not find the real significance of these
data. They
are merely there to because it is innate in people to communicate and get connected
with
each other.
But individuals or organizations who think about business take advantage of these
data in a
business-driven approach. They try to collect, aggregate, summarize and
statistically
analyze data so they know what products they may want to see and who among the
people
will be their target market.
Many people who started from scratch but were driven by passion for business have
created
multinational corporations. Traditional vendors started from the sidewalk and took
a
business driven approach to move their wares from the streets to the internets
where
millions of potential buyers converge.
Today's businesses are very closely dependent on the use of technology. More and
more
transactions are going online. With the possibility of high security money
transfers, people
can buy items from continents away using just their credit cards. They can also
maintain
online accounts and make fund transfers in seconds at one click of the mouse.
Software application developers and vendors are coming up with innovative tools to
help
business optimize their performance. Large data warehouses, those repositories of
all sorts
data getting bulkier every single minute, are being set up and companies are
investing in
labor intensive high power capable computers connected to local area networks and
the
internet.
To manage these data warehouses, database administrators use relational database
technology with the aid of tools like Enterprise Resource Planning (ERP), Customer
Resource
Management (CRM), Advance Planner and Optimizer (APO), Supply Chain Management
(SCM), Business Information Warehouse (BIW), Supplier Relationship Management
(SRM),
Human Resource Management System (HRMS) and Product Lifecycle Management (PLM)
among thousands of others.
To keep businesses stable and credible especially when dealing with a global market
using
the internet, it needs to ensure security of critical data and information. Keeping
these data
safe is a responsibility which requires coordinated, continuous and very focused
efforts. Companies should invest in infrastructures that shield critical resources.
To keep the business afloat in all situations, it has to see to it that procurement
and
contract management is well maintained and coordinated. It has to wield substantial
purchasing power to cope up with new and highly competitive contracting methods.
A business needs to constantly evaluate enterprise resource planning implementation
and
should have a distinct department to manage finance and human resources with
principles
that are sound.
Business drivers are the people, information, and tasks that support the
fulfillment of a
business objective. They lead the company trying to get it away from pitfalls and
turn
unforeseen mistakes into good lessons for future success and sustainability.
Technology is fast evolving and businesses that do not evolve with technology will
suffer
tremendously. The world of business is a competitive sink or swim arena. Every
single day,
a business is exposed to risks and challenges but this is just a natural thing.
The foremost key business drivers of course are the staff and talents � the people.
Being
the literal brains behind every business, people make and set the objectives,
execution of
critical decisions and constant innovation to move the business forward. Human
resources
department are scouting for the best people in the business industry everyday.
Different
people have different aptitudes so it is important to employ only those with
business
stamina. While most companies prefer people with advance degrees specifically
master and
doctorates in business administration and the like, there are many people who have
not
even finished college but have innate skills for running a business effectively.
The fast rising in popularity of open source and open standards has also been a
significant
business driver in recent years. Open standards make possible the sharing of non
proprietary so portability is has no longer become an issue. In the past,
integration of
internal systems of a company was expensive because different computing platforms
could
not effectively communicate with each other. Now, with more bulk of data being
transferred from one data warehouse to another, open standards is making things a
lot
faster and more efficient.
Open source on the other hand allows companies free software or software with
minimal fee
and thus save the company a lot of money to be used in other investments. Open
source
software applications are in no way inferior than commercial applications. With
open
source, anybody can make additions to the source code so this can be a good way to
customize software to the business archicture.
Business Architecture
Business architects should also determine the scope of the business. How can the
company
grow and branch out to areas or countries? What is the expected annual growth based
on
product manufacture and sales revenues?
There are many more considerations to account and once all these are in one place,
a
business architect starts drawing the design. The design must have to cater to all
aspects of
the business. There is no trivial aspect in business as tiny details can create
huge impact
on the organization as a whole.
As one of the layer of the IT architecture, the business architecture is the very
framework
where the other layers, information, application and technology are based on.
Business
data constitute information which may be used by business software applications
which are
executed by hardware technology. All these other layers operate within the business
architecture framework.
The nature of business is fast evolving. Many traditional businesses have evolved
from a
local place to global proportions. One mobile phone has evolved from wood pulp
mills to
rubber works to what today is perhaps one of the global leaders in technological
innovations. Some insurance companies also consider themselves banks and vice
versa. These complex business evolution need to be tracked so appropriate changes
in the
business architecture can also be taken care by the information systems architects.
Business experts are people who thoroughly understand the business and the data
supporting the business. They know the specific business rules and processes.
In a typical company setup, a person will have a long way to climb up the corporate
ladder
so he or she will land on top manager positions. During the period that he is
climbing the
corporate ladder, he learns valuable lessons about the company's objectives,
decision
making guides, internal and external policies, business strategy and many other
aspect of
the business.
The chief executive officer (CEO) is the business organization's highest ranking
executive.
He is responsible for running the business, carrying out policies of the board of
directors and
making decisions that can highly impact the company. The CEO must not only be a
business expert in general but must also be a business expert in particular about
all the
details of the business he is running. He can be considered the image of the
company and
his relationship both internally and externally with other companies is very vital
for the
success of the business.
Accountant are key business experts responsible for recording, auditing and
inspecting
financial records of business and prepares financial and tax reports. Most
accountants give
recommendations by laying out projected sales, income, revenue and others.
Marketing people, whether marketing managers or staff, are constantly on the look
out for
marketing trends. They gather different statistical data and demography so they
know the
target for goods and services. They closely work with advertising people.
Business software developers are among the top notch information technology
professionals. Aside from mastering the technical aspect of IT like computer
languages and
IT infrastructure, the must also know the very framework of the business
architecture. Their applications are made to automate business tasks like
transaction
processing and all kinds of business related reporting.
Customer relations specialists take of the needs of clients especially the current
and loyal
ones. They make sure that clients are satisfied with the products and services.
They also
act like marketing staff by recommending other products and service to clients.
Their main
responsibility is keeping the clients happy, satisfied and wanting for more.
Human resource staff take care of hiring the best minds suited for the business.
Since
businesses offer different products and services, the human resource staff is
responsible for
screening potential employees to handle the business operations. In order for the
human
resource staff to match the potential candidate, they (the HR staff) should also
know the ins
and outs of the business they are dealing with.
There are also business experts who do not want to be employed in other companies
but
instead they want to have their own business and be their own boss. These people
are
called entrepreneur. Entrepreneurs spend invest their own money on the business of
their
choice. The make feasibility studies first before throwing in their money or they
may hire
the services of a business consultant.
A business consultant is a seasoned business expert that that has a lot of business
success
experiences under his belt. Most business consultants are not fully attached to one
company alone. Business consultants know all aspects of the business. He recommends
actions by studying the financial status and transaction history of the company he
offering
his services. Many companies are offering business consultancy as the main line of
service.
Corporate lawyers focus on laws pertaining to business. They take charge of the
contracts
and represent companies during time of legal misunderstanding with other entities.
In all
undertakings whether traditional or new, corporate lawyers will have to ensure that
the
company does not violate any law of a given country.
Business rule is very important to define a business schema. The business rule can
influence and guide behaviors and result to support to policies and response to
environmental events and situations. Rules may be the top means whereby a business
organization directs its movement and defines it objectives and perform appropriate
actions.
The terms in a business schema is the precise definition of words used in the
business
rule. Order, Product Type and Line items are terms that refer to entity class which
are
things of heavy significance to the business. Attributes are terms that describe
the entity
class. For example, total number and total value are attributes of an order.
Attributes of
Product Type may include manufacturer, unit price and materials used. Quantity and
extended value are attributes for Line Item entity class.
Facts, another business rule in the schema, describe a thing like the role a thing
plays and
other descriptions. A data model has three kinds of facts which relationships,
attributes and
super types / sub-types.
Derivation can be any attribute that is a derivative of other attributes or system
variables. For example, the extended value attribute for the entity class line
items can be
determined by multiplying quantity of line item by the unit price of the product
type.
Many companies hire consultants to document and consolidated the standards, rules,
policies and practices. Then these documentations are handed to IT and database
consultants so that they can be transformed into database rules that follow the
business
schema.
Many independent or third party consultancy firms, research institutions and
software
application vendors and developers offer rich business schema solutions. Even if
business
rules are constantly changing and software applications are already in final form,
these
software applications can be customized to be in sync with the constantly changing
business
rules in particular and the schema in general.
A business rules engine is widely used software application that is used to manage
and
automate business rules to follow legal regulations. For example the law that
states "An
employee can be fired for any reason or no reason but not for an illegal reason" is
ensured
to be followed by the software.
The rule engine's most significant function is to help classify, register and
manage business
and legal rules and verify for constancy. Likewise, the can infer some rules basing
other
existing rules and relate them to IT application. Rules in IT applications can
automatically
detect unusual situations arising from the operations.
With a business schema clearly defined, there is little room for mistake in running
a
business successfully.
Business-driven data distribution refers to the situation where the business need
for data at
a specific location drives the company to develop a data site and the distribution
of data to
the said newly developed site. This data distribution is independent of a
telecommunications
network.
Having a data warehouse can have tremendous positive impact on a company because
the
data warehouse can allow people in the business to get timely answers to many
business
related questions and problems. Aside from answers, data warehouses can also help
the
company spot trends and patterns on the spending lifestyles of both their existing
clients or
potential customers.
Building a data warehouse can be expensive. With today's businesses relying heavily
on
data, it is not uncommon for companies to invest millions of dollar on IT
infrastructures
alone. Aside from that, companies still need to expend on hiring of IT staff and
consultants.
For many companies, the best way to grow is to utilize the internet as the main
market
place of their goods and services. With the internet, a company can be exposed to
the
global market place. People from different places can have easy access to goods or
make
inquiries when they are interested. Customer relationship can be sealed a lot
faster without
staff having to travel far places. The company can have presence in virtually all
countries
around the world.
But with a wide market arena comes greater demand for data. One data warehouse
alone
location may not be able to handle the bulk data that need to be gathered,
aggregated,
analyzed and reported.
This is where business-driven data distribution comes in. For instance, a company
that has
mainly operated in New York and London has one data warehouse to serve both
offices. But
since many people on the internet have come to know of the company and availed of
their
services and have since then been loyal clients, many different locations are
starting to
generate bulks of data. Say, Tokyo has an exponentially increasing pool of clients,
the
company should already consider creating a new data warehouse in the location. This
data
warehouse can be the source of data to distribute to other data warehouses located
in other
areas but maintained by the same company.
If all these questions are coming up with positive answers, then the next logical
step would
be to design and implement the new data warehouse that will become the new
business-
driven data distribution center. As in most data warehouses, the appropriate high
powered
computers and servers will have to bought and installed. Relational database
software
applications and other user friendly tools should also be acquired.
As time progresses, a company may have several location that have data warehouses
acting
as business driven data distributions. These separate warehouses continually
communicate
with each transferring data updates every minute and synchronizing and aggregating
data
so business staff can get the most recent trends and patterns in the market.
With this set up, a company can have very competitive advantage over the
competitors.
They can formulate new policies, strategies and business rules based on the demand
created from the reports of the many business driven data distribution centers
around the
world.
What is Business Activity
Business activities utilize all data resources and platform resources in order to
performing
specific tasks and duties of the company.
For a company to survive in the business, it must have a competitive edge in the
arena
where multitudes of competitors exist. Competitive advantages can be had by having
rigorous and highly comprehensive methods for current and future rules, policies
and
behaviors for the processes within the organization.
Business activities are the bases for setting management procedures so support the
company in achieving the best financial quality and value in IT operations and
services.
Although these broad categories may apply to many businesses, companies have
differing
needs to the categories. For instance, a law firm may not need to acquire raw
materials as
do furniture companies. Hotels need intensive marketing while an environmental non
profit
organization may not need marketing at all.
Several vendors sell IT solutions to take care of business activities and expedite
manual
work by automatic them through computers.
An Enterprise Resource Planning (ERP) system is a software based application that
integrates all data processes of a company into one efficient and fast system.
Generally,
ERP uses several components of the software application and hardware system to have
the
integration. ERP system is highly dependent on relational databases and they it
involves
huge data requirements. Companies set up data warehouses to feed the ERP.
Supply Chain Management (SCM) makes the planning, implementation and control of
operation related to storage of raw materials, inventory of work in process and
point-of-
origin to point-of-consumption of finished products.
Supply management system takes care of the methods and process involved in
institutional
or corporate buying. Corporate buying may include purchasing of raw materials or
already
finished goods to be resold.
Data Tracking
Data tracking involves tracing data from the point of its origin to its final
state. Data
tracking is very useful in a data warehouse implementation. As it is very well
known, a data
warehouse is very complex system that involves disparate data coming various
operational
data sources and data marts.
Hence, data keeps traveling from one server to another. Data tracking helps develop
data
collection proficiency at each site when proper management actions are being taken
to
ensure data integrity.
Some servers handle data processes by archiving raw, compute and test data
automatically.
Raw data are stored exactly in the format as its was received including the header,
footer
and all other information about the data.
Data tracking can be employed to improve the quality for transactions. For example,
I do
my withdrawal using the automated teller machine (ATM) and something unexpected
happens to my transaction which results in the machine not dispensing any money but
deducting the amount from my balance as reflected in receipt spewed out by the
machine.
When I report this incident to the bank authorities, they can easily trace the
series of events
by tracking the data I entered into the machine and the activities that myself and
the ATM
machine did. Because the data is tracked, they can easily spot patterns which led
to the
problem and from then, they can immediately take actions to improve the services
Data tracking can also be used in cases of fraud being committed. Inside the
company, if
there an erring person, the data tracking process may involve inspecting the audit
trail logs.
Some websites offer aggregated data through data tracking by acquiring certain
fields of
data using remote connection technologies. For instance, web applications can track
the
sales of a company which is being refreshed regularly so while a staff is on
business travel
or simply on holiday, he can still see what is happening in the business operation.
In another example, a website may offer a service such as an hourly update of the
global
weather and this can be done by tracking data from different geographical locations
data
sites.
But despite the uses, there also issues associate with data tracking. The issue of
security
and privacy is one of the biggest areas of concerns with data tracking. Many
website try to
install some small codes on the web browsers so that they can track the return
visits of an
internet user and bring up the preferences he has specified during his previous
visits.
This information about the preferences is tracked from the small code copies on the
computer. This code is called "cookie". Cookies in themselves are intended for good
use but
there many coders who have exploited their use by making them track sensitive
information
and steal them for bad purposes.
There are many specialized used for data tracking purposes available as free
download or
for commercial purposes coming with a fee. These software data tracking tools have
easy to
use graphical dashboards and very efficient back end programs that can give users
the data
they need to track on a real time basis.
Graphical data tracking applications like these make for perfect monitoring tools
for
database or data warehouse administrators who want to keep track of the data
traveling
from data mart or operational data source to another. The graphical presentation
can make
the administrator easily spot the erring and data and have an instant idea of where
that
data is currently located.
With today's fast paced business environment, data tracking tools and devices can
greatly
enhance the information system of any organization in order for them to make wise
decisions and corresponding moves in the face of a challenging situation.
Intelligent Agent
There are two general types of intelligent agents � the physical agent and the
temporal
agent. The physical agent refers to an agent which uses sensors and other less
abstract and
more tangible means to do its job. On the other hand, the temporal agent may be
purely
codes that use time based stored information which are triggered depending on
configuration.
From those two general types of intelligent agents, there may be five classes of
intelligent
agents based on the degree of their functionalities and capabilities. These five
are simple
reflex agents, model-based reflex agents, goal-based agents, utility-based agents
and
learning agents.
The simple reflex agent functions on the basis of its most current perception is
also based
on the condition � action rule such as "if condition then action rule". The success
of the
agents job depends on how fully observable the environment is.
The model based agent has its current state stored inside the agent which maintains
certain
structures describing the part of the world which are unseen and this kind of agent
can
handle environments which are partially observable. The behavior of this kind of
agent
requires information about the way that the world works and behaves and thus is
sometimes considered to have the world view model.
The goal based agent is actually a model based agent but it stores information
about certain
situations and circumstances in a more desirable way by allowing the agent some
good
choices from among many possibilities.
The utility based agent uses a function that can map a state to a certain measure
of the
utility of the state.
Finally, a learning agent are is a self governing intelligent agents that can learn
and adapt
to constantly changing situations It can quickly learn even from large amounts of
data and
its learning can be online and in real time.
Intelligent agents have become the new paradigm for software development. The
concept
behind intelligent agents have been hailed as "the next significant breakthrough in
software
development". Today, intelligent agents are used in an increasingly wide variety of
An operational database, as the name implies, is the database that is currently and
progressive in use capturing real time data and supplying data for real time
computations
and other analyzing processes.
For example, an operational database is the one which used for taking order and
fulfilling
them in a store whether it is a traditional store or an online store. Other areas
in business
that use an operational database is in a catalog fulfillment system any other Point
of Sale
system which is used in retail stores. An operational database is used for keeping
track of
payments and inventory. It takes information and amounts from credit cards and
accountants use the operational database because it must balance up to the last
penny.
An operational database is also used for supported IRS task filings and regulations
which is
why it is sometimes managed by the IT for the finance and operations groups in a
business
organization. Companies can seldom ran successfully without using an operational
database
as this database is based on accounts and transactions.
Because of the very dynamic nature of an operational database, there are certain
issues
that need to be addressed appropriately. An operational database can grow very fast
in size
and bulk so database administrations and IT analysts must purchase high powered
computer hardware and top notch database management systems.
Most business organizations have regulations and requirements that dictate storing
data for
longer periods of time for operation. This can even create more complex setup in
relation to
database performance and usability. With ever increasing or expanding operational
data
volume, operational databases will have additional stress on processing of
transactions
leading to slowing down of things. As a general trend, the more data there are in
the
operational database, the less efficient the transactions running against the
database tend
to be.
There are several reasons for this one of the most obvious reasons is that table
scans need
to reference more pages of data so it could give results. Indexes can also grow in
size so it
could support larger data volumes and with this increase, access by the index could
degrade
as there would be more levels that need to be traversed. Some IT professionals
address this
problem by having solutions that offload older data to data stores for archive.
Operational databases are just part of the entire enterprise data management and
some of
the data that need to be archived go directly to the data warehouse.
An operational data store is basically a database that is used for being an interim
area for a
data warehouse. As such, its primary purpose is for handling data which are
progressively in
use such as transactions, inventory and collecting data from Point of Sales. It
works with a
data warehouse but unlike a data warehouse, an operational data store does not
contain
static data. Instead, an operational data store contains data which are constantly
updated
through the course of the business operations.
ODS is specially designed such that it can quickly perform relatively simply
queries on
smaller volumes of data such as finding orders of a customer or looking for
available items
in the retails store. This is in contrast to the structure of a data warehouse
wherein one
needs to perform complex queries on high volumes of data. As a simple analogy, a
data
store may be a company's short term memory storing only the most recent information
while the data warehouse is the long term memory which also serves as a company's
historical data repository whose data stored are relatively permanent.
The history of the operational data store goes back to as early as the year 1990
when the
original ODS system were developed and used as a reporting tool for administrative
purposes. But even then, the ODS was already dynamic in nature and was usually
updated
every day as it provided reports about daily business transactions such as sales
totals or
orders being filled.
The ODS that time are now referred to as a Class III ODS. As information technology
evolved, so did ODS with the coming of the Class II ODS which was already capable
of
tracking more complex information such as product and location codes, and to update
the
database more frequently (perhaps hourly) to reflect changes. And then came the
Class I
ODS systems from the development of customer relationship management (CRM).
Many years, IT professional were having great problems with integrating legacy
applications
as the process would entail so many resources for maintenance and other efforts had
done
little to care of the needs of the legacy environments. With experimentations and
development of new technologies, there was little left for company IT resources. As
IT
people had experienced with legacy applications, the legacy environment has become
the
child consuming its parent.
There were many approaches done to respond to the problems associated with legacy
systems. One approach was to model data and have information engineering but this
proved to be slow in the delivery of tangible results. With the growth of legacy
systems
came the growth in complexity as well as the data model.
Another response done to address legacy system problems was the establishment of a
data
warehouse and this has proven to be beneficial but a data warehouse only addresses
the
informational aspect of the company.
The development of an operational data store has greatly addressed the problems
associate
with legacy systems. Much like a data warehouse, data from legacy systems are
transformed and integrated into the operational data store and once there, data
ages and
then passed into a data warehouse. One of the main roles of the ODS is to represent
a
collective, integrated view of the up-to-the-second operations of the company. It
is very
useful for corporate-wide mission-critical applications.
OLAP is part of a wide category of business intelligence that includes ETL (extract
transform
load), relational reporting and data mining. Some critical areas of a business
enterprise
where OLAP are greatly used include business reporting for sales, marketing,
management
reporting, business process management (BPM), budgeting and forecasting, financial
reporting and similar areas.
Those databases that are planned and configured for use with OLAP use a
multidimensional
data model which can enable complex analytical and ad-hoc queries with a rapid
execution
time. Outputs from an OLAP query are displayed in a matrix or pivot format with
dimensions
forming the row and column of the matrix; the measures, the values.
OLAP is a function of business intelligence software which can enable data end
users to
easily and selectively get extracted data and view them from different points of
view. In
many cases, an OLAP as aspects designed for managers who want to look to make sense
of
enterprise information as well as how the company fares well with the competition
in a
certain industry. OLAP tools structure data in a hierarchical manner which is
exactly the way
many business mangers thinks of their enterprises. But OLAP also allows business
analysts
to rotate data and change relationships and perspectives so they get deeper
insights into
corporate information to enable them to analyze historical as well as future trends
and
patterns.
The OLAP cube is found in the core of any OLAP system. The OLAP cube is also
referred to
as the multidimensional cube or a hypercube. This cube contains numeric facts which
are
called measures and these measures are further categorized into several dimensions.
The
metadata of the cube is often made from a snowflake schema or star schema of tables
in a
relational database. The hierarchy goes from measures which are derived from the
records
in the fact table and dimensions which are derived from dimension tables.
A claim has it that OLAP cubes for complex queries has the power to produce answers
in
around 0.1% of the time for the same query on OLTP relational data. Aggregation is
the key
for OLAP to achieve the amazing performance and in OLAP, aggregations are built
from the
fact table. This is done by changing the granularity on specific dimensions and
aggregating
up data along these dimensions.
There are different kinds of OLAP such as Multidimensional OLAP (MOLAP) which uses
database structures which are optimal for attributes such as time period, location,
product
or account code; Relational OLAP (ROLAP) wherein the base data and the dimension
tables
are stored as relational tables and new table are created so they can hold
aggregated
information; and Hybrid OLAP (HOLAP) which can be a combination of OLAP types. Many
client � server processing and brokering of software applications that can enable
transactions to run on various computer platforms within a network.
Today, with the ubiquity of the internet, more and more people even from those
remote
areas are not doing transactions online through an e-commerce environment. The term
transaction processing is often associated with the process wherein an online shop
or
ecommerce website accepts and processes payments through a customer's credit or
debit
card in real time in return for purchased goods and services.
There are also many OLTP brokering programs which can distribute transaction
processing
among multiple computers on a network that can enhance the functions of an OLTP
working
on a more demanding decentralized database system. Service oriented architectures
and
web services are now commonly integrated with OLTP.
The two main benefits with using OLTP are simplicity and efficiency. OLTP helps
simplify a
business operation by reducing paper trails and helping draw faster and more
accurate
forecasting for revenues and expenses. OLTP helps provide a concrete foundation
with
timely updating of corporate data. For an enterprise' customers, OLTP allows the
more
choices on how they want to pay giving them more flexible time and enticing them to
make
more transactions. Most OLTP transactions offer services to customers 24 hours a
day seven
days a week.
But despite the great benefits that OLTP can give to companies and their customers,
there
are certain issues that it needs to address. The main issues pertaining to OLTP are
on
security and economic costs. Because an OLTP implementation is exposed on a
network,
more specifically the internet, the database may be susceptible to hackers and
intruders
who may be waiting on the side to get sensitive information on people and their
bank and
credit card accounts.
What is Aggregation
In the broadest sense of the word, aggregation means collecting and combining of
data
horizontally, vertically and chronologically and then expressed in summary form to
be used
for statistical analysis. In the more technical sense, aggregation is a special
kind association
that specified a part of whole relationship between the component part and the
while.
Data aggregation can be used for personal data aggregation services to offer a user
one
point for collection of his personal information from other websites. The user can
use a
single master personal identification number (PIN) which he can use to access a
variety of
other accounts like airlines, clubs and financial institutions. This kind of
aggregation is often
called "screen scraping".
During the course of time, large amounts of aggregated account data from provider
to
server are transferred and may develop into a comprehensive database of user
profiles with
details of balances, securities transactions, credit information and other
information. Privacy
and security become a major issue but there are independent companies offering
these
related services.
Because of the possibility of liabilities that may arise from activities related to
data
aggregations such as security issues and infringement of intellectual property
rights,
aggregators may agree on a data feed arrangement at the discretion of the end user
or
customers. This may involve using an Open Financial Exchange (OFX) standard in
requesting and delivering the information to the customer. This agreement will
provide an
opportunity for institutions to protect the interest of their customers' interest
and for
aggregators to come up with more robust services. Screen scrapping without the
content
provider's consent can lead to allowing subscribers to see any account opened
through a
single website.
Automatic data partitioning is the process of breaking down large chunks of data
and
metadata at a specific data site into partitions according to the request
specification of the
client.
Data sites contain multitudes of varied data which can be extremely useful as a
statistical
basis for determining many trends in businesses. Because data in the data sites can
grow at
a very fast rate, the demand for internet traffic also increases. a good software
with
partitioning capability should be employed to manage the data warehouse. Many
software
application handling data also have advanced functions like traffic shaping and
policing so
that sufficient bandwidth can be maintained.
Relational database management systems (RDBMS) effectively manage data sites. This
database system follows the relational model introduced by E. F. Codd in which data
is
stored tables while the relationship among data is stored in another tables. This
is in
contrast to flat files where all data is stored in one contiguous area.
Since RDMS data is not stored in one contiguous area but instead broken down into
tables,
it becomes easy to partition data whether manually or automatically for easy
sharing and
distribution.
The biggest advantage to data partitioning is that I can divide large tables and
indexes into
smaller parts and as a result, the system's performance can be greatly improved
while
contention is reduced and data availability and distribution is increased.
Automatic data
partitioning makes the job of the database administrator a lot easier especially in
labor
intensive jobs such as doing back ups, loading data, recovering and processing a
query.
Vertical partitioning is another technique wherein tables are created with fewer
columns
with additional separate tables to store the rest of the remaining columns.
Usually, the
process involves the use of different physical storage.
Some of the partitioning methods used as criteria include range partitioning, list
partitioning, hash partitioning and composite partitioning.
Hash partitioning uses th