Chapter 2 - Data and Knowledge Management-Notes
Chapter 2 - Data and Knowledge Management-Notes
The chapter emphasizes the need and importance of Data and Knowledge Management along with
Business intelligence in the domain of Management Information System
i. Columns. Columns are like fields, that is, individual items of data that we wish to
store. A Student' Roll Number, Name, Address etc. are all examplesof columns.
They are also like the columns found in spreadsheets (the A, B,C etc. along the
top).
ii. Rows. Rows are like records as they contain data of multiple columns (like the 1,
2, 3 etc. in a spreadsheet). A row can be made up of as many or as few columns as
you want. This makes reading data much more efficient - you fetch what you want.
iii. Tables. A table is a logical group of columns. For example, you may have a table
that stores details of customers' names and addresses. Another table would be
used to store details of parts and yet another would be used for supplier's names
and addresses.
ii. Shared: Data in a database are shared among different users and
applications.
vi. Consistency: Whenever more than one data element in a database represents
related real-world values, the values should be consistent with respect to the
relationship.
vii. Non-redundancy: No two data items in a database should represent the same
real-world entity.
ix. Easily Accessible: It should be available when and where it is needed i.e. itshould
be easily accessible.
A typical Traditional File Processing Systems is shown in the diagram that shows
program and data independency.
v. With the explosion of the data the challenge has gone to the next level and now a Big
Data is becoming the reality in many organizations.
vi. The goal of every organization and expert is same to get maximum out of the data,
the route and the starting point are different for each organization and expert.
vii. As organizations are evaluating and architecting big data solutions they are also
learning the ways and opportunities which are related to Big Data.
viii. There is not a single solution to big data as well there is not a single vendor which can
claim to know all about Big Data.
ix. Big Data is too big a concept and there are many players – different architectures,
different vendors and different technology.
Volume
i. The exponential growth in the data storage as the data is now more than textdata.
ii. The data can be found in the format of videos, music’s and large images on
our social media channels.
iii. It is very common to have Terabytes and Petabytes of the storage system for
enterprises.
iv. As the database grows the applications and architecture built to support the data
needs to be reevaluated quite often.
v. Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data.
vi. The big volume indeed represents Big Data.
Velocity
i. The data growth and social media explosion have changed how we look at thedata.
There was a time when we used to believe that data of yesterday is recent.
ii. The matter of the fact newspapers is still following that logic.
iii. However, news channels and radios have changed how fast we receive the news.
iv. Today, people reply on social media to update them with the latest happening. On
social media sometimes a few seconds old messages (a tweet, status updates etc.) is
not something interests users.
v. They often discard old messages and pay attention to recent updates.
vi. The data movement is now almost real time and the update window has reduced to
fractions of the seconds.
vii. This high velocity data represents Big Data.
Variety
i. Data can be stored in multiple format. For example, database, excel, csv, accessor for
the matter of the fact, it can be stored in a simple text file.
ii. Sometimes the data is not even in the traditional format as we assume, it maybe in
the form of video, SMS, pdf or something we might have not thought about it. It is the
need of the organization to arrange it and make it meaningful.
iii. It will be easy to do so if we have data in the same format, however it is not the case
most of the time.
iv. The real world has data in many different formats and that is the challenge weneed
to overcome with the Big Data.
v. This variety of the data represent Big Data.
Data warehouses have no standard definition and the people who work on data
warehouse subject have defined it in many ways as follows:
[1] “The basic data warehouse architecture interposes between end-user desktops and
production data sources a warehouse that we usually think of as a single, large system
maintaining an approximation of an enterprise data model.”
[2] “A data warehouse is a copy of transaction data specifically structured for querying
and reporting.”
[3] “A data warehouse as a “subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process”.
These data is obtained from different operational sources and kept in separate physical
store. A data warehouse is not only a relational database that contains historical data
derived from transactional data but also it is an environment that includes all the
operations and applications to manage the process of gathering data, and delivering it
to business users such as extraction, transportation, transformation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, client analysis tools.
i. Subject-Oriented: Data warehouses are designed to aid in decision making for a specific
subject. For example, sales data for applications contains specific sales of specific products
to specific customers. In contrast, sales data for decision support contains a historical
record of sales over specific time intervals. If designed well, subject-oriented data
provides a stable image of business processes, independentof legacy systems. In other
words, it captures the basic nature of the business environment.
ii. Integrated: Data warehouse consists of different kind of data which are collected from
separate legacy systems and this can create conflicts and inconsistencies among units of
measure.
iii. Because of this, they have to be put in a consistent format and by this way they become
integrated.
iv. Nonvolatile: Nonvolatile means that, once entered the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable a userto analyze
what has occurred. New data is always appended to the database, ratherthan replaced. The
database continually absorbs new data, integrating it with the previous data.
v. Time variant: There is difference between operational data and informational datafrom
the point of time valiancy. Operational data is valid only now of access- capturing a
Prof. Rushikesh R. Nikam Department Computer Engineering
Subject: Management Information system Semester: VII
1- “Makes an organization’s information accessible.” The contents of the data warehouse are
correctly labeled and obvious. It is very easy to reach to data because they are oneclick away
and there is no need to wait for this. These properties are called as same inthe above order;
understandable, navigable and fast performance.
2- “Makes the organization’s information consistent.” Consistent information has a key
importance for the data warehouses since they get data from different parts of an
organization. They must be matched properly. If two measures of the organization have the
same name, then they must mean the same thing. Conversely, in two measures don’t mean
the same thing, they are labeled differently.
3- “To be an adaptive and resilient source of information.” It enables to add new data and ask
new questions without any change in existing data and the technologies dueto it are designed
for continuous change.
4- “To be a secure bastion that protects owner’s information asset.” The data warehousenot
only controls access to the data effectively, but also gives its owners great visibilityinto the uses
and abuses of that data, even after it has left the data warehouse.
5- “To be the foundation for decision-making.” The data warehouse provides the right data for
the decision makers. The decisions are output of the data warehouses.
v. Data Mart
Data mart is a logical subset of the complete data warehouse and prepared for a
single business process in an organization. When they come together, an
integrated enterprise data warehouse is formed. Data marts must be built from
shared dimensions and fact. By this way they can be combined and used together.
ix. Metadata
Metadata contains information and definitions about the data, which is stored.
Legacy
ct
ct
End
User group driven;
Models:
ct
Data warehouse systems, on the other hand, users or knowledge workers in the role
of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of different users. These
systems are known as online analytical processing (OLAP) systems.
(1) Users and System Orientation: An OLTP system is used for transaction and query
processing by clerk, clients and information technology professionals. An OLAP system is
used for data analysis by knowledge workers, analysts, managers and executives.
(2) Data Contents: An OLTP system manages current data that typically are too detailed to
be easily used for decision making. An OLAP system manages large amounts of historic
data, provides facilities for summarization and aggregation and stores and manages
information at different levels of granularity. These features make the data easier to use
for informed decision making.
(3) Database Design: An OLTP systems use the entity-relationship(ER) data model and an
application-oriented database design. An OLAP systems use a star or snowflakemodel and
subject-oriented database design.
(4) View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organization. In
contrast, an OLAP system often spans multiple versions of a database schema, dueto the
evolutionary process of an organization. OLAP systems also deal with information that
originates from different organizations, integrating information from many data stores.
Because of their huge volume, OLAP data are stored on multiple storage media.
(5) Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly-read only operations,although many could
be complex queries.
By this simple architecture for a data warehouse seen in Figure 3.6.1, end users
directly access data derived from several source systems through the data
warehouse.
Flat files
An additional type of data, summary data is very valuable in data warehouses because they
pre-compute long operations in advance. For example, the result of the query that is about sales
of last year is retrieved by adding sales data.
The most data warehouses use a staging area in order to clean and process the operational
data before putting it into the warehouse. A staging area simplifies building summaries and
general warehouse management. The quite common architecture is shown in Figure 3.6.2.
Flat files
A warehouse’s architecture can be customized for different groups within the organization
by adding data marts, which are systems designed for specific parts of business.
The following Figure 3.6.3 shows an example. In this example, there are three data marts
which are designed separately for purchasing, sales, and inventories. This architecture gives an
opportunity to analyze historical data for purchases and sales.
Figure 17: Architecture of a Data Warehouse with a Staging Area and Data Marts
∑ Data Extraction which typically gathers data from multiple, heterogeneous and external
sources.
∑ Data Cleaning which detects errors in the data and rectifies them when possible.
∑ Data Transformation which converts data from legacy or host format to warehouse format.
∑ Load, which sorts, summarizes, consolidates, computes views, checks integrity and builds
indexes and partitions.
∑ Refresh, which propagates the updates from data source to the data warehouse.
Another thing is that of Metadata is that it is structured data which describes the
characteristics of resource. Metadata is stored in the system itself and can be queried using tools
that are available on the system.
Examples:
(1) The table of contents and index in a book may be considered metadata for the book.
(2) A library catalogue may be considered metadata. The catalogue metadata consists ofseveral
predefined elements representing specific attributes of a resource, and each element can
have one or more values. These elements could be the name of the author, the name of the
document, the publisher’s name, the publication date and the categoryto which it belongs.
They could even include an abstract of the data.
(3) Suppose we say that a data element about a person is 80. This must be described by nothing that
it is the person’s weight and the unit is kilograms. Therefore (weight, kilogram) is the metadata
repository itself may be stored in a physical location in which metadata is drawn from separate
sources. Metadata may include information about how to access specific data or more details
about the data.
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, and it plays an important role.
Metadata plays a very different role than data warehouse and it is important for many reasons.
Example: A metadata are used as a directory to help the decision support system analyst locate
the contents of the data warehouse, and as a guide to the data mapping when data are
transformed from the operational environment to the data warehouse environment. Metadata
also serve as a guide to the algorithms used for summarization between the current detailed data
and the highly summarized data, and between the lightly summarized data and the highly
summarized data. Metadata should be stored and managed persistently.
It includes the description of the structure of data warehouse. The description is defined by
schema, view, hierarchies, derived data definitions, and data mart location and contents.
It includes the business terms and definitions, data ownership information and changing
policies.
It includes currency of data and data lineage. Currency of data means whether the data is
active, archived or purged. Lineage of data means the history of data migrated and
transformation applied on it.
It includes source databases and their contents, data partitions, data extraction, cleaning,
transformation rules, data refresh and purging rules and security (user authorization and
access control).
(5) The algorithms used for summarization:It includes measure and dimension definition
algorithms, data on granularity, partitions,subject areas, aggregation, summarization, and predefined
queries and reports.
(6) Data related to system performance:
It includes indices and profiles that improve data access and retrieval performance, in
addition to rules for the timing and scheduling or refresh, update and replication cycles.
The two most common approaches to building Meta data repository architecture are:
(1) Centralized
(2) Decentralized
Generally small to medium sized organizations, a single metadata repository (the centralized
approach) is enough for handling all of the metadata required by the various groups in the
corporation. This architecture offers a single and centralized approach to administering and
sharing metadata.
On the Other hand, most large enterprises that have multiple and disparate divisions will require
several metadata repository for handling all of the corporation’s various types of metadata
content and applications.
This approach is the most common one that corporations have implemented.
The concept of a centralized Metadata architecture, consistent Meta model that mandates the
schema for defining and organizing the various metadata be stored in a global metadata
repository.
The strength of this approach is that it integrates all of the metadata and stores it in the Meta
model schema that can be easily accessed.
Process
Decentralized Metadata architecture creates a uniform and consistent Meta model that mandates
the schema for defining and organizing the various Metadata to be stored in a global metadata
repository and in the shared metadata elements that appear in the local meta data repository.
All the Metadata that is shared and reused among the various repositories must first go
through the central global repository but sharing and access to the local metadata is independent
of the central repository.
3.10 Mapping
A basic part of the data warehouse environment is that of mapping from the operational
environment into the data warehouse.
∑ Conversions
Consider the Vice president of marketing who has just asked for a new report of product selling
and purchasing. The manager turns to the data warehouse for the data for report. Uponinspection,
the vice president proclaims the report to be fiction. Than manager who can prove that data in
the report to be valid. The manager first looks to the validity of the data in the warehouse. If the
data warehouse, data has not been reported properly then the reports are adjusted.
However, if the reports have been made properly from the data warehouse, the manage having
to go back to the operational sources. At this point, if the mapping data has been carefully stored,
then the manager can quickly and easily go to the operational source. However, if the mapping
has not been stored properly, then manager has a difficult time defending conclusion to the vice
president.
The metadata store for the data warehouse then is natural place for the storing of mapping
information.
Data marts can have dependent or independent structure. If the characteristic of the data
marts’ dimensions is defined at the beginning, as they would be compliant to each other
then these data marts will have dependent characteristic.
In some situations, it is better to have independent data marts. This time the characteristic
of the other data marts will not take in the consideration during the preparation of the
datamart. However, this can prevent future integration and add development cost if there
willbe an interest in sharing information across departments.
i. To give users more flexible access to the data they need to analyze most often.
ii. To provide data in a form that matches the collective view of a group of users.
iii. To improve end uses response time.
iv. Potential users of a data mart are clearly defined and can be targeted for support to
retrieve the data.
v. To provide appropriately structured data as dictated by the requirements of the enduser
access tools.
vi. Building a data mart is simpler compared with establishing a corporate data warehouse.
vii. The cost of implementing data marts is far less than that required to establish a data
warehouse.
viii. Data mart is the access larger of the data warehouse environment. That means we
create data mart to retrieve the data to the users faster.
ix. The Data mart is the subset of warehouse that means all the data available in thedata
mart will be available in database. This Data mart will be created for the purpose of
specific business.
x. It is easy to access frequently needed data from the database when required by the
client.
xi. We can give access to group of users to view the Data mart when it is required. Of
course, performance will be good.
xii. It is easy to maintain and to create the data mart. It will be related to specific business.
xiii. It is low cost to create a data mart rather than creating data warehouse with a huge
space.
Resource
Finance
There are three main approaches for building data marts; top-down approach, bottomup
approach and federated approach.
Data Marts
ODS
ODS
When the data mart is compared with the data warehouse, two fundamental distinctions
can easily be noticed. One of them is that data mart is a subset of the data warehouse
and it is requirement oriented. Against this data warehouse holds the enterprise data
without taking care about any specific requirements. But of course, during the design of
data mart the structure of the whole warehouse has to be considered, if not it will be very
hard to integrate the data marts later.
The implementation of the data mart is much faster and costs cheaper, since a data mart
contains only a specific part of the data warehouse whose implementation is more time
consuming and costs much more.
There are some data mart solutions that are developed by the many decision support
systems (DSS) vendors. But using them to design a data mart for the specific
requirements needs to spend much more effort to customize them; due to this solutions
are produced for general purposes.
The other main difference of the data mart from the data warehouse is that the data in
the data mart can be more granular than the data warehouse. Since the requirements of
the data mart are more defined than those of the data warehouse, preaggregation can
be afforded to the data along the requirements. So the extraction of the data can be
done faster and more efficient.
Data Handling Data warehousing includes large Data marts are easy to use, design
area of the corporation which is and implement as it can only
why it takes a long time to process handle small amounts of data.
it.
Data type The data stored inside the Data Data Marts are built for particular
Warehouse are always detailed user groups. Therefore, data short
when compared with data mart. and limited.
Subject-area The main objective of Data Mostly hold only one subject area-
Warehouse is to provide an for example, Sales figure.
integrated environment and
coherent picture of the business at
a point in time.
wledge Designed
Data storing
Management
to store enterprise-wide Dimensional modeling and star
decision data, not just marketing schema design employed for
Knowledge is very important for survival of organization. Historically,
data. optimizing the performance of
employees have gathered knowledgeaccess through trial-and-error method
layer.
or by working as an apprentice under a tenured knowledgeable
employee. Management guru Peter Drucker forwarded a concept that
Data type Time variance
knowledge is asandvaluable
non-volatile Mostly includes
as a company’s variousconsolidation
asset like data
plant,
design areetc.
machinery, strictly enforced. management
A knowledge structures
systemto meet subject aarea's
comprises range of
query and reporting
practices used in an organization to identify, create, represent,needs.
i. Expert Systems
These are knowledge management systems developed to facilitate a Subject Matter
Expert. This module provides knowledge of different subjects.
ii. Groupware
In the current global scenario, team members are spread across regions. However, it is
important for them to collaborate on various projects. Groupware is a knowledge
management system which helps in sharing calendar, project activities and instant
messaging.
iii. SharePoint
All the systems we are discussing here come under knowledge management category.
A knowledge management system is not radically different from all these information
systems, but it just extends the already existing systems by assimilating more
information.
As we have seen, data is raw facts, information is processed and/or interpreted data,
and knowledge is personalized information.
What is Knowledge?
• Personalized information
• State of knowing and understanding
• An object to be stored and manipulated
• A process of applying expertise
• A condition of access to information
• Potential to influence action
Sources of Knowledge of an Organization
• Intranet
• Data warehouses and knowledge repositories
• Decision support tools
• Groupware for supporting collaboration
• Networks of knowledge workers
• Internal expertise
Purpose of KMS
• Improved performance
• Competitive advantage
• Innovation
• Sharing of knowledge
• Integration
• Continuous improvement by −
o Driving strategy
o Starting new lines of business
o Solving problems faster
o Developing professional skills
o Recruit and retain talent
Activities in Knowledge Management
• Start with the business problem and the business value to be delivered first.
• Identify what kind of strategy to pursue to deliver this value and address the KM
problem.
• Think about the system required from a people and process point of view.
• Finally, think about what kind of technical infrastructure are required to supportthe
people and processes.
• Implement system and processes with appropriate change management anditerative
staged release.
Level of Knowledge Management
Decision making is the mental process of selecting a course of action from a set of alternatives.
Decision making is the mental process of choosing from a set of alternatives. Every decision-
making process produces an outcome that might be an action, a recommendation, or an opinion.
Since doing nothing or remaining neutral is usually among the set of options one chooses from,
selecting that course is also deciding.
• Establishing objectives
• Classifying and prioritizing objectives
• Developing selection criteria
• Identifying alternatives
• Evaluating alternatives against the selection criteria
• Choosing the alternative that best satisfies the selection criteria
The decision maker may face a problem when trying to evaluate alternatives in terms of their
strengths and weaknesses. This can be especially challenging when there are many factors to
consider. Time limits and personal emotions also play a role in the process of choosing between
alternatives. Greater deliberation and information gathering often takes additional time, and
decision makers often must choose before they feel fully prepared. In addition, the more that is
at stake the more emotions are likely to come intoplay, and this can distort one’s judgment.
Types of Decisions
Three approaches to decision making are avoiding, problem solving and problem
seeking.
Every decision-making process reaches a conclusion, which can be a choice to act or notto act,
a decision on what course of action to take and how, or even an opinion or recommendation.
Sometimes decision-making leads to redefining the issue or challenge. Accordingly, three
decision-making processes are known as avoiding, problem solving, and problem seeking.
One decision-making option is to make no choice at all. There are several reasons whythe
decision maker might do this:
4. The person considering the alternatives does not have the authority to decide.
Business Intelligence (BI) is a technology-driven process for analyzing data and presenting
actionable information to help executives, managers and other corporate end users make
informed business decisions. BI encompasses a wide variety of tools, applications and
methodologies that enable organizations to collect data from internal systems and
external sources, prepare it for analysis, develop and run queries against that data, and
create reports, dashboards and data visualizations to make the analytical results available
to corporate decision- makers, as well as operational workers.
Business intelligence is sometimes used interchangeably with business analytics. In other
cases, business analytics is used either more narrowly to refer to advanceddata analytics
or more broadly to include both BI and advanced analytics.
that need to be addressed. BI data can include historical information stored in a data
warehouse, as well as new data gathered from source systems as it is generated, enabling
BI tools to support both strategic and tactical decision-making processes.
Initially, BI tools were primarily used by data analysts and other IT professionals who ran
analyses and produced reports with query results for business users. Increasingly,
however, business executives and workers are using BI platforms themselves, thanks
partly to the development of self-service BI and data discoverytools and dashboards.
Types of BI tools
Business intelligence combines a broad set of data analysis applications, includingad hoc
analytics and querying, enterprise reporting, online analytical processing (OLAP), mobile
BI, real-time BI, operational BI, software-as-a-service BI, open source BI, collaborative BI
and location intelligence.
BI technology also includes data visualization software for designing charts and other
infographics, as well as tools for building BI dashboards and performance scorecards that
display visualized data on business metrics and key performance indicators in an easy-to-
grasp way. Data visualization tools have become the standard of modern BI in recent
years. A couple leading vendors defined the technology early on, but more traditional BI
vendors have followed in their path. Now, virtually every major BI tool incorporates
features of visual data discovery.
BI programs may also incorporate forms of advanced analytics, such as data mining,
predictive analytics, text mining, statistical analysis and big data analytics.In many cases,
though, advanced analytics projects are conducted and managed by separate teams of
data scientists, statisticians, predictive modelers and other skilled analytics professionals,
while BI teams oversee more straightforward querying and analysis of business data.
Business intelligence data is typically stored in a data warehouse or in smaller datamarts
that hold subsets of a company's information. In addition, Hadoop systems are
increasingly being used within BI architectures as repositories or landing padsfor BI and
analytics data, especially for unstructured data, log files, sensor data and other types of
big data. Before it is used in BI applications, raw data from different source systems must
be integrated, consolidated and cleansed using data integration and data quality tools to
ensure that users are analyzing accurateand consistent information.
Questions
2 Marks Questions
1. Define Database Approach
2. Define Big Data with example.
3. Define Datawarehouse and Data Mart.
4. Define Knowledge management with neat diagram.
5. What are the 3V’s of big data analytics.
6. What are the roles of BI?
7. Differentiate between traditional Computing and Stream Computing.
8. Define data management.
5 Marks Questions
1. Describe the importance of Business Intelligence and DSS in developing MIS
2. Explain the MIS pyramid.
3. What is Information system? What are functions of information system and
itsimpact on the society in the domain of health care.
4. Explain the ethical issues and threats of information security.
5. Differentiate between Datawarehouse and Data Mart.
6. Differentiate between OLAP and OLTP.
7. Explain with neat diagram the Value Chain of Big data.
8. Explain the Knowledge Management framework with KM Ladder.
10 Marks Question
1. What is the role of knowledge management and knowledge
managementprograms in business?
2. What are the business benefits of using intelligent techniques for
knowledgemanagement?