0% found this document useful (0 votes)
48 views15 pages

Unit 2

The document provides an overview of analytics, detailing its four stages: descriptive, diagnostic, predictive, and prescriptive analytics, each serving distinct purposes in data analysis. It also discusses various tools for data analytics, applications across industries, types of databases, and data modeling techniques, emphasizing the importance of understanding data types and relationships. Additionally, it highlights the advantages and disadvantages of data modeling, along with techniques like Entity Relationship Diagrams and UML Class Diagrams for effective data representation.

Uploaded by

konderunandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views15 pages

Unit 2

The document provides an overview of analytics, detailing its four stages: descriptive, diagnostic, predictive, and prescriptive analytics, each serving distinct purposes in data analysis. It also discusses various tools for data analytics, applications across industries, types of databases, and data modeling techniques, emphasizing the importance of understanding data types and relationships. Additionally, it highlights the advantages and disadvantages of data modeling, along with techniques like Entity Relationship Diagrams and UML Class Diagrams for effective data representation.

Uploaded by

konderunandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Analytics is the science of analysis wherein we apply methods of statistics, data mining and

computer technology for doing the analysis, whereas analysis is the process wherein the complex
data available is being broken down into simpler forms which provides more compact and better
data for understanding.

Here are the proper definitions of the four stages of analytics, explained clearly:

1. Descriptive Analytics

Definition: Descriptive analytics focuses on summarizing past data to understand and explain
historical events. It answers the question "What happened?" by analysing data and presenting it in
an organized way, often through reports, charts, or dashboards.

Example:
A school analyses data and finds that 10% of students dropped out last year. This stage helps
summarize the raw numbers and identify patterns or trends.

2. Diagnostic Analytics

Definition: Diagnostic analytics explores the causes of past events. It answers the question "Why did
it happen?" by identifying relationships, patterns, or anomalies in the data. This stage uses
techniques like correlation analysis and root cause analysis to find explanations.

Example:
The school discovers that the dropout rate increased because many students faced financial
challenges or struggled with a tougher curriculum.

3. Predictive Analytics

Definition: Predictive analytics uses historical data, machine learning, and statistical models to
forecast future outcomes. It answers the question "What is likely to happen?" by identifying
patterns and making predictions.

Example:
Using student data, the school predicts that 15 students are likely to drop out next year based on
their attendance, grades, and financial background.

4. Prescriptive Analytics

Definition: Prescriptive analytics provides recommendations for actions to influence future outcomes
or prevent problems. It answers the question "What should we do?" by combining data insights,
predictions, and decision-making techniques.

Example:
The school decides to offer financial aid, tutoring programs, and mentorship to help students
identified as at risk of dropping out, based on the predictive analysis.
Analytics is a journey that involves a combination of potential skills, advanced technologies,
applications, and processes used by firm to gain business insights from data and statistics. This is
done to perform business planning.

Introduction to Tools and Environment

Tools are the softwares that can be used for Analytics.

While techniques are the procedures to be followed to reach up to a solution.

With the increasing demand for Data Analytics in the market, many tools have emerged with various
functionalities for this purpose. Either open-source or user-friendly, the top tools in the data
analytics market are as follows.

• R programming – This tool is the leading analytics tool used for statistics and data modelling.R
compiles and runs on various platforms such as UNIX, Windows, and Mac OS. It also provides tools to
automatically install all packages as per user-requirement.

• Python – Python is an open-source, object-oriented programming language which is easy to read,


write and maintain. It provides various machine learning and visualization libraries such as Scikit-
learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be assembled on any platform like SQL
server, a MongoDB database or JSON

• Tableau Public – This is a free software that connects to any data source such as Excel, corporate
Data Warehouse etc. It then creates visualizations, maps, dashboards etc with realtime updates on
the web.

• QlikView – This tool offers in-memory data processing with the results delivered to the endusers
quickly. It also offers data association and data visualization with data being compressed to almost
10% of its original size.

• SAS – A programming language and environment for data manipulation and analytics, this tool is
easily accessible and can analyze data from different sources.

• Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly used for
clients’ internal data, this tool analyzes the tasks that summarize the data with a preview of pivot
tables.

• RapidMiner – A powerful, integrated platform that can integrate with any data source types such as
Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is mostly used for predictive
analytics, such as data mining, text analytics, machine learning.

• KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics platform, which
allows you to analyze and model data. With the benefit of visual programming, KNIME provides a
platform for reporting and integration through its modular data pipeline concept.

• OpenRefine – Also known as GoogleRefine, this data cleaning software will help you clean up data
for analysis. It is used for cleaning messy data, the transformation of dataand parsing data from
websites.
• Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This tool is
also popular for data pipelines and machine learning model development.

Apart from the above-mentioned capabilities, a Data Analyst should also possess skills such as
Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if you have
knowledge of Machine Learning, then that would make you stand out from the crowd.

Applications of Business Analytics

Business analytics has diverse applications across industries to improve decision-making, optimize
operations, and achieve strategic goals. Some notable applications include:

1. Marketing and Customer Insights

• Description: Businesses analyze consumer data to understand preferences, behaviors, and


opinions.

• Example:

o Facebook and Twitter analyze customer sentiment during campaigns to assess the
real-time impact and public opinion.

o Amazon examines purchase patterns to recommend personalized products.

2. Sales and Revenue Optimization

• Description: Identifying factors that influence performance to boost sales and improve
customer engagement.

• Example:

o Companies like eBay and Google analyze factors like user interaction and conversion
rates to increase sales and optimize revenue generation.

3. Big Data Processing

• Description: Using frameworks like Hadoop to process and analyze vast datasets in
distributed environments, enabling real-time insights.

• Example:

o Google, Twitter, and Facebook use Hadoop with MapReduce to process massive
amounts of data efficiently in their cloud environments.

4. Supply Chain and Operations Management

• Description: Optimizing inventory, logistics, and resource allocation using predictive analytics
and historical data.

• Example: Retail companies analyze demand trends to maintain adequate stock levels and
minimize overstocking.

5. Healthcare Analytics

• Description: Enhancing patient care by analyzing medical records and treatment outcomes.
• Example: IBM’s big data solutions help healthcare providers analyze patient data to improve
diagnoses and treatment plans.

6. Fraud Detection and Risk Management

• Description: Detecting unusual patterns to minimize fraud and financial risks.

• Example: Banks use analytics to monitor transactions for signs of fraud and evaluate credit
risks.

7. Social Media and Sentiment Analysis

• Description: Understanding user sentiments and trends on platforms like Twitter and
Facebook to guide content creation and advertising.

• Example: Social media platforms analyze user-generated data to offer advertisers targeted ad
placements.

Databases
Database is an organized collection of structured information, or data, typically stored electronically
in a computer system. A database is usually controlled by a database management system (DBMS)

The database can be divided into various categories such as text databases, desktop database
programs, relational database management systems (RDMS), and NoSQL and objectoriented
databases

A text database is a system that maintains a (usually large) text collection and provides fast and
accurate access to it. Eg: Text book, magazine, journals, manuals, etc..

A desktop database is a database system that is made to run on a single computer or PC. These
simpler solutions for data storage are much more limited and constrained than larger data center or
data warehouse systems, where primitive database software is replaced by sophisticated hardware
and networking setups. Eg: Microsoft excel, open access, etc.

A relational database (RDB) is a collective set of multiple data sets organized by tables, records and
columns. RDBs establish a well-defined relationship between database tables. Tables communicate
and share information, which facilitates data searchability, organization and reporting. Eg: sql,
oracle,Db2, DbaaS etc

NoSQL databases are non-tabular, and store data differently than relational tables. NoSQL databases
come in a variety of types based on their data model. The main types are document, key-value, wide-
column, and graph. Eg: JSON,Mango DB,CouchDB etc

Object-oriented databases (OODB) are databases that represent data in the formof objects and
classes. In object-oriented terminology, an object is a real-world entity, and a class is a collection of
objects. Object-oriented databases follow the fundamental principles of objectoriented
programming (OOP). Eg: c++, java, c#, small talk, LISP etc..

Types of Data and variables

Types of Data
• Structured Data-It is the data that is processed, stored, and retrieved in a fixed format. Example:
Employee details, job positions, and salaries

• Unstructured Data-It is the type of data that lacks any specific form or structure. Example: Email

• Semi Structured Data- It is the data type containing It is the type of data that lacks any specific form
or structure. Example: Email

In any database we will be working with data to perform any kind of analysis and predication. In
relational data base management system we normally use rows to represent data and columns to
represent the attribute.

In terms of big data we represent the columns from RDMS as an attribute or a variable.

This variable can be divided in to two types’ categorical data or qualitative data and continuous or
discrete data called as quantitative data.

Qualitative data or Categorical data is normally represented as variable that holds characters. And
this is divided in to two types’ nominal data and ordinal data.

In Nominal Data there is no natural ordering in values in the attribute of the dataset. Eg: color,
Gender, nouns (name, place, animal, thing). These categories cannot be predefined with order for
example there is no specific way to arrange gender of 50 students in a class. In this case the first
student can be male or female similarly for all 50 students. So ordering cannot be valid.

In Ordinal Data there is natural ordering in values in the attribute of the dataset. Eg: size (S, M, L, XL,
XXL), rating (excellent, good, better, worst). In the above example we can quantify the amount of
data after performing ordering which gives valuable insights into the data.

Quantitative data or (discrete or continuous data) can be further divided in to two types’ discrete
attribute and continuous attribute.

Interval: Numerical values where the difference is meaningful, but there's no true zero point.
• Example: Temperature in Celsius or Fahrenheit.

Ratio: Numerical values with a true zero point, allowing for meaningful comparisons (like doubling
values).

• Example: Height, Weight, or Age.

Discrete Attribute which takes only finite number of numerical values (integers). Eg: number of
buttons, no of days for product delivery etc.. These data can be represented at every specific interval
in case of time series data mining or even in ratio based entries.

Continuous Attribute which takes finite number of fractional values. Eg: price, discount, height,
weight, length, temperature, speed etc….. These data can be represented at every specific interval in
case of time series data mining or even in ratio based entries.

Data Modelling

Data Modelling is a set of tools and techniques used to understand and analyse how an organisation
should collect, update, and store data. It is a critical skill for the business analyst who is involved with
discovering, analysing, and specifying changes to how software systems create and maintain
information.

They create an entity relationship diagram to visualise relationships between key business concepts.

They create a conceptual-level data dictionary to communicate data requirements that are important
to business stakeholders.

They create a data map to resolve potential data issues for a data migration or integration project.

A data modeller would not necessarily query or manipulate data or become involved in designing or
implementing databases or data repositories.

TYPES OF DATA MODELS

There are mainly three different types of data models:

1. Conceptual: This Data Model defines WHAT the system contains. This model is typically created by
Business stakeholders and Data Architects. The purpose is to organize, scope and define business
concepts and rules.

2. Logical: Defines HOW the system should be implemented regardless of the DBMS. This model is
typically created by Data Architects and Business Analysts. The purpose is to developed technical
map of rules and data structures.

3. Physical: This Data Model describes HOW the system will be implemented using a specific DBMS
system. This model is typically created by DBA and developers. The purpose is actual implementation
of the database.
Conceptual Model

The main aim of this model is to establish the entities, their attributes, and their relationships. In this
Data modelling level, there is hardly any detail available of the actual Database structure.

The 3 basic tenants of Data Model are

Entity: A real-world thing

Attribute: Characteristics or properties of an entity

Relationship: Dependency or association between two entities

For example:

• Customer and Product are two entities. Customer number and name are attributes of the Customer
entity

• Product name and price are attributes of product entity

• Sale is the relationship between the customer and product.

Characteristics of a conceptual data model

• Offers Organisation-wide coverage of the business concepts.

• This type of Data Models are designed and developed for a business audience.

• The conceptual model is developed independently of hardware specifications like data storage
capacity, location or software specifications like DBMS vendor and technology. The focus is to
represent data as a user will see it in the "real world."

Conceptual data models known as Domain models create a common vocabulary for all stakeholders
by establishing basic concepts and scope.
Logical data models add further information to the conceptual model elements.

It defines the structure of the data elements and set the relationships between them.

The advantage of the Logical data model is to provide a foundation to form the base for the Physical
model. However, the modelling structure remains generic.

At this Data Modelling level, no primary or secondary key is defined.

At this Data modeling level, you need to verify and adjust the connector details that were set earlier
for relationships.

Characteristics of a Logical data model

• Describes data needs for a single project but could integrate with other logical data models based
on the scope of the project.

• Designed and developed independently from the DBMS.

• Data attributes will have datatypes with exact precisions and length.

• Normalization processes to the model is applied typically till 3NF.

Physical Data Model

A Physical Data Model describes the database specific implementation of the data model.

It offers an abstraction of the database and helps generate schema. This is because of the richness of
meta-data offered by a Physical Data Model.

This type of Data model also helps to visualize database structure.

It helps to model database columns keys, constraints, indexes, triggers, and other RDBMS features.

Characteristics of a physical data model:

• The physical data model describes data need for a single project or application though it maybe
integrated with other physical data models based on project scope.

• Data Model contains relationships between tables that which addresses cardinality and nullability
of the relationships.

• Developed for a specific version of a DBMS, location, data storage or technology to be used in the
project.

• Columns should have exact datatypes, lengths assigned and default values.

• Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.

ADVANTAGES AND DISADVANTAGES OF DATA MODEL:

Advantages of Data model:

• The main goal of a designing data model is to make certain that data objects offered by the
functional team are represented accurately.
• The data model should be detailed enough to be used for building the physical database.

• The information in the data model can be used for defining the relationship between tables,
primary and foreign keys, and stored procedures.

• Data Model helps business to communicate the within and across organizations.

• Data model helps to documents data mappings in ETL process

• Help to recognize correct sources of data to populate the model

Disadvantages of Data model:

• To develop Data model one should know physical data stored characteristics.

• This is a navigational system produces complex application development, management. Thus, it


requires a knowledge of the biographical truth.

• Even smaller change made in structure requires modification in the entire application.

• There is no set data manipulation language in DBMS

DATA MODELLING TECHNIQUES

There are three basic data modelling techniques

1. Entity Relationship Diagrams

2. UML Class Diagrams

3. Data Dictionary

1. Entity Relationship Diagrams

Also referred to as ER diagrams or ERDs. Entity-Relationship modeling is a default technique for


modeling and the design of relational (traditional) databases. In this notation architect identifies:

• Entities representing objects (or tables in relational database),

• Attributes of entities including data type,

• Relationships between entities/objects (or foreign keys in a database).

ERDs work well to design a relational (classic) database, Excel databases or CSV files. Any kind of
tabular data work well for visualization of database schemas and communication of top-level view of
data.
2. UML Class Diagrams

UML (Unified Modeling Language) is a standardized family of notations for modeling and design of
information systems. It was derived from various existing notations to provide a standard for
software engineering. It comprises of several different diagrams representing different aspect of the
system, and one of them being a Class Diagram that can be used for data modeling. Class diagrams
are equivalent of ERDs in relational world and are mostly used to design classes in object-oriented
programming languages (such as Java or C#).

In class diagrams architects define:

• Classes (equivalent of entity in relational world),

• Attributes of a class (same as in an ERD) including data type,

• Methods associated to specific class, representing its behavior (in relational world those would be
stored procedures),

• Relationships grouped into two categories:

• Relationships between objects (instances of Classes) differentiated into Dependency,


Association, Aggregation and Composition (equivalent to relationships in an ERD),

• Relationships between classes of two kinds Generalization/Inheritance and


Realization/Implementation (this has no equivalent in relational world).

You can use class diagrams to design a tabular data (such as in RDBMS), but were designed and are
used mostly for object-oriented programs (such as Java or C#)
3. Data Dictionary

Data dictionaries are a tabular definition/representation of data assets.

Data dictionary is an inventory of data sets/tables with the list of their attributes/columns.

Core data dictionary elements:

• List of data sets/tables

• List of attributes/columns of each table with data type.

Optional data dictionary elements:

• Item descriptions

• Relationships between tables/columns,

• Additional constraints, such as uniqueness, default values, value constraints or calculated


columns.

Data dictionary is suitable as detailed specification of data assets and can be supplemented with ER
diagrams, as both serve slightly different purpose.

Missing Imputations

In statistics, imputation is the process of replacing missing data with substituted values. ... Because
missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls
involved with list-wise deletion of cases that have missing values.

Imputation simply means that we replace the missing values with some guessed/estimated ones
Mean, median, mode imputation A simple guess of a missing value is the mean, median, or mode
(most frequently appeared value) of that variable.

Regression imputation

Mean, median or mode imputation only look at the distribution of the values of the variable with
missing entries. If we know there is a correlation between the missing value and other variables, we
can often get better guesses by regressing the missing variable on other variables.

K-nearest neighbour (KNN) imputation

Besides model-based imputation like regression imputation, neighbour-based imputation can also
be used. K-nearest neighbour (KNN) imputation is an example of neighbour-based imputation. For a
discrete variable, KNN imputer uses the most frequent value among the k nearest neighbours and,
for a continuous variable, use the mean or mode.

Multiple imputations

The Mean, median, mode imputation, regression imputation, stochastic regression imputation, KNN
imputer are all methods that create a single replacement value for each missing entry. Multiple
Imputation (MI), rather than a different method, is more like a general approach/framework of doing
the imputation procedure multiple times to create different plausible imputed datasets. The key
motivation to use MI is that a single imputation cannot reflect sampling variability from both sample
data and missing values.
Methods to Handle Missing Data

Here’s a simplified explanation of the three types of missing data:

1. Missing Completely At Random (MCAR):

o Missing data is randomly distributed across all observations.

o There is no pattern or relationship with other variables.

o Example: Some students miss a survey, and there's no reason tied to other factors
(like their age or school).

o Check: If there's no significant difference in means between data with and without
missing values (via t-test), it's MCAR.

2. Missing At Random (MAR):

o Missing data is random, but only within specific subgroups.

o The data is missing because of other factors, but those factors are observed and not
related to the missing values themselves.

o Example: Missing GPA data might occur randomly among students in some specific
schools, but not all schools.

3. Not Missing At Random (NMAR):

o Missing data follows a specific pattern related to the data itself.

o The reason for the missing data is connected to the value of the data.

o Example: If GPA data is missing only for students who have low GPA scores, then it's
NMAR.

Need for business Modelling

1. Handling Large and Complex Data

o Description: Modern businesses deal with enormous volumes of structured and


unstructured data, which require advanced systems and techniques for analysis.

o Reason: To make data-driven decisions efficiently in competitive environments.

2. Improving Decision-Making

o Description: Businesses use analytics to transition from intuition-based decisions to


evidence-based strategies.

o Reason: To achieve higher accuracy in forecasting and planning.

3. Optimizing Performance

o Description: Business modeling identifies inefficiencies and areas for improvement in


operations, marketing, and customer engagement.

o Reason: To increase productivity and profitability.


4. Adapting to Market Changes

o Description: Companies need models to predict trends and respond quickly to


changing consumer needs.

o Reason: To maintain a competitive edge by being proactive rather than reactive.

5. Supporting Big Data Initiatives

o Description: Organizations require business models that can integrate big data
solutions like Hadoop and cloud computing.

o Reason: To process and analyze vast datasets effectively and derive actionable
insights.

6. Enhancing Customer Relationships

o Description: Business modeling helps in segmenting customers and tailoring


offerings based on their preferences and behaviors.

o Reason: To improve customer satisfaction and loyalty.

7. Reducing Risks

o Description: Predictive analytics within business models helps identify potential risks
and offers mitigation strategies.

o Reason: To ensure stability and minimize unforeseen losses.

8. Driving Innovation

o Description: Modeling enables companies to identify new opportunities and


innovate based on market data.

o Reason: To foster growth and remain relevant in evolving industries.

This separation highlights the distinct applications (real-world use cases) and needs (why modeling
is essential) for business analytics and modeling.
The Map() step: Each worker node applies the Map() function to the local data and writes the output
to a temporary storage space. The Map() code is run exactly once for each K1 key value, generating
output that is organized by key values K2. A master node arranges it so that for redundant copies of
input data only one is processed.

The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key value
that each processor should work on, and provide that processor with all of the map generated data
associated with that key value, such that all data belonging to one key are located on the same
worker node.

The Reduce() step: Worker nodes process each group of output data(per key) in parallel, executing
the user provided Reduce() code; each function is run exactly once for each K2 key value pro-duced
by the map step.

Produce the final output: The MapReduce system collects all of the reduce outputs and sorts them
by K2 to produce the final out-come.

You might also like