Unit 2
Unit 2
UNIT-II
Table of Contents
Introduction to Analytics ............................................................................................................ 2
Introduction to Tools and Environment ................................................................................ 5
Application of Modeling in Business ....................................................................................... 9
Databases .................................................................................................................................. 10
Types of Data and Variables .................................................................................................... 14
Data Modeling Techniques ....................................................................................................... 20
Missing Imputations .................................................................................................................. 21
Need for Business Modeling .................................................................................................... 24
1
Introduction to Analytics
Data analytics involves examining data to uncover useful trends and patterns
that can guide decision-making. With the vast amount of data generated today
and the powerful computing tools available, businesses can use this
information to make informed decisions based on past successes. By
analyzing data, companies can gain insights that help improve their
operations and predict future outcomes.
2
Predictive Analytics
● Linear Regression
● Data Mining
Basic Cornerstones of Predictive Analytics
● Predictive modeling
● Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to
how to approach future events. It looks at past performance and understands
the performance by mining historical data to understand the cause of success
or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often
used to classify customers or prospects into groups. Unlike a predictive model
that focuses on predicting the behavior of a single customer, Descriptive
analytics identifies many different relationships between customer and
product.
Common examples of Descriptive analytics are company reports that
provide historic reviews like:
● Data Queries
● Reports
3
● Descriptive Statistics
● Data dashboard
Prescriptive Analytics
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer
any question or for the solution of any problem. We try to find any dependency
and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight
into a problem, and they also keep detailed information about their disposal
otherwise data collection may turn out individual for every problem and it will
be very time-consuming. Common techniques used for Diagnostic Analytics
are:
● Data discovery
● Data mining
● Correlations
4
Introduction to Tools and Environment
There are myriads of data analytics tools that help us get important
information from the given data. We can use some of these free and open
source tools even without any coding knowledge. These tools are used for
deriving useful insights from the given data without sweating too much. For
example, you could use them to determine the better among some cricket
player based on various statistics and yardsticks. They have helped in
strengthening the decision making the process by providing useful
information that can help reach better conclusions.
There are many tools that are used for deriving useful insights from the given
data. Some are programming based and others are non-programming based.
Some of the most popular tools are:
● SAS
● Microsoft Excel
● R
● Python
● Tableau
● RapidMiner
● KNIME
SAS: SAS was a programming language developed by the SAS Institute for
performed advanced analytics, multivariate analyses, business intelligence,
data management and predictive analytics.
It is proprietary software written in C and its software suite contains more
than 200 components. Its programming language is considered to be high
level thus making it easier to learn. However, SAS was developed for very
specific uses and powerful tools are not added every day to the extensive
5
already existing collection thus making it less scalable for certain
applications. It, however, boasts of the fact that it can analyze data from
various sources and can also write the results directly into an excel
spreadsheet.
It is used by many companies such as Google, Facebook, Twitter, Netflix and
Accenture. SAS brought to the market a huge set of products in 2011 for
customer intelligence and various SAS modules for web, social media and
marketing analytics used largely for profiling customers and gaining insights
about prospective customers. Even though it is under attack by upcoming
languages such as R, Python, SAS still continues to develop in order to prove
that it is still a major stakeholder in the data analytics market.
Microsoft Excel :It is an important spreadsheet application that can be
useful for recording expenses, charting data and performing easy
manipulation and lookup and or generating pivot tables to provide the desired
summarized reports of large datasets that contain significant data findings. It
is written in C#, C++ and .NET Framework and its stable version were released
in 2016. It involves the use of a macro programming language called Visual
Basic for developing applications. It has various built-in functions to satisfy
the various statistical, financial and engineering needs. It is the industry
standard for spreadsheet applications. It is also used by companies to perform
real-time manipulation of data collected from external sources such as stock
market feeds and perform the updates in real-time to maintain a consistent
view of data. It is relatively useful for performing somewhat complex analyses
of data when compared to other tools such as R or python. It is a common
tool among financial analysts and sales managers to solve complex business
problems.
R : It is one of the leading programming languages for performing complex
statistical computations and graphics. It is a free and open-source language
that can be run on various UNIX platforms, Windows and MacOS. It also has
a command line interface which is easy to use. However, it is tough to learn
especially for people who do not have prior knowledge about programming.
However, it is very useful for building statistical software and is very useful
for performing complex analyses. It has more than 11, 000 packages and we
6
can browse the packages category-wise. These packages can also be
assembled with Big Data, the catalyst which has transformed various
organization’s views on unstructured data. The tools required to install the
packages as per user requirements are also provided by R which makes
setting up convenient.
Python :It is a powerful high-level programming language that is used for
general purpose programming. Python supports both structured and
functional programming methods. It’s an extensive collection of libraries make
it very useful in data analysis. Knowledge of Tensorflow, Theano, Keras,
Matplotlib, Scikit-learn and Keras can get you a lot closer towards your dream
of becoming a machine learning engineer. Everything in python is an object
and this attribute makes it highly popular among developers. It is easy to
learn compared to R and can be assembled onto any platform such as
MongoDB or SQL server. It is very useful for big data analysis and can also
be used to extract data from the web. It can also handle text data very well.
Python can be assembled on various platforms such as SQL Server, MongoDB
database or JSON(JavaScript Object Notation). Some of the companies that
use Python for data analytics include Instagram, Facebook, Spotify and
Amazon.
Tableau Public :Tableau Public is free software developed by the public
company “Tableau Software” that allows users to connect to any spreadsheet
or file and create interactive data visualizations. It can also be used to create
maps, dashboards along with real-time updation for easy presentation on the
web. The results can be shared through social media sites or directly with the
client making it very convenient to use. The resultant files can also be
downloaded in different formats. This software can connect to any type of data
source, be it a data warehouse or an Excel application or some sort of web-
based data. Approximately 446 companies use this software for operational
purposes and some of the companies that are currently using this software
include SoFi, The Sentinel and Visa.
RapidMiner : RapidMiner is an extremely versatile data science platform
developed by “RapidMiner Inc”. The software emphasizes lightning fast data
science capabilities and provides an integrated environment for preparation
7
of data and application of machine learning, deep learning, text mining and
predictive analytical techniques. It can also work with many data source types
including Access, SQL, Excel, Tera data, Sybase, Oracle, MySQL and Dbase.
Here we can control the data sets and formats for predictive analysis.
Approximately 774 companies use RapidMiner and most of these are US-
based. Some of the esteemed companies on that list include the Boston
Consulting Group and Dominos Pizza Inc.
KNIME : KNIME, the Konstanz Information Miner is a free and open-source
data analytics software. It is also used as a reporting and integration platform.
It involves the integration of various components for Machine Learning and
data mining through the modular data-pipe lining. It is written in Java and
developed by KNIME.com AG. It can be operated in various operating systems
such as Linux, OS X and Windows. More than 500 companies are currently
using this software for operational purposes and some of them include Aptus
Data Labs and Continental AG.
Programmin
Tool Type Primary Use g Language Notable Features
Advanced
Statistical analytics, data Proprietary Over 200 components, used
SAS Software management (written in C) for customer intelligence
Data
Microsoft Spreadsheet organization, Widely used, supports pivot
Excel Software basic analysis N/A tables, charts, VBA
Programming Statistical Open-source, extensive
Language & computing, data packages for statistics and
R Software analysis R graphics
Versatile, extensive
libraries for data science
Programming General-purpose, (e.g., Pandas, NumPy,
Python Language data analysis Python SciPy)
Data Business Intuitive drag-and-drop
Visualization intelligence, data interface, integrates with
Tableau Software visualization N/A various data sources
GUI-based, supports
Data Science Data preparation, various machine learning
RapidMiner Platform machine learning Java algorithms
Data Data integration, Open-source, modular
Analytics processing, workflows, integrates with
KNIME Platform analysis Java R and Python
8
Application of Modeling in Business
9
companies can evaluate the performance of machinery and make
informed decisions about maintenance or replacement, ultimately
enhancing operational efficiency.
Databases
Relational Databases
Relational databases store data in structured tables with rows and columns,
where relationships between data are defined using foreign keys. They use
SQL (Structured Query Language) for querying and managing data.
1. MySQL
o Open-source relational database.
o Widely used in web applications.
o Supports ACID transactions and has a large community.
2. PostgreSQL
o Advanced open-source relational database.
o Known for extensibility and standards compliance.
o Supports complex queries, foreign keys, triggers, and up to full
transactional integrity.
10
3. Oracle Database
o Enterprise-level relational database.
o Known for high performance, scalability, and reliability.
o Extensive features for transaction management, analytics, and
data warehousing.
4. SQL Server (Microsoft)
o Microsoft's relational database management system.
o Integrates well with other Microsoft products.
o Includes tools for data warehousing, business intelligence, and
analytics.
5. SQLite
o Lightweight, file-based relational database.
o Often used in mobile apps and small applications.
o Doesn’t require a server to operate, making it easy to use and
deploy.
NoSQL Databases
11
databases excel at managing complex relationships between data by
representing it as nodes and edges.
NoSQL databases are favored for their scalability, flexibility, and high
performance, particularly in scenarios where data is frequently changing and
rapidly growing. However, they may not be the best choice for all applications.
NoSQL databases generally lack full ACID compliance, which can lead to
issues with data consistency. Additionally, they are more complex to manage
than traditional relational databases and may not support complex queries
as effectively.
1. Cassandra
o Distributed NoSQL database.
o Designed for high availability and scalability.
o Uses a wide-column store model.
2. MongoDB
o Document-oriented NoSQL database.
o Stores data in flexible, JSON-like documents.
o Great for applications needing fast, iterative development.
3. Couchbase
o A NoSQL database that combines the best of both document and
key-value stores.
o Offers flexible data models, scalability, and high availability.
o Optimized for interactive web applications.
4. Redis
o In-memory key-value store.
o Known for its speed and support for various data structures like
strings, lists, sets, and hashes.
o Often used for caching, real-time analytics, and messaging.
5. DynamoDB
o Managed NoSQL database service by Amazon Web Services
(AWS).
o Supports both document and key-value store models.
12
o Designed for high availability and scalability with seamless
integration with other AWS services.
Data Warehouses
Data warehouses are designed for querying and analyzing large volumes of
data. They store data from various sources and are optimized for read-heavy
operations and complex queries.
1. Amazon Redshift
o Fully managed data warehouse service by AWS.
o Optimized for online analytical processing (OLAP).
o Integrates well with AWS’s ecosystem and scales easily.
2. Google BigQuery
o Serverless, highly scalable data warehouse by Google Cloud.
o Supports SQL queries and can analyze terabytes of data in
seconds.
o Built for real-time analytics and integrates with other Google
Cloud services.
3. Snowflake
o Cloud-native data warehouse.
o Separates compute and storage, allowing independent scaling.
o Supports structured and semi-structured data (e.g., JSON,
Parquet).
4. Azure Synapse (formerly SQL Data Warehouse)
o Integrated analytics service by Microsoft Azure.
o Combines big data and data warehousing.
o Offers capabilities for data integration, exploration, preparation,
and analysis.
13
Table 1:SQL VS NON-SQL
SQL NoSQL
Fixed or static, predefined schemaDynamic schema
Not suited for hierarchical data Best suited for hierarchical data
storage storage
Best suited for complex queries Not ideal for complex queries
Vertically scalable Horizontally scalable
Follows CAP (Consistency,
Follows ACID properties
Availability, Partition Tolerance)
MySQL, PostgreSQL, Oracle, MS-SQL MongoDB, HBase, Neo4j, Cassandra,
Server, etc. etc
Suppose that hair color and marital status are two attributes describing person
objects. In our application, possible values for hair color are black, brown,
blond, red, auburn, gray, and white. The attribute marital status can take on
the values single, married, divorced, and widowed. Both hair color and marital
14
status are nominal attributes. Another example of a nominal attribute is
occupation, with the values teacher, dentist, programmer, farmer, and so on.
Binary Attributes
Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, while 0 indicates that the patient does not. Similarly, suppose
the patient undergoes a medical test that has two possible outcomes. The
attribute medical test is binary, where a value of 1 means the result of the test
for the patient is positive, while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome
should be coded as 0 or 1. One such example could be the attribute gender
having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally
important,such as the positive and negative outcomes of a medical test for
HIV. By convention,we code the most important outcome, which is usually
the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).
Ordinal Attributes
15
Suppose that drink size corresponds to the size of drinks available at a fast-
food restaurant. This nominal attribute has three possible values: small,
medium, and large. The values have a meaningful sequence (which
corresponds to increasing drink size); however, we cannot tell from the values
how much bigger, say, a medium is than a large. Other examples of ordinal
attributes include grade (e.g., A, B, C, and so on) and professional rank.
Professional ranks can be enumerated in a sequential order: for example,
assistant, associate, and full for professors, and private, private first class,
specialist, corporal, and sergeant for army ranks.
Numeric Attributes
Interval-Scaled Attributes
16
temperature. In addition, we can quantify the difference between values. For
example, a temperature of 20_C is five degrees higher than a temperature of
15_C. Calendar dates are another example. For instance, the years 2002 and
2010 are eight years apart.
Ratio-Scaled Attributes
17
values is countable (where the values can be put in one-to-one
correspondence with the set of integers). Zip codes are another example.
If an attribute is not discrete, it is continuous. The terms numeric attribute
and continuous attribute are often used interchangeably in the literature. (This
can be confusing because, in the classic sense, continuous values are real
numbers, whereas numeric values can be either integers or real numbers.) In
practice, real values are represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.
But the properties of attribute can be different than the properties of the
values used to represent the attribute
18
Table 2:Typess of Attributes
Attribute
Type Description Examples Operations Transformation Comment
If all employee
Zip codes, ID numbers
Nominal employee ID Mode, are
attribute numbers, entropy, Any one-to-one reassigned, it
values only eye color, contingency mapping, e.g., a will not make
distinguish. sex: {male, correlation, permutation of any
Nominal (=, ≠ ) female} chi-sq test values difference.
An attribute
encompassing
Hardness of the notion of
minerals, An order- good, better,
{good, Median, preserving best can be
Ordinal better, percentiles, change of values, represented
attribute best}, rank i.e., new value = equally well
values also grades, correlation, f(old value), where by the values
order objects. street run tests, f is a monotonic {1, 2, 3} or by
Ordinal (<, >) numbers sign tests function. {0.5, 1, 10}.
The
Fahrenheit
and Celsius
For interval temperature
attributes, Calendar Mean, scales differ in
differences dates, standard the location of
between temperatur deviation, their zero
values are e in Celsius Pearson's New value = a * value and the
meaningful. or correlation, old value + b, a size of a
Interval (+, -) Fahrenheit t and F tests and b constants. degree (unit).
For ratio Temperatur
variables, e in Kelvin,
both monetary Geometric
differences quantities, mean,
and ratios counts, age, harmonic
are mass, mean, Length can be
meaningful. length, percent New value = a * measured in
Ratio (*, /) current variation old value meters or feet.
19
Data Modeling Techniques
20
attributes. The Extensible Markup Language (XML) is widely used to represent
semistructured data.
Historically, two other data models, the network data model and the
hierarchical data model, preceded the relational data model. These models
were tied closely to the underlying implementation, and complicated the task
of modeling data.
Missing Imputations
MCAR occurs when the probability of data being missing is uniform across all
observations. There is no relationship between the missingness of data and
any other observed or unobserved data within the dataset. This type of
missing data is purely random and lacks any discernible pattern.
21
Example: In a survey about library books, some overdue book values in the
dataset are missing due to human error in recording.
MAR data occurs when the probability of data being missing depends only on
the observed data and not on the missing data itself. In other words, the
missingness can be explained by variables for which you have complete
information. There is a pattern in the missing values, but this pattern can be
explained by other observed variables.
Example: In a survey, ‘Age’ values might be missing for those who did not
disclose their ‘Gender’. Here, the missingness of ‘Age’ depends on ‘Gender’,
but the missing ‘Age’ values are random among those who did not disclose
their ‘Gender’.
MNAR occurs when the missingness of data is related to the unobserved data
itself, which is not included in the dataset. This type of missing data has a
specific pattern that cannot be explained by observed variables.
Example: In a survey about library books, people with more overdue books
might be less likely to respond to the survey. Thus, the number of overdue
books is missing and depends on the number of books overdue.
22
Missing data is a common headache in any field that deals with datasets. It
can arise for various reasons, from human error during data collection to
limitations of data gathering methods. Luckily, there are strategies to address
missing data and minimize its impact on your analysis. Here are two main
approaches:
23
A business model is a schematic representation of the interconnected
processes of a company. It shows what and to whom to sell, and also
determines the profit.
Below, we will consider in more detail what a business model is and what it
is used for.
● It calculates how much money will be needed for the startup, what
expenses will arise each month, and the expected level of profit in the
early stages of operation.
24
● It analyzes how customers interact with your business and also reduces
expenses.
In case of losses:
Attracting investments:
25
26