0% found this document useful (0 votes)
30 views26 pages

Unit 2

The document provides an overview of data analytics, including its types (predictive, descriptive, prescriptive, and diagnostic) and their applications in various business sectors such as finance, marketing, and human resources. It also discusses popular data analytics tools like SAS, Excel, R, Python, Tableau, RapidMiner, and KNIME, highlighting their features and uses. Furthermore, it covers the importance of databases, differentiating between relational and NoSQL databases, and their roles in managing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views26 pages

Unit 2

The document provides an overview of data analytics, including its types (predictive, descriptive, prescriptive, and diagnostic) and their applications in various business sectors such as finance, marketing, and human resources. It also discusses popular data analytics tools like SAS, Excel, R, Python, Tableau, RapidMiner, and KNIME, highlighting their features and uses. Furthermore, it covers the importance of databases, differentiating between relational and NoSQL databases, and their roles in managing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Analytics (U21PC701CS)

UNIT-II

Table of Contents
Introduction to Analytics ............................................................................................................ 2
Introduction to Tools and Environment ................................................................................ 5
Application of Modeling in Business ....................................................................................... 9
Databases .................................................................................................................................. 10
Types of Data and Variables .................................................................................................... 14
Data Modeling Techniques ....................................................................................................... 20
Missing Imputations .................................................................................................................. 21
Need for Business Modeling .................................................................................................... 24

1
Introduction to Analytics

Data analytics involves examining data to uncover useful trends and patterns
that can guide decision-making. With the vast amount of data generated today
and the powerful computing tools available, businesses can use this
information to make informed decisions based on past successes. By
analyzing data, companies can gain insights that help improve their
operations and predict future outcomes.

For example, in manufacturing, data on machine performance and work


queues can help optimize production processes, ensuring machines run
efficiently. Similarly, gaming companies use data to create engaging reward
systems, and content providers analyze user interactions to improve content
presentation and engagement. In essence, data analytics helps organizations
in various sectors make better decisions and enhance their performance.

Types of Data Analytics

There are four major types of data analytics:


1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics

Data Analytics and its Types

2
Predictive Analytics

Predictive analytics turn the data into valuable, actionable information.


predictive analytics uses data to determine the probable outcome of an event
or a likelihood of a situation occurring. Predictive analytics holds a variety of
statistical techniques from modeling, machine learning, data mining,
and game theory that analyze current and historical facts to make predictions
about a future event. Techniques that are used for predictive analytics are:

● Linear Regression

● Time Series Analysis and Forecasting

● Data Mining
Basic Cornerstones of Predictive Analytics

● Predictive modeling

● Decision Analysis and optimization

● Transaction profiling

Descriptive Analytics

Descriptive analytics looks at data and analyze past event for insight as to
how to approach future events. It looks at past performance and understands
the performance by mining historical data to understand the cause of success
or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often
used to classify customers or prospects into groups. Unlike a predictive model
that focuses on predicting the behavior of a single customer, Descriptive
analytics identifies many different relationships between customer and
product.
Common examples of Descriptive analytics are company reports that
provide historic reviews like:

● Data Queries

● Reports

3
● Descriptive Statistics

● Data dashboard

Prescriptive Analytics

Prescriptive Analytics automatically synthesize big data, mathematical


science, business rule, and machine learning to make a prediction and then
suggests a decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also
suggesting action benefits from the predictions and showing the decision
maker the implication of each decision option. Prescriptive Analytics not only
anticipates what will happen and when to happen but also why it will happen.
Further, Prescriptive Analytics can suggest decision options on how to take
advantage of a future opportunity or mitigate a future risk and illustrate the
implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning
by using analytics to leverage operational and usage data combined with data
of external factors such as economic data, population demography, etc.

Diagnostic Analytics

In this analysis, we generally use historical data over other data to answer
any question or for the solution of any problem. We try to find any dependency
and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight
into a problem, and they also keep detailed information about their disposal
otherwise data collection may turn out individual for every problem and it will
be very time-consuming. Common techniques used for Diagnostic Analytics
are:

● Data discovery

● Data mining

● Correlations

4
Introduction to Tools and Environment

Data Analytics is an important aspect of many organizations nowadays. Real-


time data analytics is essential for the success of a major organization and
helps drive decision making. This article will help you gain knowledge about
the various data analytic tools that are in use and how they differ.

There are myriads of data analytics tools that help us get important
information from the given data. We can use some of these free and open
source tools even without any coding knowledge. These tools are used for
deriving useful insights from the given data without sweating too much. For
example, you could use them to determine the better among some cricket
player based on various statistics and yardsticks. They have helped in
strengthening the decision making the process by providing useful
information that can help reach better conclusions.

There are many tools that are used for deriving useful insights from the given
data. Some are programming based and others are non-programming based.
Some of the most popular tools are:

● SAS

● Microsoft Excel

● R

● Python

● Tableau

● RapidMiner

● KNIME
SAS: SAS was a programming language developed by the SAS Institute for
performed advanced analytics, multivariate analyses, business intelligence,
data management and predictive analytics.
It is proprietary software written in C and its software suite contains more
than 200 components. Its programming language is considered to be high
level thus making it easier to learn. However, SAS was developed for very
specific uses and powerful tools are not added every day to the extensive

5
already existing collection thus making it less scalable for certain
applications. It, however, boasts of the fact that it can analyze data from
various sources and can also write the results directly into an excel
spreadsheet.
It is used by many companies such as Google, Facebook, Twitter, Netflix and
Accenture. SAS brought to the market a huge set of products in 2011 for
customer intelligence and various SAS modules for web, social media and
marketing analytics used largely for profiling customers and gaining insights
about prospective customers. Even though it is under attack by upcoming
languages such as R, Python, SAS still continues to develop in order to prove
that it is still a major stakeholder in the data analytics market.
Microsoft Excel :It is an important spreadsheet application that can be
useful for recording expenses, charting data and performing easy
manipulation and lookup and or generating pivot tables to provide the desired
summarized reports of large datasets that contain significant data findings. It
is written in C#, C++ and .NET Framework and its stable version were released
in 2016. It involves the use of a macro programming language called Visual
Basic for developing applications. It has various built-in functions to satisfy
the various statistical, financial and engineering needs. It is the industry
standard for spreadsheet applications. It is also used by companies to perform
real-time manipulation of data collected from external sources such as stock
market feeds and perform the updates in real-time to maintain a consistent
view of data. It is relatively useful for performing somewhat complex analyses
of data when compared to other tools such as R or python. It is a common
tool among financial analysts and sales managers to solve complex business
problems.
R : It is one of the leading programming languages for performing complex
statistical computations and graphics. It is a free and open-source language
that can be run on various UNIX platforms, Windows and MacOS. It also has
a command line interface which is easy to use. However, it is tough to learn
especially for people who do not have prior knowledge about programming.
However, it is very useful for building statistical software and is very useful
for performing complex analyses. It has more than 11, 000 packages and we

6
can browse the packages category-wise. These packages can also be
assembled with Big Data, the catalyst which has transformed various
organization’s views on unstructured data. The tools required to install the
packages as per user requirements are also provided by R which makes
setting up convenient.
Python :It is a powerful high-level programming language that is used for
general purpose programming. Python supports both structured and
functional programming methods. It’s an extensive collection of libraries make
it very useful in data analysis. Knowledge of Tensorflow, Theano, Keras,
Matplotlib, Scikit-learn and Keras can get you a lot closer towards your dream
of becoming a machine learning engineer. Everything in python is an object
and this attribute makes it highly popular among developers. It is easy to
learn compared to R and can be assembled onto any platform such as
MongoDB or SQL server. It is very useful for big data analysis and can also
be used to extract data from the web. It can also handle text data very well.
Python can be assembled on various platforms such as SQL Server, MongoDB
database or JSON(JavaScript Object Notation). Some of the companies that
use Python for data analytics include Instagram, Facebook, Spotify and
Amazon.
Tableau Public :Tableau Public is free software developed by the public
company “Tableau Software” that allows users to connect to any spreadsheet
or file and create interactive data visualizations. It can also be used to create
maps, dashboards along with real-time updation for easy presentation on the
web. The results can be shared through social media sites or directly with the
client making it very convenient to use. The resultant files can also be
downloaded in different formats. This software can connect to any type of data
source, be it a data warehouse or an Excel application or some sort of web-
based data. Approximately 446 companies use this software for operational
purposes and some of the companies that are currently using this software
include SoFi, The Sentinel and Visa.
RapidMiner : RapidMiner is an extremely versatile data science platform
developed by “RapidMiner Inc”. The software emphasizes lightning fast data
science capabilities and provides an integrated environment for preparation

7
of data and application of machine learning, deep learning, text mining and
predictive analytical techniques. It can also work with many data source types
including Access, SQL, Excel, Tera data, Sybase, Oracle, MySQL and Dbase.
Here we can control the data sets and formats for predictive analysis.
Approximately 774 companies use RapidMiner and most of these are US-
based. Some of the esteemed companies on that list include the Boston
Consulting Group and Dominos Pizza Inc.
KNIME : KNIME, the Konstanz Information Miner is a free and open-source
data analytics software. It is also used as a reporting and integration platform.
It involves the integration of various components for Machine Learning and
data mining through the modular data-pipe lining. It is written in Java and
developed by KNIME.com AG. It can be operated in various operating systems
such as Linux, OS X and Windows. More than 500 companies are currently
using this software for operational purposes and some of them include Aptus
Data Labs and Continental AG.
Programmin
Tool Type Primary Use g Language Notable Features
Advanced
Statistical analytics, data Proprietary Over 200 components, used
SAS Software management (written in C) for customer intelligence
Data
Microsoft Spreadsheet organization, Widely used, supports pivot
Excel Software basic analysis N/A tables, charts, VBA
Programming Statistical Open-source, extensive
Language & computing, data packages for statistics and
R Software analysis R graphics
Versatile, extensive
libraries for data science
Programming General-purpose, (e.g., Pandas, NumPy,
Python Language data analysis Python SciPy)
Data Business Intuitive drag-and-drop
Visualization intelligence, data interface, integrates with
Tableau Software visualization N/A various data sources
GUI-based, supports
Data Science Data preparation, various machine learning
RapidMiner Platform machine learning Java algorithms
Data Data integration, Open-source, modular
Analytics processing, workflows, integrates with
KNIME Platform analysis Java R and Python

8
Application of Modeling in Business

Finance Sector: In finance, business analytics is essential for


budgeting, financial planning, portfolio management, investment
banking, and forecasting trends. By analyzing vast amounts of financial
data, companies can determine the true value of products and offer
data-driven advice to clients on whether to retain or sell assets. This
helps optimize financial decisions and improve profitability.

Marketing: Business analytics allows companies to deeply analyze


competitors' sales, market trends, and consumer behavior. By studying
buying patterns, businesses can identify target audiences and develop
effective advertising strategies tailored to specific regions and customer
groups. This data-driven approach helps in making more informed
decisions and building stronger customer relationships.

Human Resources (HR): HR professionals use analytics to gather


detailed information about candidates, including their educational
background and extracurricular activities. For existing employees,
analytics can help predict promotion timelines, retention rates, and
analyze factors like gender and age. This information plays a crucial
role in hiring, promotions, and improving employee satisfaction.

Consumer Relationship Management (CRM): In CRM, business


analytics helps organizations maintain strong relationships with
customers by analyzing key performance indicators and socio-economic
factors. By understanding customer preferences and purchasing
patterns, companies can tailor their services to meet specific needs,
improving customer satisfaction and loyalty.

Manufacturing: In the manufacturing sector, business analytics is


used to manage supply chains, assess performance targets, and
improve inventory management. By analyzing historical data,

9
companies can evaluate the performance of machinery and make
informed decisions about maintenance or replacement, ultimately
enhancing operational efficiency.

Credit Card Companies:Credit card companies use business analytics


to assess the financial health and purchasing preferences of customers.
By analyzing transaction data, they can identify spending patterns and
target specific audiences more effectively. This information is also
valuable to retail and manufacturing sectors, helping them refine their
advertising and marketing strategies.

Other Applications: Beyond these sectors, business analytics plays a


vital role in biomedical research, healthcare, IoT, fraud detection,
defense, cybersecurity, sales, and government policy-making. By
leveraging data, organizations across various industries can make more
informed decisions, improve operations, and drive innovation.

Databases

Relational Databases

Relational databases store data in structured tables with rows and columns,
where relationships between data are defined using foreign keys. They use
SQL (Structured Query Language) for querying and managing data.

1. MySQL
o Open-source relational database.
o Widely used in web applications.
o Supports ACID transactions and has a large community.
2. PostgreSQL
o Advanced open-source relational database.
o Known for extensibility and standards compliance.
o Supports complex queries, foreign keys, triggers, and up to full
transactional integrity.

10
3. Oracle Database
o Enterprise-level relational database.
o Known for high performance, scalability, and reliability.
o Extensive features for transaction management, analytics, and
data warehousing.
4. SQL Server (Microsoft)
o Microsoft's relational database management system.
o Integrates well with other Microsoft products.
o Includes tools for data warehousing, business intelligence, and
analytics.
5. SQLite
o Lightweight, file-based relational database.
o Often used in mobile apps and small applications.
o Doesn’t require a server to operate, making it easy to use and
deploy.

NoSQL Databases

NoSQL databases are designed for unstructured, semi-structured, or


structured data that doesn’t fit well into a traditional relational database.
They are often used in big data and real-time web applications.

Unlike traditional relational databases, which store data in tables with


predefined schemas, NoSQL databases use flexible data models that can
easily adapt to changes. They are particularly well-suited for applications that
require the ability to scale horizontally to manage growing amounts of data,
such as in big data and real-time web applications.

There are four main categories of NoSQL databases: document databases,


key-value stores, column-family stores, and graph databases. Document
databases store data in formats like JSON or XML, making them flexible for
handling varying data structures. Key-value stores focus on simplicity and
speed, storing data as key-value pairs. Column-family stores organize data
into columns, allowing for efficient querying of large datasets. Graph

11
databases excel at managing complex relationships between data by
representing it as nodes and edges.

NoSQL databases are favored for their scalability, flexibility, and high
performance, particularly in scenarios where data is frequently changing and
rapidly growing. However, they may not be the best choice for all applications.
NoSQL databases generally lack full ACID compliance, which can lead to
issues with data consistency. Additionally, they are more complex to manage
than traditional relational databases and may not support complex queries
as effectively.

1. Cassandra
o Distributed NoSQL database.
o Designed for high availability and scalability.
o Uses a wide-column store model.
2. MongoDB
o Document-oriented NoSQL database.
o Stores data in flexible, JSON-like documents.
o Great for applications needing fast, iterative development.
3. Couchbase
o A NoSQL database that combines the best of both document and
key-value stores.
o Offers flexible data models, scalability, and high availability.
o Optimized for interactive web applications.
4. Redis
o In-memory key-value store.
o Known for its speed and support for various data structures like
strings, lists, sets, and hashes.
o Often used for caching, real-time analytics, and messaging.
5. DynamoDB
o Managed NoSQL database service by Amazon Web Services
(AWS).
o Supports both document and key-value store models.

12
o Designed for high availability and scalability with seamless
integration with other AWS services.

Data Warehouses

Data warehouses are designed for querying and analyzing large volumes of
data. They store data from various sources and are optimized for read-heavy
operations and complex queries.

1. Amazon Redshift
o Fully managed data warehouse service by AWS.
o Optimized for online analytical processing (OLAP).
o Integrates well with AWS’s ecosystem and scales easily.
2. Google BigQuery
o Serverless, highly scalable data warehouse by Google Cloud.
o Supports SQL queries and can analyze terabytes of data in
seconds.
o Built for real-time analytics and integrates with other Google
Cloud services.
3. Snowflake
o Cloud-native data warehouse.
o Separates compute and storage, allowing independent scaling.
o Supports structured and semi-structured data (e.g., JSON,
Parquet).
4. Azure Synapse (formerly SQL Data Warehouse)
o Integrated analytics service by Microsoft Azure.
o Combines big data and data warehousing.
o Offers capabilities for data integration, exploration, preparation,
and analysis.

13
Table 1:SQL VS NON-SQL

SQL NoSQL
Fixed or static, predefined schemaDynamic schema
Not suited for hierarchical data Best suited for hierarchical data
storage storage
Best suited for complex queries Not ideal for complex queries
Vertically scalable Horizontally scalable
Follows CAP (Consistency,
Follows ACID properties
Availability, Partition Tolerance)
MySQL, PostgreSQL, Oracle, MS-SQL MongoDB, HBase, Neo4j, Cassandra,
Server, etc. etc

Types of Data and variables

An attribute is a property or characteristic of an object, such as eye color or


temperature. It is also referred to as a variable, field, characteristic,
dimension, or feature. A collection of attributes is used to describe an object,
which may also be known as a record, point, case, sample, entity, or instance.
The values of attributes are numbers or symbols assigned to them for a
particular object. It is important to distinguish between attributes and
attribute values, as the same attribute can have different values depending
on the measurement, such as height being measured in feet or meters.
Additionally, different attributes can share the same set of values, like ID and
age both being represented as integers.

Nominal means “relating to names.” The values of a nominal attribute are


symbols or names of things. Each value represents some kind of category,
code, or state, and so nominal attributes are also referred to as categorical.
The values do not have any meaningful order. In computer science, the values
are also known as enumerations.

Suppose that hair color and marital status are two attributes describing person
objects. In our application, possible values for hair color are black, brown,
blond, red, auburn, gray, and white. The attribute marital status can take on
the values single, married, divorced, and widowed. Both hair color and marital

14
status are nominal attributes. Another example of a nominal attribute is
occupation, with the values teacher, dentist, programmer, farmer, and so on.
Binary Attributes

A binary attribute is a nominal attribute with only two categories or states:


0 or 1, where 0 typically means that the attribute is absent, and 1 means that
it is present. Binary attributes are referred to as Boolean if the two states
correspond to true and false.

Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, while 0 indicates that the patient does not. Similarly, suppose
the patient undergoes a medical test that has two possible outcomes. The
attribute medical test is binary, where a value of 1 means the result of the test
for the patient is positive, while 0 means the result is negative.

A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome
should be coded as 0 or 1. One such example could be the attribute gender
having the states male and female.

A binary attribute is asymmetric if the outcomes of the states are not equally
important,such as the positive and negative outcomes of a medical test for
HIV. By convention,we code the most important outcome, which is usually
the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).

Ordinal Attributes

An ordinal attribute is an attribute with possible values that have a


meaningful order or ranking among them, but the magnitude between
successive values is not known.

15
Suppose that drink size corresponds to the size of drinks available at a fast-
food restaurant. This nominal attribute has three possible values: small,
medium, and large. The values have a meaningful sequence (which
corresponds to increasing drink size); however, we cannot tell from the values
how much bigger, say, a medium is than a large. Other examples of ordinal
attributes include grade (e.g., A, B, C, and so on) and professional rank.
Professional ranks can be enumerated in a sequential order: for example,
assistant, associate, and full for professors, and private, private first class,
specialist, corporal, and sergeant for army ranks.

Ordinal attributes are useful for registering subjective assessments of


qualities that cannot be measured objectively; thus ordinal attributes are
often used in surveys for ratings. In one survey, participants were asked to
rate how satisfied they were as customers. Customer satisfaction had the
following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2:
neutral, 3: satisfied, and 4: very satisfied.

Numeric Attributes

A numeric attribute is quantitative; that is, it is a measurable quantity,


represented in integer or real values. Numeric attributes can be interval-
scaled or ratio-scaled.

Interval-Scaled Attributes

Interval-scaled attributes are measured on a scale of equal-size units. The


values of interval-scaled attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between values.
A temperature attribute is interval-scaled. Suppose that we have the outdoor
temperature value for a number of different days, where each day is an object.
By ordering the values, we obtain a ranking of the objects with respect to

16
temperature. In addition, we can quantify the difference between values. For
example, a temperature of 20_C is five degrees higher than a temperature of
15_C. Calendar dates are another example. For instance, the years 2002 and
2010 are eight years apart.

Ratio-Scaled Attributes

A ratio-scaled attribute is a numeric attribute with an inherent zero-point.


That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value. In addition, the values are ordered, and
we can also compute the difference between values, as well as the mean,
median, and mode.

Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature


scale has what is considered a true zero-point . It is the point at which the
particles that comprise matter have zero kinetic energy. Other examples of
ratio-scaled attributes include count attributes such as years of experience
(e.g., the objects are employees) and number of words (e.g., the objects are
documents). Additional examples include attributes to measure weight,
height, latitude and longitude

Classification algorithms developed from the field of machine learning often


talk of attributes as being either discrete or continuous. Each type may be
processed differently. A discrete attribute has a finite or countably infinite
set of values, which may or may not be represented as integers. The attributes
hair color, smoker, medical test, and drink size each have a finite number of
values, and so are discrete. Note that discrete attributes may have numeric
values, such as 0 and 1 for binary attributes or, the values 0 to 110 for the
attribute age. An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one-to-one correspondence with natural
numbers. For example, the attribute customer ID is countably infinite. The
number of customers can grow to infinity, but in reality, the actual set of

17
values is countable (where the values can be put in one-to-one
correspondence with the set of integers). Zip codes are another example.
If an attribute is not discrete, it is continuous. The terms numeric attribute
and continuous attribute are often used interchangeably in the literature. (This
can be confusing because, in the classic sense, continuous values are real
numbers, whereas numeric values can be either integers or real numbers.) In
practice, real values are represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.
But the properties of attribute can be different than the properties of the
values used to represent the attribute

1. Nominal Attribute: An attribute that possesses only distinctness (=,


≠).
2. Ordinal Attribute: An attribute that has distinctness (=, ≠) and order
(<, >).
3. Interval Attribute: An attribute that includes distinctness (=, ≠), order
(<, >), and meaningful differences (+, -).
4. Ratio Attribute: An attribute that includes all four properties:
distinctness (=, ≠), order (<, >), meaningful differences (+, -), and
ratios (*, /).

18
Table 2:Typess of Attributes

Attribute
Type Description Examples Operations Transformation Comment
If all employee
Zip codes, ID numbers
Nominal employee ID Mode, are
attribute numbers, entropy, Any one-to-one reassigned, it
values only eye color, contingency mapping, e.g., a will not make
distinguish. sex: {male, correlation, permutation of any
Nominal (=, ≠ ) female} chi-sq test values difference.
An attribute
encompassing
Hardness of the notion of
minerals, An order- good, better,
{good, Median, preserving best can be
Ordinal better, percentiles, change of values, represented
attribute best}, rank i.e., new value = equally well
values also grades, correlation, f(old value), where by the values
order objects. street run tests, f is a monotonic {1, 2, 3} or by
Ordinal (<, >) numbers sign tests function. {0.5, 1, 10}.
The
Fahrenheit
and Celsius
For interval temperature
attributes, Calendar Mean, scales differ in
differences dates, standard the location of
between temperatur deviation, their zero
values are e in Celsius Pearson's New value = a * value and the
meaningful. or correlation, old value + b, a size of a
Interval (+, -) Fahrenheit t and F tests and b constants. degree (unit).
For ratio Temperatur
variables, e in Kelvin,
both monetary Geometric
differences quantities, mean,
and ratios counts, age, harmonic
are mass, mean, Length can be
meaningful. length, percent New value = a * measured in
Ratio (*, /) current variation old value meters or feet.

19
Data Modeling Techniques

Underlying the structure of a database is the data model: a collection of


conceptual tools for describing data, data relationships, data semantics, and
consistency constraints. A data model provides a way to describe the design
of a database at the physical, logical, and view levels.

Relational Model: The relational model uses a collection of tables to represent


both data and the relationships among those data. Each table has multiple
columns, and each column has a unique name. Tables are also known as
relations. The relational model is an example of a record-based model. Record-
based models are so named because the database is structured in fixed-
format records of several types. Each table contains records of a particular
type. Each record type defines a fixed number of fields, or attributes. The
columns of the table correspond to the attributes of the record type. The
relational data model is the most widely used data model, and a vast majority
of current database systems are based on the relational model.

Entity-Relationship Model: The entity-relationship (E-R) data model uses a


collection of basic objects, called entities, and relationships among these
objects. An entity is a “thing” or “object” in the real world that is
distinguishable from other objects.

Object-Based Data Model: Object-oriented programming (especially in Java,


C++, or C#) has become the dominant software-development methodology.
This led to the development of an object-oriented data model that can be seen
as extending the E-R model with notions of encapsulation, methods
(functions), and object identity. The object-relational data model combines
features of the object-oriented data model and relational data model.

Semistructured Data Model: The semistructured data model permits the


specification of data where individual data items of the same type may have
different sets of attributes. This is in contrast to the data models mentioned
earlier, where every data item of a particular type must have the same set of

20
attributes. The Extensible Markup Language (XML) is widely used to represent
semistructured data.

Historically, two other data models, the network data model and the
hierarchical data model, preceded the relational data model. These models
were tied closely to the underlying implementation, and complicated the task
of modeling data.

Missing Imputations

Imputation is the process of replacing missing data with substituted values.


Types of missing data .Missing data can be classified into one of three
categories

Types of Missing Values

Understanding the types of missing values in datasets is crucial for effectively


handling missing data and ensuring accurate analyses:

Missing Completely At Random (MCAR)

MCAR occurs when the probability of data being missing is uniform across all
observations. There is no relationship between the missingness of data and
any other observed or unobserved data within the dataset. This type of
missing data is purely random and lacks any discernible pattern.

21
Example: In a survey about library books, some overdue book values in the
dataset are missing due to human error in recording.

Missing At Random (MAR)

MAR data occurs when the probability of data being missing depends only on
the observed data and not on the missing data itself. In other words, the
missingness can be explained by variables for which you have complete
information. There is a pattern in the missing values, but this pattern can be
explained by other observed variables.

Example: In a survey, ‘Age’ values might be missing for those who did not
disclose their ‘Gender’. Here, the missingness of ‘Age’ depends on ‘Gender’,
but the missing ‘Age’ values are random among those who did not disclose
their ‘Gender’.

Missing Not At Random (MNAR)

MNAR occurs when the missingness of data is related to the unobserved data
itself, which is not included in the dataset. This type of missing data has a
specific pattern that cannot be explained by observed variables.

Example: In a survey about library books, people with more overdue books
might be less likely to respond to the survey. Thus, the number of overdue
books is missing and depends on the number of books overdue.

Understanding the type of missing data is crucial because it determines the


appropriate strategy for handling missing values and ensuring the integrity of
statistical analyses. Techniques such as handling missing values, how to
handle missing values, how to fill missing values in dataset, and missing value
imputation are essential for mitigating biases and ensuring robust results in
scenarios such as sentiment analysis python, python sentiment analysis, and
how to do sentiment analysis in python.

How to Handle Missing Data?

22
Missing data is a common headache in any field that deals with datasets. It
can arise for various reasons, from human error during data collection to
limitations of data gathering methods. Luckily, there are strategies to address
missing data and minimize its impact on your analysis. Here are two main
approaches:

● Deletion: This involves removing rows or columns with missing values.


This is a straightforward method, but it can be problematic if a
significant portion of your data is missing. Discarding too much data
can affect the reliability of your conclusions.

● Imputation: This replaces missing values with estimates. There are


various imputation techniques, each with its strengths and
weaknesses. Here are some common ones:

o Mean/Median/Mode Imputation: Replace missing entries with


the average (mean), middle value (median), or most frequent value
(mode) of the corresponding column. This is a quick and easy
approach, but it can introduce bias if the missing data is not
randomly distributed.

o K-Nearest Neighbors (KNN Imputation): This method finds the


closest data points (neighbors) based on available features and
uses their values to estimate the missing value. KNN is useful
when you have a lot of data and the missing values are scattered.

o Model-based Imputation: This involves creating a statistical


model to predict the missing values based on other features in
the data. This can be a powerful technique, but it requires more
expertise and can be computationally expensive.pen_spark.

Need for Business Modeling

23
A business model is a schematic representation of the interconnected
processes of a company. It shows what and to whom to sell, and also
determines the profit.

Below, we will consider in more detail what a business model is and what it
is used for.

Free business model template. Source: online.visual-paradigm.com

At the launch stage:

● It calculates how much money will be needed for the startup, what
expenses will arise each month, and the expected level of profit in the
early stages of operation.

24
● It analyzes how customers interact with your business and also reduces
expenses.

In case of losses:

The business model saves the unprofitable business. An innovative product


or service is certainly important, but it is not always necessary to outperform
competitors. The main thing is to constantly improve your processes and
adapt the business model to new technologies.

Attracting investments:

● Developing a business model creates a simple and logical description of


the project concept.

● It explains to financial partners that the product is in demand and how


it differs from analogs.

● The project's business model demonstrates the connection between


processes, which increases investor confidence.

Overall, a business model:

● Forms a clear vision of how the business will operate.

● Adapts the company to changing market conditions and adjusts the


course of development.

● Promotes more efficient solutions based on analysis and forecasting.

25
26

You might also like