DA - Unit I
DA - Unit I
BCS-052
Unit-1
• Introduction to Data Analytics: Sources and nature of data,
classification of data (structured, semi-structured, unstructured),
characteristics of data, introduction to Big Data platform, need of data
analytics, evolution of analytic scalability, analytic process and tools,
analysis vs reporting, modern data analytic tools, applications of data
analytics.
• Data Analytics Lifecycle: Need, key roles for successful analytic
projects, various phases of data analytics lifecycle – discovery, data
preparation, model planning, model building, communicating results,
operationalization.
Data Analytics
• In this new digital world, data is being generated in an enormous
amount which opens new paradigms.
• As we have high computing power and a large amount of data we can
use this data to help us make data-driven decision making.
• The main benefits of data-driven decisions are that they are made up
by observing past trends which have resulted in beneficial results.
• In short, we can say that data analytics is the process of manipulating
data to extract useful trends and hidden patterns that can help us derive
valuable insights to make business predictions.
Data Analytics
• Data analytics is an important field that involves the process of
collecting, processing, and interpreting data to uncover insights and
help in making decisions.
• Data analytics is the practice of examining raw data to identify trends,
draw conclusions, and extract meaningful information.
• This involves various techniques and tools to process and transform
data into valuable insights that can be used for decision-making.
• Data analytics encompasses a wide array of techniques for analyzing data to
gain valuable insights that can enhance various aspects of operations.
• By scrutinizing information, businesses can uncover patterns and metrics
that might otherwise go unnoticed, enabling them to optimize processes and
improve overall efficiency.
• For instance, in manufacturing, companies collect data on machine runtime,
downtime, and work queues to analyze and improve workload planning,
ensuring machines operate at optimal levels.
• Beyond production optimization, data analytics is utilized in diverse sectors.
• Gaming firms utilize it to design reward systems that engage players
effectively, while content providers leverage analytics to optimize content
placement and presentation, ultimately driving user engagement.
Types of Data Analytics
Descriptive analytics
• Descriptive analytics is a simple, surface-level type of analysis that looks
at what has happened in the past.
• The two main techniques used in descriptive analytics are data aggregation
and data mining—so, the data analyst first gathers the data and presents it in
a summarized format (that’s the aggregation part) and then “mines” the data
to discover patterns.
• The data is then presented in a way that can be easily understood by a wide
audience (not just data experts).
• It’s important to note that descriptive analytics doesn’t try to explain the
historical data or establish cause-and-effect relationships;
• At this stage, it’s simply a case of determining and describing the “what”.
Diagnostic analytics
• While descriptive analytics looks at the “what”, diagnostic analytics explores the
“why”.
• When running diagnostic analytics, data analysts will first seek to identify
anomalies within the data—that is, anything that cannot be explained by the data in
front of them.
• For example: If the data shows that there was a sudden drop in sales for the month
of March, the data analyst will need to investigate the cause.
• To do this, they’ll embark on what’s known as the discovery phase, identifying any
additional data sources that might tell them more about why such anomalies arose.
• Finally, the data analyst will try to uncover causal relationships—for example,
looking at any events that may correlate or correspond with the decrease in sales.
• At this stage, data analysts may use probability theory, regression analysis,
filtering, and time-series data analytics.
Predictive analytics
• Just as the name suggests, predictive analytics tries to predict what is likely
to happen in the future.
• This is where data analysts start to come up with actionable, data-driven
insights that the company can use to inform their next steps.
• Predictive analytics estimates the likelihood of a future outcome based on
historical data and probability theory, and while it can never be completely
accurate, it does eliminate much of the guesswork from key business
decisions.
• Predictive analytics can be used to forecast all sorts of outcomes—from
what products will be most popular at a certain time, to how much the
company revenue is likely to increase or decrease in a given period.
• Ultimately, predictive analytics is used to increase the business’s chances of
“hitting the mark” and taking the most appropriate action.
Prescriptive Analytics
• Building on predictive analytics, prescriptive analytics advises on the actions
and decisions that should be taken.
• In other words, prescriptive analytics shows you how you can take advantage of
the outcomes that have been predicted.
• When conducting prescriptive analysis, data analysts will consider a range of
possible scenarios and assess the different actions the company might take.
• Prescriptive analytics is one of the more complex types of analysis, and may
involve working with algorithms, machine learning, and computational modeling
procedures.
• However, the effective use of prescriptive analytics can have a huge impact on the
company’s decision-making process and, ultimately, on the bottom line.
• The type of analysis you carry out will also depend on the kind of data you’re
working with. If you’re not already familiar, it’s worth learning about the four
levels of data measurement: nominal, ordinal, interval, ratio.
GOALS OF DATA ANALYSIS
• Explanation: understand / find the true relation between variables of
interest ▸ e.g., causal mechanism or correlation
• Prediction: accurately predict hitherto unobserved (e.g., future) data
points ▸ e.g., for medical image classification (tumor recognition)
There are four types of measurement (or scales) to be aware
of: nominal, ordinal, interval, and ratio.
Nominal
• The nominal scale simply categorizes variables according to
qualitative labels (or names).
• These labels and groupings don’t have any order or hierarchy to them,
nor do they convey any numerical value.
Ordinal
• The ordinal scale also categorizes variables into labeled groups, and these
categories have an order or hierarchy to them.
• For example, you could measure the variable “income” on an ordinal scale as
follows:
• low income
• medium income
• high income.
• Another example could be level of education, classified as follows:
• high school
• master’s degree
• doctorate
• These are still qualitative labels (as with the nominal scale), but you can see that
they follow a hierarchical order.
Interval
• The interval scale is a numerical scale which labels and orders variables, with a
known, evenly spaced interval between each of the values.
• A commonly-cited example of interval data is temperature in Fahrenheit, where
the difference between 10 and 20 degrees Fahrenheit is exactly the same as the
difference between, say, 50 and 60 degrees Fahrenheit.
Ratio
• The ratio scale is exactly the same as the interval scale, with one key difference:
The ratio scale has what’s known as a “true zero.”
• A good example of ratio data is weight in kilograms.
• If something weighs zero kilograms, it truly weighs nothing—compared to
temperature (interval data), where a value of zero degrees doesn’t mean there is
“no temperature,” it simply means it’s extremely cold!
Understanding Structured, Semi-Structured, and
Unstructured Data
• When we talk about data or analytics, the terms structured, unstructured, and
semi-structured data often get discussed.
• These are the three forms of data that have now become relevant for all types of
business applications.
• Structured data has been around for some time, and traditional systems and
reporting still rely on this form of data.
• However, there has been a swift increase in the generation of semi-structured and
unstructured data sources in the past few years, due to the rise of Big Data.
• As a result, more and more businesses are now looking to take their business
intelligence and analytics to the next level by including all three forms of data.
Structured Data
• Structured data is information that has been formatted and transformed into a
well-defined data model.
• The raw data is mapped into predesigned fields that can then be extracted and read
through SQL easily.
• SQL relational databases, consisting of tables with rows and columns, are the
perfect example of structured data.
• The relational model of this data format utilizes memory since it minimizes data
redundancy.
• However, this also means that structured data is more inter-dependent and less
flexible.
Semi-Structured Data
• You may not always find your data sets to be structured or unstructured.
Semi-structured data or partially structured data is another category between
structured and unstructured data.
• Semi-structured data is a type of data that has some consistent and definite
characteristics.
• It does not confine into a rigid structure such as that needed for relational
databases.
• Businesses use organizational properties like metadata or semantics tags with
semi-structured data to make it more manageable.
• However, it still contains some variability and inconsistency.
Unstructured Data
• Unstructured data is defined as data present in absolute raw form.
• This data is difficult to process due to its complex arrangement and formatting.
• Unstructured data includes social media posts, chats, satellite imagery, IoT sensor
data, emails, and presentations.
• Unstructured data management takes this data to organize it in a logical,
predefined manner in data storage.
• Natural language processing (NLP) tools help understand unstructured data that
exists in a written format.
WHAT’S DATA?
Characteristics of data
• Data contains a large amount of data that is not being processed by
traditional data storage or the processing unit.
• It is used by many multinational companies to process the data and
business of many organizations.
• The data flow would exceed 150 exabytes per day before replication.
5 V's of Big Data
•Volume
•Veracity
•Variety
•Value
•Velocity
Volume
• The name Big Data itself is related to an enormous size.
• Big Data is a vast 'volumes' of data generated from many sources
daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.
• Facebook can generate approximately a billion messages, 4.5
billion times that the "Like" button is recorded, and more than 350
million new posts are uploaded each day.
• Big data technologies can handle large amounts of data.
Variety
• Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
Veracity
• Veracity means how much the data is reliable.
• It has many ways to filter or translate the data.
• Veracity is the process of being able to handle and manage data
efficiently.
• Big Data is also essential in business development.
• For example, Facebook posts with hashtags.
Value
• Value is an essential characteristic of big data. It is not the data that we
process or store.
• It is valuable and reliable data that we store, process, and
also analyze.
Collection Phase
•Connect: Ensuring that the right people or systems can access and
use the data.
•Incentive: Encouraging use of the data through various incentives.
•Influence: Data starts to influence decisions, strategies, or policies.
Impact Phase
Grid
Technology
Hadoop
Big MPP
Architecture
Data
Cloud
Technology
Traditional Analytic Architecture
Purpose To explore data and extract insights. To inform stakeholders of the results.
Skills Needed Statistical, programming, critical thinking Data presentation, software proficiency
Data Analytics Lifecycle
• The data analytics lifecycle is a structure for doing data analytics that
has business objectives at its core.
• In today’s digital-first world, data is of immense importance.
• It undergoes various stages throughout its life, during its creation,
testing, processing, consumption, and reuse.
• Data Analytics Lifecycle maps out these stages for professionals
working on data analytics projects.
• These phases are arranged in a circular structure that forms a Data
Analytics Lifecycle. Each step has its significance and characteristics.
Why is Data Analytics Lifecycle Essential?
• The Data Analytics Lifecycle is designed to be used with significant
big data projects.
• It is used to portray the actual project correctly; the cycle is iterative.
• A step-by-step technique is needed to arrange the actions and tasks
involved in gathering, processing, analyzing, and reusing data to
explore the various needs for assessing the information on big data.
• Data analysis is modifying, processing, and cleaning raw data
to obtain useful, significant information that supports business
decision-making.
Importance of Data Analytics Lifecycle
• Data Analytics Lifecycle defines the roadmap of how data is
generated, collected, processed, used, and analyzed to achieve
business goals.
• It offers a systematic way to manage data for converting it into
information that can be used to fulfill organizational and project goals.
• The process provides the direction and methods to extract information
from the data and proceed in the right direction to accomplish business
goals.
• Data professionals use the lifecycle’s circular form to proceed with
data analytics in either a forward or backward direction.
Phases of the data analytics lifecycle
Discovery
• This first phase involves getting the context around your problem: you need to know what
problem you are solving and what business outcomes you wish to see.
• You should begin by defining your business objective and the scope of the work.
• Work out what data sources will be available and useful to you (for example, Google
Analytics, Salesforce, your customer support ticketing system, or any marketing
campaign information you might have available), and perform a gap analysis of what data
is required to solve your business problem analysis compared with what data you have
available, working out a plan to get any data you still need.
• Once your objective has been identified, you should formulate an initial hypothesis.
• Design your analysis so that it will determine whether to accept or reject this hypothesis.
• Decide in advance what the criteria for accepting or rejecting the hypothesis will be to
ensure that your analysis is rigorous and follows the scientific method.
Data preparation
• In the next stage, you need to decide which data sources will be useful for
the analysis, collect the data from all these disparate sources, and load it into
a data analytics sandbox so it can be used for prototyping.
• When loading your data into the sandbox area, you will need to transform it.
The two main types of transformations:
• Preprocessing transformations means cleaning your data to remove things like nulls,
defective values, duplicates, and outliers.
• Analytics transformations can mean a variety of things, such as standardizing or
normalizing your data so it can be used more effectively with certain machine
learning algorithms, or preparing your datasets for human consumption (for example,
transforming machine labels into human-readable ones, such as “sku123” →
“T-Shirt, brown”).
• Depending on whether your transformations take place before or after the
loading stage, this whole process is known as either ETL (extract, transform,
load) or ELT (extract, load, transform).
Model planning
• A model in data analytics is a mathematical or programmatic description of
the relationship between two or more variables.
• It allows us to study the effects of different variables on our data and to
make statistical assumptions about the probability of an event happening.
• You may want to think about the following when deciding on a model:
• How large is your dataset?
• While the more complex types of neural networks (with many hidden layers) can
solve difficult questions with minimal human intervention, be aware that with more
layers of complexity, a larger set of training data is required for the neural network's
approximations to be accurate.
• You may only have a small dataset available, or you may require your dashboards to
be fast, which generally requires smaller, pre-aggregated data.
• How will the output be used?
• In the business intelligence use case, fast, pre-aggregated data is great, but if
the end users are likely to perform additional drill-downs or aggregations in
their BI solution, the prepared dataset has to support this.
• Is the data labeled with column headings?
• If it is, you could use supervised learning, but if not, unsupervised learning is
your only option.
• Do you want the outcome to be qualitative or quantitative?
• If your question expects a quantitative answer (for example, “How many sales
are forecast for next month?” or “How many customers were satisfied with our
product last month?”) then you should use a regression model.
• However, if you expect a qualitative answer (for example, “Is this email
spam?”, where the answer can be Yes or No, or “Which of our five products
are we likely to have the most success in marketing to customer X?”), then you
may want to use a classification or clustering model.
• Is accuracy or speed of the model particularly important?
• If so, check whether your chosen model will perform well. The size of your
dataset will be a factor when evaluating the speed of a particular model.
• Is your data unstructured?
• Unstructured data cannot be easily stored in either relational or graph
databases and includes free text data such as emails or files.
• This type of data is most suited to machine learning.
• Have you analyzed the contents of your data?
• Analyzing the contents of your data can include univariate analysis or
multivariate analysis (such as factor analysis or principal component analysis).
• This allows you to work out which variables have the largest effects and to
identify new factors (that are a combination of different existing variables) that
have a big impact.
Building and executing the model
• The steps within this phase of the data analytics lifecycle depend on
the model you've chosen to use.
• SQL model
• You will first need to find your source tables and the join keys.
• Next, determine where to build your models.
• Depending on the complexity, building your model can range from saving
SQL queries in your warehouse and executing them automatically on a
schedule, to building more complex data modeling chains using tooling
like dbt or Dataform.
• Statistical model
• Next, you will need to decide which statistical model is appropriate for your use
case.
• For example, you could use a correlation test, a linear regression model, or an
analysis of variance (ANOVA).
• Finally, you should run your model on your dataset and publish your results.
• Machine learning model
• There is some overlap between machine learning models and statistical models, so
you must begin the same way as when using a statistical model and develop a dataset
containing exactly the information required for your analysis.
• However, machine learning models require you to create two samples from this
dataset: one for training the model, and another for testing the model.
• If you are using a machine learning model, it will need to be trained.
• This involves executing your model on your training dataset, and tuning various
parameters of your model so you get the best predictive results.
• Once this is working well, you can execute your model on your real dataset, which is
used for testing your model.
• You can now work out which model gave the most accurate result and use this model
for your final results, which you will then need to publish.
Communicating results
• You must communicate your findings clearly, and it can help to use
data visualizations to achieve this.
• Any communication with stakeholders should include a narrative, a
list of key findings, and an explanation of the value your analysis adds
to the business.
• You should also compare the results of your model with your initial
criteria for accepting or rejecting your hypothesis to explain to them
how confident they can be in your analysis.
Operationalizing
• Once the stakeholders are happy with your analysis, you can execute
the same model outside of the analytics sandbox on a production
dataset.
• You should monitor the results of this to check if they lead to your
business goal being achieved.
• If your business objectives are being met, deliver the final reports to
your stakeholders, and communicate these results more widely across
the business.