0% found this document useful (0 votes)
0 views19 pages

Introduction To Data Analysis 2

The document provides an overview of data analysis, detailing its definition, types, processes, roles of data analysts, and essential skills. It distinguishes between data analysis and analytics, outlines various analytical methods such as descriptive, diagnostic, predictive, and prescriptive analytics, and emphasizes the importance of data cleaning and wrangling. Additionally, it discusses common data sources, relevant programming languages, and the significance of effective communication in presenting findings.

Uploaded by

ennyzkid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views19 pages

Introduction To Data Analysis 2

The document provides an overview of data analysis, detailing its definition, types, processes, roles of data analysts, and essential skills. It distinguishes between data analysis and analytics, outlines various analytical methods such as descriptive, diagnostic, predictive, and prescriptive analytics, and emphasizes the importance of data cleaning and wrangling. Additionally, it discusses common data sources, relevant programming languages, and the significance of effective communication in presenting findings.

Uploaded by

ennyzkid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA ANALYSIS COURSE

Introduction to Data Analysis.

What is data Analysis?

Data analysis is the process of gathering, cleaning, analysing data, interpreting

results, and reporting the findings. With data analysis we find patterns within data

and correlations between different data points. And it is through these patterns

and correlations that insights are generated, and conclusions are drawn. Data

analysis helps businesses understand their past performance and informs their

decision-making for future actions.

The terms Data Analysis and Data Analytics are often used interchangeably,

however it is important to note that there is a subtle difference between the terms

and meaning of the words Analysis and Analytics.

The dictionary meanings are:

Analysis - detailed examination of the elements or structure of something.

Analytics - the systematic computational analysis of data or statistics.

Analysis can be done without numbers or data, such as business analysis psycho

analysis, etc. Whereas Analytics, even when used without the prefix "Data",

almost invariably implies use of data for performing numerical manipulation and

inference.

1
Types of Data Analysis

Descriptive Analysis: Descriptive Analytics helps answer questions about what

happened over a given period of time by summarizing past data and presenting

the findings to stakeholders. It helps provide essential insights into past events.

For example, tracking past performance based on the organization's key

performance indicators or cash flow analysis.

Diagnostic Analysis: Diagnostic analytics helps answer the question. Why did it

happen? It takes the insights from descriptive analytics to dig deeper to find the

cause of the outcome. For example, a sudden change in traffic to a website

without an obvious cause or an increase in sales in a region where there has been

no change in marketing.

Predictive Analytics: Predictive analytics helps answer the question, what will

happen next? Historical data and trends are used to predict future outcomes. Some

of the areas in which businesses apply predictive analysis are risk assessment and

sales forecasts. It's important to note that the purpose of predictive analytics is

not. to say what will happen in the future, it's objective is to forecast what might

happen in the future. All predictions are probabilistic in nature.

Prescriptive Analysis: Prescriptive Analytics helps answer the question, what

should be done about it? By analysing past decisions and events, the likelihood

of different outcomes. Is estimated on the basis of which a course of action is

decided. Self-driving cars are a good example of Prescriptive Analytics. They

2
analyse the environment to make decisions regarding speed, changing lanes,

which route to take, etc. Or airlines automatically adjusting ticket prices based on

customer demand. Gas prices, the weather or traffic on connecting routes.

Some key Process in Data Analysis Process

Understanding the problem and desired result

Data analysis begins with understanding the problem that needs to be solved and

the desired outcome that needs to be achieved. Where you are and where you

want to be needs to be clearly defined before the analysis process can begin.

Setting a Clear Metric

This stage of the process includes deciding what will be measured. For example,

number of product X sold in a region and how it will be measured, for example.

In a quarter or during a festival season, gathering data once you know what you're

going to measure and how you're going to measure it, you identify the data you

require, the data sources you need to pull this data from, and the best tools for the

job.

Cleaning Data

Having gathered the data, the next step is to fix quality issues in the data that

could affect the accuracy of the analysis. This is a critical step because the

accuracy of the analysis can only be ensured if the data is clean. You will clean

the data for missing or incomplete values and outliers. For example, a customer

3
demographics data in which the age field has a value of 150 is an outlier. You

will also standardize the data coming in from multiple sources.

Analysing and Mining Data

Once the data is clean, you will extract and analyze the data from different

perspectives. You may need to manipulate your data in several different ways to

understand the trends, identify correlations and find patterns and variations.

Interpreting Results

After analyzing your data and possibly conducting further research, which can be

an iterative loop, it's time to interpret your results. As you interpret your results,

you need to evaluate if your analysis is defendable against objections, and if there

are any limitations or circumstances under which your analysis may not hold true.

Presenting your Findings

Ultimately, the goal of any analysis is to impact decision making. The ability to

communicate and present your findings in clear and impactful ways is as

important a part of the data analysis process as is the analysis itself. Reports,

dashboards, charts, graphs, maps, case studies are just some of the ways in which

you can present your data.

4
ROLES OF A DATA ANALYST

While the role of a Data Analyst varies depending on the type of organization

and the extent to which it has adopted data-driven practices, there are some

responsibilities that are typical to a Data Analyst role in today’s organizations.

These include: Acquiring data from primary and secondary data sources,

Creating queries to extract required data from databases and other data collection

systems, filtering, cleaning, standardizing, and reorganizing data in preparation

for data analysis, Using statistical tools to interpret data sets, Using statistical

techniques to identify patterns and correlations in data, Analyzing patterns in

complex data sets and interpreting trends, Preparing reports and charts that

effectively communicate trends and patterns, Creating appropriate documentation

to define and demonstrate the steps of the data analysis process.

SKILLS THAT ARE VALUABLE TO A DATA ANALYST

The data analysis process requires a combination of TECHNICAL,

FUNCTIONAL, and SOFT SKILLS.

Let’s first look at some of the Technical skills that you need in your role as a Data
Analyst.
TECHNICAL SKILLS: These include: Expertise in using spreadsheets such as

Microsoft Excel or Google Sheets, Proficiency in statistical analysis and

visualization tools and software such as IBM Cognos, IBM SPSS, Oracle Visual

Analyzer, Microsoft Power BI, SAS, and Tableau Proficiency in at least one of

5
the programming languages such as R, Python, and in some cases C++, Java, and

MATLAB, Good knowledge of SQL, and ability to work with data in relational

and NoSQL databases, The ability to access and extract data from data

repositories such as data marts, data warehouses, data lakes, and data pipelines,

Familiarity with Big Data processing tools such as Hadoop, Hive, and Spark. We

will understand more about the features and use cases of some of these

programming languages, databases, data repositories, and big data processing

tools further along in the course.

FUNCTIONAL SKILLS: These include: Proficiency in Statistics to help you

analyze your data, validate your analysis, and identify fallacies and logical errors.

Analytical skills that help you research and interpret data, theorize, and make

forecasts.

Problem-solving skills, because ultimately, the end-goal of all data analysis is to

solve problems. Probing skills that are essential for the discovery process, that is,

for understanding a problem from the perspective of varied stakeholders and

users—because the data analysis process really begins with a clear articulation of

the problem statement and desired outcome.

Data Visualization skills that help you decide on the techniques and tools that

present your findings effectively based on your audience, type of data, context,

and end-goal of your analysis.

6
Project Management skills to manage the process, people, dependencies, and

timelines of the initiative.

SOFT SKILLS: This includes your ability to work collaboratively with business

and cross-functional teams; communicate effectively to report and present your

findings; tell a compelling and convincing story; and gather support and buy-in

for your work. Above all, being curious, is at the heart of data analysis. In the

course of your work, you will stumble upon patterns, phenomena, and anomalies

that may show you a different path. The ability to allow new questions to surface

and challenge your assumptions and hypotheses makes for a great analyst. You

will also hear data analysis practitioners talk about intuition as a must-have

quality. It’s essential to note that intuition, in this context, is the ability to have a

sense of the future based on pattern recognition and past experiences. In this

video, we learned about the responsibilities and skillsets of a Data Analyst. In the

next video, we will walk you through a day in the life of a Data Analyst.

APPLICATION OF DATA ANALYST

The applications of data analytics in the world today is everywhere. Every

commercial that you see, someone had to analyze and identify either from the

consumer or for the company, what information they wanted to share.

This isn't something that should be thought of separate and apart from, it's what

we do every day in our lives so the applications are universal. So the great thing

with analytics in this day and age is that it's very widely applicable. Every

7
industry, every vertical, every function within a given organization can benefit

from data and analytics. Whether you're doing sales pipeline analysis, whether

you're doing financials at the end of the month, creating predefined and

standardized formatted reports. Or if you're doing something like headcount

planning or headcount review, all of these across every vertical can benefit from

analytics.

COMMON SOURCES OF DATA

Data sources have never been as diverse as they are today. Typically, most

organisation have internal applications to support them to manage their day to

day business activities, customer transactions, human resource activities and their

workflow. We will be looking at some common data sources such as Relational

database, flat files and XMLs dataset, APIs and web services, web scraping.

Relational Database: is a common source of data that is been gotten from most

organisation. Such data can be used to monitor business activities, customer

transaction, Human resources activities and workflows. This system uses

relational database such as SQL server, Oracle, Mysql and IBM DB2. Data stored

in a relational database are structured data and can be used to analyse data.

Example, data from a retail transaction system can be used to analyse sales from

different region.

Flat files and XMLs datasets: is a common source of data from both public and

private organisations. For example, Government organisation releasing

8
demographic or economic data or companies that sell specific data such as point

of sale data, financial data, weather data etc. which businesses can use to define

strategy, predict demand and make distribution decisions.

9
APIs and Web Services: This is a type of data that is being generated from

individuals or organisation that interact with Application Program Interface

(APIs) or web services. This Data being generated as a result of user activity can

come in form of web request from users or network request from application and

return the data inform of plain text, XML, HTML, JSON or media files. Popular

examples of APIs being used as a data source for data analytics include Facebook

and Twitter APIs which can be used to mine for data such as post opinion mining

or sentiment analysis and stock market APIs which is used for pulling data such

as share, commodity prices, earning per share, and historical prices for trading

and analysis.

Web Scraping: is used to extract relevant data from unstructured sources such

as websites and webpages like e-commerce website. Web scraping makes it

possible to gather specific data base on defined parameters such as text, contact

information, videos, product items etc. from a website. Some popular scraping

tools include: selenium, pandas, beautiful soup, and scrapy.

10
Languages Relevant to the work of a Data Analyst

Query languages: Query languages are designed for accessing and manipulating

data in database (SQL).

Programming Languages: are designed for developing and controlling

application behaviour (Python, Java and R).

Shell and Scripting Language: are designed for repetitive and time consuming

operational tasks (Unix/Linux shell, PowerShell).

Data Repositories

Data repository is a general term that refers to the data that has been collected,

organised and isolated so that it can be used for reporting, analytics and also for

archival purposes.

Types of data repositories

The different types of data repository include:

Databases: which can be relational or non-relational, each following a set of

organisational principles, the type of data they can store, and the tools that can be

used to query, organise, and retrieve data.

Data Ware-house: consolidate incoming data into one comprehensive

storehouse.

11
Data Marts: are essentially sub-sections of a data warehouse, built to isolate data

for a particular business function or use case.

Data lakes: that serve as storage repositories for large amounts of structured,

semi-structured, and unstructured data in their native format.

Big Data Stores: provide distributed computational and storage infrastructure to

store, scale and process very large data sets.

Big Data refers to the vast amounts of data that is being produced each

moment of every day, by people, tools and machines. The sheer velocity, volume

and variety of data challenge the tools and systems used for conventional data,

these challenges led to the emergence of processing tools and platforms designed

specifically for Big data, such as Apache Hadoop, Apache Hive, and Apache

Spark.

SOURCES OF DATA

Primary Data: refers to information obtained directly from the source. Data

from the organisation’s CRM, HR, or workflow applications. Data you gather

directly through surveys, interviews, discussions, observations, and focus

groups.

Secondary Data: refers to information retrieved from existing sources. External

databases, research articles, publications training materials, internet searches, or

12
financial records available as public data. Data collected through externally

conducted surveys, interviews, discussions, observations, and focus.

Third Party Data: refers to data purchased from aggregators who collect data

from various sources and combine it into comprehensive datasets for purpose of

selling the data.

Data Types and Destination Repositories

Structured Data: Relational databases stores structured data with a well-defined

schema. if you are using a relational database, you will only be able to store

structured data. Sources include data from online forms, spreadsheets etc.

Semi-structured Data: is a non-relational data that have some organisational

property but not a schema or well tabulated. Some common sources include

emails, XML, zipped files, binary executable etc. Semi-structured data are stored

in NoSQL servers.

Unstructured Data: is a data have no structure at all and cannot be organised

into a schema. such data comes from webpages, social media feeds, images

videos, documents, media logs etc. unstructured data can be stored in NoSQL

database and data lakes.

ETL tools and data pipelines provide automated functions that facilitate the

process of importing data. Tools for importing data include Talend, informatica,

python, R, SQL etc.

13
Data Wrangling

Data wrangling also known as data munging, is an iterative process that involves

data exploration, transformation validation, and making it available for a credible

and meaningful analysis. It includes a range of task in preparing a row data for a

clearly define purpose where row data at this stage is data that has been collected

through various data sources in a data repository.

Data wrangling process include: discovery, transformation, validation, and

publishing the data.

Tools for Data Wrangling

Some of the popularly used data wrangling software and tools include:

 Excel Power Query/Spreadsheet such as Microsoft power query for excel

and google sheet query, Microsoft excel and google sheet.

 Python has a huge library and set of packages that offer powerful data

manipulation capabilities. Examples include: Jupiter notebook, Numpy

(numerical python), and Pandas.

 R offers a series of libraries and packages that are explicitly created for

wrangling messy data. Examples of R packages include: Dplyr, Data.table

and Jasonlite.

 Other examples of wrangling software include: OpenRefine, Google

DataPrep Watson Studio, and Trifacta Wrangler.

14
Your decision regarding the best tools for your needs will depend on the factors

that are specific to your use case, infrastructure, and teams such as: Supported

data size, Data structures, Cleaning and transformation capabilities, infrastructure

needs, ease of use and learnability.

Data Cleaning

Poor quality data such as missing data, inconsistent data, incorrect data, leads to

false conclusions and ineffective decisions which may in turn weaken an

organisation’s competitive standing and undermines critical business objectives.

Data sets picked up from disparate sources could have a number of issues,

including missing values, inaccuracies, duplicates, incorrect or missing

delimiters, inconsistent records and insufficient parameters.

Data Cleaning Workflow

Data cleaning workflow includes: inspection, cleaning, and verification.

Inspection: the first step in cleaning workflow is to detect different types of

issues and errors your data set may have. Data profiling helps you to inspect

source data to help you understand structures, content and inter-relationships in

your data. It uncovers anomalies and data quality issues such as blank, null values

duplicate data or weather the value of a filled falls within the expected range.

15
Cleaning: the technique you apply for cleaning your dataset will depend on your

use case and the type of issues you encounter. Some of the most issues you will

encounter include:

 Missing values: can cause unexpected or biased results. You should filter

out records with missing data or source missing information, or input, that

is, calculate the missing values based on statistical values.

 Duplicate data: need to be removed.

 Irrelevant data: is or are data that are not contextual to your use case

hence, need to be removed also.

 Data type conversion: is needed to ensure that values in a field are stored

as the data type of that field. For example, numbers should be stored in

numerical value field or date stored as date data type.

 Standardizing data: is needed to ensure date-time formats and units of

measurement are standard across the dataset.

 Syntax errors: such as white spaces, extra spaces, typos and formats like

full name of a state or country verses abbreviation need to be fixed.

Verification: in this step, you inspect the result to establish effectiveness and

accuracy achieved as a result of the data cleaning operation. You need to re-

inspect the data to ensure that the rules and constrain applicable to the data still

holds after the corrections you made.

16
At the end, you need to know that all the changes undertaken as part of the data

cleaning operation, reasons for undertaking these changes, and the quality of the

current stored data must be documented.

Statistical Analysis

Statistics is a branch of mathematics dealing with the collection, analysis,

interpretation, and presentation of numerical or quantitative data. Everyday

examples of statistics at work include: calculations such as average income,

average age, highest-paid professions it’s all statistics. Today’s statistic is being

applied across industries for decision making based on data. For example,

researchers use statistics to analyse data for the production of vaccine to ensure

safety and efficacy or companies using statistics to reduce customers churn by

gaining greater insight into customer requirements.

STATISTICAL ANALYSIS

Statistical Analysis is the application of statistical methods to a sample of data

in order to develop an understanding of what that data represents.

Types of Statistics

Descriptive Statistics: summarize information about the sample. It helps you

to present data in a meaningful way which allows for simpler interpretation of

the data. Data is describe using summary chat, table or graph. It does not attempt

17
to draw conclusions about the population from which the sample is taking, but

to make it easier to understand and visualize data.

Common measures of descriptive statistics

 Central Tendency: is locating the central of a data sample. Some

common measure of central tendency includes: Mean, Median, Mode.

This measures tells you where most values in your data-set fall.

 Dispersion: is the measure of variability in a dataset. Common measures

of statistical dispersion are: variance, standard deviation and range.

Inferential Statistics: is all about making inferences or generalization about the

broader population. Inferential statistics takes data from a sample to make

inferences about the larger population from which the sample was drawn. It

helps to draw generalisation that apply the results of the sample to the population

as a whole.

18
Some common methodologies of inferential statistics

 Hypothesis Testing: can be used for example, studying the effectiveness

of a vaccine by comparing outcomes in a control group, hypothesis test

can tell you whether the efficacy of a vaccine observed in a control group

is likely to exist in the population as well.

 Confidence Intervals: incorporate the uncertainty and sample error to

create a range of values the actual population value is likely to fall within.

 Regression Analysis: incorporates hypothesis tests that helps determine

whether the relationships observed in the sample data actually exist in the

population rather than just the sample.

19

You might also like