INTRODUCTION TO DATA ANALYSIS
WHAT IS DATA?
Data refers to raw facts, figures, or information collected over time and used for reference, analysis,
or decision-making. It can be in various forms, such as numbers, text, images, or audio, and is the
foundation upon which knowledge and insights are built.
Data becomes meaningful when it is processed, structured, and analyzed.
KEY ASPECTS OF DATA
Raw Data: Unorganized and unprocessed facts or figures. For example, survey responses, sales
numbers, or sensor readings.
Processed Data: Data cleaned, organized, or manipulated for specific use, turning raw data into
valuable insights.
TYPES OF DATA
1. Qualitative (Categorical) Data: Descriptive information that cannot be measured numerically
(e.g., colors, labels, categories).
i. Nominal data: The word “nominal” came from the Latin word “Nomen,” which
means “Name”. Hence, nominal data is used for labeling variables without any type
of quantitative value. E.g., gender, marital status, hair color, tribe/ethnicity.
ii. Ordinal data: Ordinal data represents categories with meaningful order or ranking.
However, the difference between the categories cannot be measured. For example,
Movie rating, Position (1st, 2nd, 3rd, … nth), customer satisfaction rating/review (1-
10), economic status (low, medium, and High), and Letter Grades (A, B, C, …).
2. Quantitative (Numerical) Data: Numerical information that can be measured and counted
(e.g., height, temperature, sales figures).
i. Discrete data: consists of distinct, separate values that can be counted - whole
numbers or specific categories. E.g., Number of students in a class, number of cars in
a parking lot, goals scored in a game, staff, etc.
ii. Continuous data: This can take any value within a range. It is measured, not counted.
It can have infinite possibilities between any two values. E.g., Height, temperature,
time, car speed, etc.
Within quantitative data, there are further distinctions based on the level of measurement:
• Interval Data: Interval data has meaningful intervals between values but lacks a true
zero point. Examples include temperature in Celsius or Fahrenheit and dates on a
calendar. The difference between values is interpretable, but ratios are not
meaningful.
• Ratio Data: Ratio data has all the properties of interval data, and also includes a true
zero point, allowing for meaningful ratios. Examples include weight, height, and
income. Here, both differences and ratios between values are interpretable.
FORMS OF DATA
1. Structured Data: Structured data is highly organized and easily searchable in databases. It
fits neatly into predefined formats, such as rows and columns in spreadsheets or tables in a
relational database. (Organized in a predefined format, such as in tables or spreadsheets)
(e.g., Excel files, databases).
Examples of Structured Data:
• Databases: Information stored in tables, such as customer information, product
inventories, and sales records.
• Spreadsheets: Organized data in rows and columns, such as financial statements,
schedules, and contact lists.
2. Unstructured Data: Data without a specific structure, such as text from social media, emails,
or images. They lack a predefined format or structure. It includes a variety of data types that
are often text-heavy and require more complex processing to analyze.
Examples of Unstructured Data:
• Text Data: Emails, documents, social media posts, and customer reviews.
• Multimedia Data: Images, audio files, and video recordings.
• Web Data: Content from websites, blogs, and forums.
3. Semi-structured Data: Semi-structured data doesn’t fit neatly into tables but contains tags
or markers to separate data elements. It provides some organizational properties, making it
easier to analyze than purely unstructured data.
Examples of Semi-Structured Data:
• XML and JSON: Markup languages that structure data in a hierarchical format, often used
in web services and APIs.
• NoSQL Databases: Databases like MongoDB that store data in flexible, JSON-like
documents.
Importance of Data:
i. Decision-Making: Data helps organizations make informed decisions based on evidence.
ii. Analysis: Data is used for statistical analysis, forecasting trends, and generating insights.
iii. Automation: Data powers algorithms and machine learning models for automation and
predictions.
DATA LITERACY
1. Data generation/collection
2. Data structure
3. Data storage
4. Data Analysis
5. Statistics
6. Data-Driven Decision Making
DATA COLLECTION/GENERATION
Data collection refers to the process of gathering information from various sources for analysis,
decision-making, or research purposes. It involves obtaining raw data from surveys, experiments,
sensors, or databases.
Data generation refers to the creation or simulation of data, often through models, algorithms, or
synthetic processes. This can occur when real-world data is unavailable or to test systems in
controlled environments.
Sources of Data
Data can come from a myriad of sources; however, these sources are categorized into Primary and
secondary sources.
The primary data are those collected firsthand by the researcher, company, or persons specifically
for a purpose. For example, surveys, interviews, experiments, measurements/observations, and
sensor data in real-time. The secondary data are those collected from sources other than the
researcher – someone else had earlier collected the data for a different purpose, but it is used by
the researcher for their analysis. E.g., Government reports on the census, research articles, company
financial statements, or databases, and web scraping from social media.
• Surveys and Questionnaires: These tools collect responses directly from people, providing
firsthand information about opinions, behaviors, and experiences. They are commonly used
in market research, customer satisfaction studies, and social science research.
• Transactions: Data from transaction records, sales, purchases, and other business
operations. This type of data is crucial for financial analysis, inventory management, and
customer relationship management.
• Sensors: Physical devices like temperature sensors, GPS trackers, and industrial machines
collect data. Sensor data is essential for real-time monitoring, environmental studies, and the
Internet of Things (IoT) applications.
• Social Media: Platforms like Facebook, Twitter, and Instagram generate vast amounts of data
through posts, likes, shares, and comments. Analyzing this data helps in understanding public
sentiment, trending topics, and consumer behavior.
• Logs: System and application logs provide detailed records of activities and events. These
logs are invaluable for monitoring system performance, diagnosing issues, and ensuring
security.
DATA STRUCTURE
A data structure is a way of organizing, managing, and storing data in a computer so that it can be
efficiently accessed and modified. Data structures are essential in computer science and data
analysis as they determine how data is arranged, how it can be processed, and how efficiently tasks
like searching, sorting, and updating data can be performed.
Types of Data Structures
These are primitive and non-primitive data structures
Primitive Data Structures:
These are basic data types provided by the programming language. E.g.
• Integer: Stores numerical data (e.g., 1, 10, -5).
• Float: Stores decimal values (e.g., 3.14, 0.001).
• Character: Stores single characters (e.g., 'a', 'B').
• Boolean: Stores true or false values (e.g., True, False).
Non-Primitive Data Structures: These are more complex and can be derived from primitive data
structures. Non-primitive data structures are divided into two categories: linear and non-linear.
• Linear Data Structures: data elements are arranged in a sequential or linear order, where
each element is connected to its previous and next elements. Array, Linked List, Stack,
Queue.
• Non-Linear Data Structures: data elements are not stored sequentially; they are connected
in a hierarchical or graph-like structure. Tree (Binary Tree, Binary Search Tree), Graph, Heap.
7 golden rules when structuring your data – DBrown Consulting
1. No empty columns
2. No empty rows
3. All dates must be in a single column (Every single row in a table signifies a transaction).
4. All columns must have a unique data type.
5. No totals or subtotals.
6. No obstruction around your data – anything that is not part of your data should not be
around your dataset.
7. Single rows of heading.
DATA STORAGE
This is simply how we store our data, how we save, organize, and manage data so that it can be
efficiently accessed, processed, and analyzed. Proper data storage ensures that the data is secured,
retrievable, and usable for analysis.
Types of Data Storage:
• Local Storage: Data is stored on physical devices like hard drives, USB drives, or personal
computers.
• Cloud Storage: Data is stored online via cloud platforms (e.g., Google Drive, Amazon S3,
Microsoft Azure), allowing access from multiple devices and locations.
• Databases: Structured storage solutions that allow data to be easily queried and managed
(e.g., MySQL, PostgreSQL, MongoDB).
• Data Warehouses: Large-scale storage systems designed to hold vast amounts of structured
data from different sources for analysis (e.g., Amazon Redshift, Snowflake).
• Data Lakes: Store vast amounts of raw, unstructured data (e.g., Hadoop, AWS S3).
Storage Formats:
• Spreadsheets: Data stored in Excel or CSV files, often used for smaller datasets.
• Databases: Relational databases (SQL) or NoSQL databases.
• Text Files: Data stored in plain text, JSON, or XML formats.
• Binary Files: More efficient data storage formats, often used for large datasets (e.g., HDF5,
Parquet).
Databases
An organized collection of data generally stored and accessed electronically. It manages, stores, and
retrieves large amounts of structured data efficiently. Allows for data manipulation, querying, and
updating to serve various applications. Where our data is based.
Types of Databases
i. Relational Databases: Organize data into tables (rows and columns) with predefined
relationships. These tables can be related to each other through primary and foreign keys.
For data manipulations, they use Structured Query Language (SQL). E.g., MySQL,
PostgreSQL, SQL Server, Oracle. Debatably, Excel, Access, SharePoint
ii. NoSQL: designed for unstructured, semi-structured data. More flexible than RD and is often
used for big data and real-time applications. E.g. MongoDB, Cassandra, Neo4j.
Importance of Data Storage in Analysis:
• Data Integrity: Proper storage prevents data corruption or loss, ensuring that the data
remains accurate and reliable.
• Data Security: Storing data securely is crucial for protecting sensitive information from
unauthorized access or breaches.
• Scalability: As the volume of data grows, scalable storage solutions (like cloud services) help
accommodate expanding datasets without performance issues.
• Efficiency: Efficient storage allows for quick access, reducing the time needed for data
retrieval and processing during analysis.
DATA ANALYSIS.
Data analysis is the process of examining, cleaning, transforming, and modeling data to discover
useful information, draw conclusions, and support decision-making. It involves various techniques
and methods to ensure that the data is accurate, reliable, and suitable for the analytical process. The
ultimate goal of data analysis is to turn raw data into actionable insights that inform strategies and
decisions. Simply put, telling a story with data. Making sense of a dataset.
Categories/Types of Analytics
1. Descriptive Analytics: What has happened based on historical data. This type of analysis
summarizes past data to understand what happened. It involves using statistical measures
such as mean, median, mode, and standard deviation to describe the main features of a
dataset. For example, a company might analyze sales data to see which products sold best
last quarter. Descriptive analysis is often the first step in data analysis as it provides a clear
overview of the data.
2. Diagnostic Analytics: This analysis explores the reasons behind past outcomes. It goes
beyond descriptive analysis by trying to understand the underlying causes of certain
events. it digs deeper to find the cause. The performance indicators are further investigated
to discover why they got worse or better. If sales dropped, for instance, diagnostic analysis
might look at factors like market trends, changes in consumer behavior, or internal business
processes. Techniques such as drill-down, data mining, and correlations are commonly used
in diagnostic analysis.
3. Predictive Analytics: forecasting into the future to identify trends and determine if they are
likely to occur again. This analysis uses historical data to predict future events. It involves
statistical models and machine learning algorithms to forecast trends and behaviors.
Retailers, for example, might forecast future sales based on past trends and patterns.
Predictive analysis helps businesses anticipate changes and plan accordingly.
4. Prescriptive Analytics: This type of analysis suggests actions to achieve desired outcomes. It
combines insights from predictive analysis with optimization and simulation algorithms to
recommend specific actions. This allows businesses to make informed decisions in the face
of uncertainty. For example, it might involve recommending optimal inventory levels or
marketing strategies based on predictive insights. Prescriptive analysis aims to guide
decision-making to achieve the best possible outcomes. This allows businesses to make
informed decisions in the face of uncertainty.
5. Cognitive Analysis: Mostly by a data scientist. What might happen if circumstances change,
and how do we handle the situation? It refers to the use of advanced data processing
techniques, often involving artificial intelligence (AI) and machine learning (ML), to simulate
human thinking and decision-making processes. It enables systems to analyze data, extract
insights, and make informed decisions in a way that mimics human cognition, such as
reasoning, learning, problem-solving, and pattern recognition.
Stages of Data Analysis/Data Analysis Process
The data analysis process involves a series of systematic steps to extract meaningful insights from
data and help in decision-making. Each stage contributes to the overall effectiveness of the analysis
and supports business objectives. These processes are:
1. Understanding the objective/ business problem(s)
2. Data collection and preparation.
3. Data processing, cleaning, and transformation.
4. Data Analysis Proper: Exploratory Data Analysis (EDA), Statistical Data Analysis, Data
Analysis.
5. Interpretation, visualization, presentation, recommendations, etc.
6. Act – decision-making phase (very important to business)
Understanding the Objective / Business Problem(s)
This is the most critical step as it sets the foundation for the entire data analysis. It involves clearly
defining the business problem, research question, or objective that needs to be addressed.
Here, the analyst;
• Identify and articulate the specific problem or goal the business is facing.
• Determine the key metrics and questions that will help solve the problem.
• Align the data analysis goals with business needs.
• Collaborate with stakeholders to understand the expectations, context, and desired
outcomes.
For example, A business might want to understand why sales have dropped over the past quarter
or how to improve customer retention.
Data Collection and Preparation:
After understanding the objective, you need to gather the right data that will help in answering the
business question. Data collection involves sourcing relevant data from various internal and external
sources.
Key actions here include:
• Identify the data needed (e.g., transactional data, customer feedback, web analytics).
• Collect data from reliable sources (e.g., databases, surveys, APIs).
• Ensure that the data collected is relevant to the problem being analyzed.
• Types of data can include primary data (collected firsthand) or secondary data (from existing
databases or reports).
Example: For a sales analysis, data may include sales transactions, customer demographics,
marketing spend, etc.
Data Processing, Cleaning, and Transformation:
This is possibly the most important stage in the data analysis process. Wrong/dirty data will result in
an incorrect analysis. Raw data is often messy and needs to be cleaned and structured for analysis.
This step ensures that the data is accurate, consistent, and in a usable format.
Here, you do:
• Data Cleaning: Handling missing values, removing duplicates, correcting errors, and
standardizing data formats.
• Transformation: Converting data into a suitable format, such as aggregating sales by day,
normalizing values, or creating new calculated fields.
• Filtering: Removing irrelevant or outlier data points that could distort the analysis.
• Integration: Combining data from multiple sources, if necessary.
Data Analysis Proper
This is where the actual analysis of the data happens, and it involves exploring the data, applying
statistical techniques, and drawing insights.
Key Actions:
• Exploratory Data Analysis (EDA): Initial investigation of the data using summary statistics
(Mean, Median, Mode, Standard Deviation, percentile, etc.), graphs, and charts to identify
patterns, trends, or anomalies.
• Statistical Data Analysis: Applying statistical techniques such as regression, hypothesis
testing, and clustering to test assumptions or relationships between variables.
• Data Analysis Methods: Analyzing the dataset using specific models, machine learning
techniques, or domain-specific tools to extract actionable insights.
Data Visualization:
Visual representation of data is crucial to make the findings easier to understand, especially for non-
technical stakeholders.
Key Actions:
• Creating graphs, charts, and dashboards (e.g., bar charts, line graphs, pie charts, heatmaps)
that illustrate key findings and trends.
• Using visualization tools such as Power BI, Excel, or Tableau to communicate insights
effectively.
• Ensuring that the visualizations are clear, intuitive, and tailored to the audience’s
understanding.
Example: A dashboard showing sales trends over time, broken down by region and product
category.
Interpretation, Recommendations, and Presentation:
This stage involves translating the analytical results into business insights and actionable
recommendations.
Key Actions:
• Interpreting the results in the context of the original business problem.
• Providing meaningful explanations for trends, patterns, or anomalies.
• Making recommendations based on the analysis (e.g., increasing marketing spend in specific
regions, adjusting pricing strategies).
• Presenting the findings to stakeholders, often in the form of reports or presentations.
Presenting insights on customer churn and recommending strategies to retain high-value
customers.
Act – Decision-Making Phase:
Every analysis should leave stakeholders with an action plan – what to do. This is the final stage
where the insights and recommendations are put into action. It involves making informed business
decisions based on the analysis to address the original business problem.
Key Actions:
• Using the insights from the analysis to implement strategies or changes in the business.
• Monitoring the outcomes of the decisions and assessing if the goals are achieved.
• Iterating the process, if necessary, by refining strategies or conducting further analysis.
Example: A company might decide to launch a targeted marketing campaign for specific customer
segments based on insights from the analysis.
SCENARIO
A company wants to understand why customer churn is increasing.
1. Understanding the Problem: The goal is to identify factors contributing to customer churn
and recommend retention strategies.
2. Data Collection: Collect data on customer transactions, demographics, support tickets, and
feedback.
3. Data Cleaning: Handle missing data in support tickets and standardize customer IDs across
datasets.
4. Data Analysis Proper: Use EDA to explore customer churn patterns, and run a logistic
regression to identify the variables most likely to cause churn.
5. Visualization: Create a dashboard showing customer churn rates by product, region, and
support ticket frequency.
6. Interpretation & Recommendations: Interpret results to show that customers with high
support ticket frequency are more likely to churn. Recommend improving customer service.
7. Act: Implement enhanced customer service training and offer promotions to high-risk
customers.
COMMON TOOLS FOR DATA ANALYSIS
Several tools are commonly used in data analysis to facilitate the process and enhance accuracy:
• Excel: A widely used tool for basic data analysis, offering features like pivot tables, charts,
and various statistical functions. It’s user-friendly and suitable for small to medium-sized
datasets. Excel is often used for initial data exploration and simple analyses.
• SQL: Structured Query Language (SQL) is used for managing and querying relational
databases. SQL is essential for retrieving, updating, and manipulating data stored in
databases. It’s powerful for handling large datasets and performing complex queries.
• Power BI/Tableau: These are powerful data visualization tools that help in creating
interactive and shareable dashboards. They are used to represent data graphically and make
it accessible for analysis. Power BI integrates well with other Microsoft products, while
Tableau is known for its ease of use and ability to handle large datasets.
• Python/R: These programming languages are versatile and widely used for data analysis.
Python offers libraries like pandas, NumPy, and sci-kit-learn
for data manipulation, analysis, and machine learning. R is specifically designed for statistical
computing and graphics, with extensive libraries for various types of analysis. Both languages
are popular due to their simplicity, extensive community support, and powerful capabilities.
ESSENTIAL SKILLS NEEDED
These skills are fundamental for individuals working in data-driven fields, and mastering them can
significantly enhance overall performance in the workplace.
• Technical Skills: These skills focus on the tools and technologies essential for analysis and
problem-solving. They include Excel, SQL, Data Visualization, Statistics, and programming
languages like Python and R.
• Analytical Skills: These skills involve processing information critically and solving problems
effectively. They encompass critical thinking, problem-solving, and attention to detail.
• Communication Skills: Communication is critical for conveying insights and collaborating
with teams. Skills in this category include domain knowledge, storytelling, written and verbal
communication, and reporting.
• Interpersonal Skills: These skills foster positive interactions and collaboration in teams. They
include teamwork, time management, and adaptability.