SQL Notes
SQL Notes
DATA ANALYTICS...................................................................................................................................................1
TYPES OF DATA:......................................................................................................................................................2
STRUCTURED DATA:..........................................................................................................................................3
SEMI – STRUCTURED DATA.............................................................................................................................4
UNSTRUCTURE DATA........................................................................................................................................5
DATASET................................................................................................................................................................6
TYPES OF DATASET............................................................................................................................................7
DATABASE.............................................................................................................................................................8
RELATIONAL DATABASE...............................................................................................................................10
SQL:....................................................................................................................................................................... 14
WHAT IS SQL?....................................................................................................................................................17
WHAT IS SQL FOR DATA ANALYSIS?..........................................................................................................17
WHY IS SQL A MUST-HAVE SKILL FOR EVERY DATA ANALYST.......................................................17
USE CASES OF SQL FOR DATA ANALYSIS.................................................................................................17
BEST TOOLS IN SQL FOR DATA ANALYTICS............................................................................................17
COMMON SQL QUERIES FOR DATA ANALYSIS.......................................................................................18
SQL TECHNIQUES TO PERFORM DATA ANALYSIS.................................................................................18
BENEFITS OF SQL FOR DATA ANALYSIS...................................................................................................18
LIMITATIONS OF SQL FOR DATA ANALYSIS............................................................................................18
SQL FOR DATA ANALYSIS: SQL QUERIES.....................................................................................................20
FAQs......................................................................................................................................................................62
SQL
DATA ANALYTICS
The process of examining raw data to draw insights and make informed decisions.
Raw data: unprocessed data
Data analytics is basically the process of examining raw data and you have to take
out insights.
Data is the most important part in every industry.
Data structures and type of data is different
Data-Float, Integer, categorically variable, quantitative variable, array list, dictionary
– it defines the structures of the data like the data is arithmetic data / Quantitative
data / integer/ float/ string- It is defining the structure of data.
TYPES OF DATA:
1. Structured data
2. Semi Structured data
3. Unstructured data
STRUCTURED DATA:
Data is usually in a tabular format.
It is stored in Excel files, CSV files, or SQL databases.
Anyone who reads the data can decipher it; there are specific rows and columns,
and each column is defined with a heading. This type of data is used in Relational
Database Management Systems (RDBMS).
SQL stands for Structured Query Language.
It is essentially the query language for structured data.
SQL is specifically a language that requires data to be structured. In Python or R,
different libraries can convert various types of datasets into a structured format,
such as a data frame, list, or array. However, SQL does not have that flexibility; it
requires a structured dataset.
SQL can only be used when the data has a structure.
Any type of format that is in Excel or a structured format is known as a structured
database.
It has a defined format.
It has a defined structure.
Fixed schemas (e.g., rows and columns).
Rigid (we can’t change it easily).
Scalability – When we have to store a huge amount of data horizontally, the size
and utilization increase.
Conforms to a tabular format:
o Well-defined relationships between different rows and columns.
o Excel files are not SQL databases, but they can be used to store structured
data.
Highly organized and easily decipherable.
Use of RDBMS. EX: SL NO DATA NAME AGE
Usability of SQL. 1
2
SEMI – STRUCTURED DATA
This type of data includes websites, Wikipedia pages, resumes, reports, etc.
Semi-structured data is a kind of structured data stored with the help of tags and
semantic markers.
Metadata is identified and gives structure to the data.
Uses of metadata: Tags, semantic markers, etc., to identify data characteristics.
This makes the data better cataloged, searchable, and analyzable.
More complex than structured data.
Easier to store and comprehend than unstructured data.
Self-describing data formats include JSON, XML, and HTML.
JSON is a format of key-value pairs (like a dictionary). It does not define data
entirely as tabular but uses keys (tags) to describe entries and their attributes, such
as names and ages.
Semi-structured data has some structure and can be easily converted to a
structured format, but it is not exactly tabular or relational. These types of
databases are known as semi-structured.
There is some structure available, making it easy to identify, but it is not in a
structured format like Excel or CSV.
We moved from structured to semi-structured data because it makes processing
faster, whereas processing structured data is a bit slower.
High latency.
Easy to catalog.
Easier to analyze.
Easy to store and comprehend.
Example: Amazon.
For example, Amazon has a large amount of data, making it difficult to store in a
structured format. Therefore, data is stored in a semi-structured format using tags
like HTML or XML. Tags are used to store related data under the same tag.
No fixed format structure; employs tagging and demarcation schemas (schemas
vary).
Allows separation among certain elements, enabling search and navigation.
Undefined format (can be any form of data).
UNSTRUCTURE DATA
This type of data cannot be easily defined or structured and is stored as it is.
To address this type of data, a different type of tool was developed known as
NoSQL.
NoSQL was named to indicate that it does not use SQL (Structured Query
Language).
NoSQL is a tool used for analyzing unstructured databases, also known as non-
relational databases.
Non-relational databases handle unstructured data such as social media posts,
YouTube videos, images, PDFs, etc.
Example: An AI car collects vast amounts of data every second, such as
acceleration and engine revolutions. This is known as IoT-based data.
IoT stands for Internet of Things.
Example: Rear parking sensors, RFID (used in tolls and shopping complexes). In
shopping complexes, if a tag is not removed, it beeps when carried out.
We get the data as it is and process it as it is.
Specialized tools like MongoDB, Dynamo DB, Hadoop, Azure, etc., are used to
manage unstructured data.
Hadoop is a tool used to handle unstructured databases.
Unorganized, no predefined data model, qualitative (e.g., social media posts
including images and videos).
Managed using NoSQL (non-relational databases).
Fast accumulation rate.
Data is stored in its native format, enhancing usability and flexibility.
High flexibility (can store any kind of data like social media posts, including
images and videos).
Examples: Multimedia texts, documents, images, sound bites.
Contains an outer structure and some unstructured data.
DATASET
A Dataset is a set of data grouped into a collection with which developers can
work to meet their goals. In a dataset, the rows represent the number of data points
and the columns represent the features of the Dataset. They are mostly used in
fields like machine learning, business, and government to gain insights, make
informed decisions, or train algorithms. Datasets may vary in size and complexity
and they mostly require cleaning and preprocessing to ensure data quality and
suitability for analysis or modeling.
A dataset is a type of format in which data is stored. It can be in various formats
such as tabular form, Excel file, CSV file, or JSON file. Regardless of the format,
it is known as a dataset because it stores data.
A dataset is a form where data is stored.
It can be in spreadsheets or a data frame.
Rows represent records.
Columns represent attributes.
All items in a single column have the same type of data or feature.
Indexable.
Relational databases store a unique reference for every row or record (like a row
number).
Multi-level indexing for faster access.
TYPES OF DATASET
1. RECTANGULAR DATASET
All the items in a single column have the same datatype.
Note: It is important to check the datatype first.
Any data stored in a rectangular format in your day-to-day life is called a
rectangular format.
This is the datatype required in SQL.
2. NON-RECTANGULAR DATASET
Another form of dataset is known as a non-rectangular dataset.
There are no columns and no rows.
It can be both relational and non-relational.
Note: Anything in curly brackets is a dictionary, and anything in square brackets
is an array. It is relational but it is non-
Example: rectangular (Has same tag
Data = ['little boy', and can build relation but it is
[1, 2, 3, 4], # Array non-rectangular because
('a', 'b', 'c'), # List don’t have rectangular row
{'hi': 1, 'hello': 2, 'aloha': 3}] # Dictionary and column format)
Example: Graphical data nodes/graphical data, such as the World Wide Web
(WWW), are examples of unstructured data.
This is related to the internet. If we post anything on LinkedIn/Facebook, it is
connected or interconnected on the internet by hashtags (#). This forms a node or
graphical structure, and it is very difficult to convert this immense amount of data
into a rectangular structured format. Therefore, we created a separate tool known
as NoSQL.
DATABASE
A database is a container that stores data electronically.
A database is an organized collection of data, so it can be easily accessed and
managed.
You can organize the data into tables, rows, and columns, and index it to make it
easier to find relevant information.
The main purpose of a database is to handle a large amount of information by
storing, retrieving, and managing data.
The main reasons to store data are to ensure it can be:
o Easily accessed
o Modified
o Protected
o Analyzed
Example: Amazon stores data. Any user who wants to buy products can query the
data through a DBMS to see what products are available. All the data is stored so
it can be analyzed.
Data is stored in a database.
RDBMS (Relational Database Management System) is used to retrieve, modify,
and protect the data in the database.
SQL (Structured Query Language) is used to interact with an RDBMS.
DBMS (Database Management System) has two types:
1. Relational Database (RDBMS)
2. Non-Relational Database (NRDBMS)
RELATIONAL DATABASE
A relational database is connected with keys, specifically unique primary keys.
The structure consists of many tables connected using primary keys and foreign
keys
An employee is directly connected to a manager with an ID, and the manager is
directly connected to a department with an ID. However, the department and
employee are not directly connected; they are indirectly connected through the
manager.
A relational database or Relational Database Management System (RDBMS)
stores data in the form of tables
Ex-
EMPLOYEE
ID
SQL
MANAGER DEPARTMENT
ID ID
DATASET
A dataset is a type of format in which data is stored. It can be in various formats
such as tabular form, Excel file, CSV file, or JSON file. Regardless of the format,
it is known as a dataset because it stores data.
DATATYPE
A data type defines the structure of the data. It describes what the data is about,
such as whether it is an integer (no decimals) or a float (with decimals). Data
types can be in any format, including JSON. Common data types include integers,
strings, objects, and floats.
SQL Server is an instance that helps you build and connect to a database. A
connection is an endpoint that allows you to write queries in a specific format and
retrieve results from the database.
Questions:
1. Explain data types with example
2. Explain data sets with relative example
3. Types of data
STRUCTURED /UNSTRUCTURED
TYPES OF DATA
1. Qualitative Data – Non-numerical data, also known as categorical variables, such
as string responses (e.g., black color, eye color). Boolean data is different and not
considered qualitative.
a. Nominal – Represents categories without any order. The values should be
string values.
Example: “I have an SUV,” “I live in Sweden” – these define the values
stored in the data when collecting it.
b. Ordinal – Represents categories with a specific order but does not describe
the characteristic itself.
Example: Levels of satisfaction such as “happy,” “neutral.”
2. Quantitative Data – Numerical data that describes the data in terms of numbers
(e.g., 44, 1, 10, 100).
a. Discrete – Can hold only specific values, often whole numbers.
Example: The number of students in a class (e.g., 29 students, but not 22.1
or 22.9).
b. Continuous – Can hold any value within a range.
Example: Height, which can be measured continuously (e.g., 170.5 cm)
SQL enables users to efficiently create, read, update, and delete records in these
databases.
Its widespread adoption is partly due to its versatility, ease of understanding, and
robust functionality, which allows for the precise management of large data sets.
SQL for data analysis refers to the application of SQL in querying and
manipulating data to extract insights and inform business decisions.
It involves using specific queries to analyze data trends, patterns, and anomalies.
Through SQL, analysts can aggregate data, perform complex joins between tables,
and filter data sets to hone in on relevant information.
The power of SQL for data analysis lies in its ability to handle large volumes of
data across different databases, making it a staple in the data analyst’s toolkit.
SQL’s ubiquity in data management makes it a critical skill for data analysts.
Firstly, it provides direct access to data stored in relational databases, which is
common in many organizations.
This direct access allows analysts to query live data, ensuring analyses are based
on current information.
Mastery of SQL also enhances efficiency, allowing for the automation of repetitive
tasks and complex data transformations.
o Market Trend Analysis: Through data aggregation and filtering, SQL assists in
uncovering market trends and consumer preferences
o JOIN: To combine rows from two or more tables based on a related column.
o GROUP BY: To aggregate rows with the same values in specified columns
into summary rows.
o Subqueries: Queries within queries that allow for more dynamic data
manipulation.
o Flexibility: SQL can handle various data analysis tasks, from simple queries to
complex data transformations.
o Scalability: SQL databases can manage anything from small data sets to large
enterprise-level databases.
o Performance Issues with Large data sets: SQL queries can sometimes
become slow or resource-intensive, particularly with large data sets.
o Limited to Structured Data: SQL excels with structured data but is not
designed for unstructured data, such as text or images, without additional
processing or tools.
A basic knowledge of SQL and its query language is essential for any aspiring data
analyst and data scientist. This blog discusses SQL and its data analysis features.
SQL for Data Analysis is a powerful programming language that helps data analysts
interact with data stored in Relational databases. SQL stands for Structured Query
language.
Since most systems today capture the data using one or more databases (like MySQL,
Oracle, Redshift, SQL Server, etc.), you need to know SQL to extract data from these
systems and then work with it.
But before I explain why a data analyst should learn SQL, let me take you through
what data analysis is and who data analysts are:
Data analysis refers to using and manipulating data to generate insights that can help
businesses solve a problem or uncover opportunities. A data analyst is a professional
who collects and analyzes data in order to turn it into meaningful information that
companies can use to improve and grow. These individuals gather information from a
number of sources and are skilled in researching, analyzing and reporting.
Historically, MS Excel was sufficient to handle these activities. But as businesses
generate more data, something more than a spreadsheet is needed to keep up with the
pace. And that is where SQL enters the picture, specifically when it comes to dealing
with relational databases storing large volumes of data.
Differences between Excel and SQL
You can use SQL to help you with the following work:
Creating databases and tables.
Adding data to a table. Selecting data.
Editing data
Deleting data.
Sorting data.
Finding unique values.
Combining data from two or more tables.
SQL queries can be classified into five parts as they perform specific roles to execute
queries on any RDBMS system, and they are:
1. Data Definition Language (DDL): DDL commands include create, alter, drop,
rename and truncate, dealing with the structure of the databases. It operates on
database objects like views, tables, indexes, and triggers.
2. Data Manipulation Language (DML): DML commands include insert, update,
and delete operations to modify data in existing databases.
3. Data Query Language (DQL): This command includes a select operation to
retrieve data matching criteria specified by the user. To condense data
efficiently, DQL commands also involve nested queries.
4. Data Control Language (DCL): This command is used by data administrators
to grant and revoke permission to access data in the organization’s database.
5. Transaction Control Language (TCL): TCL commands help in managing
transactions in databases to commit or roll back a current transaction. TCL
command is used to commit a DML operation, and it has the ability to club
multiple commands in a single operation.
SQL COMMANDS
1. DDL : Data Definition Language
2. DML : Data Manipulation Language
3. DCL : Data Control Language
4. DQL : Data Query Language
DCL is not for data analyst, so we will be using DDL. DML & DQL
Data engineers or database administrator will be using data
1. DDL : DATA DEFINITION LANGUAGE:
Command Description
CREATE Create a new table, a view of a table or other object in database
ALTER Modifies an existing database object such as table
DROP Deletes an entire table, a view of a table or other object in the database
CREATE TABLE
In SQL, we use the CREATE TABLE statement to define the structure of a
new table. This statement specifies the table name and the columns it will
contain, along with their data types and any constraints.
It follows a specific format, with each element serving a distinct purpose.
Whenever we create a table, these things are important:
Specify the table name.
Specify the column names.
Define the data types of the columns (e.g., INT, VARCHAR, etc.).
Define constraints, which describe the rules for the data types.
Example: If you are describing a column as an integer and you specify that it
should not be NULL (no data/value), then NOT NULL is added along with the
data type INTEGER. This means that the column can only have integer values
and cannot have any null values.
Variable will take ( n ) values.
Integer should not have NULL values. For example, let’s take ID. The ID
should not be a non-null character and should not contain any zero characters.
This can be given as a constraint.
Example: Suppose we want to create a table to store information about
employees. Here’s what the CREATE TABLE statement might look like:
CREATE TABLE Employees(
Employee ID INT PRIMARY KEY
First name VARCHAR(50),
Last name VARCHAR(50),
Department VARCHAR(50),
Hiredate DATE
);
USE db_name;
CREATE TABLE table_name (
Column_name1 datatype constraint,
Column_name2 datatype constraint,
Column_name3 datatype constraint,
);
CREATE TABLE student (
Id INT PRIMARY KEY,
Name VARCHAR(50),
Age INT NOT NULL,
);
I NAME AGE
D
1 AVI 20
2 ANU 21
3 AKE 20
In this example : id is integer , Name is Varchar
and Age is Integer
INTERPRETATION
The CREATE TABLE keyword signals the start of the statement.
Employees is the name of the table we are creating.
Each column definition includes the column name and its data type.
The primary key constraint specifies the employee_id column as the primary
key, ensuring uniqueness for each employee.
Data types such as INT, VARCHAR, and DATE specify the type of data each
column can hold.
Note:
o When naming tables or columns in SQL or any kind of modeling, never
use spaces in the names. For example, it should not be employee table;
instead, use an underscore (_) like employee_table or first_name.
o Avoid giving spaces in names
3. The first column name is employee_id, and you need to describe its data type
(INT) and constraint (PRIMARY KEY).
4. Use a comma , to separate each column definition.
5. The first_name column is described with the data type VARCHAR, which stands
for variable character (another term for string).
6. Similarly, define the last_name column with its data type and any constraints.
7. For string values like department, use VARCHAR(50).
8. The hire_date column uses the DATE data type, with the default date format
being YYYY-MM-DD.
9. Close the statement with a closing parenthesis ) and a semicolon ;.(The semicolon
indicates the end of the SQL statement.)
COLUMN DEFINITIONS:
INT, VARCHAR (Variable character string), DATE, DECIMAL (e.g., 1.1, 1.2,
1.3), and BOOLEAN (TRUE or FALSE) are examples of data types we can use
to define columns.
INT is for integer values, VARCHAR is for variable-length character strings,
DATE is for date values, DECIMAL is for decimal values, and BOOLEAN is for
boolean values (true or false).
Numerical data types include INTEGER and DECIMAL.
Categorical data is represented by VARCHAR.
Date data is represented by DATE.
Boolean data is represented by BOOLEAN (true or false, yes or no).
CONSTRAINTS
SQL Constraints are used to specify rules for data in a table
1. PRIMARY KEY
This constraint ensures that each value in the specified column is unique and
serves as a unique identifier for each row in the table.
It can be in any column and acts as a unique identifier (no duplicates
allowed, serves as a reference ID for joining tables).
If the data has two similar values, do not use it as a primary key (e.g.,
Name).
If the primary key might contain characters, use VARCHAR instead of INT.
Using VARCHAR can reduce memory size.
It makes a column UNIQUE and NOT NULL , but used only one
Ex : id INT PRIMARY KEY
Ex :
CREATE TABLE temp(
Id INT NOT NULL,
PRIMARY KEY(id)
);
2. NOT NULL
This constraint ensures that there must be a value in the column.
It cannot be missing or null (NULL/NOT NULL).
This constraint specifies that a column cannot contain NULL values,
ensuring data integrity.
Ex : Col1 int NOT NULL
3. DEFAULT
This constraint assigns a default value to a column if no value is provided
during insertion.
Example: DEFAULT DECIMAL(10,2)
10 – Number of units.
2 – Number of decimal places.
If the value is zero, it will be stored as 0.00.
If there is no name, it will default to AA. The data cannot be null, and it
won’t throw an error
Set the default value of a column
SALARY INT DEFAULT 25000;
4. UNIQUE
All values in a column are different
Ex : Col2 INT UNIQUE
5. CHECK
It can limit the values allowed in a column
Ex1 :
CREATE TABLE CITY(
ID INT PRIMARY KEY,
City VARCHAR(50),
Age INT,
CONSTRAINT age_check(age>=18 AND city=”Delhi”)
);
Ex2 :
CREATE TABLE newtab(
Age INT CHECK(age>=18),
);
6. FOREIGN KEY
Prevent actions that would destro links between tables
Ex:
CREATE TABLE temp(
Cust_id int;
FOREIGN KEY ( cust_id ) references customer (id)
);
EXERCISE 1: Create a table to store information about products in a inventory system
CREATE TABLE Products(
Product ID INT PRIMARY KEY
Product name VARCHAR(100) NOT NULL,
Category VARCHAR(50),
Price DECIMAL (10,2) DEFAULT 0.00,
Instock BOOLEAN DEFAULT TRUE
);
EXERCISE 2:
CREATE database University;
USE University;
CREATE table student(
ID INT PRIMARY KEY NOT NULL;
Name VARCHAR(20),
Age INT NOT NULL
);
Show databases; ----- It will show all the databases
Show tables; ---- it will show the data under the selected ar database
CHECKING DATA TYPES
We can use the DESCRIBE statement in SQL to view the structure of a table,
including its column names and data types.
Example: Let’s say we have a table called students. We can use the DESCRIBE
statement to check the data types of each column:
SQL
DESCRIBE Students;
AI-generated code. Review and use carefully. More info on FAQ.
This is metadata – data that describes other data (e.g., metadata describes a
video on YouTube).
STRING DATA TYPES
TINYBLOB For BLOBs (Binary Large Objects). Max length: 255 bytes
ENUM(val1, val2, A string object that can have only one value, chosen from a
val3, ...) list of possible values. You can list up to 65535 values in
an ENUM list. If a value is inserted that is not in the list, a
blank value will be inserted. The values are sorted in the
order you enter them
SET(val1, val2, A string object that can have 0 or more values, chosen from
val3, ...) a list of possible values. You can list up to 64 values in a
SET list
DOUBLE
PRECISION(size, d)
Note: All the numeric data types may have an extra option: UNSIGNED or
ZEROFILL. If you add the UNSIGNED option, MySQL disallows negative values
for the column. If you add the ZEROFILL option, MySQL automatically also adds
the UNSIGNED attribute to the column.
datetime2 From January 1, 0001 to December 31, 9999 with 6-8 bytes
an accuracy of 100 nanoseconds
datetimeoffse The same as datetime2 with the addition of a time 8-10 bytes
t zone offset
Ole Object Can store pictures, audio, video, or other BLOBs up to 1GB
(Binary Large Objects)
Lookup Let you type a list of options, which can then be 4 bytes
Wizard chosen from a drop-down list
ADDING DATA TO TABLE
We use the INSERT INTO statement in SQL to add new rows of data to a table.
Each INSERT statement specifies the table name and the values to be inserted
into each column. It’s crucial to ensure that the data we’re inserting matches the
data types and constraints of the table.
INSERT INTO is the clause/syntax/statement used.
Two statements we use are INSERT INTO and VALUES.
When entering data, ensure that the data matches the data types and constraints
of the table.
Example: Suppose we want to add a new student to our Students table. Here’s
how we can do it:
INSERT INTO Students (StudentID, Name, Age, EnrollmentDate)
VALUES (1, 'John Smith', 20, '2023-09-15');
Whenever inserting a string value, it should always be enclosed in single quotes,
and dates should be enclosed in quotes and always in the format YYYY-MM-
DD.
If there is no data for a column, you can omit it from the INSERT statement.
CREATE THE SAMPLE DATA:
CREATE DATABASE College;
USE College;
CREATE TABLE student (
Rollno INT PRIMARY KEY,
Name VARCHAR (50),
Marks INT NOT NULL,
Grade VRACHAR(1),
City VARCHAR(20)
);
INSERT INTO DATA:
INSERT INTO Student ( rollno, name, marks, grade, city)
VALUES
(101, “Anil”, 78, “C”, “Pune”),
(102, “Bhumika”, 93, “A”, “Mumbai”),
(103, “Chetan”, 78, “B”, “Mumbai”),
(104, “Dhruv”, 78, “A”, “Delhi”),
(105, “Emanuel”, 96, “F”, “Delhi”),
(106, “Farah”, 78, “B”, “Delhi”),
EXERCISE 1: Create database called Employee and Table called salary amd output
should be as below
ID NAME SALARY
1 AAA 2000
2 AAD 3000
PK PK FK
USE Employee;
DROP table_salary
DROP table Name;
CREATE table salary (
ID INT PRIMARY KEY NOT NULL.
Name Varchar ( 20 ),
Salary INT );
INSERT INTO salary
(ID, Name, Salary)
Values
(1, “AAA”, 2000)
(2, “AAD”, 3000)
CREATE table salary (
SNO INT PRIMARY KEY NOT NULL.
ID INT,
CITY VARCHAR (10),
Foreign key(ID) references salary(ID)
);
INSERT INTO city
(SNO, Name, Salary)
Values
(1, 2,“PUNE”)
(2, 1,”CHENNAI”)
REVERSE ENGINEER
To connect the two tables – This is reverse engineer
Database reverse engineer Next Next Select Schema(Here employee)
Next Execute Next
Represents the primary key of the particular table
This is for data engineer, but we can check our connections
Once DROP Command is executed we cannot retrieve the data
If encountered connection issue in SQL
o Services MySQL84 start the service(in right)
o MySQL84 Properties Startup type Automatic
TO UPLOAD FILES
Schemas Select filename tables select and right click table
data import wizard browse select file open
file path browse
Next Next Next Next Next Finish refresh
Same way upload all the files in the table
If we apply user data inn configure import settings
Join date text datetime
Last payment date text datetime
Date format - %d-%m-%y
User_date
SELECT FROM
The SELECT statement has a simple structure. We specify the columns we want
to retrieve after the SELECT keyword and then indicate which table contains the
data we’re interested in using the FROM keyword.
Example: Suppose we have a table called Customers and we want to retrieve the
names and email addresses of all customers. We would write a SELECT
statement like this:
SELECT Name, Email
FROM Customers;
Example: If you want to select some columns but not all, you use the SELECT
statement to specify the columns. For instance, if we are working with a table
named Students and we want only the students’ first names and student IDs, we
would write:
SELECT StudentID, FirstName
FROM Students;
In the first line of the SELECT statement, do not use a colon or semicolon.
Avoid entering currency directly in values; instead, use an extra column for
currency, such as USD or INR.
Naming conventions:
o FIRST NAME ------ ✕
o FIRST-NAME ------ ✕
o FIRST_NAME ------ ✓
For boolean values, use TRUE or FALSE in uppercase.
o 'TRUE' ------- ✓ It should be only in uppercase
o 'True' ------- ✕
The CHECK constraint in the CREATE TABLE statement can be applied to any
kind of field, such as integer, character, categorical variable, or quantitative
variables.
It acts as a filter, ensuring that the value must meet the specified condition in the
CHECK constraint.
The CHECK constraint always comes after the data type and column name.
For string/categorical values, use IN.
For integers, use AND with comparison operators like <, =, etc.
When specifying string values in the CHECK constraint, use quotes.
For strings, use inverted commas (single quotes).
The equal sign (=) is used only for integers.
Example:
CREATE TABLE Students (
StudentID INT PRIMARY KEY,
FirstName VARCHAR(50),
Age INT CHECK (Age >= 18),
Gender VARCHAR(10) CHECK (Gender IN ('Male', 'Female', 'Other'))
);
In this example:
o The Age column must have values greater than or equal to 18.
o The Gender column must have values that are either ‘Male’, ‘Female’, or
‘Other’
WHERE
The WHERE clause allows us to narrow down our results by specifying
conditions that the data must meet. The WHERE clause comes after the FROM
clause in our SQL query. We can use various comparison operators such as =, <,
>, and logical operators like AND, OR, to define our conditions.
Example: Suppose we have a table called products and we want to retrieve the
names and prices of products that are currently in stock (i.e., where the
quantity_on_hand is greater than zero)
1. SELECT NAME,PRICE
FROM Products
WHERE QuantityOnHand>0;
2. SELECT NAME,PRICE
FROM Products
WHERE QuantityOnHand>0 AND CustomerID = 123;
The WHERE clause is a filtration query that allows us to filter the data we’re
selecting. We can use any of the comparison operators (e.g., =, <, >) or logical
operators (AND, OR, etc.).
Example: With reference to the above example, instead of using AND, we can use
OR if we need any of the data (e.g., QuantityOnHand > 0 OR sales > 0).
For three conditions:
1. If we need to use more than 2-3 separate filters, we should use separate
queries. This simplifies the code for ourselves and for anyone reading it.
2. Avoid using too many filters in a single query, as the code will get complex
and we might not get the right output. If you have too many filters, you can
create two tables with different filters and work on them separately instead
of putting all the filters in one table.
3. The order of the filters (chronology) doesn’t matter
Examples:
WHERE CustomerID = 123
WHERE QuantityOnHand > 0 AND Sales > 0
We almost always use single filtration by using logical operators.
In SQL, the hierarchy of the statements matters.
Always start with SELECT and FROM.
If we want to see all the columns, we use SELECT *
ORDER BY
The ORDER BY clause is used to sort the result set returned by the SELECT
statement. We specify the column(s) by which we want to sort, along with the
desired sorting order (Ascending or Descending).
Example: Suppose we have a table called employees with columns for employee
names and salaries. We can use the ORDER BY clause to sort the employees by
their salaries in descending order:
SELECT Name, Salary
FROM Employees
ORDER BY Salary DESC;
We can sort a column in ascending or descending order by using the ORDER BY
statement.
By default, it will sort in ascending order.
We need to mention the column where we need sorting.
DESC - Descending order (Z to A)
ASC - Ascending order (A to Z)
ORDER BY will not work well on columns with categorical data like YES/NO. If
applied, it will show all NOs first and then YESs because N comes before Y.
Avoid using it on such columns; use it on columns with numerical or name data
instead.
GROUP BY
When we use the GROUP BY clause, SQL divides the result set into groups based
on the values of one or more columns. We can then apply aggregate functions such
as SUM, AVG, COUNT, MIN, and MAX to each group to calculate summary
statistics.
Example: Suppose we have a table called Orders with columns for Order IDs,
Customer IDs, and order amounts. We want to calculate the total order amount for
each customer. We can use the GROUP BY clause along with the SUM function:
SELECT CustomerID, SUM(OrderAmount) AS TotalOrderAmount
FROM Orders
GROUP BY CustomerID;
When we have a table and we need to view something, it’s often not just two or
three columns. We are usually performing some operations on the table, such as
grouping.
Grouping By means we have different entries for the same entity
EXERCISE 5:
CREATE DATABASE JatinDemo-(Execute and refresh it will have the database
in schemas )
CREATE TABLE sample_table (
id INT PRIMARY KEY,
name VARCHAR (50) ,
gender VARCHAR (10),
city VARCHAR (50) ,
salary DECIMAL(10,2)
);
INSERT INTO sample_table ( id, name , age, gender, city, salary )
VALUES
(1, ‘John’, 30, ‘Male’, ‘New York’, ‘50000.00),
(2, ‘Alice’, 25, ‘Female’, ‘Los Angeles’, ‘60000.00),
(3, ‘Michael’, 35, ‘Male’, ‘Chicago’, ‘70000.00),
(4, ‘Emily’, 28, ‘Female’, ‘Houston’, ‘65000.00),
(5, ‘David’, 40, ‘Male’, ‘Phoenix’, ‘80000.00),
(6, ‘Sophia’, 32, ‘Female’, ‘Philadelphia’, ‘45000.00),
(7, ‘James’, 33, ‘Male’, ‘Son Antonia’, ‘75000.00),
(8, ‘Emma’, 29, ‘Female’, ‘San Diego’, ‘65000.00),
(9, ‘Daniel’, 32, ‘Male’, ‘Dallos’, ‘72000.00),
(10, ‘Olivia’, 27, ‘Female’, ‘San Francisco’, ‘68000.00);
1. To show all the table
SELECT *
FROM sample_table
2. Find the average salary of employees grouped by gender, but only for those age
is above 25
SELECT gender, AVG ( Salary ) AS avgs_salary
FROM sample_table
WHERE age > 25
GROUP BY gender ;
3. List the cities where the total salary expenditure is greater than $65000,
ordered by the total salary expenditure in descending order
SELECT city, SUM ( Salary ) AS total_salary
FROM sample_table
GROUP BY city
HAVING SUM ( salary ) > 65000
ORDER BY total_salary DESC;
O/P :
SL no city avg_salary
1 Phoenix 80000.00
4. 2 Son Antonia 75000.00 Find the age of employees
3 Dallas 72000.00 whose salaries are above $
4 Chicago 70000.00 60000, grouped by gender
5 San Francisco 68000.00 and only show the results
for gender where the average age is below 35
SELECT gender, AVG ( age ) AS average_age
FROM sample_table
WHERE salary > 60000
GROUP BY gender
HAVING AVG ( age ) < 35 ;
O/P :
SL no Gender Average_age
1 Female 28
5. Find the max salary among employees aged 30 and above in each city
SELECT City, Max(Salary) AS Max_Salary
FROM sample_table
WHERE age >= 30
GROUP BY city;
O/P :
SL no City Max_Salary
1 Chicago 70000.00
2 Dallas 72000.00
3 New York 50000.00
4 Phoenix 80000.00
5 San Antonia 75000.00
JOIN
Joins are used to combine rows from two or more tables based on a related
column between them. This allows us to create a single result set that contains
data from multiple tables.
There are several types of joins in SQL, including INNER JOIN, LEFT JOIN,
RIGHT JOIN, and FULL OUTER JOIN. Each type of join determines how rows
from the joined tables are combined.
1. Primary key: A unique identifier for each record in a table.
2. Foreign key: A field in one table that uniquely identifies a row of
another table. A foreign key cannot be null because it is a primary key in
another table
1.What is the difference between foreign key and primary key?
For joining tables, the only criteria are common columns between the two
tables. These columns can be categorical variables or integers, but they must
be unique. They cannot have duplicate entries.
Criteria:
Uniqueness
Common column
1. INNER JOIN
An INNER JOIN between two tables will return only records where the joining
fields, such as a key, find a match in both tables.
EXERCISE :
2. LEFT JOIN
A LEFT JOINS keeps all of the original records in the left table and returns
missing values for any columns from the right table where the joining field did not find
a match
Display all the entries present in the left table. If there are no matching values in
the right table, it will display as NULL.
EXERCISE :
3. RIGHT JOIN
A RIGHT JOINS keeps all of the original records in the right table and
returns missing values for any columns from the left table where the joining field
did not find a match
EXERCISE :
A FULL JOIN combines a LEFT JOIN and RIGHT JOIN. A FULL JOIN
will return all records from a table, irrespective of whether there is a match on the
joining field in the other table, returning null values accordingly.
EXERCISE :
EXERCISE :
LEFT TABLE – Table A – 5 Rows and 2 Columns
RIGHT TABLE – Table B – 10 Rows and 3 Columns
SELF JOINS are used to compare values in a table to other values of the
same table by joining different parts of a table together
EXERCISE :
1. Retreive all employees along with the names of their corresponding
departments
2. Retrieve all employees and departments, matching them where possible
(We can use INNER TOO too for the above code)
SELECT Employees.Name , Department.Name,
FROM Employee,
OUTER JOIN Department ON Employee. Department ID =
Department. Department ID;
6. CROSS JOIN
In a CROSS JOIN, every row from the first table is combined with every row
from the second table, resulting in a cartesian product. This means that if Table A
has m rows and Table B has n rows, the CROSS JOIN will produce m*n rows in
the output.
EXERCISE : PRODUCT CATALOG GENERATION
In an e-commerce system, product catalogs often contain various attributes
such as sizes, colors, and styles. A CROSS JOIN can be used to generate all
possible combinations of products and attributes, resulting in a comprehensive
product catalog. Suppose you have tables for products and attributes. A CROSS
JOIN between these tables would produce all possible combinations of products
and attributes, allowing customers to view different product variations.
It does not need column IDs.
It will match each row in both tables, so it will basically be a
Cartesian product of ( m \times n )
EXERCISE :
TYPES OF SUBQUERIES
1. SINGLE ROW SUBQUERIES: These subqueries return only one row of results.
2. MULTI-ROW SUBQUERIES: These subqueries return multiple rows of results.
3. SINGLE-COLUMN SUBQUERIES: These subqueries return only one column
of results.
4. CORRELATED SUBQUERIES: These subqueries are evaluated once for each
row processed by the outer query.
The basic syntax of a SUBQUERY involves enclosing the inner query within
parentheses and using it as an operand in the outer query.
Here’s an example of a SUBQUERY in a SELECT statement:
EXERCISE :
CREATE TABLE Customers (
Customer ID INT PRIMARY KEY,
NAME VARCHAR (50)
);
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
CustomerID INT
OrderAmount DECIMAL(10,2),
FOREIGN KEY (CustomerID) REFERENCES Customers (CustomerID)
);
---inserting values into the customer table
INSERT INTO Customers ( CustomerID, Name) VALUES
(101, ‘ALICE’ ),
(102, ‘BOB’ ),
(103, ‘CHARLIE’ ),
---inserting values into the orders table
INSERT INTO Orders ( OrderID, CustomerID, OrderAmount ) VALUES
(1, 101, 500 ),
(2, 102, 700 ),
(3, 103, 600 ),
(4, 101, 800 );
Where we see a subquery, try replacing it with a LEFT JOIN to reduce the use
of subqueries and improve performance.
In c.Name, c is an alias for the table from which the column Name is being
selected
TYPES OF SUBQUERIES
1. AGGREGATE FUNCTIONS
Functions like SUM(), AVG(), COUNT(), MIN() and MAX() are
fundamentals for summarizing and aggregating data across groups
2. WINDOW FUNCTIONS
These functions allow for advanced analyzing by performing calcualtions
over set of rows related to the current row, without the need for explicit
subqueries like LAG, LEAD , Partition by rank
Why we use window function?
We don’t have to write a lot of subqueries, where we first write a subquery for
one calculation, then another subquery for retrieving data, and then another
for comparing and ranking values. This is why we created window functions,
which basically reduce our work and the time spent writing different joining
subqueries.
3. STRING FUNCTIONS
Functions like CONCAT(), SUBSTRING(), UPPER(), LOWER() and
REPLACE() facilitate text manipulation and formatting
5. MATHEMATICAL FUNCTIONS
Functions like ABS(), ROUND(), CEILING(), FLOOR(), and POWER()
support mathematical operations on numeric data.
1. ABS(): Returns the absolute value of a number. For example, ABS(-5)
returns 5.
2. ROUND(): Rounds a number to a specified number of decimal places.
For example, ROUND(2.718, 2) returns 2.72.
3. CEILING(): Rounds a number up to the nearest integer. For example,
CEILING(2.82) returns 3.
4. FLOOR(): Rounds a number down to the nearest integer. For example,
FLOOR(2.82) returns 2.
5. POWER(): Raises a number to the power of another number. For
example, POWER(2, 3) returns 8.
6. CONDITIONAL FUNCTIONS
Functions like CASE() provide conditional logic to customize query output
based on specified conditions.
CONDITIONAL FUNCTIONS
CASE STATEMENTS:
CASE statements evaluate a set of conditions and return a result based on the
first condition that is true. They’re similar to IF-THEN-ELSE statements in
other programming languages, but tailored for SQL queries.
CASE statements can be used in various scenarios such as categorizing data,
creating calculated fields, or simplifying complex logic.
Here’s the syntax for a CASE statement:
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
----- (additional conditions can be added here)
ELSE default_result
END
Example : Here it is an example like WHEN – THEN - ELSE
If a>1 THEN TRUE
ELSE
FALSE
Example: Suppose we want to categorize orders based on their amount:
SELECT order_id, order_amount,
CASE
WHEN order_amount > 1000 THEN 'High'
WHEN order_amount BETWEEN 500 AND 1000 THEN 'Medium'
ELSE 'Low'
END AS order_category
FROM orders;
In this example:
Orders with an amount greater than 1000 are categorized as ‘High’.
Orders with an amount between 500 and 1000 are categorized as ‘Medium’.
All other orders are categorized as ‘Low’.
Example: Suppose we have a table of students scores and we want to cateogorize their
performance as “Excellent”, “Good”, “Average” or “BelowAverage” based on their
scores
StudentID Score
1 85
2 70
3 95
4 60
2. We have a table of employees with their salaries and we want to categorize them
into different salary ranges
SELECT EmployeeID, Name, Salary,
CASE
WHEN Salary >= 80000 then ‘HIGH’
WHEN salary >= 60000 then ‘MEDIUM’
ELSE ‘LOW’
END AS SalaryCategory
FROM Employees
EmployeeID Name Score Performance
1 ALICE 60000.00 MEDIUM
2 BOB 75000.00 MEDIUM
3 CHARLIE 45000.00 LOW
4 DAVID 80000.00 HIGH
3. We have a table of products with their prices and discounts and we want to
calculate the final price (after discount, if any) for each product
SELECT * FROM Products
SELECT Name, Price, Discount,
CASE
WHEN Discount >0 then Price-(Price*Discount)
ELSE Price
END AS FinalPrice
FROM Products ;
Name Price Discount FinalPrice
Laptop 1200.00 0.10 1080.0000
Phone 800.00 0.05 760.0000
PC 1000.00 0.00 1000.0000
Tablet 500.00 0.15 425.0000
PRACTISE 1 :
Suppose we have a database for a bookstore and we want to generate a report that
categorizes books based on their popularity. We define the popularity of a book based
on the number of copies sold. We categorize a book as a ‘Bestseller’ if it has sold
more than 100 copies, ‘Popular’ if it has sold between 50 and 100 copies (inclusive),
and ‘Standard’ otherwise.
SALES TABLE:
SaleID ( PRIMARYKEY) : Unique identifier for each sale
BookID (FOREIGN KEY ) : References the BookID in the Books table
Quantity : Number of copies sold in each sale
Solution:
SELECT Title, Author,
SELECT SUM( Quantity ) FROM Sales WHERE BookID =
Books.BookID) AS TotalSales,
CASE
WHEN ( SELECT SUM(Quantity) FROM Sales WHERE
BookID = Books.BookID) > 100 THEN ‘Bestseller’
WHEN ( SELECT SUM(Quantity) FROM Sales WHRE
BookID = Books.BookID ) BETWEEN 50 AND 100
THEN ‘Popular’
ELSE ‘Standard’
END AS Popularity)
FROM Books;
Note : BookID.BookID , can be used as a foreign key. If it is not a foreign key,
then we cannot use it in that context.”
PRACTISE 2 :
Suppose we have a database for an online marketplace, and we want to generate a
report that categorizes sellers based on their sales performance. We define the
performance of a seller based on the total revenue generated from their sales. We
categorize a sellers as a ‘Top Performer’ if it their total revenue exceeds $10,000,
‘Good Performer’ if its between $5000 and $10,000 (inclusive), and ‘Standard
Performer’ otherwise.
Solution:
CREATE TABLE Sellers
(
Seller_id INT NOT NULL PRIMARY KEY,
Seller_name VARCHAR ( 50 )
);
INSERT INTO Sellers (Seller_id, Seller_Name ) VALUES
(1, ‘SARA’),
(2, ‘MAYA’),
(3, ‘JOHN’),
(4, ‘NARA’),
CREATE TABLE Orders_s
(
Order_id INT PRIMARY KEY,
Seller_id INT,
Unit_Price INT,
Quantity INT,
FOREIGN KEY (Seller_id) REFERENCES Sellers(Seller_id)
);
INSERT INTO Orders_s, Seller_id, Unit_Price, Quantity) VALUES
(1, 1, 1000, 5),
(2, 1, 1000, 10),
(3, 2, 7000, 2),
(4, 4, 10000, 1);
WINDOWS FUNCTIONS
Window functions allow us to perform calculations across a set of rows that are
related to the current row, without collapsing the result set. They are incredibly
versatile and can be used for tasks such as ranking, aggregation, moving
averages, and more.
Unlike aggregate functions that collapse multiple rows into a single row, window
functions maintain the granularity of the result set, providing valuable context for
each row.
When we use window functions, we apply aggregation, moving averages, and
ranking functions to each row. These functions perform aggregation at the row
level, not at a group or higher level.
1. RANKING FUNCTIONS:
RANK(), DENSE_RANK(), ROW_NUMBER(), and NTILE() assign a rank or
row number to each row based on specified criteria.
RANK() – Assigns a rank to each row within the partition of a result set.
Example: In a sales table, a column RANK gives ranks according to sales
values.
DENSE_RANK() – Similar to RANK(), but it does not skip ranks if there are
ties.
Example: If there are two people tied for second position, it will assign ranks
as 1, 2, 2, and then 3.
ROW_NUMBER() – Assigns a unique number to each row, starting at 1 for
the first row in each partition.
NTILE() – Divides the result set into a specified number of approximately
equal groups, or “tiles.”
Example: If you specify NTILE(5), it will divide the rows into five groups,
each representing a percentile range (0-20%, 20-40%, 40-60%, 60-80%, 80-
100%).
Additional Explanation for NTILE():
1. Percentile Example: If you score in the 98th percentile, it means you scored
better than 98% of the students who took the exam. If there are 200
students, you scored better than 196 of them. This is not your actual score
but your relative position among the group.
2. Sales Example: Using NTILE(5) on sales data will categorize each row into
one of five percentiles. For instance, the first group will include the bottom
20% of sales, the second group the next 20%, and so on up to the top 20%.
In this example:
LAG(amount, 1) OVER (ORDER BY sale_date): Retrieves the sales
amount from the previous row.
LEAD(amount, 1) OVER (ORDER BY sale_date): Retrieves the sales
amount from the next row.
These functions are particularly useful for trend analysis, such as
comparing current sales to previous sales, or forecasting future sales
based on past data
1. RANKING FUNCTIONS:
SELECT ProductID, ProductName, UnitPrice,
RANK() OVER(ORDER BY UnitPrice DESC() AS PriceRank
FROM ExampleData;
2. AGGREGATE FUNCTIONS WITH WINDOW FRAME(Cumulative
total):
SELECT OrderID, OrderDate, TotalAmount,
SUM(TotalAmount) OVER(PARTITION BY CustomerID ORDER
BY OrderDate) AS CumulativeTotal
FROM ExampleData;
Explanation
Moving Average – It calculates the average for a specified period of time.
Example: Stock market analysis.
2 Preceding and 2 Following – For example, if the current date is the 23rd:
The dates 21 and 22 are the 2 preceding days.
The dates 24 and 25 are the 2 following days.
Detailed Breakdown
Moving Average: This function helps smooth out short-term fluctuations and
highlight longer-term trends. It’s particularly useful in time series data, such
as stock prices.
2 Preceding and 2 Following: In the context of the moving average
calculation:
If the current row is for the 23rd, the window frame includes the 21st, 22nd,
23rd, 24th, and 25th.
This means the moving average for the 23rd will be calculated using the
revenue values from these five dates.
This approach helps in analyzing trends over a brief period, providing a
clearer picture of the data’s behavior over time.
Example:
SELECT CustomerID, OrderDate, TotalAmount,
LAG(TotalAmount) OVER PARTITION BY CustomerID ORDER BY
OrderDate) AS PreviousOrderAmount;
LEAD(TotalAmount) OVER PARTITION BY CustomerID ORDER BY
OrderDate) AS NextOrderAmount;
FROM ExampleData;
Note :
1. LAG – Value of the previous entry
2. LEAD – Value of the next entry
MONTH Date Amount LAG LEAD
A 1 50 50
A 2 100 100
A 3 200
A 4
3. Example – If I have month and sales data, we need to see the growth rate for
each month using the LAG function
MONTH SALES PreviousMonthSalary GrowthRate
JAN S1 - -
FEB S2 S1 (S2-S1)*100
MARCH S3 S2 (S3-S2)*100
S3 (S4-S3)*100
EXERCISE :
USE IIMSKILLS;
CREATE TABLE ExampleData (
ProductID INT,
ProductName VARCHAR (100),
UnitPrice DECIMAL (10,2)
OrderID INT,
OrderDate DATE,
TotalAmount DECIMAL (10,2),
CustomerID INT,
Date DATE,
Revenue DECIMAL (10,2)
);
INSERT INTO ExampleData (ProductID, ProductName, UnitPrice, OrderID,
OrderDate, TotalAmount, CustomerID, Date, Revenue)
VALUES
(1, ‘ProductA’, 10.00, 1, ‘2024-01-01’, 100.00, 1, ‘2024-01-01’, 200.00),
(1, ‘ProductB’, 15.00, 2, ‘2024-01-02’, 150.00, 1, ‘2024-01-02’, 250.00),
(1, ‘ProductC’, 20.00, 3, ‘2024-01-03’, 200.00, 2, ‘2024-01-03’, 300.00),
(1, ‘ProductD’, 25.00, 4, ‘2024-01-04’, 250.00, 2, ‘2024-01-04’, 350.00),
(1, ‘ProductE’, 30.00, 5, ‘2024-01-05’, 300.00, 3, ‘2024-01-05’, 400.00),
2. RANK :
SELECT ProductID, ProductName, UnitPrice,
RANK() OVER (ORDER BY UnitPrice DESC) AS PriceRank
FROM ExampleData;
3. LAG/LEAD :
SELECT OrderID, CustomerID, OrderDate, TotalAmount,
LAG(TotalAmount) OVER PARTITION BY CustomerID ORDER BY
OrderDate) AS PreviousOrderAmount;
LEAD(TotalAmount) OVER PARTITION BY CustomerID ORDER
BY OrderDate) AS NextOrderAmount;
FROM ExampleData;
OrderID CustomerI OrderDate TotalAmount PreviousOrder NextOrder
D
1 1 2024-01-01 100.00 NULL 150.00
2 1 2024-01-02 150.00 100.00 NULL
3 2 2024-01-03 200.00 NULL 250.00
4 2 2024-01-04 250.00 200.00 NULL
5 3 2024-01-05 300.00 NULL NULL
4. CUMMULATIVE SUM :
SELECT OrderID, OrderDate, TotalAmount,
SUM(TotalAmount) OVER PARTITION BY CustomerID ORDER
BY OrderDate) AS CummulativeTotal;
FROM ExampleData;
OrderID OrderDate TotalAmount CummulativeTotal
1 2024-01-01 100.00 100.00
2 2024-01-02 150.00 250.00
3 2024-01-03 200.00 200.00
4 2024-01-04 250.00 450.00
5 2024-01-05 300.00 300.00
5. MOVING AVERAGE :
SELECT Date, Revenue,
AVG(Revenue) OVER (ORDER BY Date ROWS BETWEEN 3
PRECEEDING AND 3 FOLLOWING ) AS MovingAverageRevenue
FROM ExampleData;
OrderDate Revenue PrevRevenue Revenue_GrowthRate
2024-01-01 200.00 NULL NULL
2024-01-02 250.00 200.00 0.2500000000000
2024-01-03 300.00 250.00 0.2000000000000
2024-01-04 350.00 300.00 0.1666666666666
2024-01-05 400.00 350.00 0.1428571428571
6. GROWTH RATE :
Solution 1:
SELECT Date, Revenue,
LAG(Revenue) OVER (ORDER BY Date) AS PrevRevenue (Revenue –
LAG(Revenue, 1) over (ORDER BY Date)) / LAG(Revenue, 1) over
(ORDER BY Date) AS Revenue_GrowthRate
FROM ExampleData;
OrderDate Revenue MovingAverageRevenue
2024-01-01 200.00 275.000000
2024-01-02 250.00 300.000000
2024-01-03 300.00 300.000000
2024-01-04 350.00 300.000000
2024-01-05 400.00 325.000000
Solution 2:
SELECT Date, Revenue,
LAG(Revenue) OVER (ORDER BY Date) AS PrevRevenue,
--ROUND
(Revenue – LAG(Revenue, 1) over (ORDER BY Date)) / LAG(Revenue,
1) over (ORDER BY Date)*100,2) AS Revenue_GrowthRate
-----Decimal Round Off
ROUND(LAST(Revenue – LAG(Revenue, 1) over (ORDER BY Date)) /
LAG(Revenue, 1) over (ORDER BY Date)*100) AS
DECIMAL(10,2)),2) AS Revenue_GrowthRate
FROM ExampleData;
OrderDate Revenue PrevRevenue Revenue_GrowthRate
2024-01-01 200.00 NULL NULL
2024-01-02 250.00 200.00 25.00
2024-01-03 300.00 250.00 20.00
2024-01-04 350.00 300.00 16.67
2024-01-05 400.00 350.00 14.29
3. Detecting Seasonality
SELECT EXTRACT(Month FROM OrderDate) AS Month,
AVG(TotalAmount) AS AvgTotalAmount
FROM Orders,
GROUP BY EXTRACT(Month FROM OrderDate)
ORDER BY Month;
4. Identifying Trends:
SELECT DATE_TRUNC (‘Month’, OrderDate ) AS Month,
SUM(TotalAmount) AS TotalSales,
FROM Orders,
GROUP BY DATE_TRUNC(‘Month’, OrderDate)
ORDER BY Month;
Privacy principles :
1. Sensitive Data: Identify and protect sensitive data such as personally identifiable
information (PII), health records, financial information, and biometric data.
2. Data Encryption: Encrypt data at rest and in transit to prevent unauthorized
access and protect confidentiality.
3. Access Control: Implement access controls and authentication mechanisms to
restrict access to sensitive data based on user roles and permissions.
4. Data Retention: Define data retention policies to determine how long data
should be stored and when it should be securely deleted or anonymized.
5. Data Sharing: Exercise caution when sharing data with third parties, ensuring
compliance with privacy regulations and contractual agreements.
CASE STUDY
Orders;
Order_ID Customer_ID Order_Date Total_Amount
1 101 2024-03-15 150.00
2 102 2024-03-18 75.50
3 103 2024-03-20 200.00
4 101 2024-03-25 120.00
5 104 2024-04-01 180.25
6 105 2024-04-05 90.80
7 102 2024-04-10 300.00
8 106 2024-04-12 50.00
Order_items;
Order_ID Customer_ID Quantity Total_Amount
1 101 2 25.00
1 102 1 50.00
2 103 3 15.50
3 101 1 30.00
4 104 2 60.00
5 102 1 90.25
5 103 2 45.00
6 104 1 70.80
7 105 4 75.00
8 106 1 50.00
Products;
Product_I Product_Name Category
D
101 Shirt Apparel
102 Shoes Apparel
103 Hat Apparel
104 Watch Accessories
105 Bag Accessories
106 Socks Apparel
SELECT customer_id,
COUNT ( DISTINCT order_id ) AS total_orders,
SUM( total_amount ) AS total_spent
FROM orders
GROUP BY customer_id;
Sl no Customer_id Total_orders Total_spent
1 101 2 270.00
2 102 2 375.50
3 103 1 200.00
4 104 1 180.25
5 105 1 90.80
6 106 1 50.00
4. What are the top 3 customers in terms of total spending?
Sl no Customer_id Total_spent
1 101 375.50
2 102 270.00
3 103 200.00
5. What is the average order value (AOV) for each product category ?
SELECT p.category,
CAST ( AVG( o.total_amount ) AS DECIMAL(10,2) AS average_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products.p ON oi.product_id = p.product_id
GROUP BY p.category;
Sl no Category Average_order_value
1 Accessories 170.27
2 Apparel 140.86
6. For each customer, find the average total aount of their orders and compare it with
the overall average totalamount of orders ( use window function )
SELECT
DISTINCT customer_id
ROUND ( AVG (total_amount ) OVER ( PARTITION BY
customer_id ), 2 ) AS AVG_order1,
ROUND ( AVG (total_amount ) OVER (), 2 ) AS Overall_Avg1,
CASE
WHEN AVG (total_amount ) OVER ( PARTITION BY
customer_id ) > AVG (total_amount ) OVER () THEN ‘Above
Avg’,
ELSE ‘Below Avg’
END AS CompareAvg
FROM orders;
We can save the output as – right click – select all – save with headers
SQL (Structured Query Language) is a powerful tool used in data analysis for querying and
manipulating data stored in relational databases. It enables data analysts to:
Access and extract data: SQL allows analysts to retrieve data from different tables within a
database, making it accessible for analysis.
Data manipulation: Analysts can use SQL to clean, transform, and prepare data for analysis, such
as filtering, sorting, and aggregating data.
Data aggregation: SQL facilitates summarization, helping calculate averages, sums, and other
statistical measures across large data sets.
Complex queries: Through SQL, analysts can perform complex queries to analyze relationships
between different data sets, identify patterns, and extract insights.
By utilizing SQL, analysts can efficiently handle large volumes of structured data, conduct
thorough analyses, and generate actionable insights for decision-making.
The choice between SQL and Python for data analysis depends on the specific needs and scope
of the project. Each has its strengths:
SQL is specifically designed for managing and querying relational databases. It is highly efficient
for data extraction, manipulation, and querying operations within databases. SQL shines when the
analysis relies heavily on structured data stored in SQL databases.
Python is a general-purpose programming language with a rich ecosystem of libraries (such as
pandas, NumPy, and Matplotlib) for data analysis, statistical modeling, machine learning, and
visualization. It is versatile and well-suited for tasks that require complex data manipulations,
integrations with web services, or the analysis of unstructured data.
SQL might be more straightforward and efficient for data analysis, primarily involving structured
data within relational databases. However, Python would be preferable for analyses requiring
complex data manipulation, integration with other data sources, or the application of advanced
statistical or machine learning models.
The “best” SQL for a data analyst largely depends on the specific requirements of the
organization and the nature of the projects. However, some of the most widely used SQL dialects
and database management systems include:
PostgreSQL: Known for its open-source nature, robustness, and compliance with SQL standards.
It offers advanced features and is well-suited for complex data analysis tasks.
MySQL: Another popular open-source option, known for its ease of use and efficiency in handling
a wide range of data analysis tasks.
Microsoft SQL Server: Preferred in enterprise environments, offering extensive tools and features
for data analysis, along with strong integration with other Microsoft products.
Oracle Database: Widely used in large enterprises for its scalability, reliability, and comprehensive
data management and analysis features.
Data analysts should choose a system that meets their project’s needs, considering data
complexity, scalability requirements, and existing infrastructure.
While extremely versatile and powerful for data analysis, Python manipulates data using libraries
like pandas. These operations occur in memory and can become slower as the volume of data
increases, especially compared to SQL operations optimized by the database engine itself.
However, overall performance also depends on the complexity of the task at hand, the size of the
data, and how well the SQL queries or Python scripts are optimized. SQL is generally faster for
data retrieval and manipulation tasks. Python’s ecosystem provides a broader range of
capabilities that may not be directly comparable to SQL’s performance for more complex data
analysis involving statistical models or machine learning.