0% found this document useful (0 votes)
33 views75 pages

SQL Notes

Uploaded by

Meghana C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views75 pages

SQL Notes

Uploaded by

Meghana C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Contents

DATA ANALYTICS...................................................................................................................................................1
TYPES OF DATA:......................................................................................................................................................2
STRUCTURED DATA:..........................................................................................................................................3
SEMI – STRUCTURED DATA.............................................................................................................................4
UNSTRUCTURE DATA........................................................................................................................................5
DATASET................................................................................................................................................................6
TYPES OF DATASET............................................................................................................................................7
DATABASE.............................................................................................................................................................8
RELATIONAL DATABASE...............................................................................................................................10
SQL:....................................................................................................................................................................... 14
WHAT IS SQL?....................................................................................................................................................17
WHAT IS SQL FOR DATA ANALYSIS?..........................................................................................................17
WHY IS SQL A MUST-HAVE SKILL FOR EVERY DATA ANALYST.......................................................17
USE CASES OF SQL FOR DATA ANALYSIS.................................................................................................17
BEST TOOLS IN SQL FOR DATA ANALYTICS............................................................................................17
COMMON SQL QUERIES FOR DATA ANALYSIS.......................................................................................18
SQL TECHNIQUES TO PERFORM DATA ANALYSIS.................................................................................18
BENEFITS OF SQL FOR DATA ANALYSIS...................................................................................................18
LIMITATIONS OF SQL FOR DATA ANALYSIS............................................................................................18
SQL FOR DATA ANALYSIS: SQL QUERIES.....................................................................................................20
FAQs......................................................................................................................................................................62
SQL
DATA ANALYTICS
 The process of examining raw data to draw insights and make informed decisions.
 Raw data: unprocessed data
 Data analytics is basically the process of examining raw data and you have to take
out insights.
 Data is the most important part in every industry.
 Data structures and type of data is different
 Data-Float, Integer, categorically variable, quantitative variable, array list, dictionary
– it defines the structures of the data like the data is arithmetic data / Quantitative
data / integer/ float/ string- It is defining the structure of data.
TYPES OF DATA:
1. Structured data
2. Semi Structured data
3. Unstructured data
STRUCTURED DATA:
 Data is usually in a tabular format.
 It is stored in Excel files, CSV files, or SQL databases.
 Anyone who reads the data can decipher it; there are specific rows and columns,
and each column is defined with a heading. This type of data is used in Relational
Database Management Systems (RDBMS).
 SQL stands for Structured Query Language.
 It is essentially the query language for structured data.
 SQL is specifically a language that requires data to be structured. In Python or R,
different libraries can convert various types of datasets into a structured format,
such as a data frame, list, or array. However, SQL does not have that flexibility; it
requires a structured dataset.
 SQL can only be used when the data has a structure.
 Any type of format that is in Excel or a structured format is known as a structured
database.
 It has a defined format.
 It has a defined structure.
 Fixed schemas (e.g., rows and columns).
 Rigid (we can’t change it easily).
 Scalability – When we have to store a huge amount of data horizontally, the size
and utilization increase.
 Conforms to a tabular format:
o Well-defined relationships between different rows and columns.
o Excel files are not SQL databases, but they can be used to store structured
data.
 Highly organized and easily decipherable.
 Use of RDBMS. EX: SL NO DATA NAME AGE
 Usability of SQL. 1
2
SEMI – STRUCTURED DATA
 This type of data includes websites, Wikipedia pages, resumes, reports, etc.
 Semi-structured data is a kind of structured data stored with the help of tags and
semantic markers.
 Metadata is identified and gives structure to the data.
 Uses of metadata: Tags, semantic markers, etc., to identify data characteristics.
This makes the data better cataloged, searchable, and analyzable.
 More complex than structured data.
 Easier to store and comprehend than unstructured data.
 Self-describing data formats include JSON, XML, and HTML.
 JSON is a format of key-value pairs (like a dictionary). It does not define data
entirely as tabular but uses keys (tags) to describe entries and their attributes, such
as names and ages.
 Semi-structured data has some structure and can be easily converted to a
structured format, but it is not exactly tabular or relational. These types of
databases are known as semi-structured.
 There is some structure available, making it easy to identify, but it is not in a
structured format like Excel or CSV.
 We moved from structured to semi-structured data because it makes processing
faster, whereas processing structured data is a bit slower.
 High latency.
 Easy to catalog.
 Easier to analyze.
 Easy to store and comprehend.
 Example: Amazon.
 For example, Amazon has a large amount of data, making it difficult to store in a
structured format. Therefore, data is stored in a semi-structured format using tags
like HTML or XML. Tags are used to store related data under the same tag.
 No fixed format structure; employs tagging and demarcation schemas (schemas
vary).
 Allows separation among certain elements, enabling search and navigation.
 Undefined format (can be any form of data).
UNSTRUCTURE DATA
 This type of data cannot be easily defined or structured and is stored as it is.
 To address this type of data, a different type of tool was developed known as
NoSQL.
 NoSQL was named to indicate that it does not use SQL (Structured Query
Language).
 NoSQL is a tool used for analyzing unstructured databases, also known as non-
relational databases.
 Non-relational databases handle unstructured data such as social media posts,
YouTube videos, images, PDFs, etc.
 Example: An AI car collects vast amounts of data every second, such as
acceleration and engine revolutions. This is known as IoT-based data.
 IoT stands for Internet of Things.
 Example: Rear parking sensors, RFID (used in tolls and shopping complexes). In
shopping complexes, if a tag is not removed, it beeps when carried out.
 We get the data as it is and process it as it is.
 Specialized tools like MongoDB, Dynamo DB, Hadoop, Azure, etc., are used to
manage unstructured data.
 Hadoop is a tool used to handle unstructured databases.
 Unorganized, no predefined data model, qualitative (e.g., social media posts
including images and videos).
 Managed using NoSQL (non-relational databases).
 Fast accumulation rate.
 Data is stored in its native format, enhancing usability and flexibility.
 High flexibility (can store any kind of data like social media posts, including
images and videos).
 Examples: Multimedia texts, documents, images, sound bites.
 Contains an outer structure and some unstructured data.
DATASET
 A Dataset is a set of data grouped into a collection with which developers can
work to meet their goals. In a dataset, the rows represent the number of data points
and the columns represent the features of the Dataset. They are mostly used in
fields like machine learning, business, and government to gain insights, make
informed decisions, or train algorithms. Datasets may vary in size and complexity
and they mostly require cleaning and preprocessing to ensure data quality and
suitability for analysis or modeling.
 A dataset is a type of format in which data is stored. It can be in various formats
such as tabular form, Excel file, CSV file, or JSON file. Regardless of the format,
it is known as a dataset because it stores data.
 A dataset is a form where data is stored.
 It can be in spreadsheets or a data frame.
 Rows represent records.
 Columns represent attributes.
 All items in a single column have the same type of data or feature.
 Indexable.
 Relational databases store a unique reference for every row or record (like a row
number).
 Multi-level indexing for faster access.

TYPES OF DATASET
1. RECTANGULAR DATASET
 All the items in a single column have the same datatype.
 Note: It is important to check the datatype first.
 Any data stored in a rectangular format in your day-to-day life is called a
rectangular format.
 This is the datatype required in SQL.
2. NON-RECTANGULAR DATASET
 Another form of dataset is known as a non-rectangular dataset.
 There are no columns and no rows.
 It can be both relational and non-relational.
 Note: Anything in curly brackets is a dictionary, and anything in square brackets
is an array. It is relational but it is non-
 Example: rectangular (Has same tag
Data = ['little boy', and can build relation but it is
[1, 2, 3, 4], # Array non-rectangular because
('a', 'b', 'c'), # List don’t have rectangular row
{'hi': 1, 'hello': 2, 'aloha': 3}] # Dictionary and column format)

 Example: Graphical data nodes/graphical data, such as the World Wide Web
(WWW), are examples of unstructured data.
 This is related to the internet. If we post anything on LinkedIn/Facebook, it is
connected or interconnected on the internet by hashtags (#). This forms a node or
graphical structure, and it is very difficult to convert this immense amount of data
into a rectangular structured format. Therefore, we created a separate tool known
as NoSQL.

DATABASE
 A database is a container that stores data electronically.
 A database is an organized collection of data, so it can be easily accessed and
managed.
 You can organize the data into tables, rows, and columns, and index it to make it
easier to find relevant information.
 The main purpose of a database is to handle a large amount of information by
storing, retrieving, and managing data.
 The main reasons to store data are to ensure it can be:
o Easily accessed
o Modified
o Protected
o Analyzed

 Example: Amazon stores data. Any user who wants to buy products can query the
data through a DBMS to see what products are available. All the data is stored so
it can be analyzed.
 Data is stored in a database.
 RDBMS (Relational Database Management System) is used to retrieve, modify,
and protect the data in the database.
 SQL (Structured Query Language) is used to interact with an RDBMS.
 DBMS (Database Management System) has two types:
1. Relational Database (RDBMS)
2. Non-Relational Database (NRDBMS)

1. RELATIONAL DATABASE MANAGEMENT SYSTEM (RDBMS)


 A structured database has each column with a column name and an assigned data
type. Each row has a unique record with values corresponding to each column.
 Examples: MySQL, Oracle, PostgreSQL, Microsoft SQL Server.

2. NON – RELATIONAL DATABASE MANAGEMENT SYSTEM


(NRDBMS)
 It is a semi-structured and unstructured database.
 Example: MongoDB

RELATIONAL DATABASE
 A relational database is connected with keys, specifically unique primary keys.
 The structure consists of many tables connected using primary keys and foreign
keys
 An employee is directly connected to a manager with an ID, and the manager is
directly connected to a department with an ID. However, the department and
employee are not directly connected; they are indirectly connected through the
manager.
 A relational database or Relational Database Management System (RDBMS)
stores data in the form of tables
Ex-

EMPLOYEE

ID
SQL

MANAGER DEPARTMENT

ID ID

 An employee is directly connected to a manager with an ID, and the manager is


directly connected to a department with an ID. However, the department and
employee are not directly connected; they are indirectly connected through the
manager.
 A relational database or Relational Database Management System (RDBMS)
stores data in the form of tables.
 The table consists of rows and columns. In a relational database, rows are referred
to as records, and columns are referred to as fields.
 Examples: MySQL, Oracle, Microsoft SQL Server.
 Example: In Amazon, AWS can store unorganized data such as survey
information, text, pictures, etc. These are examples of unorganized data. There are
certain databases for storing organized data, but for SQL, we need organized data
(a table that consists of rows and columns). SQL is a simple query language.
 SQL is a computer language, but it is specifically a query language. We do not use
it to build applications or websites; it is used to query existing databases.
 As a data analyst, you are not responsible for creating the full database structure.
Your job is to retrieve data, which is not always readily available. You need to
query and get the data stored somewhere. Companies can adopt any database they
want, and you need to apply the appropriate commands.
 After retrieving the data, you may need to manipulate it. For example, if you have
employee information and want to filter out employees who perform well or are in
level one, you first query to get that information. After retrieving it, you may need
to manipulate it further, such as filtering for employees with a salary above
25,000. This involves querying, updating, and manipulating the data.
 Finally, you can connect the data with Power BI or download it in SQL for
visualization. The first part involves using SQL to get the data, then taking it to
the next level with visualization tools or further analysis in Python, such as
machine learning. The data is stored somewhere, processed through SQL,
manipulated, and then used for visualization or analysis
SQL:
 SQL: Structured Query language
 SQL is a standard programming language for managing and querying relational
databases.
 SQL, or Structured Query Language, is a computer language for storing,
manipulating, and retrieving data stored in relational databases.
 SQL is the structured language used for relational database systems.

 All relational database management systems like MySQL, MS Access, Oracle,


Sybase, Informix, PostgreSQL, and SQL Server use SQL as the standard
database language.
 JSON and CSV files need to be converted into a relational database format (like
snowflake or star schema) to run queries on them.
 Data has to be structured in a rectangular format and should be in a relational
database.
 Benefits of using SQL for data analytics include its simplicity, efficiency,
flexibility, and ability to query data faster.
 SQL has a specific syntax, which includes:
o Specifying the columns, you require.
o Indicating the table from which to retrieve the data.
o Defining what you want to do with the columns, such as grouping, ordering,
averaging, etc.
o Applying conditions to filter the data.
 For trends, use a line graph instead of a bar graph

DATASET
 A dataset is a type of format in which data is stored. It can be in various formats
such as tabular form, Excel file, CSV file, or JSON file. Regardless of the format,
it is known as a dataset because it stores data.
DATATYPE
 A data type defines the structure of the data. It describes what the data is about,
such as whether it is an integer (no decimals) or a float (with decimals). Data
types can be in any format, including JSON. Common data types include integers,
strings, objects, and floats.
 SQL Server is an instance that helps you build and connect to a database. A
connection is an endpoint that allows you to write queries in a specific format and
retrieve results from the database.
Questions:
1. Explain data types with example
2. Explain data sets with relative example
3. Types of data

STRUCTURED /UNSTRUCTURED
TYPES OF DATA
1. Qualitative Data – Non-numerical data, also known as categorical variables, such
as string responses (e.g., black color, eye color). Boolean data is different and not
considered qualitative.
a. Nominal – Represents categories without any order. The values should be
string values.
Example: “I have an SUV,” “I live in Sweden” – these define the values
stored in the data when collecting it.
b. Ordinal – Represents categories with a specific order but does not describe
the characteristic itself.
Example: Levels of satisfaction such as “happy,” “neutral.”
2. Quantitative Data – Numerical data that describes the data in terms of numbers
(e.g., 44, 1, 10, 100).
a. Discrete – Can hold only specific values, often whole numbers.
Example: The number of students in a class (e.g., 29 students, but not 22.1
or 22.9).
b. Continuous – Can hold any value within a range.
Example: Height, which can be measured continuously (e.g., 170.5 cm)

INSTALLING MYSQL WORKBENCH


 Instructions to Install MySQL.docx
 https://dev.mysql.com/downloads/installer/

IMPORTANCE OF SQL IN DATA ANALYTICS


 Data Extraction from Databases
o Efficiently retrieve data from relational databases.
o Handle large volumes of data.
 Data Manipulation and Transformation
o Clean, format, and prepare data for analysis.
o Use functions to modify and compute data.
 Data Querying for Insights
o Perform complex queries to derive meaningful insights.
o Aggregate, filter, and join data to uncover trends.
 Integration with Data Analysis Tools
o Seamlessly integrate with tools like Tableau, Power BI.
o Enhance data visualization and reporting capabilities.
WHAT IS SQL?

 SQL, an acronym for Structured Query Language, is a standard programming


language designed to manage and manipulate relational databases. Relational
databases store data in tables, which can be linked—or related—based on common
keys or identifiers.

 SQL enables users to efficiently create, read, update, and delete records in these
databases.

 Its widespread adoption is partly due to its versatility, ease of understanding, and
robust functionality, which allows for the precise management of large data sets.

WHAT IS SQL FOR DATA ANALYSIS?

 SQL for data analysis refers to the application of SQL in querying and
manipulating data to extract insights and inform business decisions.

 It involves using specific queries to analyze data trends, patterns, and anomalies.

 Through SQL, analysts can aggregate data, perform complex joins between tables,
and filter data sets to hone in on relevant information.

 The power of SQL for data analysis lies in its ability to handle large volumes of
data across different databases, making it a staple in the data analyst’s toolkit.

WHY IS SQL A MUST-HAVE SKILL FOR EVERY DATA ANALYST

 SQL’s ubiquity in data management makes it a critical skill for data analysts.
Firstly, it provides direct access to data stored in relational databases, which is
common in many organizations.

 This direct access allows analysts to query live data, ensuring analyses are based
on current information.

 Furthermore, SQL enables analysts to retrieve and manipulate data precisely,


ensuring data integrity and accuracy in analyses.

 Mastery of SQL also enhances efficiency, allowing for the automation of repetitive
tasks and complex data transformations.

USE CASES OF SQL FOR DATA ANALYSIS

SQL’s versatility shines across various use cases in data analysis:

o Customer Behavior Analysis: SQL queries can segment customers based on


behavior, enabling targeted marketing strategies.

o Financial Analysis: Analysts use SQL to monitor transactions, track financial


health, and detect fraud.
o Operational Efficiency: SQL helps organizations optimize operations by
identifying process bottlenecks and inefficiencies.

o Market Trend Analysis: Through data aggregation and filtering, SQL assists in
uncovering market trends and consumer preferences

BEST TOOLS IN SQL FOR DATA ANALYTICS

Several tools enhance SQL’s capabilities for data analytics, including:

o MySQL and PostgreSQL: Popular open-source database systems known for


their robustness and flexibility.

o Microsoft SQL Server: A comprehensive database solution offering advanced


data analytics features.

o Oracle Database: Widely used in enterprise environments, known for its


scalability and security features.

o SQLite: A lightweight database ideal for small to medium-sized projects and


mobile applications.

COMMON SQL QUERIES FOR DATA ANALYSIS

Data analysis with SQL often involves a set of common queries:

o SELECT: To retrieve data from one or more tables.

o JOIN: To combine rows from two or more tables based on a related column.

o WHERE: To filter records that fulfill a specified condition.

o GROUP BY: To aggregate rows with the same values in specified columns
into summary rows.

o HAVING: To filter records after an aggregation.

SQL TECHNIQUES TO PERFORM DATA ANALYSIS

To deepen data analysis capabilities, analysts employ several advanced SQL


techniques:

o Subqueries: Queries within queries that allow for more dynamic data
manipulation.

o Window Functions: These functions allow calculations across sets of rows


related to the current row.
o Common Table Expressions (CTEs): Temporarily result sets that are more
readable and can be referenced within a SELECT, INSERT, UPDATE, or
DELETE statement.

o Pivot Tables: To rotate data for a more comprehensive analysis.

BENEFITS OF SQL FOR DATA ANALYSIS

The adoption of SQL in data analysis offers numerous benefits:

o Efficiency: SQL queries can quickly process large volumes of data.

o Accuracy: Direct manipulation of data reduces the likelihood of errors.

o Flexibility: SQL can handle various data analysis tasks, from simple queries to
complex data transformations.

o Scalability: SQL databases can manage anything from small data sets to large
enterprise-level databases.

LIMITATIONS OF SQL FOR DATA ANALYSIS

Despite its strengths, SQL has limitations:

o Steep Learning Curve for Complex Queries: While basic SQL is


straightforward, mastering complex queries and optimizations can be
challenging.

o Performance Issues with Large data sets: SQL queries can sometimes
become slow or resource-intensive, particularly with large data sets.

o Limited to Structured Data: SQL excels with structured data but is not
designed for unstructured data, such as text or images, without additional
processing or tools.

 A basic knowledge of SQL and its query language is essential for any aspiring data
analyst and data scientist. This blog discusses SQL and its data analysis features.
 SQL for Data Analysis is a powerful programming language that helps data analysts
interact with data stored in Relational databases. SQL stands for Structured Query
language.
 Since most systems today capture the data using one or more databases (like MySQL,
Oracle, Redshift, SQL Server, etc.), you need to know SQL to extract data from these
systems and then work with it.
 But before I explain why a data analyst should learn SQL, let me take you through
what data analysis is and who data analysts are:
 Data analysis refers to using and manipulating data to generate insights that can help
businesses solve a problem or uncover opportunities. A data analyst is a professional
who collects and analyzes data in order to turn it into meaningful information that
companies can use to improve and grow. These individuals gather information from a
number of sources and are skilled in researching, analyzing and reporting.
 Historically, MS Excel was sufficient to handle these activities. But as businesses
generate more data, something more than a spreadsheet is needed to keep up with the
pace. And that is where SQL enters the picture, specifically when it comes to dealing
with relational databases storing large volumes of data.
 Differences between Excel and SQL

HOW CAN SQL HELP DATA ANALYST

You can use SQL to help you with the following work:
 Creating databases and tables.
 Adding data to a table. Selecting data.
 Editing data
 Deleting data.
 Sorting data.
 Finding unique values.
 Combining data from two or more tables.

SQL FOR DATA ANALYSIS: SQL QUERIES

SQL queries can be classified into five parts as they perform specific roles to execute
queries on any RDBMS system, and they are:
1. Data Definition Language (DDL): DDL commands include create, alter, drop,
rename and truncate, dealing with the structure of the databases. It operates on
database objects like views, tables, indexes, and triggers.
2. Data Manipulation Language (DML): DML commands include insert, update,
and delete operations to modify data in existing databases.
3. Data Query Language (DQL): This command includes a select operation to
retrieve data matching criteria specified by the user. To condense data
efficiently, DQL commands also involve nested queries.
4. Data Control Language (DCL): This command is used by data administrators
to grant and revoke permission to access data in the organization’s database.
5. Transaction Control Language (TCL): TCL commands help in managing
transactions in databases to commit or roll back a current transaction. TCL
command is used to commit a DML operation, and it has the ability to club
multiple commands in a single operation.
SQL COMMANDS
1. DDL : Data Definition Language
2. DML : Data Manipulation Language
3. DCL : Data Control Language
4. DQL : Data Query Language
 DCL is not for data analyst, so we will be using DDL. DML & DQL
 Data engineers or database administrator will be using data
1. DDL : DATA DEFINITION LANGUAGE:
Command Description
CREATE Create a new table, a view of a table or other object in database
ALTER Modifies an existing database object such as table
DROP Deletes an entire table, a view of a table or other object in the database

2. DML : DATA MANIPULATION LANGUAGE


Command Description
INSERT Create a record
UPDATE Modifies records
DELETE Deletes records

3. DML : DATA MANIPULATION LANGUAGE


Command Description
GRANT Give a privilege to user
REVOKE Takes back privileges granted from user

4. DML : DATA QUERY LANGUAGE


Command Description
SELECT Retrieves certain records from one or more tables
IF YOU WANT TO DIG DEEP INTO SQL, MY RECOMMENDATION IS TO
STUDY AS FOLLOWS:
 Basic SQL command
 SQL join
 SQL aggregate function
 SQL sub queries
 Data cleaning with SQL
 SQL window function
 SQL performance tuning

CREATE OUR DATABASE


CREATE DATABASE db_name;
DROP DATABASE db_name:
 Example:

CREATE Database new_data;


CREATE database university;
DROP database new_data;
USE college;
 ; - represents that code is over

CREATE TABLE
 In SQL, we use the CREATE TABLE statement to define the structure of a
new table. This statement specifies the table name and the columns it will
contain, along with their data types and any constraints.
 It follows a specific format, with each element serving a distinct purpose.
 Whenever we create a table, these things are important:
 Specify the table name.
 Specify the column names.
 Define the data types of the columns (e.g., INT, VARCHAR, etc.).
 Define constraints, which describe the rules for the data types.
 Example: If you are describing a column as an integer and you specify that it
should not be NULL (no data/value), then NOT NULL is added along with the
data type INTEGER. This means that the column can only have integer values
and cannot have any null values.
 Variable will take ( n ) values.
 Integer should not have NULL values. For example, let’s take ID. The ID
should not be a non-null character and should not contain any zero characters.
This can be given as a constraint.
 Example: Suppose we want to create a table to store information about
employees. Here’s what the CREATE TABLE statement might look like:
 CREATE TABLE Employees(
Employee ID INT PRIMARY KEY
First name VARCHAR(50),
Last name VARCHAR(50),
Department VARCHAR(50),
Hiredate DATE
);

 CREATE TABLE Employees (


employee_id INT NOT NULL,
first_name VARCHAR(50),
last_name VARCHAR(50),
hire_date DATE,
PRIMARY KEY (employee_id)
);

 USE db_name;
CREATE TABLE table_name (
Column_name1 datatype constraint,
Column_name2 datatype constraint,
Column_name3 datatype constraint,
);
CREATE TABLE student (
Id INT PRIMARY KEY,
Name VARCHAR(50),
Age INT NOT NULL,
);
I NAME AGE
D
1 AVI 20
2 ANU 21
3 AKE 20
In this example : id is integer , Name is Varchar
and Age is Integer

INTERPRETATION
 The CREATE TABLE keyword signals the start of the statement.
 Employees is the name of the table we are creating.
 Each column definition includes the column name and its data type.
 The primary key constraint specifies the employee_id column as the primary
key, ensuring uniqueness for each employee.
 Data types such as INT, VARCHAR, and DATE specify the type of data each
column can hold.
 Note:
o When naming tables or columns in SQL or any kind of modeling, never
use spaces in the names. For example, it should not be employee table;
instead, use an underscore (_) like employee_table or first_name.
o Avoid giving spaces in names

STEPS TO WRITE A CODE


1. Start with the statement CREATE TABLE, followed by the table name and an
opening parenthesis ( (not curly brackets).
2. Define each column with its name, data type, and optional constraints.
Example:
CREATE TABLE Employees (
employee_id INT PRIMARY KEY, -- Column name, data type, and constraint
first_name VARCHAR(50), -- Column name and data type
last_name VARCHAR(50), -- Column name and data type
department VARCHAR(50), -- Column name and data type
hire_date DATE -- Column name and data type
);

3. The first column name is employee_id, and you need to describe its data type
(INT) and constraint (PRIMARY KEY).
4. Use a comma , to separate each column definition.
5. The first_name column is described with the data type VARCHAR, which stands
for variable character (another term for string).
6. Similarly, define the last_name column with its data type and any constraints.
7. For string values like department, use VARCHAR(50).
8. The hire_date column uses the DATE data type, with the default date format
being YYYY-MM-DD.
9. Close the statement with a closing parenthesis ) and a semicolon ;.(The semicolon
indicates the end of the SQL statement.)
COLUMN DEFINITIONS:
 INT, VARCHAR (Variable character string), DATE, DECIMAL (e.g., 1.1, 1.2,
1.3), and BOOLEAN (TRUE or FALSE) are examples of data types we can use
to define columns.
 INT is for integer values, VARCHAR is for variable-length character strings,
DATE is for date values, DECIMAL is for decimal values, and BOOLEAN is for
boolean values (true or false).
 Numerical data types include INTEGER and DECIMAL.
 Categorical data is represented by VARCHAR.
 Date data is represented by DATE.
 Boolean data is represented by BOOLEAN (true or false, yes or no).
CONSTRAINTS
SQL Constraints are used to specify rules for data in a table
1. PRIMARY KEY
 This constraint ensures that each value in the specified column is unique and
serves as a unique identifier for each row in the table.
 It can be in any column and acts as a unique identifier (no duplicates
allowed, serves as a reference ID for joining tables).
 If the data has two similar values, do not use it as a primary key (e.g.,
Name).
 If the primary key might contain characters, use VARCHAR instead of INT.
 Using VARCHAR can reduce memory size.
 It makes a column UNIQUE and NOT NULL , but used only one
 Ex : id INT PRIMARY KEY
 Ex :
CREATE TABLE temp(
Id INT NOT NULL,
PRIMARY KEY(id)
);
2. NOT NULL
 This constraint ensures that there must be a value in the column.
 It cannot be missing or null (NULL/NOT NULL).
 This constraint specifies that a column cannot contain NULL values,
ensuring data integrity.
 Ex : Col1 int NOT NULL
3. DEFAULT
 This constraint assigns a default value to a column if no value is provided
during insertion.
 Example: DEFAULT DECIMAL(10,2)
 10 – Number of units.
 2 – Number of decimal places.
 If the value is zero, it will be stored as 0.00.
 If there is no name, it will default to AA. The data cannot be null, and it
won’t throw an error
 Set the default value of a column
 SALARY INT DEFAULT 25000;
4. UNIQUE
 All values in a column are different
 Ex : Col2 INT UNIQUE
5. CHECK
 It can limit the values allowed in a column
 Ex1 :
CREATE TABLE CITY(
ID INT PRIMARY KEY,
City VARCHAR(50),
Age INT,
CONSTRAINT age_check(age>=18 AND city=”Delhi”)
);
 Ex2 :
CREATE TABLE newtab(
Age INT CHECK(age>=18),
);

6. FOREIGN KEY
 Prevent actions that would destro links between tables
 Ex:
CREATE TABLE temp(
Cust_id int;
FOREIGN KEY ( cust_id ) references customer (id)
);
EXERCISE 1: Create a table to store information about products in a inventory system
CREATE TABLE Products(
Product ID INT PRIMARY KEY
Product name VARCHAR(100) NOT NULL,
Category VARCHAR(50),
Price DECIMAL (10,2) DEFAULT 0.00,
Instock BOOLEAN DEFAULT TRUE
);

EXERCISE 2:
CREATE database University;
USE University;
CREATE table student(
ID INT PRIMARY KEY NOT NULL;
Name VARCHAR(20),
Age INT NOT NULL
);
 Show databases; ----- It will show all the databases
 Show tables; ---- it will show the data under the selected ar database
CHECKING DATA TYPES
 We can use the DESCRIBE statement in SQL to view the structure of a table,
including its column names and data types.
 Example: Let’s say we have a table called students. We can use the DESCRIBE
statement to check the data types of each column:
 SQL
 DESCRIBE Students;
 AI-generated code. Review and use carefully. More info on FAQ.
 This is metadata – data that describes other data (e.g., metadata describes a
video on YouTube).
STRING DATA TYPES

Data type Description

CHAR(size) A FIXED length string (can contain letters, numbers, and


special characters). The size parameter specifies the column
length in characters - can be from 0 to 255. Default is 1

VARCHAR(size) A VARIABLE length string (can contain letters, numbers,


and special characters). The size parameter specifies the
maximum string length in characters - can be from 0 to
65535

BINARY(size) Equal to CHAR (), but stores binary byte strings.


The size parameter specifies the column length in bytes.
Default is 1

VARBINARY(size) Equal to VARCHAR (), but stores binary byte strings.


The size parameter specifies the maximum column length
in bytes.

TINYBLOB For BLOBs (Binary Large Objects). Max length: 255 bytes

TINYTEXT Holds a string with a maximum length of 255 characters

TEXT(size) Holds a string with a maximum length of 65,535 bytes

BLOB(size) For BLOBs (Binary Large Objects). Holds up to 65,535


bytes of data

MEDIUMTEXT Holds a string with a maximum length of 16,777,215


characters

MEDIUMBLOB For BLOBs (Binary Large Objects). Holds up to


16,777,215 bytes of data

LONGTEXT Holds a string with a maximum length of 4,294,967,295


characters

LONGBLOB For BLOBs (Binary Large Objects). Holds up to


4,294,967,295 bytes of data

ENUM(val1, val2, A string object that can have only one value, chosen from a
val3, ...) list of possible values. You can list up to 65535 values in
an ENUM list. If a value is inserted that is not in the list, a
blank value will be inserted. The values are sorted in the
order you enter them

SET(val1, val2, A string object that can have 0 or more values, chosen from
val3, ...) a list of possible values. You can list up to 64 values in a
SET list

NUMERIC DATA TYPES

Data type Description

BIT(size) A bit-value type. The number of bits per value is specified


in size. The size parameter can hold a value from 1 to 64.
The default value for size is 1.

TINYINT(size) A very small integer. Signed range is from -128 to 127.


Unsigned range is from 0 to 255. The size parameter
specifies the maximum display width (which is 255)

BOOL Zero is considered as false, nonzero values are considered as


true.

BOOLEAN Equal to BOOL

SMALLINT(size) A small integer. Signed range is from -32768 to 32767.


Unsigned range is from 0 to 65535. The size parameter
specifies the maximum display width (which is 255)

MEDIUMINT(size) A medium integer. Signed range is from -8388608 to


8388607. Unsigned range is from 0 to 16777215.
The size parameter specifies the maximum display width
(which is 255)

INT(size) A medium integer. Signed range is from -2147483648 to


2147483647. Unsigned range is from 0 to 4294967295.
The size parameter specifies the maximum display width
(which is 255)

INTEGER(size) Equal to INT(size)

BIGINT(size) A large integer. Signed range is from -


9223372036854775808 to 9223372036854775807.
Unsigned range is from 0 to 18446744073709551615.
The size parameter specifies the maximum display width
(which is 255)

FLOAT(size, d) A floating point number. The total number of digits is


specified in size. The number of digits after the decimal
point is specified in the d parameter. This syntax is
deprecated in MySQL 8.0.17, and it will be removed in
future MySQL versions
FLOAT(p) A floating point number. MySQL uses the p value to
determine whether to use FLOAT or DOUBLE for the
resulting data type. If p is from 0 to 24, the data type
becomes FLOAT(). If p is from 25 to 53, the data type
becomes DOUBLE()

DOUBLE(size, d) A normal-size floating point number. The total number of


digits is specified in size. The number of digits after the
decimal point is specified in the d parameter

DOUBLE
PRECISION(size, d)

DECIMAL(size, d) An exact fixed-point number. The total number of digits is


specified in size. The number of digits after the decimal
point is specified in the d parameter. The maximum number
for size is 65. The maximum number for d is 30. The default
value for size is 10. The default value for d is 0.

DEC(size, d) Equal to DECIMAL(size,d)

 Note: All the numeric data types may have an extra option: UNSIGNED or
ZEROFILL. If you add the UNSIGNED option, MySQL disallows negative values
for the column. If you add the ZEROFILL option, MySQL automatically also adds
the UNSIGNED attribute to the column.

DATE AND TIME DATA TYPES

Data type Description

DATE A date. Format: YYYY-MM-DD. The supported range is


from '1000-01-01' to '9999-12-31'

DATETIME(fsp) A date and time combination. Format: YYYY-MM-DD


hh:mm:ss. The supported range is from '1000-01-01
00:00:00' to '9999-12-31 23:59:59'. Adding DEFAULT and
ON UPDATE in the column definition to get automatic
initialization and updating to the current date and time

TIMESTAMP(fsp) A timestamp. TIMESTAMP values are stored as the number


of seconds since the Unix epoch ('1970-01-01 00:00:00'
UTC). Format: YYYY-MM-DD hh:mm:ss. The supported
range is from '1970-01-01 00:00:01' UTC to '2038-01-09
03:14:07' UTC. Automatic initialization and updating to the
current date and time can be specified using DEFAULT
CURRENT_TIMESTAMP and ON UPDATE
CURRENT_TIMESTAMP in the column definition
TIME(fsp) A time. Format: hh:mm:ss. The supported range is from '-
838:59:59' to '838:59:59'

YEAR A year in four-digit format. Values allowed in four-digit


format: 1901 to 2155, and 0000.
MySQL 8.0 does not support year in two-digit format.

SQL SERVER DATA TYPES


String Data Types

Data type Description Max size Storage

char(n) Fixed width character string 8,000 characters Defined width

varchar(n) Variable width character 8,000 characters 2 bytes + number of


string chars

varchar(max) Variable width character 1,073,741,824 2 bytes + number of


string characters chars

text Variable width character 2GB of text data 4 bytes + number of


string chars

nchar Fixed width Unicode string 4,000 characters Defined width x 2

nvarchar Variable width Unicode 4,000 characters


string

nvarchar(max) Variable width Unicode 536,870,912


string characters

ntext Variable width Unicode 2GB of text data


string

binary(n) Fixed width binary string 8,000 bytes

varbinary Variable width binary string 8,000 bytes

varbinary(max) Variable width binary string 2GB

image Variable width binary string 2GB

NUMERIC DATA TYPES

Data type Description Storage

bit Integer that can be 0, 1, or NULL

tinyint Allows whole numbers from 0 to 255 1 byte


smallint Allows whole numbers between -32,768 and 32,767 2 bytes

int Allows whole numbers between -2,147,483,648 and 4 bytes


2,147,483,647

bigint Allows whole numbers between - 8 bytes


9,223,372,036,854,775,808 and
9,223,372,036,854,775,807

decimal(p,s) Fixed precision and scale numbers. 5-17 bytes


Allows numbers from -10^38 +1 to 10^38 –1.
The p parameter indicates the maximum total
number of digits that can be stored (both to the left
and to the right of the decimal point). p must be a
value from 1 to 38. Default is 18.
The s parameter indicates the maximum number of
digits stored to the right of the decimal point. s
must be a value from 0 to p. Default value is 0

numeric(p,s) Fixed precision and scale numbers. 5-17 bytes


Allows numbers from -10^38 +1 to 10^38 –1.
The p parameter indicates the maximum total
number of digits that can be stored (both to the left
and to the right of the decimal point). p must be a
value from 1 to 38. Default is 18.
The s parameter indicates the maximum number of
digits stored to the right of the decimal point. s
must be a value from 0 to p. Default value is 0

smallmoney Monetary data from -214,748.3648 to 214,748.3647 4 bytes

money Monetary data from -922,337,203,685,477.5808 to 8 bytes


922,337,203,685,477.5807

float(n) Floating precision number data from -1.79E + 308 4 or 8 bytes


to 1.79E + 308.
The n parameter indicates whether the field should
hold 4 or 8 bytes. float(24) holds a 4-byte field and
float(53) holds an 8-byte field. Default value of n is
53.

real Floating precision number data from -3.40E + 38 to 4 bytes


3.40E + 38

DATE AND TIME DATA TYPES

Data type Description Storage

datetime From January 1, 1753 to December 31, 9999 with 8 bytes


an accuracy of 3.33 milliseconds

datetime2 From January 1, 0001 to December 31, 9999 with 6-8 bytes
an accuracy of 100 nanoseconds

smalldatetime From January 1, 1900 to June 6, 2079 with an 4 bytes


accuracy of 1 minute

date Store a date only. From January 1, 0001 to 3 bytes


December 31, 9999

time Store a time only to an accuracy of 100 3-5 bytes


nanoseconds

datetimeoffse The same as datetime2 with the addition of a time 8-10 bytes
t zone offset

timestamp Stores a unique number that gets updated every time


a row gets created or modified. The timestamp
value is based upon an internal clock and does not
correspond to real time. Each table may have only
one timestamp variable

OTHER DATA TYPES

Data type Description

sql_variant Stores up to 8,000 bytes of data of various data types,


except text, ntext, and timestamp

uniqueidentifier Stores a globally unique identifier (GUID)

xml Stores XML formatted data. Maximum 2GB

cursor Stores a reference to a cursor used for database operations

table Stores a result-set for later processing

MS Access Data Types

Data type Description Storage

Text Use for text or combinations of text and numbers.


255 characters maximum

Memo Memo is used for larger amounts of text. Stores up to


65,536 characters. Note: You cannot sort a memo
field. However, they are searchable

Byte Allows whole numbers from 0 to 255 1 byte


Integer Allows whole numbers between -32,768 and 32,767 2 bytes

Long Allows whole numbers between -2,147,483,648 and 4 bytes


2,147,483,647

Single Single precision floating-point. Will handle most 4 bytes


decimals

Double Double precision floating-point. Will handle most 8 bytes


decimals

Currency Use for currency. Holds up to 15 digits of whole 8 bytes


dollars, plus 4 decimal places. Tip: You can choose
which country's currency to use

AutoNumbe AutoNumber fields automatically give each record 4 bytes


r its own number, usually starting at 1

Date/Time Use for dates and times 8 bytes

Yes/No A logical field can be displayed as Yes/No, 1 bit


True/False, or On/Off. In code, use the constants
True and False (equivalent to -1 and 0). Note: Null
values are not allowed in Yes/No fields

Ole Object Can store pictures, audio, video, or other BLOBs up to 1GB
(Binary Large Objects)

Hyperlink Contain links to other files, including web pages

Lookup Let you type a list of options, which can then be 4 bytes
Wizard chosen from a drop-down list
ADDING DATA TO TABLE
 We use the INSERT INTO statement in SQL to add new rows of data to a table.
Each INSERT statement specifies the table name and the values to be inserted
into each column. It’s crucial to ensure that the data we’re inserting matches the
data types and constraints of the table.
 INSERT INTO is the clause/syntax/statement used.
 Two statements we use are INSERT INTO and VALUES.
 When entering data, ensure that the data matches the data types and constraints
of the table.
 Example: Suppose we want to add a new student to our Students table. Here’s
how we can do it:
INSERT INTO Students (StudentID, Name, Age, EnrollmentDate)
VALUES (1, 'John Smith', 20, '2023-09-15');
 Whenever inserting a string value, it should always be enclosed in single quotes,
and dates should be enclosed in quotes and always in the format YYYY-MM-
DD.
 If there is no data for a column, you can omit it from the INSERT statement.
CREATE THE SAMPLE DATA:
CREATE DATABASE College;
USE College;
CREATE TABLE student (
Rollno INT PRIMARY KEY,
Name VARCHAR (50),
Marks INT NOT NULL,
Grade VRACHAR(1),
City VARCHAR(20)
);
INSERT INTO DATA:
INSERT INTO Student ( rollno, name, marks, grade, city)
VALUES
(101, “Anil”, 78, “C”, “Pune”),
(102, “Bhumika”, 93, “A”, “Mumbai”),
(103, “Chetan”, 78, “B”, “Mumbai”),
(104, “Dhruv”, 78, “A”, “Delhi”),
(105, “Emanuel”, 96, “F”, “Delhi”),
(106, “Farah”, 78, “B”, “Delhi”),

EXERCISE 1: Create database called Employee and Table called salary amd output
should be as below
ID NAME SALARY
1 AAA 2000
2 AAD 3000

CREATE DATABASE Employee;


USE Employee;
CREATE TABLE salary (
ID INT PRIMARY KEY NOT NULL,
Name VARCHAR (50),
Age INT NOT NULL;
INSERT INTO salary
(ID, Name, Salary)
Values
(1, “AAA”, 2000)
(2, “AAD”, 3000)
Select * From salary;
 Primary key – Unique – 1 primary key not null
 Foreign key – Repeatable – Multiple Foreign key

EXERCISE 2: CREATE database called employee


Table called salary Table called city
ID NAME SALARY SL NO ID CITY
1 AAA 2000 1 2 Pune
2 AAD 3000 2 1 Chennai

PK PK FK
USE Employee;
DROP table_salary
DROP table Name;
CREATE table salary (
ID INT PRIMARY KEY NOT NULL.
Name Varchar ( 20 ),
Salary INT );
INSERT INTO salary
(ID, Name, Salary)
Values
(1, “AAA”, 2000)
(2, “AAD”, 3000)
CREATE table salary (
SNO INT PRIMARY KEY NOT NULL.
ID INT,
CITY VARCHAR (10),
Foreign key(ID) references salary(ID)
);
INSERT INTO city
(SNO, Name, Salary)
Values
(1, 2,“PUNE”)
(2, 1,”CHENNAI”)
REVERSE ENGINEER
 To connect the two tables – This is reverse engineer
 Database  reverse engineer Next  Next  Select Schema(Here employee) 
Next  Execute  Next
 Represents the primary key of the particular table
 This is for data engineer, but we can check our connections
 Once DROP Command is executed we cannot retrieve the data
 If encountered connection issue in SQL
o Services  MySQL84  start the service(in right)
o MySQL84  Properties  Startup type  Automatic

TO UPLOAD FILES
 Schemas  Select filename  tables  select and right click  table
data import wizard  browse  select file  open 
file path browse
Next  Next  Next  Next  Next Finish refresh
 Same way upload all the files in the table
 If we apply user data  inn configure import settings
 Join date  text  datetime
 Last payment date  text  datetime
 Date format - %d-%m-%y
 User_date
SELECT FROM
 The SELECT statement has a simple structure. We specify the columns we want
to retrieve after the SELECT keyword and then indicate which table contains the
data we’re interested in using the FROM keyword.
 Example: Suppose we have a table called Customers and we want to retrieve the
names and email addresses of all customers. We would write a SELECT
statement like this:
SELECT Name, Email
FROM Customers;
 Example: If you want to select some columns but not all, you use the SELECT
statement to specify the columns. For instance, if we are working with a table
named Students and we want only the students’ first names and student IDs, we
would write:
SELECT StudentID, FirstName
FROM Students;
 In the first line of the SELECT statement, do not use a colon or semicolon.
 Avoid entering currency directly in values; instead, use an extra column for
currency, such as USD or INR.
 Naming conventions:
o FIRST NAME ------ ✕
o FIRST-NAME ------ ✕
o FIRST_NAME ------ ✓
 For boolean values, use TRUE or FALSE in uppercase.
o 'TRUE' ------- ✓ It should be only in uppercase
o 'True' ------- ✕

Exercise: Get the following columns from the Enrollments table


created in the class exercise:
 Student ID
 Course ID
 Grade
If the table has many columns and you want to select all columns, you can use:
SELECT * (Select(space)*)
FROM Customers;

CREATE TABLE – CHECK CONSTRAINTS


 The CHECK constraint in SQL is used to enforce domain integrity by limiting the
values that can be inserted into a column. It allows you to define a condition that
must be met for each row when data is inserted or updated. If the condition
specified in the CHECK constraint evaluates to false for any row, the operation
(insert or update) will be rejected

1. CREATE TABLE Student (


Student ID INT PRIMARY KEY,
StudentName VARCHAR (50),
Age INT CHECK (Age >= 5 AND Age <= 18)
);
2. CREATE TABLE STUDENT (
StudentID INT PRIMARY KEY,
StudentName VARCHAR(30) NOT NULL,
Sex VARCHAR(6) CHECK (Sex IN ('Male','Female’))
);

 The CHECK constraint in the CREATE TABLE statement can be applied to any
kind of field, such as integer, character, categorical variable, or quantitative
variables.
 It acts as a filter, ensuring that the value must meet the specified condition in the
CHECK constraint.
 The CHECK constraint always comes after the data type and column name.
 For string/categorical values, use IN.
 For integers, use AND with comparison operators like <, =, etc.
 When specifying string values in the CHECK constraint, use quotes.
 For strings, use inverted commas (single quotes).
 The equal sign (=) is used only for integers.
Example:
CREATE TABLE Students (
StudentID INT PRIMARY KEY,
FirstName VARCHAR(50),
Age INT CHECK (Age >= 18),
Gender VARCHAR(10) CHECK (Gender IN ('Male', 'Female', 'Other'))
);
 In this example:
o The Age column must have values greater than or equal to 18.
o The Gender column must have values that are either ‘Male’, ‘Female’, or
‘Other’

WHERE
 The WHERE clause allows us to narrow down our results by specifying
conditions that the data must meet. The WHERE clause comes after the FROM
clause in our SQL query. We can use various comparison operators such as =, <,
>, and logical operators like AND, OR, to define our conditions.
 Example: Suppose we have a table called products and we want to retrieve the
names and prices of products that are currently in stock (i.e., where the
quantity_on_hand is greater than zero)
1. SELECT NAME,PRICE
FROM Products
WHERE QuantityOnHand>0;

2. SELECT NAME,PRICE
FROM Products
WHERE QuantityOnHand>0 AND CustomerID = 123;
 The WHERE clause is a filtration query that allows us to filter the data we’re
selecting. We can use any of the comparison operators (e.g., =, <, >) or logical
operators (AND, OR, etc.).
 Example: With reference to the above example, instead of using AND, we can use
OR if we need any of the data (e.g., QuantityOnHand > 0 OR sales > 0).
 For three conditions:
1. If we need to use more than 2-3 separate filters, we should use separate
queries. This simplifies the code for ourselves and for anyone reading it.
2. Avoid using too many filters in a single query, as the code will get complex
and we might not get the right output. If you have too many filters, you can
create two tables with different filters and work on them separately instead
of putting all the filters in one table.
3. The order of the filters (chronology) doesn’t matter
 Examples:
WHERE CustomerID = 123
WHERE QuantityOnHand > 0 AND Sales > 0
 We almost always use single filtration by using logical operators.
 In SQL, the hierarchy of the statements matters.
 Always start with SELECT and FROM.
 If we want to see all the columns, we use SELECT *

ORDER BY
 The ORDER BY clause is used to sort the result set returned by the SELECT
statement. We specify the column(s) by which we want to sort, along with the
desired sorting order (Ascending or Descending).
 Example: Suppose we have a table called employees with columns for employee
names and salaries. We can use the ORDER BY clause to sort the employees by
their salaries in descending order:
SELECT Name, Salary
FROM Employees
ORDER BY Salary DESC;
 We can sort a column in ascending or descending order by using the ORDER BY
statement.
 By default, it will sort in ascending order.
 We need to mention the column where we need sorting.
 DESC - Descending order (Z to A)
 ASC - Ascending order (A to Z)
 ORDER BY will not work well on columns with categorical data like YES/NO. If
applied, it will show all NOs first and then YESs because N comes before Y.
Avoid using it on such columns; use it on columns with numerical or name data
instead.

GROUP BY
 When we use the GROUP BY clause, SQL divides the result set into groups based
on the values of one or more columns. We can then apply aggregate functions such
as SUM, AVG, COUNT, MIN, and MAX to each group to calculate summary
statistics.
 Example: Suppose we have a table called Orders with columns for Order IDs,
Customer IDs, and order amounts. We want to calculate the total order amount for
each customer. We can use the GROUP BY clause along with the SUM function:
SELECT CustomerID, SUM(OrderAmount) AS TotalOrderAmount
FROM Orders
GROUP BY CustomerID;
 When we have a table and we need to view something, it’s often not just two or
three columns. We are usually performing some operations on the table, such as
grouping.
 Grouping By means we have different entries for the same entity

Dept Name Sales Amount Total sales


A 5000
B 3000  B= 3000
B 2000  B=4000
A 2500  B=3000
B 4000 B=SUM (3000,4000,3000)
B 3000
 AVG – gives any value
 COUNT-gives count value
 MAX-gives max value,
 MIN – gives min value

 AS Statement: We can use it to rename a column or table. For example,


SalesAmount AS TotalSales.
 We have a SQL database explorer. Instead of retrieving the entire table, we can use
a command to get a table for modeling, which shows departments grouped by sales
directly. We can then perform further operations on this table. This creates an
entirely new table. If we need to save this table back to the database, we use
another statement to save it. If we don’t save it, the output will vanish once we
close the query window.
HAVING
 The HAVING clause is similar to the WHERE clause, but it operates on grouped
data rather than individual rows. It allows us to filter groups based on aggregate
conditions. After using the GROUP BY clause to group rows based on certain
criteria, we can apply the HAVING clause to specify conditions that must be met
by the groups. Only groups that satisfy the specified conditions will be included in
the result set.
 Example: Find departments where the total sales amount is greater than $5000.
You have columns Department and SalesAmount:
SELECT Department, SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Department
HAVING SUM (SalesAmount) > 5000;
 The HAVING clause is similar to the WHERE clause.
 The HAVING clause is only for aggregate functions.
 The HAVING clause can only be used to filter aggregate functions like COUNT,
MIN, and MAX. If any column has an aggregate function applied and we need to
filter those, we use the HAVING clause.
 It has two limitations:
 It can be used only after the GROUP BY statement.
 It should always be used after the GROUP BY statement (we can’t write it before
the GROUP BY statement).
 Using logical operators, we can write two or more filter conditions.
 For HAVING, the GROUP BY clause is compulsory.
 It’s important to give a column name when you are using an aggregate function to
provide a clear reference.
EXERCISE 1:
 Table name : Orders
 Columns : OrderID, CustomerID, OrderDate, TotalAmount
 Query: Retrieve the total sales amount for each customer from the “Orders” table,
where the orders we placed between January 1,2023 and December 31, 2023 and
the total sales amount is greater than $1000.Sort the results in ascending order of
total sales amount
SELECT CustomerID, SUM(TotalAmount) AS TotalSalesAmount
FROM Orders
WHERE OrderDate >= ‘2023-01-01’ AND OrderDate <= ‘2023-12-31’
GROUP BY CustomerID
HAVING SUM( TotalAmount ) > 1000
ORDER BY SUM ( TotalAmount ) DESC;
 Note: Always write the ORDER BY clause at the end. When you have created the
entire table, sorting should be the last step
EXERCISE 2:
 Table name : Products
 Columns : ProductID, Category , Price, Instock
 Query: Retrieve the number of products in each category from the “Products”
table, where the products are currently instock and the number of products is
greater than 5. Sort the results by the number of products in descending order
SELECT Category, COUNT(*) AS NumberofProducts
FROM Products
WHERE Instock > 0
GROUP BY Category
HAVING COUNT(*) > 5
ORDER BY COUNT(*) DESC;
EXERCISE 3:
 Table name : Employees
 Columns : EmployeeID, Department,Salary
 Query: Find the average salary of employees in each department from the
“Employees” table, wher the average salary is greater than $50000, Sort the results
by department name alphabetically
SELECT Department, AVG(Salary) AS Average_Salary_USD
FROM Employees
GROUP BY Department
HAVING AVG(Salary) > 5000
ORDER BY Department;
EXERCISE 4:
 Table name : Sales
 Columns : TransactionID, ProductID,SaleDate, SaleAmount
 Query: Retrieve the total sale amount for each product from the “Sales” table
where the sales were made in the year 2023 and the total sale amount is greater
than $500.Sort the results by the total sale amount in descending order
SELECT ProductID, SUM(SaleAmount) AS Total_Sales
FROM Sales
WHERE SaleDate >= ‘2023-01-01’ AND SaleDate <= ‘2023-12-31’
GROUP BY ProductID
HAVING SUM(SaleAmount) > 500
ORDER BY SUM(SaleAmount) DESC;

SQL SERVER MANAGEMENT STUDIO


STEPS :
1. Open (Close any open connections if necessary) – File – Connect to Object
Explorer
2. Server type – Database Engine
3. Server Name – Enter your laptop name (this is the first thing you need to select)
4. Options – Ensure the checkbox is marked (Trust server certificate)
5. Options – Connect – New Query
6. In web code, write in schemas SQL and execute in query SQL

EXERCISE 5:
CREATE DATABASE JatinDemo-(Execute and refresh it will have the database
in schemas )
CREATE TABLE sample_table (
id INT PRIMARY KEY,
name VARCHAR (50) ,
gender VARCHAR (10),
city VARCHAR (50) ,
salary DECIMAL(10,2)
);
INSERT INTO sample_table ( id, name , age, gender, city, salary )
VALUES
(1, ‘John’, 30, ‘Male’, ‘New York’, ‘50000.00),
(2, ‘Alice’, 25, ‘Female’, ‘Los Angeles’, ‘60000.00),
(3, ‘Michael’, 35, ‘Male’, ‘Chicago’, ‘70000.00),
(4, ‘Emily’, 28, ‘Female’, ‘Houston’, ‘65000.00),
(5, ‘David’, 40, ‘Male’, ‘Phoenix’, ‘80000.00),
(6, ‘Sophia’, 32, ‘Female’, ‘Philadelphia’, ‘45000.00),
(7, ‘James’, 33, ‘Male’, ‘Son Antonia’, ‘75000.00),
(8, ‘Emma’, 29, ‘Female’, ‘San Diego’, ‘65000.00),
(9, ‘Daniel’, 32, ‘Male’, ‘Dallos’, ‘72000.00),
(10, ‘Olivia’, 27, ‘Female’, ‘San Francisco’, ‘68000.00);
1. To show all the table
SELECT *
FROM sample_table

2. Find the average salary of employees grouped by gender, but only for those age
is above 25
SELECT gender, AVG ( Salary ) AS avgs_salary
FROM sample_table
WHERE age > 25
GROUP BY gender ;

O/P : SL no gender Avg_salary


1 Female 62666.666665
2 Male 69400.000000

3. List the cities where the total salary expenditure is greater than $65000,
ordered by the total salary expenditure in descending order
SELECT city, SUM ( Salary ) AS total_salary
FROM sample_table
GROUP BY city
HAVING SUM ( salary ) > 65000
ORDER BY total_salary DESC;
O/P :
SL no city avg_salary
1 Phoenix 80000.00
4. 2 Son Antonia 75000.00 Find the age of employees
3 Dallas 72000.00 whose salaries are above $
4 Chicago 70000.00 60000, grouped by gender
5 San Francisco 68000.00 and only show the results
for gender where the average age is below 35
SELECT gender, AVG ( age ) AS average_age
FROM sample_table
WHERE salary > 60000
GROUP BY gender
HAVING AVG ( age ) < 35 ;
O/P :

SL no Gender Average_age
1 Female 28

5. Find the max salary among employees aged 30 and above in each city
SELECT City, Max(Salary) AS Max_Salary
FROM sample_table
WHERE age >= 30
GROUP BY city;
O/P :
SL no City Max_Salary
1 Chicago 70000.00
2 Dallas 72000.00
3 New York 50000.00
4 Phoenix 80000.00
5 San Antonia 75000.00

JOIN
 Joins are used to combine rows from two or more tables based on a related
column between them. This allows us to create a single result set that contains
data from multiple tables.
 There are several types of joins in SQL, including INNER JOIN, LEFT JOIN,
RIGHT JOIN, and FULL OUTER JOIN. Each type of join determines how rows
from the joined tables are combined.
1. Primary key: A unique identifier for each record in a table.
2. Foreign key: A field in one table that uniquely identifies a row of
another table. A foreign key cannot be null because it is a primary key in
another table
1.What is the difference between foreign key and primary key?
For joining tables, the only criteria are common columns between the two
tables. These columns can be categorical variables or integers, but they must
be unique. They cannot have duplicate entries.
Criteria:
 Uniqueness
 Common column

1. INNER JOIN

An INNER JOIN between two tables will return only records where the joining
fields, such as a key, find a match in both tables.

EXERCISE :
2. LEFT JOIN

A LEFT JOINS keeps all of the original records in the left table and returns
missing values for any columns from the right table where the joining field did not find
a match

 Display all the entries present in the left table. If there are no matching values in
the right table, it will display as NULL.

EXERCISE :
3. RIGHT JOIN

A RIGHT JOINS keeps all of the original records in the right table and
returns missing values for any columns from the left table where the joining field
did not find a match

EXERCISE :

 If we want same result, we can also code as


SELECT Students.Name, Grade.Subject, Grades.Grade
FROM Students
LEFT JOIN Grades ON Students.StudentID = Grades.StudentID;
4. FULL OUTER JOIN

A FULL JOIN combines a LEFT JOIN and RIGHT JOIN. A FULL JOIN
will return all records from a table, irrespective of whether there is a match on the
joining field in the other table, returning null values accordingly.

EXERCISE :

SELECT Students.Name, Grade.Subject, Grades.Grade


FROM Students
FULL OUTER JOIN Grades.ON Students.StudentID = Grades.StudentID;
Name Subject Grade
ALICE MATH A
ALICE SCIENCE B
BOB MATH B
CHARLIE SCIENCE A
JOHN NULL NULL
NULL ECONOMICS A

EXERCISE :
 LEFT TABLE – Table A – 5 Rows and 2 Columns
 RIGHT TABLE – Table B – 10 Rows and 3 Columns

 SELECT A.*, B.*


 INNER JOIN – 3 Rows and 5 Columns
 SELECT A.*, B.Name, B.Marks
 LEFT JOIN - 5 Rows and 4 Columns
 RIGHT JOIN - 10 Rows and 5 Columns(3+2)
 OUTER JOIN - 12 Rows(15-3=12 – we need to subtract the
common rows) and 5 Columns(2+3)
5. SELF JOIN

SELF JOINS are used to compare values in a table to other values of the
same table by joining different parts of a table together

EXERCISE :
1. Retreive all employees along with the names of their corresponding
departments
2. Retrieve all employees and departments, matching them where possible

SELECT Employees.Name , Department.Name,


FROM Employee,
LEFT JOIN Department ON Employee. Department ID = Department.
Department ID;

(We can use INNER TOO too for the above code)
SELECT Employees.Name , Department.Name,
FROM Employee,
OUTER JOIN Department ON Employee. Department ID =
Department. Department ID;

6. CROSS JOIN

In a CROSS JOIN, every row from the first table is combined with every row
from the second table, resulting in a cartesian product. This means that if Table A
has m rows and Table B has n rows, the CROSS JOIN will produce m*n rows in
the output.
EXERCISE : PRODUCT CATALOG GENERATION
In an e-commerce system, product catalogs often contain various attributes
such as sizes, colors, and styles. A CROSS JOIN can be used to generate all
possible combinations of products and attributes, resulting in a comprehensive
product catalog. Suppose you have tables for products and attributes. A CROSS
JOIN between these tables would produce all possible combinations of products
and attributes, allowing customers to view different product variations.
 It does not need column IDs.
 It will match each row in both tables, so it will basically be a
Cartesian product of ( m \times n )

EXERCISE :

EXAMPLE : GENE MAPPING


 TABLE A – 5 Rows and 1 Column
 TABLE B - 6 Rows ans 1 Column
 CROSS JOIN- 30 Rows and 2 Column
LIMIT
 The LIMIT clause is typically used at the end of a SELECT statement to specify
the maximum number of rows to return. It allows you to control the size of the
result set, which can be helpful for performance optimization and data analysis.
 Instead of LIMIT, you can also use TOP in some SQL databases (like SQL
Server) when LIMIT is not supported.
 The correct syntax for using LIMIT is:
SELECT column1, column2, ……
FROM table_name,
LIMIT number_of_rows;
 The row numbering in SQL typically starts with index 1, for example: 1, 2, 3, …
EXAMPLE
1. SELECT city, MAX(Salary) AS max_salary
FROM sample_table
WHERE age>=30
GROUP BY city;
LIMIT 3;

2. SELECT TOP 3 city, MAX(Salary) AS max_salary


FROM sample_table
WHERE age>=30
GROUP BY city;
o/p: SL NO CITY MAX_SALARY
1 CHICAGO 70000.00
2 DALLAS 72000.00
3 NEW YORK 50000.00
SUBQUERIES
 SUBQUERIES, also known as nested queries or inner queries, are powerful
tools that allow us to perform complex analysis by embedding one query (the
inner query) within another query (the outer query).
 A SUBQUERY is essentially a query nested within another query. It can be used
within SELECT, INSERT, UPDATE, or DELETE statements to retrieve or
manipulate data based on the results of another query. Subqueries can return a
single value, a single row, multiple rows, or even a result set.

TYPES OF SUBQUERIES

1. SINGLE ROW SUBQUERIES: These subqueries return only one row of results.
2. MULTI-ROW SUBQUERIES: These subqueries return multiple rows of results.
3. SINGLE-COLUMN SUBQUERIES: These subqueries return only one column
of results.
4. CORRELATED SUBQUERIES: These subqueries are evaluated once for each
row processed by the outer query.
 The basic syntax of a SUBQUERY involves enclosing the inner query within
parentheses and using it as an operand in the outer query.
 Here’s an example of a SUBQUERY in a SELECT statement:

EXERCISE :
CREATE TABLE Customers (
Customer ID INT PRIMARY KEY,
NAME VARCHAR (50)
);
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
CustomerID INT
OrderAmount DECIMAL(10,2),
FOREIGN KEY (CustomerID) REFERENCES Customers (CustomerID)
);
---inserting values into the customer table
INSERT INTO Customers ( CustomerID, Name) VALUES
(101, ‘ALICE’ ),
(102, ‘BOB’ ),
(103, ‘CHARLIE’ ),
---inserting values into the orders table
INSERT INTO Orders ( OrderID, CustomerID, OrderAmount ) VALUES
(1, 101, 500 ),
(2, 102, 700 ),
(3, 103, 600 ),
(4, 101, 800 );

1. SELECT * FROM Customers


SL NO CustomerID Name
1 101 ALICE
2 102 BOB
3 103 CHARLIE

2. SELECT * FROM Orders


SL NO OrderID CustomerI OrderAmount
D
1 1 101 500.00
2 2 102 700.00
3 3 103 800.00
4 4 101 800.00
3. Need to find the customers whose average order amount is greater than the overall
average amount of all orders
Method 1: SELECT Name
FROM Customers
WHERE CustomerID IN(
SELECT CustomerID
FROM Orders
GROUP BY CustomerID
HAVING AVG (OrderAmount)>(SELECT AVG (OrderAmount)
FROM Orders)
);
Output: SL no Name
1 BOB
Note :
1. SELECT AVG (OrderAmount) FROM Orders – [o/p – 650]
2. SELECT CustomerID
FROM Orders
GROUP BY CustomerID
HAVING AVG (OrderAmount)>(SELECT AVG (OrderAmount) FROM
Orders)
Output - SL no Name
1 102

Method 2: SELECT customer_id, AVG(order_amount) AS avg_order_amount


FROM orders
GROUP BY customer_id
HAVING AVG(order_amount) > (SELECT AVG(order_amount)
FROM orders);
In this query:
 The inner subquery (SELECT AVG(order_amount) FROM orders) calculates
the overall average order amount.
 The outer query groups the orders by customer_id and calculates the average
order amount for each customer.
 The HAVING clause filters the results to include only those customers whose
average order amount is greater than the overall average

Method 2: SELECT c.Name, c.CustomerID


FROM Customers AS c
LEFT JOIN Orders AS o ON c.CustomerID = o.CustomerID
GROUP BY c.CustomerID, c.Name
HAVING AVF ( o.OrderAmount ) > (
SELECT AVG (OrderAmount)
FROM Orders
);

 Where we see a subquery, try replacing it with a LEFT JOIN to reduce the use
of subqueries and improve performance.
 In c.Name, c is an alias for the table from which the column Name is being
selected

TYPES OF SUBQUERIES

1. SINGLE ROW SUBQUERIES


 SINGLE ROW subqueries return only one row of results.
 These subqueries are commonly used in scenarios where you need to
compare a single value with the result of another query
Example 1 : Suppose we want to find the customer with the highest total order
amount. We can use a single row subquery to achieve this:
SELECT Name
FROM Customers
WHERE CustomerID = (SELECT CustomerID
FROM Orders
GROUP BY CustomerID
ORDER BY SUM ( OrderAmount ) DESC
LIMIT 1
);
Example 2 : Suppose we want to find the customer with the highest average
order amount. We can use a single row subquery to achieve this:
SELECT Name
FROM Customers
WHERE CustomerID = (SELECT CustomerID
FROM Orders
GROUP BY CustomerID
ORDER BY AVG ( OrderAmount ) DESC
LIMIT 1
);
Note :
 Instead of using IN, use = because we need only one row as the answer.
 Instead of LIMIT 1, you can also use TOP 1 in some SQL databases
(like SQL Server).

2. MULTI ROW SUBQUERIES


 MULTI-ROW subqueries return multiple rows of results.
 These subqueries are useful when you need to compare or filter based on a set
of values returned by another query.
Example 1 : Suppose we want to find all the customer who have placed orders
woth more than the average order amount, we can use a multi-row subquery to
achieve this:
SELECT Name
FROM Customers
WHERE CustomerID = (SELECT CustomerID
FROM Orders
GROUP BY CustomerID
HAVING AVG ( OrderAmount ) > (SELECT
AVG(OrderAmount) from Orders));

 Instead of using =, use IN for multiple answers.


 If the ORDER BY AVG is DESC and two average amounts are the same, the
system will determine the order. If the original data is in alphabetical order, it
will return results alphabetically. If not, it will return results by index, so
whichever customer is at a higher index will be noted as the top customer.
When we use LIMIT 1, this top customer will be selected

ADVANCED SQL FUNCTIONS FOR ANALYSIS

1. AGGREGATE FUNCTIONS
Functions like SUM(), AVG(), COUNT(), MIN() and MAX() are
fundamentals for summarizing and aggregating data across groups

2. WINDOW FUNCTIONS
These functions allow for advanced analyzing by performing calcualtions
over set of rows related to the current row, without the need for explicit
subqueries like LAG, LEAD , Partition by rank
Why we use window function?
We don’t have to write a lot of subqueries, where we first write a subquery for
one calculation, then another subquery for retrieving data, and then another
for comparing and ranking values. This is why we created window functions,
which basically reduce our work and the time spent writing different joining
subqueries.

3. STRING FUNCTIONS
Functions like CONCAT(), SUBSTRING(), UPPER(), LOWER() and
REPLACE() facilitate text manipulation and formatting

4. DATE AND TIME FUNCTIONS


Functions such as DATEADD(), DATEDIFF(), DATE_TRUNC() nd
EXTRACT() enable date arithmetic and extraction of date/time components
 Rounding of the data: Use the DATE_TRUNC() function.
 Extract date, month, year from a date column: Use the EXTRACT() function,
for example, EXTRACT(YEAR FROM date_column)
 Example - EXTRACT ( YEAR, DATE )

5. MATHEMATICAL FUNCTIONS
Functions like ABS(), ROUND(), CEILING(), FLOOR(), and POWER()
support mathematical operations on numeric data.
1. ABS(): Returns the absolute value of a number. For example, ABS(-5)
returns 5.
2. ROUND(): Rounds a number to a specified number of decimal places.
For example, ROUND(2.718, 2) returns 2.72.
3. CEILING(): Rounds a number up to the nearest integer. For example,
CEILING(2.82) returns 3.
4. FLOOR(): Rounds a number down to the nearest integer. For example,
FLOOR(2.82) returns 2.
5. POWER(): Raises a number to the power of another number. For
example, POWER(2, 3) returns 8.

6. CONDITIONAL FUNCTIONS
Functions like CASE() provide conditional logic to customize query output
based on specified conditions.

CONDITIONAL FUNCTIONS
CASE STATEMENTS:
 CASE statements evaluate a set of conditions and return a result based on the
first condition that is true. They’re similar to IF-THEN-ELSE statements in
other programming languages, but tailored for SQL queries.
 CASE statements can be used in various scenarios such as categorizing data,
creating calculated fields, or simplifying complex logic.
 Here’s the syntax for a CASE statement:
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
----- (additional conditions can be added here)
ELSE default_result
END
 Example : Here it is an example like WHEN – THEN - ELSE
If a>1 THEN TRUE
ELSE
FALSE
Example: Suppose we want to categorize orders based on their amount:
SELECT order_id, order_amount,
CASE
WHEN order_amount > 1000 THEN 'High'
WHEN order_amount BETWEEN 500 AND 1000 THEN 'Medium'
ELSE 'Low'
END AS order_category
FROM orders;
In this example:
 Orders with an amount greater than 1000 are categorized as ‘High’.
 Orders with an amount between 500 and 1000 are categorized as ‘Medium’.
 All other orders are categorized as ‘Low’.

Example: Suppose we have a table of students scores and we want to cateogorize their
performance as “Excellent”, “Good”, “Average” or “BelowAverage” based on their
scores
StudentID Score
1 85
2 70
3 95
4 60

SELECT StudentID, Score,


CASE
WHEN Score >= 90 THEN ‘Excellent’
WHEN Score >= 80 THEN ‘Good’
WHEN Score >= 90 THEN ‘Average’
ELSE ‘BelowAverage’
END AS performance
FROM StudentScores;

StudentID Score Performance


1 85 Good
2 70 Average
3 95 Excellent
4 60 BelowAverage
EXERCISE :

CREATE TABLE Employee (


Customer ID INT PRIMARY KEY,
NAME VARCHAR (50),
SALARY DECIMAL (10,2)
);
INSERT INTO Employees ( CustomerID, Name, Salary ) VALUES
(1, ‘ALICE’ , 60000),
(2, ‘BOB’, 75000),
(3, ‘CHARLIE’, 45000),
(4, ‘DAVID’, 80000),
CREATE TABLE Products (
ProductID INT PRIMARY KEY,
CustomerID INT
NAME VARCHAR (50),
DISCOUNT DECIMAL (5,2)
);
INSERT INTO Products ( ProductID, Name, Price, Discount) VALUES
(1, ‘Laptop’, 1200, 0.1),
(2, ‘Phone’, 800, 0.05),
(3, ‘PC’, 1000, 0.00),
(4, ‘Tablet’, 500, 0.15);

1. SELECT * FROM Employees


SELECT MAX( Salary) FROM Employees
SELECT MAX(Salary) as Maxsalary, MIN(Salary) as Minsalary,
AVG(Salary) as Avgsalary FROM Employees

2. We have a table of employees with their salaries and we want to categorize them
into different salary ranges
SELECT EmployeeID, Name, Salary,
CASE
WHEN Salary >= 80000 then ‘HIGH’
WHEN salary >= 60000 then ‘MEDIUM’
ELSE ‘LOW’
END AS SalaryCategory
FROM Employees
EmployeeID Name Score Performance
1 ALICE 60000.00 MEDIUM
2 BOB 75000.00 MEDIUM
3 CHARLIE 45000.00 LOW
4 DAVID 80000.00 HIGH

3. We have a table of products with their prices and discounts and we want to
calculate the final price (after discount, if any) for each product
SELECT * FROM Products
SELECT Name, Price, Discount,
CASE
WHEN Discount >0 then Price-(Price*Discount)
ELSE Price
END AS FinalPrice
FROM Products ;
Name Price Discount FinalPrice
Laptop 1200.00 0.10 1080.0000
Phone 800.00 0.05 760.0000
PC 1000.00 0.00 1000.0000
Tablet 500.00 0.15 425.0000
PRACTISE 1 :
Suppose we have a database for a bookstore and we want to generate a report that
categorizes books based on their popularity. We define the popularity of a book based
on the number of copies sold. We categorize a book as a ‘Bestseller’ if it has sold
more than 100 copies, ‘Popular’ if it has sold between 50 and 100 copies (inclusive),
and ‘Standard’ otherwise.

Bookstore Database Schemas :


Books Tabe :
BookID ( Primary Key ) : Unique identifier for each book
Title : Title of the book

SALES TABLE:
SaleID ( PRIMARYKEY) : Unique identifier for each sale
BookID (FOREIGN KEY ) : References the BookID in the Books table
Quantity : Number of copies sold in each sale

Solution:
SELECT Title, Author,
SELECT SUM( Quantity ) FROM Sales WHERE BookID =
Books.BookID) AS TotalSales,
CASE
WHEN ( SELECT SUM(Quantity) FROM Sales WHERE
BookID = Books.BookID) > 100 THEN ‘Bestseller’
WHEN ( SELECT SUM(Quantity) FROM Sales WHRE
BookID = Books.BookID ) BETWEEN 50 AND 100
THEN ‘Popular’
ELSE ‘Standard’
END AS Popularity)
FROM Books;
 Note : BookID.BookID , can be used as a foreign key. If it is not a foreign key,
then we cannot use it in that context.”
PRACTISE 2 :
Suppose we have a database for an online marketplace, and we want to generate a
report that categorizes sellers based on their sales performance. We define the
performance of a seller based on the total revenue generated from their sales. We
categorize a sellers as a ‘Top Performer’ if it their total revenue exceeds $10,000,
‘Good Performer’ if its between $5000 and $10,000 (inclusive), and ‘Standard
Performer’ otherwise.

Online Marketplace Database Schemas :


Seller Tabe :
SellerID ( Primary Key ) : Unique identifier for each seller
SellerName : Name of the seller
ORDERS TABLE:
OrderID ( PRIMARYKEY) : Unique identifier for each order
SellerID (FOREIGN KEY ) : References the SellerID in the Sellers table
UnitPrice : Price per unit of the product in the order
Quantity : Number of units ordered

Solution:
CREATE TABLE Sellers
(
Seller_id INT NOT NULL PRIMARY KEY,
Seller_name VARCHAR ( 50 )
);
INSERT INTO Sellers (Seller_id, Seller_Name ) VALUES
(1, ‘SARA’),
(2, ‘MAYA’),
(3, ‘JOHN’),
(4, ‘NARA’),
CREATE TABLE Orders_s
(
Order_id INT PRIMARY KEY,
Seller_id INT,
Unit_Price INT,
Quantity INT,
FOREIGN KEY (Seller_id) REFERENCES Sellers(Seller_id)
);
INSERT INTO Orders_s, Seller_id, Unit_Price, Quantity) VALUES
(1, 1, 1000, 5),
(2, 1, 1000, 10),
(3, 2, 7000, 2),
(4, 4, 10000, 1);

SELECT * FROM Sellers


SELECT * FROM Orders_s;
SELECT Seller_id, Seller_Name,
SELECT SUM( Unit_Price * Quantity ) FROM Orders_s WHERE Seller_id =
Sellers.Seller_id) AS ‘Total_Revenue’,
CASE
WHEN ( SELECT SUM ( Unit_Price * Quantity )FROM
Orders_s WHERE Seller_id = Sellers.Seller_id ) > 10000 THEN
‘Top Performers’
WHEN ( SELECT SUM ( Unit_Price * Quantity )FROM
Orders_s WHERE Seller_id = Sellers.Seller_id ) BETWEEN
5000 AND 10000 THEN ‘Good Performers’
ELSE ‘Standard Performance’
END AS ‘Categorize Sellers’
FROM Sellers;
 Note : SELECT SUM( Unit_Price * Quantity ) FROM Orders_s WHERE
Seller_id = Sellers.Seller_id) is SUBQUERY
Seller_id Seller_Name Total_Revenue Categorize sellers
1 SARA 15000 Top Performance
2 MAYA 14000 Top Performance
3 JOHN 50000 Top Performance
4 NARA 10000 Good Performance

WINDOWS FUNCTIONS
 Window functions allow us to perform calculations across a set of rows that are
related to the current row, without collapsing the result set. They are incredibly
versatile and can be used for tasks such as ranking, aggregation, moving
averages, and more.
 Unlike aggregate functions that collapse multiple rows into a single row, window
functions maintain the granularity of the result set, providing valuable context for
each row.
 When we use window functions, we apply aggregation, moving averages, and
ranking functions to each row. These functions perform aggregation at the row
level, not at a group or higher level.

TYPES OF WINDOWS FUNCTIONS

1. RANKING FUNCTIONS:
 RANK(), DENSE_RANK(), ROW_NUMBER(), and NTILE() assign a rank or
row number to each row based on specified criteria.
 RANK() – Assigns a rank to each row within the partition of a result set.
Example: In a sales table, a column RANK gives ranks according to sales
values.
 DENSE_RANK() – Similar to RANK(), but it does not skip ranks if there are
ties.
Example: If there are two people tied for second position, it will assign ranks
as 1, 2, 2, and then 3.
 ROW_NUMBER() – Assigns a unique number to each row, starting at 1 for
the first row in each partition.
 NTILE() – Divides the result set into a specified number of approximately
equal groups, or “tiles.”
Example: If you specify NTILE(5), it will divide the rows into five groups,
each representing a percentile range (0-20%, 20-40%, 40-60%, 60-80%, 80-
100%).
 Additional Explanation for NTILE():
1. Percentile Example: If you score in the 98th percentile, it means you scored
better than 98% of the students who took the exam. If there are 200
students, you scored better than 196 of them. This is not your actual score
but your relative position among the group.
2. Sales Example: Using NTILE(5) on sales data will categorize each row into
one of five percentiles. For instance, the first group will include the bottom
20% of sales, the second group the next 20%, and so on up to the top 20%.

2. AGGREGATE FUNCTIONS WITH WINDOW FRAMES:


 Functions like SUM(), AVG(), MIN(), and MAX() can indeed be used as
window functions. By specifying a window frame, you can perform
aggregations over a subset of rows within a result set. Here’s a brief
overview:
 SUM(): Calculates the total sum of a numeric column over the specified
window frame.
 AVG(): Computes the average value of a numeric column over the
specified window frame.
 MIN(): Finds the minimum value within the specified window frame.
 MAX(): Finds the maximum value within the specified window frame.
 Example Usage
 Consider a table sales with columns salesperson, sale_date, and amount.
You can use window functions to calculate running totals, averages, etc.

SELECT salesperson, sale_date, amount,


SUM(amount) OVER (PARTITION BY salesperson ORDER BY
sale_date) AS running_total,
AVG(amount) OVER (PARTITION BY salesperson ORDER BY
sale_date) AS running_avg,
MIN(amount) OVER (PARTITION BY salesperson ORDER BY
sale_date) AS running_min,
MAX(amount) OVER (PARTITION BY salesperson ORDER BY
sale_date) AS running_max
FROM sales;
 PARTITION BY salesperson: Divides the result set into partitions by
salesperson.
 ORDER BY sale_date: Orders the rows within each partition by sale date.
 SUM(amount) OVER (…): Calculates the running total of sales amounts
for each salesperson.
 AVG(amount) OVER (…): Calculates the running average of sales
amounts for each salesperson.
 MIN(amount) OVER (…): Finds the running minimum sales amount for
each salesperson.
MAX(amount) OVER (…): Finds the running maximum sales amount for
each salesperson.
This allows you to perform complex calculations and analyses on subsets of
your data without needing to write multiple queries.
3. ANALYTIC FUNCTIONS :
 Functions such as LAG(), LEAD(), PRECEDING, and FOLLOWING
provide access to data from other rows in the result set, enabling
comparisons in calculations.
 These functions help in analyzing or understanding/comparing certain
values.
 Explanation of Each Function:
 LAG(): Provides access to a value in a previous row within the same
result set.
 Example: LAG(sales, 1) OVER (ORDER BY sale_date) gives the sales
amount from the previous row.
 LEAD(): Provides access to a value in a subsequent row within the same
result set.
 Example: LEAD(sales, 1) OVER (ORDER BY sale_date) gives the sales
amount from the next row.
 PRECEDING: Used in defining a window frame to include rows before
the current row.
 Example: ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
includes the current row and the three preceding rows.
FOLLOWING: Used in defining a window frame to include rows after
the current row.
Example: ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING
includes the current row and the three following rows.
Example Usage
 Consider a table sales with columns salesperson, sale_date, and amount. You
can use these functions to compare sales amounts from different rows.

SELECT salesperson, sale_date, amount,


LAG(amount, 1) OVER (ORDER BY sale_date) AS
previous_amount,
LEAD(amount, 1) OVER (ORDER BY sale_date) AS
next_amount
FROM sales;

 In this example:
 LAG(amount, 1) OVER (ORDER BY sale_date): Retrieves the sales
amount from the previous row.
 LEAD(amount, 1) OVER (ORDER BY sale_date): Retrieves the sales
amount from the next row.
 These functions are particularly useful for trend analysis, such as
comparing current sales to previous sales, or forecasting future sales
based on past data

WINDOWS FUNCTIONS EXAMPLE

1. RANKING FUNCTIONS:
SELECT ProductID, ProductName, UnitPrice,
RANK() OVER(ORDER BY UnitPrice DESC() AS PriceRank
FROM ExampleData;
2. AGGREGATE FUNCTIONS WITH WINDOW FRAME(Cumulative
total):
SELECT OrderID, OrderDate, TotalAmount,
SUM(TotalAmount) OVER(PARTITION BY CustomerID ORDER
BY OrderDate) AS CumulativeTotal
FROM ExampleData;

3. AGGREGATE FUNCTIONS(Moving Average, Lead and Lag):


SELECT Date, Revenue,
AVG(Revenue) OVER(ORDER BY DATE ROWS BETWEEN 2
PRECEEDING AND 2 FOLLOWING) AS MovingAvgRevenue
FROM ExampleData;

 Explanation
 Moving Average – It calculates the average for a specified period of time.
 Example: Stock market analysis.
 2 Preceding and 2 Following – For example, if the current date is the 23rd:
 The dates 21 and 22 are the 2 preceding days.
 The dates 24 and 25 are the 2 following days.
Detailed Breakdown
 Moving Average: This function helps smooth out short-term fluctuations and
highlight longer-term trends. It’s particularly useful in time series data, such
as stock prices.
 2 Preceding and 2 Following: In the context of the moving average
calculation:
 If the current row is for the 23rd, the window frame includes the 21st, 22nd,
23rd, 24th, and 25th.
 This means the moving average for the 23rd will be calculated using the
revenue values from these five dates.
 This approach helps in analyzing trends over a brief period, providing a
clearer picture of the data’s behavior over time.
Example:
SELECT CustomerID, OrderDate, TotalAmount,
LAG(TotalAmount) OVER PARTITION BY CustomerID ORDER BY
OrderDate) AS PreviousOrderAmount;
LEAD(TotalAmount) OVER PARTITION BY CustomerID ORDER BY
OrderDate) AS NextOrderAmount;
FROM ExampleData;
Note :
1. LAG – Value of the previous entry
2. LEAD – Value of the next entry
MONTH Date Amount LAG LEAD
A 1 50 50
A 2 100 100
A 3 200
A 4
3. Example – If I have month and sales data, we need to see the growth rate for
each month using the LAG function
MONTH SALES PreviousMonthSalary GrowthRate
JAN S1 - -
FEB S2 S1 (S2-S1)*100
MARCH S3 S2 (S3-S2)*100
S3 (S4-S3)*100

EXERCISE :
USE IIMSKILLS;
CREATE TABLE ExampleData (
ProductID INT,
ProductName VARCHAR (100),
UnitPrice DECIMAL (10,2)
OrderID INT,
OrderDate DATE,
TotalAmount DECIMAL (10,2),
CustomerID INT,
Date DATE,
Revenue DECIMAL (10,2)
);
INSERT INTO ExampleData (ProductID, ProductName, UnitPrice, OrderID,
OrderDate, TotalAmount, CustomerID, Date, Revenue)
VALUES
(1, ‘ProductA’, 10.00, 1, ‘2024-01-01’, 100.00, 1, ‘2024-01-01’, 200.00),
(1, ‘ProductB’, 15.00, 2, ‘2024-01-02’, 150.00, 1, ‘2024-01-02’, 250.00),
(1, ‘ProductC’, 20.00, 3, ‘2024-01-03’, 200.00, 2, ‘2024-01-03’, 300.00),
(1, ‘ProductD’, 25.00, 4, ‘2024-01-04’, 250.00, 2, ‘2024-01-04’, 350.00),
(1, ‘ProductE’, 30.00, 5, ‘2024-01-05’, 300.00, 3, ‘2024-01-05’, 400.00),

1. SELECT * FROM ExampleData;

2. RANK :
SELECT ProductID, ProductName, UnitPrice,
RANK() OVER (ORDER BY UnitPrice DESC) AS PriceRank
FROM ExampleData;

3. LAG/LEAD :
SELECT OrderID, CustomerID, OrderDate, TotalAmount,
LAG(TotalAmount) OVER PARTITION BY CustomerID ORDER BY
OrderDate) AS PreviousOrderAmount;
LEAD(TotalAmount) OVER PARTITION BY CustomerID ORDER
BY OrderDate) AS NextOrderAmount;
FROM ExampleData;
OrderID CustomerI OrderDate TotalAmount PreviousOrder NextOrder
D
1 1 2024-01-01 100.00 NULL 150.00
2 1 2024-01-02 150.00 100.00 NULL
3 2 2024-01-03 200.00 NULL 250.00
4 2 2024-01-04 250.00 200.00 NULL
5 3 2024-01-05 300.00 NULL NULL

4. CUMMULATIVE SUM :
SELECT OrderID, OrderDate, TotalAmount,
SUM(TotalAmount) OVER PARTITION BY CustomerID ORDER
BY OrderDate) AS CummulativeTotal;
FROM ExampleData;
OrderID OrderDate TotalAmount CummulativeTotal
1 2024-01-01 100.00 100.00
2 2024-01-02 150.00 250.00
3 2024-01-03 200.00 200.00
4 2024-01-04 250.00 450.00
5 2024-01-05 300.00 300.00

5. MOVING AVERAGE :
SELECT Date, Revenue,
AVG(Revenue) OVER (ORDER BY Date ROWS BETWEEN 3
PRECEEDING AND 3 FOLLOWING ) AS MovingAverageRevenue
FROM ExampleData;
OrderDate Revenue PrevRevenue Revenue_GrowthRate
2024-01-01 200.00 NULL NULL
2024-01-02 250.00 200.00 0.2500000000000
2024-01-03 300.00 250.00 0.2000000000000
2024-01-04 350.00 300.00 0.1666666666666
2024-01-05 400.00 350.00 0.1428571428571

6. GROWTH RATE :
Solution 1:
SELECT Date, Revenue,
LAG(Revenue) OVER (ORDER BY Date) AS PrevRevenue (Revenue –
LAG(Revenue, 1) over (ORDER BY Date)) / LAG(Revenue, 1) over
(ORDER BY Date) AS Revenue_GrowthRate
FROM ExampleData;
OrderDate Revenue MovingAverageRevenue
2024-01-01 200.00 275.000000
2024-01-02 250.00 300.000000
2024-01-03 300.00 300.000000
2024-01-04 350.00 300.000000
2024-01-05 400.00 325.000000

Solution 2:
SELECT Date, Revenue,
LAG(Revenue) OVER (ORDER BY Date) AS PrevRevenue,
--ROUND
(Revenue – LAG(Revenue, 1) over (ORDER BY Date)) / LAG(Revenue,
1) over (ORDER BY Date)*100,2) AS Revenue_GrowthRate
-----Decimal Round Off
ROUND(LAST(Revenue – LAG(Revenue, 1) over (ORDER BY Date)) /
LAG(Revenue, 1) over (ORDER BY Date)*100) AS
DECIMAL(10,2)),2) AS Revenue_GrowthRate
FROM ExampleData;
OrderDate Revenue PrevRevenue Revenue_GrowthRate
2024-01-01 200.00 NULL NULL
2024-01-02 250.00 200.00 25.00
2024-01-03 300.00 250.00 20.00
2024-01-04 350.00 300.00 16.67
2024-01-05 400.00 350.00 14.29

TIME BASED ANALYSIS


 Time-based data can provide valuable insights into trends, patterns, and
behaviors over time. By analyzing data with respect to time, we can uncover
seasonality, trends, cyclic patterns, and anomalies (such as sudden dips in the
trend) that may not be evident from a static snapshot of the data.
 Anything observed over a period of time.
o Seasonality: Regular patterns that repeat over a specific period, such as
increased retail sales during the holiday season.
o Trends: Long-term movements in data, like a gradual increase in average
global temperatures.
o Cyclic Patterns: Fluctuations that occur at irregular intervals, often
influenced by economic cycles.
o Anomalies: Sudden changes or outliers, such as a sharp drop in stock prices
due to unexpected news.

COMMON TIME BASED ANALYSIS TECHNIQUES


1. Aggregating data by time intervals: Summarizing data over specific time
intervals (e.g., hourly, daily, weekly) to identify trends and patterns.
2. Calculating moving averages: Smoothing out fluctuations in data by
calculating moving averages over rolling time windows.
3. Removing seasonality: Analyzing data over different time periods to identify
and remove recurring patterns or seasonal variations.
4. Identifying trends: Using regression analysis or trend line fitting to identify
long-term trends in time series data.

TIME BASED ANALYSIS EXAMPLES


1. Aggregating data by time intervals
SELECT DATE_TRUNC(‘day’, OrderDate, Count(*) AS NumOrders
FROM Orders,
GROUP BY DATE_TRUNC(‘day’, OrderDate)
ORDER BY OrderDay;
 Note: To simplify the date format from 2024-04-03 14:25:00 to just 2024-04 or
2024, you can use the TRUNC function in SQL. Here’s how you can do it:
To get the year and month (2024-04):
SELECT TRUNC(date_column, 'MM') AS truncated_date
FROM your_table;
To get only the year (2024):
SELECT TRUNC(date_column, 'YYYY') AS truncated_date
FROM your_table;
 In these examples, date_column is the name of your column containing the date,
and your_table is the name of your table.
 Use line graph for time base analysis

2. Calculating moving averages


SELECT OrderDate, TotalAmount,
AVG(TotalAmount) OVER (ORDER BY OrderDate ROWS BETWEEN 3
PRECEEDING AND 3 FOLLOWING) AS MovingAvgTotalAmount
FROM Orders;

3. Detecting Seasonality
SELECT EXTRACT(Month FROM OrderDate) AS Month,
AVG(TotalAmount) AS AvgTotalAmount
FROM Orders,
GROUP BY EXTRACT(Month FROM OrderDate)
ORDER BY Month;

4. Identifying Trends:
SELECT DATE_TRUNC (‘Month’, OrderDate ) AS Month,
SUM(TotalAmount) AS TotalSales,
FROM Orders,
GROUP BY DATE_TRUNC(‘Month’, OrderDate)
ORDER BY Month;

VISUALIZATION USING SQL


Data Visualization in SQL empowers analysts to create insightful visual
representations directly from their data queries
1. Export Data : Execute the SQL query in SSMS and export the results as CSV,
Excel or other compatible formats.
2. Import into Visualization tool : Import the exported data file into your preferred
visualization tool. Ex: Power BI and Tableau
3. Create Visualization : Use the featured and capabilities of the visualization tool to
create interactive and customized visualizations
4. Publish or Share : Publish the visualizations to dashboards, reports and share
them with stakeholders for insights and decision-making.
DIFFERENT TYPE OF VISUALIZATION
1. Bar Chart: Categorizes the data.
Variables:
 Independent: Categorizes data into groups (e.g., types of cars).
 Dependent: Varies based on the independent variable.
2. Line Chart: Used for continuous data over time (e.g., time, date, month).
Example: Water level is continuous and affects the population.
3. Scatter Plot: Used to understand the correlation and distribution of two variables.
Variables: Not strictly dependent/independent (e.g., age and degrees).
4. Pie Plot: Shows percentage share and distribution among categories.
Question: “How much?” It illustrates how data is distributed among different
categories.

OPTIMIZING SQL QUERIES FOR PERFORMANCE


1. Use Indexes: Properly index columns frequently used in search conditions, joins,
and ORDER BY clauses to speed up data retrieval (use wherever possible).
2. Limit Result Set: Retrieve only the necessary columns and rows needed for the
query result to reduce data transfer and processing overhead.
3. Optimize Subqueries: Rewrite correlated subqueries as joins or derived tables
for better performance.
4. Avoid SELECT *: Specify only the required columns in the SELECT list to
minimize data retrieval overhead.
5. Avoid Functions in WHERE Clause: Avoid using functions on indexed
columns in WHERE clauses, as it can prevent index usage. For example, avoid
SELECT * FROM table WHERE DATE(column) < CURRENT_DATE.
6. Update Statistics: Regularly update table and index statistics to ensure the query
optimizer has accurate information for generating efficient execution plans.

INTRODUCTION TO DATA ANALYTICS PROJECTS


1. Problem Definition: Clearly define the problem or business question that the
analytics project aims to address. This step involves understanding stakeholders’
needs and defining project objectives.
Example: Why are sales declining?
Objectives:
 Identify top performers by demographics.
 Analyze trends.
 Assess customer acquisition (added or lost customers).
 Gather sales data for each person, age, dates, trends, etc.
2. Data Collection: Gather relevant data sources, both internal and external,
necessary for analysis. This may involve accessing databases, APIs, data
warehouses, or acquiring data from third-party sources.
3. Data Preparation: Cleanse, transform, and preprocess the raw data to ensure its
quality, consistency, and suitability for analysis. This step may involve data
cleaning, data integration, and feature engineering.
4. Analysis and Modeling: Apply statistical techniques, machine learning
algorithms, or SQL analytics to analyze the data and extract insights. This may
include exploratory data analysis, hypothesis testing, predictive modeling, and
segmentation (e.g., using moving averages and seasonality).
5. Interpretation and Insights: Interpret the results of the analysis in the context of
the business problem and derive actionable insights. This step involves
communicating findings to stakeholders and making data-driven
recommendations.
6. Visualization and Reporting: Present the findings and insights using data
visualization techniques, dashboards, and reports to facilitate understanding and
decision-making.
Note: EDA (Exploratory Data Analysis) and data preparation come under data
preprocessing, which includes problem statement, data collection, and data
preprocessing

PRESENTING YOUR SQL ANALYTICS FINDINGS


1. Executive Summary: Provide a high-level overview of the key findings,
insights, and recommendations. Summarize the most critical points concisely for
busy stakeholders.
2. Data Visualization: Use charts, graphs, and dashboards to visually represent the
data and insights. Choose appropriate visualization types based on the nature of
the data and the message you want to convey.
3. Narrative Storytelling: Structure your presentation as a narrative, guiding
stakeholders through the data analysis process from problem definition to insights
and recommendations. Tell a compelling story that connects with your audience.
4. Interactivity: Incorporate interactive elements such as drill-downs, filters, and
tooltips into your visualizations to enhance engagement and allow stakeholders to
explore the data further.
5. Contextualization: Provide context around the data analysis, including the
business problem, data sources, analysis methodology, and assumptions made.
Help stakeholders understand the significance of the findings within the broader
business context.
6. Actionable Recommendations: Clearly articulate actionable recommendations
based on the insights derived from the analysis. Provide specific steps or
strategies that stakeholders can implement to address the identified opportunities
or challenges

REAL BUSINESS SCENARIOS


1. Customer Segmentation: Analyzing customer data to segment customers based
on demographics, behavior, or preferences.
2. Sales and Marketing Analytics: Analyzing sales data to identify trends,
optimize marketing campaigns, and forecast future sales.
3. Financial Analysis: Analyzing financial data to track revenue, expenses,
profitability, and financial performance metrics.
4. Supply Chain Optimization: Analyzing supply data to optimize inventory
levels, streamline logistics, and reduce costs.
5. Risk Management: Analyzing data to identify and mitigate risks such as fraud
detection, credit risk assessments, and compliance monitoring.

DATA ETHICS AND PRIVACY IN SQL ANALYTICS


Ethical principles :
1. Transparency: Be transparent about data collection practices, usage, and
potential risks to stakeholders.
2. Anonymization: Minimize the risk of re-identification by anonymizing or
pseudonymizing sensitive data.
3. Informed Consent: Obtain explicit consent from individuals before collecting
and processing their personal data.
4. Data Minimization: Collect and retain only the data necessary for the intended
purpose, minimizing unnecessary data exposure.
Example: Use hashing for email IDs or phone numbers to protect sensitive
information. Ensure that sensitive data is hidden or protected.

Privacy principles :
1. Sensitive Data: Identify and protect sensitive data such as personally identifiable
information (PII), health records, financial information, and biometric data.
2. Data Encryption: Encrypt data at rest and in transit to prevent unauthorized
access and protect confidentiality.
3. Access Control: Implement access controls and authentication mechanisms to
restrict access to sensitive data based on user roles and permissions.
4. Data Retention: Define data retention policies to determine how long data
should be stored and when it should be securely deleted or anonymized.
5. Data Sharing: Exercise caution when sharing data with third parties, ensuring
compliance with privacy regulations and contractual agreements.

CASE STUDY

Orders;
Order_ID Customer_ID Order_Date Total_Amount
1 101 2024-03-15 150.00
2 102 2024-03-18 75.50
3 103 2024-03-20 200.00
4 101 2024-03-25 120.00
5 104 2024-04-01 180.25
6 105 2024-04-05 90.80
7 102 2024-04-10 300.00
8 106 2024-04-12 50.00
Order_items;
Order_ID Customer_ID Quantity Total_Amount
1 101 2 25.00
1 102 1 50.00
2 103 3 15.50
3 101 1 30.00
4 104 2 60.00
5 102 1 90.25
5 103 2 45.00
6 104 1 70.80
7 105 4 75.00
8 106 1 50.00

Products;
Product_I Product_Name Category
D
101 Shirt Apparel
102 Shoes Apparel
103 Hat Apparel
104 Watch Accessories
105 Bag Accessories
106 Socks Apparel

1. What are the top 5 best selling products in terms of quantity

SELECT TOP 5 p.product_name, SUM(oi.quantity) AS Total_quantity,


FROM Order_items AS oi
JOIN Products AS p ON oi.products_id = p.product_id
GROUP BY p.product_name
ORDER BY total_quantity DESC;
Sl no Product_Name Total_quantity
1 Hat 5
2 Bag 4
3 Watch 3
2. What is the 4 Shirt 3 total revenue
generated of 5 Shoes 2 each month for the
past year?

SELECT MONTH (order_date) AS month,


YEAR (order_date ) AS year,
SUM(total_amount) AS monthly_revenue
FROM orders
WHERE order_date >= DATEADD ( year, -1, GETDATE())
GROUP BY YEAR ( order_date ), MONTH( order_date )
ORDER BY YEAR (order_date ), MONTH ( order_date );
Sl no Month Year Monthly_revenue
1 3 2024 545.50
2 4 2024 621.05
3. For each customer, What is the total amount they’ve spent and the number of orders
they’ve placed.

SELECT customer_id,
COUNT ( DISTINCT order_id ) AS total_orders,
SUM( total_amount ) AS total_spent
FROM orders
GROUP BY customer_id;
Sl no Customer_id Total_orders Total_spent
1 101 2 270.00
2 102 2 375.50
3 103 1 200.00
4 104 1 180.25
5 105 1 90.80
6 106 1 50.00
4. What are the top 3 customers in terms of total spending?

SELECT TOP 3 customer_id, SUM( total_amount ) AS total_spent


FROM orders
GROUP BY customer_id;
ORDER BY total_spent DESC;

Sl no Customer_id Total_spent
1 101 375.50
2 102 270.00
3 103 200.00

5. What is the average order value (AOV) for each product category ?

SELECT p.category,
CAST ( AVG( o.total_amount ) AS DECIMAL(10,2) AS average_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products.p ON oi.product_id = p.product_id
GROUP BY p.category;

Sl no Category Average_order_value
1 Accessories 170.27
2 Apparel 140.86

6. For each customer, find the average total aount of their orders and compare it with
the overall average totalamount of orders ( use window function )
SELECT
DISTINCT customer_id
ROUND ( AVG (total_amount ) OVER ( PARTITION BY
customer_id ), 2 ) AS AVG_order1,
ROUND ( AVG (total_amount ) OVER (), 2 ) AS Overall_Avg1,
CASE
WHEN AVG (total_amount ) OVER ( PARTITION BY
customer_id ) > AVG (total_amount ) OVER () THEN ‘Above
Avg’,
ELSE ‘Below Avg’
END AS CompareAvg
FROM orders;

Sl no Customer_id Average_order1 Overall_Avg1 CompareAvg


1 101 135 145.82 Below Avg
2 102 187.75 145.82 Above Avg
3 103 200 145.82 Above Avg
4 104 180.25 145.82 Above Avg
5 105 90.8 145.82 Below Avg
6 106 50 145.82 Below Avg

We can save the output as – right click – select all – save with headers

Stackoverflow is very good to trobleshoot


Databricks – can run sql and python in same tool
W3schools – w3schools.com
SQLZOO – sqlzoo.net/wiki/sql_tutorial
Kagle
Sql
Python/R
BI tool (advanced) – Powerbi/alteryx/tableau/looker
FAQs
1. How is SQL used in data analysis?

SQL (Structured Query Language) is a powerful tool used in data analysis for querying and
manipulating data stored in relational databases. It enables data analysts to:

 Access and extract data: SQL allows analysts to retrieve data from different tables within a
database, making it accessible for analysis.
 Data manipulation: Analysts can use SQL to clean, transform, and prepare data for analysis, such
as filtering, sorting, and aggregating data.
 Data aggregation: SQL facilitates summarization, helping calculate averages, sums, and other
statistical measures across large data sets.
 Complex queries: Through SQL, analysts can perform complex queries to analyze relationships
between different data sets, identify patterns, and extract insights.

By utilizing SQL, analysts can efficiently handle large volumes of structured data, conduct
thorough analyses, and generate actionable insights for decision-making.

2. Is SQL better than Python for data analysis?

The choice between SQL and Python for data analysis depends on the specific needs and scope
of the project. Each has its strengths:

 SQL is specifically designed for managing and querying relational databases. It is highly efficient
for data extraction, manipulation, and querying operations within databases. SQL shines when the
analysis relies heavily on structured data stored in SQL databases.
 Python is a general-purpose programming language with a rich ecosystem of libraries (such as
pandas, NumPy, and Matplotlib) for data analysis, statistical modeling, machine learning, and
visualization. It is versatile and well-suited for tasks that require complex data manipulations,
integrations with web services, or the analysis of unstructured data.

SQL might be more straightforward and efficient for data analysis, primarily involving structured
data within relational databases. However, Python would be preferable for analyses requiring
complex data manipulation, integration with other data sources, or the application of advanced
statistical or machine learning models.

3. Which SQL is best for a data analyst?

The “best” SQL for a data analyst largely depends on the specific requirements of the
organization and the nature of the projects. However, some of the most widely used SQL dialects
and database management systems include:

 PostgreSQL: Known for its open-source nature, robustness, and compliance with SQL standards.
It offers advanced features and is well-suited for complex data analysis tasks.
 MySQL: Another popular open-source option, known for its ease of use and efficiency in handling
a wide range of data analysis tasks.
 Microsoft SQL Server: Preferred in enterprise environments, offering extensive tools and features
for data analysis, along with strong integration with other Microsoft products.
 Oracle Database: Widely used in large enterprises for its scalability, reliability, and comprehensive
data management and analysis features.

Data analysts should choose a system that meets their project’s needs, considering data
complexity, scalability requirements, and existing infrastructure.

4. Which is faster, Python or SQL?


Regarding processing speed, particularly for data retrieval and manipulation within databases,
SQL often outperforms Python. This is because SQL queries are executed directly on the
database server, which is optimized for these types of operations. SQL is designed for data
manipulation and retrieval, making it highly efficient.

While extremely versatile and powerful for data analysis, Python manipulates data using libraries
like pandas. These operations occur in memory and can become slower as the volume of data
increases, especially compared to SQL operations optimized by the database engine itself.

However, overall performance also depends on the complexity of the task at hand, the size of the
data, and how well the SQL queries or Python scripts are optimized. SQL is generally faster for
data retrieval and manipulation tasks. Python’s ecosystem provides a broader range of
capabilities that may not be directly comparable to SQL’s performance for more complex data
analysis involving statistical models or machine learning.

You might also like