0% found this document useful (0 votes)
8 views10 pages

Python Map

The document provides a comprehensive cheat sheet for data wrangling in Pandas, covering advanced indexing, merging, reshaping, and handling missing data. It also includes sections on data visualization techniques and DAX functions for Power BI, offering examples and use cases for various chart types and statistical functions. This resource serves as a quick reference for data scientists and analysts working with Python and Power BI.

Uploaded by

saifou.djenad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Python Map

The document provides a comprehensive cheat sheet for data wrangling in Pandas, covering advanced indexing, merging, reshaping, and handling missing data. It also includes sections on data visualization techniques and DAX functions for Power BI, offering examples and use cases for various chart types and statistical functions. This resource serves as a quick reference for data scientists and analysts working with Python and Power BI.

Uploaded by

saifou.djenad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

> Advanced Indexing Also see NumPy Arrays > Combining Data

Python For Data Science


Selecting
>>> [Link][:,(df3>1).any()] #Select cols with any vals >1

Data Wrangling in Pandas Cheat Sheet >>>


>>>
>>>
[Link][:,(df3>1).all()] #Select cols with vals > 1

[Link][:,[Link]().any()] #Select cols with NaN

[Link][:,[Link]().all()] #Select cols without NaN

Learn Data Wrangling online at [Link] Indexing With isin()


>>> df[([Link]([Link]))] #Find same elements

>>> [Link](items=”a”,”b”]) #Filter on values


Merge
>>> [Link](lambda x: not x%5) #Select specific elements

Where >>> [Link](data1,

data2,

> Reshaping Data >>> [Link](s > 0) #Subset the data

Query
how='left',

on='X1')

>>> [Link]('second > first') #Query DataFrame


Pivot >>> [Link](data1,

data2,

>>> df3= [Link](index='Date', #Spread rows into columns


Setting/Resetting Index how='right',

on='X1')
columns='Type',

values='Value') >>> df.set_index('Country') #Set the index

>>> df4 = df.reset_index() #Reset the index


>>> [Link](data1,

>>> df = [Link](index=str, #Rename


data2,

DataFrame columns={"Country":"cntry",
how='inner',

"Capital":"cptl",
on='X1')
"Population":"ppltn"})
>>> [Link](data1,

Reindexing data2,

how='outer',

on='X1')
Pivot Table >>> s2 = [Link](['a','c','d','e','b'])

Forward Filling Backward Filling


>>> df4 = pd.pivot_table(df2, #Spread rows into

columns values='Value',
>>> [Link](range(4),
>>> s3 = [Link](range(5),

index='Date',
method='ffill') method='bfill') Join
columns='Type']) Country Capital Population
0 3

0 Belgium Brussels 11190846


1 3
>>> [Link](data2, how='right')
1 India New Delhi 1303171035
2 3

Stack / Unstack 2 Brazil Brasília 207847528


3 3

3 Brazil Brasília 207847528 4 3


Concatenate
>>> stacked = [Link]() #Pivot a level of column labels

>>> [Link]() #Pivot a level of index labels


MultiIndexing Vertical
>>> [Link](s2)
>>> arrays = [[Link]([1,2,3]),

[Link]([5,4,3])]
Horizontal/Vertical
>>> df5 = [Link]([Link](3, 2), index=arrays)

>>> [Link]([s,s2],axis=1, keys=['One','Two'])

>>> tuples = list(zip(*arrays))

>>> [Link]([data1, data2], axis=1, join='inner')


>>> index = [Link].from_tuples(tuples,

names=['first', 'second'])

>>> df6 = [Link]([Link](3, 2), index=index)

Melt >>> df2.set_index(["Date", "Type"])

> Dates
> Duplicate Data
>>> [Link](df2, #Gather columns into rows

id_vars=["Date"],

value_vars=["Type", "Value"],
>>> df2['Date']= pd.to_datetime(df2['Date'])

value_name="Observations") >>> df2['Date']= pd.date_range('2000-1-1',

>>> [Link]() #Return unique values


periods=6,

>>> [Link]('Type') #Check duplicates


freq='M')

>>> df2.drop_duplicates('Type', keep='last') #Drop duplicates


>>> dates = [datetime(2012,5,1), datetime(2012,5,2)]

>>> [Link]() #Check index duplicates >>> index = [Link](dates)

>>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')

> Grouping Data


> Visualization Also see Matplotlib
Aggregation
> Iteration >>> [Link](by=['Date','Type']).mean()

>>> import [Link] as plt


>>> [Link]()
>>> [Link]()

>>> [Link](level=0).sum()

>>> [Link](level=0).agg({'a':lambda x:sum(x)/len(x), 'b': [Link]}) >>> [Link]() >>> [Link]()


>>> [Link]() #(Column-index, Series) pairs

>>> [Link]() #(Row-index, Series) pairs


Transformation
>>> customSum = lambda x: (x+x%2)

> Missing Data


>>> [Link](level=0).transform(customSum)

>>> [Link]() #Drop NaN values

>>> [Link]([Link]()) #Fill NaN values with a predetermined value

>>> [Link]("a", "f") #Replace values with others


Learn Data Skills Online at [Link]
> Part-to-whole charts
Pie chart Donut pie chart Heat maps Stacked column chart Treemap charts

The Data Visualization Cheat Sheet

Learn Data Visualization online at [Link] One of the most common ways to
show part to whole data. It is also
The donut pie chart is a variant of the
pie chart, the difference being it has a
Heatmaps are two-dimensional charts
that use color shading to represent
Best to compare subcategories within
categorical data. Can also be used to
2D rectangles whose size is
proportional to the value being
commonly used with percentages hole in the center for readability data trends. compare percentages measured and can be used to display
hierarchically structured data

Use cases Use cases Use cases Use cases Use cases

How to use this cheat sheet


Voting preference by age grou Android OS market shar Average monthly temperatures Quarterly sales per regio Grocery sales count with
Market share of cloud providers Monthly sales by channel across the year Total car sales by producer categorie
Departments with the highest Stock price comparison by
amount of attrition over time industry and company
Use this cheat sheet for inspiration when making your next data visualizations. For more data visualization cheat sheets,
check out our cheat sheets repository here.

> Capture a trend > Visualize a single value > Capture distributions
Line chart Multi-line chart Area chart Stacked area chart Spline chart Card Table chart Gauge chart Histogram Box plot Violin plot Density plot

$7.47M
Total Sales

Cards are great for showing Best to be used on small This chart is often used in Shows the distribution of a Shows the distribution of a A variation of the box plot.
Visualizes a distribution by
The most straightforward way to Captures multiple numeric Shows how a numeric value Most commonly used variation of Smoothened version of a line chart. and tracking KPIs in datasets, it displays tabular executive dashboard reports variable. It converts variable using 5 key It also shows the full using smoothing to allow
capture how a numeric variable is variables over time. It can include progresses by shading the area area charts, the best use is to track It differs in that data points are dashboards or presentations data in a table
to show relevant KPIs numerical data into bins as summary statistics— distribution of the data smoother distributions and
changing over time multiple axes allowing comparison between line and the x-axis the breakdown of a numeric value connected with smoothed curves columns. The x-axis shows minimum, first quartile, alongside summary statistics better capture the
of different units and scale ranges by subgroups to account for missing values, as the range, and the y-axis median, third quartile, and distribution shape of the data
opposed to straight lines represents the frequency maximum

Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases

Revenue in $ over tim Apple vs Amazon stocks Total sales over tim Active users over time by Electricity consumption over Revenue to date on a Account executive NPS score Distribution of salaries in Gas efficiency of vehicle Time spent in restaurants Distribution of price of
Energy consumption in kWh over tim Active users over time segmen tim sales dashboar leaderboard Revenue to target an organizatio Time spent reading across across age group hotel listing
over tim Lebron vs Steph Curry Total revenue over time by CO2 emissions over time Total sign-ups after a Registrations per webinar Distribution of height in readers Length of pill effects by Comparing NPS scores by
Google searches over time searches over tim country promotion one cohort dose customer segment
Bitcoin vs Ethereum price
over time

> Visualize relationships > Visualize a flow


Bar chart Column chart Scatter plot Connected scatterplot Bubble chart Word cloud chart Sankey chart Chord chart Network chart

Data Analyst
Science
Engineer

One of the easiest charts to Also known as a vertical bar Most commonly used chart A hybrid between a scatter Often used to visualize data A convenient visualization for Useful for representing flows in Useful for presenting Similar to a graph, it
read which helps in quick
comparison of categorical
chart, where the categories
are placed on the x-axis.
when observing the
relationship between two
plot and a line plot, the
scatter dots are connected
points with 3 dimensions,
namely visualized on the x-
visualizing the most prevalent
words that appear in a text
systems. This flow can be any
measurable quantity

weighted relationships or
flows between nodes.
consists of nodes and
interconnected edges. It
Learn Data Skills Online at
data. One axis contains These are preferred over bar variables. It is especially with a line axis, y-axis, and with the size Especially useful for illustrates how different [Link]
categories and the other axis charts for short labels, date useful for quickly surfacing of the bubble. It tries to show highlighting the dominant or items have relationships
represents values ranges, or negatives in values potential correlations relations between data points important flows
with each other
between data points using location and size

Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases Use cases

Volume of google Brand market shar Display the relationship Cryptocurrency price Adwords analysis: CPC vs Top 100 used words by Energy flow between Export between countries How different airports are
searches by regio Profit Analysis by region between time-on-platform inde Conversions vs Share of customers in customer countrie to showcase biggest connected worldwide
Market share in revenue and chur Visualizing timelines and total conversion service tickets Supply chain volumes export partner Social media friend group
by product Display the relationship events when analyzing Relationship between life between warehouses Supply chain volumes analysis
between salary and years two variables expectancy, GDP per between the largest
spent at company capita, & population size warehouses
> Date & time functions > Information functions
CALENDAR(<start_date>, <end_date>) Returns a table with a single column named "Date" that COLUMNSTATISTICS() Returns statistics regarding every column in every table. This function

Power BI for Business Intelligence


contains a contiguous set of dates has no arguments
DATE(<year>, <month>, <day>) Returns the specified date in datetime format NAMEOF(<value>) Returns the column or measure name of a value
DATEDIFF(<date_1>, <date_2>, <interval>) Returns the number of units between two dates as ISBLANK(<value>) // ISERROR(<value>) Returns whether the value is blank // an error

DAX Cheat Sheet


defined in <interval>
ISLOGICAL(<value>) Checks whether a value is logical or not
DATEVALUE(<date_text>) Converts a date in text to a date in datetime format
ISNUMBER(<value>) Checks whether a value is a number or not
DAY(<date>) Returns a number from 1 to 31 representing the day of the month
ISFILTERED(<table> | <column>) Returns true when there are direct filters on a column
WEEKNUM(<date>) Returns weeknumber in the year
ISCROSSFILTERED(<table> | <column>) Returns true when there are crossfilters on a column
MONTH(<date>) Returns a number from 1 to 12 representing a month
USERPRINCIPALNAME() Returns the user principal name or email address. This function has no
QUARTER(<date>) Returns a number from 1 to 4 representing a quarter.

arguments.

> Math & statistical functions > Time intelligence functions > DAX statements
SUM(<column>) Adds all the numbers in a column
DATEADD(<dates>, <number_of_intervals>, <interval>) Moves a date by a specific interval VAR(<name> = <expression>) Stores the result of an expression as a named variable. To
SUMX(<table>, <expression>) Returns the sum of an expression evaluated for each row in a return the variable, use RETURN after the variable is defined
table DATESBETWEEN(<dates>, <date_1>, <date_2>) Returns the dates between specified dates
COLUMN(<table>[<column>] = <expression>) Stores the result of an expression as a column in
AVERAGE(<column>) Returns the average (arithmetic mean) of all the numbers in a column TOTALYTD(<expression>, <dates>[, <filter>][, <year_end_date>]) Evaluates the year-to-date a table.
value of the expression in the current context
AVERAGEX(<table>, <expression>) Calculates the average (arithmetic mean) of a set of ORDER BY(<table>[<column>]) Defines the sort order of a column. Every column can be sorted
expressions evaluated over a table SAMEPERIODLASTYEAR(<dates>) Returns a table that contains a column of dates shifted one in ascending (ASC) or descending (DESC) way.
year back in time
MEDIAN(<column>) Returns the median of a column
STARTOFMONTH(<dates>) // ENDOFMONTH(<dates>) Returns the start // end of the month
MEDIANX(<table>, <expression>) Calculates the median of a set of expressions evaluated
over a table STARTOFQUARTER(<dates>) // ENDOFQUARTER(<dates>) Returns the start // end of the quarter
GEOMEAN(<column>) Calculates the geometric mean of a column
GEOMEANX(<table>, <expression>) Calculates the geometric mean of a set of expressions
STARTOFYEAR(<dates>) // ENDOFYEAR(<dates>) Returns the start // end of the quarter.

> DAX Operators


evaluated over a table
COUNT(<column>) Returns the number of cells in a column that contain non-blank values
COUNTX(<table>, <expression>) Counts the number of rows from an expression that evaluates
> Relationship functions
to a non-blank value Comparison operators 3 Meaning
DIVIDE(<numerator>, <denominator> [,<alternateresult>]) Performs division and returns CROSSFILTER(<left_column>, <right_column>, <crossfiltertype>) Specifies the cross-filtering
alternate result or BLANK() on division by 0 direction to be used in a calculation
MIN(<column>) Returns a minimum value of a column RELATED(<column>) Returns a related value from another table.
= Equal to
MAX(<column>) Returns a maximum value of a column

> Table manipulation functions


COUNTROWS([<table>]) Counts the number of rows in a table
= = Strict equal to
DISTINCTCOUNT(<column>) Counts the number of distinct values in a column
RANKX(<table>, <expression>[, <value>[, <order>[, <ties>]]]) Returns the ranking of a
number in a list of numbers for each row in the table argument.
SUMMARIZE(<table>, <groupBy_columnName>[, <groupBy_columnName>]…[, <name>, <expression>]…)
> Greater than
Returns a summary table for the requested totals over a set of groups
DISTINCT(<table>) Returns a table by removing duplicate rows from another table or
expression
< Smaller than

> Filter functions ADDCOLUMNS(<table>, <name>, <expression>[, <name>, <expression>]…) Adds calculated columns
to the given table or table expression
SELECTCOLUMNS(<table>, <name>, <expression>[, <name>, <expression>]…) Selects calculated
columns from the given table or table expression
> = Greater than or equal to

FILTER(<table>, <filter>) Returns a table that is a subset of another table or


expression GROUPBY(<table> [, <groupBy_columnName>[, [<column_name>] [<expression>]]…) Create a
summary of the input table grouped by specific columns
CALCULATE(<expression>[, <filter1> [, <filter2> [, …]]]) Evaluates an expression in a = < Smaller than or equal to
filter context INTERSECT(<left_table>, <right_table>) Returns the rows of the left-side table that appear
in the right-side table
HASONEVALUE(<columnName>) Returns TRUE when the context for columnName has been filtered
NATURALINNERJOIN(<left_table>, <right_table>) Joins two tables using an inner join
down to one distinct value only. Otherwise it is FALSE < > Not equal to
ALLNOBLANKROW(<table> | <column>[, <column>[, <column>[,…]]]) Returns a table that is a NATURALLEFTOUTERJOIN(<left_table>, <right_table>) Joins two tables using a left outer join
subset of another table or expression
UNION(<table>, <table>[, <table> [,…]]) Returns the union of tables with matching columns.

ALL([<table> | <column>[, <column>[, <column>[,…]]]]) Returns all the rows in a table, or


all the values in a column, ignoring any filters that might have been applied
ALLEXCEPT(<table>, <column>[, <column>[,..]]) Returns all the rows in a table except for Text operator Meaning 3 Example
> Text functions
those rows that are affected by the specified column filters
REMOVEFILTERS([<table> | <column>][, <column>[, <column>[,…]]]]) Clear all filters from
designated tables or columns.
Concatenates
Concatenates text values | [City]&",
& text values "&[State]

EXACT(<text_1>, <text_2>) Checks if two strings are identical (EXACT() is case sensitive).
FIND(<text_tofind>, <in_text>) Returns the starting position a text within another text

> Logical functions


(FIND() is case sensitive)
FORMAT(<value>, <format>) Converts a value to a text in the specified number format Logical 3 Example
LEFT(<text>, <num_chars>) Returns the number of characters from the start of a string. operator Meaning
IF(<logical_test>, <value_if_true>[, <value_if_false>]) Checks a condition, and returns a RIGHT(<text>, <num_chars>) Returns the number of characters from the end of a string
certain value depending on whether it is true or false ([City] = "Bru") && ([Return] =
LEN(<text>) Returns the number of characters in a string of text && AND condition "Yes"))
AND(<logical 1>, <logical 2>) Checks whether both arguments are TRUE, and returns TRUE if
both arguments are TRUE. Otherwise, it returns FALSE LOWER(<text>) Converts all letters in a string to lowercase
([City] = "Bru") || ([Return] =
OR(<logical 1>, <logical 2>) Checks whether one of the arguments is TRUE to return TRUE. UPPER(<text>) Converts all letters in a string to uppercase
The function returns FALSE if both arguments are FALSE || OR condition "Yes"))
TRIM(<text>) Remove all spaces from a text string
NOT(<logical>) Changes TRUE to FALSE and vice versa
CONCATENATE(<text_1>, <text_2>) Joins two strings together into one string OR condition
Product[Color] IN {"Red", "Blue",
SWITCH(<expression>, <value>, <result>[, <value>, <result>]…[, <else>]) Evaluates an
expression against a list of values and returns one of possible results SUBSTITUTE(<text>, <old_text>, <new_text>, <instance_num>) Replaces existing text with new
IN {} for each row "Gold"}
text in a string
IFERROR(<value>, <value_if_error>) Returns value_if_error if the first expression is an
error and the value of the expression itself otherwise.

REPLACE(<old_text>, <start_posotion>, <num_chars>, <new_text>) Replaces part of a string with


a new string.

Learn Data Skills Online at [Link]


Joining Data in SQL RIGHT JOIN UNION ALL
Cheat Sheet
A right join keeps all of the original left_table right_table result after RIGHT JOIN
The UNION ALL operator works just like UNION, but it SELECT artist_id

records in the right table and returns id left_val id right_val id left_val right_val
returns duplicate values. The same restrictions of UNION FROM artist

missing values for any columns from 1 L1 1 R1 1 L1 R1 hold true for UNION ALL UNION ALL

the left table where the joining field


Learn SQL online at [Link] did not find a match. Right joins are 2 L2 4 R2 4 L2 R2 SELECT artist_id

Result after
FROM album;
far less common than left joins, 3 L3 5 R3 5 null R3
UNION ALL
because right joins can always be re- 4 L4 6 R4 6 null R4

> Definitions used throughout this cheat sheet


written as left joins. left right
id val Result after UNION ALL:
1 A
Result after RIGHT JOIN: id val id val 1 A
artist_id
RIGHT JOIN on one field 1 A 1 A 1
artist_id name album_id title name 1 B
2
Primary key:
Foreign key:
SELECT *
1 AC/DC 1 For those who rock 1 1 B 4 A 2 A 3
A primary key is a field in a table that uniquely identifies A foreign key is a field in a table which references the FROM artist as art
1 Aerosmith 2 Dream on 2 2 A 5 A 3 A 1
each record in the table. In relational databases, primary primary key of another table. In a relational database, RIGHT JOIN album AS alb
2 Aerosmith 3 Restless and wild 2 3 A 6 A 4 A 2
keys can be used as fields to join tables on. one way to join two tables is by connecting the foreign ON art.artist_id = alb.artist_id; 2 AC/DC 4 Let there be rock 1 4 A 4 A 2
key from one table to the primary key of another. 3 null 5 Rumours 6 5 A 1

FULL JOIN
6 A 6
One-to-one relationship:
One-to-many relationship:

INTERSECT
Database relationships describe the relationships In a one-to-many relationship, a record in one table can
between records in different tables. When a one-to-one be related to one or more records in a second table. A full join combines a left join and result after FULL JOIN
relationship exists between two tables, a given record in However, a given record in the second table will only be right join. A full join will return all left_table right_table id left_val right_val
one table is uniquely related to exactly one record in the related to one record in the first table. records from a table, irrespective of id left_val id right_val
1 L1 R1
other table. whether there is a match on the 1 L1 1 R1
2 L2 null
The INTERSECT operator returns only identical rows from two tables. SELECT artist_id

joining field in the other table, 2 L2 4 R2 FROM artist

Many-to-many relationship:
3 L3 null left_table right_table INTERSECT

returning null values accordingly. 3 L3 5 R3


4 L4 R2 Result after

In a many-to-many relationship, records in a given table ‘A’ can be related to one or more records in another table ‘B’,
4 L4 6 R4 id val id val INTERSECT SELECT artist_id

and records in table B can also be related to many records in table A. 5 null R3
FROM album;
6 null R4 1 N1 1 N1 id val

> Sample Data FULL JOIN on one field


Result after FULL JOIN:
artist_id
1
name
AC/DC
album_id
1
title
For those who rock
name
1
1 N1
3 L3
4 R2
5 R3
1 N1 Result after INTERSECT:
artist_id
1
SELECT *
4 L4 6 R4 2
Artist Table Album Table FROM artist as art
1 AC/DC 4 Let there be rock 1
2 Aerosmith 2 Balls to the wall 2
EXCEPT
artist_id name album_id title artist_id FULL OUTER JOIN album AS alb

1 AC/DC 1 For those who rock 1 ON art.artist_id = alb.artist_id; 2 Aerosmith 3 Restless and wild 2
2 Aerosmith 2 Dream on 2 3 Alanis Morissette null null null
3 Alanis Morissette 3 Restless and wild 2 null null 5 Rumours 6
The EXCEPT operator returns only those rows from SELECT artist_id

CROSS JOIN
4 Let there be rock 1 the left table that are not present in the right table.
5 Rumours 6
FROM artist

EXCEPT

SELECT artist_id

INNER JOIN CROSS JOIN creates all possible combinations of two


tables. CROSS JOIN does not require a field to join ON.
SELECT name, title

FROM artist

left_table
id val
right_table
id val
Result after

EXCEPT
FROM album;
An inner join between two tables will left_table right_table CROSS JOIN album; Result after EXCEPT:
return only records where a joining 1 N1 1 N1 id val artist_id
id left_val id right_val result after
Result after CROSS JOIN:
field, such as a key, finds a match in result after INNER JOIN
CROSS JOIN 1 N1 4 R2 3 L3 3
both tables. 1 L1 1 R1
name title
id left_val right_val
3 L3 5 R3
2 L2 4 R2 id1 id2 AC/DC For those who rock 4 L4
1 L1 R1 table 1 table 2
3 L3 5 R3 1 A AC/DC Dream on 4 L4 6 R4
4 L4 R2 id1 id AC/DC Restless and wild
4 L4 6 R4 1 B
A2
SEMI JOIN
1 AC/DC Let there be rock
1 C AC/DC Rumours
INNER JOIN join ON one field 2 B Aerosmith For those who rock
2 A
SELECT *
3 Aerosmith Dream on
FROM artist AS art
Result after INNER JOIN:
C
2 B Aerosmith Restless and wild
A semi join chooses records in the first table where a SELECT *

Aerosmith Let there be rock


condition is met in the second table. A semi join makes FROM album

INNER JOIN album AS alb

artist_id name title album_id 2 C use of a WHERE clause to use the second table as a filter WHERE artist_id IN

ON art.artist_id = alb.artist_id; Aerosmith Rumours


for the first.
1 AC/DC For those who rock 1 3 A (SELECT artist_id

Alanis Morissette For those who rock


INNER JOIN with USING 1 AC/DC Let there be rock 4
3 B Alanis Morissette Dream on
FROM artist);
2 Aerosmith Dream on 2 left_table right_table
Result after

SELECT *
Alanis Morissette Restless and wild SEMI JOIN
2 Aerosmith Restless and wild 3 3 C
Result after Semi join:
FROM artist AS art
Alanis Morissette Let there be rock id col1 col2 id col1
INNER JOIN album AS alb
Alanis Morissette Rumours album_id title artist_id
USING (artist_id); 1 A B 2 B 1 For those who rock 1

SELF JOIN Set Theory Operators in SQL 2 B C 3 C


2
4
Dream on
Let there be rock
2
1
3 C 3 Restless and wild 2
Self-joins are used to compare values in a table to other values of the same table by joining different parts
of a table together. 4 D
SELECT

alb1.artist_id,

[Link] AS alb1_title,

Result after Self join:


artist_id name album_id alb2_title
UNION

UNION ALL
INTERSECT EXCEPT
ANTI JOIN
[Link] AS alb2_title

UNION
1 AC/DC 1 For those who rock
FROM album AS alb1
2 Aerosmith 2 Dream on The anti join chooses records in the first table where a SELECT *

INNER JOIN album AS alb2


2 Aerosmith 3 Restless and wild condition is NOT met in the second table. It makes use of FROM album

ON alb1.artist_id = alb2.artist_id
1 AC/DC 4 Let there be rock a WHERE clause to use exclude values from the second WHERE artist_id NOT IN

WHERE alb1.album_id<>alb2.album_id; The UNION operator is used to vertically combine the results SELECT artist_id
table. (SELECT artist_id

of two SELECT statements. For UNION to work without errors, FROM artist
FROM artist);
LEFT JOIN all SELECT statements must have the same number of
columns and corresponding columns must have the same
UNION

SELECT artist_id
left_table right_table
Left table after

ANTI JOIN
Result after Anti join:
A left join keeps all of the original left_table right_table result after LEFT JOIN data type. UNION does not return duplicates. FROM album; id col1 col2 id col1
records in the left table and returns id left_val id right_val id left_val album_id title artist_id
missing values for any columns from
right_val
Result after UNION Result after UNION: 1 A B 1 A 5 Rumours 6
1 L1 1 R1 1 L1 R1
the right table where the joining field id val artist_id 2 B C 4 D
did not find a match.
2 L2 4 R2 2 L2 null left right 1
1 A
3 L3 5 R3 3 L3 null id val id val 2 3 C
1 B
4 L4 6 R4 4 L4 R2 1 A 1 A 3 4 D
2 A 6
1 B 4 A
Result after LEFT JOIN: 3 A
2 A 5 A
LEFT JOIN on one field artist_id name album_id title name 4 A
SELECT *
3 A 6 A
1 AC/DC 1 For those who rock 1 5 A
FROM artist AS art
1 AC/DC 4 Let there be rock 1 4 A 6 A
LEFT JOIN album AS alb

ON art.artist_id = alb.artist_id;
2
2
Aerosmith
Aerosmith
2
3
Dream on
Restless and wild
2
2 Learn Data Skills Online at [Link]
3 Alanis Morissette null null null
Descriptive Statistics
Cheat Sheet > Numerical Dataset—Glasses of Water V isualizing Numeric Variables

There are a variety of ways of visualizing numerical data, here’s a few of them in action:
earn more online at [Link]
L

300 ml 60ml 300 ml 120 ml 180 ml 180 ml 300 ml Histogram Box plot

Median
To illustrate statistical concepts on numerical data, we’ll be using a numerical
variable, consisting of the volume of water in different glasses.

> Key Definitions


M inimum M aximum
0 300 Q 1 3
Q

M easures of Center Shows the distribution of a variable. It converts numerical Shows the distribution of a variable using 5 key summary
data into bins as columns. The x-axis shows the range, and statistics—minimum, first quartile, median, third quartile,
Throughout this cheat sheet, you’ll find terms and specific statistical jargon being used. Here’s a rundown of all the the y-axis represents the frequency and maximum
terms you may encounter. M easures of center allow you to describe or summarize your data by capturing one value that describes the center of
its distribution.
Variable: In statistics, a variable is a quantity that can be measured or counted. In data analysis, a variable is
typically a column in a data frame
Descriptive statistics: Numbers that summarize variables. They are also called summary statistics or aggregations M easure Definition H ow to find it R esult
Categorical data: Data that consists of discrete groups. The categories are called ordered (e.g., educational levels)
if you can sort them from lowest to highest, and unordered otherwise (e.g., country of origin)
Arithmetic mean The total of the values
divided by how many
)
) 205.7 ml > Correlation
Numerical data: Data that consists of numbers (e.g., age).
values there are 7

S trong negative Weak negative No correlation Weak positive S trong positive


M edian The middle value, when 180 ml
sorted from smallest to 180 ml
largest

> Categorical Data—Trail Mix M ode The most common value 300 ml
300 ml 300 ml 300 ml

Correlation is a measure of the linear relationship between two variables. That is, when one variable goes up, does the
To illustrate statistical concepts on categorical data, we’ll be using an unordered
categorical variable, consisting different elements of a trail mix. Our categorical
Other Measures of Location other variable go up or down? There are several algorithms to calculate correlation, but it is always a score between -1
and +1.

variable contains 15 almonds, 13 cashews, and 25 cranberries.


There are other measures that you can use, that can help better describe or summarize your data. For two variables, X and Y, correlation has the following interpretation:

M easure Definition H ow to find it R esult Correlation score Interpretation

1 When X increases, Y decreases. Scatter plot forms a perfect straight line with negative slope
Counts and Proportions inimum The lowest value in your 60 ml -
M

data 60 ml B etween -1 and 0 When X increases, Y decreases

Counts and proportions are measures of how much data you have. They allow you to understand how many data 0 There is no linear relationship between X and Y, so the scatter plot looks like a noisy mess
M aximum The highest value in your 300 ml
points belong to different categories in your data.
data 300 ml B etween 0 and +1 When X increases, Y increases
A count is the number of times a data point occurs in the dataset
A proportion is the fraction of times a data point occurs in the dataset. +1 When X increases, Y increases. Scatter plot forms a perfect straight line with positive slope
Percentile: Cut points that divide the data into 100 intervals with the same amount of data in each interval (e.g., in
the water cup example, the 100th percentile is 300 ml Note that correlation does not account for non-linear effects, so if X and Y do not have a straight-line relationship,
Food category Count Proportion
Quartile: Similar to the concept of percentile, but with four intervals rather than 100. The first quartile is the same the correlation score may not be meaningful.
as the 25th percentile, which is 120 ml. The third quartile is the same as the 75th percentile, which is 300 ml.
Almond 15 15 / 53 = 0.283

Cashew 13 13 / 53 = 0.245
M easures of Spread
Cranberry 25 25 / 53 = 0.472

Sometimes, rather than caring about the size of values, you care about how different they are.

V isualizing Categorical Variables


M easure Definition H ow to find it R esult

Range The highest value minus 240 ml


Bar plot Stacked bar chart Treemap chart
the lowest value
300 ml 60 ml

Variance The sum of the squares of )


- ) + ... +
2
)
- )
2
9428.6 ml
2

L earn Data Skills Online at


the differences between
each value and the mean,
60ml Mean

(7 - 1)
300 ml Mean
[Link]
all divided by one less
than the number of data
One of the easiest charts to read Best to compare subcategories within 2D rectangles whose size is
points
which helps in quick comparison of categorical data. Can also be used to proportional to the value being

categorical data. One axis contains compare proportions measured and can be used to display
Inter-quartile range The third quartile minus 180 ml
categories and the other axis hierarchically structured data
the first quartile
represents values 300 ml 120 ml
> Getting started with lists > Getting started with characters and strings
A list is an ordered and changeable sequence of elements. It can hold integers, characters, floats, strings, and even objects.
# Create a string with double or single quotes


Python Basics Creating lists


"DataCamp"

# Embed a quote in string with the escape character \

"He said, \"DataCamp\""

Getting started with Python Cheat Sheet # Create lists with [], elements separated by commas

x = [1, 3, 2]

# Create multi-line strings with triple quotes

"""

Learn Python online at [Link] List functions and methods A Frame of Data

Tidy, Mine, Analyze It

[Link](x) # Return a sorted copy of the list e.g., [1,2,3]


Now You Have Meaning

[Link]() # Sorts the list in-place (replaces x)


Citation: [Link]

> How to use this cheat sheet reversed(x) # Reverse the order of elements in x e.g., [2,3,1]

[Link]() # Reverse the list in-place

"""

[Link](2) # Count the number of element 2 in the list


str[0] # Get the character at a specific position

Python is the most popular programming language in data science. It is easy to learn and comes with a wide array of str[0:2] # Get a substring from starting to ending index (exclusive)

powerful libraries for data analysis. This cheat sheet provides beginners and intermediate users a guide to starting
using python. Use it to jump-start your journey with python. If you want more detailed Python cheat sheets, check out Selecting list elements
the following cheat sheets below:
Combining and splitting strings
Python lists are zero-indexed (the first element has index 0). For ranges, the first element is included but the last is not.

# Define the list


"Data" + "Framed" # Concatenate strings with +, this returns 'DataFramed'

x = ['a', 'b', 'c', 'd', 'e']


x[1:3] # Select 1st (inclusive) to 3rd (exclusive)
3 * "data " # Repeat strings with *, this returns 'data data data '

x[0] # Select the 0th element in the list


x[2:] # Select the 2nd to the end
"beekeepers".split("e") # Split a string on a delimiter, returns ['b', '', 'k', '', 'p', 'rs']

x[-1] # Select the last element in the list


x[:3] # Select 0th to 3rd (exclusive)

Mutate strings
Importing data in python Data wrangling in pandas

Concatenating lists
str = "Jack and Jill" # Define str

# Define the x and y lists


x + y # Returns [1, 3, 6, 10, 15, 21]

> Accessing help and getting object types x = [1, 3, 6]

y = [10, 15, 21]

3 * x # Returns [1, 3, 6, 1, 3, 6, 1, 3, 6]
[Link]() # Convert
[Link]() # Convert
a
a
string to uppercase, returns 'JACK AND JILL'

string to lowercase, returns 'jack and jill'

[Link]() # Convert a string to title case, returns 'Jack And Jill'

1 + 1 # Everything after the hash symbol is ignored by Python


[Link]("J", "P") # Replaces matches of a substring with another, returns 'Pack and Pill'

help(max) # Display the documentation for the max function

type('a') # Get the type of an object — this returns str > Getting started with dictionaries
A dictionary stores data values in key-value pairs. That is, unlike lists which are indexed by position, dictionaries are indexed
> Getting started with DataFrames
> Importing packages by their keys, the names of which must be unique.
Pandas is a fast and powerful package for data analysis and manipulation in python. To import the package, you can
use import pandas as pd. A pandas DataFrame is a structure that contains two-dimensional data stored as rows and
Python packages are a collection of useful tools developed by the open-source community. They extend the
Creating dictionaries columns. A pandas series is a structure that contains one-dimensional data.

capabilities of the python language. To install a new package (for example, pandas), you can go to your command
prompt and type in pip install pandas. Once a package is installed, you can import it as follows.

# Create a dictionary with {}

{'a': 1, 'b': 4, 'c': 9}


Creating DataFrames
import pandas # Import a package without an alias

import pandas as pd # Import a package with an alias

from pandas import DataFrame # Import an object from a package

Dictionary functions and methods # Create a dataframe from a dictionary

[Link]({

# Create a dataframe from a list of dictionaries

[Link]([

'a': [1, 2, 3],


{'a': 1, 'b': 4, 'c': 'x'},

x = {'a': 1, 'b': 2, 'c': 3} # Define the x ditionary


'b': [Link]([4, 4, 6]),
{'a': 1, 'b': 4, 'c': 'x'},

[Link]() # Get the keys of a dictionary, returns dict_keys(['a', 'b', 'c'])


'c': ['x', 'x', 'y']
{'a': 3, 'b': 6, 'c': 'y'}

> The working directory [Link]() # Get the values of a dictionary, returns dict_values([1, 2, 3])
}) ])

Selecting dictionary elements Selecting DataFrame Elements


The working directory is the default file path that python reads or saves files into. An example of the working directory
is ”C://file/path". The os library is needed to set and get the working directory.

x['a'] # 1 # Get a value from a dictionary by specifying the key


Select a row, column or element from a dataframe. Remember: all positions are counted from zero, not one.

import os # Import the operating system package


# Select the 3rd row

[Link]() # Get the current directory



[Link][3]

[Link]("new/working/directory") # Set the working directory to a new file path


> NumPy arrays # Select one column by name

df['col']

# Select multiple columns by names

> Operators NumPy is a python package for scientific computing. It provides multidimensional array objects and efficient operations
on them. To import NumPy, you can run this Python code import numpy as np

df[['col1', 'col2']]

# Select 2nd column

[Link][:, 2]

Arithmetic operators Creating arrays


# Select the element in the 3rd row, 2nd column

[Link][3, 2]

102 + 37 # Add two numbers with +


22 // 7 # Integer divide a number with //

# Convert a python list to a NumPy array



102 - 37 # Subtract a number with -

4 * 6 # Multiply two numbers with *

3 ^ 4 # Raise to the power with ^

22 % 7 # Returns 1 # Get the remainder after [Link]([1, 2, 3]) # Returns array([1, 2, 3])
Manipulating DataFrames
22 / 7 # Divide a number by another with /
division with %
# Return a sequence from start (inclusive) to end (exclusive)

[Link](1,5) # Returns array([1, 2, 3, 4])
# Concatenate DataFrames vertically
# Calculate the mean of each column

# Return a stepped sequence from start (inclusive) to end (exclusive)
 [Link]([df, df])
[Link]()

Assignment operators [Link](1,5,2) # Returns array([1, 3])


# Concatenate DataFrames horizontally
# Get summary statistics by column

# Repeat values n times
 [Link]([df,df],axis="columns")


[Link](aggregation_function)

a = 5 # Assign a value to a
[Link]([1, 3, 6], 3) # Returns array([1, 1, 1, 3, 3, 3, 6, 6, 6])
# Get rows matching a condition
# Get unique rows

x[0] = 1 # Change the value of an item in a list # Repeat values n times


[Link]('logical_condition')
df.drop_duplicates()

[Link]([1, 3, 6], 3) # Returns array([1, 3, 6, 1, 3, 6, 1, 3, 6])


# Drop columns by name
# Sort by values in a column

Numeric comparison operators [Link](columns=['col_name'])

# Rename columns

df.sort_values(by='col_name')

# Get rows with largest values in a column

3 == 3 # Test for equality with ==


3 >= 3 # Test greater than or equal to with >=

> Math functions and methods [Link](columns={"oldname": "newname"})

# Add a new column

[Link](n, 'col_name')

3 != 3 # Test for inequality with !=


3 < 4 # Test less than with <
[Link](temp_f=9 / 5 * df['temp_c'] + 32)
All functions take an array as the input.
3 > 1 # Test greater than with >
3 <= 4 # Test less than or equal to with <=
[Link](x) # Calculate logarithm
[Link](x, q) # Calculate q-th quantile

[Link](x) # Calculate exponential


[Link](x, n) # Round to n decimal places

Logical operators [Link](x) # Get maximum value


[Link](x) # Calculate variance

[Link](x) # Get minimum value


[Link](x) # Calculate standard deviation

not(2 == 2) # Logical NOT with not


(1 >= 1) or (1 < 1) # Logical OR with or
[Link](x) # Calculate sum

(1 != 1) and (1 < 1) # Logical AND with and [Link](x) # Calculate mean


Reshaping Data with > Working with indexes > Exploding and normalizing

pandas in Python # Move columns to the index with .set_index()

movies_indexed = movies.set_index("title")

# Expand list columns with .explode()

# Vectors inside the lists are given their own row

# The number of columns remains unchanged

[Link]("singles")

# Move index to columns with .reset_index()

movies_indexed.reset_index()

Learn Python online at [Link]


# For dictionary columns, move items to their own columns with json_normalize()

# By default, each top-level key becomes a new column

# Replace index, left joining new index to existing data with .reindex()
pd.json_normalize(music_exploded["singles"])
avengers_index = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War",
"Avengers: Endgame"]

> Content movies_indexed.reindex(avengers_index)

# Equivalent to [Link](index=avengers_index) \

> Stacking and unstacking


# .merge(movies_indexed, how="left", left_index=True, right_index=True)
Definitions
# Move (multi-)indexes from a column index to a row index with .stack()

The majority of data analysis in Python is performed in pandas DataFrames. These are rectangular datasets consisting # level argument starts with 0 for the outer index

of rows and columns


An observation contains all the values or variables related to a single instance of the objects being analyzed. For
example, in a dataset of movies, each movie would be an observation.
> Joining and splitting columns pig_feed_stacked = pig_feed.stack(level=0)

A variable is an attribute for the object, across all the observations. For example, the release dates for all the movies
# Concatenate several columns into a single string column with .[Link]()
# Move (multi-)indexes from a row index to a column index with .stack()

Tidy data provides a standard way to organize data. Having a consistent shape for datasets enables you to worry less
# Each column must be converted to string type before joining
pig_feed_stacked.unstack(level=1)
about data structures and more on getting useful results. The principles of tidy data are
movies["release_year"].astype(str) \

Every column is a variable .[Link](movies[["release_month", "release_day"]].astype(str), sep="-")


Every row is an observation


Every cell is a single value.
# Split a column on a delimiter into several columns with .[Link](expand=True)
> Converting to and from JSON
movies["directors"].[Link](",", expand=True)

import json

> Datasets used throughout this cheat sheet # Combine several columns into a list column with .[Link]()
# Convert series containing nested elements to JSON string with [Link]()

movies["release_list"] = movies[["release_year", "release_month", "release_day"]] \


json_singles = [Link](music["singles"].to_list())

Throughout this cheat sheet we will use a dataset of the top grossing movies of all time, stored as movies.
.[Link]()

title release_year release_month release_day directors box_office_busd


# Add column from JSON string with with [Link]()

Avatar 2009 12 18 James Cameron 2.922


# Split a list column into separate columns with .to_list()
music["singles2"] = [Link](json_singles)
Avengers: 2019 4 22 Anthony Russo,
2.798
Endgame Joe Russo movies[["release_year2", "release_month2", "release_day2"]] = \

Titanic 1997 11 01 James Cameron 2.202 movies["release_list"].to_list()

Star Wars Ep. 2015 12 14 J.J Abrams 2.068


VII: The Force
Awakens
> Dealing with missing data
Avengers:
Infinity War
2018 4 23 Anthony Russo,

Joe Russo
2.048
> Melting and pivoting # Drop rows containing any missing values in the specified columns with .dropna()

The second dataset involves an experiment with the number of unpopped kernels in bags of popcorn, adapted from the [Link](subset="weight_kg")

Popcorn dataset in the R's Stat2Data package. # Move side-by-side columns to consecutive rows with .melt()

[Link](id_vars="brand", var_name="trial", value_name = "n_unpopped")


brand trial_1 trial_2 trial_3 trial_4 trial_5 trial_6


# Fill missing values with a default value with .fillna()

Orville 26 35 18 14 8 6 [Link]({"weight_kg": 100})


Seaway 47 47 14 34 21 37 # Melt using row index as id_variable with .melt(ignore_index=False)

popcorn_indexed = popcorn.set_index("brand")

The third dataset is JSON data about music containing nested elements. The JSON is parsed into nested lists using popcorn_indexed.melt(var_name="trial", value_name="n_unpopped", ignore_index=False)

read_json() from the pandas package. Notice that each element in the singles column is a list of dictionaries.

artist singles
# Where there is a column multi-index, specify id_vars with a list of tuples

Bad Bunny [{'title': 'Gato de Noche',


pig_feed.melt(id_vars=[("No", "No")])

'tracks': [{'title': 'Gato de Noche', 'collaborator': 'Ñengo Flow'}]},

{'title': 'La Jumpa',

'tracks': [{'title': 'La Jumpa', 'collaborator': 'Arcángel'}]}


# Same as .melt(), plus cleanup of var_name with wide_to_long()

Drake [{'title': 'Scary Hours 2',


pd.wide_to_long(popcorn, stubnames="trial", i="brand", j="trial_no", sep="_")

'tracks': [{'title': "What's Next"},

{'title': 'Wants and Needs', 'collaborator': 'Lil Baby'},

{'title': 'Lemon Pepper Freestyle', 'collaborator': 'Rick Ross'}]}]


# Move values in from rows to columns with .pivot()
Learn Python Online at
The fourth dataset is a synthetic dataset containing attributes of people. sex is a character vector, and hair_color is a
factor.
# Reset the index to completely reverse a melting operation

popcorn_long \

[Link]
.pivot(values="n_unpopped", index="brand", columns="trial") \

sex hair_color height_cm weight_kg


.reset_index()

Female brown 166 72


Male blonde 184
Female black 153 # Move values in from rows to columns and aggregate with .pivot_table()

# df.pivot_table(values, index, columns, aggfunc) is equivalent to

Male black 192 93


# [Link]([index, columns])[values].agg(aggfunc).reset_index().pivot(index, columns)

The fifth dataset, pig_feed, shows weight gain in pigs from additives to their feed. There is a multi-index on the columns. popcorn_long \

.pivot_table(values="n_unpopped", index="brand", columns="trial") \

Antibiotic No Yes
.reset_index()

B12 No Yes No Yes
19 22 3 54
ILTER Replace text with REPLACE() and SUBSTITUTE()
Data Manipulation in Excel
Subset Arrays for Multiple Rows with F

Filter
an array for values that match a value with F ILTER() — Same as =XLOOKUP("Nigeria", A2:A11, B2:D11)
=REPLACE(B2:B11, 2, 1, "X") Replace a substring by position with REPLACE()

=FILTER(B2:D11, A2:A11="Nigeria")


=SUBSTITUTE(B2:B11, "N", "X") Replace specific characters with SUBSTITUTE()
Learn Excel online at [Link] Where the lookup value does not match a key, provide a default value with FILTER(if_empty)
Kingdom", A2:A11, B2:D11, "Country not found")

— Same as =XLOOKUP("United

=FILTER(B2:D11, A2:A11="United Kingdom", "Country not found")

FILTER can also return multiple rows

=FILTER(A2:D11, D2:D11<10)

> Work with Cell Positions & References


Combine criteria using logical AND with FILTER(include1 * include2) — For text data < means "preceding alphabetically"
=CHOOSE(RANDBETWEEN(1, 4), A2:A11, B2:B11, C2:C11, D2:D11) Choose a return value from the input with CHOOSE()

=FILTER(A2:D11, (A2:A11 < "N") * (D2:D11 > 100))

> Dataset

Combine criteria using logical OR with FILTER(include1 + include2)

=INDIRECT(F1) Get the value in a reference to a cell with INDIRECT() — Suppose cell F1 contains the text value "A1"

=FILTER(A2:D11, (C2:C11 = ".in") + (C2:C11 = ".id"))


The main dataset contains details for the ten most populous countries. =OFFSET(A2, 0, 3) Get the value in a cell by position relative to another cell with OFFSET()

- A B C D
Fin d Positions in Lists with XMATCH() =ROWS(A2:A11) Get the number of rows in an array with ROWS()

1 Country Country code Internet TLD Phone prefix code Get the position in a list of the first exact match of a value with XMATCH()
=COLUMNS(A2:D2) Get the number of columns in an array with COLUMNS()

2 China CHN .cn 599 =XMATCH("Brazil", A2:A11)

3 India IND .in 91 =ROW(A2:A11) Get the number of row number of cells with ROW()

Get the position in a list of the first match that starts with a value with XMATCH(match_mode=1)

4 United States USA .us 1 =XMATCH("I", A2:A11, 1)

=COLUMN(A2:D2) Get the number of column number of cells with COLUMN()


5 Indonesia IDN .idn 62 Get the position in a list of the first match using wildcards with XMATCH(match_mode=2)

6 Pakistan PAK .pk 92 =XMATCH("Me?ico", A2:A11, 2)

Brazil BRA .br 55 Fordata sorted in ascending order, use faster binary search for same task XMATCH(search_mode=2)

Nigeria NGA .ng 234


=XMATCH("China", SORT(A2:A11), , 2)
> Calculate with Database-like Filters
Bangladesh BGD .bd 880
Russia RUS .ru 7
Get Values by Position with INDEX Assume an additional dataset in the worksheet containing filter conditions. Perform calculations using database-like filter conditions
with D*()

Mexico MEX .mx 52 Get the value by row and column number within an array with INDEX() — Row and column numbers start from 1rom 1

=INDEX(A2:D11, 5, 3)

Find the maximum of elements matching filters

=DMAX(A1:D11, "Phone prefix code", A10:D15)

Get the value that matches a condition with XMATCH() and INDEX() combined

=INDEX(A2:D11, XMATCH("Brazil", A2:A11), XMATCH("Country code", A1:D1))


COUNT of elements matching filters

> Wildcards Sort Arrays with SORT and SORTBY


=DCOUNT(A1:D11, "Phone prefix code", A10:D15)

SUM of elements matching filters

Many data manipulation functions let you match any text character using wildcards. Sort an array in ascending order of values in a column with SORT()
=DSUM(A1:D11, "Phone prefix code", A10:D15)

=SORT(A2:D11, 3)

- A B AVERAGE of elements matching filters

Sort an array in descending order of values in a column with SORT(sort_order=-1)

? "gr?y" "grey" "gray" =DAVERAGE(A1:E11, "GDP", A10:E15)

Match 1 character matches and =SORT(A2:D11, 3, -1) 


* Match 0 or more characters "sp*y" matches "spy", "spry", and "springy" STDEV of elements matching filters

Sort an array by values of another array with SORTBY()

~ Escape wildcard character "~?~*~~" matches "?*~" =SORTBY(A2:D11, C2:C11)

=DSTDEV(A1:E11, "GDP", A10:E15)

<> Match not blank "<>" matches "anything"


Sort an array by multiple arrays (for example breaking ties with values from second column) 

=SORTBY(A2:D11, A2:A11, 1, B2:B11, -1)

Database calculation functions and conditional calculation functions allow numeric criteria wildcards.
Randomize row order with SORTBY() + RANDARRAY()

- A B =SORTBY(A2:D11, RANDARRAY(COUNTA(A2:A11)))

> Match values greater than `>10` matches values greater than 10

<= Matches values less than or equal to <=10 matches values less than or equal to 10

= Match values equal to =10 matches values equal to 10

<> Match values not equal tor <>10 matches values not equal to 10 > Work with Text Data
Clean text with TRIM() and CL EAN()
Trim all white space except single spaces between words with TRIM()

> Data Transformation =TRIM(" Only single spaces between words remain ")

Remove non-printable characters with CLEAN() — CHAR(7) is an alarm bell sound

Subset Arrays for a Single Row with XLOOKUP =CLEAN("alarm" & CHAR(7))

Get the rows of a return array where the keys match a value with XLOOKUP()

Fin d Substrings with FIND()


=XLOOKUP("Nigeria", A2:A11, B2:D11)

the position of the first instance of a character sequence with IND()

Learn Excel Online at


[Link]
Where the lookup value does not match a key, provide a default value with XLOOKUP(if_not_found)
Find F

=XLOOKUP("United Kingdom", A2:A11, B2:D11, "Country not found")

=FIND("ia", A2:A11)

Where the lookup value does not match a key, return the next largest value with XLOOKUP(match_mode=1)

=XLOOKUP("United Kingdom", A2:A11, B2:D11, #N/A, 1)


Join & Split Text with TEXTJOIN() and TEXTSPLIT()
Collapse an array of text to a single cell with TEXTJOIN()

Left joins with XLOOKUP() =TEXTJOIN(";", TRUE, A2:A11)

- F G H I J Split a cell by a delimiter with TEXTSPLIT()

=TEXTSPLIT(A4, " ")

1 Landmark Address City State Country

2 Taj Mahal Dharmapuri Agra Uttar Pradesh India


Split text on multiple delimiters with TEXTSPLIT(delimiter={array})

=TEXTSPLIT(A4, {"a","e"})
3 Empire State 350 5th Avenue New York New York United States

4 Winter Palace 32 Palace 
 St Petersburg Northwestern 
 Russia


Embankment District

5 Al Hambra C. Real de la 
 Granada Andalusia Spain


Alhambra

Left join two datasets with XLOOKUP() — Copy formula down the J column to complete the join

=XLOOKUP(J2, $A$2:$A$11, $B$2:D$11)


> Filtering Data > Aggregating Data
MySQL Cheat Sheet Filtering on numeric columns Simple aggregations

Get rows where a number is greater than a value with WHERE col > n Get the total number of rows SELECT COUNT(*)
SELECT franchise, inception_year
SELECT COUNT(*)

FROM franchises
FROM franchises
Learn SQL online at [Link] WHERE inception_year > 1928
Get the total value of a column with SELECT SUM(col)
Get rows where a number is greater than or equal to a value with WHERE col >= n
SELECT SUM(total_revenue_busd)

SELECT franchise, inception_year


FROM franchises
FROM franchises

WHERE inception_year >= 1928 Get the mean value of a column with SELECT AVG(col)
What is MySQL? SELECT AVG(total_revenue_busd)

Get rows where a number is less than a value with WHERE col < n
FROM franchises
MySQL is an open-source relational database management system (RDBMS) known for its fast SELECT franchise, inception_year

FROM franchises
Get the minimum value of a column with SELECT MIN(col)
performance and reliability. Developed by Oracle Corporation, it's widely used for web
WHERE inception_year <= 1977 SELECT MIN(total_revenue_busd)

applications and online publishing.


FROM franchises
Get rows where a number is equal to a value with WHERE col = n
SELECT franchise, inception_year
Get the maximum value of a column with SELECT MAX(col)
FROM franchises

> Sample Data WHERE inception_year = 1996


SELECT MAX(total_revenue_busd)

FROM franchises
Get rows where a number is not equal to a value with WHERE col <> n or WHERE col != n
The dataset contains details of the world's highest valued media franchises by gross revenue.

Each row contains one franchise, and the table is named franchises.
SELECT franchise, inception_year
Grouping, filtering, and sorting
FROM franchises

WHERE inception_year <> 1996


Franchise inception_year total_revenue_busd original_medium owner n_movies Get summaries grouped by values with GROUP BY col
Get rows where a number is between two values (inclusive) with WHERE col BETWEEN m AND n
Star Wars
1977
46.7 movie
The Walt Disney
12
SELECT owner, COUNT(*)

Company
SELECT franchise, inception_year
FROM franchises

Mickey Mouse The Walt Disney


FROM franchises
GROUP BY owner
1928 52.2 cartoon
and Friends Company WHERE inception_year BETWEEN 1928 AND 1977
Get summaries grouped by values, in order of summaries with GROUP BY col ORDER BY smmry DESC
Anpanman 1973 38.4 book Froebel-kan 33
SELECT original_medium, SUM(n_movies) AS total_movies

Winnie the Pooh


1924
48.5 book
The Walt Disney
6
Filtering on text columns FROM franchises

Company
GROUP BY original_medium

The Pokémon
Pokémon 1996 88 video game 24 ORDER BY total_movies DESC
Company
Get rows where text is equal to a value with WHERE col = 'x'
Disney Princess 2000 45.4 movie
The Walt Disney
Company SELECT franchise, original_medium
Get rows where values in a group meet a criterion with GROUP BY col HAVING condn
FROM franchises
SELECT original_medium, SUM(n_movies) AS total_movies

WHERE original_medium = 'book' FROM franchises

GROUP BY original_medium

> Querying tables Get rows where text is one of several values with WHERE col IN ('x', 'y')
ORDER BY total_movies DESC

SELECT franchise, original_medium


HAVING total_movies > 10
FROM franchises

Get all the columns from a table using SELECT * WHERE original_medium IN ('movie', 'video game') Filter before and after grouping with WHERE condn_before GROUP BY col HAVING condn_after
SELECT *
SELECT original_medium, SUM(n_movies) AS total_movies

FROM franchises Get rows where text contains specific letters with WHERE col LIKE '%abc%'

FROM franchises

(% represents any characters) WHERE owner = 'The Walt Disney Company'

Get a column from a table by name using SELECT col GROUP BY original_medium

SELECT franchise, original_medium

SELECT franchise
FROM franchises
ORDER BY total_movies DESC

FROM franchises WHERE original_medium LIKE '%oo%' HAVING total_movies > 10

Get multiple columns from a table by name using SELECT col1, col2
Filtering on multiple columns
SELECT franchise, inception_year

FROM franchises > MySQL-Specific Syntax


Get the rows where one condition and another condition holds with WHERE condn1 AND condn2
Override column names with SELECT col AS new_name
N ot all code works in every dialect of SQL. The following examples work in MySQL, but are not
SELECT franchise, inception_year, total_revenue_busd

SELECT franchise, inception_year AS creation_year


guaranteed to work in other dialects.
FROM franchises

FROM franchises
WHERE inception_year < 1950 AND total_revenue_busd > 50 Limit the number of rows returned, offset from the top with LIMIT m, n
Arrange the rows in ascending order of values in a column with ORDER BY col
Get the rows where one condition or another condition holds with WHERE condn1 OR condn2 SELECT *

SELECT franchise, inception_year


FROM franchises

SELECT franchise, inception_year, total_revenue_busd

FROM franchises
LIMIT 2, 3
FROM franchises

ORDER BY inception_year
WHERE inception_year < 1950 OR total_revenue_busd > 50 B y default, MySQL uses case insensitive matching in WHERE clauses.
Arrange the rows in descending order of values in a column with ORDER BY col DESC
SELECT *

SELECT franchise, total_revenue_busd


Filtering on missing data FROM franchises

FROM franchises
WHERE owner = 'THE WALT DISNEY COMPANY'
ORDER BY total_revenue_busd DESC
Get rows where values are missing with WHERE col IS NULL o get case sensitive matching, use WHERE BINARY condn
T

Limit the number of rows returned with LIMIT n


SELECT franchise, n_movies
SELECT *

SELECT *
FROM franchises
FROM franchises

FROM franchises
WHERE n_movies IS NULL WHERE BINARY owner = 'THE WALT DISNEY COMPANY'
LIMIT 2
Get rows where values are not missing with WHERE col IS NOT NULL Get the current date with CURDATE() and the current datetime with NOW() or CURTIME()
Get unique values with SELECT DISTINCT
SELECT franchise, n_movies
SELECT CURDATE(), NOW(), CURTIME()
SELECT DISTINCT owner
FROM franchises

FROM franchises WHERE n_movies IS NOT NULL List available tables with show tables
show tables
> Navigating Worksheets
Key > Cell Entry
Key
Functionality Shortcut Functionality Shortcut

Excel Keyboard Move one cell down ENTER or Down arrow Cancel your input to a cell Esc

Shortcuts Cheat Sheet

Move one cell up SHIFT+ENTER or Up arrow Write a new line within a cell CTRL+Enter
Move one cell right TAB or Right arrow o to the end of the line
G CTRL+End
Learn Excel skills online at [Link] Move one cell left SHIFT+TAB or Left arrow
Go to the start of the line CTRL+Home
Move one screen down PageDown
Insert a function SHIFT+F3
Move one screen up PageUp
Display function arguments (when cursor is to right of function name) CTRL+A
Move one screen right ALT+PageUp
Using Shortcuts in Excel Insert AutoSum formula ALT+=
Move one screen left ALT+PageDown
While every action in Excel can be performed by clicking on menu items or dialog boxes, this is Insert hyperlink CTRL+K
Move to first row, first column CTRL+Home
often slower than pressing keys on your keyboard. Regular Excel users can gain productivity
Insert current date CTRL+;
increases by making use of keyboard shortcuts. The shortcuts shown here are for the Windows Move to last filled cell CTRL+End
versions of Excel. Most shortcuts are applicable to Excel in Office 365. Likewise most shortcuts Insert current time CTRL+:
can be used on MacOS by replacing CTRL with CMD.

Key
> Zooming
Insert a comment on the cell CTRL+SHIFT+F2

Key
> Getting Help Functionality Shortcut

Zoom in CTRL+ALT+=
Key
> Text Formatting
Functionality Shortcut
Zoom out CTRL+ALT+-
Functionality Shortcut
Open help browser F1
Apply bold formatting CTRL+B
Open keyboard shortcut browser
Key
> Editing
CMD+/
Apply italic formatting CTRL+I

Apply strikethrough formatting CTRL+5


Key
> Manipulating Workbooks Functionality Shortcut
Apply general number formatting CTRL+SHIFT+~
Cut selection CTRL+X
Apply scientific number formatting CTRL+SHIFT+^
Functionality Shortcut Copy selection CTRL+C
Apply currency format CTRL+SHIFT+$
Paste selection CTRL+V
Open a workbook CTRL+O
Apply percentage format CTRL+SHIFT+%
Open the Paste Special dialog CTRL+ALT+V
Save a workbook CTRL+S
Apply date format CTRL+SHIFT+#
Close a workbook CTRL+W

Insert new worksheet SHIFT+F11 or ALT+SHIFT+F1



 Key
> Undo & Redo Apply time format

Toggle hiding/displaying content


CTRL+SHIFT+@

CTRL+6

Functionality Shortcut

Key
> Access Ribbon Tabs
Undo last action CTRL+Z
Key
> Charts
Redo last undone action CTRL+Y

Functionality Shortcut
Functionality Shortcut

Key
> Selecting Cells
Create a chart of the selected data in new worksheet F11
Open the File menu ALT+F
Create a chart of the selected data in current worksheet ALT+F11
Open the Home tab ALT+H Functionality Shortcut
Open the Insert tab

Open the Page Layout tab


ALT+N Select the current column

Select the current row


CTRL+Space

SHIFT+Space
Key
> Refreshing Data
ALT+P
Select all cells from the current location downwards SHIFT+PageDown
Open the Formulas tab ALT+M Functionality Shortcut
Select all cells from the current location upwards SHIFT+PageUp
Open the Data tab ALT+A Refresh external data in current worksheet CTRL+F5
Select all cells from the current location rightwards SHIFT+End
Open the Review tab ALT+R Refresh external data in all worksheets CTRL+ALT+F5
Select all cells from the current location leftward SHIFT+Home

Open the View tab ALT+W Select all completed cells CTRL+A Run all calculations in all open workbooks CTRL+ALT+F9

You might also like