MF810 Advanced Programming
Data Structure and Algorithms
Lecture 4
Database and SQL
Jun Fan
[email protected]
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 1
Review
Sorting Algorithm
Asymptotic Bounding
Data Abstraction
Tidy Data
Case Study: The Great Quant Meltdown
Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 2
August 2007 Great Quant Meltdown
2007 GQM 2021 Archegos
• Multiple Leveraged Portfolios • Single Leveraged Portfolio
• Crowded Positions among • Concentrated Positions by one
Quant PMs PM
• Each PM holds 10% ADV of 1000 stocks • Archegos holds 1000% ADV of 8 stocks
• 20 Quant PMs hold similar stocks • Shared among 5 Brokers
• Global Liquidations among • Forced Liquidations by PBs
Quant funds
• Severe temporarily market • Strong correction of stock price
impact -> rebound quickly from prior market impact
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 3
August 2007 Great Quant Meltdown
Multi-day trading impact to your positions
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 4
Agenda
Data Manipulation
Relational Database
Structured Querying Language
Database Landscape
Case Study: The WSB GameStop short sell frenzy
Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 5
Tidy Data
Fundamentals of “tidy” data
Why?
• Tidying data is about structuring datasets to facilitate analysis
• Such datasets are easy to manipulate, model and visualize
• The format provides a well-defined structure and uniformity over your datasets
• Provides access to a number of tools that work well with tidy datasets
https://vita.had.co.nz/papers/tidy-data.pdf
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 6
Grammar for data manipulation
Common Verbs for operating on “tidy” data
select: pick variables by their names
filter: choose rows that satisfy some criteria
mutate: create transformed or derived variables
join: combine two tables of data
summarize: collapse rows down to summaries
arrange: reorder the rows
group_by: verb to perform a summary operation on a grouping of rows
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 7
Grammar for data manipulation
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 8
Agenda
Data Manipulation
Relational Database
Structured Querying Language
Database Landscape
Case Study: The WSB GameStop short sell frenzy
Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 9
Commodity Futures
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 10
Chinook Database Schema
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 11
Agenda
Data Manipulation
Relational Database
Structured Querying Language
Database Landscape
Case Study: The WSB GameStop short sell frenzy
Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 12
Structured Querying Language (SQL)
SQL is a domain-specific language that enables us to perform set operations such as
union, intersect, etc to organize and retrieve information in relational databases. This
language has been expounded upon and used almost everywhere there is tabular data.
Operations supported by SQL adhere to the CRUD (create, read, update, delete)
paradigm. We will focus on the ”read” aspect.
While there is an ANSI SQL standard, the standard is both a superset of what is often
implemented and also insufficient to cover areas such as indexes and file storage.
Commercial vendors are not incentivized to fully standardize as it allows for vendor
lock-in.
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 13
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 14
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 15
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 16
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 17
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 18
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 19
Structured Querying Language (SQL)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 20
Agenda
Data Manipulation
Relational Database
Structured Querying Language
Database Landscape
Case Study: The WSB GameStop short sell frenzy
Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 21
Data Considerations: Introduction
‣ Structured vs unstructured data vs semi-structured
‣ Native file format? CSV, XLS, JSON, XML
‣ Size of the data set? GB? TB? PB?
‣ Frequency of data - Monthly? Daily? Hourly? Second?
‣ How available does the data need to be?
‣ Sensitivity to missing data
‣ Longevity? Do we need to retain data for audit purposes?
‣ Do I need to secure the data?
‣ How will the data be used? Batch vs interactive?
‣ What kind of computations will I be performing with this data?
‣ What is my budget? How much will it cost?
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 22
Data Representation
Image source: https://www.igneous.io/blog/structured-data-vs-unstructured-data
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 23
Data Considerations: Storage Formats
Storage Formats
Files – “Old-school” approach where files get a
name, tagged with metadata and organized
hierarchically in a series of folders/directories.
This format excels at handling relatively small
amounts of data.
Blocks - A block is a raw storage volume filled with files that have been split into equal size chunks of
data. Each block does not have associated metadata, rather the operating system allocates storage
for different applications and decides what goes into each block. Often used for databases, email
servers, RAID redundancy and virtual machines.
Objects - data is stored in isolated containers identified by a unique ID or hash. These objects can be
stored locally or remotely and very amenable to scaling. Often used for big data, web apps and
backups.
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 24
Data Storage Landscape
Relational Databases
Most common and prevalent method of scalable storage for predominantly tabular data,
that has existed in some form or another since the 1960s. Traditional forms of usually
supported ACID properties (atomicity, consistency, isolation and durability). Modern forms
of these sometimes sacrifice one or more of these properties in exchange for some
optimization.
Traditional vendors who have adapted to distributed computing include Oracle,
MySQL, MS SQLServer, PostgreSQL.
Modern versions
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 25
Data Storage Landscape
NoSQL Databases
NoSQL or sometimes called ”Not only SQL” became popular with the advent of the
Web 2.0 movement. Emphasis is on simpler design and better horizontal scaling to
clusters. With NoSQL, unstructured schema-less data can be stored.
Main categories of NoSQL databases consist of:
• Key-Value
• Document
• Wide-column
• Graph
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 26
NoSQL Paradigm
NoSQL Databases (Key-Value Stores)
Key-value stores are usually simple databases that use an associative array to map a single
unique key to one and only one value in a collection. Often some form of dictionary or
hash-table and are usually cached in memory allowing fast query.
Popular flavors include
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 27
NoSQL Paradigm
NoSQL Databases (Key-Value Stores)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 28
NoSQL Paradigm
NoSQL Databases (Document Stores)
Document store databases use JSON, XML, or BSON documents to store data.
Fairly flexibly support for any type of data you want. Documents can have different
structures, but you can index common fields for faster performance.
Popular flavors include
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 29
NoSQL Paradigm
NoSQL Databases (Document Stores)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 30
NoSQL Paradigm
NoSQL Databases (Wide-Column Stores)
Wide-column stores use tables, rows and columns like relational databases, but the names
and format can vary from row-to-row.
Popular flavors include
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 31
NoSQL Paradigm
NoSQL Databases (Wide-Column Stores)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 32
CAP Theorem
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 33
SQL v.s. NoSQL
SQL NoSQL
Relational Non-Relational
Structured Data (tables) Un-Structured Data
Schema “Schema-less” or Dynamic
Vertical Scaled Horizontal Scaled
Limited data capacity Can handle very large data
Easier deployment “Harder” deployment
Co-exist in the foreseeable future (just like nerd and quant)
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 34
Data Storage Landscape
Distributed File Systems
A distributed file system allows files to be accessed using the same interfaces and
semantics as local files – for example, mounting/unmounting, listing directories,
read/write at byte boundaries, system's native permission model
Notable versions include Google FS and
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 35
Amazon Simple Storage Service (S3)
Storage Format: Object storage organized as buckets where each object is identified by a
bucket, key and version ID
Good for frequently accessed data due to its’ design for low latency and high throughput –
amenable to cloud applications, dynamic websites, content distribution, big data analytics,
etc. High fault tolerance and availability of data through replication. Scalability from the
perspective of being an object store.
Multiple tiers of service that specialize on different guarantees:
• Standard: default
• Standard Infrequent Access (IA): optimized for cost by splitting data across two stores
(one for frequent data and one for infrequent data access)
• One-Zone Infrequent Access: optimized for cost at the tradeoff of availability and
resiliency
• Glacier: designed for archiving, optimized for durability at the tradeoff for high latency
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 36
Agenda
Data Manipulation
Relational Database
Structured Querying Language
Database Landscape
Case Study: The WSB GameStop short sell frenzy
Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 37
GameStop
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 38
GameStop
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 39
The Wolf of the WallStreetBets
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 40
The Wolf of the WallStreetBets
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 41
Robinhood
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 42
Buy Long and Sell Short
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 43
Short Squeeze
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 44
Timeline
• Short selling
• Cost of borrowing
• Limited upside risk, but unlimited downside risk
• GameStop was being short sold at more than 100% of its total free float
market share
• Crowd sourced strong buying triggered by users of the subreddit
r/wallstreetbets
• Short seller realized significant loss
• forced liquidating short positions by buying back shares
• raise stock price further
• GameStop price reached a max value of over $500 per share, nearly 30
times the $17.25 price at the beginning of the month.
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 45
Timeline
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 46
GameStop
Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 47