DS 5110 – Lecture 1
Introduction
Roi Yehoshua
Agenda
Course objectives
Intro to database management systems
Database architecture
The relational model
MySQL installation
2 Roi Yehoshua, 2023
DS 5110 Course Objectives
Understand basic database concepts
Understand the structure and operation of a relational database system
Use ER models to design a database system
Improve the database design by normalization
Learn to formulate SQL queries on the data
Integrate SQL queries within Python applications
Learn how to work with semi-structured data, including XML, JSON and HTML
Understand NoSQL databases
Learn to work with MongoDB
Describe big data tools and techniques
Learn to use Apache Spark
3 Roi Yehoshua, 2023
Course Schedule
Published on Canvas
4 Roi Yehoshua, 2023
Your Responsibilities
Read the materials before class
Attend the lectures, please be on time
Ask questions, participate in interactive discussion in class
Come to office hours
Submit assignments on time
Help others
5 Roi Yehoshua, 2023
Homework Assignments
4 problem sets including both theoretical questions and programming exercises
Assignments are due at 11:59PM on the specified date
Submission via Canvas
Please submit both PDF and source files individually (no zip)
Up to one day late is allowed, penalized by 10%
6 Roi Yehoshua, 2023
Final Project
Build an application that uses a DBMS including:
Database design
Data modeling and normalization
Application that interacts with the database
Data analysis and visualization
Team size: 1-3 members
Timeline
Project proposal: mid class
Presentation: on the last lecture of the semester
Final report: ~1 week after semester ends
7 Roi Yehoshua, 2023
Grading
Assignments – 20%
Midterm exam – 20%
Final exam – 20%
Final project – 40%
Class participation is used to adjust grade upwards (at the discretion of the
instructor)
8 Roi Yehoshua, 2023
Academic Integrity
Homework is done individually!
Rules
Can discuss with colleagues or instructor
Can post and answer questions on Piazza
Code cannot be shared with colleagues
Cannot use code from the internet without explicit permission from the instructor
Any cheating will automatically result in grade F and report to the university
administration
http://www.northeastern.edu/osccr/academic-integrity-policy/
9 Roi Yehoshua, 2023
Textbook
Abraham Silberschatz, Henry F. Korth, S. Sudarshan.
Database System Concepts (7th Edition). McGraw-Hill,
2020.
Course website:
https://www.db-book.com/
10 Roi Yehoshua, 2023
What is Data?
Collection of data objects and their attributes
Attributes
Objects
11 Roi Yehoshua, 2023
Facets of Data
There are many different types of data
Each requires its own tools and techniques
Main categories of data:
Data matrix (e.g., spreadsheets)
Relational data (tables in a database)
Text documents (natural language)
Graph-based (e.g., social networks, world wide web)
Sequential data (e.g., genome sequences)
Multimedia (audio, video, images)
Streamed data (e.g., changing stock prices, stream of Twitter tweets)
12 Roi Yehoshua, 2023
Data Growth
The amount of raw data is increasing exponentially
Data is eating the world: prediction is that 163 ZB will be created in 2025
1 KB (Kilobyte) 103 bytes
1 MB (Megabyte) 106 bytes
1 GB (Gigabyte) 109 bytes
1 TB (Terabyte) 1012 bytes
1 PB (Petabyte) 1015 bytes
1 EB (Exabyte) 1018 bytes
1 ZB (Zettabyte) 1021 bytes
1 YB (Yottabyte) 1024 bytes
13 Roi Yehoshua, 2023
Database and DBMS
A database is a an organized collection of structured data stored in a computer
A database management system (DBMS) is software designed to store, retrieve,
define, and manage data in a database
DBMS provide users with an abstract view of the data
By hiding certain details of how the data are stored and maintained
A database system includes the database, DBMS and associated applications
14 Roi Yehoshua, 2023
Database Systems Examples
Enterprise information
Sales: customers, products, purchases
Accounting: payments, receipts, account balances, assets
Human resources: information about employees, salaries, payroll taxes, benefits
Manufacturing: management of production, inventory, orders, supply chain
Banking and finance
Customer information, accounts, loans, and banking transactions
Sales and purchases of financial instruments (e.g., stocks and bonds)
Universities: student information, course registration, grades
Airlines: reservations and schedule information
Telecommunication: records of calls, texts, and data usage
Social media: users, connections between users, posts made by users, etc.
15 Roi Yehoshua, 2023
File Systems vs. Databases
A file system stores data in a collection of flat files grouped into directories
Databases are built on top of the file system
Advantages of DBMS over file systems
Data redundancy and inconsistency: DBMS controls redundancy by maintaining a single
repository of data that is defined once and accessed by many users
In a file system, the data is scatted in various files
Data concurrency: DBMS provides mechanisms to control concurrent access to the same data
Data access and searching: DBMS provides built-in efficient searching operations
In a file system, a different program has to be written for every search operation
Data integrity: DBMS allows the user to define constraints on the data
Data security: DBMS allows to control which users can access which part of the data
Supports transactions and atomic updates
16 Roi Yehoshua, 2023
Database Levels of Abstraction
Physical level: describes how a record (e.g., instructor) is actually stored
Logical level: describes what data are stored in the database, and what relationships
exist among those data
View level: describes only part of the database that is needed for the application
The system may provide many views for the same database
Views can also hide information (such as an employee’s salary) for security purposes
17 Roi Yehoshua, 2023
Schemas and Instances
Similar to types and variables in programming languages
The overall design of the database is called the database schema
Analogous to a variable declaration with its associated type
Logical schema – the overall logical structure of the database
e.g., the database contains information about customers and accounts in a bank and their relations
Physical schema – the overall physical structure of the database
Database instance – the actual content of the database at a particular point in time
Analogous to the value of a variable
18 Roi Yehoshua, 2023
DBMS Architecture
A DBMS consists of two major components
Storage manager
Query processor
19 Roi Yehoshua, 2023
Storage Manager
Responsible for storage, retrieval and updates of data in the database
The storage manager components include:
File manager - manages the allocation of space on disk storage
Buffer manager - decides which data needs to be cached in main memory
Authorization and integrity manager - checks the authorization of users to access data
Transaction manager - ensures that the database remains in a consistent state despite
transaction failures
Implements several data structures
Data files - stores the database itself
Data dictionary - stores metadata about the structure of the database
e.g., the database schema, integrity constraints and authorization
Indices - provides fast access to data items
20 Roi Yehoshua, 2023
The Query Processor
A query is a statement requesting the retrieval of information from the database
The query processor translates queries into sequences of physical actions
Includes the following components
Query parser – translates query statements into an execution plan
Query optimizer – picks the lowest cost execution plan for a given query
Query evaluation engine – executes the low-level instructions in the execution plan
21 Roi Yehoshua, 2023
Database Users
22 Roi Yehoshua, 2023
Database Administrator (DBA)
A DBA is a person who has central control over the system is called a DBA
Functions of a DBA include:
23 Roi Yehoshua, 2023
Popular Database Systems
A 2021 survey of 25,000 data scientists and ML engineers done by Kaggle
https://www.kaggle.com/kaggle-survey-2021
24 Roi Yehoshua, 2023
Data Models
A collection of tools for describing the data set
Various types of data models
The relational model
Entity-relationship models (used mainly for design)
Semi-structured data model (e.g., JSON and XML)
Document-based models
Graph-based models
25 Roi Yehoshua, 2023
The Relational Model
The most widely used data model
Introduced by Edgar F. Codd at IBM in 1970
Data is represented by a collection of tables (also called relations)
Each table contains records (rows) of a particular type
Each record type defines a fixed number of fields (columns)
Ted Codd
Turing Award 1981
26 Roi Yehoshua, 2023
Relational Schemas
Relational schemas are denoted by R(A1, A2, …, An )
A1, A2, …, An are the names of the attributes
Attribute values are normally required to be atomic (i.e., indivisible)
A relation instance r defined over schema R is denoted by r(R)
For example, a schema R(A, B) may have the following instance r:
A B
1 4
1 5
3 7
An element t of relation r is called a tuple and is represented by a row in a table
Order of tuples is irrelevant (tuples may be stored in an arbitrary order)
27 Roi Yehoshua, 2023
Relational Schemas
Example for a relation schema: instructor (ID, name, dept_name, salary)
We often use the same name (e.g., instructor) to refer both to the schema and the instance
attributes
(or columns)
tuples
(or rows)
28 Roi Yehoshua, 2023
Relational Schemas
Tuples in different relations are "linked" using keys
Columns
Rows
Relationship
established by keys
29 Roi Yehoshua, 2023
Keys
Let K R be a subset of the attributes in schema R
K is a super key of R if it is sufficient to identify a unique tuple in a relation r(R)
That is, if t1 and t2 are in r and t1 t2, then t1.K t2.K
Example: {ID} and {ID, name} are both super keys of instructor
Super key K is a candidate key if K is minimal (no proper subset of K is a superkey)
Example: {ID} is a candidate key for instructor
One of the candidate keys is selected to be the primary key
The primary key should be chosen such that its values are never/rarely changed
The primary key attributes are underlined in the schema definition
instructor (ID, name, dept_name, salary)
classroom (building, room_number, capacity)
section (course_id, sec_id, semester, year, building, room_number)
30 Roi Yehoshua, 2023
Foreign Keys
A foreign key is an attribute (or set of attributes) that refers to an attribute of unique
values in another table (usually its primary key)
The table with the foreign key is called the child table (or referencing table)
The table with the primary key is called the parent table (or referenced table)
The foreign key constraint ensures that a value inserted to the foreign key column
must be one of the values contained in the primary key of the parent table
31 Roi Yehoshua, 2023
Full Schema of the University Database
32 Roi Yehoshua, 2023
Schema Diagrams
A database schema can be depicted using a schema diagram
The foreign keys are represented by arrows connecting the two tables
33 Roi Yehoshua, 2023
Relational Algebra
Relational algebra is a procedural query language introduced by Edgar F. Codd
Consists of a set of mathematical operators defined on relations
Provides a theoretical foundation for relational query languages such as SQL
34 Roi Yehoshua, 2023
SQL
Structured Query Language (SQL) is the standard relational databases language
A set-based, declarative programming language
SQL Commands
DDL DML DQL DCL TCL
(Data Definition (Data Manipulation (Data Query (Data Control (Transaction Control
Language) Language) Language) Language) Language)
CREATE INSERT SELECT GRANT COMMIT
ALTER UPDATE REVOKE ROLLBACK
DROP DELETE SAVEPOINT
RENAME MERGE SET TRANSACTION
TRUNCATE CALL
EXPLAIN PLAN
LOCK TABLE
35 Roi Yehoshua, 2023
SQL Queries
A SQL query takes as input several tables (possibly one) and returns a results table
Basic structure of SQL queries
select A1, A2, ..., An
from r1, r2, ..., rm
where P
Ai represents an attribute (column in a table)
ri represents a relation (table)
P is a predicate
36 Roi Yehoshua, 2023
SQL Queries
For example, an SQL query that finds the names of all instructors and the courses
that they teach
select name, course_id
from instructor, teaches
where instructor.ID = teaches.ID
37 Roi Yehoshua, 2023
Database Applications
SQL is not as powerful as procedural programming languages like Java or Python
It also doesn’t support actions such as input from users or output to displays
Application programs written in a host language (like Python) can include
embedded SQL statements
These statements are sent to the DB via a data API (application program interface)
38 Roi Yehoshua, 2023
Database Applications
Database applications are usually partitioned into two or three layers (tiers)
39 Roi Yehoshua, 2023
NoSQL Databases
Non-tabular databases that provide flexible schemas
Allow related data to be nested within a single data structure
Scale easily with large amounts of data and high loads
40 Roi Yehoshua, 2023
NoSQL Database Types
Store data in Store data in nodes
documents similar to and edges
JSON
Store data tables, Each item contains
rows, and dynamic keys and values
columns
41 Roi Yehoshua, 2023
Big Data
Data sets that are too large or complex for traditional data techniques to deal with
The characteristics of big data are often referred to as the 3 Vs:
Volume – The amount of data is much larger than what traditional databases can handle
Velocity – Data arrives at a very fast rate, and needs to be analyzed very quickly
Many applications need to process data items as they arrive (streaming data systems)
Variety – Data comes in variety of types, such as text, audio and video
42 Roi Yehoshua, 2023
Big Data
A new generation of technologies and frameworks that deal with big data
Distributed file systems
Distributed programming (MapReduce)
NoSQL databases
Analytic tools
Cloud services
43 Roi Yehoshua, 2023
MySQL
For this course, I recommend that you use MySQL database
Used by many popular websites, including Facebook, Flickr, Twitter and YouTube
MySQL is offered under two editions:
Open source MySQL Community Server
The proprietary Enterprise Server (owned by Oracle)
Can run on various platforms UNIX, Linux, Windows, etc.
Easy to master in comparison with other database software like Oracle Database or
Microsoft SQL Server
44 Roi Yehoshua, 2023
MySQL Installation
Download MySQL Community Server from https://dev.mysql.com/downloads/mysql/
MySQL comes preinstalled on most linux distributions, although the version may not be the latest
45 Roi Yehoshua, 2023
MySQL Installation
Run the setup with the default settings
Remember the password you set for the root user
46 Roi Yehoshua, 2023
MySQL Tools
After the installation, you will have two programming tools to work with MySQL:
MySQL Shell, where you can type SQL commands directly in the console
A graphical user interface tool called MySQL Workbench
47 Roi Yehoshua, 2023
MySQL Tools
After the installation, you will have two programming tools to work with MySQL:
MySQL Shell, where you can type SQL commands directly in the console
A graphical user interface tool called MySQL Workbench
To run the MySQL Shell open a command prompt
Then type mysqlsh -u root -p --sql
Enter the password for your root user
Then you can enter SQL commands directly in the shell
For Mac users, use the following link
https://www.bytebase.com/blog/how-to-install-mysql-shell-on-macos
48 Roi Yehoshua, 2023
MySQL Shell
49 Roi Yehoshua, 2023
MySQL Shell
Running a SQL query:
50 Roi Yehoshua, 2023
MySQL Tools
Start MySQL WorkBench and click on the default connection
51 Roi Yehoshua, 2023
Create a New Schema
Click the Create New Schema button in the toolbar
Call the schema university
Click Apply
52 Roi Yehoshua, 2023
Creating a New Database
You will see the newly created database on the left side under the Schemas tab
Double-click on the database to select it
All your SQL statements will now be directed to this database
Alternatively, you can type USE [database] before your SQL statement
53 Roi Yehoshua, 2023
Load the Sample Database
Download the scripts for creating the university database from the book’s web site
DDL.sql creates the schema
smallRelationsInsertFile.sql loads data into the tables
To run SQL scripts choose File > Open SQL Script
54 Roi Yehoshua, 2023
Load the Sample Database
Double-click on the university schema and then click on the execute button
55 Roi Yehoshua, 2023
Load the Sample Database
Before running smallRelationsInsertFile put into comments the delete commands by
wrapping them around /* and */
MySql by default has the safe-updates option set. This means that you can't update or delete
records without specifying a key in the where clause.
56 Roi Yehoshua, 2023
Load the Sample Database
Right click on the university schema and choose Refresh All
You will now be able to see the tables that have been created
57 Roi Yehoshua, 2023
Load the Sample Database
To view a table’s data, click on a table name and then click the grid button
58 Roi Yehoshua, 2023
Run SQL Query
To run an SQL query, click the Create New SQL button in the toolbar
Type your query in the editor and then click Execute
59 Roi Yehoshua, 2023
Generating Schema Diagrams in MySQL
To create a diagram from an existing database you can use the reverse engineering
option in MySQL Workbench
Provide connection details to your database and click Next
Select the schema that you want to reverse engineer
60 Roi Yehoshua, 2023
Generating Schema Diagrams in MySQL
When the process ends, you’ll get a diagram with all the tables and views
61 Roi Yehoshua, 2023
Final Notes
I publish numerous articles on data science and machine learning on Medium
https://medium.com/@roiyeho
62 Roi Yehoshua, 2023