Introduction To Database
Introduction To Database
Introduction to Database
Faculty Advisors
Korea National Project Technical Committee Author Committee
University of Education
Dr. Ki-Sang Song (PM) H.E Put Samith Dr. Chhouk Chan Chhaya
Dr. Taeyoung Kim Mr. Kim Junsu Mr. Sopheap Vannpheakdey
Dr. Youngsik Kim Dr. Ki-Sang Song (PM) Mr. Chhom Leang
Dr. KwiHoon Kim Ms. Pen Vuthyda Mr. Hout Panharith
Dr.Hyung-Jong Choe Mr. Ngor Penglong Mr. Miech Thuy
Dr. Seung-Hyun Kim Mr. Sun Bunna Mr. Phel Phearoun
Dr. Sang-Mok Jung Mr. Sok Tha Ms. Pok Pisey
Dr. Kwan-Hee Yoo Mr. Pring Morokoath Mr. Samang Mengheng
Dr. In-Seong Jeon (PMO) Dr. Chhouk Chanchhaya Mr. San Phun
Mr. Seng Sim Mr. Sok Chea
Mr. Oeur Sokmeng Mr. Sorn Rithy
Mr. Bouy Vuthy Mr. Toy Kompheak
Mr. Nget Soda Mr. Ty Puthy
Mr. Uk Borath
Mr. Uk Samphors
Mr. Lay Sokchea
Mr. Phin Phal
Preface for Database
Those who want to teach ICT subjects in secondary schools must have basic knowledge of data
storage. It is impossible to cover all of the knowledge about data storage in one book. Still, in this
book, we tried to cover the basic data, information, and file system you need to know before studying
data storage at the DBMS level. This book covers the fundamentals by first introducing basic data
storage concepts and techniques before exploring the DBMS administrator field.
This textbook follows a bottom-up arrangement of subjects that progresses from the concrete
to the abstract—an order that results in a sound pedagogical presentation in which each topic leads to
the next. It begins with the basic data, information, and file system concept, the structure of DBMS,
the architecture of the client/server and web, the data relational model (Chapter 1 and 2). Then it
progresses to the study of designing data, setting the kind of data, and using a structure query language
such as show, insert, delete, and modify data (Chapter 3 and 4), and then further continuing to the
practice of using those data concepts with the coding of SQL and PHP language (Chapter 5 and 6).
Finally, this book focuses on data analysis through the Orange 3 platform (Chapter 7 and 8).
In the first section of each chapter, we have motivational thoughts that address more
specialized themes or explore traditional issues in more depth that are linked to the whole chapter.
The Introduction opens the discussion with representative problems to give the reader an overview
of what to expect from the chapter. The Key Points highlights the essential concepts covered in each
section. The Chapter Summary reviews the important subjects that students should understand and
remember. It helps them reinforce the key concepts they have learned in the chapter. Also, at the end
of each chapter are the Questions and Exercises in the Social Issues category to provide students
with opportunities to apply the new skills they have learned independently. They are designed to
provoke thought and discussion. Many of the questions and exercises can be utilized as the basis for
research tasks that are conducted in concisely written or spoken reports.
A variety of supplemental materials for this textbook are accessible to all readers:
• https://www.w3schools.com/sql/default.asp
• https://orangedatamining.com/
• http://comedudb6.knue.ac.kr/tykim/Myhome/note.html
• https://tmfrlska.notion.site/2022-2nd-Data-Analysis-
0058aa8b2e8c4075b0e803af6d6b6b3f
Contents
CHAPTER 1: INTRODUCTION
1.1 Database and Database System .......................................................... 3
1.2 History ................................................................................................ 4
1.3 DBMS .............................................................................................. 11
Summary ................................................................................................ 29
Questions ................................................................................................ 29
Exercises ................................................................................................ 30
CHAPTER 5: SQL
5.1 Introduction .................................................................................... 121
5.2 DDL: Data Definition Language ................................................... 122
5.3 DML: Data Modification ............................................................... 125
Summary .............................................................................................. 136
Questions .............................................................................................. 136
Exercises .............................................................................................. 137
CHAPTER 6: PHP
6.1 WWW (World Wide Web) ............................................................ 141
6.2 Installing Web Server (Apache) .................................................... 141
6.3 Installing PHP (Professional Hypertext Preprocessor) ................. 144
6.4 Install My SQL............................................................................... 150
Summary .............................................................................................. 171
Questions .............................................................................................. 171
Exercises .............................................................................................. 172
Learning Objective:
In this chapter, we introduce good information that is derived from raw facts. These raw
facts are known as data. Data are likely to be managed most efficiently when they are
stored in a database. You will learn what a database is, and what it does. You will also
learn about various types of databases and why database design is so important as well.
File system data management is now largely understood as the characteristics of stored
file systems. Database and database management systems have become essential for
managing businesses, governments, schools, universities, banks, and so forth. Also in the
chapter, we will talk about the responsibility of a database administrator, and the
architecture of 2-tier and 3-tier.
Data is the keyword to describe the systematic storage of data. A Database Management
System (DBMS) is the collection of interrelated and persistent data and is a set of
application programs used to access, update and manage data.
Since the computer was developed, it has been used in various fields as follows:
The goal of a DBMS is to provide an environment that is both convenient and efficient to
use in
4 Chapter 01 – Introduction
1.2 History
File Systems were types of data used to store, retrieve and update data. The Data were
stored as a file system with complexity which were challenging to control. After this, data
models were developed. Those models are the hierarchical model and the network model.
The size of data usage increased rapidly. Also, those models are difficult to use while
relational databases have appeared.
Databases have historically been an essential part of any information processing system.
The origins of databases go back to libraries, governmental, business, and medical records.
There is a long history of information storage, indexing, and retrieval.
Before the 1960s, file systems were used to store, retrieve and update data. The structure
is shown in Figure 1.1, and each department manages its files.
Personnel Record
Accounting Record
To see why database management systems are necessary, let’s look at a typical “file-
processing system” supported by a conventional operating system. The application is a
savings bank:
• Savings accounts and customer records are kept in permanent system files.
• Application programs are written to manipulate files to perform the following tasks:
− Debit or credit an account.
− Add a new account.
− Find an account balance.
− Generate monthly statements.
So, we can see there are problems with the straight file-processing approach:
These problems and others led to the development of database management systems.
In the 1960s, computers became cost-effective for private companies, along with the
increased storage capacity of computers. Databases (DB) and Database Management
Systems (DBMS) were developed. Figure 1.2 shows DB and DBMS.
Two main data models were developed: the hierarchical model (IMS) and the network
model (CODASYL), which accesses the database through low-level pointer operations
Chapter 01 – Introduction 7
linking records. Storage details depended on the type of data to be stored. Thus, adding an
extra field to the database requires rewriting the underlying access/modification scheme.
Figure 1.3 shows the hierarchical model, and Figure 1.4 shows the network model.
The above two models emphasized records to be processed, not the system’s overall
structure. A user would need to know the physical structure of the database to query for
information. One major commercial success was the SABRE system from IBM and
American Airlines.
In the 1970s, several camps of proponents argued about the merits of these competing
systems, while the theory of databases led to mainstream research projects. Two main
prototypes for relational systems were developed during 1974-77.
8 Chapter 01 – Introduction
In 1970-72, E.F. Codd at IBM proposed a relational model for databases in a landmark
paper on how to think about databases. He disconnects a database’s schema (logical
organization) from the physical storage methods. This system has been standard ever since.
In the relational model, records are stored only in tables shown in Figure 1.5.
These systems provide excellent examples of how theory leads to best practice.
Relational Database Management System (RDBMS) was coined during this period.
In 1976, P. Chen proposed the Entity-Relationship (ER) model for database design,
giving another essential insight into conceptual data models. Such higher-level modeling
allows the designer to concentrate on using data instead of logical table structure.
(Structured Query Language) became the “intergalactic standard”, and DB2 became IBM’s
flagship product.
Network and hierarchical models fade into the background, with essentially no
development of these systems today, but some legacy systems are still in use. Development
of the IBM PC gives rise to many DB companies and products such as RIM, RBASE 5000,
PARADOX, OS/2 Database Manager, dBase III, IV (later FoxBASE, even later Visual
FoxPro), and Watcom SQL.
In the early 1990s, an industry shakeout begins with fewer surviving companies
offering increasingly complex products at higher prices. Much development during this
period centers on client tools for application development, such as PowerBuilder (Sybase),
Oracle Developer, VB (Microsoft), etc. The client-server computing model becomes the
norm for future business decisions. Development of personal productivity tools such as
Excel/Access (MS) and ODBC. This also marks the beginning of Object Database
Management Systems (ODBMS) prototypes.
In the mid-1990s, the usable Internet/WWW appears. A mad scramble ensues to allow
remote access to computer systems with legacy data. Client-server frenzy reaches the
desktop of average users with little patience for complexity while Web/DB grows
exponentially.
In the late ‘90s, the significant investment in Internet companies fueled the tools market
boom for Web/Internet/DB connectors. Active Server Pages, Front Page, Java Servlets,
JDBC, Enterprise Java Beans, ColdFusion, Dream Weaver, Oracle Developer 2000, etc.,
are examples of such offerings. Open-source solutions come online with the widespread
use of GCC, CGI, Apache, MySQL, etc. Online Transaction processing (OLTP) and online
analytic processing (OLAP) have come of age, with many merchants using point-of-sale
(POS) technology daily.
In the early 21st century, the Internet industry declined as a whole, but the solid growth
of DB applications continues. More interactive applications appear with PDAs, POS
transactions, consolidation of vendors, etc. Three leading (western) companies predominate
in the large DB market: IBM (buys Informix), Microsoft, and Oracle.
10 Chapter 01 – Introduction
Future trends show that massive (terabyte) systems appear and require novel means of
handling and analyzing data. Large science databases include genome projects, geological,
national security, and space exploration data. Clickstream analysis is happening now. Data
mining, data warehousing, and data marts are commonly used techniques today. More of
this in the future, without a doubt. Smart/personalized shopping using purchase history,
time of day, etc.
Mobile database use is a product now coming to market in various ways. Distributed
transaction processing is becoming the norm for business planning in many arenas.
There will probably be a continuing shakeout in the RDBMS market. Linux with
Apache supporting MySQL (or even Oracle) on relatively cheap hardware is a significant
threat to high-cost legacy systems of Oracle and DB2, so these have begun pre-emptive
projects to hold onto their customers.
1.3 DBMS
The evolution of DBMS and a rapidly broad spread to improve data sharing, reduce data
redundancy, and to ensure the independence of program data. The responsibility of the
database administrator (DBA) and the architecture of 2-tier, 3-tier, and n-tier are also
described in this section.
1.3.1 Structure
A DBMS can be a highly complex set of software programs that controls the organization,
storage, and retrieval of data (fields, records, and files) in a database. The basic
functionalities that a DBMS must provide are:
1. A model language to define the schema of each database hosted in the DBMS, according
to the DBMS data model.
3. A database query language and report writer to allow users to interactively interrogate
the database, analyze its data and update it according to the users’ privileges on data.
4. A transaction mechanism ideally would guarantee the ACID properties to ensure data
integrity despite concurrent user access (concurrency control) and faults (fault tolerance).
12 Chapter 01 – Introduction
The DBMS accepts requests for data from the application program and instructs the
operating system to transfer the appropriate data. The overall structure of the DBMS is
shown in Figure 1.7
Data Abstraction
The DBMS provides users with tables by using layered abstraction, shown in Figure 1.8
The primary purpose of a database system is to provide users with an abstract view of
the system. The system hides specific data storage, creation, and maintenance details.
Complexity should be hidden from database users.
14 Chapter 01 – Introduction
There are several levels of abstraction from the bottom to the top:
1. Physical Level:
2. Conceptual Level:
3. View Level:
− Highest level.
− Describes part of the database for a particular group of users.
− Can be many different views of a database.
− E.g., tellers in a bank get a view of customer accounts but not payroll data.
When a DBMS is used, information systems can be changed much more quickly as the
organization’s information requirements vary. New categories of data can be added to the
database without disrupting the system.
Chapter 01 – Introduction 15
DBA
A database administrator (DBA) manages the database and the DBMS shown in Figure 1.7.
− DBMS management
− database management
− object management: tables /views/indices/clusters etc.
− user management
− performance tuning/backup
Organizations may use one kind of DBMS for daily transaction processing and then
move the detail onto another computer that uses another DBMS better suited for random
inquiries and analysis. Overall, systems design decisions are performed by data
administrators and systems analysts. Data administrators perform detailed database design.
Database servers are specially designed computers that hold the actual databases and
run only the DBMS and related software. Database servers are usually multiprocessor
computers with RAIQ (Redundant Arrays of Independent Disks) disk arrays for stable
storage.
Through the appearance of Local-Area-Networks, PCs came out of their isolation and were
soon not only being connected mutually but also to servers.
Client/Server-computing was born. Servers today are mainly file and database servers.
Application servers are the exception. However, database servers only offer data on the
Chapter 01 – Introduction 17
This model is the opposite of its famous terminal-based predecessor, which had its
entire intelligence on the host system. The 2-tier and 3-tier client/server models are shown
in Figure 1.11.
One reason why the 2-tier model is so widespread is because of the quality of the tools
and middleware that has been most commonly used since the ‘90s: Remote SQL, ODBC,
relatively inexpensive and well-integrated PC tools (like Visual Basic, Power-Builder, MS
Access, 4-GL-Tools by the DBMS manufactures).
In comparison, the server side uses relatively expensive tools. In addition, the PC-based
tools show good Rapid-Application-Development (RAD) qualities, i.e., those more
straightforward applications can be produced comparatively quickly. The 2-tier model is
the logical consequence of the RAD tools’ popularity. For many managers, it was simpler
to attempt to achieve efficiency in software development using tools than to choose the
steep and stony path of “Brainware”.
Why 3-tier?
18 Chapter 01 – Introduction
Unfortunately, the 2-tier model shows striking weaknesses that make developing and
maintaining such applications much more expensive. The software connection of the 2-tier
model is shown in Figure 1.12, and the problems are listed below.
• The complete development accumulates on the PC. The PC processes and presents
information which leads to monolithic applications that are expensive to maintain.
That’s why it’s called a “fat client”.
• In the 2-tier architecture, business logic is implemented on the PC. Even though the
business logic never directly uses the windowing system, programmers must be trained
for the complex API under Windows.
• Windows OS and Mac systems have strict resource restrictions. For this reason,
application programmers must also be well-trained in systems technology to optimize
scarce resources.
• Increased network load: since the actual processing of the data takes place on the remote
client, the data has to be transported over the network. As a rule, this leads to increased
network stress.
• How to conduct transactions is controlled by the client. Advanced techniques like two-
phase-committing can’t be run.
• PCs are considered “untrusted” in terms of security, i.e., relatively easy to crack.
Nevertheless, without an alternative, sensitive data is transferred to the PC.
Chapter 01 – Introduction 19
• Data is only “offered” on the server, not processed. Stored procedures are a form of
assistance given by the database provider. But they have a limited application field and
proprietary nature.
• Application logic can’t be reused because it is bound to an individual PC- program.
• The influences on change management are drastic: due to changes in business
• politics or law (e.g., changes in VAT computation) processes must be changed. Thus,
possibly dozens of PC programs must be adapted because the same logic has been
implemented numerous times. It is then apparent that each program has to undergo
quality control because all programs are expected to generate the same results again.
• The 2-tier model implies a complicated software-distribution procedure. As all of the
application logic is executed on the PC, all those machines (maybe thousands) must be
updated in case of a new release. This can be expensive, complicated, prone to error,
and time-consuming. Distribution procedures include the distribution over networks
(perhaps of large files) or the production of adequate media like CDs. Once it arrives at
the user’s desk, the software must be installed and tested for correct execution. Due to
the distributed character of such an update procedure, system management cannot
guarantee that all clients work on the correct copy of the program.
The 3-tier architecture endeavors to solve these problems. This goal is achieved primarily
by moving the application logic from the client back to the server.
• Client Tier
This tier is responsible for the presentation of data, receiving user events, and controlling
the user interface. The actual business logic (e.g., calculating added value tax) has been
moved to an application server. Today, web server scripts such as PHP and JSP offer an
alternative to traditionally written PC applications.
20 Chapter 01 – Introduction
• Application Server-tier
This tier is new, i.e., it isn’t present in 2-tier architecture in this explicit form. Business
objects that implement the business rules “live” here and are available to the client tier.
This level now forms the central key to solving 2-tier problems. This tier protects the data
from direct access by the clients.
The object-oriented analysis “OOA”, on which many books have been written, aims for
this tier: to record and abstract business processes in business objects. This way, it is
possible to map out the applications-server-tier directly from the CASE tools that support
OOA.
Furthermore, the term “component” is also to be found here. Today the term pre
dominantly describes visual components on the client side. In the non-visual area of the
system, components on the server side can be defined as configurable objects, which can
be put together to form new application processes.
• Data Server-tier
This tier is responsible for data storage. Besides the widespread relational database systems,
existing legacy systems databases are often reused here.
It is important to note that boundaries between tiers are logical. It is easy to run all three
levels on the same (physical) machine. The main importance is that the system is neatly
Chapter 01 – Introduction 21
structured and that there is a well-planned definition of the software boundaries between
the different tiers.
As previously mentioned, 3-tier architecture solves many problems that are inherent to 2-
tier architectures. Naturally, it also causes new problems, but the advantages outweigh these.
The software connection of the 3-tier model is shown in Figure 1.14, and the benefits are
listed below.
• Clear separation of user-interface control and data presentation from application logic.
More clients can access a wide variety of server applications through this separation.
The two main advantages for the client applications are clear: quicker development
through reusing pre-built business-logic components and a shorter test phase because
the server components have already been tested.
• Re-definition of the storage strategy won’t influence the clients. RDBMS offers certain
independence from storage details for the clients. However, cases like changing table
attributes make it necessary to adapt the client’s application. In the future, radical
changes, like switching from an RDBMS to an OODBS, won’t influence the client. In
well-designed systems, the client still accesses data over a stable and well-designed
interface that encapsulates all the storage details.
22 Chapter 01 – Introduction
• Business objects and data storage should be brought together as closely as possible.
Ideally, they should be together physically on the same server. This way - especially
with complex accesses - network load is eliminated. The client only receives the
calculation results through the business object, of course.
• In contrast to the 2-tier model, where only data is accessible to the public, business
objects can place applications-logic or “services” on the net. For example, an inventory
number has a “test digit”, and the calculation of that digit can be made available on the
server.
• As a rule, servers are “trusted” systems. Their authorization is more straightforward
than thousands of “untrusted” client PCs. Data protection and security are simpler to
obtain. Therefore, running critical business processes that work with security sensitive
data on the server makes sense.
• Dynamic load balancing: if bottlenecks in terms of performance occur, the server
process can be moved to other servers at runtime.
• Change management: of course, it’s easy and faster to exchange a component on the
server than to furnish numerous PCs with new program versions. For example, running
the latest version of a tax object is pretty easy, so the clients automatically work with
the version from the exact date it must be run. It is, however, compulsory that interfaces
remain stable and that old client versions are still compatible. In addition, such
components require a high standard of quality control. This is because low-quality
components can, at worst, endanger the functions of a whole set of client applications.
At best, they will still irritate the operator of the system.
In a later phase, the old application can be replaced by a modem solution. If it is possible
to leave the business object’s interfaces unchanged, the client application remains
unaffected. However, a requirement for wrapping is that a procedure interface in the old
application remains existent. A business object can’t emulate a terminal. It is also essential
Chapter 01 – Introduction 23
for the project planner to be aware that the implementation of wrapping objects can be very
complex.
Java is both available and generally suitable, but not all application developers are likely
to prefer it over languages more similar to what they currently use or, in the case of non-
programmers, over higher-level languages and tools. This is OK because there is no real
reason we must converge on a single programming language for the Web any more than
we must connect on a single programming language in any other domain.
The Web does, however, place some specific constraints on our choices: the ability to
deal with a variety of protocols and formats (e.g., graphics) and programming tasks;
performance (both speed and size); safety; platform independence; protection of
intellectual property; and the basic ability to deal with other Web tools and languages.
These issues are not independent of one another. A seemingly optimal choice in one
dimension may be sub-optimal or worse in another.
A Web daemon executes a CGI program on the server. It returns the results to the client
(e.g., a query against a database server) rather than simply replacing a copy of the
referenced file, as it does with an HTML reference. Parameters are passed from the server
to the CGI program as environment variables.
• Java
Java is the leading contender for a full-feature programming language targeted at Internet
applications. Its advantages are familiarity (derived from C++), platform independence
(will run on any platform which implements the Java Virtual Machine), performance (byte-
code compiled faster than fully interpreted code), and safety (downloaded applets are
checked for integrity, and only interpreted by trusted Virtual Machine).
Java is being aggressively distributed and promoted by Sun Microsystems (Oracle Co.),
which developed it, and sees it as a way to loosen Microsoft’s and Intel’s grip on the
computer platform. The leading web browsers now include the Java VM, and
JavaScript/JSP appears on websites everywhere. Even Microsoft, promoting Visual Basic
Script for this purpose, has licensed Java from Sun and will support it in its browsers.
The list of Java licensees is long and includes other major players, such as IBM. Sun
(Oracle Co.) is distributing a Java developer kit free of charge to promote Java’s widespread
use. It recently announced the development of microprocessors optimized for Java for
different markets (from cellular phones to high-performance 3D “Network Appliances”. If
their strategy is successful, the application platform is raised, and Java displaces Windows
or other OSs as the target platform of application developers. The ballgame changes and
the impact is felt across the entire computer industry, not just the Internet.
resources developing and maintaining versions of their products for the different
hardware/software platform combinations. With Java, one set of sources and, even more
importantly, one byte-compiled executables, can be delivered for all hardware/software
platforms.
It is potentially disastrous news for Microsoft and Intel, who, arguably, often sell their
products solely based on their market position rather than their technical merit. Hopefully,
the result will be a more level playing field for vendors and more choices for consumers,
not just the replacement of Microsoft and Intel with Sun (Oracle Co.).
That said, only some agree that Java is the answer. The most common complaint is that
Java is not simple; it is a slimmed-down, cleaned-up C++ with an extensive GUI library.
C++ programming is not described by most as “simple”, and Java programming is not much
more straightforward, especially compared to HTML or some other languages put forward
as its competition. Java is the market leader, so it is the apparent target.
26 Chapter 01 – Introduction
• JavaScript
JavaScript is Netscape’s scripting language for integrating HTML, Netscape plug-ins, and
Java applets. It is based on Java and is mostly syntactically compatible. Still, it differs from
Java in that it is interpreted rather than compiled, only supports built-in objects and user-
defined functions rather than full support for user-defined classes with inheritance and
methods, and is integrated with HTML rather than invoked from HTML files, weakly typed,
and dynamically bound.
• Python
Python fans tout its clear, intuitive syntax in comparison to C, C++, Java, Perl, shell
languages, and most other interpreted languages, the completeness of its type of system and
its suitability for significant application development in comparison to Tcl, and its
extensibility with Python and C/C++ libraries. Like Java, Perl, and Tcl, Python offers
several portable GUI libraries.
Perl advocates complain about the lack of regular expression matching and output
formatting natively in Python. Perl is a little more of a “sysadmin’s” shell language than
Python, and Tcl is a little more straightforward and less capable. Python is more of a
common programming language but clearer to program than Java. However, all are suited
to Internet programming.
VBScript source code is embedded in HTML and downloaded to the client in the
HTML file, where it is compiled and executed in association with its runtime libraries.
Microsoft envisions an OLE Scripting Manager on the client side, with browsers interacting
with a specified interface. The Scripting Manager would manage the compilation and
invocation of downloaded scripts in Visual Basic Script or any other scripting language.
Microsoft also intends to support Visual Basic and Java in this way. The idea is to make
multiple language runtimes pluggable into browsers. Microsoft wants to elicit the
cooperation of various consortia and vendors in defining and standardizing this interface.
Microsoft intends to support VBScript on its different Windows platforms and the
Macintosh and will license the technology to UNIX vendors.
In the early days of the spread of the Internet, websites were created using Perl language
and C language. However, the problem with the Perl and C languages is that they are
inefficient and complicated to develop. The processing method is the biggest problem with
the Perl and C languages. Each time a client connects one by one, a process is created.
28 Chapter 01 – Introduction
Therefore, overload occurs more frequently as the number of connections increases, and
the server often goes down.
Therefore, various Web script languages have been developed to supplement the above
problems and provide faster service. Microsoft’s ASP, Sun’s JSP, and Zend’s PHP are
famous Web scripting languages. In the above three languages, when a client connects one
by one, it creates a process and then creates a thread to respond. Since tasks are processed
in units of threads, the load on the server is reduced. PHP is developed in the C language,
so its syntax is similar to C, and its execution speed is also fast.
Since PHP is free, it is independent of the platform by supporting multiple OSs such as
Windows, Linux, and Unix. Also, it runs faster on Linux, a free OS, and is called APM
(Apache/PHP/MySQL) because it works well with MySQL, a free RDBMS. Usually,
PHP works in conjunction with a web server called Apache. When a user requests a
web document, Apache processes the HTML document, and PHP processes the PHP
Script.
Chapter 01 – Introduction 29
Summary
The efficient storage and manipulation of data is the major goal of the database
management system. A database is an organized collection of related data. Only
data that has been processed provides meaningful information that enables an
organization to make critical decisions. Large data files were stored, processed, and
retrieved using computer file processing systems before the DBMS. Computer file
processing systems have limitations such as data duplications, limited data sharing,
and no program data independence.
Questions
Exercises
1. Please look at figure 1.13 (the 3-tier architecture from the textbook) and a
summary of its processing.
2. Please summarize of the role of a database administrator (DBA).
3. Draw a diagram of 2-tier client/server architecture and 3-tier client/server
architecture and explain in brief.
4. Please explore the internet to find the various categories of the data model and
explain them in brief.
Chapter 02
Relational Model
Learning Objective:
This chapter is dedicated to the relational model used since the late 1970s by E.F. Codd (Edgar
Frank Codd). Various operations in relational algebra and relational calculus are given in this
chapter. After completing this chapter, the reader should be familiar with the following
concepts of evolution and the importance of the relational model, terms in the relational model
like tuple, domain, cardinality, and degree of relation, and operations in relational algebra and
relational calculus, and the difference between relational algebra and relational calculus.
2.1. Overview
The relational model for large shared data was written by E.F. Codd (Edgar Frank Codd) of
IBM in 1970.
The first database systems were based on network and hierarchical models. E.F. Codd
first proposed the relational model in 1970, and the first such systems (notably INGRES and
System/R) were developed in the 1970s. The relational model is now the dominant model for
commercial data processing applications.
The text uses long attribute names instead of abbreviation words in the notes as follows:
The terms commonly used by the user, model, and programmers are given below:
2.2. Relation
We can add, delete and modify rows to reflect changes in the real world. A table row
will consist of an n-tuple where n is the number of attributes. The table contains a subset of
the set of all possible rows. We refer to the set of all possible rows as the Cartesian product
of the sets of all attribute values.
• Figure 2.1 shows the Deposit and Customer tables for our banking example
• The deposit table has four attributes.
• Each attribute has a permitted set of values called the domain of that attribute.
• E.g., the domain of bname is the set of all branch names.
Let D1 denote the domain of bname, and D2, D3, and D4 the remaining attributes’
domains, respectively. We may denote this as
D, X D2 X D3 X D4
for the deposit table, D1, D2, D3, and D4 denote the set of all branch names, all account
numbers, all customer names, and all balances, respectively.
In general, for a table of n columns, we may denote the Cartesian product of D1, D2, ...,
Dn by
Chapter 02 – Relational Model 35
xni=1 Di
Then, any row of the deposits consisting of a four-tuple (v1, v2, V3, V4)
A deposit generally contains a subset of the set of all possible rows. That is, a deposit
is a subset of
D1 x D2 x D3 x D4 or abbreviated to x4i=1 Di
We will also require that the domains of all attributes be indivisible units.
The difference between logical design (a database scheme) and the structure of a database
file (database instance). The relation between all attributes and domains is called a relation
scheme.
We may state that the deposit is related to the scheme Deposit-scheme by writing the
deposit(Deposit-scheme). If we wish to specify domains, we can write:
Note that customers are identified by name. This would not be allowed in the real world,
as two or more customers might share the same name. Figure 2.2 shows the E-R diagram
for a banking enterprise.
The relation schemes for the banking example used throughout the text are:
Some attributes appear in several relation schemes (e.g., bname, cname). This is legal
and provides a way of relating tuples of distinct relations. Why not put all attributes in one
relation?
• If a customer has several accounts, we must duplicate their address for each account.
• If a customer has an account but no current address, we cannot build a tuple, as we
have no values for the address.
• We would have to use null values for these fields.
• Null values cause difficulties in the database.
• We can do this without using null values using two separate relations.
2.4. Keys
A super key is a set of attributes uniquely identifying each tuple of a relation. A candidate
key is a super key with no redundant qualities known as a candidate key, i.e., it should not
contain any column that contains duplicate data. A primary key is a specific choice of a
minimal set of attributes (columns) that uniquely specify a tuple (row).
More formally, if we say that a subset k of a relation R is a super key for R, we are
restricting consideration to relations r(R) in which no two distinct tuples, t1 and t2, have
the same values on all attributes links. In other words,
The notions of super key, candidate key, and primary key apply to the relational model.
For example, in Branch-scheme,
• {bname} is a superkey.
• {bname, bcity} is a superkey.
• {bname, bcity} is not a candidate key, as the superkey {bname} is contained in it.
38 Chapter 02 – Relational Model
Requesting information from a database, the user uses used query language. Relational
Algebra procedural is used in this language. This language has many operations such as
select, project, rename, Cartesian product, union, and set difference. Moreover, additional
functions include set-intersection, natural join, division, and assignment.
• Procedural, where the user instructs the system to perform a sequence of operations
on the database. This will compute the desired information.
• Nonprocedural, where the user specifies the information desired without giving a
procedure for obtaining the information.
A complete query language also contains facilities to insert and delete tuples and
modify parts of existing tuples. Relational algebra is a procedural query language, and
operations produce a new relation.
• set-intersection: RՈS
• natural join: R∞S
• division: R÷S
• assignment: R S
Select
The select operation selects tuples that satisfy a given predicate. Select is denoted by a
lowercase Greek sigma (Ϭ), with the predicate appearing as a subscript. The argument
relation is shown in parentheses following the Ϭ.
Let Figure 2.3 be the Borrow and Branch relations in the banking operations.
For example, to select tuples (rows) of the borrow relation where the branch is “SFU”,
we would write:
The new relation created as the result of this operation consists of one tuple:
We allow comparisons using =, ≠, <, ≤, >, and ≥ in the selection predicate. We also
allow the logical connectives v (or) and/\ (and). For example:
Suppose there is one more relation, client, shown in Figure 3.4, with the scheme
Client_scheme = (cname, banker)
We might write to find clients with the same name as their banker.
Ϭ cname=banker (Client)
Project
The project operation copies its argument relation for the specified attributes only.
Since a relation is a set, duplicate rows are eliminated. Projection is denoted by the Greek
capital letter pi (IT). The attributes to be copied appear as subscripts.
For example, to obtain a relation showing Customers and Branches but ignoring the
amount and loan#, we write
We can perform these operations on the relations resulting from other operations. To
get the names of customers having the same name as their bankers,
Cartesian Product
A cross denotes the Cartesian product operation of two relations (x), written.
The result of R1 x R2 is a new relationship with a tuple for each possible pairing of
tuples from R1 and R2. To avoid ambiguity, the attribute names have attached the name of
the relation from which they came. If no ambiguity results, we drop the relation name. The
result of client x customer is a substantial relation, as shown in Table 2.1.
If R1 has n1 tuples, and R2 has n2 tuples, then R = R1 x R2 will have n1·n2 tuples. The
resulting scheme is the concatenation of R1 and R2, with relation names added as mentioned.
We need information on client and customer relations to find the clients of banker
Johnson and the city where they live. We can get this by writing:
Finally, to get just the customer’s name and city, we need a projection:
42 Chapter 02 – Relational Model
Rename
The rename operation solves the problems with naming when performing the Cartesian
product of relation with itself. Suppose we want to find the names of all the customers who
live on the same street and in the same city as Smith. We can get the street and city of Smith
by writing
To find other customers with the same information, we need to reference the customer
relation again:
Problem: how do we distinguish between the two street values appearing in the
Cartesian product, as both come from the Customer relation?
Solution: use the rename operator, denoted by the Greek letter rho (p).
We write
Px(R)
If we use this to rename one of the two Customer relations we are using, the ambiguities
will disappear.
Union
The union operation is denoted u as in set theory. It returns the union (set union) of two
To find all customers of the SFU branch, we must find everyone with a loan, an account,
or both at the branch. We need both Borrow and Deposit relations for this:
π cname (σ bname = ' SFU' (Borrow)) υ π cname ( σ bname = ' SFU' (Deposit))
As in all set operations, duplicates are eliminated, giving the relation of Figure 2.5(a).
Set Difference
The set difference operation is denoted by the minus sign (-). It finds tuples that are in
one relation but not in another. Thus, R-S results in a relation containing tuples in R but not
S.
To find customers of the SFU branch who have an account there but no loan, we write
The result is shown in Figure 2.5(b). We can do more with this operation. Suppose we
want to find the most significant account balance in the bank.
Strategy:
To find, we write
This resulting relation contains all balances except the largest one. (See Figure 2.6(a)).
Now we can finish our query by taking the set difference:
Additional operations are defined in terms of the fundamental operations. They do not
add power to algebra but help simplify common queries.
Intersection
The set intersection operation is denoted by n and returns a relation containing tuples
in both of its argument relations. It does not add any power as
R Ո S = R - (R - S)
To find all customers having both a loan and an account at the SFU branch, we write
π cname (σ bname = ' S FU' (Borrow)) Ո π cname (σ bname = ' S FU' (Deposit))
Natural Join
Chapter 02 – Relational Model 45
Often, we want to simplify queries on a Cartesian product. For example, to find all
customers having a loan at the bank and the cities in which they live, we need Borrow and
Customer relations:
Our selection predicate obtains only those tuples about only one cname. This type of
operation is widespread, so we have the natural join operation, denoted by a ∞ sign. Natural
join combines a Cartesian product and a selection into one operation. It performs selection-
forcing equality on those attributes that appear in both relation schemes. Duplicates are
removed as in all related operations.
Formally,
To find the assets and names of all branches which have depositors living in Stamford,
we need Customer, Deposit, and Branch relations:
Note that ∞ is associative. To find all customers who have both an account and a loan
at the SFU branch:
This is equivalent to the set intersection version we wrote earlier. We see now that
there can be several ways to write a query in relational algebra. If two relations R(r) and
S(s) have no attributes in common, then rՈs =, and R∞ S = R X S.
Division
The division operation, denoted ÷, is suited to queries that include the phrase “for all”.
Suppose we want to find all the customers with an account at all branches located in
Brooklyn.
which is simply R2 ÷ R1
Formally,
are tuples with the r - s portion and the s portion in R for every value of the s portion in
relation S.
We will look at this explanation in class more closely. The division operation can be
defined in terms of the fundamental operations.
Assignment
T1 π r-s (R)
No extra relation is added to the database, but the relation variable created can be used
in subsequent expressions. Assignment to a permanent relation would constitute a
modification to the database.
48 Chapter 02 – Relational Model
Up until now, we have looked at extracting information from the database. We also
need to add, remove and change information. Modifications are expressed using the
assignment operator.
Deletion
Deletion is expressed in much the same way as a query. Instead of displaying, the
selected tuples are removed from the database. We can only delete whole tuples.
r r-E
where r is a relation and E is a relational algebra query. Tuples in r for which E is true
are deleted.
Some examples:
2. Delete all loans with loan numbers between 1300 and 1500.
Deposit Deposit – r2
Chapter 02 – Relational Model 49
Insertion
To insert data into a relation, we either specify a tuple or write a query whose result is
the set of tuples to be inserted. Attribute values for inserted tuples must be members of the
attribute’s domain.
An insertion is expressed by r r ∪E
Some examples:
1. To insert a tuple for Smith, who has $1200 in account 9372 at the SFU branch.
2. To provide all loan customers in the SFU branch with a $200 savings
Account.
r1 σ bname=’SFU’ (Borrow)
r2 π bname,loan#,cname(r1)
Update
Updating allows us to change some values m a tuple without necessarily changing all.
γA E (R)
where r is a relation with attribute A, which is assigned the value of expression E. The
expression E is any arithmetic expression involving constants and attributes in a relation.
Some examples:
2. To make two different interest payment rates, depending on the balance account.
The relational calculus is nonoperational. The users define queries in terms of what they
want but not what the computer cannot compute.
The tuple relational calculus is a nonprocedural language. (The relational algebra was
procedural.) We must provide a formal description of the information desired.
A query in the tuple relational calculus is expressed as
{t l P(t)}
Several tuple variables may appear in a formula. A tuple variable is said to be a free
variable unless it is quantified by a ∃ or a ∀. Then it is said to be a bound.
• An atom is a formula.
• If P is a formula, so are ¬P and (P).
• If P1 and P2 are formulae, then so are P1 v P2, P1 ۸ P2 and P1 ⇒P2
• If P(s) is a formula containing a free tuple variable s, then the followings are
formulae also.
∃s ∈ r(P(s)) and ∀s ∈ r(P(s))
• P1 ۸ P2 = ¬ (¬P1 v ¬P2)
• ∀t ∈ r(P(t)) = ¬∃t ∈ r(¬P(t))
• P1 ⇒ P2 = ¬P1 v P2
For example, to find the branch name, loan number, customer name, and amount for
loans over $1200:
This gives us all attributes, but suppose we only want the customer names. (We would
use the project in the algebra.) We need to write an expression for a relation on the scheme
(cname).
In English, we may read this equation as “the set of all tuples t such that there exists a
tuple s in the relation Borrow for which the values of t and s for the cname attribute are
equal, and the value of s for the amount attribute is greater than 1200.” The notation ∃t ∈
r(Q[t]) means “there exists a tuple t in relation r such that predicate Q[t] is true”.
How did we get the above expression? We needed tuples on scheme cname such that
there were tuples in Borrow about that customer name with the amount attribute. The tuples
t get the scheme cname implicitly as that is the only attribute t is mentioned with.
Let us look at a more complex example. “Find all customers having a loan from the
‘SFU’ branch and the cities where they live.”
52 Chapter 02 – Relational Model
In English, we might read this as “the set of all (cname, ccity) tuples for which cname
is a borrower at the ‘SFU’ branch, and ccity is the city of cname”. Tuple variables ensure
that the customer is a borrower at the ‘SFU’. Branch. Tuple variable u is restricted to pertain
to the same customer as s and also ensures that ccity is the city of the customer.
The logical connectives ۸ (AND) and v(OR) are allowed, as well as ¬ (negation). We
also use the existential quantifier ∃ and the universal quantifier ∀.
1. Find all customers having a loan, an account, or both at the SFU branch:
Note the use of the v connective. As usual, set operations to remove all duplicates.
2. Find all customers with loans and accounts at the SFU branch. Solution: change the
v connective in 1 to a ۸.
3. Find customers with an account but not a loan at the SFU branch.
{t | ∃u ∈ Deposit(t[cname]=u[cname] ۸ u[bname] ='SFU')
v ¬∃s ∈ Borrow(t[cname]=s[cname] ۸ s[bname] ='SFU')}
4. Find all customers with an account at all branches located in Brooklyn.
(We used division in relational algebra.) For this example, we will use implication, denoted
by a pointing finger in the text, but by ⇒ here. The formula P ⇒ Q means P implies Q; if P
is true, then Q must be true.
{t I ∀u ∈ Branch(u[bcity] = 'Brooklyn' ⇒
In English: “the set of all cname tuples t such that for all tuples u in the Branch relation,
if the value of u on attribute bcity in Brooklyn, then the customer has an account at the
Chapter 02 – Relational Model 53
branch whose name appears in the bname attribute of u.” The division is difficult to
understand. Think it through carefully.
{t | ¬ (t ∈ Borrow)}
There are an infinite number of tuples that are not in Borrow! Most of these tuples
contain values that do not appear in the database. Therefore, we need to restrict the
relational calculus a bit.
• The domain of a formula P, denoted dom(P), is the set of all values referenced in P.
• These include values mentioned in P and values that appear in a tuple of a relation
mentioned in P.
• So, the P domain is the set of all values explicitly appearing in P or relations
mentioned in P.
• dom(t∈Borrow ۸ t[amount]>1200) is the set of all values appearing in Borrow and
1200.
• dom(t | ¬ (t∈Borrow) is the set of all values appearing in Borrow.
We may say an expression {t I P(t)} is safe if all values that appear in the result are
values from dom(P). A safe expression yields a finite number of tuples as its result.
Otherwise, it is called unsafe. The tuple relational calculus restricted to safe expressions is
equivalent in expressive power to relational algebra.
54 Chapter 02 – Relational Model
Summary
In the relational model, the data are stored in the form of tables or relations.
Each table or relation has a unique name. Tables consist of a given number of
columns or attributes. Every table column must have a name, and no two columns
of the same table may have identical names.
The rows of the table are called tuples. The total number of columns or
attributes is known as the degree of the table. The relational model provides an
alternative way to express queries. The principle of the relational model is set by
E.F. Codd (Edgar Frank Codd) to use. This chapter also introduced concepts of
relational algebra and relational calculus. Different operators like selection,
projection, rename, Cartesian product, union, intersection, natural join, division, and
assignment were discussed with suitable examples
Questions
Exercises
According to figure 3.3 mentioned above, please write the formula of relational
algebra to solve the problem below:
1. Show all tuples that have the value “101” in column “loan#”
2. List all entries for ABA_TK (requires amount>2000)
3. Delete all customers with the name “Gech” from branch name “ABA_TK”.
4. Delete all accounts at branches located in SFU.
Chapter 03
Relational Language-SQL
Learning Objective
The objectives of this chapter is to introduce about history of SQL and Relational language-SQL.
It will provide you about definition, theory and how to use Relational language-SQL to create,
modify, retrieve manipulate data from relational database management systems.
After successfully completing this chapter student should be able to:
• Understand the history of Relational Language-SQL
• Explain DDL: Data Definition Language
• Understand Views
• Explain DML: Data Modification
• Understand Embedded SQL
SQL is the most popular computer language used to create, modify, retrieve and
manipulate data from relational database management systems. According to history,
during the 1970s, a group at IBM’s San Jose research center developed a database and
then later Oracle. ANSI and ISO standardized SQL. The programming language was
integrated with SQL or uses SQL commands in the database. As time went on, SQL became
more developed and more useful.
Commercial database systems require more user-friendly query languages. We will look
at SQL in detail. Although referred to as query language, SQL contains facilities for
designing and modifying the database.
During the 1970s, a group at IBM’s San Jose research center developed a database
system, “System R” based upon, but not strictly faithful to, Codd’s model. Structured
English Query Language (“SEQUEL”) was designed to manipulate and retrieve data stored
in System R.
The acronym SEQUEL was later condensed to SQL because the word ‘SEQUEL’ was
held as a trademark by the Hawker-Siddeley aircraft company of the UK. Although SQL
was influenced by Codd’s work, Donald D. Chamberlin and Raymond F. Boyce at IBM
were the authors of the SEQUEL language design. Their concepts were published to
increase interest in SQL.
The first non-commercial, relational, non-SQL database, Ingres, was developed in 1974
at U.C. Berkeley. In 1978, methodical testing commenced at customer test sites.
Demonstrating both the usefulness and practicality of the system, this testing proved to be
a success for IBM. As a result, IBM began to develop commercial products that
implemented SQL based on their System R prototype, including the System/38 (announced
60 Chapter 03 – Relational Language-SQL
in 1978 and commercially available in August 1979), SQL/DS (introduced in 1981), and
DB2 (in 1983).
At the same time, Relational Software, Inc. (now Oracle Corporation) saw the potential
of the concepts described by Chamberlin and Boyce. It developed its version of an RDBMS
for the Navy, CIA, and others. In the summer of 1979, Relational Software, Inc. introduced
Oracle V2 (Version2) for VAX computers as SQL’s first commercially available
implementation. Oracle is often incorrectly cited as beating IBM to market by two years
when they only beat IBM’s release of the System/38 by a few weeks. Considerable public
interest then developed; soon, many other vendors developed versions, and Oracle’s future
was ensured.
It is often suggested that IBM was slow to develop SQL and relational products,
possibly because it was not available on the mainframe and Unix environments, and that
they were afraid it would cut into lucrative sales of their IMS database product, which used
navigational database models instead of relational.
However, at the same time as Oracle was being developed, IBM was developing the
System/38, which was intended to be the first relational database system, and was thought
by some at the time, because of its advanced design and capabilities, that it might have
become a possible replacement for the mainframe and Unix systems.
Although both ANSI and ISO define SQL, there are many extensions to and variations
on the version of the language defined by these standards bodies. Many of these extensions
are proprietary, such as Oracle Corporation’s PL/SQL or Sybase, IBM’s SQL PL (SQL
Procedural Language), and Microsoft’s Transact-SQL.
It is also not uncommon for commercial implementations to omit support for basic
features of the standard, such as the DATE or TIME data types, preferring some variant of
their own. As a result, unlike ANSI C or ANSI Fortran, which can usually be ported from
platform to platform without major structural changes, SQL code can rarely be ported
between database systems without major modifications. There are several reasons for this
lack of portability between database systems:
• The complexity and size of the SQL standard mean that most databases do not implement
the entire standard.
• The standard does not specify database behavior in several important areas (e.g., indexes),
leaving it up to (who?) implement the standard to decide how to behave.
• The SQL standard specifies the syntax a conformant database system must implement.
However, the standard’s specification of the semantics of language constructs is less well-
defined, leading to areas of ambiguity.
• Many database vendors have large existing customer bases; where the SQL standard
conflicts with the prior behavior of the vendor’s database, the vendor may be unwilling
to break backward compatibility.
• Some believe the lack of compatibility between database systems is intentional to ensure
vendor lock-in.
SQL is designed for a specific, limited purpose - querying data in a relational database.
As such, it is a set-based, declarative computer language rather than an imperative language
such as C or BASIC, which, being programming languages, are designed to solve a much
broader set of problems. Language extensions such as PL/SQL are designed to address this
by turning SQL into a fully-fledged programming language while maintaining the
advantages of SQL.
62 Chapter 03 – Relational Language-SQL
One joke about SQL is that “SQL is neither structured, nor is it limited to queries,
nor a language.” This is founded on the notion that pure SQL is not a classic programming
language since it is not Turing-complete. On the other hand, it is a programming language
because it has grammar, syntax, and programmatic purpose and intent. The joke recalls
Voltaire’s remark that the Holy Roman Empire was “neither holy, nor Roman, nor an
empire.”
However, it also makes it possible for SQL source code to be produced (and optimized)
by software, leading to the development of several natural language database query
languages, as well as ‘drag and drop’ database programming packages with ‘object-
oriented’ interfaces. Often these allow the resultant SQL source code to be examined, for
educational purposes, further enhancement, or to be used in a different environment.
Technically, SQL is a declarative computer language for use with “SQL databases”.
Theorists and some practitioners note that many of the original SQL features were inspired
by, but in violation of, the relational model for database management and its tuple calculus
realization. Recent extensions to SQL achieved relational completeness but have worsened
the violations.
In addition, there are also some criticisms about the practical use of SQL:
result in “run-on SQL sentences” and may force one into a deep hierarchical nesting
when a graph-like (reference-by-name) approach may be more appropriate and better
repetition-factoring.
• Implementations are inconsistent and usually incompatible between vendors.
• For larger statements, it is often difficult to factor repeated patterns and expressions into
one or fewer places to avoid repetition and avoid making the same change to different
places in a given statement.
• Confusion about the difference between value-to-column assignment in UPDATE and
INSERT syntax.
Regardless, SQL has become the standard relational database language. It has several parts:
We will only look at basic DDL, views, and the DML. The relation schemes for the
banking example used throughout the textbook are:
customer-name account-number
Hayes A-102
Johnson A-101
Johnson A-201
Jones A-217
Lindsay A-222
Smith A-215
Turner A-305
customer-name loan-number
Adams L-16
Curry L-93
Hayes L-15
Jackson L-14
Jones L-17
Smith L-11
Smith L-23
Williams L-17
DDL, or Data Definition Language, consists of the SQL commands that can define the
database schema. DDL commands are used to create, manipulate, and modify, and drop
objects in schemas and tables.
The SQL DDL (Data Definition Language) allows the specification of not only a set of
relations but also the following information for each relation:
where R is the relation name, Ai is the name of an attribute, and Di is the domain of that
attribute.
The values of the primary key must be ‘not null’ and ‘unique’. SQL-92 considers not null
in the PRIMARY KEY specification as redundant, but SQL-89 requires the definition of
it explicitly.
Integrity constraints provide a way of ensuring that changes made to the database by
authorized users do not result in a loss of data consistency. An integrity constraint can be
any arbitrary predicate applied to the database. They may be costly to evaluate, so we will
only consider integrity constraints that can be tested with minimal overhead.
In addition, the original standard allows the specification of primary and candidate keys
and foreign keys as part of the CREATE TABLE command:
• The PRIMARY KEY clause includes a list of attributes forming the primary key.
• The UNIQUE KEY clause includes a list of attributes forming a candidate key.
• The FOREIGN KEY clause includes a list of attributes forming the foreign key and the
name of the relation referenced by the foreign key.
A domain of possible values should be associated with every attribute. These domain
constraints are the most basic form of integrity constraint. They are easy to test for when
data is entered.
Domain Types
Chapter 03 – Relational Language-SQL 69
• Attributes may have the same domain, e.g., cname and employee-name.
• It is unclear whether the bname and cname domains ought to be distinct.
• At the implementation level, they are both character strings.
• At the conceptual level, we do not expect customers to have the same names as branches.
• Strong typing of domains allows us to test for values inserted and whether queries make
sense.
• Newer systems, particularly object-oriented database systems, offer a rich set of domain
types that can be extended easily.
The CHECK clause in SQL-92 allows domains to be restricted in powerful ways that most
programming language-type systems do not allow. The CHECK clause allows the schema
designer to specify a predicate that must be satisfied by any value assigned to a variable
whose type is the domain.
e.g.,
Note that ‘CONSTRAINT WAGE-VALUE-TEST’ is optional (to give a name to the test
to signal which constraint is violated).
CONSTRAINT ACCOUNT-NUMBER-NULL-TEST
CHECK(VALUE NOT NULL)
CONSTRAINT ACCOUNT-TYPE-TEST
CHECK(VALUE IN (‘Checking’, ‘Saving’))
Often we wish to ensure that a value appearing in a relationship for a given set of attributes
also appears for another set of attributes in another relation. This is called Referential
Integrity.
e.g.,
70 Chapter 03 – Relational Language-SQL
STREET CHAR(30),
CITY CHAR(30),
PRIMARY KEY (CNAME))
BCITY CHAR(30),
ASSETS INTEGER,
BNAME CHAR(l5),
BALANCE INTEGER,
Notes on foreign keys: A short form to declare a single column is a foreign key.
When a referential integrity constraint is violated, the normal procedure is to reject the
action. However, a FOREIGN KEY clause in SQL-92 can specify steps to change the
tuples in the referenced relation to restore the constraint.
e.g.,
Chapter 03 – Relational Language-SQL 71
SQL-92 also allows the FOREIGN KEY clause to specify actions other than cascade,
such as setting the referencing field to NULL or a default value if the constraint is violated.
A deletion or update at one end of the chain can propagate across the entire chain if
there is a chain of foreign key dependencies across multiple relations. If a cascading update
or delete causes a constraint violation that a further cascading operation cannot handle, the
system aborts the transaction, and all the changes caused by the transaction and its
cascading actions are undone.
Given and complexity and arbitrary nature of the way constraints in SQL behave with
NULL values, it is best to ensure that all columns of UNIQUE and FOREIGN KEY
specifications are declared to be NONNULL.
• CHAR (N) (or CHARACTER (N)): fixed-length character string with user-specified
length.
• VARCHAR (N) (or CHARACTER VARYING): variable-length character string,
• INT or INTEGER: an integer (length is machine-dependent).
• SMALLINT: a small integer (length is machine-dependent).
• NUMERIC (P, D): a fixed-point number with user-specified precision, consisting of P
digits (plus a sign) and D of P digits to the right of the decimal point.
e.g., NUMERIC (3, 1) allows 44.5 to be stored exactly but not 444.5.
72 Chapter 03 – Relational Language-SQL
e.g.,
CREATE DOMAIN PERSON-NAME CHAR(20)
3.3 Views
Views is a virtual table based on the result set of an SQL statement. The index is an on-
disk structure associated with a table or view that speeds the retrieval of rows from the
table or view.
We have assumed that the relations we are given are the actual relations stored in the
database. We may wish to create a personalized collection of relations for a user for
security and convenience reasons.
We use the term view to refer to any relation, not part of the conceptual model, that is
made visible to the user as a “virtual relation”. As deletions, insertions, and updates may
modify relations, it is generally not possible to store views. Views must then be recomputed
for each query referring to them.
To create a view that contains all customers of all branches and their customers:
Having defined a view, we can now use it to refer to the virtual relation it creates. View
names can appear anywhere a relation name can.
e.g., We can now find all customers of the SFU branch by writing
FROM LOAN)
Since SQL allows a view name to appear anywhere a relation name may appear, the clerk
can write:
An insertion represents this insertion into the actual relation LOAN, from which the view
is constructed. However, we have no value for the amount. This insertion results in ‘SFU’,
‘L-307’, NULL) being inserted into the LOAN relation. The symbol NULL says the value
is unknown or does not exist.
As we saw, when a view is defined in terms of several relations, serious problems can
result. As a result, many SQL-based systems impose the constraint that a modification is
permitted through a view only if the view in question is defined in terms of one relation in
the database.
Some SQL implementations include data definition commands to create and drop indices.
The SQL commands are:
An index is created by
The attribute list is the list of attributes about r that form the search key for the index.
If the search key is a candidate key, we add the word UNIQUE to the definition:
3.4.1 Insert
To insert data into a relation, we either specify a tuple or write a query whose result is the
set of tuples to be inserted. Attribute values for inserted tuples must be members of the
attribute’s domain.
e.g.,
To insert a tuple for Smith, who has $1200 in account A-9372 at the SFU branch.
To provide each customer loan in the ‘SFU’ branch with a $200 savings account.
FROM LOAN
WHERE BNAME='SFU’)
We must evaluate the SELECT statement fully before carrying out any insertion. If some
insertions were carried out even as the SELECT statement was being evaluated, the
insertion
might insert an infinite number of tuples. Evaluating the selected statement completely
before performing insertions avoids such problems.
Inserted tuples can be given values on only some attributes of the schema. The remaining
attributes are assigned a NULL value denoted by NULL. We can prohibit the insertion of
NULL values using the SQL DDL.
3.4.2 Update
Updating allows us to change some values in a tuple without necessarily changing all.
e.g.,
To make two different rates of interest payment, depending on the balance amount:
In general, the WHERE clause of the UPDATE statement may contain any construct legal
in a WHERE clause of a SELECT statement (including nesting).
A nested SELECT within an UPDATE may reference the relation that is being updated.
As before, all tuples in the relation are first tested to see whether they should be updated,
and the updates are carried out afterward.
For example, to pay 5% interest on an account whose balance is greater than average,
we have
3.4.3 Delete
Deletion is expressed in much the same way as a query. Instead of displaying, the selected
tuples are removed from the database. We can only delete whole tuples.
If the WHERE clause is omitted, all tuples are deleted. The request
WHERE CNAME='Smith’.
Delete all loans with loan numbers between 1300 and 1500.
We may only delete tuples from one relation at a time, but we may reference any number
of relations in a select-from-where clause embedded in the WHERE clause of a DELETE.
However, if the DELETE request contains an embedded SELECT that references the
relation from which tuples are to be deleted, ambiguities may result.
For example, we might write to delete the records of all accounts with balances below
the average.
You can see that as we delete tuples from the account, the average balance changes.
Solution: The DELETE statement first tests each tuple in the relation Account to check
whether the account has a balance less than the bank’s average. Then all tuples that fail the
test are deleted. Please perform all the tests (and mark the tuples to be deleted) before any
deletion, then delete them en masse after the evaluations.
SELECT BNAME
, FROM ACCOUNT
By default, duplicates are not removed. We can state it explicitly using all.
SELECT * means select all the attributes. Arithmetic operations can also be in the selection
list.
The FROM clause defines a Cartesian product of the relations in the clause. SQL does
not have a natural join equivalent. However, natural join can be expressed as a Cartesian
product, selection, and projection.
More selections with join: “Find the names and loan numbers of all customers who have a
loan at the SFU branch,” we can write in SQL,
SELECT ACCOUNT#
FROM ACCOUNT
WHERE BALANCE BETWEEN 90000 AND 100000
old-name AS new-name.
e.g.,
SELECT DISTINCT CNAME, BORROWER.LOAN# AS LOAN_ID
FROM BORROWER, LOAN
WHERE BORROWER.LOAN#= LOAN.LOAN# and BNAME= 'SFU'
Tuple Variables
Tuple variables can be used in SQL and are defined in the FROM clause:
These variables can then be used throughout the expression. Think of it as being
something like the rename operator.
Find the names of all branches with assets greater than at least one Branch located in
Burnaby.
String Operation
The most commonly used operation on strings is pattern matching using the operator LIKE.
Patterns are case sensitive, e.g., ‘Jim’ does not match ‘jim’.
String matching operators % (any substring) and _ (underscore, matching any character).
E.g., ‘_ _ _ %’ matches any string with at least three characters. We can use NOT LIKE
for string mismatching.
e.g., Find all customers whose street includes the substring “Main”.
SELECT CNAME
FROM CUSTOMER
WHERE STREET LIKE ‘%Main%’
Backslash overrides the special meaning of symbols. Use the keyword ESCAPE to define
the escape character. E.g., LIKE ‘ab%tely\%\’ ESCAPE ‘\’ matches all the strings
beginning with ‘ab’ followed by a sequence of characters, then ‘tely’ and then ‘% \’.
SQL also permits a variety of functions on character strings, such as concatenating (using
‘I’), extracting substrings, finding the length of strings, converting between upper case and
lower case, and so on.
82 Chapter 03 – Relational Language-SQL
SQL allows the user to control the order in which tuples are displayed.
SELECT *
FROM LOAN
ORDER BY AMOUNT DESC, LOAN# ASC
Sorting can be costly and should only be done when needed.
Set Operations
(SELECT CNAME
FROM DEPOSITOR)
UNION
(SELECT CNAME
FROM BORROWER)
(SELECT CNAME
FROM DEPOSITOR)
INTERSECT
(SELECT CNAME
FROM BORROWER)
(SELECT CNAME
FROM DEPOSITOR)
EXCEPT
(SELECT CNAME
FROM BORROWER)
Aggregate Functions
In SQL, we can compute functions on groups of tuples using the group by clause. Attributes
given are used to form groups with the same values. SQL can then compute.
We use DISTINCT so that a person having more than one account will not be counted
more than once.
Find branches and their average balances where they are more than $1200.
Predicates in the HAVING clause are applied after the formation of groups.
Find the average balance of each customer who lives in ‘Vancouver’ and has at least
three accounts:
If a WHERE clause and a HAVING clause appear in the same query, the WHERE clause
predicate is applied first.
• Tuples satisfying the WHERE clause are placed into groups by the GROUP BY clause.
• The HAVING clause is applied to each group.
• The SELECT clause uses groups satisfying the HAVING clause to generate the result
tuples.
• If no HAVING clause is present, the tuples satisfying the WHERE clause are treated as
a single group.
NULL Values
With insertions, we saw how NULL values might be needed if values were unknown.
Queries involving NULLs pose problems.
Chapter 03 – Relational Language-SQL 85
All aggregate functions except COUNT ignore tuples with NULL values on the argument
attributes.
Set Membership
Note that we can write the same query in several ways in SQL.
This finds all customers with a loan and an account at the ‘SFU’ branch in yet another way.
We can use the NOT IN operation to find all customers with loans but not accounts.
86 Chapter 03 – Relational Language-SQL
Set Comparison
SELECT BNAME
FROM BRANCH
WHERE ASSETS > SOME (SELECT ASSETS
FROM BRANCH
WHERE BCITY='BURNABY’)
to find branches whose assets are greater than some branches in Burnaby.
We can use any of the equality or inequality operators with SOME. If we change > SOME
to >ALL, we find branches whose assets are greater than all in Burnaby.
e.g., Find branches with the highest average balance. We cannot compose aggregate
functions in SQL, e.g., we cannot do MAX(AVG(...)). Instead, we find the branches for
which the average balance is greater than or equal to all average balances:
SELECT BNAME
FROM ACCOUNT
GROUP BY BNAME
HAVING AVG (BALANCE) >= ALL (SELECT AVG (BALANCE)
FROM ACCOUNT
GROUP BY BNAME)
The EXISTS construct returns true if the argument subquery is nonempty. Find all
customers who have a loan and an account at the bank.
Chapter 03 – Relational Language-SQL 87
SELECT CNAME
FROM BORROWER
WHERE EXISTS (SELECT *
FROM DEPOSITOR
WHERE DEPOSITOR.CNAME =
BORROWER.CNAME)
The UNIQUE construct returns true if the argument subquery contains no duplicate tuples.
Find customers with only one account at the ‘SFU’ branch.
SELECT T.CNAME
FROM DEPOSITOR AS T
WHERE UNIQUE (SELECT R.CNAME
FROM ACCOUNT, DEPOSITOR AS R
WHERE T.CNAME = R.CNAME and
R.ACCOUNT# = ACCOUNT.ACCOUNT#
and ACCOUNT.BNAME = ' SFU' )
Each variant of the join operations in SQL-92 consists of a join type and a join condition.
1. Join types: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL
OUTER JOIN
The Keyword INNER and OUTER are optional since the rest of the join type enables us
to deduce whether the join is an inner join or an outer join.
• Inner join:
Notice that the loan# will appear twice in the inner joined relation.
Figure 3.11 The Result of Loan Natural Full Outer Join Borrower Using (loan#)
Find all customers with an account or a loan (but not both) at the bank.
SELECT CNAME
FROM (NATURAL FULL OUTER JOIN BORROWER)
WHERE ACCOUNT# IS NULL OR LOAN# IS NULL
SQL provides a powerful declarative query language. However, access to a database from
a general-purpose programming language is required because,
The SQL standard defines the embedding of SQL as embedded SQL, and the language in
which SQL queries are embedded is referred to as host language. The query result is made
available to the program one tuple (record) at a time.
To identify embedded SQL requests to the pre-processor, we use the EXEC SQL statement:
EXEC SQL
DECLARE C CURSOR FOR
SELECT CNAME, CCITY
FROM DEPOSIT, CUSTOMER
WHERE DEPOSIT.CNAME = CUSTOMER.CNAME
and DEPOSIT.BALANCE> :AMOUNT
END-EXEC
EXEC SQL
OPEN C
END-EXEC
This statement causes the DB system to execute the query and save the results within a
temporary relation.
A series of FETCH statements are executed to make tuples of the results available to the
program.
EXEC SQL
FETCH C INTO :CN, :CC
END-EXEC
Chapter 03 – Relational Language-SQL 91
The program can then manipulate the variable: CN and CC using the features of the host
programming language.
A single FETCH request returns only one tuple. We need to use a WHILE loop (or
equivalent) to process each tuple of the result until there are no further tuples (when a
variable in the SQLCA is set).
We need to use the close statement to tell the DB system to delete the temporary relation
that held the query result.
EXEC SQL
CLOSEC
END-EXEC
Embedded SQL can execute any valid UPDATE, INSERT, or DELETE statements.
Dynamic SQL component allows programs to construct and submit SQL queries at run
time. SQL-92 also contains a module language that defines procedures in SQL.
92 Chapter 03 – Relational Language-SQL
Summary
DDL or Data Definition Language consists of the SQL commands that can be
used to define the database schema. In a SQL database, a schema is a list of logical
data structures. The attribute is the conjunction of a column.
Views are virtual tables based on an SQL statement's result-set. The index is an
on-disk structure associated with a table or view that speeds the retrieval of rows from
the table or view.
Interactive data manipulation language (DML) is a query language based on
relational algebra and tuple relational calculus, plus commands to insert, delete and
modify tuples. Data Modification (INSERT, UPDATE, DELETE, SELECT, and
MERGE) statements are collectively referred to as DML (Data Manipulation
Language) statements.
Embedded SQL (Other SQL Features) is a special language to assist application
programmers in creating tuples on the screen for a user interface.
Chapter 03 – Relational Language-SQL 93
Questions
Exercises
Learning Objective
The objectives of this chapter is to introduce about the Entity-Relational Model. It will provide
you graphical representation that depicts relationships among people, objects, places, concepts
or events within an information technology (IT) system.
After successfully completing this chapter student should be able to:
• Understand the Entity-Relationship model
• Understand Entity and Entity Set
• Explain Relationship and Relationship Set
• Explain Keys
• Explain E-R Diagram
• Understand Reducing E-R Diagrams to Tables
4.1 Overview
Data modeling is analyzing and defining all the different data your business collects and
produces and the relationships between those bits of data. The data modeling in the entity-
relational mode is intended primarily for the D.B. design. The entity-relationship model or
entity-relationship diagram (E.R.D.) is a data model or diagram for a high-level
description of conceptual data models.
The data modeling technique, however, can be used to describe any ontology (i.e., an
overview and classifications of used terms and their relationships) for a particular universe
of discourse (i.e., area of interest). In the case of the design of an information system that
is based on a database, the conceptual data model is, at a later stage (usually called logical
design), mapped to a logical data model, such as the relational model. This, in turn, is
mapped to a physical model during physical design. These phases are sometimes referred
to as “physical design”. The E-R (entity-relationship) data model views the real world as a
set of basic objects (entities) and relationships among these objects.
The entity-relationship mode is intended primarily for the D.B. design process by allowing
the specification of an enterprise scheme. This represents the overall logical structure of
the D.B.
An entity is an object that exists and is distinguishable from other objects. For instance,
“John Harris with S.S.N. (Social Security Number) 890-12-3456” is an entity, as he can be
98 Chapter 04 – Entity-Relationship Model
uniquely identified as one person in the universe. An entity may be concrete (a person or a
book, for example) or abstract (like a holiday or a concept).
An entity set is a set of entities of the same type (e.g., all persons with bank accounts).
Entity sets need not be disjoint. For example, the entity set “employee” (all employees of
a bank) and the entity set “customer” (all customers of the bank) may have members in
common. Figure 4.1 shows two entity sets, “customer” and “loan”.
customer loan
Figure 4.1 Two Entity Sets, “Customer” and “Loan”
Attributes
An entity is represented by a set of attributes, e.g., name, S.S.N., street, and city, for the
“customer” entity. The domain of the attribute is the set of permitted values (e.g., the
telephone number must be seven positive integers).
An attribute is a function that maps an entity set into a domain. Every entity is described
by a set of (attribute, data value) pairs. There is one pair for each attribute of the entity set.
E.g., a particular “customer” entity is described by the set {(name, Harris), (S.S.N., 890-
123-456), (street, North), (city, Georgetown)}.
An analogy can be made with the programming language notion of the type definition. The
concept of an entity set corresponds to the programming language type definition. A
variable of a given type has a particular value at a time. Thus, a programming language
variable corresponds to an entity in the E-R model.
For example, five entity sets and their attributes are introduced below:
Chapter 04 – Entity-Relationship Model 99
• Branch, the set of all branches of a particular bank. The attributes branch-name, branch-
city, and assets describe each branch.
• Customer, the set of all people with bank accounts. Attributes are customer name, S.S.N.,
street, and customer city.
• Employee, with attributes employee name and phone number.
• Account, the set of all accounts created and maintained in the bank. Attributes are account
-number and balance.
• Transaction, the set of all account transactions executed in the bank. Attributes are
transaction number, date, and amount.
For example, consider the two entity sets, customer and loan, shown in Fig. 4.1. We define
the relationship CustLoan to denote the association between customers and their loans.
This is a binary relationship set shown in Figure 4.2.
100 Chapter 04 – Entity-Relationship Model
customer loan
Figure 4.2 A Binary Relationship Set Between Customers and Their Loans
Going back to the formal definition, the relationship set CustLoan is a subset of all the
possible customer and account pairings. This is a binary relationship. Occasionally there
are relationships involving more than two entity sets.
The role of an entity is the function it plays in a relationship. For example, the relationship
works-for could be ordered pairs of employee entities. The first employee will take the
manager role, and the second will take the worker role.
A relationship may also have descriptive attributes. For example, the date (last date of
account access) could be an attribute of the depositor relationship set between the customer
and account entity set shown in Figure 4.3.
Chapter 04 – Entity-Relationship Model 101
deposit(access-date)
customer(customer-name) customer(customer-name)
It is possible to define a set of entities and their relationships in several ways. The main
difference is in how we deal with attributes.
Consider the entity set employee with attributes employee-name and phone-number. The
phone should be treated as an entity itself, with attributes of phone-number and location.
Then we have two entity sets, and the relationship set EmpPhn defines the association
between employees and their phones. This new definition allows employees to have several
(or zero) phones. A new definition may more accurately reflect the real world.
We cannot extend this argument easily to making employee-name an entity. The question
of what constitutes an entity and what constitutes an attribute depends mainly on the
structure of the modeled real-world situation and the semantics associated with the attribute
in question.
An E-R scheme may define certain constraints to which the contents of a database must
conform: mapping cardinalities and existence dependencies.
102 Chapter 04 – Entity-Relationship Model
Mapping Cardinalities
Mapping cardinalities express the number of entities with which another entity can be
associated via a relationship. For binary relationship sets between entity sets A and B, the
mapping cardinality must be one of the following:
A B A B
a1 b1 b1
a1 b2
a2 b2
a3 b3 a2 b3
a4 b4 a3 b4
b5
(a) (b)
Figure 4.4 (a) one-to-one (b) one-to-many
A B A B
a1 a1 b1
a2 b1 b2
a2
a3 b2 a3 b3
a4 b3 a4 b4
a5
(a) (b)
The appropriate mapping cardinality for a particular relationship set depends on the
modeled real world. (Think about the CustLoan relationship in Figure 4.2.) Further,
mapping cardinalities affect the E.R. design.
For example, we can make access-date an attribute of the account in Figure 4.4 instead of
a relationship attribute. Each account can have only one customer. I.e., the relationship
from account to the customer is many-to-one, or equivalently, customer to account is one-
to-many, as shown in Figure 4.6.
account(account-number, access-date)
customer(customer-name)
deposit A-101 24 May 1996
Johnson
A-215 3 June 1996
Smith
A-102 10 June 1996
Hayes
Lindsay
A-222 24 June 1996
Existence Dependencies
Existence dependencies are such that if entity X’s existence depends on entity Y’s
existence, then X is said to be existence dependent on Y. (Or we say that Y is the dominant
entity and X is the subordinate entity.)
For example, consider account and transaction entity sets and a relationship log between
them. This is one-to-many from account to the transaction. If an account entity is deleted,
its associated transaction entities must also be deleted. This account is dominant, and the
transaction is subordinate.
104 Chapter 04 – Entity-Relationship Model
4.4 Keys
A key is an attribute or a set of attributes that help to uniquely identify a tuple (or row) in
a relation (or table) type of the key in a D.B.: Primary Key, Super Key. Candidate Key,
Alternate Key, Foreign Key, Composite Key, and Unique Key. The terms superkey,
candidate key, and primary key apply to entity and relationship sets as they do for relation
schemas. Identifying the primary key of a relationship set requires some care since it is
composed of attributes from one or more related entity sets.
A superkey may contain extraneous attributes, and we are often interested in the smallest
superkey. A superkey for which no subset is a superkey is called a candidate key. In the
example above, S.S.N. is a candidate key, as it is minimal and uniquely identifies a
customer entity.
A primary key is a candidate key (there may be more than one) chosen by the D.B. designer
to identify entities in an entity set. An entity set that does not possess sufficient attributes
to form a primary key is called a weak entity set. One that does have a primary key is called
a strong entity set.
For example, the entity set transaction has attributes of transaction-number, date, and
amount. Different transactions on different accounts could share the same number. There
needs to be more than these to form a primary key (uniquely identifying a transaction).
This transaction is a weak entity set.
For a weak entity set to be meaningful, it must be part of a one-to-many relationship set.
This relationship set should have no descriptive attributes. Why? The idea of strong and
weak entity sets is related to the existence of dependencies seen earlier.
Chapter 04 – Entity-Relationship Model 105
A Member of a strong entity set is a dominant entity. A Member of a weak entity set is a
subordinate entity. A weak entity set does not have a primary key, but we need a means of
distinguishing them among the entities. The discriminator of a weak entity set is a set of
attributes that allows this distinction to be made. The primary key of a weak entity set is
formed by taking the primary key of the strong entity set on which its existence depends
see Mapping Constraints plus its discriminator.
Just Remember: the primary key of a weak entity is found by taking the primary key of the
strong entity on which it is existence-dependent, plus the discriminator of the weak entity
set.
The attributes of a relationship set are the attributes that comprise the primary keys of the
entity sets involved in the relationship set. For example, S.S.N. is the primary key of the
customer, and the account-number is the primary key of the account. The attributes of the
relationship set CustAcct are then (account-number, S.S.N.).
The primary key of a relationship set R depends on the mapping cardinality and the
presence of descriptive attributes. With no descriptive attributes:
1. many-to-many: all attributes in R.
2. one-to-many: primary key for the “many” entity.
3. one-to-one: the primary key of either entity.
106 Chapter 04 – Entity-Relationship Model
Depending on the mapping cardinality and the semantics involved, descriptive attributes
may be added.
There are several conventions for entity-relationship diagrams (E.R.D.s). The classical
notation is described in the remainder of this article and mainly relates to conceptual
modeling. A range of notations is typically employed in logical and physical database
design, including information engineering, IDEF1x (ICAM DEFinition Language), and
dimensional modeling.
We can express the overall logical structure of a database graphically with an E- R diagram.
Its basic components are:
The summaries of symbols used in E-R notation are shown in Figures 4.7 and 4.8.
Chapter 04 – Entity-Relationship Model 107
E Entity A Attribute
Identifying Total
R Relationship R E Participation
Set for Weak of Entity Set
Entity Set
Discriminating
A Primary Key A Attribute of
Weak Entity Set
Figure 4.9 shows an example with two entity sets and a relationship set. In this example,
the primary key of the customer is the customer-id, and the primary key of the account is
the account-number, respectively. Go back and review mapping cardinalities. They express
the number of entities to which an entity can be associated via a relationship. The arrow
positioning is simple once you get it straight in your mind. Think of the arrowhead pointing
to the entity “one” refers to. Thus, in this example, the cardinality of the relationship set
depositor is many-to-many.
108 Chapter 04 – Entity-Relationship Model
access-date
Customer-name Customer-street account-number balance
Customer-id Customer-city
Figure 4.9 An Example with Two Entity Sets and a Relationship Set
The text uses one particular style of the diagram. Many variations exist. Some of the
variations we will see are:
Figure 4.10)
- Fewer symbols, clearer picture.
- What happens with descriptive attributes?
- In this case, we must create an intersection entity to possess the attributes.
• Numbers instead of arrowheads indicating cardinality. (See Figure 4.10)
- Symbols 1, n, and m used
- E.g., 1 to 1, 1 to n, n to m.
- Easier to understand than arrowheads.
E
Entity set E with A1
Attributes A1, A2, A3 A2
and primary key A1 A3
Many-to-Many R
* *
Relationship R
R
One-to-One 1 1
Relationship R
Many-to-One * 1 R
Relationship R
• A range of numbers indicating the optionality of the relationship. (See Figure 4.11)
- e.g. (0,1) indicates minimum zero (optional), maximum 1.
- Can also use (0,n), (1,1), or (1,n).
- Typically used near the end of the link - confusing at first, but gives more information.
- E.g., entity 1 (0, 1) -- (1,n) entity 2 indicates that entity 1 is related to between 0 and
1 occurrences of entity 2 (optional).
- Entity 2 is related to at least one and possibly many occurrences of entity 1
(mandatory).
customer-name customer-street
loan-number amount
customer-id Customer-city
0..* 1.1
customer borrower loan
Street-name
middle-initial
Street-number apartment-number
first-name last-name
street city
name
address state
customer-id
customer
zip-code
The function that an entity plays in a relationship is called its role. Roles are typically
explicit and not specified. They are helpful when the meaning of a relationship set needs
clarification. For example, the entity sets of a relationship may not be distinct. The
relationship works-for might be ordered pairs of employees (first is a manager, second is a
worker).
The E-R diagram can be shown by labeling the lines connecting entities (rectangles) to
relationships (diamonds). (See figure 9.14).
employee-name
employee-id telephone-number
manager
employee works-for
worker
A doubly-outlined box indicates a weak entity set. For example, the weak entity set
payment depends on the strong entity set loan via the relationship set loan-payment. Figure
4.15 shows this example.
Chapter 04 – Entity-Relationship Model 111
payment-date
loan-number amount
payment-number payment-amount
Non-binary Relationships
Non-binary relationships can easily be represented. Figure 4.16 shows an E-R diagram
with a ternary relationship.
title level
job
employee-name street branch-city
branch-name
employee-id assets
city
Reduction of E.R. diagram to Table. The database can be represented using the notations,
and these notations can be reduced to a collection of tables.
A collection of tables can represent a database conforming to an E-R diagram. For each
entity set and relationship set, a unique table is assigned the name of the corresponding set.
Each Table has several columns with unique names. The E-R diagram of Figure 4.17 is
used as an example.
112 Chapter 04 – Entity-Relationship Model
Branch-city
Branch-name assets
branch
loan-branch
customer-name
payment-date
Customer-street
Customer-city amount
loan-
customer borrower loan payment payment
access-date
account-number balance
cust- type
deposit account
Customer-id employer-name
account Checking-account
Deposit-number
Deposit-name
We use a table with one column for each attribute of the set. Each row in the Table
corresponds to one entity of the entity set. For the entity set customer, see the Table of
Figure 4.18.
Chapter 04 – Entity-Relationship Model 113
For a weak entity set, we add columns to the Table corresponding to the primary key of
the strong entity set on which the weak set is dependent. For example, the weak entity set
payment has four attributes: loan-number, payment-number, payment-date, and payment -
amount. The primary key of payment is {loan-number, payment-number}. This gives us
the Table in Figure 4.19.
Let R be a relationship set involving entity sets E1, E2,.. .., Em. The Table corresponding to
the relationship set R has the following attributes:
∪𝑚
𝑖=1 𝑃𝑟𝑖𝑚𝑎𝑟𝑦_𝐾𝑒𝑦(𝐸𝑖 )
If the relationship has k descriptive attributes b1, b2,…, bk, we add them too:
∪𝑚
𝑖=1 𝑃𝑟𝑖𝑚𝑎𝑟𝑦𝐾𝑒𝑦(𝐸𝑖 ) ∪ {𝑏1 , 𝑏2 , … 𝑏𝑘
An example: The relationship set borrower involves the entity sets customer and loan.
Their respective primary keys are customer-id and loan-number. This gives us the Table
in Figure 4.20.
As required, we take the primary keys of each entity set. There are no descriptive attributes
in this example.
These relationship sets are many-to-one and have no descriptive attributes. The primary
key of the weak entity set is the primary key of the strong entity set it is existence-
dependent on, plus its discriminator.
The Table for the relationship set loan-payment in Figure 4.18 would have the same
attributes as the Table in Figure 4.20 and is thus redundant.
2.6.3 Generalization
We can express the similarities between the entity sets by generalization. This is the
process of forming containment relationships between a higher-level entity set and one or
more lower-level entity sets.
Method 1: Create a table for the high-level entity, plus tables for the lower-level
entities also containing their specific attributes
account(account-number, balance)
savings-account(account-number, interest-rate)
checking-account(account-number, overdraft-amount)
Method 2: Create only tables for the lower-level entities.
savings-account(account-number, balance, interest-rate)
checking-account(account-number, balance, overdraft-amount)
116 Chapter 04 – Entity-Relationship Model
Summary
An entity is a real-world thing or a real-world object which is distinguishable
from other objects in the real world. An entity set is a set of entities of the same type.
The attribute is a function that maps an entity set into a domain. The domain is the
set of permitted values.
A relationship is an association among several entities. A relationship set is a
collection of relationships of the same type, and an entity set is a collection of entities
of the same type.
A key is an attribute or a set of attributes that help to uniquely identify a tuple
(or row) in a relation (or table) type of the key in a DB: Primary Key, Super Key.
A database design specified by an E-R diagram can be represented by a
collection of relation schemas. For each entity set and relationship set in the database,
there is a unique relation schema assigned the name of the corresponding entity set
or relationship set.
Reduction of ER diagram to Table. The database can be represented using the
notations, and these notations can be reduced to a collection of tables.
Chapter 04 – Entity-Relationship Model 117
Questions
Exercises
1. Please look at figures 4.4 and 4.5, mapping cardinalities from the textbook and
give a summary of the processing.
2. Please look at figure 4.7, the summary of symbols. Please describe the role of
each symbol in the E-R Diagram.
3. Please provide a summary of the role in the E-R Diagram.
Chapter 05
SQL
Learning Objective
The objectives of this chapter is to practice the most common command types in SQL. It will
provide you with practice for Create, Alter, Drop, Read, Update and Delete commands which
allow the handling of information using tables in a database.
After successfully completing this chapter students should be able to:
• Practice DDL: Data Definition Language
• Practice DML: Data Modification
5.1 Introduction
• MySQL: phpMyAdmin
- MySQL User ID/Password
122 Chapter 05 –SQL
The most common command types in DDL are CREATE, ALTER, and DROP. All three
types have a predefined syntax that must be followed for the command to run and changes
to take effect.
• Used by DBA.
• Creating a DB
CREATE DATABASE D ( … )
Customer
name city address
char(20) char(20) char(50)
Branch
branch asset city
char(15) int char(30)
city char(30),
PRIMARY KEY (branch))
Deposit
branch account name balance
char(15) char(10) char(20) int
Entity integrity?
Domain integrity?
Referential integrity?
- Security considerations may require that certain data be hidden from the user.
⇒ It is possible to support many views on top of any given set of actual relations.
⇒ A view must be 'recomputed' for each query that refers to it.
• Used by programmers.
• Insertion
- A newly created relation is empty initially.
⇒ Load data into the relation.
Deposit
branch account name balance
Ta kmao 123 CHEA 1000
Loan
branch account name balance
Ta kmao 111 CHEA 1000
-An insertion is permitted through a view only if the view in the query is defined in one
relation of the actual DB.
126 Chapter 05 –SQL
- All aggregate operations(AVG, MIN, MAX, SUM, COUNT) ignore tuples with NULL
values on the argument attributes.
WHERE balance>10000;
UPDATE Deposit SET balance=balance*1.05
WHERE balance<=10000
• Deletion
Branch
branch asset city
Mesang 100 Prey Veng
Kam chay mear 200 Prey Veng
Bar Phnom 400 Prey Veng
Kien Svay 300 Kandal
Ta kmao 50 Kandal
Kandal 500 Kandal
Customer
name city address
CHEA Prey Veng Mesang
Mongkol Prey Veng Kam chay mear
Rithy Prey Veng Bar Phnom
PUTY Prey Veng Mesang
Deposit
branch account name balance
Mesang 111 CHEA 100
Mesang 112 Mongkol 200
Mesang 113 Rithy 400
Loan
branch account name balance
Mesang 111 CHEA 10000
Kam chay mear 112 Mongkol 20000
Bar Phnom 113 Rithy 40000
Mesang 114 PUTY 50000
128 Chapter 05 –SQL
desc Branch;
desc Customer;
desc Deposit;
desc Loan;
• Query Processing
• Query-execution plan
- Parse-tree representation
- Annotated instruction
Chapter 05 –SQL 129
• Retrieving
Duplicate Tuples
FROM Deposit
FROM Deposit
Set operations
A list of customers with a deposit account, a loan account, or both accounts at the ‘Mesang’
branch?
130 Chapter 05 –SQL
(SELECT name
FROM Deposit
WHERE branch=‘Mesang‘)
UNION
(SELECT name
FROM Loan
WHERE branch=‘Kien Svay‘)
(SELECT name
FROM Deposit
WHERE branch=‘Mesang‘)
INTERSECT
(SELECT name
FROM Loan
WHERE branch=‘Kien Svay‘)
A list of customers who only have a loan account at the ‘Mesang’ branch?
Join
The list and city of customers with a loan account at the ‘Mesang’ branch?
A list of customers with a loan account at the ‘Mesang’ branch, city, and deposit
balance?
⇒ INTERSECT
SELECT Loan.name
FROM Loan JOIN Deposit USING (name) /* Natural Join */
WHERE Loan.branch = ‘Mesang’
Arithmetic Expression
SELECT account
FROM Deposit
WHERE balance BETWEEN 200 AND 300
( ⇔ WHERE balance >= 200 AND balance <= 300)
132 Chapter 05 –SQL
SELECT account
FROM Deposit
WHERE balance NOT BETWEEN 200 AND 300
String Match
SELECT name
FROM customer
WHERE address LIKE ‘%Mes%‘
SELECT name
FROM customer
WHERE address NOT LIKE ‘%Mes%‘
⇒ INTERSECT
Set Comparison
Which branch has greater assets than the branches in ‘Prey Veng’?
Which branch has greater assets than all branches in ‘Prey Veng’?
SELECT branch
FROM Branch
WHERE city != ‘Prey Veng‘ and /* <> */
FROM branch
WHERE city=‘Prey Veng‘)
SELECT branch
FROM Branch
WHERE city != ‘Prey Veng‘ and /* <> */
FROM branch
WHERE city=‘Prey Veng‘)
→ False: 〃 empty.
A list of customers with deposit and loan accounts in the ‘Mesang’ branch?
SELECT name
FROM Customer
WHERE EXISTS (SELECT * FROM Deposit
WHERE Deposit.name=Customer.name and
134 Chapter 05 –SQL
branch=‘Mesang‘)
AND EXISTS (SELECT * FROM Loan
WHERE Loan.name=Customer.name and
branch=‘Mesang‘)
⇒ Tuple variables are most useful for comparing two tuples in the same relation.
What is the customer’s name with savings account at the same branch as 'CHEA'?
SELECT name
FROM Deposit
WHERE branch=‘Mesang‘
ORDER BY name ASC /* DESC */
Aggregate Functions
Outer Join
Referential Integrity
Domain Integrity
/* CHECK (balance>=100) */
Summary
The most common command types in DDL are CREATE, ALTER, and DROP. All
three types have a predefined syntax that must be followed for the command to run
and changes to take effect.
DML is an abbreviation for Data Manipulation Language. Data Manipulation
Language or DML represents a collection of programming languages explicitly used
to make changes in the database, such as: CRUD operations to create, read, update,
and delete data, using the INSERT, SELECT, UPDATE and Delete commands.
Questions
Exercises
1. Please write the correct SQL statement to create a new database called
rttcDB.
2. Please write the correct SQL statement to create a table grade following the
table below:
Field Type Null Key Default
ID Integer(3) Yes Primary Key Null
- Please write the correct SQL statement to insert 3 students into table grade
following the table below:
ID khmer math science comment
111 90 95 85 outstanding
3. Please write the correct SQL statement to create a table student following
the table below:
Field Type Null Key Default
ID Integer(3) Yes Primary Key Null
- Please write the correct SQL statement to insert 3 students into the student
table student following the table below:
ID name address
111 Chan dara Null
4. Please write the correct SQL statement update table grade set math=60
with Id 112.
5. Please write the correct SQL statement and select from the grade by
ordering from large to small through the math.
6. Please write the correct SQL statement and select from the grade by
Aggregate Functions through Khmer.
7. Please write the correct SQL statement and select the student name with
the highest Khmer score.
8. Please write the correct SQL statement, select the name of the lowest
student, and show all the scores that get comment: study hard.
9. Please write the correct SQL statement Joined Relations (Inner Join, Left
Outer Join, Natural Inner Join, Natural Full Outer Join) and show the result.
Chapter 06
PHP
Learning Objective
The objectives of this chapter is to introduce Web Server (Apache), PHP My SQL, how to install
and configure Web Server (Apache), and PHP My SQL on window 10. It will provide a basic
understanding and the practice of the abovementioned features. This chapter will also explain to
you some key points about PHP My SQL for Create, Update, Select, and Delete commands that
allows handling the information using tables in the database.
After successfully completing this chapter students should be able to:
• Understand the World Wide Web
• Practice Web Server(Apache)
• Practice PHP (Professional Hypertext Preprocessor)
• Practice My SQL
• Introduction
- Wide-area information service and software that allows you to search all kinds of
information distributed on the Internet in a unified way
- Proposed by Tim Berners-Lee of the European Council for Nuclear Research
(CERN) in 1989.
- Stored and managed as a unit called Home Page in the form of Hypertext within the
Web Server
- Connected to Hypertext worldwide geographically distributed on the Internet by a
function called Link.
- Standard protocol that defines how a web client and web server communicate and
how data information is transferred from the web server to the web client.
- Obtain data information from a web server by entering the URL (Uniform Resource
Locator) starting with http://
- TCP/IP is used for the correct transmission of data in the underlying
Apache is the most commonly used Web server on Linux systems. Web servers are used to
serve Web pages requested by client computers. Clients typically request and view Web
pages using Web browser applications such as Firefox, Opera, Chromium, or Internet
Explorer.
- Move the Apache24 folder to the ‘C:\’. After moving, the final path will be
‘C:\Apache24’ path.
- Open the ‘httpd.conf’ file with notepad. Find the part below and change the settings to
suit your installation PC. (Specifies the Server Root path. Since it was specified as
C:\Apache24 above, modify it accordingly.)
Chapter 06 –PHP 143
ServerRoot “c:\Apache24”
- Set the port of the web server. The default is 80. Leave it as is if you do not use a
different port number.
Listen 80
- This is the path where the website files displayed when accessing the web server with
a web browser are saved. When connecting to http://localhost:80 (or http://localhost), the
index.html page in DocumentRoot is found and displayed.
DocumentRoot “c:\Apache24/htdocs”
- Run the command prompt window in administrator mode. (If you want to uninstall
the installed Apache server, type httpd.exe -k uninstall command.)
C:\Apache25/bin
Httpd.ext –k install
C:\Apache24\bin\ApacheMonitor.exe
144 Chapter 06 –PHP
- Run ‘Apache Monitor’ by right-clicking the Apache icon on the taskbar at the bottom
right of the Window.
- Start and stop the server using the start and stop buttons in Apache Monitor.
- Launch a web browser and connect to ‘http://localhost’.
- This will produce the following Result:
• Introduction
- Server-side script: A technology that directly processes the files coded in the web server
by the application server operating on the server side and sends the results to the browser
- You can configure the fastest website running on Windows, Unix, and Linux operating
systems.
- Since it is a language developed with an open-source model, it is ported to operate in
various operating systems and web server environments.
- A very accessible web development language for small developers.
- Please download the appropriate version of PHP from the site below. You must select
‘Thread Safe’ rather than Non-Thread Safe for it to work properly! (The VS16 version of
64bit Thread Safe)
windows.php.net/download/
Chapter 06 –PHP 145
- Change the name to php8 (PHP version name), save it in the same folder as Apache,
and change ‘php.ini-development’ in the folder to ‘php.ini’.
- After that, open php.ini with notepad.
- Find extension_dir and change extension_dir = "C:\php8\ext",
extension=curl
extension=mysqli
extension=gettext
extension=mbstring
extension=openssl
extension=pdo_sqlite
[Date]
; Defines the default timezone used by the date functions
; http://php.net/date.timezone
date.timezone = Asia/Seoul
- After that, open ‘httpd.conf’ in conf in the Apache24 folder with notepad and paste the
following in the bottom line.
PHPIniDir "C:/php8"
LoadModule php_module "C:/php8/php8apache2_4.dll"
AddHandler application/x-httpd-php .php
AddType application/x-httpd-php .php .html
<IfModule dir_module>
146 Chapter 06 –PHP
<?php
phpinfo();
?>
- After that, type ‘localhost/phpinfo.php’ in the address bar, and the screen of PHP
information will appear!
Basic Programming
<?php
Echo "Hello World!! \n";
phpinfo();
?>
<?php
Echo "Hello World. Today is".date("Y-m-d", time());
?>
<h3>How are you?</h3>
<script>
document.write("Hello World. Today is "+Date());
</script>
/* This is a comment... */
$mycounter=1;
$mystring="Hello";
$myarray=array("one", "two", "three");
echo $mycounter."<br>";
echo $mystring;
echo "<br>";
echo $myarray[0];
echo "<br>";
$username="Kim Taeyoung";
echo $username;
echo "<br>";
$current_user=$username;
echo $current_user;
?>
148 Chapter 06 –PHP
<?php
/* automatic transformation... */
$number=123*678;
echo $number."<br>"
echo substr($number,3,2)."<br>";
$pi="3.141592";
$radius=5;
$area=$pi*($radius*$radius);
echo $area;
?>
<?php
/* function... */
echo $temp.longdate1(time());
function longdate1($timestamp){
return date("H:i:m", $timestamp);
}
echo "<br><br>";
echo longdate2($temp,time());
function longdate2($text,$timestamp){
return $text.date("H : i : m", $timestamp);
}
echo "<br><br>";
echo test()."->".test()."->"test();
function test(){
static $count=0;
$count++;
return $count;
}
?>
<?php
/* This is a comment... */
$count=0;
150 Chapter 06 –PHP
while(++$count<=10)
echo $count."<br>";
echo "--------<br>";
for($i=1, $j=1; $i+$j<=10; $i++, $j++)
echo $i."+".$j."<br>";
$fp=fopen("4-1.txt","wb");
for($j=0; $j<10; ++$j){
$written=fwrite($fp, "data".$j."\n");
if($written==FALSE) break;
}
fclose($fp);
?>
MySQL is a tool used to manage databases and servers. So, while it is not a database, it is
widely used for managing and organizing data in databases.
• Installing MySQL
https://dev.mysql.com/downloads/installer/
When you try to start the download, the website will ask you to log in or create an account,
but you do not have to do so. Note the No thanks, just start my download button.
When you open the installer, it will first configure the installation and then ask for the
user’s permission to proceed:
When this is over, we will finally see the installer interface. As you can see in the image
below, the process consists of four steps:
There are five types of setups available in this first step, and you can check the side box to
see what each of them will install. However, we strongly recommend, especially if you are
just getting started with SQL, to select the default option.
Select the default option: The default option is large. It requires a large computer to install
MySQL to run properly and avoid crashes during use.
The most important features, among others, this setup will install are:
It is also fine to choose the full setup, as this will install all MySQL resources available.
However, if your computer is medium or under powered, we should choose Custom and
select the products you would like to install on your computer.
Click on Next
Requirements
At this point, there is a chance you will be asked to install some required software, the most
common being the Visual Code. The installer can automatically solve some requirement
issues. However, this is not the case here:
- Click on Execute
- √ I agree to the terms and conditions of the licenses
- Click on Install
- Setup Successful (Click on Close)
154 Chapter 06 –PHP
Click on Next
You have now reached the download section. The section name is self-explanatory: you
will download all the components in the setup option you selected.
Click on Execute, and the download will start. This might take a few minutes to be
concluded. When it is done, you should see check marks on every item. Then you can
proceed.
Chapter 06 –PHP 155
The next screen is almost the same as the last one, but now it will install all the components
you have just downloaded. This step will take significantly longer than the previous one.
When it is over, you will see all the click marks again:
Click on Execute
Configuration
The next step is to configure the server. You will see the following screen. Hit Next.
It is important to keep Development Computer in the Config Type field, as you probably
installed it on your personal computer and not on a dedicated machine. You can choose the
port, but the default will work just fine. Click Next. For the authentication method, let us
stick with the recommended option and click Next:
Now it is time to create the root account. You will be asked to set a password. Remember
to use a strong one.
You can create other users on this screen and set their passwords and permissions. You
have to click on Add User and fill in the blanks. Then, click Next.
Now you can choose the Windows service details, such as the service name, account type,
and if you want to start MySQL when you turn on your computer. Again, the default
options will work in most cases:
Chapter 06 –PHP 157
Click on Next
Click on Next
The next screen applies the configuration. Execute it. This step also takes a while to be
concluded.
158 Chapter 06 –PHP
Click on Next
This screen is followed by another one asking to apply the configuration. Just execute it
and click Finish.
If you choose to start the Workbench after finishing the installation, you will see the
following screen:
Choose the connection to the server you created and log into it.
Notice in the SCHEMAS window that you already have a few sample databases to play
with. In the Information window, you can see the database you have selected. Of course,
you have the main window to write SQL code.
- Run the command prompt window in administrator mode. (If you want to show the
database by command.)
160 Chapter 06 –PHP
Click on Delete
Before we can access data in the MySQL database, we need to be able to connect
to the server:
1. Start Apache
2. Start MySQL server
3. Create a new file name: Mysqlconn.php and save it in C:\Apache24\htdocs
<?php
$host = 'localhost';
$user = 'root';
$pw = '';
$dbName = '';
$conn = new mysqli($host, $user, $pw, $dbName);
//Check SQL Command
if($conn){
echo "MySQL connection OK";
}else{
echo "MySQL connection failed";
}
?>
$conn->close();
A database table has its unique name and consists of columns and rows.
<?php
$host = 'localhost';
$user = 'root';
$pw = '';
$dbName = 'jspbook';
$conn = new mysqli($host, $user, $pw, $dbName);
//Check SQL Command
if($conn){
echo "MySQL connect OK";
}else{
echo "MySQL connection failed";
}
// SQL to create a table
164 Chapter 06 –PHP
$conn->close();
?>
$conn->close();
$conn->close();
$conn->close();
?>
if ($result->num_rows > 0) {
// output data of each row
while($row = $result->fetch_assoc()) {
echo "id: " . $row["id"]. " - Name: " . $row["firstname"]. " " .
$row["lastname"]. "<br>";
}
} else {
echo "0 results";
}
168 Chapter 06 –PHP
$conn->close();
if ($result->num_rows > 0) {
// output data of each row
while($row = $result->fetch_assoc()) {
echo "id: " . $row["id"]. " - Name: " . $row["firstname"]. " " .
$row["lastname"]. "<br>";
}
} else {
echo "0 results";
}
$conn->close();
$conn->close();
170 Chapter 06 –PHP
Summary
Apache is the most commonly used Web server on Linux systems. Web servers
are used to serve Web pages requested by client computers. Clients typically request
and view Web pages using Web browser applications such as Firefox, Opera,
Chromium, or Internet Explorer.
PHP (Hypertext Preprocessor) is known as a general-purpose scripting language
that can be used to develop dynamic and interactive websites. It was among the first
server-side languages that could be embedded into HTML, making it easier to add
functionality to web pages without needing to call external files for data.
MySQL is a tool used to manage databases and servers. So, while it’s not a
database, it’s widely used for managing and organizing data in databases.
Questions
Exercises
4. Please write code to insert 3 students into the table grade in PHP following
the table below:
ID khmer math science comment
111 90 95 85 outstanding
5. Please write code to update the table grade set math=60 with Id 112 in PHP.
6. Please write code and select all (all what?) from the table grade in PHP.
Chapter 07
Data and Process
Learning Objective:
Data is digitalized and stored with various facts, making it the most important material
with which to understand and solve problems. Various data analysis tools can be used
to improve students' computing thinking skills, and this chapter uses a tool called
Orange 3. In Chapter 7, we will look at data and processing and you will learn various
ways to load data as a basic step to analyze data using Orange 3. In addition, you will
see that to check the data in detail, it is expressed in a table form, and columns and rows
can be selected and utilized according to conditions. Finally, to analyze the data, you
will learn how to pre-process it into the most suitable form and deal with the data
yourself.
7.1. File
7.1.1. Output
The File widget reads the input data file (data table with data instances) and sends the
dataset to its output channel. The history of the most recently opened files is maintained in
the widget. The widget also includes a directory with sample datasets pre-installed with
Orange.
The widget reads data from Excel (.xlsx), simple tab-delimited (.txt), comma-separated
files (.csv), or URLs. For other formats, see the Other Formats section below:
Browse through previously opened data files or load any sample ones.
3. Insert data from URL addresses, including data from Google Sheets.
4. Information on the loaded dataset: dataset size, number, and types of data features.
5. Additional information on the features in the dataset. Features can be edited by
double-clicking on them. The user can change the attribute names, select the
variable type per each attribute (Continuous, Nominal, String, Datetime), and
choose how to define the attributes further (as Features, Targets, or Meta). The user
can also decide to ignore an attribute.
6. Browse documentation datasets.
7. Produce a report.
Example:
Most Orange workflows start with the File widget. In the schema below, the widget
reads the data sent to the Data Table and the Box Plot widget.
3. Orange’s native format is a tab-delimited text file with three header rows. The first
row contains attribute names, the second the type (continuous, discrete, or string),
and the third the optional element (class, meta, or time).
7.2.1. Output
Output is read data from the dataset through the .csv file, and Data Frame is pandas
DataFrame object.
The CSV File Import widget reads comma-separated files and sends the dataset to its
output channel. File separators can be commas, semicolons, spaces, tabs, or manually-
defined delimiters. The history of the most recently opened files is maintained in the widget.
Data Frame output can be used in the Python Script widget by connecting it to the in-
object input (e.g., df = in object). Then it can be used as a regular DataFrame.
The import window is where the user sets the import parameters. It can be re-opened
by pressing Import Options in the widget. Right-click on the column name to set the column
type. Right-click the row index (on the left) to mark a row as a header, skipped, or normal
data row.
1. File encoding. The default is UTF-8. See the Encoding subchapter for details.
2. Import settings:
• Cell delimiter:
− Tab
− Comma
− Semicolon
− Space
− Other (set the delimiter in the field to the right)
• Quote character: either “or”. Defines what is considered a text.
• Number separators:
− Grouping: delimiters for thousands, e.g., 1,000
− Decimal: delimiters for decimals, e.g., 1.234
3. Column type: select the column in the preview and set its type. Column type can
also be set by right-clicking on the selected column.
• Auto: Orange will automatically try to determine column type. (default)
• Numeric: for continuous data types, e.g. (1.23, 1.32, 1.42, 1.32)
• Categorical: for discrete data types, e.g. (brown, green, blue)
• Text: for string data types, e.g. (John, Olivia, Mike, Jane)
• Datetime: for time variables, e.g. (1970-01-01)
• Ignore: do not output the column.
4. Pressing Reset will return the settings to the previously set state (saved by pressing
OK in the Import Options dialogue). Restore Defaults will set the settings to their
default values. Cancel aborts the import, while OK imports the data and saves the
settings.
180 Chapter 07 –Data and Processing
7.2.3. Widget
1. The folder icon opens the dialogue for importing the local .csv file. It can either
load the first file or change the existing one (load new data). The File dropdown
stores paths to previously loaded data sets.
2. Information on the imported data set. Reports on the number of instances (rows),
variables (features or columns), and meta-variables (special columns).
3. Import Options re-opens the import dialogue, where the user can set delimiters,
encodings, text fields, and so on. Cancel aborts data import. Reload imports the file
once again, adding any changes made in the original file to the data.
7.2.4. Encoding
The dialogue for settings custom encodings list in the Import Options - Encoding dropdown.
Select Customize Encodings List. . . to change which encodings appear in the list. To save
the changes, close the dialogue. Closing and reopening Orange (even with Reset widget
settings) will not reset the list. To do this, press Restore Defaults. To have all the available
encodings in the list, press Select all.
Chapter 07 –Data and Processing 181
Example:
CSV File Import works like the File widget, with the added options for importing
different types of .csv files. In this workflow, the widget reads the data from the file and
sends it to the Data Table for inspection.
182 Chapter 07 –Data and Processing
7.3. Datasets
7.3.1. Output
The datasets widget retrieves the selected dataset from the server and sends it to the output.
The file is downloaded to the local memory and thus is instantly available even without an
internet connection. Each dataset provides a description and formation of the data size,
several instances, variables, targets, and tags.
1. Information on the number of datasets available and the number downloaded to the
local memory.
2. Content of available datasets. Each dataset is described with the size, the number of
instances and variables, the type of the target variable, and tags.
3. Formal description of the selected dataset.
4. If Send Data is ticked automatically, the selected dataset is communicated
automatically. Alternatively, press Send Data.
Example:
Orange workflows can start with the Datasets widget instead of the File widget. In the
example below, the widget retrieves a dataset from an online repository (Kickstarter data),
Chapter 07 –Data and Processing 183
which is subsequently sent to both the Data Table and the Distributions.
7.4.1. Input
Data is used for the input dataset, like a data table with data instances.
7.4.2. Output
Selected Data is used to select an instance from the table. The Data Table widget receives
one or more datasets in its input and presents them as a spreadsheet. Attribute values may
sort data instances. The widget also supports the manual selection of data instances.
184 Chapter 07 –Data and Processing
1. The name of the dataset (usually the input data file). Data instances are in rows, and
their attribute values are in columns. The dataset is sorted by the attribute “sepal
length” in this example.
2. Info on current dataset size and number and types of attributes
3. Values of continuous attributes can be visualized with bars; colors can be attributed
to different classes.
4. Data instances (rows) can be selected and sent to the widget’s output channel.
5. Use the Restore Original Order button to reorder data instances after attribute-based
sorting.
6. Produce a report.
7. While auto-send is on, all changes will be automatically communicated to other
widgets. Otherwise, press Send Selected Rows.
Example:
We used two File widgets to read the Iris and Glass dataset (provided in the Orange
distribution) and send them to the Data Table widget.
Chapter 07 –Data and Processing 185
Selected data instances in the first Data Table are passed to the second Data Table.
Notice that we can select which dataset to view (iris or glass). Changing from one dataset
to another alters the communicated selection of data instances if Commit on any change is
selected.
The Select Columns widget is used to compose your data domain manually. The user can
decide which attributes will be used and how. Orange distinguishes between ordinary
attributes, (optional) class attributes and meta-attributes.
7.5.1. Input
Data is used for the input dataset, like a data table with data instances.
7.5.2. Outputs
Data is output from the dataset with columns as set in the widget. The Select Columns
widget is used to compose your data domain manually. The user can decide which attributes
will be used and how. Orange distinguishes between ordinary attributes, (optional) class
attributes and meta-attributes. For instance, the domain would be composed of attributes
and a discrete class attribute for building a classification model. Meta attributes are not used
in modeling, but several widgets can use them as instance labels.
186 Chapter 07 –Data and Processing
Orange attributes have a type and are either discrete, continuous, or a character string.
The attribute type is marked with a symbol before the attribute’s name (D, C, and S,
respectively).
1. Left-out data attributes that will not be in the output data file
2. Data attributes in the new data file
3. Target variable. If none, the new dataset will be without a target variable.
4. Meta attributes of the new data file. These attributes are included in the dataset but
are, for most methods, not considered in the analysis.
5. Produce a report.
6. Reset the domain composition to that of the input data file.
7. Tick if you wish to auto-apply changes to the data domain.
8. Apply changes to the data domain and send the new data file to the output channel
of the widget.
Example:
In the workflow below, the Iris data from the File widget is fed into the Select Columns
widget, where we select to output only two attributes (namely petal width and petal length).
We view both the original and the dataset with selected columns in the Data Table
widget.
Chapter 07 –Data and Processing 187
For a more complex use of the widget, we composed a workflow to redefine the
classification problem in the heart- disease dataset. Originally, the task was to predict if the
patient has a coronary artery diameter narrowing. We changed the problem to gender
classification based on age, chest pain, and cholesterol level and informatively kept the
diameter narrowing as a meta-attribute.
This widget selects a subset from an input dataset based on user-defined conditions.
7.6.1. Inputs
Data is used for the input dataset, like a data table with data instances.
7.6.2. Outputs
Matching Data is the instances that match the conditions, and Non-Matching Data is
instances that do not. Data is data with an additional column showing whether an instance
is selected.
This widget selects a subset from an input dataset based on user-defined conditions.
Instances that match the selection rule are placed in the output Matching Data channel.
Criteria for data selection are presented as a collection of conjunct terms (i.e., selected
items are those matching all the terms in ‘Conditions’). Condition terms are defined by
selecting an attribute, selecting an operator from a list of operators, and, if needed, defining
the value to be used in the condition term. Operators are different for discrete, continuous,
and string attributes.
Any change in the composition of the condition will update the information pane (Data
Out). If Send is automatically selected, the output is updated on any change in the
composition of the condition or any of its terms.
Example:
In the workflow below, we used the Zoo data from the File widget and fed it into the
Select Rows widget. We chose to output only two animal types in the widget: fish and
reptiles. We can inspect the original dataset with selected rows in the Data Table widget.
190 Chapter 07 –Data and Processing
In the following example, we used the data from the Titanic dataset and similarly fed it
into the Box Plot widget. We first observed the entire dataset based on survival. Then we
selected only first-class passengers in the Select Rows widget and fed it again into the Box
Plot. There we could see all the first-class passengers listed by their survival rate and
grouped by gender.
7.7. Impute
The object widget that replaces unknown (missing)values in the data are called Impute.
Some of Orange’s algorithms and visualizations cannot handle unknown values in the data.
7.7.1. Inputs
Data is used for the input dataset, like a data table with data instances.
7.7.2. Outputs
Data is output to show the information through a dataset with imputed values. Some of
Orange’s algorithms and visualizations cannot handle unknown values in the data. This
widget does what statisticians call imputation: it substitutes missing values with values
either computed from the data or set by the user. The default imputation is (1-NN).
Chapter 07 –Data and Processing 191
1. In the top-most box, Default method, the user can specify a general imputation
technique for all attributes.
• Do not Impute does nothing with the missing values.
• Average/Most-frequent uses the average value (for continuous attributes) or the
most common value (for discrete attributes).
• As a distinct value creates new values to substitute the missing ones.
• Model-based imputer constructs a model for predicting the missing value based
on values of other attributes; a separate model is constructed for each attribute.
The default model is 1-NN learner, which takes the value from the most similar
example (this is sometimes referred to as hot deck imputation). This algorithm
can be substituted once the user connects to the input signal Learner for
Imputation. Note, however, that if there are discrete and continuous attributes
in the data, the algorithm needs to handle them both. At the moment, only the
1-NN learner can do that. (In the future, when Orange has more regressors, the
Impute widget may have separate input signals for discrete and continuous
models.)
• Random values compute the distributions of values for each attribute and then
inputs by picking random values from them.
• Remove examples with missing values and removes the example containing
missing values. This check also applies to the class attribute if Impute class
values are checked.
192 Chapter 07 –Data and Processing
2. Specifying individual treatment for each attribute is possible, which overrides the
default treatment set. One can also specify a manually defined value used for
imputation. In the screenshot, we decided not to impute the values of “normalized-
losses” and “make”. The missing values of “aspiration” will be replaced by random
values, while the missing values of “body-style” and “drive-wheels” will be
replaced by “hatchback” and “fwd”, respectively. If the values of “length”, “width”
or “height” are missing, the example is discarded. Values of all other attributes use
the default method set above (model-based imputer, in our case).
3. The imputation methods for individual attributes are the same as the default
methods.
4. Restore All to Default resets the individual attribute treatments to default.
5. Produce a report.
6. All changes are committed immediately if Apply automatically is checked.
Otherwise, Apply needs to be ticked to apply any new settings.
Example:
We played around with the Iris dataset and deleted some data to demonstrate how the
Impute widget works. We used the Impute widget and selected the Model-based imputer
to impute the missing values. In another Data Table, we see how the question marks turned
into distinct values (“Iris-setosa, “Iris-versicolor”).
Chapter 07 –Data and Processing 193
7.8. Outliers
The outlier detection widget is used to classify the dataset, and those methods apply to
classify data into the dataset.
7.8.1. Inputs
• Data: input dataset
7.8.2. Outputs
• Outliers: instances scored as outliers
• Inliers: instances not scored as outliers
• Data: input dataset appended Outlier variable
The Outliers widget applies one of the four methods for outlier detection. All methods
apply classification to the dataset. One-class SVM with non-linear kernels (RBF) performs
well with non-Gaussian distributions, while the co-variance estimator works only for data
with Gaussian distribution. One efficient way to perform outlier detection on moderately
high dimensional datasets is to use the Local Outlier Factor algorithm. The algorithm
computes a score reflecting the degree of abnormality of the observations. It measures the
local density deviation of a given data point concerning its neighbors. Another efficient
way to perform outlier detection in high-dimensional datasets is using random forests
(Isolation Forest).
Example:
Below is an example of how to use this widget. We used a subset (versicolor and
Chapter 07 –Data and Processing 195
virginica instances) of the Iris dataset to detect the outliers. We chose the Local Outlier
Factor method with Euclidean distance. Then we observed the annotated instances in the
Scatter Plot widget. In the next step, we used the setosa instances to demonstrate novelty
detection using Apply Domain widget. After concatenating both outputs, we examined the
outliers in the Scatter Plot (1).
7.9. Pre-process
Pre-process data with selected methods and offers several pre-processing methods that can
be combined in a single pre-processing pipeline.
7.9.1. Inputs
In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be data entry through a keyboard, scanner, or any other input
source, like a data table with data instances.
196 Chapter 07 –Data and Processing
7.9.2. Outputs
Pre-processor is the pre-processing method. Data processing is collecting raw data and
translating it into usable information. And Pre-processed Data is the data pre-processed
with selected methods.
1. List of pre-processors. Double-click the pre-processors you wish to use and their order
by dragging them up or down. You can also add pre-processors by dragging them from
the left menu to the right.
2. Pre-processing pipeline.
3. When the box is ticked (Send Automatically), the widget will communicate changes
automatically. Alternatively, click Send.
7.9.3. Pre-processors
Figure 7.14 shows that a pre-processor is an instance for data analysis. Those instances,
such as Select Random Features, CUR Matrix
Chapter 07 –Data and Processing 197
1. List of pre-processors.
2. Discretization of continuous values:
• Entropy-MDL discretization by Fayyad and Irani that uses expected information to
determine bins.
• Equal frequency discretization splits by frequency (same number of instances in
each bin.
• Equal width discretization creates bins of equal width (the span of each bin is the
same).
• Remove numeric features altogether.
3. Continuation of discrete values:
• Most frequent as base treats the most frequent discrete value as 0 and others as 1.
The discrete attributes with more than two values, the most frequent, will be
considered as a base and contrasted with the remaining values in corresponding
columns.
198 Chapter 07 –Data and Processing
• One feature per value creates columns for each value. Place 1 where an instance has
that value and 0 where it does not. Essentially One Hot Encoding.
• Remove non-binary features; retain only categorical features with values of either
0 or 1 and transform them into continuous ones.
• Remove categorical features and removes categorical features altogether.
• Treat as ordinal, takes discrete values, and treat them as numbers. If discrete values
are categories, each category will be assigned a number as they appear in the data.
• Divide by several values is similar to treating as ordinal. Still, the final values will
be divided by the total number of values, and hence the range of the new continuous
variable will be [0, 1].
4. Impute missing values:
• Average/Most frequent replaces missing values (NaN) with the average (for
continuous) or most frequent (for discrete) value.
• Replace with a random value and replaces missing values with random ones within
the range of each variable.
• Remove rows with missing values.
5. Select relevant features:
• Similar to Rank, this pre-processor outputs only the most informative features. A
score can be determined by information gain, gain ratio, Gini index, ReliefF, fast
correlation-based filter, ANOVA, Chi2, RReliefF, and Univariate Linear
Regression.
• Strategy refers to how many variables should be on the output. Fixed returns a fixed
number of top-scored variables, while Percentile returns the selected top percent of
the features.
6. Select random feature outputs, either a fixed number of features from the original data
or a percentage. This is mainly used for advanced testing and educational purposes.
Chapter 07 –Data and Processing 199
When building predictive models, one has to be careful about how to do pre-processing.
There are two possible ways to do it in Orange, each slightly different:
200 Chapter 07 –Data and Processing
1. Connect Pre-process to the learner. This will override the default pre-processing
pipeline for the learner and apply only the custom pre-processing pipeline (default
pre-processing steps are described in each learner’s documentation).
2. Connect Pre-process to Test and Score. This will apply the pre-processors to each
batch within cross-validation. Then the learner’s pre-processors will be applied to
the pre-processed subset.
Finally, there is a wrong way to do it. Connecting Pre-process directly to the original
data and outputting pre-processed data set will likely overfit the model. Do not do it.
Example:
In the first example, we have used the heart_disease.tab dataset available in the
dropdown menu of the File widget. Then we used Pre-process to impute missing values
Chapter 07 –Data and Processing 201
and normalize features. We can observe the changes in the Data Table and compare them
to the non-processed data.
The second example shows how to use Pre-process for predictive modeling. This time
we are using the heart_disease.tab data from the File widget. You can access the data in the
dropdown menu. This is a dataset with 303 patients that came to the doctor suffering from
chest pain. After the tests, some patients were found to have diameter narrowing, and others
did not (this is our class variable).
Some values are missing in our data set, so we would like to impute missing values
before evaluating the model. We do this by passing a pre-processor directly to Test and
Score. In Pre-process, we set the correct pre-processing pipeline (in our example, only a
single pre-processor with Impute missing values), then connect it to the Pre-processor input
of Test and Score.
We also pass the data and the learner (in this case, a Logistic Regression). This is the
correct way to give a pre-processor to cross-validation, as each fold will independently get
pre-processed in the training phase. This is particularly important for feature selection.
202 Chapter 07 –Data and Processing
Summary
1. Orange 3 is a tool that can analyze data simply without coding.
2. File and CSV File Import widgets can be used to retrieve and utilize files stored on
the computer.
3. The Datasets widget can be used to retrieve and utilize various example data used
for data analysis learning.
4. A Data Table is a widget that converts data into a table form and shows it in detail.
5. Select Columns, Select Rows widgets allow you to select columns and rows
according to specific conditions.
6. The Input widget can be used to fill in missing values (missing data) that may occur
in the data.
7. The Outliers widget can be used to locate and exclude outliers (data outside of
statistical scope).
8. The Pre-process widget can be pre-processed in the desired from through various
pre-processing functions such as data conversion and data modification.
Chapter 07 –Data and Processing 203
Questions
1. What is the file extension in the form of storing data separated by commas?
2. What expression method has columns and rows and can accurately determine
the exact value of the data?
3. What do rows and columns of data mean in the data expressed in table form?
4. Give three examples of filling in missing data values.
Exercises
1. Use Google Links (bit.ly/o3knue) to upload ice cream sales data to Orange 3.
2. Express and interpret the ice cream sales data in table form.
3. Fill in the missing values from the ice cream sales data.
Chapter 08
Data visualization
Learning Objective:
In this chapter, you will learn about various ways to visualize your data. The larger the number
of data, the more difficult it is to interpret it as a representation in the form of a table, and the
more difficult it is to interpret the detailed interpretation of the relationship between the data.
The advantage of representing data in charts, tables, and plots is that you can easily see and
interpret data trends and features at a glance.
8.1.1. Inputs
• Data: input dataset
8.1.2. Outputs
• Selected Data: instances selected from the plot.
• Data: data with an additional column showing whether a point is selected.
The Box Plot widget shows the distributions of attribute values. It is a good practice to
check any new data with this widget to quickly discover anomalies, such as duplicated
values (e.g., gray and grey), outliers, and like. Bars can be selected - for example, values
for categorical data or the quantile range for numeric data.
1. Select the variable you want to plot. Click Order by relevance to subgroups by Chi2
or ANOVA over the selected subgroup.
2. Choose Subgroups to see box plots displayed by a discrete subgroup. Click Order
by relevance to variable to order subgroups by Chi2 or ANOVA over the selected
variable.
208 Chapter 08 –Data visualization
3. When a subgroup groups instances, you can change the display mode. Annotated
boxes will display the end values, the mean, and the median, while comparing
medians and means will naturally compare the selected values between subgroups.
4. The mean (the dark blue vertical line). The thin blue line represents the standard
deviation.
5. Values of the first (25%) and the third (75%) quantile. The blue highlighted area
represents the values between the first and the third quartile.
6. The median (yellow vertical line).
The bars represent the instances with each attribute value for discrete attributes. The
plot shows the number of different animal types in the Zoo dataset: 41 mammals, 13 fish,
20 birds, and so on.
Display shows:
• Stretch bars: Shows relative values (proportions) of data instances. The unticked
box shows absolute values.
• Show box labels: Display discrete values above each bar.
• Sort by subgroup frequencies: Sort subgroups by their descending frequency.
Chapter 08 –Data visualization 209
Examples:
The Box Plot widget is most commonly used immediately after the File widget to
observe the statistical properties of a dataset. In the first example, we used heart-disease
data to inspect our variables.
Box Plot is also useful for finding the properties of a specific dataset, for instance, a set
of instances manually defined in another widget (e.g., Scatter Plot or instances belonging
to some cluster or a classification tree node). Let us now use zoo data and create a typical
clustering workflow with Distances and Hierarchical Clustering.
210 Chapter 08 –Data visualization
Now define the threshold for cluster selection (click on the ruler at the top). Connect
Box Plot to Hierarchical Clustering, tick Order by relevance, and select Cluster as a
subgroup. This will order attributes by how well they define the selected subgroup, in our
case, a cluster. It seems like our clusters indeed correspond very well with the animal type!
8.2.1. Inputs
• Data: input dataset
8.2.2. Outputs
• Selected Data: instances selected from the plot.
• Data: data with an additional column showing whether a point is selected.
The Violin Plot widget plays a similar role as a Box Plot. It shows the distribution of
quantitative data across several levels of a categorical variable such that those distributions
can be compared. Unlike the Box Plot, in which all plot components correspond to actual
data points, the Violin Plot features a kernel density estimation of the underlying
distribution.
Chapter 08 –Data visualization 211
1. Select the variable you want to plot. Click Order by relevance to subgroups to order
variables by Chi2 or ANOVA over the selected subgroup.
2. Choose Subgroups to see violin plots displayed by a discrete subgroup. Click Order
by relevance to variable to order subgroups by Chi2 or ANOVA over the selected
variable.
3. Box plot: Tick to show the underlying box plot.
− Strip plot: Tick to show the underlying data
represented by points.
− Rug plot: Tick to show the underlying data
represented by lines.
− Order subgroups: Click to order violins by median (ascending).
− Orientation: Determine violin orientation.
4. Kernel: Select the kernel used to estimate the density. Possible kernels are Normal,
Epanechnikov and Linear.
Scale: Select the method used to scale the width of each violin. If the area is selected,
each violin will have the same area. If the count is selected, the width of the violins will
be scaled by the number of observations in that bin. If width is selected, each violin will
have the same width.
212 Chapter 08 –Data visualization
Example:
The Violin Plot widget is most commonly used immediately after the File widget to
observe the statistical properties of a dataset. In the first example, we used heart-disease
data to inspect our variables.
The Violin Plot could also be used for outlier detection. In the following example, we
eliminate the outliers by selecting only instances that fall inside the Q1 1.5 and Q3 + 1.5
IQR.
8.3. Distribution
8.3.1. Inputs
• Data: input dataset
8.3.2. Outputs
− Selected Data: instances selected from the plot
− Data: data with an additional column showing whether an instance is selected.
− Histogram Data: bins and instance counts from the histogram.
The graph shows how many times (e.g., in how many instances) each attribute value
appears in the data. If the data contains a class variable, class distributions for each attribute
value will be displayed (like in the snapshot below). To create this graph, we used the Zoo
dataset.
1. A list of variables for display. Sort categories by frequency orders displayed values
by frequency.
2. Set Bin width with the slider. The precision scale is set to sensible intervals. Fitted
distribution fits the selected distribution to the plot. Options are Normal, Beta,
Gamma, Rayleigh, Pareto, Exponential, and Kernel density.
3. Columns:
• Split by displays value distributions for instances of a certain class.
• Stack columns display one column per bin, colored by proportions of class
values.
• Show probabilities shows probabilities of class values at selected variable.
• Show cumulative distribution cumulatively stacks frequencies.
For continuous attributes, the attribute values are also displayed as a histogram. It is
possible to fit various distributions to the data, for example, a Gaussian kernel density
estimation. Hide bars hide histogram bars and show only distribution (old behavior of
Distributions).
v
Chapter 08 –Data visualization 215
In classless domains, the bars are displayed in blue. We used the Housing dataset.
Plots a heat map for a pair of attributes and is a graphical method for visualizing attribute
values in a two-way matrix.
8.4.1 Inputs
• Data: input dataset
8.4.2. Outputs
• Selected Data: instances selected from the plot.
• Data: data with an additional column showing whether a point is selected.
Heat map is a graphical method for visualizing attribute values in a two-way matrix. It
only works on datasets containing numeric variables. The values are represented by color
according to the selected color palette. By combining class variables and attributes on the
x and y axes, we see where the attribute values are the strongest and weakest, thus enabling
us to find typical features for each class.
216 Chapter 08 –Data visualization
The widget enables row selection with click and drag. One can zoom in with Ctrl++
(Cmd++) and zoom out with Ctrl+- (Cmd+-). Ctrl+0 (Cmd+0) resets zoom to the extended
version, while Ctrl+9 (Cmd+9) reset it to the default.
4. Split rows or columns by a categorical variable. If the data contains a class variable,
rows will be automatically split by class.
5. Set what is displayed in the plot in Annotation & Legend.
• If the Show Legend is ticked, a color chart will be displayed above the map.
• If Stripes with averages is ticked, a new line with attribute averages will be
displayed on the left. Row Annotations add annotations to each instance on the
right. Color colors the instances with the corresponding value of the selected
categorical variable. Column Annotations adds an annotation to each variable
at the selected position (default is Top). Color colors the columns with the
corresponding value of the selected column annotation.
6. The If the Keep Aspect Ratio is ticked, each value will be displayed with a square
(proportionate to the map).
7. If Send Automatically is ticked, changes are communicated automatically.
Alternatively, click Send.
Heat map enables some neat plot enhancements. Such options include clustering rows
and/or columns for better data organization, row and column annotations, and splitting the
data by categorical variables.
Hierarchical clustering is based on the Ward linkage method. Clustering with optimal
leaf ordering reorders left and right branches in the dendrogram to minimize the sum of
distances between adjacent leaves (Bar-Joseph et al., 2001).
The Heat Map below displays attributes values for the brown-selected data set (Brown
et al., 2000). Heat maps are particularly appropriate for showing gene expressions, and the
brown-selected data set contains yeast gene expressions at different conditions.
The heat map shows low expressions in blue and high expressions in yellow and white.
For better organization, we added Clustering (opt. ordering) to the columns, which puts
columns with similar profiles closer together. In this way, we can see the conditions that
result in low expressions for ribosomal genes in the lower right corner.
Additionally, the plot is enhanced with row color on the right, showing which class the
rows belong to.
Chapter 08 –Data visualization 219
Heat maps are great for visualizing comparable numeric variables, for example,
sentiment in a collection of documents. We will take the book-excerpts corpus from the
Corpus widget and pass it to the Sentiment Analysis widget, which computes sentiment
scores for each document. The output of sentiment analysis is four columns, positive,
negative, and neutral sentiment score, and a compound score aggregating the previous
scores into a single number. Positive compound values (white) represent positive
documents, while negative (blue) represent negative documents.
We used row clustering to place similar rows closer, resulting in clear negative and
positive groups. Now we can select negative children’s books and explore which are they.
220 Chapter 08 –Data visualization
8.4.6. References
Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S. (2001) Fast optimal leaf ordering for
hierarchical clustering, Bioinformatics, 17, 22–29.
Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, M.,
Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by
using support vector machines, Proceedings of the National Academy of Sciences, 1, 262–
267.
Scatter plot visualization with exploratory analysis and intelligent data visualization
enhancements.
8.5.1 Inputs
• Data: input dataset
• Data Subset: the subset of instances
• Features: list of attributes
Chapter 08 –Data visualization 221
8.5.2. Outputs
• Selected Data: instances selected from the plot
The Scatter Plot widget provides a 2-dimensional scatter plot visualization. The data is
displayed as a collection of points, each having the value of the x-axis attribute determining
the position on the horizontal axis and the value of the y-axis attribute determining the
position on the vertical axis. Various properties of the graph, like color, size and shape of
the points, axis titles, maximum point size, and jittering, can be adjusted on the left side of
the widget. A snapshot below shows the scatter plot of the Iris dataset with the coloring
matching of the class attribute.
v
1. Select the x and y attribute. Optimize your projection with Find Informative
Projections. This feature scores attribute pairs by average classification accuracy
and returns the top-scoring pair with a simultaneous visualization update.
2. Attributes: Set the color of the displayed points (you will get colors for categorical
values and blue-green-yellow points for numeric). Set label, shape, and size to
differentiate between points. Label-only selected points to allow you to select
individual data instances and label only those.
3. Set symbol size and opacity for all data points. Set jittering to prevent the dots from
overlapping. Jittering will randomly scatter points only around categorical values.
222 Chapter 08 –Data visualization
If Jitter numeric values are checked, points are also scattered around their actual
numeric values.
• Show color regions color the graph by class (see the screenshot below).
• Show legend displays a legend on the right. Click and drag the legend to move
it.
• Show gridlines and display the grid behind the plot.
• Show all data on mouse hover, enabling information bubbles if the cursor is
placed on a dot.
• Show regression line draws the regression line for pair of numeric attributes.
Individual regression lines for each class value will be displayed if a categorical
variable is selected for coloring the plot. The reported r value corresponds to the
value from linear least-squares regression, equal to Pearson’s correlation
coefficient.
• Treat variables as independent fits regression line to a group of points (minimize
the distance from points), rather than fitting y as a function of x (minimize
vertical distances)
4. Select, zoom, pan, and zoom to fit are the options for exploring the graph. The
manual selection of data instances works as an angular/square selection tool.
Double-click to move the projection. Scroll in or out for zooming.
5. If Send automatically is ticked, changes are communicated automatically.
Alternatively, press Send.
Here is an example of the Scatter Plot widget if the Show color regions and Show
regression line boxes are ticked.
Chapter 08 –Data visualization 223
If a dataset has many attributes, it is impossible to manually scan through all the pairs
to find interesting or valuable scatter plots. Orange implements intelligent data
visualization with the Find Informative Projections option in the widget.
The score is computed if a categorical variable is selected in the Color section. The
method finds the ten nearest neighbors in the projected 2D space for each data instance
based on the combination of attribute pairs. It then checks how many of them have the same
color. The total score of the projection is then the average number of the same-colored
neighbors.
The computation for numeric colors is similar, except that the coefficient of
determination is used to measure the projection's local homogeneity.
To use this method, go to the Find Informative Projections option in the widget, open
the sub-window, and press Start Evaluation. The feature will return a list of attribute pairs
by average classification accuracy score.
Below, there is an example demonstrating the utility of ranking. The first scatter plot
projection was set as the default sepal width to sepal length plot (we used the Iris dataset
for simplicity). Upon running Find Informative Projections optimization, the scatter plot
converted to a much better projection of petal width to petal length plot.
8.5.4. Selection
Selection can be used to define subgroups in the data manually. Use the Shift modifier
when selecting data instances to put them into a new group. Shift + Ctrl (or Shift + Cmd on
macOS) appends instances to the last group.
Signal data outputs a data table with an additional column that contains group indices.
Like the other Orange widgets, the Scatter Plot supports zooming in and out of part of
the plot and a manual selection of data instances. These functions are available in the lower-
left corner of the widget.
The default tool is Select, which selects data instances within the rectangular area. Pan
enables you to move the scatter plot around the pane. With Zoom, you can zoom in and out
of the pane with a mouse scroll, while Reset zoom resets the visualization to its optimal
size. An example of a simple schema, where we selected data instances from a rectangular
region and sent them to the Data Table widget, is shown below. Notice that the scatter plot
does not show all 52 data instances because some data instances overlap (they have the
same values for both attributes used).
Chapter 08 –Data visualization 225
Example:
The Scatter Plot can be combined with any widget that outputs a list of selected data
instances. In the example below, we combine Tree and Scatter Plot to display instances
taken from a chosen decision tree node (clicking on any node of the tree will send a set of
selected data instances to the scatter plot and mark selected instances with filled symbols).
226 Chapter 08 –Data visualization
8.5.7. References
Gregor Leban and Blaz Zupan and Gaj Vidmar and Ivan Bratko (2006) VizRank: Data
It is a visualization of data profiles. A line plot is a type of plot which displays the data as
a series of points connected by straight line segments.
8.6.1. Inputs
• Data: input dataset
• Data Subset: the subset of instances
8.6.2. Outputs
• Selected Data: instances selected from the plot
• Data: data with an additional column showing whether a point is selected
A line plot is a type of plot which displays the data as a series of points connected by
straight line segments. It only works for numerical data, while categorical can be used to
group the data points.
Example:
Line Plot is a standard visualization widget that displays data profiles, normally of
ordered numerical data. This example will display the iris data in a line plot grouped by the
iris attribute. The plot shows how petal length nicely separates between class values.
If we observe this in a Scatter Plot, we can confirm this is indeed so. Petal length is an
interesting attribute for the separation of classes, especially when enhanced with petal width,
which is also nicely separated in the line plot.
228 Chapter 08 –Data visualization
8.7.1. Inputs
• Data: input dataset.
• Data Subset: the subset of instances.
8.7.2. Outputs
• Selected Data: instances selected from the plot.
• Data: data with an additional column showing whether a point is selected.
The Bar Plot widget visualizes numeric variables and compares them by a categorical
variable. The widget is useful for observing outliers, distributions within groups, and
comparing categories.
Chapter 08 –Data visualization 229
1. The Bar Plot widget visualizes numeric variables and compares them by a
categorical variable. The widget is useful for observing outliers, distributions within
groups, and comparing categories and parameters of the plot. Values are the
numeric variable to plot. Group by is the variable for grouping the data. Annotations
are categorical labels below the plot. Color is the categorical variable whose values
are used for coloring the bars.
2. Select, zoom, pan, and zoom to fit are the options for exploring the graph. The
manual selection of data instances works as an angular/square selection tool.
Double-click to move the projection. Scroll in or out for zoom.
3. If Send automatically is ticked, changes are communicated automatically.
Alternatively, press Send.
4. Access help, save the image, produce a report, or adjust visual settings. On the right,
the information on input and output is shown.
Example:
The Bar Plot widget is most commonly used immediately after the File widget to
compare categorical values. In this example, we have used heart-disease data to inspect our
variables.
230 Chapter 08 –Data visualization
First, we observed the cholesterol values of patients from our data set. We grouped
them by diameter narrowing, which defines patients with heart disease (1) and those
without (0). We use the same variable for coloring the bars.
Then, we selected patients over 60 years of age with Select Rows. We sent the subset
to Bar Plot to highlight these patients in the widget. The big outlier with a high cholesterol
level is over 60 years old.
8.8.1. Input
• Data: input dataset.
8.8.2. Output
• Selected Data: instances selected from the plot.
• Data: entire data with a column indicating whether an instance was selected or not.
The Venn Diagram widget displays logical relations between datasets by showing the
number of common data instances (rows) or the number of shared features (columns).
Selecting a part of the visualization outputs the corresponding instances or features.
Chapter 08 –Data visualization 231
Example:
The easiest way to use the Venn Diagram is to select data subsets and find matching
instances in the visualization. We use the breast-cancer dataset to select two subsets with
the Select Rows widget - the first subset is that of breast cancer patients aged between 40
and 49, and the second is that of patients with a tumor size between 20 and 29. The Venn
Diagram helps us find instances that correspond to both criteria, which can be found in the
intersection of the two circles.
232 Chapter 08 –Data visualization
The Venn Diagram widget can also be used for exploring different prediction models.
In the following example, we analyzed three prediction methods, Naive Bayes, SVM, and
Random Forest, according to their misclassified instances.
By selecting misclassifications in the three Confusion Matrix widgets and sending them
to the Venn diagram, we can see all the misclassification instances visualized per method
used. Then we open Venn Diagram and select, for example, the misclassified instances
identified by all three methods. This is represented as an intersection of all three circles.
Click on the intersection to see these two instances marked in the Scatter Plot widget. Try
selecting different diagram sections to see how the scatter plot visualization changes.
8.9.1. Inputs
• Data: input dataset.
Chapter 08 –Data visualization 233
This widget displays linear projections of class-labeled data. It supports various types
of projections, such as circular, linear discriminant analysis, and principal component
analysis.
Consider, for a start, a projection of the Iris dataset shown below. Notice that the sepal
width and length already separate Iris setosa from the other two. In contrast, the petal length
best separates Iris versicolor from Iris virginica.
1. Axes in the projection that are displayed and other available axes. Optimize your
projection by using Suggest Features. This feature scores attributes and returns the
top-scoring attributes with a simultaneous visualization update. Feature scoring
computes the classification accuracy (for classification) or MSE (regression) of the
234 Chapter 08 –Data visualization
Example:
The Linear Projection widget works just like other visualization widgets. Below, we
connected it to the File widget to see the set projected on a 2-D plane. Then we selected the
data for further analysis and connected it to the Data Table widget to see the details of
the selected subset.
Chapter 08 –Data visualization 235
8.9.3. References
Koren Y., Carmel L. (2003). Visualization of labeled data using linear transformations.
In Proceedings of IEEE Information Visualization 2003, (InfoVis’03). Available here.
Boulesteix A.-L., Strimmer K. (2006). Partial least squares: a versatile tool for the
analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1), 32-44.
Abstract here.
Leban G., Zupan B., Vidmar G., Bratko I. (2006). VizRank: Data Visualization Guided
by Machine Learning. Data Mining and Knowledge Discovery, 13, 119-136. Available here.
236 Chapter 08 –Data visualization
Summary
1. The larger the data size (despite the number of columns and rows) makes it
difficult to interpret and handle the data.
3. Box Plots can be based on the data's representative values (Q1, Q2, median,
maximum, minimum, mean, standard deviation) and their distribution.
4. Violin Plots can determine the distribution based on the value of the data.
Depending on the characteristics, you can compare how much data is located in
which data interval.
6. Heat Map expresses the relationship between each data in color, making it
easy to understand.
7. Scatter Plots are often used to express the correlation of data. Because they
are represented in a plane based on two characteristics, it is easy to interpret the
relationship between the two data.
8. Line Plots are often used to compare changes in time series data. You can
view and interpret the flow of data that changes over time or specific criteria.
9. Bar Plots can be used to compare the size of the interval or discrete values of
the data. Although they can be utilized similarly to Line Plots, Line Plots are
mainly used to determine continuous value changes, while Bar Plots are used to
compare discrete values.
Chapter 08 –Data visualization 237
Questions
Exercises
1. Load iris data from the Datasets widget and find out what the characteristics of
the data mean.
2. Express the number of data according to the flower variety in the iris dataset as
a bar plot.
3. Check the representative value of the petal width using the box plot and explain
the characteristics of the data.
4. Analyze the correlation between the width and length of petals and the width
and length of sepals with a scatter plot and interpret the relationship according to
the variety.
This book has been developed with the cooperation
of the Cambodia Ministry of Education, Youth and
Sport (MoEYS) under the Project of ICT Capacity
Building of Lower Secondary Education in Cambodia
supported by KOICA.