Advanced Database Systems
Advanced Database Systems
UNIVERSITY OF TEESSIDE
MIDDLESBROUGH
TS1 3BA
Kiril Anastasov
We are required to submit only a single pdf (portable document format) file that
documents all our answers based on the following scenario SomeCity University holds
interview days for candidates applying to its degree courses. The purpose is to appraise
the students’ ability and then make them an offer regarding acceptance onto a degree
course; an offer is the total number of points they need to gain from their A-level or
other qualifications. SomeCity would like to develop a database and associated
applications to manage this activity. The following documents highlight the data in the
system. Read through the documents carefully. If you have queries you can contact
your local tutor who will act as a client from SomeCity University.
Acknowledgements
I am grateful for the help and support of my supervisors Capper, Graham (Dr) and
Longstaff, Jim (Dr). Without the on-going support them gave me, it would not have been
possible to complete this project. They were able to get in touch despite the time and
place, in order to help for achieving most of the goals. They made the effort to
personally get to know me and my project and give feedback where it was needed.
8.1.2 Clustering…………………………..………………………......…60
10 REFERENCES ...................................................................................... 70
11 APPENDIX .......................................................................................... 71
Task1:
1. There is inconsistency of notations (Chen, Crow's foot) on many places that have
been used in Document 1. To solve this problem UML class diagram was used.
2. Optional, And were used in the generalisation where the correct way is
Mandatory, Or .
4. Candidate should not be linked with InterviewDay, the details about Candidate
are not required according to document 2.
5. Staff should be directly associated with InterviewDay as the details about Staff
are required in the IntreviewDay according to document 2.
In the diagram we have the names of the tables, all the attributes and the Primary
Foreign Key relationships. By using the UML diagram for these two tables we were able
to create the SQL Database Staff and Appointment tables as it can be seen on the next
page in Figure 5.
USE [l1087591]
GO
/****** Object: Table [SomeUni].[Appointment] Script Date: 07/01/2014 12:46:31
******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [SomeUni].[Appointment](
[appointmentId] [int] IDENTITY(1,1) NOT NULL,
[date] [date] NOT NULL,
[time] [nchar](10) NOT NULL,
[staffName] [nchar](20) NOT NULL,
[staffWorkPlace] [nchar](20) NOT NULL,
[candidateName] [nchar](20) NOT NULL,
[candidateId] [int] NOT NULL,
[staffId] [int] NOT NULL,
CONSTRAINT [PK_Appointment_1] PRIMARY KEY CLUSTERED
(
[appointmentId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS =
ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object: Table [SomeUni].[Staff] Script Date: 07/01/2014 12:46:31 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [SomeUni].[Staff](
[staffId] [int] IDENTITY(1,1) NOT NULL,
[name] [nchar](20) NOT NULL,
[workPlace] [nchar](20) NOT NULL,
[email] [nchar](30) NOT NULL,
[interviewDayPersonId] [int] NOT NULL,
CONSTRAINT [PK_Staff_1] PRIMARY KEY CLUSTERED
(
[staffId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS =
ON) ON [PRIMARY]
Kiril Anastasov [email protected]
By:
11
) ON [PRIMARY]
GO
ALTER TABLE [SomeUni].[Appointment] WITH CHECK ADD CONSTRAINT
[FK_Appointment_Candidate] FOREIGN KEY([candidateId])
REFERENCES [SomeUni].[Candidate] ([candidateId])
GO
ALTER TABLE [SomeUni].[Appointment] CHECK CONSTRAINT
[FK_Appointment_Candidate]
GO
ALTER TABLE [SomeUni].[Appointment] WITH CHECK ADD CONSTRAINT
[FK_Appointment_Staff] FOREIGN KEY([staffId])
REFERENCES [SomeUni].[Staff] ([staffId])
GO
ALTER TABLE [SomeUni].[Appointment] CHECK CONSTRAINT
[FK_Appointment_Staff]
GO
ALTER TABLE [SomeUni].[Staff] WITH CHECK ADD CONSTRAINT
[FK_Staff_InterviewDayPerson] FOREIGN KEY([interviewDayPersonId])
REFERENCES [SomeUni].[InterviewDayPerson] ([interviewDayPersonID])
GO
ALTER TABLE [SomeUni].[Staff] CHECK CONSTRAINT
[FK_Staff_InterviewDayPerson]
GO
This was the most important T-SQL code for the Appointment table in the SomeUni
Schema. Let’s examine the code for the Staff Table.
GO
ALTER TABLE [SomeUni].[Appointment] WITH CHECK ADD CONSTRAINT
[FK_Appointment_Candidate] FOREIGN KEY([candidateId])
REFERENCES [SomeUni].[Candidate] ([candidateId])
With the following code we alter the Appointment table and add constraint for the
Appointment Candidate table relationship where candidateId is a foreign key in
the Appointment table.
GO
ALTER TABLE [SomeUni].[Appointment] WITH CHECK ADD CONSTRAINT
[FK_Appointment_Staff] FOREIGN KEY([staffId])
REFERENCES [SomeUni].[Staff] ([staffId])
With the following code we alter the Appointment table and add constraint for the
Appointment Staff table relationship where staffId is a foreign key in the
Appointment table.
USE [l1087591]
GO
SET QUOTED_IDENTIFIER ON
GO
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace SomeUniApp
{
/**dropMenu() is responsible for queries in the student db from the entity data
model
*
*/
private static void dropMenu()
{
} while (more);
/**insertStaff is responsible for creating a new staff in the db, inserting it and
saving it.
* @param myStaffName is the name of the staff.
* @param myWorkPlace is the where the staff works from.
* @param myEmail is the email of the staff.
* @param id is the id of the staff.
*/
private static void insertStaff(String myStaffName, String myWorkPlace, String
myEmail, String id)
{
using (var context = new l1087591Entities2())
{
//create a staff object
Staff staff = new Staff();
staff.name = myStaffName;
staff.workPlace = myWorkPlace;
staff.email = myEmail;
staff.interviewDayPersonId = Convert.ToInt32(id);
//add object to container's result entity set and save, catch exceptions
try
{
context.Staffs.AddObject(staff);
context.SaveChanges();
Console.WriteLine(" New staff " + myStaffName + " added ... ");
}
catch (Exception ex)
{
Console.WriteLine("error: " + ex.Message);
}
staff.name = myStaffName;
//add object to container's result entity set and save, catch exceptions
try
{
//get the row containing the given primary key
var myStaff = (from m in context.Staffs
where m.name == myStaffName
select m).First();
//get user input -see function above
string newName = setName();
string newWorkPlace = setWorkPlace();
string newEmail = setEmail();
string newId = setId();
//change module entity and save
myStaff.name = newName;
myStaff.workPlace = newWorkPlace;
myStaff.email = newEmail;
myStaff.staffId = Convert.ToInt32(newId);
// context.Modules.
//context.Modules.DeleteObject(module);
context.SaveChanges();
Console.WriteLine("... editing staff " + myStaffName + " done");
}
catch (Exception ex)
{
Console.WriteLine("error: " + ex.Message);
}
staff.name = myStaffName;
//add object to container's result entity set and save, catch exceptions
try
{
//get the row containing the given name
var myStaff = (from m in context.Staffs
where m.name == myStaffName
select m).First();
// context.Modules.
context.Staffs.DeleteObject(myStaff);
context.SaveChanges();
Console.WriteLine("The staff " + myStaffName + " has been deleted ");
}
catch (Exception ex)
{
Console.WriteLine("error: " + ex.Message);
}
}
}
}
}
If you enter a wrong Option number you will get this result due to the validation.
In Figure 13 we can see that the new staff was added successfully in SQL.
In Figure 17 we can see that the staff has been deleted from the SQL.
In Figure 19.1 we can see that the staff has been deleted from the SQL.
Based on the C# source code and all the examples it can be concluded that the add,
edit and delete menu features are working as expected.
When we deleted a staff the main bad consequence is that there are dependent
records in other tables (Offer, InterviewDay, and Appointment) whose referential
integrity is lost when the parent record goes away. There a few ways you can
deal with the referential integrity. The first way is to use cascading deletions with
care. With this method you can eliminate most referential integrity problems by
carefully controlling the update process. The second way to control update
anomalies if you do not want to cascade a deletion. Instead change the child’s
table’s foreign key to a Null value.
When we delete a staff, then there is no record of the staff. This could be a
problem, if for example the police are looking for him or anyone from SomeUni
wants to contact him. We can create a table called StaffWorkedInSomeUni that
store the same information about Staff but when staffs is deleted from Staff table
the information about the staff remains in the StaffWorkedInSomeUni.
In Figure 20 we can see the Mendix model corresponds to the tables and relationships
in the Internet Sales database diagram of AdwentureWorksDW2012. There are no
Primary and Foreign Keys in the Mendix model because Mendix does not need a
Primary and Foreign Key relationships. The attributes in the tables have similar data
type. For example in Mendix attributes that are defined as String correspond to
nvarchar() in SQL and Integer in Mendix correspond to all the number types in SQL like
float, double and integer.
Date is strictly NOT in BCNF because there are extra dependencies which
involve the primary key (DateKey).
In figure 21 we can see our mendix application deployed. We can Login with current
user Jim1 and password 123456 or Register as a new User.
In Figure 22 we can create or edit a Product category; in this case we are creating a
new product called currency.
In Figure 23 we have created new product FOREX and we have select the product sub
category name euro.
In Figure 24 we have created a product subcategory cheap helmet and have chosen a
product category small helmet.
In Figure 25 we have a few Product names with product category and product sub
categories. For example we have Bike as product category, green Bike as product sub
category and fast bike as product name. Also we have FOREX as product name and
aus as product sub category and eur for product category name to demonstrate that the
relationship between the category, sub category and product is working as expected.
In figure 26 we can see the data cube that was created with the dimensions and
measures.
The Diagram Organizer pane creates sub diagrams that allow you to view subsets of
the data source view. The table pane is where the tables and their schema elements
are displayed in a tree data structure view.
In figure 28 we see again the cube structure with tables and their attributes. The
Diagram Pane is where the tables and the relationships are represented graphically.
In Figure 29 it can be seen that the names of the tables have been changed to more
friendly or readable ones. All the Dim and Fact names have been removed so that the
tables are more readable in this way.
In Figure 30 we can see that by default, the wizard selects as measures all numeric
columns in the fact table that are not linked to dimensions. However, these four
columns are not actual measures. The first three are key values that link the fact table
with dimension tables that are not used in the initial version of this cube and the
measures were not selected.
In Figure 31 we can see the attribute relationships. Product Key is connected with
English Product Subcategory which is connected with English Product Category and
Product Key is connected with Model Name and Model Name is connected with
Product Line.
In Figure 32 we can see that the data cube that was created in the business intelligence
can be accessed from the localhost Microsoft Analysis Server. We can see the
measured groups, the data sources, the data source views and the dimensions of the
data cube. The cube is called DSV Adventure Works DW2012.
In Figure 33 it was selected Product for Dimension, English Categories for Hierarchy,
Equal for the Operator and it was selected Bike and Racks for the Filter Expression and
no Parameters. For the measures it was selected Sales Amount and after execution the
result is: 39360.
In Figure 34 it was selected Product for Dimension, English Categories for Hierarchy,
Equal for the Operator and it was selected Bike, Racks and Helmets for the Filter
Expression and no Parameters. For the measures it was selected Sales Amount and
after execution the result is: 264695.600...
In Figure 35 it was selected Product for Dimension, English Categories for Hierarchy,
Equal for the Operator and it was selected Bike, Racks, Helmets, Panniers, Mountain
Bikes and Bib-shorts for the Filter Expression and no Parameters. For the measures it
was selected Sales Amount and after execution the result is: 10217455.1...
In Figure 36 it was selected Products for Dimension, Products for Hierarchy, Equal for
the Operator and it was selected M for the Filter Expression and no Parameters. For the
measures it was selected Sales Amount and after execution the result is: 10251183
In Figure 37 it was selected Products for Dimension, Products for Hierarchy, Equal for
the Operator and it was selected M and S for the Filter Expression and no Parameters.
For the measures it was selected Sales Amount, Total Product Cost, Order Quantity
and Internet Sales Count and after execution the result is:
In Figure 38 it was selected Products for Dimension, Products for Hierarchy, Equal for
the Operator and it was selected T for the Filter Expression and no Parameters. For the
measures it was selected Sales Amount, Total Product Cost, Order Quantity and
Internet Sales Count and after execution the result is:
In Figure 39 it was selected Products for Dimension, English Categories for Hierarchy,
Equal for the Operator and it was selected Clothing and Bikes for the Filter Expression
and no Parameters. For the measures it was selected Sales Amount, Total Product
Cost, Order Quantity, Internet Sales Count and Internet Sales Count and after
execution the result is:
In Figure 40 it was selected Products for Dimension, English Categories for Hierarchy,
Contains for the Operator and no Parameters. For the measures it was selected Sales
Amount, Total Product Cost, Order Quantity, Internet Sales Count and Internet Sales
Count and after execution the result is:
In Figure 41 it was selected Products for Dimension, English Categories for Hierarchy,
Range (Inclusive) for the Operator and it was selected Bike Racks: Bottle of Cages the
Filter Expression and no Parameters. For the measures it was selected Sales Amount,
Total Product Cost, Order Quantity, Internet Sales Count and Internet Sales Count and
after execution the result is:
Index could provide the data pointer that specified to the data value set which is stored
in specified column, and then order these pointers according to the specified sort order.
Index in database is very similar to the index in using books: it searches index and finds
out the special value, then finds out the column in this value following with the pointers.
A primary key is a LOGICAL concept – it is the unique identifier for a row in a table. As
such, it has a bunch of attributes - it may not be null, and it must be unique. Naturally,
you are likely to be searching for records by their unique identifier a lot, it would be
good to have an index on the primary key. A clustered index is a PHYSICAL concept –
it is an index that affects the order in which records are stored on disk. This makes it a
very fast index when accessing data, though it may slow down writes if your primary
key is not a sequential number. You can have a primary key without a clustered index -
and sometimes, it is the better option (for example when your primary key is a
combination of foreign keys on a joining table, and you do not want to incur the disk
shuffle overhead when writing). You can create a clustered index on columns that are
not a primary key. (Neville, 2013)
In Figure 42 we start the creating of the Data Mining with the decision tree Algorithm
through the Data Mining Wizard.
In Figure 43 we specify the training data like Input data parameters and what parameter
we want to predict in our case BikeBuyers,
In Figure 44 we specify the percentage of data for testing. We leave the default value
30%. The testing set is used to check the model accuracy.
In Figure 46 we run the Process Mining Model and it can be seen that the status of the
Process is successful.
In Figure 47 we can see the Decision Tree and that there is 72.23% probability for
someone with our characteristics to buy a bike when they do not have one and 27.77%
to buy a bike when they already have one.
In Figure 48 we can see the Decision Tree and the clustering Models.
In Figure 49 we can see the cluster model. There are 10 Clusters and cluster number
10 is the strongest (with most intense color). The probability to buy a bike is 67% which
is similar to the percentage we have from the decision tree.
In Figure 50 we can see the Cluster profiles. We can see the variables like Age, Bike
Buyers, Commute Distance, English Education, English Occupation, Gender and
Geography Key.
In Figure 51 we can see the Characteristics for all the population, the values and the
Probability
Kiril Anastasov [email protected]
By:
62
Figure 52: Cluster Discrimination
In Figure 52 we can see the Discrimination scores for Cluster 1 and Complement of
cluster 1. All the variables, Values, Favours Cluster 1 and Favors Complements of
Cluster 1 can be seen.
In Figure 54 we can see that the result for the Singleton query for the decision tree
model give approximately 0.34 % of buying a bike for a customer who is 50 years old,
has 3 children and 2 cars.
In Figure 56 we can see that the result for the Singleton query for the clustering model
give approximately 0.43% of buying a bike for a customer who is 50 years old, has 3
children and 2 cars.
In Figure 57 we can see 3 lines: The blue line represents the Ideal Model, the red line
represents the decision tree model and the green line represents the Clustering model.
It can be deducted that the red line (decision tree model) was more suited to the
explorations and predictions that were carried out in parts one and two for two reasons.
First of all, the red line (decision tree model) is closer to the Ideal Model, which can be
seen from Figure 57. Second of all, the red line (decision tree model) gives better
results. For example the Score of the decision tree model (0.71)is higher to the score of
the clustering model (green line, 0.65), the population correct variable (35.56%) is
higher for the decision tree model compared to the clustering model (31.87%) and the
predict probability variable(63.72%) is higher for the decision tree model compared to
the clustering one(59.17%). For those reasons, the decision tree model was more
appropriate for the given data, desired prediction and input variables. It can be taken
into consideration that both the decision tree model and the Clustering model use
already implemented algorithms from Microsoft. If the algorithms are implemented
differently and different data set is provided then the Clustering model may give better
results.
BI tools such as MicroStrategy, Tableau, IBM Cognos, and others provide business
users with direct access to data warehouse insights. First, the business user can
create reports and complex analysis quickly and easily using these tools. As a result,
there is a trend in many data warehouse sites towards end-user self-service. Business
users can easily demand more reports than IT has staffing to provide. More important
than self-service however, is that the users become intimately familiar with the data.
They can run a report, discover they missed a metric or filter, make an adjustment, and
run their report again all within minutes. This process results in significant changes in
business users’ understanding the business and their decision-making process. First,
users stop asking trivial questions and start asking more complex strategic questions.
Generally, the more complex and strategic the report, the more revenue and cost
savings the user captures. This leads to some users becoming “power users” in a
company. These individuals become wizards at teasing business value from the data
and supplying valuable strategic information to the executive staff. Every data
warehouse has anywhere from two to 20 power users.
Query performance with BI tools lowers the analytic pain threshold. If it takes 24 hours
to ask and get an answer, users only ask once. If it takes minutes, they will ask dozens
of questions. For example, a major retailer was comparing stock-on-hand to planned
newspaper coupon advertising. Initially they ran an eight-hour report that analyzed
hundreds of stores. One power user saw they could make more money if the
advertising was customized for stores by geographic region. By adding filters and
constraints and selecting small groups of regional stores, the by-region query ran in two
Hadoop and the data warehouse will often work together in a single information supply
chain. When it comes to Big Data, Hadoop excels in handling raw, unstructured and
complex data with vast programming flexibility. Data warehouses also manage big
structured data, integrating subject areas and providing interactive performance through
BI tools. It is rapidly becoming a symbiotic relationship. Some differences are clear,
and identifying workloads or data that runs best on one or the other will be dependent
on your organization and use cases. As with all platform selections, careful analysis of
the business and technical requirements should be done before platform selection to
ensure the best outcome. Having both Hadoop and a data warehouse onsite greatly
helps everyone learn when to use which. (Awadallah, A. Graham D. 2012)
NoSQL is often used for storing Big Data. This is a new type of database which is
becoming more and more popular among web companies today. Proponents of NoSQL
solutions state that they provide simpler scalability and improved performance relative
to traditional relational databases. These products excel at storing “unstructured data,”
and the category includes open source products such as Cassandra, MongoDB, and
Redis.(Feinleib, 2012)
Awadallah, A. Graham D. (2012), Hadoop and the Date Warehouse: When to Use
Which, Available at:
https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=7&ved=0CF8QFj
AG&url=http%3A%2F%2Fwww.teradata.co.uk%2Fwhite-papers%2FHadoop-and-the-
Data-Warehouse-When-to-Use-Which%2F%3Ftype%3DWP&ei=8dtYU-
rtFMfrPODVgegO&usg=AFQjCNGsjKLOdPaLaxKMx2lU4v7fdVkrdA&bvm=bv.6539761
3,d.ZWU&cad=rja (Accessed: 04 April 2014).
Neville, K. (2013), Relationship of Primary Key and Clustered Index) Available at:
http://stackoverflow.com/questions/15051869/relationship-of-primary-key-and-clustered-
index (Accessed: 04 April 2014).
Oracle Database 12c for Data Warehousing and Big Data (2013) Available at:
http://www.oracle.com/technetwork/database/bi-datawarehousing/data-warehousing-
wp-12c-1896097.pdf (Accessed: 04 April 2014).
https://github.com/kanastasov/Advanced-Databses.git