E-Migrate: Semi Automated Mapping of Database Structure using String Similarity
and Distance Algorithms for FAITH Colleges Multiple Systems
A Capstone Project by:
JOSEPH PLATA MALALUAN
Master in Information Technology
Submitted to the School of Graduate Studies
First Asia Institute of Technology and Humanities
Tanauan City, Batangas, Philippines
In Partial Fulfillment
of the Requirements for the Degree
Master in Information Technology
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
TABLES OF CONTENTS
Chapter 1: Introduction
Background of the Study……………………………………………………….1
Objectives of the Study…………………………………………………………2
General Objectives……………………………………………………...2
Specific Objectives……………………………………………………...2
Significance of the Study……………………………………………………….4
Scope and Limitations of the Study…………………………………………...5
Chapter 2: Review of Related Literature
Conceptual Literature…………………………………………………………...7
Related Study and Literature…………………………………………………..8
Existing System…………………………………………………………….......9
Synthesis………………………………………………………………............11
Conceptual Framework……………………………………………………….13
Operational Framework……………………………………………………….14
Definition of Terms…………………………………………………………….17
Chapter 3: Research Methodology
Research Design……………………………………………………………....17
System Evaluation………………..……………………………………………21
Subject of the Study…………………...………………………………………24
System Requirements………………………………………………………...25
1
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Chapter I
INTRODUCTION
1.1 BACKGROUND OF THE STUDY
Moving data is a critical process in migrating from one system to another. Legacy
systems that use outdated technology need to be migrated to a new system to be
more efficient and maintainable; however changes on structures and schema are
unavoidable in any migration process(1). Semantic problems in every column
should be resolved and apply more suitable column names in order to follow
naming convention standards(2).
The initial challenge that needs to be addressed before deciding to migrate into
another system is how compatible the new system is to the existing data. At
FAITH Colleges, multiple systems are in need of upgrading and improvements.
Campus++ an admission and enrollment system - cannot be easily replaced
without migrating data related to grades and back accounts. While another
system ISIS has different corporate data that needs to be intact from the
beginning.
Manual mapping of each table schema and columns is a potential solution,
however it may take a long time to be completed having more than 50 tables with
some containing at least 20 columns. While, fully automated data mapping is not
an option due to the old and poor design strategy designing the existing
database. Therefore, a combination of manual and automated database mapping
is ideal to be used. Automated- for faster execution and manual adjusting for
accuracy.
2
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Automated database mapping requires a strategy that can ensure accurate
mapping and relevant scoring. The initial stage of schematic mapping is linguistic
matching for element name comparison(3). In this stage, Smith-Waterman
Algorithm will be used to compare and rank similarity scoring for each candidate
element.
Smith-Waterman Algorithm aligns the comparing table names and column names
into a matrix formulation. This algorithm is a good fit in analyzing the similarity of
the element since it provides a scoring pattern.
1.2 OBJECTIVES OF THE STUDY
1.2.1 General Objective
The researcher aims to develop a system that will help migrate data
from a legacy database structure to a new structure faster and more
accurately.
1.2.2 Specific Objectives
1.2.1.1 Analyze the structure of the existing or legacy systems of FAITH
Colleges.
● To create a system that can suggest database table and column
mapping of old database structure to a new database structure
design in SQL Server or MYSQL server.
3
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
● To allow users to visually compare database similarity of two
different databases designed in the same database management
system.
● To give users options on how to map database structure, with
manually mapped or follow the automated mapping suggestion.
1.2.1.2 Comparing different algorithms that can be used in database
entities mapping of databases design in SQL Server or MYSQL Server.
● Identify the algorithm that is accurate and efficient.
● Visually displays the score generated by the compared algorithms.
1.2.1.3 To utilize Smith-Waterman algorithm as a library tool for entity
mapping.
● Compare the Smith-Waterman algorithm with the Hamming
Distance and Levenshtein Distance algorithm.
● Use the most suitable algorithm for string comparison.
1.2.1.4 Evaluate the result of the study.
● Validate the efficiency of the system in comparison with the manual
process.
● Validate the suitability of Smith-Waterman algorithm in database
entity mapping.
4
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
1.3 SIGNIFICANCE OF THE STUDY
The goal of this project is to provide aid migrating data from a legacy
database to a new database structure. With the help of the proposed system,
accuracy and efficiency of data migration can be achieved. Missing pertinent
data can be avoided and new structures can be utilized in warehousing old data
without interrupting new processes. Efficiency and reliability of migrated data will
benefit the following stakeholders:
1.3.1 To the current system developers. Mapping of database elements
can be a tedious task for developers. The proposed system can help
current system developers to lessen their job on manually analyzing
database schemas.
1.3.2 To the system users. Having an accurate data migration from
legacy system to new system can significantly improve reliability of the
system. System users no longer encounter missing data from the
replacement system. Implementation of new processes can also be
applied on old transactions that will be very important in performing
different tasks of the users.
1.3.3 To the researcher. Data migration research is the best way to come
up with the latest technique to improve knowledge and skills. Ideas and
learning can be acquired in the process of this research.
1.3.4 To the future researchers. Ideas discussed in this research can be
a big help to future researchers. Improved research for this topic can open
up a new discovery for data migration techniques.
5
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
1.4 SCOPE AND LIMITATION OF THE STUDY
This project focuses on developing a system that maps and compares database
schema of a legacy database and a new database schema for migration.
The following features are included in the proposed system:
1. Dynamic database object mapping
The proposed system has the flexibility to be integrated in different
systems being used in any industry. Connection string attributes should be
supplied within the system interface. Table Name, Column name and
Instructor/Student object attribute are captured by the system for proper
mapping.
2. Visual presentation of database object
A web interface is available for viewing and validating of structure
mapping executed by the system. The user can update manually if
mapping adjustment is needed.
3. Linguistic Similarity Scoring
The proposed system displays the similarity score of two elements
being mapped. The score is computed based on Smith-Waterman
Algorithm.
While the system can visualize the actual database structure,
migration of data still needs the user’s intervention. While it is important to
consider database relationships and foreign keys, this research did not
tackle and implement in-depth study of relational databases. Database
6
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
components such as storedprocedure, views and triggers are not included
in this research and implementation.
7
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
CHAPTER II
REVIEW OF RELATED LITERATURE
This chapter presents the project’s features, related reading materials
from other related literature, and a review of related systems used by the
developers to complete the project.
2.1 CONCEPTUAL LITERATURE
This study literature review deals with the Database Management System,
Data Migration, Linguistic Module, and Smith-Waterman Algorithm. The
developers will use these concepts to create and deploy the proposed project
into actual development.
2.1.1 Database Management System. According to Chapman and Hall in
the book “Handbook of Data Structures and Applications” a database
management system is a software application that purposely designs to
store data. It is an interface between user and data that allows interaction
to execute different queries such as add, update, delete and retrieval of
data. In the journal written by Tim Kraska about Learned Databases(2019)
, DBMS is designed to handle different purposes, multiple schemas and
aims to provide efficient access to stored data.
2.1.2 Data Migration. In the article written by Henry Engeldow, Data
migration is always accompanied by challenges and obstacles. Different
structures and field definitions added complexity to the migration process.
8
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Datasource numerous fields increase the problems that need to be
resolved.
2.1.3 Linguistic Module. A journal written by Vitor Marini Blaselbauer
defines the linguistic module as the measures of similarity between
strings. In his article, JSON elements are being used to compare schemas
to create pairs for comparison. Pairs and nodes are in result after lexical
function under semantic module of comparison.
2.1.4 Smith-Waterman Algorithm. According to Fabio F. de Oliveira in a
journal posted on June 30,2022, SW algorithm initially intended to perform
sequence between nucleotides and proteins under the biology field.
Forward and backstage are performed in this algorithm. While in an article
written by Han Su Yin Nyunt, a parallel implementation was done in
sequence of strings with relevant scoring.
2.2 RELATED STUDIES AND LITERATURE
The Information gathered from the previous research helped a lot in the
project. In addition, the related literature from other research and articles
improved the developers' understanding of the project. Thus it gave the
developers the knowledge to design and create a system that can be used for
data mapping and migration.
The examination of old data is the first stage of a multi-phase process
called data migration, which concludes with the loading and reconciliation of data
into new systems. Organizations are always in need of data migration due to the
9
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
exponential expansion of data. Data migration may be a difficult procedure, and
testing is necessary to guarantee the accuracy of the data. If best practices are
not followed and hidden expenses are not discovered early enough, migration
can be highly expensive ,Sarmah, S. S. (2018).
There are existing tools that can be used but it is hard to find the best one
mostly for the specific needs of the users. According to a study of Elamparithi,
M., & Anuratha, V. (2015) due to the varied structures and many RDB data types,
Relational Database Migration (RDBM) has generally been a complicated, time-
consuming, and expensive procedure. There are several Database Migration
Tools (DBMTs) available in the software industry, but choosing the best one is
still a difficult task. Therefore, some of the DBMTs are examined, and certain
fundamental standards for assessing migration toolkits are given.
Even though NoSQL is a newer database system, there aren't as many
tools and methods for designing schema in it as there are in RDBMS. NoSQL
necessitates a different strategy for creating effective schema, such as deciding
which information belongs in a document database as an embedded document
or a referenced document. For the creation of schemas in NoSQL databases,
there are certain general guidelines. Creating a decent and effective database
schema is a particularly challenging process in reengineering projects, especially
when switching from an old RDBMS to a new NoSQL system. In this research,
we offer a model for a schema design advisor that leverages the SQL query load
10
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
of the current program as an input and an algorithm for schema design
recommendation, Namdeo, B., Suman, U. (2021).
2.3 EXISTING SYSTEMS
2.3.1 Azure Migrate. This migration tool is used under the Azure
environment. It allows the user to assess the actual data environment
using different tools such as Turbonic and Unify Cloud. This migration tool
focuses on migrating the entire database (schema and data) all in one in
an azure environment. It supports the migration of systems running under
microsoft technology.
2.3.2 Fivetran. With the help of the cloud-based ETL data transfer
platform Fivetran, data engineers can focus less on laborious engineering
activities and more on producing new insights. Fivetran streamlines data
migration by automatically handling all data integration duties using
connectors that deploy in a matter of minutes, adapt to source changes
automatically, and require zero maintenance.
2.3.3 Microsoft Data Migration Assistant You can update your database
schema and data from a source SQL Server environment to a target SQL
Server environment using the Microsoft tool known as Data Migration
Assistant, or DMA. You can transfer schema, data, and uncontained
11
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
objects from your source server to your target server using this migration
tool.
The best thing about utilizing this tool is that it can help you find
compatibility problems that might affect how well your new SQL Server
database version functions. DMA maintains a minimum amount of
downtime while automating database migration. Additionally, it offers
assistance in moving databases like Microsoft SQL Server, MySQL,
PostgreSQL, and MongoDB from on-premises and other cloud
environments to Azure.
2.3.4 Integrate.io (Formerly Xplenty). A data migration solution called
Integrate.io offers customers a single interface for managing, converting,
and moving data between several applications. This tool aids businesses
in the integration, processing, and preparation of data for cloud analytics.
The solution has a user-friendly interface, a highly automated workflow,
and is both scalable and economical.
Users may concentrate on their businesses without having to worry about
the difficulties of data conversion thanks to Integrate.io's simplicity of use.
The availability of integrations via REST API or direct FTP uploads
enables even non-technical staff to use this application in a smooth
manner with their technology stack.
12
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
2.3.5 IRI NextForm. A data migration tool developed on Eclipse is called
IRI NextForm. From one system or format to another, this utility can
migrate, reformat, convert, replicate, federate, and report data.
Additionally, it may replace outdated database engines and other
components that might not have native export capabilities as well as
transport data from legacy systems into new contexts.
2.4 SYNTHESIS
Concepts, literature and related systems were reviewed and analyzed in
comparison with E-Migrate. The researcher used those studies to further
understand the value of the migration concept. The present study is improved
with the use of available ideas from previous studies.
13
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Table 2.1 Summary of Related System Comparisons
Related System Developed
System
Software Azure Fivetr Microsoft Integrat NextForm E-Migrate
Features Migrate an e.io
Assist
Similarity No No No No No Yes
Analysis
Graphical No Yes No No No Yes
Analysis
Table No No No Yes Yes Yes
Mapping
Field No No No Yes Yes Yes
Mapping
Constraint Yes Yes Yes Yes Yes Yes
Check
Foreign Key Yes Yes Yes Yes Yes Yes
Check
14
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
For database concepts and applicable data migration, Chapman and Hall
insights Handbook of Data Structures and Applications (2009) and Henry
Engeldow in Data Migration (2011) identified key features in evaluating the
database structure and migration concepts.
Linguistic analysis from Blaselbauer was used to evaluate different algorithms
such as Smith-Waterman Algorithm, Hamming Distance and Levenshtein
Distance.
While related systems were analyzed and compared with the proposed system.
No features were duplicated or copied from the existing system with the
proposed system. Algorithm and structures are completely different and unique
with the proposed system.
15
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
2.5 CONCEPTUAL FRAMEWORK
Figure 2.1 Conceptual Framework
Figure 2.1 Illustrates the conceptual framework of the e-migrate project. The
image is utilized to visually represent the overall processes involved in the
system. It shows that a db admin is needed to provide connection strings value in
an e-migrate system. The system will process the inputs and provide database
elements to be processed in the server. The application then uses the process
data to gather data for migration and save it to the new database destination.
16
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
2.6 OPERATIONAL FRAMEWORK
2.6.1 Procedure for System Development
Deciding which System development life cycle to be used in developing the
system were analyzed based on different factors. Below are the factors
considered in deciding to Iterative Incremental Development Models Model.
1. Number of human resources that will develop and analyze the system
2. Availability of the system specification upon start of the development
3. Data Gathering for machine learning training
4. Time constraint.
5. Development tools and familiarity.
Figure 2.2 Iterative Incremental Development Models using Agile method
17
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
2.6.1.1 Planning
This is the initial step of the project development to identify the value of
the project and setting up the goals and timeline. A project scope also
developed within this phase.
2.6.1.2 Requirements and Roadmap Creation
Roadmap is very important having the process of breaking down of
deliverables to build the final product is identified. Each individual feature
is built by the whole team.
2.6.1.3 Release Planning
Before the project kicks off, a road map and project plan are presented.
Feature releases are and deliverables are finalized.
2.6.1.4 Sprint Planning
This phase is responsible for identifying tasks to be accomplished by each
member of the team during the specific sprint. It also includes discussions
on how it is possible to achieve specific accomplishments. Assessment of
tasks loads is part of this step.
2.6.1.5 Stand up Meeting
This is a short meeting that is being held to help each member of the team
accomplish their tasks. At some point changes need to be made during
the sprint, discussion of the changes is being done during this meeting.
2.6.1.6 Sprint Review and Retrospective
End of each sprint a two or combined meeting are being held to review the
sprint and to present the finished product. Next is the sprint retrospective
18
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
meeting wherein discussion on what are the difficulties and bottlenecks
during the sprint and how to improve the process on the next.
19
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
2.7 DEFINITION OF TERMS
20
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
CHAPTER III
RESEARCH METHODOLOGY
This study utilizes comparative analysis research and quantitative analysis to
design and implement the system. Research on algorithms to be used is also
discussed in this chapter.
3.1 RESEARCH DESIGN
Choosing which algorithm to use in developing a mapping tool for data migration
could be a tough one without comparing the most suitable algorithm available.
For this study, comparative descriptive design is utilized to make comparisons
between entities while conducting experimental research (Cantrell, 2011). Score
comparisons between the three selected algorithms were done and analyzed by
eight professionals in software engineering and database administrations.
Prototyping Method is utilized for the execution of the experimental research.
The researcher compared three widely used algorithms in technical linguistics.
21
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Table 3.1 Algorithm and Computation Basis
Algorithms Equation Computation Basis
Hamming Distance String Distance
Levenshtein Distance String Distance
Smith-Waterman Algorithm Similarity
Table 3.1 defined the algorithm being compared in this research. These
algorithms have their own basis in computing differences. These bases are also
defined in table 1.
22
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Table 3.2 String Differences Scenarios
Scenarios Percent Shares
strings in same length 33.33%
strings in different length 33.33%
identical strings 33.33%
Table 3.2 shows different scenarios that are most likely encountered by
comparing different database entities.
Table 3.3 Category and Percent Share
Categories Percent Shares
Accuracy 70%
Efficiency 30%
Table 3.3 contains the test consideration category in selecting the algorithm.
Accuracy has higher consideration, while speed is less.
23
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
A prototype application containing the implementation of the three algorithms
were created and evaluated using the below matrix.
3.2 SYSTEM EVALUATION
E-Migrate will utilize ISO 25010 Quality Model Characteristics in evaluating the
system. Feedback is captured using Google Form and distributed to technical
experts and different target users.
Table 3.5 System Attribute for Evaluation
Characteristics Sub Characteristics Evaluation Questions
Functionality Functional Is the system properly connected
Specifications and visualize the connection
structure?
Fields and Tables mapping options
is available
Easy to map fields and table
objects
Reliability Fault tolerance Suggested mapped objects are
accurate
Recoverability
Score provided in each mapping
are accurate
Migration options are usable and
produce correct output
No data is loss during the process
and execution
24
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Usability Learnability Link and Pages are properly
placed where the user can easily
Operability find it.
Error Protection Errors and Warning are properly
shown.
The system can be very useful and
replace current process
Efficiency Time Behavior The system respond time is
acceptable
The system execution time is
acceptable
Resource Utilization Resources consumes are
acceptable
Maintainability Modifiability The system can easily identify
errors
Capacity
Fixes or work around can be easily
apply
The system test cases is easy to
apply
Portability Adaptability Different environment(databases)
can be integrated in the system
Viewing from different browser
(includes mobile)
25
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
Table 3.5 Likert Scale and Score equivalent
Score Rating Interpretation
Equivalent
80-100 4.51-5.00 Highly Acceptable
60-79 3.51-4.50 Acceptable
40-59 2.51-3.50 Moderately Acceptable
20-39 1.51-2.50 Slightly Acceptable
0-19 1.00-1.50 Not Acceptable
Table 3.4 gives the rating equivalent to the score resulting from the prototype
execution. Evaluators may give their own score upon manually evaluated and
compared given strings.
3.3 SUBJECT OF THE STUDY
Researchers conducted surveys and evaluation from experts in programming,
mathematics and database administration. There were twenty-five invited
respondents to evaluate the e-migrate project and twenty three of them
26
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
responded. Among the respondents are five software developers, eight data
engineers and ten database administrators.
Table 3.5 Profile of respondents
Participants Description Frequency Percentage
Software Engineers Locally Employed 5 21.75
Data Engineers Internationally Employed 8 34.78
Database Admin Locally Employed 10 43.47
27
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\
SYSTEM REQUIREMENTS
To properly implement the system, minimum requirements must be provided.
Table 3.6 Software Requirements
Category Minimum Requirements Recommended
Requirements
Operating Windows 7 or higher and its Windows 10
System equivalent on other platforms
Browser Firefox, Google Chrome, Safari Firefox,Google
Chrome
Programming .net Framework 6.0 .NET Framework 6.0
Libraries
Hosting IIS 6.0 IIS 6.0
Database MS SQL Server 2018 MS SQL Server 2022
Table 3.6 displays the rundown of development and programing libraries needed
to complete the software development. The system is web based and expected
to be hosted in a Microsoft Server environment. .NET Framework is also a
requirement to handle. Core library requirements.
Category Minimum Requirements Recommended
Requirements
28