ERE 551 GIS FOR ENGINEERS
SOLUTIONS TO HOMEWORK PROBLEM SET: DATA MANAGEMENT
DATA MANAGEMENT
1. Define the terms: database and database management system (DBMS).
A database is a collection of persistent, interrelated data, typically consisting of both information
and relationships.
A DBMS is a system used to manage database structure (programs used to manipulate and
maintain data, control access) and ensure efficient organization and integrity of data.
2. Describe the background and required capabilities of a DBMS.
DBMSs originated in business applications and were developed primarily for non-spatial data.
Capabilities include ability to: capture data and implement applications, support multiple data
views, provide integration with minimal redundancy, maintain data integrity and security, and
allow concurrent access. Required capabilities include: Define database (provide data model);
Construct database (establish framework and input data); Manipulate data (control over editing,
updates); Query database (provide indexes and a query language); View data (allow for
customization in view); Secure database (control access, provide backup/recovery);
Administration tools (general administration, programming)
3. Spatial data provides unique challenges in DBMSs. Discuss some of these issues and challenges.
Spatial queries Æ very different to text types of questions; need for more than just alphanumeric
indexing
Storage issues Æ need to input/analyze spatial and image data; increased file size versus
traditional databases; as mentioned above, issues with indexing versus alphanumeric data
Other issues Æ temporal issues; maintaining integrity of spatial relationships e.g. after analysis
4. Describe several advantages and disadvantages of the DBMS approach.
Advantages Æ
- Centralized control: DBMS can be controlled by a single person or group; ensure quality
standards maintained and security enforced; manage conflicting requirements; ensure
integrity of database
- Efficient data sharing: flexible, but controlled sharing; reduce redundancy
- Data independence: application programs are independent of data
- Implementation of new applications: new application programs and unique searches are
easily implemented using a DBMS
- Direct user access: Users (non-programmers) can perform analysis
- Redundancy control: monitor and reduce redundancy, manage updates/copies
- Provide enhanced user interface and view flexibility
Disadvantages
- Cost: software and hardware can be expensive to acquire and maintain
- Added complexity: compared to file based system
- Centralized risk: greater theoretical risk of data loss or corruption with minimal
redundancy (but negated by ability to backup more easily)
- Decreased performance: for some databases
5. List the traditional and new models used in implementing DBMSs.
Traditional: Hierarchical; Network; Relational
Newer: Object; Object-Relational
6. Describe the Relational Data Model. Provide advantages and disadvantages.
Represent data as a collection of tables (formally called relations, hence the name): table rows are
collections of related data values; table columns provide attribute values.
Advantages Æ Simple data structure, generally good flexibility; formal query operations
Disadvantages Æ no explicit representation of relationships; reduced performance for large, well-
defined databases; limitations for geographic storage/queries
7. Define the terms from RDBMSs: relation, attribute, domain, tuple, degree, cardinality, and table
schema.
Relation – table with rows and columns
Attribute – named column of relation
Domain – set of allowable values for attribute
Tuple (AKA record) – a row of relation
Degree – number of attributes in table
Cardinality – number of rows in table
Table schema – definition of table characteristics: set of attributes and domain name parts; not the
data
Key – set of attributes that uniquely identifies rows in a table, can’t remove any attribute and still
have a key; every relation has at least one potential key (set of all attributes) or it isn’t a
relation since by definition objects in a database must be unique;
Candidate key – there may be more than one key in a database, hence have candidate keys. One of
these is usually selected and called the primary key.
8. Discuss the purpose of and types of integrity constraints in RDBMSs.
Integrity constraints do exactly as the name suggests: enforced constraints to ensure the integrity
and consistency of the data in a database. These constraints are specified on the database schema
and must hold for every instance of the database.
There are fundamentally two types of integrity constraints: those that constrain individual tables
and those that constrain multiple tables. The individual table constraints say that every table must
have primary key attribute(s) identified; and these attribute(s) cannot be null. If no primary key
attributes are identified then have no way to distinguish between records. Same problem if the
primary key attributes are allowed to be null (can’t tell the difference between two nulls). The
integrity constraints between tables consider relationships between objects in tables and are
important to maintain consistency (e.g. don’t want to delete information about an object in one
table and leave information “hanging” in another table.
9. Discuss each of the following components of informal database design: attribute semantics,
reducing redundancy, reduction of null values.
Attribute semantics Æ group attributes to provide a meaning, assume logical association of
attribute values, and specify interpretation of values in row
Reducing redundancy Æ benefits of reducing redundancy include the decrease in storage demands
and the mitigation of update anomalies (insertion, deletion, and modification)
Reduction of null values Æ combining several entities can increase chance of having many null
values, waste storage space, and potentially be confusing
QUERYING AND INDEXING
1. Discuss the importance of database querying.
This question could be answered in a thousand ways. Nonetheless, it is generally accepted that it
is important to ask questions about the data being stored. Additionally, querying helps with
manipulation, updating, and viewing data.
2. Describe the function and primary components of structured query language (SQL).
SQL provides a means to create database and table structures, to perform basic data management
tasks (insertion, deletion, modification), and to perform simple and complex queries. The major
components of SQL are the Data Definition Language (DDL) used to define the database structure
and control database access and the Data Management Language (DML) used to retrieve and
update data. Within DML, the most important statement is SELECT, which is used to retrieve and
display data from one or more tables. Multiple examples of using the SELECT statement are given
in lecture and also used in the lab exercises.
3. Discuss the need for spatial queries and the similarities/differences between spatial and non-spatial
querying.
Traditional database querying asks questions about data typically based on alphanumeric
searching e.g. find the largest, find the oldest, find the 10 most expensive, etc. While the same
types of questions can also be asked of spatial data, there are queries that are inherently spatial in
nature. Spatial queries search for data that meet a specified condition in terms of some spatial
relationship or characteristic. Spatial relationships include nearest, adjacent, intersects, contains,
etc. Spatial characteristics include area, perimeter, centered, length, etc.
4. Define the term index. Describe the benefit and technique of indexing.
An index is a data structure that helps DBMSs to locate records quickly and speed responses to the
user (much like the index in a book helps a reader to find a topic more quickly). The benefits of
using indexing include elimination of sequential data searching and reduction in the set of records
a query looks at. Traditional indexes create a sorted index file that references records in the
database (see question 5). While this technique works for enhancing attribute searching in spatial
databases, it doesn’t allow for faster searches based on locations (see questions 6and 7).
5. Describe the general procedure for non-spatial indexing.
The general procedure is to specify the field(s) for indexing and create a sorted index file based on
selected field(s). A link is provided from each value in the index to the relevant database record.
Note that it is possible to specify field characteristics in the index e.g. values must be unique or not
null, etc. For every index that is created, the database has to essentially store a table of sorted
information that is updated each time the database is changed. This process is efficient only if the
number of indexed fields is limited to those that are likely to be searched. It will not be efficient to
index every field.
6. Spatial indexes can be broadly divided into space-driven and data-driven. Explain the distinction
between these two approaches to indexing.
Space-driven approaches consider the extent of the region under consideration and create an
index to systematically cover the entire region without any consideration about potential
variability in the density of data distribution. Data-driven approaches specifically consider where
the data is and how it may be most efficiently grouped.
7. Describe each of the following approaches to spatial indexing: Grid indexing, quadtree index, and
r-tree index.
Grid indexing Æ method partitions 2D space into rectangular cells, independent of distribution of
spatial objects, and maps objects to cells based on some geometric criteria. The grid location of
each object in the database is recorded in the index. The efficiency of a grid index relates to the
size of the grid and the size of the objects. Many pieces of software use a multi-level grid system
(often three levels). Grid indexes are very simple to implement, are robust, fast to create and
update, and provide good flexibility
Quadtree index Æ this is a special case of a grid index. It involves recursive division of space into
quadrants, with the division based on data density. Quadtrees are popular because the idea can
be applied to a variety of data types, is easy to implement and demonstrates good performance.
The quadtree approach is not only used for indexing but is also used for raster layer compression.
R-tree index Æ this is a data driven approach that groups objects using minimum bounding
rectangle (MBR). You need to specify the number of objects within each group and the maximum
size of the MBR.
GIS DESIGN
1. Describe the need for planning when establishing a GIS.
Establishing a GIS is a resource intensive venue: planning helps to reduce the chance of wasting
resources (whether human, fiscal, or data based resources). Planning also helps to ensure that the
GIS will satisfy user needs in terms of factors such as software compatibility, data and data
handling needs, and institutional fit. Effective planning ensures there is flexibility for future
expansion and can reduce the potential for redundancy.
2. Planning must consider both technical and institutional restrictions: describe these elements.
Above all else, the designed GIS must meet the functionality requirements of the project i.e. it must
be able to perform the desired analysis. However, the technical considerations typically also
include evaluation of personnel (e.g. training requirements), software (e.g. functionality), and data
(e.g. required data types) as well as the interaction between these components. This is not
unrelated to the institutional considerations e.g. data cost, personnel considerations (outsourcing
vs. training), legal consequences
3. GIS Design must consider the project from a conceptual standpoint as well as a practical position.
Describe these issues.
The conceptual part of the design involves assessing the general needs and available capabilities,
looking at the problem in a global view. This does not get as specific as looking at e.g. what
software will be used, but rather focusing on the global picture, what needs to be accomplished.
The conceptual design involves determining what needs to be represented, what characteristics
need to be recorded, and potentially how objects will relate to each other.
The practical side of the design looks at making decisions about the software (functionality,
performance, licensing, availability …) as well as understanding the specifics about the study area
(e.g. extent, accessibility …), appropriate data structure (raster vs. vector; coordinate system and
datum), and data collection (e.g. primary vs. secondary, sampling specifications—considering
both spatial and attribute samples, need to collect data beyond the final extent, choices about
classification/categorization of data)
4. Describe the reasons for working with prototypes in designing a GIS
Major benefit is to conserve resources by testing out plan on a small area. Helps to characterize
costs to collect and process data, as well as tools that will be required to perform analysis.
Potentially may help decision making e.g. in terms of raster vs. vector or choice of georeferencing.