s09 Geospatial Data Management
s09 Geospatial Data Management
This content was accessible as of December 29, 2012, and it was downloaded then by Andy Schmitz
(http://lardbucket.org) in an effort to preserve the availability of this book.
Normally, the author and publisher would be credited here. However, the publisher has asked for the customary
Creative Commons attribution to the original publisher, authors, title, and book URI to be removed. Additionally,
per the publisher's request, their name has been removed in some passages. More information is available on this
project's attribution page (http://2012books.lardbucket.org/attribution.html?utm_source=header).
For more information on the source of this book, or why it is available for free, please see the project's home page
(http://2012books.lardbucket.org/). You can browse or download additional books there.
i
Chapter 5
Geospatial Data Management
Every user of geospatial data has experienced the challenge of obtaining,
organizing, storing, sharing, and visualizing their data. The variety of formats and
data structures, as well as the disparate quality, of geospatial data can result in a
dizzying accumulation of useful and useless pieces of spatially explicit information
that must be poked, prodded, and wrangled into a single, unified dataset. This
chapter addresses the basic concerns related to data acquisition and management
of the various formats and qualities of geospatial data currently available for use in
modern geographic information system (GIS) projects.
101
Chapter 5 Geospatial Data Management
LEARNING OBJECTIVE
Data Types
The type of data that we employ to help us understand a given entity is determined
by (1) what we are examining, (2) what we want to know about that entity, and (3)
our ability to measure that entity at a desired scale. The most common types of data
available for use in a GIS are alphanumeric strings, numbers, Boolean values, dates,
and binaries.
102
Chapter 5 Geospatial Data Management
A single precision floating-point6 value occupies 32 bits, like the long integer.
However, this data type provides for a value of up to 7 bits to the left of the decimal
(a maximum value of 128, or 127 if signed) and up to 23-bit values to the right of the
decimal point (approximately 7 decimal digits). A double precision floating-point7
value essentially stores two 32-bit values as a single value. Double precision floats,
then, can represent a value with up to 11 bits to the left of the decimal point and
values with up to 52 bits to the right of the decimal (approximately 16 decimal
digits) (Figure 5.1 "Double Precision Floating-Point (64-Bit Value), as Stored in a
Computer").
Boolean, date, and binary values are less complex. Boolean8 values are simply those
values that are deemed true or false based on the application of a Boolean operator
such as AND, OR, and NOT. The date data type is presumably self-explanatory, while
the binary data type represents attributes whose values are either 1 or 0.
locales. Other examples of nominal data include last name, eye color, land-use type,
ethnicity, and gender.
Ordinal data10 places attribute information into ranks and therefore yields more
precisely scaled information than nominal data. Ordinal data describes the position
in which data occur, such as first, second, third, and so forth. These scales may also
take on names, such as “very unsatisfied,” “unsatisfied,” “satisfied,” and “very
satisfied.” Although this measurement scale indicates the ranking of each data
point relative to other data points, the ordinal scale does not explicitly denote the
exact quantitative difference between these rankings. For example, if an ordinal
attribute represents which runner came in first, second, or third place, it does not
state by how much time the winning runner beat the second place runner.
Therefore, one cannot undertake arithmetic operations with ordinal data. Only
sequence is explicit.
Ratio data12 are similar to the interval measurement scale; however, it is based
around a meaningful zero value. Population density is an example of ratio data
whereby a 0 population density indicates that no people live in the area of interest.
10. A data scale that places
Similarly, the Kelvin temperature scale is a ratio scale as 0 K does imply that no
attribute information into heat (temperature) is measurable within the given attribute.
ranks.
11. A data scale based on values Specific to numeric datasets, data values also can be considered to be discrete or
with equal intervals but with
continuous. Discrete data13 are those that maintain a finite number of possible
no meaningful zero.
values, while continuous data14 can be represented by an infinite number of
12. A data scale based on values values. For example, the number of mature trees on a small property will
with equal intervals and a
necessarily be between one and one hundred (for argument’s sake). However, the
meaningful zero.
height of those trees represents a continuous data value as there are an infinite
13. Data that can are limited to a number of potential values (e.g., one tree may be 20 feet tall, 20.1 feet, or 20.15 feet,
finite number of potential
20.157 feet, and so forth).
values.
Now that we have a sense of the different data types and measurement scales
available for use in a GIS, we must direct our thoughts to how this data can be
acquired. Primary data capture15 is a direct data acquisition methodology that is
usually associated with some type of in-the-field effort. In the case of vector data,
directly captured data commonly comes from a global positioning system (GPS) or
other types of surveying equipment such as a total station (Figure 5.2 "GPS Unit
(left) and Total Station (right)"). Total stations are specialized, primary data capture
instruments that combine a theodolite (or transit), which measures horizontal and
vertical angles, with a tool to measure the slope distance from the unit to an
observed point. Use of a total station allows field crews to quickly and accurately
derive the topography for a particular landscape.
In the case of GPS, handheld units access positional data from satellites and log the
information for subsequent retrieval. A network of twenty-four navigation satellites
is situated around the globe and provides precise coordinate information for any
point on the earth’s surface (Figure 5.3 "Earth Imaging Satellite Capturing Primary
15. A direct data acquisition Data"). Maintaining a line of sight to four or more of these satellites provides the
methodology that is associated user with reasonably accurate location information. These locations can be
with an in-the-field effort.
In addition to the typical GPS unit shown in Figure 5.2 "GPS Unit (left) and Total
Station (right)", GPS is becoming increasingly incorporated into other new
technologies. For example, smartphones now embed GPS capabilities as a standard
technological component. These phone/GPS units maintain comparable accuracy to
similarly priced stand-alone GPS units and are largely responsible for a renaissance
in facilitating portable, real-time data capture and sharing to the masses. The
ubiquity of this technology led to a proliferation of crowdsourced data acquisition
alternatives. Crowdsourcing16 is a data collection method whereby users
contribute freely to building spatial databases. This rapidly expanding methodology
is utilized in such applications as TomTom’s MapShare application, Google Earth,
Bing Maps, and ArcGIS.
Raster data obtained via direct capture comes more commonly from remotely
sensed sources (Figure 5.3 "Earth Imaging Satellite Capturing Primary Data").
Remotely sensed data offers the advantage of obviating the need for physical access
to the area being imaged. In addition, huge tracts of land can be characterized with
little to no additional time and labor by the researcher. On the other hand,
validation is required for remotely sensed data to ensure that the sensor is not only
operating correctly but properly calibrated to collect the desired information.
Satellites and aerial cameras provide the most ubiquitous sources of direct-capture
raster data (Chapter 4 "Data Models for GIS", Section 4.3.1 "Satellite Imagery").
Secondary data capture17 is an indirect methodology that utilizes the vast amount
of existing geospatial data available in both digital and hard-copy formats. Prior to
initiating any GIS effort, it is always wise to mine online resources for existing GIS
data that may fulfill your mapping needs without the potentially intensive step of
creating the data from scratch. Such digital GIS data are available from a variety of
sources including international agencies (CGIAR, CIESIN, United Nations, World
Bank, etc.); federal governments (USGS, USDA, NOAA, USFWS, NASA, EPA, US
Census, etc.); state governments (CDFG, Teale Data Center, INGIS, MARIS, NH GIS
Resources, etc.); local governments (SANDAG, RCLIS, etc.); university websites
(UCLA, Duke, Stanford, University of Chicago, Indiana Spatial Data Portal, etc.); and
commercial websites (ESRI, GeoEye, Geocomm, etc.). These secondary data are
available in a wide assortment of file types, extents, and sizes but is ready-made to
be used in most GIS software packages. Often these data are free, but many sites will
charge a fee for access to the proprietary information they have developed.
17. An indirect data acquisition
methodology that utilizes the
vast amount of existing data Although these data sources are all cases where the information has been converted
available in both digital and
hard-copy formats. to digital format and properly projected for use in a GIS, there is also a great deal of
spatial information that can be gleaned from existing, nondigital sources. Paper
maps, for example, may contain current or historic information on a locale that
cannot be found in digital format. In this case, the process of digitization18 can be
used to create digital files from the original paper copy. Three primary methods
exist for digitizing spatial information: two are manual, and one is automated.
heads-up digitizing session to edit and repair any errors that occurred during
automation.
The final secondary data capture method worth noting is the use of information
from reports and documents. Via this method, one enters information from
reports and documents into the attribute table of an existing, digital GIS file that
contains all the pertinent points, lines, and polygons. For example, new information
specific to census tracts may become available following a scientific study. The GIS
user simply needs to download the existing GIS file of census tracts and begin
entering the study’s report/document information directly into the attribute table.
If the data tables are available digitally, the use of the “join” and “relate” functions
in a GIS (Section 5.2.2 "Joins and Relates") are often extremely helpful as they will
automate much of the data entry effort.
KEY TAKEAWAYS
• The most common types of data available for use in a GIS are
alphanumeric strings, numbers, Boolean values, dates, and binaries.
• Nominal and ordinal data represent categorical data, while interval and
ratio data represent numeric data.
• Data capture methodologies are derived from either primary or
secondary sources.
EXERCISES
LEARNING OBJECTIVE
23. A software package that allows Several types of database models exist, such as the flat, hierarchical, network, and
for the creation, storage,
maintenance, manipulation, relational models (Worboys 1995; Jackson 1999).Worboys, M. F. 1995. GIS: A
and retrieval of large datasets Computing Perspective. London: Taylor & Francis., Jackson, M. 1999. “Thirty Years
distributed over one or more (and More) of Databases.” Information and Software Technology 41: 969–78. A flat
files. database24 is essentially a spreadsheet whereby all data are stored in a single, large
24. A database model whereby all table (Figure 5.4 "Flat Database"). A hierarchical database25 is also a fairly simple
data are stored in a single model that organizes data into a “one-to-many” association across levels (Figure 5.5
table. "Hierarchical Database"). Common examples of this model include phylogenetic
25. A simple database model that trees for classification of plants and animals and familial genealogical trees showing
organizes data into a “one-to- parent-child relationships. Network databases26 are similar to hierarchical
many” association across databases, however, because they also support “many-to-many” relationships
levels.
(Figure 5.6 "Network Database"). This expanded capability allows greater search
26. A simple database model that flexibility within the dataset and reduces potential redundancy of information.
organizes data into a “one-to- Alternatively, both the hierarchical and network models can become incredibly
many” or “many-to-many”
association across levels. complex depending on the size of the databases and the number of interactions
between the data points. Modern geographic information system (GIS) software
27. A database model that relates typically employs a fourth model referred to as a relational database27 (Codd
information across multiple
tables according to primary 1970).Codd, E. 1970. “A Relational Model of Data for Large Shared Data Banks.”
and foreign keys. Communications of the Association for Computing Machinery 13 (6): 377–87.
110
Chapter 5 Geospatial Data Management
In the relational model, each table (not surprisingly called a relation) is linked to
each other table via predetermined keys (Date 1995).Date, C. 1995. An Introduction to
Database Systems. Reading, MA: Addison-Wesley. The primary key29 represents the
attribute (column) whose value uniquely identifies a particular record (row) in the
relation (table). The primary key may not contain missing values as multiple
missing values would represent nonunique entities that violate the basic rule of the
primary key. The primary key corresponds to an identical attribute in a secondary
table (and possibly third, fourth, fifth, etc.) called a foreign key30. This results in all
the information in the first table being directly related to the information in the
second table via the primary and foreign keys, hence the term “relational” DBMS.
With these links in place, tables within the database can be kept very simple,
resulting in minimal computation time and file complexity. This process can be
repeated over many tables as long as each contains a foreign key that corresponds
to another table’s primary key.
The relational model has two primary advantages over the other database models
described earlier. First, each table can now be separately prepared, maintained, and
28. A software package that edited. This is particularly useful when one considers the potentially huge size of
records information in such a many of today’s modern databases. Second, the tables may be maintained
way that data can be accessed separately until the need for a particular query or analysis calls for the tables to be
without reorganization of the
related. This creates a large degree of efficiency for processing of information
tables.
within a given database.
29. The attribute whose value
uniquely identifies a particular
record in an attribute table. It may become apparent to the reader that there is great potential for redundancy
in this model as each table must contain an attribute that corresponds to an
30. The attribute that corresponds
to a primary key in an attribute in every other related table. Therefore, redundancy must actively be
associated table. monitored and managed in a RDBMS. To accomplish this, a set of rules called
normal forms have been developed (Codd 1970).Codd, E. 1970. “A Relational Model
31. The first stage in the
normalization of a relational of Data for Large Shared Data Banks.” Communications of the Association for Computing
database in which repeating Machinery 13 (6): 377–87. There are three basic normal forms. The first normal
groups and attributes are form31 (Figure 5.7 "First Normal Form Violation (above) and Fix (below)") refers to
eliminated by placing them
into a separate tables five conditions that must be met (Date 1995).Date, C. 1995. An Introduction to
connected via primary keys Database Systems. Reading, MA: Addison-Wesley. They are as follows:
and foreign keys.
Figure 5.7 First Normal Form Violation (above) and Fix (below)
The second normal form32 states that any column that is not a primary key must
be dependent on the primary key. This reduces redundancy by eliminating the
potential for multiple primary keys throughout multiple tables. This step often
involves the creation of new tables to maintain normalization.
Figure 5.8 Second Normal Form Violation (above) and Fix (below)
The third normal form33 states that all nonprimary keys must depend on the
primary key, while the primary key remains independent of all nonprimary keys.
This form was wittily summed up by Kent (1983)Kent, W. 1983. “A Simple Guide to
Five Formal Forms in Relational Database Theory.” Communications of the Association
for Computing and Machinery. 26 (2): 120–25. who quipped that all nonprimary keys
“must provide a fact about the key, the whole key, and nothing but the key.”
Echoing this quote is the rejoinder: “so help me Codd” (personal communication
with Foresman 1989).
Figure 5.9 Third Normal Form Violation (above) and Fix (below)
35. An operation that temporarily Alternatively, the relate35 operation temporarily associates two map layers or
associates two attribute tables tables while keeping them physically separate. Relates are bidirectional, so data can
through the use of an attribute be accessed from the one of the tables by selecting records in the other table. The
or field that is common to both relate operation also allows for the association of three or more tables, if necessary.
tables while keeping the tables
physically separate.
Sometimes it can be unclear as to which operation one should use. As a general rule,
joins are most suitable for instances involving one-to-one or many-to-one
relationships. Joins are also advantageous due to the fact that the data from the two
tables are readily observable in the single output table. The use of relates, on the
other hand, are suitable for all table relationships (one-to-one, one-to-many, many-
to-one, and many-to-many); however, they can slow down computer access time if
the tables are particularly large or spread out over remote locations.
KEY TAKEAWAYS
EXERCISE
LEARNING OBJECTIVE
Geospatial data are stored in many different file formats. Each geographic
information system (GIS) software package, and each version of these software
packages, supports different formats. This is true for both vector and raster data.
Although several of the more common file formats are summarized here, many
other formats exist for use in various GIS programs.
The most common vector file format is the shapefile36. Shapefiles, developed by
ESRI in the early 1990s for use with the dBASE III database management software
package in ArcView 2, are simple, nontopological files developed to store the
geometric location and attribute information of geographic features. Shapefiles are
incapable of storing null values, as well as annotations or network features. Field
names within the attribute table are limited to ten characters, and each shapefile
can represent only point, line, or polygon feature sets. Supported data types are
limited to floating point, integer, date, and text. Shapefiles are supported by almost
all commercial and open-source GIS software.
117
Chapter 5 Geospatial Data Management
AIN and AIH Attribute information for active fields in the table
The earliest vector format file for use in GIS software packages, which is still in use
today, is the ArcInfo coverage37. This georelational file format supports multiple
features types (e.g., points, lines, polygons, annotations) while also storing the
topological information associated with those features. Attribute data are stored as
multiple files in a separate directory labeled “Info.” Due to its creation in an MS-
DOS environment, these files maintain strict naming conventions. File names
cannot be longer than thirteen characters, cannot contain spaces, cannot start with
37. A georelational file format a number, and must be completely in lowercase. Coverages cannot be edited in
developed by ESRI that
supports multiple features ArcGIS 9.x or later versions of ESRI’s software package.
types (e.g., points, lines,
polygons, annotations) while
also storing the topological The US Census Bureau maintains a specific type of shapefile referred to as TIGER or
information associated with TIGER/Line (Topologically Integrated Geographic Encoding and Referencing
those features. system)38. Although these open-source files do not contain actual census
38. A vector file format developed information, they map features such as census tracts, roads, railroads, buildings,
by the US Census Bureau rivers, and other features that support and improve the bureauand improve the
including map features such as Bureau’s ability to#8217;s ability to collect census information. TIGER/Line
census tracts, roads, railroads,
buildings, rivers, and other shapefiles, first released in 1990, are topologically explicit and are linked to the
features that support and Census Bureau’s Master Address File (MAF), therefore enabling the geocoding of
improve the bureau’s ability to
collect census information.
street addresses. These files are free to the public and can be freely downloaded
from private vendors that support the format.
Vector data files can also be structured to represent surface elevation information.
A TIN (Triangulated Irregular Network)41 is an open-source vector data structure
that uses contiguous, nonoverlapping triangles to represent geographic surfaces
(Figure 5.10 "Triangulated Irregular Network (TIN)"). Whereas the raster depiction
of a surface represents elevation as an average value over the spatial extent of the
individual pixel (see Section 5.3.2 "Raster File Formats"), the TIN data structure
models each vertex of the triangle as an exact elevation value at a specific point on
39. A vector file format developed the earth. The arcs between each vertex are an approximation of the elevation
by Autodesk to allow
between two vertices. These arcs are then aggregated into triangles from which
interchange between
engineering-based CAD information on elevation, slope, aspect, and surface area can be derived across the
(computer-aided design) entire extent of the model’s space. Note that term “irregular” in the name of the
software and other mapping data model refers to the fact that the vertices are typically laid out in a scattered
software packages.
fashion.
40. The vector file format
developed by the USGS that
maintains information on
physical and cultural features
across the United States.
The use of TINs confers certain advantages over raster-based elevation models (see
Section 5.3.2 "Raster File Formats"). First, linear topographic features are very
accurately represented relative to their raster counterpart. Second, a comparatively
small number of data points are needed to represent a surface, so file sizes are
typically much smaller. This is particularly true as vertices can be clustered in areas
where relief is complex and can be sparse in areas where relief is simple. Third,
specific elevation data can be incorporated into the data model in a post hoc
fashion via the placement of additional vertices if the original is deemed
insufficient or inadequate. Finally, certain spatial statistics can be calculated that
cannot be obtained when using a raster-based elevation model, such as flood plain
delineation, storage capacity curves for reservoirs, and time-area curves for
hydrographs.
A multitude of raster file format types are available for use in GIS. The selection of
raster formats has dramatically increased with the widespread availability of
imagery from digital cameras, video recorders, satellites, and so forth. Raster
imagery is typically 8-bit (256 colors) or 24-bit (16 million colors). Due to ongoing
technological advancements, raster image file sizes have been getting larger and
larger. To deal with this potential constraint, two types of file compression are
commonly used: lossless and lossy. Lossless compression42 reduces file size
without decreasing image quality. Lossy compression43 attempts to exploit
limitations of the human eye by removing information from the image that cannot
be sensed. As you may guess, lossy compression results in smaller file sizes than
lossless compression.
42. A method to reduce the file Among the most common raster files used on the web are the JPEG, TIFF, and PNG
size of an image without
decreasing quality. formats, all of which are open source and can be used with most GIS software
packages. The JPEG (Joint Photographic Experts Group)44 and TIFF (Tagged
43. A method to reduce the file Image File Format)45 raster formats are most frequently used by digital cameras to
size of an image by exploiting
limitations of the human eye
store 8-bit values for each of the red, blue, and green colors spaces (and sometimes
through removal of 16-bit colors, in the case of TIFF images). JPEGs support lossy compression, while
information from that cannot TIFFs can be either lossy or lossless. Unlike JPEG, TIFF images can be saved in either
be sensed. RGB or CMYK color spaces. PNG (Portable Network Graphics)46 files are 24-bit
44. Raster image format that images that support either lossy or lossless compression. PNG files are designed for
stores 8-bit values for each of efficient viewing in web-based browsers such as Internet Explorer, Mozilla Firefox,
the red, blue, and green colors Netscape, and Safari.
spaces.
48. A raster format developed by An example of a raster file format with explicit georeferencing information is the
LizardTech, Inc., for use with proprietary MrSID (Multiresolution Seamless Image Database)48 format. This
large aerial photographs or lossless compression format was developed by LizardTech, Inc., for use with large
satellite images, whereby
aerial photographs or satellite images, whereby portions of a compressed image can
portions of a compressed
image can be viewed quickly be viewed quickly without having to decompress the entire file. The MrSID format
without having to decompress is frequently used for visualizing orthophotos.
the entire file.
49. A raster file format developed Like MrSID, the proprietary ECW (Enhanced Compression Wavelet)49 format also
by Earth Resource Mapping
that supports up to 255 layers includes georeferencing information within the file structure. This lossy
of image information and compression format was developed by Earth Resource Mapping and supports up to
includes georeferencing 255 layers of image information. Due to the potentially huge file sizes associated
information within the file
with an image that supports so many layers, ECW files represent an excellent option
structure.
for performing rapid analysis on large images while using a relatively small amount
of the computer’s RAM (Random Access Memory), thus accelerating computation
speed.
Like the open-source, vector-based DLG, DRGs (Digital Raster Graphics)50 are
scanned versions of USGS topographic maps and include all of the collar material
from the originals. The geospatial information found within the image’s neatline is
georeferenced, specifically to the UTM coordinate system. These graphics are
scanned at a minimum of 250 dpi (dots per inch) and therefore have a spatial
resolution of approximately 2.4 meters. DRGs contain up to thirteen colors and
therefore may look slightly different from the originals. In addition, they include all
the collar material from the original print version, are georeferenced to the surface
of the earth, fit the Universal Transverse Mercator (UTM) projection, and are most
likely based on the NAD27 data points (NAD stands for North American Datum).
Like the TIN vector format, some raster file formats are developed explicitly for
modeling elevation. These include the USGS DEM, USGS SDTS, and DTED file
formats. The USGS DEM (US Geological Survey Digital Elevation Model)51 is a
popular file format due to widespread availability, the simplicity of the model, and
the extensive software support for the format. Each pixel value in these grid-based
DEMs denotes spot elevations on the ground, usually in feet or meters. Care must be
taken when using grid-based DEMs due to the enormous volume of data that
accompanies these files as the spatial extent covered in the image begins to
increase. DEMs are referred to as digital terrain models (DTMs)52 when they
represent a simple, bare-earth model and as digital surface models (DSMs)53 when
they include the heights of landscape features such as buildings and trees (Figure
5.11 "Digital Surface Model (left) and Digital Terrain Model (right)").
Figure 5.11 Digital Surface Model (left) and Digital Terrain Model (right)
USGS DEMs can be classified into one of four levels of quality (labeled 1 to 4)
depending on its source data and resolution. This source data can be 1:24,000-;
1:63,360-; or 1:250,000-scale topographic quadrangles. The DEM format is a single
file of ASCII text comprised of three data blocks; A, B, and C. The A block contains
header information such as data origin, type, and measurement systems. The B
block contains contiguous elevation data described as a six-character integer. The C
block contains trailer information such as root-mean square (RMS) error of the
scene. The USGS DEM format has recently been succeeded by the USGS SDTS
(Spatial Data Transfer Standard) DEM54 format. The SDTS formatUSGS. 2010.
“What is SDTS?” USGS, http://mcmcweb.er.usgs.gov/sdts/whatsdts.html. was
specifically developed as a distribution format for transferring data from one
computer to another with zero data loss.
The DTED (Digital Terrain Elevation Data)55 format is another elevation specific
raster file format. It was developed in the 1970s for military purposes such as line of
sight analysis, 3-D visualization, and mission planning. The DTED format maintains
three levels of data over five different latitudinal zones. Level 0 data has a
54. A distribution format for resolution of approximately 900 meters; Level 1 data has a resolution of
transferring USGS DEMs from
approximately 90 meters; and Level 2 data has a resolution of approximately 30
one computer to another with
zero data loss. meters.
There are three different types of geodatabases. The personal geodatabase57 was
developed for single-user editing, whereby two editors cannot work on the same
geodatabase at a given time. The personal geodatabase employs the Microsoft
Access DBMS file format and maintains a size limit of 2 gigabytes per file, although
it has been noted that performance begins to degrade after file size approaches 250
megabytes. The personal geodatabase is currently being phased out by ESRI and is
therefore not used for new data creation.
The file geodatabase58 similarly allows only single-user editing, but this restriction
applies only to unique feature datasets within a geodatabase. The file geodatabase
incorporates new tools such as domains (rules applied to attributes), subtypes
56. A recently developed,
(groups of objects with a feature class or table), and split/merge policies (rules to
proprietary ESRI file format control and define the output of split and merge operations). This format stores
that supports both vector and information as binary files with a size limit of 1 terabyte and has been noted to
raster feature datasets (e.g., perform and scale much more efficiently than the personal geodatabase
points, lines, polygons,
annotation, JPEG, TIFF) within (approximately one-third of the feature geometry storage required by shapefiles
a single file. and personal geodatabases). File databases are not tied to any specific relational
database management system and can be employed on both Windows and UNIX
57. A type of geodatabase
developed for single-user platforms. Finally, file geodatabases can be compressed to read-only formats that
editing, whereby two editors further reduce file size without subsequently reducing performance.
cannot work on the same
geodatabase at a given time.
The third hybrid ESRI format is the ArcSDE geodatabase59, which allows multiple
58. A type of geodatabase that
allows only single-user editing
editors to simultaneously work on feature datasets within a single geodatabase
for unique feature datasets (a.k.a. versioning). Like the file geodatabase, this format can be employed on both
within a geodatabase. Windows and UNIX platforms. File size is limited to 4 gigabytes and its proprietary
nature requires an ArcInfo or ArcEditor license for use. The ArcSDE geodatabase is
59. A type of geodatabase
developed to allow multiple implemented on the SQL Server Express software package, which is a free DBMS
editors to simultaneously work platform developed by Microsoft.
on feature datasets within a
single geodatabase.
In addition to the geodatabase, Adobe Systems Incorporated’s geospatial PDF
60. A nonproprietary file format
developed by Adobe Systems, (Portable Document Format)60 is an open-source format that allows for the
Inc., that allows for the representation of geometric entities such as points, lines, and polygons. Geospatial
representation of geometric PDFs can be used to find and mark coordinate pairs, measure distances, reproject
entities such as points, lines,
files, and georegister raster images. This format is particularly useful as the PDF is
and polygons.
Finally, Google Earth supports a new, open-source, hybrid file format referred to as
a KML (Keyhole Markup Language)61. KML files associate points, lines, polygons,
images, 3-D models, and so forth, with a longitude and latitude value, as well as
other view information such as tilt, heading, altitude, and so forth. KMZ files are
commonly encountered, and they are zipped versions KML files.
KEY TAKEAWAYS
EXERCISES
1. If you were a city planner tasked with creating a GIS database for
mapping features throughout the city, would you prefer using a DLG or a
DRG? What are the advantages and disadvantages of using either of
these formats?
2. Search the web and create a list of URLs that contain working files for
each of the raster and vector formats discussed in this section.
LEARNING OBJECTIVE
Not all geospatial data are created equally. Data quality refers to the ability of a
given dataset to satisfy the objective for which it was created. With the voluminous
amounts of geospatial data being created and served to the cartographic
community, care must be taken by individual geographic information system (GIS)
users to ensure that the data employed for their project is suitable for the task at
hand.
Two primary attributes characterize data quality. Accuracy62 describes how close a
measurement is to its actual value and is often expressed as a probability (e.g., 80
percent of all points are within +/− 5 meters of their true locations). Precision63
refers to the variance of a value when repeated measurements are taken. A watch
may be correct to 1/1000th of a second (precise) but may be 30 minutes slow (not
accurate). As you can see in Figure 5.12 "Accuracy and Precision", the blue darts are
both precise and accurate, while the red darts are precise but inaccurate.
126
Chapter 5 Geospatial Data Management
Several types of error can arise when accuracy and/or precision requirements are
not met during data capture and creation. Positional accuracy64 is the probability
of a feature being within +/− units of either its true location on earth (absolute
positional accuracy) or its location in relation to other mapped features (relative
positional accuracy). For example, it could be said that a particular mapping effort
may result in 95 percent of trees being mapped to within +/− 5 feet for their true
location (absolute), or 95 percent of trees are mapped to within +/− 5 feet of their
location as observed on a digital ortho quarter quadrangle (relative).
Speaking about absolute positional error does beg the question, however, of what
exactly is the true location of an object? As discussed in Chapter 2 "Map Anatomy",
differing conceptions of the earth’s shape has led to a plethora of projections, data
points, and spheroids, each attempting to clarify positional errors for particular
locations on the earth. To begin addressing this unanswerable question, the US
National Map Accuracy Standard (or NMAS) suggests that to meet horizontal
accuracy requirements, a paper map is expected to have no more than 10 percent of
measurable points fall outside the accuracy values range shown in Figure 5.13
64. The probability of a feature "Relation between Positional Error and Scale". Similarly, the vertical accuracy of no
being within +/− units of either more than 10 percent of elevations on a contour map shall be in error of more than
its true location on earth one-half the contour interval. Any map that does not meet these horizontal and
(absolute positional accuracy)
or its location in relation to
vertical accuracy standards will be deemed unacceptable for publication.
other mapped features
(relative positional accuracy).
Positional errors arise via multiple sources. The process of digitizing paper maps
commonly introduces such inaccuracies. Errors can arise while registering the map
on the digitizing board. A paper map can shrink, stretch, or tear over time,
changing the dimensions of the scene. Input errors created from hastily digitized
points are common. Finally, converting between coordinate systems and
transforming between data points may also introduce errors to the dataset.
The root-mean square (RMS) error is frequently used to evaluate the degree of
inaccuracy in a digitized map. This statistic measures the deviation between the
actual (true) and estimated (digitized) locations of the control points. Figure 5.14
"Potential Digitization Error" illustrates the inaccuracies of lines representing soil
types that result from input control point location errors. By applying an RMS error
calculation to the dataset, one could determine the accuracy of the digitized map
and thus determine its suitability for inclusion in a given study.
Positional errors can also arise when features to be mapped are inherently vague.
Take the example of a wetland (Figure 5.15 "Defining a Wetland Boundary"). What
defines a wetland boundary? Wetlands are determined by a combination of
hydrologic, vegetative, and edaphic factors. Although the US Army Corps of
Engineers is currently responsible for defining the boundary of wetlands
throughout the country, this task is not as simple as it may seem. In particular,
regional differences in the characteristics of a wetland make delineating these
features particularly troublesome. For example, the definition of a wetland
boundary for the riverine wetlands in the eastern United States, where water is
abundant, is often useless when delineating similar types of wetlands in the desert
southwest United States. Indeed, the complexity and confusion associated with the
conception of what a “wetland” is may result in difficulties defining the feature in
the field, which subsequently leads to positional accuracy errors in the GIS
database.
many datasets undergo a regular data update regimen. For example, the California
Department of Fish and Game updates its sensitive species databases on a near
monthly basis as new findings are continually being made. It is important to ensure
that, as an end-user, you are constantly using the most up-to-date data for your GIS
application.
KEY TAKEAWAYS
EXERCISES