lOMoARcPSD|13114020
Chapter 8 Data Structures and Caatts for Data Extraction
Bachelor of science in accountancy (Philippine International College)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by pink pink (
[email protected])
lOMoARcPSD|13114020
Chapter 8: Data Structures and CAATTs for Data Extraction
DATA STRUCTURE
Two (2) fundamental components:
1. Organization: Arrangement of records on storage devices, either sequential or random.
- Sequential Files: Contiguous storage, efficient for large file processing but
inefficient for small portions or direct access.
- Random Files: Stored without physical order, lacking efficiency in direct access.
2. Access Method: Technique used to locate records and to navigate through the database
or file.
FLAT FILE STRUCTURES
- End users in this environment own their data files rather than share them with other users.
- Data files are structured, formatted, and arranged to suit the specific needs of the owner
or primary user.
Sequential Structure
- All records in contiguous storage spaces in specified.
- sequence (key field).
- Sequential files are simple & easy to process.
- The application starts at the beginning of the file and processes each record in
sequence.
Indexed Random File
- Records created without regard to physical proximity to other related records.
- Physical organization of the index itself may be either sequential (by key value) or
random.
Virtual Storage Access Method (VSAM)
- Used for very large files that require routine batch processing and a moderate degree
of individual record processing.
Three physical components:
a) indexes
b) prime data storage area
c) overflow area.
Hashing Structure: Fast access but inefficient storage and collision issues.
Pointer Structure: Uses address pointers to link records, offering fast access but with
potential issues if a record moves or a pointer is lost.
lOMoARcPSD|13114020
Three (3) Types of Pointers:
a) Physical address pointer contains actual disk storage location which allows direct
access to the record.
b) Relative address pointer contains relative position of a record in the file which must
be manipulated to convert to physical address.
c) Logical key pointer contains primary key of related record.
- Key value is converted by hashing to physical address
HIERARCHICAL & NETWORK DATABASE STRUCTURES
- Early models using flat-file techniques and proprietary structures for process integration
and data sharing.
RELATIONAL DATABASE STRUCTURE, CONCEPTS AND TERMINOLOGY
- Based on indexed sequential file structure, facilitates direct and batch access to records,
and supports relational algebra functions.
Three (3) Algebra Functions:
a) Restrict: Extracts specified rows from a specified table.
b) Project: Extracts specified attributes (columns) from a table to create a virtual table.
c) Join: Builds a new physical table from two tables consisting of all concatenated pairs
of rows, from each table.
*Note - concatenated means linked together in a series
Entity, Occurrence, and Attributes
- Entity – anything about which the organization wishes to capture data.
- Data model – blueprint for the physical database
- Entity relationship diagram – graphic representation of a data model
- Occurrence – number of records that pertain to a specific entity
- Attributes – data elements of an entity
Association and Cardinality
- Cardinality – degree of association between two entities
Zero or one (0,1),
One and only one (1,1),
Zero or many (0,M), and
One or many (1,M)
lOMoARcPSD|13114020
Anomalies, Structural Dependencies, and Data Normalization
Three (3) types of anomalies
a) Update anomaly results from data redundancy in an unnormalized table. Modification
on an attribute must be made in each of the rows in which the attribute appears.
b) Insertion anomaly occurs when a new item cannot be added to the table because there
is no association with the primary key.
c) Deletion anomaly involves unintentional deletion of data. When an attribute item
used by only one entity is deleted, all information about that attribute item is lost.
Designing Relational Databases
Six phases of database design:
1. Identify entities
2. Construct a data model showing entity associations
3. Add primary keys and attributes to the model
4. Normalize the data model and add foreign keys
5. Construct the physical database
6. Prepare the user reviews
Embedded Audit Module
- Also known as continuous auditing, is to identify important transactions while they
are being process and extract copies of them in real time.
- An EAM is a specially programmed module embedded in a host application to
capture predetermined transaction types for subsequent analysis.
GENERALIZED AUDIT SOFTWARE (GAS)
- GAS allows auditors to access electronically coded data files and perform various
operations on their contents.
Common uses for GAS:
- Footing and balancing entire files or selected items.
- Selecting and reporting detailed file data.
- Selecting stratified statistical samples from data files.
- Formatting results of tests into audit reports.
- Printing confirmations in standard or special wording.
- Screening and selectively including or excluding data items.
- Comparing multiple files for identifying differences.
- Recalculating data fields.
lOMoARcPSD|13114020
ACL SOFTWARE
- ACL (audit command language) is the leader in the industry. ACL was designed as a
meta-language for auditors to access data stored in various digital formats and to test
them comprehensively.
Data Definition
- The data definition screen allows the auditor to define important characteristics of the
source file, including overall record length, the name given to each field, the type of
data (i.e., numeric or character) contained in each field, and the starting point and
length of each field in the file.
Customizing a View
- A view is simply a way of looking at data in a file; auditors seldom need to use all the
data contained in a file. ACL allows the auditor to customize the original view
created during data definition to one that better meets his or her audit needs.
Filtering Data
- ACL provides powerful options for filtering data that support various audit tests.
Filters are expressions that search for records that meet the filter criteria. ACL’s
expression builder allows the auditor to use logical operators such as AND, OR, , ,
NOT and others to define and test conditions of any complexity and to process only
those records that match specific conditions.
Stratifying Data
- ACL’s stratification feature allows the auditor to view the distribution of records that
fall into specified strata. Data can be stratified on any numeric field such as sales
price, unit cost, quantity sold, and so on. The data are summarized and classified by
strata, which can be equal in size (called intervals) or vary in size (called free).
Statistical Analysis
- ACL offers many sampling methods for statistical analysis. Two of the most
frequently used are record sampling and monetary unit sampling (MUS), each
permitting random and interval sampling.
lOMoARcPSD|13114020
I. MULTIPLE-CHOICE QUESTIONS
1. CIA 1186 III-33
In an inventory system on a database management system, one stored record contains part
number, part name, part color, and part weight. These individual items are called
a. fields.
b. stored files.
c. bytes.
d. occurrences.
2. CIA 586 III-31
The use of pointers can save time when sequentially updating a
a. master file.
b. database management system.
c. batch file.
d. random file.
3. It is appropriate to use a sequential file structure when
a. records are routinely inserted.
b. a large portion of the file will be processed in one operation.
c. records need to be scanned using secondary keys.
d. single records need to be retrieved.
4. Which statement is not correct?
a. The sequential file structure is appropriate for payroll records.
b. An advantage of a sequential index is that it can be searched rapidly.
c. The index sequential access method performs record insertion operations efficiently.
d. The principal advantage of the hashing structure is speed of access.
5. Which of the following statements is not true?
a. Indexed random files are dispersed throughout the storage device without regard for
physical proximity with related records.
b. Indexed random files use disk storage space efficiently.
c. Indexed random files are efficient when processing a large portion of a file at one time.
d. Indexed random files are easy to maintain in terms of adding records.
6. Which statement is not correct? The indexed sequential access method
a. is used for very large files that need both direct access and batch processing.
b. may use an overflow area for records.
c. provides an exact physical address for each record.
d. is appropriate for files that require few insertions or deletions.
7. Which statement is true about a hashing structure?
a. The same address could be calculated for two records.
b. Storage space is used efficiently.
c. Records cannot be accessed rapidly.
d. A separate index is required.
8. In a hashing structure,
a. two records can be stored at the same address.
b. pointers are used to indicate the location of all records.
lOMoARcPSD|13114020
c. pointers are used to indicate the location of a record with the same address as another
record.
d. all locations on the disk are used for record storage.
9. Pointers can be used for all of the following except
a. to locate the subschema address of the record.
b. to locate the physical address of the record.
c. to locate the relative address of the record.
d. to locate the logical key of the record.
10. An advantage of a physical address pointer is that
a. it points directly to the actual disk storage location.
b. it is easily recovered if it is inadvertently lost.
c. it remains unchanged when disks are reorganized.
d. all of the above are advantages of the physical address pointer.
11. Pointers are used
a. to link records within a file.
b. to link records between files.
c. to identify records stored in overflow.
d. all of the above.
12. In a hierarchical model,
a. links between related records are implicit.
b. the way to access data is by following a pre-defined data path.
c. an owner (parent) record may own just one member (child) record.
d. a member (child) record may have more than one owner (parent).
13. In a network model,
a. there is one predefined path to a particular record.
b. many-to-many relationships are supported in a simple network.
c. management can track and report information by one criterion only.
d. link files are used to connect records in different files.
14. Which term is not associated with the relational database model?
a. tuple
b. attribute
c. collision
d. relation
15. In the relational database model,
a. relationships are explicit.
b. the user perceives that files are linked using pointers.
c. data are represented on two-dimensional tables.
d. data are represented as a tree structure.
16. In the relational database model, all of the following are true except
a. data are presented to users as tables.
b. data can be extracted from specified rows from specified tables.
c. a new table can be built by joining two tables.
d. only one-to-many relationships can be supported.
17. a relational database,
a. the user’s view of the physical database is the same as the physical database.
b. users perceive that they are manipulating a single table.
lOMoARcPSD|13114020
c. a virtual table exists in the form of rows and columns of a table stored on the disk.
d. a programming language (COBOL) is used to create a user’s view of the database.
18. The update anomaly in unnormalized databases
a. occurs because of data redundancy.
b. complicates adding records to the database.
c. may result in the loss of important data.
d. often results in excessive record insertions.
19. The most serious problem with unnormalized databases is the
a. update anomaly.
b. insertion anomaly.
c. deletion anomaly.
d. none of the above.
20. The deletion anomaly in unnormalized databases
a. is easily detected by users.
b. may result in the loss of important data.
c. complicates adding records to the database.
d. requires the user to perform excessive updates.
21. The data attributes that a particular user has permission to access are defined by the
a. operating system view.
b. systems design view.
c. database schema.
d. user view.
e. application program.
22. Database entities
a. may contain zero or many occurrences.
b. are represented as verbs in an ER diagram.
c. may represent both physical assets and intangible phenomena.
d. are often defined by common attributes that also define other entities.
e. are unique to a specific user view.
23. A transitive dependency
a. is a database condition that is resolved through special monitoring software.
b. is a name given to one of the three anomalies that result from unnormalized database
tables.
c. can exist only in a table with a composite primary key.
d. cannot exist in tables that are normalized at the 2NF level.
e. is none of the above.
24. A partial dependency
a. is the result of simultaneous user requests for the same data in a partitioned database
environment.
b. is a name given to one of the three anomalies that result from unnormalized database
tables.
c. can exist only in a table with a composite primary key.
d. may exist in tables that are normalized at the 2NF level.
e. is none of the above.
25. Repeating group data
lOMoARcPSD|13114020
a. is a form of data redundancy common to replicated databases in a distributed database
environment.
b. is a name given to one of the three anomalies that result from unnormalized database
tables.
c. can exist only in a table with a composite primary key.
d. cannot exist in tables that are normalized at the 2NF level.
e. is none of the above.
26. The database model most likely to be used in the development of a modern (not legacy)
system is
a. hierarchical.
b. structured.
c. relational.
d. network.
e. navigational.
II. PROBLEMS
1. Access Methods
For each of the following file processing operations, indicate whether a sequential file,
indexed random file, indexed sequential access method (VSAM), hashing, or pointer
structure works the best. You may choose as many as you wish for each step. Also indicate
which would perform the least optimally.
a. Retrieve a record from the file based on its primary key value.
b. Update a record in the file.
c. Read a complete file of records.
d. Find the next record in a file.
e. Insert a record into a file.
f. Delete a record from a file.
g. Scan a file for records with secondary keys.
2. File Organization
For the following situations, indicate the most appropriate type of file organization. Explain
your choice.
a. A local utility company has 80,000 residential customers and 10,000 commercial
customers. The monthly billings are staggered throughout the month and, as a result, the
cash receipts are fairly uniform throughout the month. For 99 percent of all accounts, one
check per month is received. These receipts are recorded in a batch file, and the customer
account records are updated biweekly. In a typical month, customer inquiries are received
at the rate of about twenty per day.
b. A national credit card agency has 12 million customer accounts. On average, 30 million
purchases and 700,000 receipts of payments are processed per day. Additionally, the
customer support hot line provides information to approximately 150,000 credit card
holders and 30,000 merchants per day.
c. An airline reservation system assumes that the traveler knows the departing city. From
that point, fares and flight times are examined based on the destination. Once a flight is
identified as being acceptable to the traveler, then the availability is checked and, if
lOMoARcPSD|13114020
necessary, a seat is reserved. The volume of transactions exceeds one-half million per
day.
d. A library system stocks more than 2 million books and has 30,000 patrons. Each patron is
allowed to check out five books. On average, there are 1.3 copies of each title in the
library. Over 3,000 books are checked out each day, with approximately the same amount
being returned daily. The check-outs are posted immediately, as well as any returns of
overdue books by patrons who wish to pay their fines.
ANSWERS:
I. Multiple Choice Question
1. a 11. d 21. d
2. d 12. b 22. c
3. b 13. d 23. d
4. c 14. c 24. c
5. c 15. c 25. d
6. c 16. d 26. c
7. a 17. b
8. c 18. a
9. a 19. c
10. a 20. b
II. Problems
1. Access Methods
a. indexed sequential access method or indexed random; least optimal: sequential
b. indexed random or hashing; least optimal: sequential
c. sequential; least optimal: indexed random
d. sequential or pointer; least optimal: indexed random
e. indexed random or hashing; least optimal: sequential
f. indexed random or hashing; least optimal: sequential
g. indexed random; least optimal: hashing
2. File Organization
a. A sequential file could be used, but with only 8 updates per month, roughly one-eighth
of the files will be updated each run. That means for each run the other seven-eighths
of the file must be read and rewritten. Random access will aid the customer inquiry
response time. A hierarchical database would be appropriate because this is a one-to-
one relationship—each address has one resident in charge of utilities.
b. Random access for customer balances and payment information is crucial. A network
or relational database is necessary because this is a many-to-many relationship: many
purchases per customer, many merchants per customer, many customers per merchant.
c. Random access will be necessary for flight inquiries and updates throughout the day.
A navigational database would be appropriate. Only one direction needs to be
investigated. Most customers start with a departure city and then flights to the
destination city can be investigated. Many different destinations will exist for a given
departure city. Rarely will a customer wish to book a flight based on a destination if he
does not know from which city he will depart.
d. A random access storage device is necessary to access the patrons’ records quickly
when they check out books. A network or relational database will be necessary
because the data should be bidirectional. An investigation may need to be conducted
Downloaded by pink pink ([email protected])
to determine what books a patron has checked out or who has a certain book checked