0% found this document useful (0 votes)
26 views20 pages

Chapter # 1 (MCQs & Structured Questions)

The document contains multiple-choice questions and answers related to data categorization, focusing on structured, semi-structured, and unstructured data. It discusses various scenarios involving data integration, storage, and analysis in financial contexts, providing rationales for each answer. Additionally, it includes a scenario-based question about categorizing datasets from a digital banking project and highlights the advantages and challenges of managing different data types.

Uploaded by

iamali139eb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views20 pages

Chapter # 1 (MCQs & Structured Questions)

The document contains multiple-choice questions and answers related to data categorization, focusing on structured, semi-structured, and unstructured data. It discusses various scenarios involving data integration, storage, and analysis in financial contexts, providing rationales for each answer. Additionally, it includes a scenario-based question about categorizing datasets from a digital banking project and highlights the advantages and challenges of managing different data types.

Uploaded by

iamali139eb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter # 1

Multiple Choice Questions


1) A bank’s analytics squad must reconcile daily ATM withdrawals (timestamped, numeric) along with
customer complaint voice notes about failed cash-outs. Which data pairing best describes the two
sources they must integrate?
A. Only structured data
B. Structured + semi-structured
C. Structured + unstructured
D. Semi-structured only
Answer: C
Rationale: ATM withdrawals are structured transactions; voice notes are unstructured audio.

2) Your CFO wants same-day IFRS packs auto-generated. Which data characteristic most enables this?
A. Human-readable tags
B. Rigid schema with defined fields
C. Multimedia richness
D. Lack of fixed schema
Answer: B
Rationale: Structured data’s predefined schema enables automated reporting.

3) An auditor correlates sudden weekend POS refunds (tables) with CCTV clips (mp4) to probe fraud.
Which tools/classes fit first vs. second?
A. SQL for POS; AI/Computer Vision for CCTV
B. SQL for both
C. NoSQL for POS; SQL for CCTV
D. NLP for POS; OCR for CCTV
Answer: A
Rationale: Structured POS → SQL & video is unstructured → AI/CV.

4) A fintech stores mobile event logs as JSON documents to iterate experiments without schema
migrations. What type is this?
A. Structured
B. Semi-structured
C. Unstructured
D. Derived data
Answer: B
Rationale: JSON with keys/tags but no rigid schema = semi-structured.

5) You’re asked to size an adjacent market using IMF and World Bank downloads plus competitor
annual reports. These sources primarily are:
A. Primary data
B. Secondary data
C. Observational data
D. Transactional data
Answer: B
Rationale: Data collected by others for reuse.

6) Sales wants “top 500 customers with emails & last 12-month spend” to push an offer today. Which
data form is essential?
A. Unstructured
B. Semi-structured
C. Structured rows/columns
D. Paper memos
Answer: C
Rationale: Names, emails, spend in relational/CRM tables.

7) An insurer ingests XML claim forms from partners and PDF accident images from clients. What best
describes each?
A. XML structured & PDFs structured
B. XML semi-structured & PDFs unstructured
C. XML unstructured & PDFs semi-structured
D. Both unstructured
Answer: B
Rationale: XML = tagged semi-structured; photos/docs = unstructured.

8) Payroll wants fast drill-downs on tax deductions by region. Most suitable store?
A. Data lake with raw images
B. Relational database tables (RDBMS)
C. Social media archive
D. Free-text docs
Answer: B
Rationale: Structured payroll in RDBMS supports efficient queries.
9) A buy-side analyst mines tweets and YouTube transcripts for sentiment. The dominant data type is:
A. Structured
B. Semi-structured
C. Unstructured
D. Time-series structured
Answer: C
Rationale: Social posts and transcripts are unstructured text/audio.

10) A team often adds new attributes to product events without downtime. Which storage design
mitigates rigid-schema risk?
A. Strict 3NF relational database model
B. structured only
C. Document store with JSON
D. Fixed-width binary files
Answer: C
Rationale: Semi-structured JSON tolerates evolving fields.

11) Which evidence is most “primary” for a revenue cutoff test?


A. Industry magazine ranking
B. Competitor 10-K
C. Company’s own sales ledger exports
D. World Bank dataset
Answer: C
Rationale: Direct, first-hand transaction records => primary.

12) Returns root-cause study requires order tables + customer review text + product photos. Minimal
set spans:
A. Structured only
B. Structured + unstructured
C. Unstructured only
Answer: B
Rationale: Orders are structured; reviews/photos are unstructured.

13) You need governed metrics for regulatory ratios and also raw media for investigations. Best
placement?
A. Put all in warehouse
B. Put all in object store
C. Warehouse & lake-house
D. Local laptops
Answer: C
Rationale: Chapter maps warehouses to structured and lakes to large varied data.

14) A bank trains, probability of default (PD) loss in the event of default (LGD) models using
transaction histories and income statements. Why this structured data advantageous here?
A. Cheaper storage only
B. Clear lineage and fields for algorithms
C. More colorful content
D. No security controls needed
Answer: B
Rationale: Structured data supports feature extraction, auditability.

15) Suppliers send e-invoices as XML; your ERP must parse line items reliably. The XML role is to
provide:
A. Video compression
B. Rigid relational schema
C. Tagged, semi-structured fields
D. Unconstrained blobs (Binary Large Object)
Answer: C
Rationale: XML tags/keys organize content without full rigidity.

16) You need broad, immediate industry context before designing a primary survey. Which data
category first?
A. Primary
B. Secondary
C. Derived
D. Synthetic
Answer: B
Rationale: Secondary is faster, broad scope; then refine with primary.

17) Operations wants on-time shipment %, cycle time, and daily backlog. Which storage supports is
best here.

A. Audio archive
B. RDBMS/warehouse tables
C. PNG images
D. Micro Soft World
Answer: B
Rationale: Structured metrics & dimensions.
18) Which source is most likely to embed sensitive personal info inconsistently across fields?
A. SQL table with defined columns
B. XML feed with fixed schema
C. Unstructured PDF attachments
D. CSV with enforced types
Answer: C
Rationale: Unstructured docs often hide PII in varied layouts.

19) A startup ingests diverse IoT payloads with frequent firmware changes. Why choose JSON?
A. Least scalable
B. Requires strict schema
C. Flexible keys, scalable ingestion
D. Only human readable; not machine-parsable
Answer: C
Rationale: Flexibility and scalability highlighted for semi-structured.

20) Which drawback is most associated with semi-structured stores at scale?


A. Inability to add new fields
B. Complex analysis and inconsistency
C. Not portable across systems
D. Cannot store metadata
Answer: B
Rationale: Variable structure → quality & analysis challenges.

21) Which is secondary data for a going-concern assessment of a supplier?


A. Supplier’s bank statements obtained from Bank by yourself.
B. Your AP ledger showing delayed payments
C. Government economic outlook reports
D. Supplier’s signed sales contracts.
Answer: C
Rationale: Government reports are secondary.

22) A finance team wants to mine figures from scanned receipts. Pre-classification of this source?
A. Structured
B. Semi-structured
C. Unstructured
D. Derived
Answer: C
Rationale: Scans/images are unstructured before extraction.

23) Which aspect of structured data most helps an auditor trace how EBITDA in a dashboard was
computed?
A. Multimedia richness
B. Clear data lineage & defined fields
C. Absence of schema
D. Ad-hoc informal tags
Answer: B
Rationale: Structured data provides lineage & field definitions.

24) Marketing exports CSVs from Meta Ads with varying optional columns across campaigns. This is
best seen as:
A. Structured
B. Semi-structured
C. Unstructured
D. Non-data
Answer: B
Rationale: CSV often semi-structured across varying column presence.

25) Which risk emerges if a company stores tax data mainly in free-text PDFs?
A. Over-normalization
B. Easy automated filing
C. Search/indexing difficulty and data deletion risk
D. Too much schema rigidity
Answer: C
Rationale: Unstructured → hard indexing; security challenges.

26) To benchmark salary bands vs. industry, what is primarily secondary data?
A. Your product export report
B. Industry compensation reports from consultancies
C. Employees performance report
D. Internal payroll journals
Answer: B
Rationale: External reports collected by others.

27) Why can structured enterprise data migrations be complex?


A. No relationships to manage
B. Multi-table relationships and schema changes
C. Lack of any metadata
D. Images are too big
Answer: B
Rationale: Schema rigidity and relationships complicate migration.

28) A security team investigates intrusion using Apache logs in line-based text with key=value fields.
Pre-classification?
A. Structured
B. Semi-structured
C. Unstructured
D. Graph data
Answer: B
Rationale: Log lines with keys/tags → semi-structured.

29) For monthly board packs (BS, P&L, CF), which is the canonical data type?
A. Structured ledger and sub-ledger tables
B. Social media feeds
C. Video statements
D. Freehand notes
Answer: A
Rationale: Financial statements & transactions are structured.

30) A company dumps images, emails, and CAD files into a bucket. What key challenge vs. a
warehouse?
A. Too rigid schema
B. Indexing/search, privacy, and analysis complexity
C. No scalability
D. Not portable
Answer: B
Rationale: Unstructured diversity → indexing/security challenges.

31) Hospital integrates EHR feeds in XML with lab PDFs. Immediate classification?
A. Both structured
B. XML semi-structured & PDFs unstructured
C. XML unstructured & PDFs structured
D. Both unstructured
Answer: B
Rationale: Tagged EHR formats; scanned docs unstructured.
32) A factory stores machine telemetry as nested JSON with device Id and readings. Why is this easier
than pure unstructured?
A. No metadata available
B. Keys/tags enable partial parsing and queries
C. Requires fixed table design
D. Enforces ACID joins
Answer: B
Rationale: Semi-structured supports extraction without rigid schema.

33) CFO requests “top 10 opex spikes by GL account vs. prior month.” What property of structured
data helps?
A. Human-readable tags only
B. Defined fields, efficient retrieval with SQL
C. Multimedia context
D. No schema enforcement
Answer: B
Rationale: Structured tables + SQL slicing.

34) Before building a new product line, you always use analyst whitepapers and government stats.
Main benefit of these sources?
A. Perfect accuracy
B. Low cost and speed, broad scope
C. Direct control over method
D. Always up-to-date
Answer: B
Rationale: Secondary data = cheaper, faster, broad.

35) Why is merging emails, call recordings, and chat transcripts into CRM hardest?
A. They are all relational
B. They are semi-structured only
C. They’re largely unstructured requiring AI to analys
D. They lack any business value
Answer: C
Rationale: Text/audio are unstructured, need NLP/AI.

36) Finance exports monthly trial balance CSVs; sometimes columns are missing when empty. This is:
A. Structured with rigid schema
B. Semi-structured with optional fields
C. Unstructured
D. No data
Answer: B
Rationale: CSV used with variable columns = semi-structured.

37) Which statement best captures a key advantage of structured data for compliance?
A. No need for access controls
B. Clear lineage and easier verification
C. Cheapest to store always
D. Works only offline
Answer: B
Rationale: Chapter highlights lineage and auditing support.

38) Which pairing aligns with chapter roles?


A. Structured → Data scientists only
B. Unstructured → Business analysts only
C. Structured → Analysts/engineers & Unstructured → Data scientists/engineers
D. Both → Only accountants
Answer: C
Rationale: Role mapping across data types provided.

39) You must quickly understand target’s market share trends. Which start is most pragmatic?
A. Run primary nationwide survey first
B. Scrape & analyze public filings and industry reports
C. Record customer calls
D. Read all emails sent by customers
Answer: B
Rationale: Secondary gives speed and scope for scoping.

40) With semi-structured logs, what risk occurs if Software ignore optional attributes?
A. Guaranteed consistency
B. Potential data loss/misinterpretation
C. Zero privacy risk
D. Enforced rigid schema
Answer: B
Rationale: Chapter warns of omissions causing loss.

41) Why are VAT computations typically implemented over structured data?
A. They require images
B. Need free-form narratives
C. Deterministic fields enable rules and automation
D. Avoid any schema
Answer: C
Rationale: Structured enables automated tax calculations.

42) Which is the most unstructured source to evaluate agent quality?


A. CSV files
B. XML interaction summaries
C. MP3 call recordings
D. JSON transcripts
Answer: C
Rationale: Raw audio is unstructured.

43) Treasury builds models from AR/AP, bank statements, and sales. Which types of data is
structured?
A. Bank statements only
B. AR/AP & bookings only
C. Sales Data
D. All of the above
Answer: D
Rationale: all digital statements are structured in banking systems.

44) Which approach fits cleaning duplicate supplier names?


A. Analyze mp4s
B. SQL
C. Only manual read of PDFs
D. CSV
Answer: B
Rationale: Structured master + analytics.

45) Marketplace imports partner feeds with YAML/JSON variants. Best classification & reason?
A. Structured—fixed tables
B. Semi-structured—tags/keys enable flexibility
C. Unstructured—no tags
D. None of above
Answer: B
Rationale: YAML/JSON = semi-structured.

46) Which drawback often increases TCO of structured data system at very large scale?
A. Zero storage costs
B. Expensive specialized software and storage
C. No migration issues
D. No security problem
Answer: B
Rationale: Chapter notes expense of large relational systems.

47) You must validate a hypothesis from a secondary report about churn drivers. What is your next
best step for high accuracy and more current data?
A. Ignore and publish what aver you have
B. Run a focused primary survey/experiment
C. Download more blogs from YouTube
D. Collect Govt. data
Answer: B
Rationale: Primary data = specific, accurate for your objective.

48) Fleet sends device logs with fields and nested arrays. Best store for easy evolution?
A. RDBMS only
B. Document database (JSON)
C. Flat PNGs
D. Audio vault
Answer: B
Rationale: Semi-structured JSON fits evolving telemetry.

49) Which combination maps best to “opinions about a bank on social networks” vs. “the bank’s GL
transactions”?
A. Both structured
B. Unstructured vs. structured
C. Structured vs. unstructured
D. Unstructured vs. semi-structured
Answer: B
Rationale: Social posts = unstructured & GL = structured.

50) A CA-firm advises a client to (i) centralize ERP facts in a warehouse, (ii) dump media and scans into
a lake-house, and (iii) parse partner XML into a curated zone. Which mapping (sequence) is correct?
A. (i) unstructured, (ii) structured, (iii) unstructured
B. (i) structured, (ii) unstructured, (iii) semi-structured
C. (i) semi-structured, (ii) structured, (iii) unstructured
D. (i) unstructured, (ii) semi-structured, (iii) structured
Answer: B
Rationale: ERP tables → structured, media → unstructured, XML → semi-structured.

Scenario-Based Applied Questions (LONG)


Question 1: (Data Categories – Structured, Semi-structured, Unstructured Data)
Scenario:
The Allied Bank of Pakistan (ABP) recently launched a Digital Banking Transformation Project to
improve data-driven decision-making. The project team collected various forms of data from multiple
departments:
• Customer account details, loan applications, and daily transaction logs were stored in SQL
databases.
• Voice recordings from customer complaint calls and scanned loan documents were uploaded
for analysis.
• XML-based data feeds were used to share customer risk ratings and product details between
different internal systems.
During a review meeting, the Chief Data Officer asked the analytics team to categorize these datasets
properly to improve data storage, retrieval, and analytics efficiency.
Required:
(a) Identify and justify which category of data (structured, unstructured, or semi-structured) applies
to each of the three types of datasets mentioned above.
(b) Explain two practical advantages and two challenges of managing multiple categories of data in a
financial institution like ABP.

Answer:
(a) Data Categorization:
Data Type Category Justification
Customer account details, Stored in SQL databases with predefined schema (e.g.,
Structured
loan applications, daily Account No., Loan ID, Balance). Data is organized in
Data
transactions rows and columns, easily queried using SQL.
No predefined format; consists of audio files (MP3) and
Voice recordings and Unstructured
images (PDF). Cannot be stored in relational tables and
scanned loan documents Data
require AI or NLP tools for interpretation.
Contain organized fields within tags (e.g.,
XML data feeds (risk Semi-
<CustomerID>, <RiskLevel>), making data partially
ratings, product info) structured Data
structured but flexible for exchange between systems.

(b) Advantages and Challenges:


Advantages:
1. Comprehensive Insights: Combining multiple data categories allows ABP to analyze customer
behavior (structured), sentiment from call recordings (unstructured), and integrate data feeds
(semi-structured) for holistic decision-making.
2. Regulatory Efficiency: Structured data enables automated compliance reporting under SBP
regulations through predefined financial schemas.
Challenges:
1. Integration Complexity: Structured and unstructured data require different storage systems
(e.g., data warehouse vs. data lake), complicating integration.
2. Data Security Risks: Unstructured data (e.g., customer audio files) is harder to encrypt and
classify, raising data privacy and compliance concerns.

Question 2: (Data Sources – Internal, External, Primary, Secondary)


Scenario:
NestMart Pvt. Ltd., a retail chain in Pakistan, plans to expand into new cities. The management team
decided to use data analytics to choose the most profitable locations and design effective marketing
campaigns.
The team gathered the following data:
1. Daily sales transactions and customer feedback collected through the company’s own mobile
app.
2. Economic and population statistics from the Pakistan Bureau of Statistics.
3. A paid market report from Nielsen showing consumer buying trends.
4. Competitor price data collected via automated web scraping.
The CEO asked the Data Manager to classify these data sources and explain how using both internal
and external sources helps in strategic decision-making.
Required:
(a) Classify each dataset as Internal / External and Primary / Secondary source of data with
justification.
(b) Explain two reasons why combining internal and external data sources improves the quality of
business decisions for NestMart.

Answer:
(a) Data Source Classification
Dataset Source Type Justification
Generated directly by NestMart through its own
1. Daily sales & customer Internal &
operations; firsthand data collected for a specific purpose
feedback (mobile app) Primary
(sales tracking & feedback).
2. Economic and
External & Collected by a government agency for national use;
population data (Bureau of
Secondary reused by NestMart for market analysis.
Statistics)
3. Nielsen consumer External & Purchased from a commercial data provider; pre-
buying trends report Secondary analyzed and not originally collected by NestMart.
4. Competitor price data External & Collected firsthand by NestMart’s team directly from
(web scraping) Primary competitor websites for comparison.

(b) Importance of Combining Internal and External Sources


1. Enhanced Market Intelligence:
Internal data (sales records) shows what customers are buying, while external data (market
reports, population stats) shows why and where demand exists — enabling better location
planning and marketing focus.
2. Improved Forecasting Accuracy:
Combining internal sales patterns with external economic trends allows NestMart to anticipate
changes in consumer purchasing power, ensuring more realistic demand forecasting and
inventory management.

Scenario-Based Applied Questions (Medium Scale)

Question 1:
Ali & Co., a retail chain, maintains a database where each sales transaction is recorded with
predefined fields such as TransactionID, ProductID, Quantity, Price, and Date. Simultaneously, their
marketing team collects thousands of customer reviews and feedback from their website and social
media pages, which include text comments, images, and video testimonials.
a) Identify and differentiate the two main categories of data being handled by Ali & Co., as described
in the scenario.
b) Explain TWO key advantages for the company of using the first category of data you identified in
part (a).

Answer:
a) The two main categories of data are:
• Structured Data: This refers to the sales transaction data stored in the database with
predefined fields (TransactionID, ProductID, etc.). It is highly organized in a rows-and-columns
format.
• Unstructured Data: This refers to the customer reviews and feedback collected from websites
and social media, which include text, images, and videos. This data has no pre-defined format
or schema.
b) Two advantages of using Structured Data are:
1. Ease of Analysis and Reporting: Structured data can be easily analyzed using traditional tools
like SQL and Excel. Ali & Co. can quickly generate sales reports, calculate total revenue, and
identify top-selling products.
2. Efficiency and Consistency: The predefined schema ensures data consistency and accuracy.
This allows for efficient storage, retrieval, and processing of large volumes of transaction data,
making operations like inventory management and financial reporting more reliable.

Question 2:
The State Bank of Pakistan is reviewing the risk profiles of various commercial banks. It classifies
banks into categories like "Low Risk," "Moderate Risk," and "High Risk" based on their capital
adequacy and non-performing loans. Furthermore, it precisely records the exact number of branches
each bank operates across the country.
a) Categorize the types of data mentioned in the scenario into Qualitative and Quantitative.
b) Further classify the qualitative data into its specific subtype and justify your choice.

Answer:
a)
• Qualitative Data: The risk classification of banks ("Low Risk," "Moderate Risk," "High Risk").
• Quantitative Data: The exact number of branches each bank operates.
b) The qualitative data is Ordinal Data.
Justification: The risk categories have a logical order or ranking (Low Risk is better than Moderate
Risk, which is better than High Risk). However, the precise difference between "Low" and "Moderate"
risk is not measurable or numerically defined, which is the key characteristic of ordinal data.

Question 3:
A financial analyst at a brokerage firm is working on a report. Part of her data includes the daily
closing price of a company's share, which was Rs. 245.75 on Monday and Rs. 248.50 on Tuesday. She
is also analyzing the total number of shares an investor holds, which is 5,000.
a) Identify the quantitative data subtypes for the share price and the number of shares.
b) Explain the fundamental difference between these two subtypes of quantitative data.
Answer:
a)
• Share Price (Rs. 245.75, Rs. 248.50): Continuous Data.
• Number of Shares (5,000): Discrete Data.
b) The fundamental difference is that Discrete Data consists of whole, countable numbers that cannot
be broken down into fractions (e.g., you cannot have 5,000.5 shares). Continuous Data, however, can
take any value within a range and can be measured with increasing precision, including decimals and
fractions (e.g., a share price can be Rs. 245.75, Rs. 245.755, etc.).

Question 4:
Bata Pakistan Ltd. receives a weekly shipment manifest from its supplier in XML format. This file
contains product codes, descriptions, and quantities, but the structure can vary slightly each week if
new product attributes are added. The company also stores all its finalized sales data in a rigid, fixed-
format SQL database.
a) Identify the category of data represented by the weekly shipment manifest.
b) State TWO characteristics of this data category that are evident from the scenario.

Answer:
a) The weekly shipment manifest in XML format represents Semi-structured Data.
b) Two characteristics are:
1. Uses Tags/Markers for Organization: The XML format uses tags
(e.g., <productCode>, <description>) to organize the information.
2. Flexible Schema/Lacks Rigid Structure: The scenario mentions that the structure "can vary
slightly each week," indicating that it does not have a rigid, predefined schema like structured
data, making it more adaptable.

Question 5:
SoftTech Solutions is developing a new project management software. To understand user needs, its
development team conducted one-on-one interviews with 50 potential users. Later, to understand the
competitive landscape, the marketing team downloaded a market research report on the software
industry from a well-known consultancy firm.
a) Classify the data sources used by the development team and the marketing team as Primary or
Secondary.
b) State ONE advantage and ONE disadvantage of the data source used by the marketing team.

Answer:
a)
• Development Team (Interviews): Primary Data Source.
• Marketing Team (Market Research Report): Secondary Data Source.
b) For the Secondary Data (market research report):
• Advantage: It is cost-effective and saves time, as the data was already collected by the
consultancy firm, so SoftTech does not have to invest resources in conducting its own
extensive market research.
• Disadvantage: The data may not be fully specific or relevant to SoftTech's unique product or
target audience, as it was collected for a general purpose. It could also be somewhat outdated.

Question 6:
A hospital uses an Electronic Health Record (EHR) system that stores patient information in structured
tables (e.g., Patient-ID, Name, Age, Blood Pressure Reading). The same system also stores scanned
copies of doctors' handwritten notes and X-ray images for each patient.
a) Identify the types of data (based on structure) mentioned in the scenario.
b) Why is the analysis of the doctors' notes and X-rays more challenging than analyzing the blood
pressure readings? Explain briefly.

Answer:
a)
• Patient information in structured tables: Structured Data.
• Scanned copies of handwritten notes and X-ray images: Unstructured Data.
b) Analyzing doctors' notes and X-rays is more challenging because they are unstructured. They lack a
predefined format, making it difficult to process using traditional data analysis tools. To extract
meaningful insights from this data, advanced techniques like Natural Language Processing (NLP) for
the text and Computer Vision for the images are required.

Question 7:
PakFab Textiles is planning to launch a new product line. The management is considering two
approaches for data collection: (i) Conducting nationwide surveys and focus groups, or (ii) Purchasing
a pre-compiled industry report from a market research firm.
a) Compare the two approaches based on cost and specificity of the data obtained.
b) Which source is likely to be more reliable for making a multi-million rupee investment decision?
Justify your answer.

Answer:
a)
• Approach (i) - Surveys/Focus Groups (Primary Data): This is typically high cost but yields data
that is highly specific to PakFab's exact needs and new product line.
• Approach (ii) - Industry Report (Secondary Data): This is more cost-effective but the data
is less specific, as it is collected for a general audience and may not address PakFab's unique
questions directly.
b) For a multi-million rupee investment, Primary Data (Approach i) is likely to be more reliable.
Justification: The high stakes of the decision require the most accurate, up-to-date, and directly
relevant information. Primary data, collected firsthand for this specific purpose, offers greater control
over the process, ensures the data is current, and is tailored to the company's precise context,
thereby reducing the risk associated with the investment.

Question 8:
During an audit of a manufacturing company, an auditor is reviewing various documents. She
examines the general ledger (which is highly organized in a table format) and also reads through a
series of internal email communications between the production and sales departments regarding
inventory discrepancies.
a) Categorize the general ledger and the email communications based on data structure.
b) State one key challenge the auditor might face while analyzing the email communications that she
would not face with the general ledger.

Answer:
a)
• General Ledger: Structured Data.
• Email Communications: Unstructured Data.
b) A key challenge with the email communications is the difficulty in indexing and searching the
content effectively. Unlike the general ledger, where she can run a simple query to find a transaction,
the emails lack a fixed schema, making it time-consuming to manually sift through and extract specific
relevant information about the inventory discrepancies.

Question 9:
A bank is implementing a new AI-driven system. The system is designed to analyze two types of data:
(1) the structured transaction history of customers from its core banking database, and (2) the
recorded audio of customer calls to the service center to detect frustration in their voices.
a) Identify the data types (based on structure) being used for each analysis.
b) Which of the two data types requires tools like Artificial Intelligence (AI) for meaningful analysis?
Explain why.

Answer:
a)
• Structured transaction history: Structured Data.
• Recorded audio of customer calls: Unstructured Data.
b) Unstructured Data (the recorded audio) requires AI tools for meaningful analysis.
Explanation: Structured data can be analyzed using conventional tools and SQL queries. However,
unstructured data like audio has no predefined format. To detect nuanced patterns like emotional
sentiment (frustration) from voice, advanced AI techniques such as Natural Language Processing (NLP)
and audio sentiment analysis are necessary, as traditional methods are ineffective.
Question 10:
A university collects data on student course feedback. Students select a rating from the following
options: "Very Dissatisfied," "Dissatisfied," "Neutral," "Satisfied," and "Very Satisfied." The university
also precisely records the percentage marks obtained by each student in the final exam.
a) Classify the course feedback data and the exam marks data into their correct data types (e.g.,
Nominal, Ordinal, Discrete, Continuous).
b) Why is it statistically inappropriate to calculate the average of the course feedback ratings (e.g.,
"Satisfied," "Very Satisfied")? Explain.
Answer:
a)
• Course Feedback ("Very Dissatisfied" ... "Very Satisfied"): Ordinal Data (Qualitative).
• Exam Marks (Percentage): Continuous Data (Quantitative).
b) It is statistically inappropriate because the feedback data is Ordinal. The categories have a
meaningful order, but the "distance" between them is not known or equal. For instance, the
difference between "Very Dissatisfied" and "Dissatisfied" may not be the same as the difference
between "Satisfied" and "Very Satisfied." Therefore, performing arithmetic operations like calculating
an average on such labels is not valid and can lead to misleading conclusions.

Question 11:
Scenario: The Accounts Department of a company maintains payroll data in a structured database
with fixed fields. The Human Resources department, on the other hand, maintains employee
resumes, offer letters, and performance appraisal notes in digital folders.
a) Contrast the two categories of data held by the Accounts and HR departments based on their
storage format and ease of analysis.
b) Which department's data is more suitable for a Data Lakehouse and why?

Answer:
a)
• Accounts Department (Payroll Data - Structured): Stored in a fixed schema (rows and
columns) in a relational database. It is very easy to analyze using tools like SQL for calculating
salaries, taxes, etc.
• HR Department (Resumes, etc. - Unstructured): Stored in its native format in digital folders. It
is difficult to analyze as it requires specialized tools (e.g., text analytics) to extract meaningful
information.
b) The HR Department's data (Unstructured Data) is more suitable for a Data Lakehouse.
Explanation: A Data Lakehouse is designed to store, process, and analyze vast amounts of
unstructured and semi-structured data in their raw form. It can handle the variety and volume of files
like resumes and notes, which a traditional data warehouse (suited for structured data) would
struggle with efficiently.

Question 12:
A car insurance company offers a discount to customers who install a telematics device in their
vehicle. This device collects two streams of data: (i) the exact GPS location coordinates (latitude and
longitude with many decimal places) every 30 seconds, and (ii) the total number of times the vehicle
was driven between 12 AM and 5 AM in a month.
a) Classify data streams (i) and (ii) as Discrete or Continuous.
b) Provide another example of Continuous Data relevant to a general insurance company.

Answer:
a)
• (i) GPS coordinates: Continuous Data (can take any value within a range, measurable to many
decimal places).
• (ii) Number of night drives: Discrete Data (a countable whole number).
b) Another example of Continuous Data for an insurance company could be the exact value of a
property insured (e.g., Rs. 45,750,000.00) or the client's age calculated precisely in years and
months (e.g., 45.75 years).

Question 13:
A supermarket chain, "EasyBuy," uses loyalty cards to track customer purchases. The data on what
each customer buys is stored in its own structured database. EasyBuy then forms a partnership with a
petrol station company, "QuickFuel," and agrees to share this customer purchase data to create joint
marketing offers.
a) In the context of data sources, how would you classify the customer data when used by QuickFuel?
b) State ONE ethical consideration that EasyBuy must address before sharing this data with QuickFuel.

Answer:
a) For QuickFuel, the customer purchase data received from EasyBuy is an External Data
Source (specifically, from a Data Partnership).
b) EasyBuy must address Transparency and Privacy.
Explanation: EasyBuy has an ethical and legal obligation to be transparent with its customers. It must
clearly inform them that their data will be shared with partners like QuickFuel and obtain their explicit
consent. Failure to do so would be a violation of customer privacy.

Question 14:
A government agency is collecting census data. It is crucial that the data collected is of high quality to
ensure effective policy-making.
a) List any FOUR characteristics of quality data that the agency must ensure during its collection.
b) For the characteristic "Timeliness," provide a specific example of how outdated census data could
lead to a poor policy decision.

Answer:
a) Four characteristics of quality data are: Accuracy, Completeness, Consistency, and Timeliness.
(Other valid characteristics include Relevance, Accessibility, and Reliability).
b) Example: If the census data is 15 years old and shows a low population in a certain district, the
government might allocate insufficient funds for building new schools and hospitals in that area. In
reality, if the population has doubled since the last census, this decision, based on outdated (non-
timely) data, would lead to a critical shortage of public services.
Question 15:
A new health and fitness app "FitLife" has become very popular in Pakistan. It collects users' precise
workout routes via GPS, their heart rate, and also their personal details like name and CNIC number.
a) Based on the chapter, name ONE Pakistani legal framework that is relevant to the data collection
activities of FitLife.
b) According to ethical considerations, what are TWO key responsibilities FitLife has regarding the
user data it collects?

Answer:
a) One relevant Pakistani legal framework is the Prevention of Electronic Crimes Act (PECA) 2016,
which criminalizes unauthorized access to personal data.
b) Two key responsibilities of FitLife are:
1. Security: FitLife must implement robust security measures to protect the sensitive user data
(like CNIC and health metrics) from hackers, breaches, and misuse.
2. Transparency: FitLife must clearly explain to its users what data it is collecting (GPS, heart rate,
CNIC), why it is being collected, and how it will be used, stored, and potentially shared.

You might also like