In an IoT sensor network monitoring environmental temperatures, sensors often
fail during extreme weather conditions, resulting in non-random missing
temperature readings. Which imputation method best addresses the bias from
these failures?
1. Fill missing readings using reliable seasonal averages consistently derived from
comprehensive historical sensor data.
2. Employ multiple imputation using MICE to accurately estimate missing temperature
values with correlated variables.
3. Use forward-fill technique to reliably propagate previous reading values consistently
across missing intervals effectively.
4. Apply simple linear interpolation between neighbouring points to precisely impute
missing sensor values accurately.
5. Remove affected records with missing readings to reliably maintain a robust and
complete dataset.
You oversee an ETL pipeline that extracts data from multiple source systems with
loosely controlled schemas. Occasionally, these sources introduce new columns or
alter data types unexpectedly, which can affect downstream processes. Which
design strategy best supports seamless schema evolution?
1. Adopt a metadata-driven ETL framework with dynamic schema detection.
2. Manually update transformation logic for each schema change.
3. Configure the pipeline to log errors and skip records with schema mismatches.
4. Freeze the schema and require all sources to conform.
5. Rebuild the ETL pipeline after every detected change.
Imagine you are given a DataFrame that contains separate columns for
'morning_sales', 'afternoon_sales', and 'evening_sales'. To analyze sales trends
over the course of the day, you need to transform this data into a long format
with two columns: one for the time period and another for sales figures. What is
an effective approach to perform this transformation in Pandas?
1. Reindex the DataFrame and perform string-based splitting on column names to
create new rows.
2. Use groupby() to aggregate the sales sums, thereby merging the different time
columns.
3. Concatenate separate DataFrames created from each sales column manually.
4. Use the melt() function to consolidate the time-specific columns into 'time' and 'sales'
columns.
5. Apply pivot_table() to rotate the data into a long format without specifying index
values.
A company is running an A/B test on two distinct email campaigns to boost
customer conversions. Each recipient is randomly assigned to either Campaign A
or Campaign B, and the outcome for every recipient is recorded as a conversion
(yes/no). Which inferential statistical test should be used to determine if there is
a statistically significant difference in conversion rates between the two
campaigns?
Paired Samples t-test
One-way ANOVA
Independent Samples t-test
Two-proportion z-test
Chi-square test of independence
A retail chain analyzes daily customer spending, which shows a long right tail due
to occasional large purchases. To provide a robust summary of the data’s
dispersion, which statistic is most appropriate for capturing the spread of the
middle 50% of the data?
1. Interquartile Range (IQR)
2. Mean Absolute Deviation
3. Standard Deviation
4. Variance
5. Range
A digital platform tests two homepage designs to see which one increases user
engagement time. The samples from the two versions are independent and
moderately large. Which hypothesis test is best for comparing the average
engagement times?
1. Mann-Whitney U test
2. Z-test
3. One-sample t-test
4. Independent t-test
5. Paired t-test
A financial team wants to summarize the monthly operational expenses across
various departments, showcasing medians, variability, and potential outlier costs.
Which chart type would most effectively display these details in a compact
format?
1. Box plot
2. Pie chart
3. Bar chart
4. Line chart
5. Histogram
A global product performance dashboard uses various visualization types like bar
charts, line graphs, and pie charts to compare metrics across different regions.
Users report that inconsistent numerical scales are causing confusion and
misinterpretation of trends. Which principle should be prioritized to resolve this
issue?
1. Adopt uniform color gradients across all charts
2. Utilize diverse chart styles for regional uniqueness
3. Introduce interactive options for scaling adjustments
4. Ensure consistent scale and axis configurations
5. Standardize font styles on all labels
You are managing a database that consists of two tables: Customers and Orders.
Not every customer has placed an order, but you need to display every customer
alongside any corresponding orders. Which join operation should you use in your
SQL query to make sure that all customers are listed, regardless of whether they
have an order record?
1. FULL OUTER JOIN
2. CROSS JOIN
3. INNER JOIN
4. RIGHT JOIN
5. LEFT JOIN
A marketing team observes that the distribution of customer spending is highly
skewed, leading standard outlier detection methods to incorrectly flag a
significant segment of high-value customers as outliers. Which strategy would
best adjust for the skewness when detecting outliers?
1. Leverage simple mean-based thresholds after temporarily excluding extreme values.
2. Apply a log transformation to normalize the distribution before using the IQR method.
3. Use z-score standardization and filter out values with scores above 3.
4. Utilize clustering techniques like k-means to isolate small clusters as outliers.
5. Increase the IQR threshold multiplier from 1.5 to 3 to capture more data points.
While consolidating supplier records from various regions, you notice that
addresses use different abbreviations and formatting. Which approach best
addresses duplicate removal for records with such minor variations?
1. Apply exact matching on formatted data
2. Sort records and remove consecutive duplicates
3. Implement fuzzy matching with set threshold
4. Combine rule-based filters with manual review
5. Use supplier IDs for deduplication
Imagine you’re tasked with cleaning a survey dataset that includes participant
ages. Some entries are clearly invalid, falling outside the plausible range of 0 to
120. Which approach best validates these entries while retaining as much correct
data as possible?
1. Use a statistical outlier filter centered on the mean to remove values without using
preset thresholds.
2. Convert all age inputs to numbers and substitute extreme values with the overall
average.
3. Discard all age entries that fall outside the 0–120 range without further investigation.
4. Automatically adjust any age values outside the expected range to the nearest valid
boundary.
5. Apply predefined numeric boundaries to flag ages below 0 or above 120 and
manually review borderline cases.
You are working with a dataset of daily sales transactions that includes columns
for the date, region, and sales amount. To evaluate quarterly trends and total
sales per region, what is the most efficient method to group your data within a
pivot table?
1. Change the date formatting in the pivot table to display quarter numbers, which
Excel will then use to group the dates.
2. Utilize the pivot table’s 'Group Field' feature on the date field and select 'Quarters' so
that Excel aggregates the data automatically.
3. Manually add a new column calculating the quarter for each transaction and use that
column in the pivot table.
4. Sort the data by date before creating the pivot table to let Excel automatically detect
quarterly groupings.
5. Create separate pivot tables for each quarter by applying a date filter for each one.
Imagine you are compiling a quarterly sales report from a dataset that includes
the columns Region and Sales. You need to calculate the total sales where the
region is "North" and each sale exceeds $1,000. Which Excel formula structure
would best achieve this result?
1. =SUM(Sales*(Region="North")*(Sales>1000))
2. =SUMIF(Region, "North", Sales)
3. =SUMIFS(Sales, Region, "North", Sales, ">1000")
4. =SUMIF(Sales, ">1000") + SUMIF(Region, "North", Sales)
5. =SUMIFS(Sales, Sales, ">1000")
In your automated infrastructure deployment, a script that works flawlessly in the
staging environment repeatedly fails in a production Linux environment. After
some investigation, you find that subtle differences in environment variable
settings across these environments are affecting the script’s behavior. Which
measure would most effectively address this issue to ensure consistent script
execution?
1. Wrap the failing parts of the script with conditional checks that ignore discrepancies
in environment variables.
2. Implement a centralized configuration management system to enforce uniform
environment variable settings across all environments.
3. Manually adjust the environment variables on the production system each time
before deployment.
4. Hard-code the environment variables directly within the script for the production
deployment.
5. Increase the script’s error handling to bypass failures related to environment
variables.
Imagine you are writing a shell script to process a list of filenames that may
contain spaces. You have two filenames: "My File.txt" and "Another File.txt".
Which of the following looping approaches best ensures that each filename is
treated as a single item in your script?
1. files="My File.txt Another File.txt" for file in $files; do echo "$file" done
2. files=("My File.txt" "Another File.txt") for file in "$files"; do echo "$file" done
3. files=("My File.txt" "Another File.txt") for file in "${files[@]}"; do echo "$file" done
4. files="My File.txt Another File.txt" IFS=$'\n' for file in $files; do echo "$file" done
5. for file in $(ls *.txt); do echo "$file" done
Imagine you are responsible for allocating limited resources among several
ongoing projects in an organization. The historical performance data, current
trends, and future projection metrics vary significantly across these projects.
Time constraints limit the possibility of a deep dive into every detail, but you still
need to make a decision that balances risk and reward while considering long-
term benefits. How would you approach this decision?
1. Invest primarily in the project with the highest current performance numbers,
trusting that past success predicts future results.
2. Allocate resources equally across all projects to ensure fairness, assuming similar
potential across the board.
3. Focus on projects with a history of stable performance, believing that consistency will
guarantee future success even if growth prospects are limited.
4. Prioritize projects that receive favorable internal opinions, assuming political support
within the organization reflects quality potential.
5. Perform a cost-benefit analysis that integrates historical performance, current trends,
and future projections to identify which projects offer the best balance of risk and
reward.
Imagine you are assessing the impact of a new service launch. Sales figures show
a decline, but customer satisfaction surveys indicate strong positive feedback.
What would be the most effective step to reconcile these conflicting outcomes?
1. Investigate data collection methods from both sources to understand possible biases.
2. Consult with the marketing team to align survey interpretations with the sales
targets.
3. Immediately attribute the discrepancy to data entry errors and disregard one of the
reports.
4. Prioritize the positive customer feedback over the declining sales figures.
5. Conduct a detailed time-series analysis to check if external factors like seasonality
affected sales.
You are a data analyst working for a large US based superstore. You have been
granted access to a historic sales database for the superstore that contains all
sales orders from 2014-10-01 until 2017-09-09 across multiple related tables, as
detailed in the following Entity Relationship Diagram (ERD). Note: there are
multiple rows in orders per orderId.
Your manager would like to see the orderId, customerId and productId for the last
15 returned orders, based on the orderDate column. If two returned orders were
ordered on the same day, sort by value, showing the highest value order first.
WITH order_totals AS(
SELECTorderId,
MAX(customerId) AS customerId,
MAX(orderDate) AS orderDate,
SUM(value) AS order_value
FROM orders
GROUP BY orderId
), returned_orders AS(
SELECT t.orderId, t.customerId, t.orderDate, t.order_value, r.returned
FROM order_totals AS t
JOIN returns AS r
ON t.orderId = r.orderId
WHERE r.returned = 1
), ranked AS(
SELECT orderId, customerId, orderDate, order_value,
ROW_NUMBER()OVER(ORDER BY orderDate DESC, order_value DESC) AS rn
FROM returned_orders
)
SELECT o.orderId, o.customerId, o.productId
FROM ranked AS r
JOIN orders AS o
ON r.orderId = o.orderId
WHERE r.rn <= 15
ORDER BY r.orderDate DESC, r.order_value
LIMIT 15;
You are given two tables: employees (with columns employee_id, name) and
orders (with columns order_id, employee_id, order_date). You need to generate a
report listing the employees who have processed more orders than the average
number of orders processed by all employees. Which SQL query correctly uses
subqueries to achieve this goal?
SELECT e.employee_id, e.name
FROM employees e
WHERE (SELECT COUNT(*)
FROM orders o
WHERE o.employee_id = e.employee_id)
> (SELECT AVG(order_count)
FROM (SELECT COUNT(*) AS order_count
FROM orders
GROUP BY employee_id));
SELECT e.employee_id, e.name
FROM employees e
WHERE (SELECT COUNT(*)
FROM orders o
WHERE o.employee_id = e.employee_id)
> (SELECT AVG(COUNT(*))
FROM orders
GROUP BY employee_id);
SELECT e.employee_id, e.name
FROM employees e
JOIN (SELECT employee_id, COUNT(*) AS total
FROM orders
GROUP BY employee_id) t
ON e.employee_id = t.employee_id
WHERE t.total > (SELECT AVG(total) FROM orders);
SELECT e.employee_id, e.name
FROM employees e
WHERE COUNT(*) > (SELECT AVG(order_count)
FROM (SELECT COUNT(*) AS order_count
FROM orders
GROUP BY employee_id) AS avg_orders);
SELECT e.employee_id, e.name
FROM employees e
JOIN (SELECT employee_id, COUNT(*) AS order_count
FROM orders
GROUP BY employee_id) t
ON e.employee_id = t.employee_id
WHERE t.order_count > (
SELECT COUNT(*) FROM orders
)/(
SELECT COUNT(DISTINCT employee_id) FROM orders
);