Top 65 SQL Data Analysis Q&A
Top 65 SQL Data Analysis Q&A
In an interview, these questions are more likely to appear early in the process and
cover data analysis at a high level.
In data mining, raw data is converted into valuable It cannot identify inaccurate or
information. incorrect data values.
Data Wrangling is the process wherein raw data is cleaned, structured, and enriched
into a desired usable format for better decision making. It involves discovering,
structuring, cleaning, enriching, validating, and analyzing data. This process can turn
and map out large amounts of data extracted from various sources into a more
useful format. Techniques such as merging, grouping, concatenating, joining, and
sorting are used to analyze the data. Thereafter it gets ready to be used with another
dataset.
3. What are the various steps involved in any analytics
project?
This is one of the most basic data analyst interview questions. The various steps
involved in any common analytics projects are as follows:
Understand the business problem, define the organizational goals, and plan for a
lucrative solution.
Collecting Data
Gather the right data from various sources and other information based on your
priorities.
Cleaning Data
Clean the data to remove unwanted, redundant, and missing values, and make it
ready for analysis.
Use data visualization and business intelligence tools, data mining techniques, and
predictive modeling to analyze data.
Interpret the results to find out hidden patterns, future trends, and gain insights.
• Handling duplicate
5. Which are the technical tools that you have used for
analysis and presentation purposes?
As a data analyst, you are expected to know the tools mentioned below for analysis
and presentation purposes. Some of the popular tools you should know are:
MS Excel, Tableau
Python, R, SPSS
MS PowerPoint
• Before working with the data, identify and remove the duplicates. This will lead to
an easy and effective data analysis process.
• Focus on the accuracy of the data. Set cross-field validation, maintain the value
types of data, and provide mandatory constraints.
• Normalize the data at the entry point so that it is less chaotic. You will be able to
ensure that all information is standardized, leading to fewer errors on entry.
• It helps you obtain confidence in your data to a point where you’re ready to engage
a machine learning algorithm.
• It allows you to refine your selection of feature variables that will be used later for
model building.
• You can discover hidden trends and insights from the data.
Suggest various
It provides insights into the
Understands the future to courses of action to
past to answer “what has
answer “what could happen” answer “what should
happened”
you do”
Uses simulation
algorithms and
Uses data aggregation and Uses statistical models and
optimization techniques
data mining techniques forecasting techniques
to advise possible
outcomes
• Systematic sampling
• Cluster sampling
• Stratified sampling
Univariate analysis is the simplest and easiest form of data analysis where the data
being analyzed contains only one variable.
The bivariate analysis involves the analysis of two variables to find causes,
relationships, and correlations between the variables.
Example – Analyzing the sale of ice creams based on the temperature outside.
• Data Ownership and Rights: Respecting data ownership rights and intellectual
property, using data only within the boundaries of legal permissions or
agreements.
• Data Quality and Integrity: Ensuring the accuracy, completeness, and reliability of
data used in the analysis to avoid misleading or incorrect conclusions.
• Social Impact: Considering the potential social impact of data analysis results,
including potential unintended consequences or negative effects on marginalized
groups.
You should name the tools you have used personally, however here’s a list of the
commonly used data visualization tools in the industry:
• Tableau
• Microsoft Power BI
• QlikView
• Plotly
• SAP Lumira
This is one of the most frequently asked data analyst interview questions, and the
interviewer expects you to give a detailed answer here, and not just the name of the
methods. There are four methods to handle missing values in a dataset.
Listwise Deletion
In the listwise deletion method, an entire record is excluded from analysis if any
single value is missing.
Average Imputation
Take the average value of the other participants' responses and fill in the missing
value.
Regression Substitution
Multiple Imputations
It creates plausible values based on the correlations for the missing data and then
averages the simulated datasets by incorporating random errors in your predictions.
• 68% of the data falls within one standard deviation of the mean
• 95% of the data lies between two standard deviations of the mean
• 99.7% of the data lies between three standard deviations of the mean
Time Series analysis is a statistical procedure that deals with the ordered sequence
of values of a variable at equally spaced time intervals. Time series data are
collected at adjacent periods. So, there is a correlation between the observations.
This feature distinguishes time-series data from cross-sectional data.
This is another frequently asked data analyst interview question, and you are
expected to cover all the given differences!
Overfitting Underfitting
An outlier is a data point that is distant from other similar points. They may be due to
variability in the measurement or may indicate experimental errors.
The graph depicted below shows there are three outliers in the dataset.
To deal with outliers, you can use the following four methods:
• Null hypothesis: It states that there is no relation between the predictor and
outcome variables in the population. H0 denoted it.
• Alternative hypothesis: It states that there is some relation between the predictor
and outcome variables in the population. It is denoted by H1.
A Type II error occurs when the null hypothesis is not rejected, even if it is false. It is
also known as a false negative.
Ans: The choice of handling technique depends on factors such as the amount and
nature of missing data, the underlying analysis, and the assumptions made. It's
crucial to exercise caution and carefully consider the implications of the chosen
approach to ensure the integrity and reliability of the data analysis. However, a few
solutions could be:
• sensitivity analysis
It's important to note that outlier detection is not a definitive process, and the
identified outliers should be further investigated to determine their validity and
potential impact on the analysis or model. Outliers can be due to various reasons,
including data entry errors, measurement errors, or genuinely anomalous
observations, and each case requires careful consideration and interpretation.
Excel Data Analyst Interview Questions
Yes, you can provide a dynamic range in the “Data Source” of Pivot tables. To do
that, you need to create a named range using the offset function and base the pivot
table using a named range constructed in the first step.
27. What is the function to find the day of the week for a
particular date value?
The get the day of the week, you can use the WEEKDAY() function.
The above function will return 6 as the result, i.e., 17th December is a Saturday.
AND() is a logical function that checks multiple conditions and returns TRUE or
FALSE based on whether the conditions are met.
Syntax: AND(logica1,[logical2],[logical3]....)
In the below example, we are checking if the marks are greater than 45. The result
will be true if the mark is >45, else it will be false.
29. Explain how VLOOKUP works in Excel?
VLOOKUP is used when you need to find things in a table or a range by row.
If you wanted to find the department to which Stuart belongs to, you could use the
VLOOKUP function as shown below:
Here, A11 cell has the lookup value, A2:E7 is the table array, 3 is the column index
number with information about departments, and 0 is the range lookup.
If you hit enter, it will return “Marketing”, indicating that Stuart is from the marketing
department.
30. What function would you use to get the current date
and time in Excel?
In Excel, you can use the TODAY() and NOW() function to get the current date and
time.
For the Sales Rep column, you need to give the criteria as “A*” - meaning the name
should start with the letter “A”. For the Cost each column, the criteria should be “>10”
- meaning the cost of each item is greater than 10.
• Select the table range and the worksheet where you want to place the pivot table
• Drag Sale total on to Values, and Sales Rep and Item on to Row Labels. It will give
the sum of sales made by each representative for every item they have sold.
• Right-click on “Sum of Sale Total’ and expand Show Values As to select % of
Grand Total.
Using this table, let’s find the records for movies that were directed by Brad Bird.
Now, let’s filter the table for directors whose movies have an average duration
greater than 115 minutes.
35. What is the difference between a WHERE clause and
a HAVING clause in SQL?
Answer all of the given differences when this data analyst interview question is
asked, and also give out the syntax for each to prove your thorough knowledge to the
interviewer.
WHERE HAVING
In the WHERE clause, the filter occurs before any groupings HAVING is used to filter values
are made. from a group.
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);
36. Is the below SQL query correct? If not, how will you
rectify it?
The query stated above is incorrect as we cannot use the alias name while filtering
data using the WHERE clause. It will throw an error.
The Union operator combines the output of two or more SELECT statements.
Syntax:
The Intersect operator returns the common records that are the results of 2 or more
SELECT statements.
Syntax:
Syntax:
A Subquery in SQL is a query within another query. It is also known as a nested query
or an inner query. Subqueries are used to enhance the data to be queried by the main
query.
It is of two types - Correlated and Non-Correlated Query.
Below is an example of a subquery that returns the name, email id, and phone
number of an employee from Texas city.
FROM employee
WHERE emp_id IN (
SELECT emp_id
FROM employee
Now, select the top one from the above result that is in ascending order of
mkt_price.
41. Using the product and sales order detail table, find
the products with total units sold greater than 1.5
million.
We can use an inner join to get records from both the tables. We’ll join the tables
based on a common key column, i.e., ProductID.
You must be prepared for this question thoroughly before your next data analyst
interview. The stored procedure is an SQL script that is used to run a task several
times.
Let’s look at an example to create a stored procedure to find the sum of the first N
natural numbers' squares.
• Create a procedure by giving a name, here it’s squaresum1
Output: Display the sum of the square for the first four natural numbers
All the combined sheets or tables contain a common Meanwhile, in data blending,
set of dimensions and measures. each data source contains its
own set of dimensions and
measures.
LOD in Tableau stands for Level of Detail. It is an expression that is used to execute
complex queries involving many dimensions at the data sourcing level. Using LOD
expression, you can find duplicate values, synchronize chart axes and create bins on
aggregated data.
- Improved Model Performance: By selecting the most relevant features, the model
can focus on the most informative variables, leading to better predictive accuracy
and generalization.
- Overfitting Prevention: Including irrelevant or redundant features can lead to
overfitting, where the model learns noise or specific patterns in the training data that
do not generalize well to new data. Feature selection mitigates this risk.
- Interpretability and Insights: A smaller set of selected features makes it easier to
interpret and understand the model's results, facilitating insights and actionable
conclusions.
- Computational Efficiency: Working with a reduced set of features can significantly
improve computational efficiency, especially when dealing with large datasets.
Extract: Extract is an image of the data that will be extracted from the data source
and placed into the Tableau repository. This image(snapshot) can be refreshed
periodically, fully, or incrementally.
Live: The live connection makes a direct connection to the data source. The data will
be fetched straight from tables. So, data is always up to date and consistent.
Joins in Tableau work similarly to the SQL join statement. Below are the types of
joins that Tableau supports:
• Inner Join
A Gantt chart in Tableau depicts the progress of value over the period, i.e., it shows
the duration of events. It consists of bars along with the time axis. The Gantt chart is
mostly used as a project management tool where each bar is a measure of a task in
the project.
• Drag Category and Subcategory columns into Rows, and Sales on to Columns. It
will result in a horizontal bar chart.
• Drag Profit on to Colour, and Quantity on to Label. Sort the Sales axis in
descending order of the sum of sales within each sub-category.
• Drag the Order Date field from Dimensions on to Columns, and convert it into
continuous Month.
• Drag Sales on to Rows, and Profits to the right corner of the view until you see a
light green rectangle.
• Drag the Country field on to the view section and expand it to see the States.
• Drag the Sales field on to Size, and Profit on to Colour.
• Increase the size of the bubbles, add a border, and halo color.
From the above map, it is clear that states like Washington, California, and New York
have the highest sales and profits. While Texas, Pennsylvania, and Ohio have good
amounts of sales but the least profits.
Treemaps Heatmaps
Treemaps are used to display data in nested Heat maps can visualize measures
rectangles. against dimensions with the help of
colors and size to differentiate one or
more dimensions and up to two
measures.
You use dimensions to define the structure of the The layout is like a text table with
treemap, and measures to define the size or color variations in values encoded as
of the individual rectangles. colors.
• Give a name to the set and select the top tab to choose the top 5 customers by
sum(profit)
• Similarly, create a set for the bottom five customers by sum(profit)
• Select both the sets, right-click to create a combined set. Give a name to the set
and choose All members in both sets.
• Drag top and bottom customers set on to Filters, and Profit field on to Colour to
get the desired result.
• By initializing a list
• By initializing a dictionary
57. Write the Python code to create an employee’s data
frame from the “emp.csv” file and display the head and
summary.
To create a DataFrame in Python, you need to import the Pandas library and use the
read_csv function to load the .csv file. Give the right location where the file name and
its extension follow the dataset.
You can use the column names to extract the desired columns.
Since the value eight is present in the 2nd row of the 1st column, we use the same
index positions and pass it to the array.
Since we only want the odd number from 0 to 9, you can perform the modulus
operation and check if the remainder is equal to 1.
Become a Data Scientist With Real-World
Experience
Data Scientist Master’s ProgramEXPLORE PROGRAM
61. There are two arrays, ‘a’ and ‘b’. Stack the arrays a
and b horizontally using the NumPy library in Python.
You can either use the concatenate() or the hstack() function to stack the arrays.
Suppose there is an emp data frame that has information about a few employees.
Let’s add an Address column to that data frame.
Declare a list of values that will be converted into an address column.
To find the unique values and number of unique elements, use the unique() and
nunique() function.
• Group the company column and use the mean function to find the average sales
So, those were the 65+ data analyst interview questions that can help you crack your
next data analyst interview and help you become a data analyst.
Conclusion
Now that you know the different data analyst interview questions that can be asked
in an interview, it is easier for you to crack for your coming interviews. Here, you
looked at various data analyst interview questions based on the difficulty levels. And
we hope this article on data analyst interview questions is useful to you.
On the other hand, if you wish to add another star to your resume before you step
into your next data analyst interview, enroll in Simplilearn’s Data Analyst Master’s
program, and master data analytics like a pro!
Unleash your potential with Simplilearn's Data Analytics Bootcamp. Master essential
skills, tackle real-world projects, and thrive in the world of Data Analytics. Enroll now
for a data-driven career transformation!