Crack The Coding Interview
Crack The Coding Interview
Implicit measures are automatically created by Power BI and are quick and easy to use, While
Explicit measures provide more customization and flexibility, but require DAX formula writing skills
implicit measures based on the aggregation For example, if we create a bar chart and drag a field to
the Value area, Power BI automatically creates a sum aggregation, which becomes an implicit
measure. Implicit measures are also created when we use the Quick Measure feature in Power BI.
They are automatically created by Power BI, saves time and effort.
Do not require any DAX formula
Work well with simple aggregations (sum or count)
Explicit measures are created by the user using DAX formulas. Highly customizable and fit for more
complex calculations. New Measure option in the Modeling tab to create Explicit measure
Example
Let’s consider we have a dataset containing sales data for a company. we want to create a measure that
calculates the SALES GROWTH from the previous month. To do this, we will apply DAX formula
Measure:
With variable:
WITH salary_ranking AS
( SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees )
SELECT salary
FROM salary_ranking
WHERE rnk = N;
SELECT
MONTH(record_date) AS month,
MAX(CASE WHEN data_type = 'max' THEN data_value END) AS max_temp,
MIN(CASE WHEN data_type = 'min' THEN data_value END) AS min_temp,
ROUND(AVG(CASE WHEN data_type = 'avg' THEN data_value END)) AS avg_temp
FROM
temperature_records
GROUP BY
MONTH(record_date);
1. WITH numbered_transactions AS (
SELECT
sender,
dt AS transaction_time,
amount,
ROW_NUMBER() OVER (PARTITION BY sender ORDER BY dt) AS rn
FROM transactions
),
transaction_sequences AS (
SELECT
[Link],
t1.transaction_time AS sequence_start,
MAX(t2.transaction_time) AS sequence_end,
COUNT([Link]) AS transactions_count,
SUM([Link]) AS transactions_sum
FROM numbered_transactions t1
JOIN numbered_transactions t2
ON [Link] = [Link]
AND [Link] >= [Link]
AND t2.transaction_time <= t1.transaction_time + INTERVAL 1 HOUR
GROUP BY [Link], t1.transaction_time
)
SELECT
sender,
sequence_start,
sequence_end,
transactions_count,
ROUND(transactions_sum, 6) AS transactions_sum
FROM transaction_sequences
WHERE transactions_sum >= 150
ORDER BY sender, sequence_start;
Tuple Vs List
My favorite ML algorithm is XGBoost (eXtreme Gradient Boosting). It's a powerful ensemble learning
technique that combines Decision Trees with Gradient Boosting. XGBoost excels in handling large datasets, has
strong regularization capabilities, Robust to overfitting, and better performance
Classification Algorithm example: predicting categorical outcomes Decision trees, Random forests, support
vector machines, and neural networks.
Can handle both binary classification (two classes) and multi-class classification (more than two classes).
Logistic Regression:
Bias-variance trade-off is a fundamental concept in machine learning that describes the balance between
underfitting and overfitting. The goal is to find a model that balances bias and variance to achieve optimal
performance on UNSEEN DATA.
Bias: Refers to the error due to simplifying model's assumptions about the data leads to underfitting,
poor performance on both training and testing data
Variance: Refers to model's sensitivity due to model complexity which lead to overfitting, Good
performance on training data but poor performance on Unseen data (testing data)
The 4 main concepts of data flow to ensure efficient data movement, transformation & data-driven workflows:
[Link]: The origin of the data, which could be structured or unstructured, such as databases, APIs, sensors, or
external systems.
2. Transformation: The process of converting, cleaning, or enriching the data to make it suitable for analysis.
This includes filtering, aggregating, or applying business logic.
3. Destination: Where the processed data is stored or delivered, such as a data warehouse, cloud storage, or a
database for further use.
4. Control Flow: Manage Data & ensure proper way of execution dependencies between tasks, processes in the
Data pipeline.
The optimized way to load data into Azure Synapse from Azure Data Factory is by using the COPY statement,
which enables efficient BULK DATA LOADING from external data sources.
Apache Spark + Apache Kafka: Spark used for batch processing, while Kafka handle streaming data.
They can be integrated to create a complete data processing pipeline.
Apache Airflow + Apache Hadoop/Spark: Airflow can orchestrate workflows that involve BOTH batch and
streaming components, using Hadoop or Spark for the actual processing.
Choosing Simpler Algorithms Over Advanced ML Models ( which algorithm did you use and why?
I used a Random Forest algorithm because it's effective for handling both numerical and categorical
features. Its ensemble nature also helps to reduce overfitting
Complexity: Deep learning models like neural networks can be computationally
expensive to train and deploy, especially for smaller datasets.
Interpretability: A simpler algorithm like collaborative filtering might be more
interpretable, allowing the business to understand why specific products are
recommended.
Performance: For a relatively simple task like product recommendation, a simpler
algorithm might perform just as well as a more complex one.
Suitability for the data type, computational efficiency, interpretability, or performance metrics
I’ve familiarity working with REST API using Flask and deployed on a cloud platform Also ensure scalability
by using containerization with Docker and Kubernetes. However, I really enjoy to actively monitor and
optimize performance of ML those are already in Production
Boosting vs Bagging: are powerful techniques for improving model performance, but different approaches
Boosting Bagging
Iterative Yes No
Adaptive Yes No
Error Correction Focuses on errors Doesn't directly focus on errors
Base Model Weights Assigns weights All base models have equal weight
Bias-Variance Trade-off Often reduces both bias and variance Primarily reduces variance
Boosting is often preferred for complex tasks or when high accuracy is required.
Bagging Reduce variance and prevent overfitting.
My goal would be to provide support and encouragement, and help to make an informed decision. Actual issue |
potential bottleneck | root cause analysis | Be open minded, compassionate to listen fully | Open communication
mental health and well-being | Sometimes a change of pace can be refreshing
Data Quality:
o Is the data clean and free from errors (e.g., missing values, inconsistencies)?
o What is the data's source and reliability?
o Are there any privacy or ethical concerns regarding the data?
Data Relevance:
o Is the data relevant to the problem you are trying to solve?
o Are there any missing features that might be important?
Data Volume:
o Is there enough data to train a reliable model?
o Are there any class imbalances that might affect model performance?
Data Format:
o Is the data in a suitable format for machine learning (e.g., CSV, JSON)?
o Are there any specific data preprocessing steps required?
EVALUATE Object Detection Models
Performance Metrics: (MAPE, MSE, R-squared error)
Mean Average Precision (mAP): Measures the overall accuracy of the model,
considering both localization and classification.
Precision: The proportion of correctly prediction out of all predicted detections.
Recall: The proportion of correctly predicted detections out of all actual
detections.
Structured files are typically used for easily searchable, well-organized data,
Unstructured files can range from multimedia to raw text file,
Tell me about your experience with cloud-based ML tools. How have you utilized them in previous
projects?
Focus on cloud ML experience. Highlight my work with AWS at BMO, where I integrated AI-enabled solutions
and migrated data using AWS Directconnect
When we cache data in Spark, where is it stored? In Spark memory or user memory?
It is stored in Spark memory, specifically in the memory allocated for the Spark executor.
What do you understand by driver node and worker node, and how are their responsibilities divided?
The driver node handles the orchestration, including task scheduling and communication, while worker nodes
execute tasks and perform the actual data processing.
How do I read and write data to and from Azure Blob Storage?
Use the wasbs:// protocol in Databricks to read from and write to Azure Blob Storage by configuring storage
account keys.
Have you ever worked with JSON data? How do you flatten nested JSON data?
Yes, flatten nested JSON data using PySpark's explode() or selectExpr() functions to convert nested fields into a
tabular format.
How do you handle corrupted data while reading from a source? Provide a code example where you read data
from a CSV file and store the corrupted records.
What methods are you using in your project for incremental load?
Incremental load is handled using delta tables with time-stamp or version-based mechanisms, or by using
watermarking in streaming processes.
Data preprocessing, especially with unstructured data? NoSQL database & Automated Data Pipelines
BMO
Explain a complex SQL query you have written to solve a specific problem, highlight your SQL work,
especially your optimization of query response times at Edmonton Airport
Excluding Gas/electricity & hydro bills, Mobile, wifi payment, car insurance, maintenance,
fuel
Adjust the date parameters to retrieve prices for the last day, week, and month as
shown above.
Ensure to replace "TICKER" with the actual stock symbol you are interested in.
Write a query to join two tables and filter results based on a specific condition.
SELECT a.column1, b.column2
FROM table_a a
JOIN table_b b ON a.common_field = b.common_field
WHERE a.condition_column = 'specific_value';
This query retrieves data from both tables where the join condition is met and
applies an additional filter.
Explain the difference between `INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, and `FULL OUTER JOIN`.
INNER JOIN: Returns records with matching values in both tables.
LEFT JOIN: Returns all records from the left table and matched records from the right table; unmatched records
from the right will show as NULL.
RIGHT JOIN: Returns all records from the right table and matched records from the left table; unmatched
records from the left will show as NULL.
FULL OUTER JOIN: Returns all records when there is a match in either left or right table
How would you use a subquery to filter results based on a condition in another table?
SELECT column_name
FROM table_a
WHERE column_name IN (SELECT column_name FROM table_b WHERE condition);
This retrieves records from table_a where the column matches values from a
subquery on table_b.
Explain how you would use a window function to calculate a running total or moving average.
SELECT column_name,
SUM(column_name) OVER (ORDER BY order_column) AS running_total
FROM table_name;
This computes a cumulative sum of column_name ordered by order_column
What are CTEs, and how would you use them to simplify complex queries?
Common Table Expressions (CTEs): simplify complex queries by breaking them
into manageable pieces, better readability and organization of SQL code
WITH cte_name AS (
SELECT column_name
FROM table_name
WHERE condition
)
SELECT * FROM cte_name;
Question: write SQL queries that manipulate two sample tables and display certain results, assuming two
Tables of Employees and Departments
This query joins the `Employees` and `Departments` tables on the `DepartmentID` and displays the first name,
last name, and department name for each employee
This query retrieves the first name, last name, and salary of employees who earn more than $70,000.
This query counts the number of employees in each department, including departments with no employees.
UPDATE Employees
SET Salary = Salary * 1.10
WHERE DepartmentID = (SELECT DepartmentID FROM Departments WHERE DepartmentName =
'Engineering');
This query increases the salary of employees in the 'Engineering' department by 10%.
This query calculates and displays the average salary for each department.
Model (Semantic) the data includes defining and creating relationships between the tables
VLOOKUP**: Function searches for a value in the first column of a table and returns a value in the same row
from a specified column. It's useful for looking up data VERTICALLY.
HLOOKUP**: Similar to VLOOKUP, but it searches for a value in the first row of a table and returns a value
from a specified row, making it suitable for HORIZONTAL lookups.
TRIM** function to remove extra spaces, Find and Replace** to correct errors, Text to Columns** to split data
Conditional Formatting: highlight cells based on certain criteria: set RULES to format cells that meet
conditions, such as values above a certain threshold, RANGE
How would you set up data validation to restrict input values in a cell or range of cells?
select the cells, go to the **Data** tab, click on **Data Validation**, and set criteria (e.g., whole numbers,
lists) to control what can be entered.
Can you explain how to create and use macros to automate repetitive tasks in Excel?
To automate repetitive tasks in Excel, you can create macros by recording actions. Go to the **View** tab,
select **Macros**, and choose **Record Macro**. Perform the tasks you want to automate, then stop
recording. You can run the macro later to repeat those actions.
Pagination with Pivoting: You will be given a list of items, and the aim is to implement pagination around a
pivot element. It should be circular.
My approach ensures the pivot remains central while allowing circular navigation through the list
Minerva Analytics, I used media mix modeling (MMM) to OPTIMIZE marketing budgets, able to increase
sales by 15% and maximizing ROI
I implemented NLP techniques in the TagAI project at BMO, where I used SYNONYM DETECTION,
ACRONYM EXPANSION, and named entity recognition (NER) to automate metadata tagging and improve
data discovery. This system leveraged active learning to enhance accuracy over time, contributing to more
efficient data governance and searchability across large databases.
The key metrics for EVALUATING AND OPTIMIZING machine learning models vary depending on whether
it's a classification or regression problem.
Classification Metrics:
5. AUC-ROC Curve: Measures the ability of the model to distinguish between classes; the Area Under
the Curve (AUC) quantifies overall performance.
6. Confusion Matrix: Provides a breakdown of true positives, false positives, true negatives, and false
negatives to visualize model performance.
Regression Metrics:
1. Mean Absolute Error (MAE): The average absolute difference between predicted and actual values.
2. Mean Squared Error (MSE): The average squared difference between predicted and actual values,
penalizing larger errors more.
3. Root Mean Squared Error (RMSE): The square root of MSE, providing a more interpretable measure
in the same units as the target variable.
4. R-squared (R²): Indicates the proportion of variance explained by the model; higher values mean better
performance.
Optimization Metrics:
1. Learning Curve: Tracks model performance over time (training and validation losses) to detect
overfitting or underfitting.
2. Hyperparameter Tuning: Metrics like cross-validation score help optimize model parameters (e.g.,
learning rate, depth of trees).
3. Model Interpretability: Use of metrics like SHAP values or feature importance to understand how
well the model is aligned with domain knowledge.
4. Computation Time and Resource Usage: Optimization considers efficiency, ensuring the model
performs well without excessive computation.
Model Monitoring: KPI’s in dashboard, with addition to new datasets / unseen data, performance metrics
evaluation ( Accuracy, Precision, Recall, F1 Score)
Model Drifting refers to the DEGRADATION IN THE Performance of a Machine Learning Model over time
due to CHANGES in the underlying data distribution. It is important to monitor, especially in production
systems, as it can result in the model becoming less reliable or relevant, requiring further adjustments
Data Drift (Covariate Shift): Occurs when the distribution of the input data changes over time compared to the
data the model was originally trained on. lead to poor predictions as the model is no longer seeing data similar
to its training set.
Concept Drift: Happens when the RELATIONSHIP between the Target Variable & Input Features changes
over time. Model's ASSUMPTIONS about the task become outdated, leading to inaccurate predictions.
Model Interpretation: (Global & Local) refer to the process of understanding how a model makes
predictions. It involves analyzing the model's internal workflow which features or data points are influencing
its output. Ensure transparency, trust and accountability especially in critical areas of real-world application
across banking, healthcare, insurance
Model performance and evaluation metrics should be defined as **quantitative measures how well a machine
learning model makes predictions ensuring alignment with the specific task's objectives, such as accuracy,
Precision: Proportion of true positive predictions, measuring accuracy for the positive class
Recall: measuring the model’s ability to detect all positive instances
F1 score: Balancing both metrics to give a single measure of the model’s accuracy, especially when the data is
imbalanced.
Classification Problem: is to categorize or classify data points into predefined **labels or categories**, such
as spam detection or medical diagnosis.
Confusion Matrix**: A table used in **classification problems** to summarize the performance of a
model by showing the counts of true positives, false positives, true negatives, and false negatives.
Regression: focus on predicting a **continuous output variable** based on input features, such as predicting
insurance fees or house prices or stock values.
Azure Machine Learning for building, training, and deploying machine learning models at scale, especially
for automating data workflows and integrating AI solutions into business operations.
A/B testing is crucial for **data-driven optimization** and **improving performance** compare two versions
(A and B) of a PRODUCT, FEATURE, to determine which one performs better. It provides **data-driven
insights
Measuring impact**: It helps identify which changes or strategies improve key performance metrics like
conversions, ENGAGEMENT, OR SALES.
Reducing guesswork**: Instead of relying on assumptions, A/B testing relies on EMPIRICAL EVIDENCE to
make informed decisions.
Optimizing user experience**: ensures the best version is implemented based on actual user preferences
Minimizing risk**: Testing on a smaller sample before a full-scale rollout reduces the risk of implementing
ineffective changes.
Supervised machine learning model is trained on labeled data, meaning the input-output pairs are
known.
The goal is to predict the output for new inputs. Predicting house prices using historical data.
Application**: Email spam detection, fraud detection.
Unsupervised Machine Learning Model is trained on unlabeled data, and the goal is to find hidden
patterns or groupings within the data. Customer segmentation in marketing using clustering. Market
basket analysis, anomaly detection.
Deep Learning: excels in tasks like image recognition and NLP. A subset of machine learning that
uses **neural networks with many layers** to model COMPLEX PATTERNS in data. Image
classification using **Convolutional Neural Networks (CNNs)**. Self-driving cars, speech recognition.
where the ERROR from the output is propagated backward through the network to update the
weights and improve the model's accuracy. Image classification, Fine-tuning WEIGHTS in deep
learning networks for facial recognition.
NLP (Natural Language Processing): A field of AI focused on the interaction between computers and
human language, enabling machines to understand, interpret, and generate text. Chatbots using
**GPT-3** for generating human-like text. Sentiment analysis, machine translation, speech
recognition.
Computer Vision**: A field of AI focused on enabling machines to interpret and understand visual
data, such as images and videos. Object detection using **CNNs** in images. Facial recognition,
medical imaging, autonomous vehicles.
Time Series Forecasting Model**: A model used to PREDICT FUTURE VALUES BASED ON
PREVIOUSLY OBSERVED DATA, typically involving data that is dependent on time. ARIMA** model
for predicting stock prices.
Python is widely used in data science for data manipulation, analysis, and visualization.
Key libraries include:
NumPy: large datasets, multi-dimensional numerical data working with arrays and matrices,
along with a collection of mathematical functions to operate on these arrays.
zeros = [Link]((3,3))
ones = [Link]((3,3))
rand_array = [Link](3,3)
Pandas: offers data manipulation and analysis, providing data structures and operations for
manipulating numerical tables and time series.
- Reading & Writing a CSV file, Selecting column, - Filtering rows
- Group by and aggregate function, - Pivot tables, - Joining dataframes: [Link](df1, df2, on='key_column',
how='left')
Machine Learning (ML): Libraries like TensorFlow, PyTorch, and Scikit-learn are used for developing
machine learning models.
Deep Learning: TensorFlow and PyTorch also provide frameworks for deep learning, which can be
used to develop complex models that simulate human neural networks.
Natural Language Processing (NLP): Libraries such as NLTK and spaCy allow for the processing and
analysis of human language data.
Computer Vision: OpenCV and PIL (Python Imaging Library) are used for image processing tasks.
Data Manipulation & Analysis: Pandas for data cleaning, transformation, and preparation. powerful
data manipulation capabilities that simplify tasks like data filtering, aggregation, and visualization. It
allows handling of large data sets
Data Visualization: Matplotlib, Seaborn to create meaningful charts and graphs.
Statistical Analysis: to understand trends, patterns, and inferential statistics.
Machine Learning: scikit-learn library to implement predictive models.
Data Mining: Techniques to extract valuable information from large datasets. Scikit-learn, TensorFlow, and
Keras these libraries offer regression models to complex neural networks.
Predicting economic trends, analyzing customer behavior, or building recommendation systems
Python scripts provide a quick way to automate data processing tasks, automation, monitoring
In the finance sector, Python is used for quantitative and qualitative analysis, financial modeling, and
algorithmic trading. Libraries like pandas have become popular choice for modeling financial data and risk
management.
Control Flow
- if, else, elif: conditional branching.
- for, while: Looping
- break: Exits from the loop.
- continue: Skips the current iteration and continues with the next one.
- pass: A placeholder that does nothing
Exception Handling
- try, except: Block catches exceptions.
- raise: Used to raise an [Link]
- assert: For debugging purposes
Functions
- def: Defines a function.
- return: Exits a function, returns a value.
- lambda: Creates an anonymous function.
- yield: returns a value and pauses the execution.
Web development using Python: Django and Flask provide tools and libraries to create secure,
scalable, and maintainable web sites and services.
Question: Given a 2D binary matrix, write a solution to make the image symmetric along the X and Y
axes. The only operations allowed are full row and full column insertions without modifying the values in
the original matrix. The goal is to find the minimum number of row and column insertions.
To make a 2D binary matrix symmetric along both X and Y axes, compare rows and columns with their
counterparts, and count the necessary insertions. The goal is to achieve symmetry with minimal row and column
insertions.
def min_insertions_for_symmetry(matrix):
n = len(matrix)
m = len(matrix[0])
row_insertions = 0
col_insertions = 0
# Example matrix
matrix = [
[1, 0, 1],
[0, 1, 0],
[1, 0, 1]
]
result = min_insertions_for_symmetry(matrix)