0% found this document useful (0 votes)
16 views19 pages

Crack The Coding Interview

The document compares implicit and explicit measures in Power BI, highlighting that implicit measures are quick and easy but limited in functionality, while explicit measures offer customization and flexibility but require DAX skills. It also discusses various SQL queries for data analysis, machine learning concepts, data processing tools, and the importance of data quality and relevance. Additionally, it covers model evaluation metrics, anomaly detection methods, and the differences between structured and unstructured data file types.

Uploaded by

sanjir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

Crack The Coding Interview

The document compares implicit and explicit measures in Power BI, highlighting that implicit measures are quick and easy but limited in functionality, while explicit measures offer customization and flexibility but require DAX skills. It also discusses various SQL queries for data analysis, machine learning concepts, data processing tools, and the importance of data quality and relevance. Additionally, it covers model evaluation metrics, anomaly detection methods, and the differences between structured and unstructured data file types.

Uploaded by

sanjir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Choosing between implicit and explicit measures is like

choosing between speed and customization in Power BI - it all


depends on the specific needs of our data visualization

Implicit measures are automatically created by Power BI and are quick and easy to use, While
Explicit measures provide more customization and flexibility, but require DAX formula writing skills

implicit measures based on the aggregation For example, if we create a bar chart and drag a field to
the Value area, Power BI automatically creates a sum aggregation, which becomes an implicit
measure. Implicit measures are also created when we use the Quick Measure feature in Power BI.

Advantages of Implicit Measures:

 They are automatically created by Power BI, saves time and effort.
 Do not require any DAX formula
 Work well with simple aggregations (sum or count)

Disadvantages of Implicit Measures:

 They have Limited functionality as they are pre-defined by Power BI.


 Not suitable for complex or more advanced calculations.
 Difficult to modify or customize as they are created automatic

Explicit measures are created by the user using DAX formulas. Highly customizable and fit for more
complex calculations. New Measure option in the Modeling tab to create Explicit measure

Advantages of Explicit Measures:


 Provide more flexibility and functionality than implicit measures.
 Customizable to meet specific business needs.
 More complex calculations (ratios or percentages)

Disadvantages of Explicit Measures:

 Require DAX formula writing


 Take more time and effort to create than implicit measures.
 Require more maintenance and updates as the data model changes.

Example
Let’s consider we have a dataset containing sales data for a company. we want to create a measure that
calculates the SALES GROWTH from the previous month. To do this, we will apply DAX formula

Measure:

Sales YoY Growth =


DIVIDE (
( [Sales] - CALCULATE ( [Sales], PARALLELPERIOD ( 'Date'[Date], -12, MONTH ) ) ),
CALCULATE ( [Sales], PARALLELPERIOD ( 'Date'[Date], -12, MONTH ) )
)

With variable:

Sales YoY Growth =


VAR SalesPriorYear =
CALCULATE ( [Sales], PARALLELPERIOD ( 'Date'[Date], -12, MONTH ) )
VAR SalesVariance =
DIVIDE ( ( [Sales] - SalesPriorYear ), SalesPriorYear )
RETURN
SalesVariance

Employees with the Highest Salary in Each Department


Query to find employees with the highest salary department-wise:

SELECT department, employee, salary


FROM
( SELECT department, employee, salary, RANK() OVER (PARTITION BY department ORDER BY
salary DESC) AS rnk
FROM employees ) AS ranked
WHERE rnk = 1;

Highest Salary 🥇 Find the Nth highest salary in a table:

WITH salary_ranking AS
( SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees )
SELECT salary
FROM salary_ranking
WHERE rnk = N;

Identify employees whose salary has increased for 3 consecutive years:

WITH ranked_salaries AS ( SELECT employee_id, year, salary,


LAG(salary, 1) OVER (PARTITION BY employee_id ORDER BY year) AS prev_salary_1,
LAG(salary, 2) OVER (PARTITION BY employee_id ORDER BY year) AS prev_salary_2
FROM employee_salaries )
SELECT employee_id
FROM ranked_salaries
WHERE salary > prev_salary_1 AND prev_salary_1 > prev_salary_2;

Find All Active Users for the Last 7 Days


How to find distinct users who logged in over the past 7 days:

SELECT COUNT(DISTINCT user_id) AS active_users


FROM user_logs
WHERE action = 'login' AND log_date >= CURRENT_DATE - INTERVAL '7 days';

SELECT
MONTH(record_date) AS month,
MAX(CASE WHEN data_type = 'max' THEN data_value END) AS max_temp,
MIN(CASE WHEN data_type = 'min' THEN data_value END) AS min_temp,
ROUND(AVG(CASE WHEN data_type = 'avg' THEN data_value END)) AS avg_temp
FROM
temperature_records
GROUP BY
MONTH(record_date);

1. WITH numbered_transactions AS (
SELECT
sender,
dt AS transaction_time,
amount,
ROW_NUMBER() OVER (PARTITION BY sender ORDER BY dt) AS rn
FROM transactions
),
transaction_sequences AS (
SELECT
[Link],
t1.transaction_time AS sequence_start,
MAX(t2.transaction_time) AS sequence_end,
COUNT([Link]) AS transactions_count,
SUM([Link]) AS transactions_sum

FROM numbered_transactions t1
JOIN numbered_transactions t2
ON [Link] = [Link]
AND [Link] >= [Link]
AND t2.transaction_time <= t1.transaction_time + INTERVAL 1 HOUR
GROUP BY [Link], t1.transaction_time
)
SELECT
sender,
sequence_start,
sequence_end,
transactions_count,
ROUND(transactions_sum, 6) AS transactions_sum
FROM transaction_sequences
WHERE transactions_sum >= 150
ORDER BY sender, sequence_start;

Precision tells us how accurate Our model is at predicting positive cases,


while recall measures how well Our model finds all the positive cases.
X-Y axes of ROC curves:
 X-axis: False Positive Rate (FPR): Y-axis: True Positive Rate (TPR): The proportion of
positive instances that were correctly classified as positive.
Metrics for evaluating skewed data:
 F1-score: The harmonic mean of precision and recall, giving equal weight to both metrics.
This is particularly useful when the dataset is imbalanced
 Precision-recall curve: to visualize the trade-off between the two metrics.
 ROC AUC (Area Under the ROC Curve): Measures the overall performance of a
model

Tuple Vs List

my_list = [1, 2, 3] # Mutable list

my_tuple = (4, 5, 6) # Immutable tuple

My favorite ML algorithm is XGBoost (eXtreme Gradient Boosting). It's a powerful ensemble learning
technique that combines Decision Trees with Gradient Boosting. XGBoost excels in handling large datasets, has
strong regularization capabilities, Robust to overfitting, and better performance

Difference between Classification & Logistic Regression algorithms

Classification Algorithm example: predicting categorical outcomes Decision trees, Random forests, support
vector machines, and neural networks.
 Can handle both binary classification (two classes) and multi-class classification (more than two classes).

Logistic Regression:

 Used for binary classification that estimates probabilities of an event


 Uses sigmoid function to map the linear combination of features to a probability between 0 and 1

Bias-variance trade-off is a fundamental concept in machine learning that describes the balance between
underfitting and overfitting. The goal is to find a model that balances bias and variance to achieve optimal
performance on UNSEEN DATA.

 Bias: Refers to the error due to simplifying model's assumptions about the data leads to underfitting,
poor performance on both training and testing data

 Variance: Refers to model's sensitivity due to model complexity which lead to overfitting, Good
performance on training data but poor performance on Unseen data (testing data)

The 4 main concepts of data flow to ensure efficient data movement, transformation & data-driven workflows:

[Link]: The origin of the data, which could be structured or unstructured, such as databases, APIs, sensors, or
external systems.

2. Transformation: The process of converting, cleaning, or enriching the data to make it suitable for analysis.
This includes filtering, aggregating, or applying business logic.

3. Destination: Where the processed data is stored or delivered, such as a data warehouse, cloud storage, or a
database for further use.

4. Control Flow: Manage Data & ensure proper way of execution dependencies between tasks, processes in the
Data pipeline.

How is Cosmos DB different than Synapse or SQL Database?


Cosmos DB is a globally distributed, multi-model NoSQL database, while Synapse Analytics or SQL Database
are structured, relational data platforms primarily used for analytics and transactional workloads

What happens in the back-end when you submit a Spark Job?


When we submit a Spark job, the backend processes ensure efficient parallel computation in distributed
environments upon running spark clusters in Azure Databricks notebook and executing the results

The different types of triggers in Azure Data Factory are:


Schedule Trigger, Window Trigger, Event Trigger and Manual Trigger

The optimized way to load data into Azure Synapse from Azure Data Factory is by using the COPY statement,
which enables efficient BULK DATA LOADING from external data sources.

What tools for batch and stream processing?


The choice of tools for batch and stream processing depends on factors like data volume, latency, processing
speed, complexity and system integration.

For batch processing: Hadoop, Spark, Airflow


Stream processing: Kafka and Dataflow.
Often, a combination of tools is used to handle various data types and workloads. The best combination will
depend on specific requirements and use cases.

 Apache Spark + Apache Kafka: Spark used for batch processing, while Kafka handle streaming data.
They can be integrated to create a complete data processing pipeline.

 Apache Airflow + Apache Hadoop/Spark: Airflow can orchestrate workflows that involve BOTH batch and
streaming components, using Hadoop or Spark for the actual processing.

Choosing Simpler Algorithms Over Advanced ML Models ( which algorithm did you use and why?

 I used a Random Forest algorithm because it's effective for handling both numerical and categorical
features. Its ensemble nature also helps to reduce overfitting
 Complexity: Deep learning models like neural networks can be computationally
expensive to train and deploy, especially for smaller datasets.
 Interpretability: A simpler algorithm like collaborative filtering might be more
interpretable, allowing the business to understand why specific products are
recommended.
 Performance: For a relatively simple task like product recommendation, a simpler
algorithm might perform just as well as a more complex one.
 Suitability for the data type, computational efficiency, interpretability, or performance metrics

Standardizing data within a pipeline is essential for

1. Scaling: It ensures that all important features


2. Improving model performance: feature scaling, such as gradient descent.
3. Interpretability: Interpret the model's coefficients,
4. Compatibility

How did you deploy the ML solution?

Challenges or considerations during deployment, in terms of scalability, performance, and maintainability.

I’ve familiarity working with REST API using Flask and deployed on a cloud platform Also ensure scalability
by using containerization with Docker and Kubernetes. However, I really enjoy to actively monitor and
optimize performance of ML those are already in Production

Machine Learning algorithms: learn from data to make predictions or decisions.


Decision trees are a popular type of ML algorithm that split data into subsets based on
features to make classifications or predictions.
Recommendation Systems: suggest items or content to users based on USER
preferences or past behavior. Content-based filtering are common approaches
recommends items; based on user's preferences while Collaborative filtering
recommends items based on similarities between items.

Boosting vs Bagging: are powerful techniques for improving model performance, but different approaches

Boosting Bagging
Iterative Yes No
Adaptive Yes No
Error Correction Focuses on errors Doesn't directly focus on errors
Base Model Weights Assigns weights All base models have equal weight
Bias-Variance Trade-off Often reduces both bias and variance Primarily reduces variance

 Boosting is often preferred for complex tasks or when high accuracy is required.
 Bagging Reduce variance and prevent overfitting.

My goal would be to provide support and encouragement, and help to make an informed decision. Actual issue |
potential bottleneck | root cause analysis | Be open minded, compassionate to listen fully | Open communication
mental health and well-being | Sometimes a change of pace can be refreshing

Large Language Models vs. Traditional Language Models

Large Language Models (LLMs):


 Scale: LLMs are trained on massive amounts of text data, allowing them to
generate human-quality text, and answer our questions in an informative way.
 Contextual Understanding: LLMs can understand and respond to context,
making them more capable of complex tasks like summarization, creative and
question answering.
Traditional Language Models:
 trained on smaller datasets compared to LLMs.
 struggle with tasks that require understanding of broader context.
 Primarily used for sentiment analysis, text classification

Dealing with Imbalanced Data in Multi-Class Classification


Oversampling:
o Duplicates instances from the minority class.
Under sampling:
o Removes instances from the majority class.
Class Weighting:
o Assigns higher weights to minority class instances during training for
imbalance.
Ensemble Methods:
o Combines multiple models trained on different subsets of the data

Data Quality:
o Is the data clean and free from errors (e.g., missing values, inconsistencies)?
o What is the data's source and reliability?
o Are there any privacy or ethical concerns regarding the data?
Data Relevance:
o Is the data relevant to the problem you are trying to solve?
o Are there any missing features that might be important?
Data Volume:
o Is there enough data to train a reliable model?
o Are there any class imbalances that might affect model performance?
Data Format:
o Is the data in a suitable format for machine learning (e.g., CSV, JSON)?
o Are there any specific data preprocessing steps required?
EVALUATE Object Detection Models
Performance Metrics: (MAPE, MSE, R-squared error)
 Mean Average Precision (mAP): Measures the overall accuracy of the model,
considering both localization and classification.
 Precision: The proportion of correctly prediction out of all predicted detections.
 Recall: The proportion of correctly predicted detections out of all actual
detections.

Determining Stationarity in Time Series Data


Statistical Tests:
 Tests for the null hypothesis of stationarity based on 2 observation Stationary OR
non-stationarity and we can draw a conclusion
 Look for trends, seasonality, or other patterns that suggest non-stationarity.

Anomaly detection is the process of identifying outliers, unusual events:


Common Anomaly Detection Models:
1. Statistical Methods:
o Z-score: Measures how many standard deviations a data point is from the
mean.
2. Machine Learning Methods:
o Autoencoders: Neural networks trained to reconstruct input data.
Anomalies are detected when reconstruction error is high.
o Time Series Anomaly Detection: Specific techniques for detecting
anomalies in time series data, such as statistical methods, machine learning
models LSTM and spectral residual analysis.
3. Hybrid Methods:
o Combining statistical and machine learning methods for improved
performance in complex scenarios.
Choosing the Right Model:
 Data type: Time series, numerical, categorical, or mixed.
 Anomaly type: Point anomalies, contextual anomalies, or collective anomalies.
 Data distribution: Normal or skewed.
 Computational resources: The complexity of the model and the size of the dataset.
Evaluation Metrics:
 Precision: The proportion of correctly identified anomalies out of all predicted
anomalies.
 Recall: The proportion of actual anomalies that were correctly identified.
 F1-score: The harmonic mean of precision and recall.
 False Positive Rate (FPR): False Negative Rate (FNR):

types: Structured Data File Types

1. CSV (Comma-Separated Values): .csv, Boolean, Date/Time, String, Float, Integer


2. Excel Spreadsheets: .xls, .xlsx
3. SQL Database Files: .sql
4. Parquet: .parquet (commonly used in big data processing)
5. JSON (if structured in a predictable manner): .json

Unstructured Data File Types:

1. Text Files: .txt (for raw text or logs)


2. Image Files: .jpg, .png, .bmp, .gif
3. Audio Files**: .mp3, .wav
4. Video Files: .mp4, .avi, .mov
5. PDF: .pdf
6. HTML Files: .html (if containing unstructured content), system log files

Structured files are typically used for easily searchable, well-organized data,
Unstructured files can range from multimedia to raw text file,
Tell me about your experience with cloud-based ML tools. How have you utilized them in previous
projects?

Focus on cloud ML experience. Highlight my work with AWS at BMO, where I integrated AI-enabled solutions
and migrated data using AWS Directconnect

When we cache data in Spark, where is it stored? In Spark memory or user memory?
It is stored in Spark memory, specifically in the memory allocated for the Spark executor.

What are the different ways to parameterize a notebook?


You can parameterize a notebook using widgets, environment variables, or Databricks Job parameters.

What do you understand by driver node and worker node, and how are their responsibilities divided?
The driver node handles the orchestration, including task scheduling and communication, while worker nodes
execute tasks and perform the actual data processing.

How do I connect to an Azure SQL Database from Databricks?


You can connect using JDBC by providing the connection string, database credentials, and driver details.

How do I read and write data to and from Azure Blob Storage?
Use the wasbs:// protocol in Databricks to read from and write to Azure Blob Storage by configuring storage
account keys.

How do I schedule and automate jobs in Azure Databricks?


Use the Databricks Jobs API or the Jobs UI to schedule and automate notebooks or workflows.

How do I secure my data and access in Databricks?


Secure data by implementing role-based access control (RBAC), encrypting data at rest and in transit, and using
Databricks Secrets for sensitive credentials.

Have you ever worked with JSON data? How do you flatten nested JSON data?
Yes, flatten nested JSON data using PySpark's explode() or selectExpr() functions to convert nested fields into a
tabular format.

How do you handle corrupted data while reading from a source? Provide a code example where you read data
from a CSV file and store the corrupted records.

What methods are you using in your project for incremental load?
Incremental load is handled using delta tables with time-stamp or version-based mechanisms, or by using
watermarking in streaming processes.

Data preprocessing, especially with unstructured data? NoSQL database & Automated Data Pipelines

BMO

Explain a complex SQL query you have written to solve a specific problem, highlight your SQL work,
especially your optimization of query response times at Edmonton Airport

1 year 9 months (~ 3 months left to round up to 2 years) cost me


2000*12 = 24k + 1800*9 = 16,200 ~~ 40,200

Excluding Gas/electricity & hydro bills, Mobile, wifi payment, car insurance, maintenance,
fuel

Depreciation Cost = $33,000 - $23,000 = $10,000


Depreciation Percentage = ($10,000 / $33,000) × 100
Depreciation Percentage ≈ 30.30%

=STOCKHISTORY("TICKER", TODAY()-1, TODAY(), 0, 1)


=STOCKHISTORY("TICKER", TODAY()-7, TODAY(), 0, 1)
=STOCKHISTORY("TICKER", EOMONTH(TODAY(), -2)+1, EOMONTH(TODAY(), -1), 0, 1)

 Use the STOCKHISTORY function to get historical stock prices.

 Adjust the date parameters to retrieve prices for the last day, week, and month as
shown above.

 Ensure to replace "TICKER" with the actual stock symbol you are interested in.

How would you find duplicate records in a table?


SELECT column_name, COUNT(column_name)
FROM table_name
GROUP BY column_name
HAVING COUNT(column_name) > 1;
This query groups the records by the specified column and counts occurrences, filtering for those greater than
one

Write a query to join two tables and filter results based on a specific condition.
SELECT a.column1, b.column2
FROM table_a a
JOIN table_b b ON a.common_field = b.common_field
WHERE a.condition_column = 'specific_value';
This query retrieves data from both tables where the join condition is met and
applies an additional filter.

Explain the difference between `INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, and `FULL OUTER JOIN`.
INNER JOIN: Returns records with matching values in both tables.
LEFT JOIN: Returns all records from the left table and matched records from the right table; unmatched records
from the right will show as NULL.
RIGHT JOIN: Returns all records from the right table and matched records from the left table; unmatched
records from the left will show as NULL.
FULL OUTER JOIN: Returns all records when there is a match in either left or right table

How do you optimize a slow running query?


Indexing: Create indexes on columns used in WHERE clauses and JOIN conditions.
Query Refactoring: Simplify complex queries by breaking them into smaller parts or using CTEs.
Analyze Execution Plans: Use execution plans to identify bottlenecks and optimize accordingly

How would you use a subquery to filter results based on a condition in another table?
SELECT column_name
FROM table_a
WHERE column_name IN (SELECT column_name FROM table_b WHERE condition);
This retrieves records from table_a where the column matches values from a
subquery on table_b.

Explain how you would use a window function to calculate a running total or moving average.
SELECT column_name,
SUM(column_name) OVER (ORDER BY order_column) AS running_total
FROM table_name;
This computes a cumulative sum of column_name ordered by order_column

What are CTEs, and how would you use them to simplify complex queries?
Common Table Expressions (CTEs): simplify complex queries by breaking them
into manageable pieces, better readability and organization of SQL code
WITH cte_name AS (
SELECT column_name
FROM table_name
WHERE condition
)
SELECT * FROM cte_name;

Question: write SQL queries that manipulate two sample tables and display certain results, assuming two
Tables of Employees and Departments

Query to Display Employee Names with Their Department Names:

SELECT [Link], [Link], [Link]


FROM Employees e
JOIN Departments d ON [Link] = [Link];

This query joins the `Employees` and `Departments` tables on the `DepartmentID` and displays the first name,
last name, and department name for each employee

Query to Find Employees with Salaries Greater Than $70,000:


SELECT FirstName, LastName, Salary
FROM Employees
WHERE Salary > 70000;

This query retrieves the first name, last name, and salary of employees who earn more than $70,000.

Query to Count the Number of Employees in Each Department:


SELECT [Link], COUNT([Link]) AS NumberOfEmployees
FROM Departments d
LEFT JOIN Employees e ON [Link] = [Link]
GROUP BY [Link];

This query counts the number of employees in each department, including departments with no employees.

Query to Increase the Salary of Employees in the 'Engineering' Department by 10%:

UPDATE Employees
SET Salary = Salary * 1.10
WHERE DepartmentID = (SELECT DepartmentID FROM Departments WHERE DepartmentName =
'Engineering');

This query increases the salary of employees in the 'Engineering' department by 10%.

Query to Find the Average Salary by Department:

SELECT [Link], AVG([Link]) AS AverageSalary


FROM Employees e
JOIN Departments d ON [Link] = [Link]
GROUP BY [Link];

This query calculates and displays the average salary for each department.

Model (Semantic) the data includes defining and creating relationships between the tables

EXCEL Interview Questions:


How do you use pivot tables for data analysis?
Pivot tables in Excel allow to summarize and analyze large datasets by organizing data into a table format.
Group, filter, and aggregate data to identify trends and patterns, making it a powerful tool for data analysis.

VLOOKUP**: Function searches for a value in the first column of a table and returns a value in the same row
from a specified column. It's useful for looking up data VERTICALLY.

HLOOKUP**: Similar to VLOOKUP, but it searches for a value in the first row of a table and returns a value
from a specified row, making it suitable for HORIZONTAL lookups.

To clean data in Excel, you can:

TRIM** function to remove extra spaces, Find and Replace** to correct errors, Text to Columns** to split data

Conditional Formatting: highlight cells based on certain criteria: set RULES to format cells that meet
conditions, such as values above a certain threshold, RANGE

How would you set up data validation to restrict input values in a cell or range of cells?

select the cells, go to the **Data** tab, click on **Data Validation**, and set criteria (e.g., whole numbers,
lists) to control what can be entered.

Can you explain how to create and use macros to automate repetitive tasks in Excel?
To automate repetitive tasks in Excel, you can create macros by recording actions. Go to the **View** tab,
select **Macros**, and choose **Record Macro**. Perform the tasks you want to automate, then stop
recording. You can run the macro later to repeat those actions.
Pagination with Pivoting: You will be given a list of items, and the aim is to implement pagination around a
pivot element. It should be circular.

To implement pagination with pivoting in a circular manner:

1. Identify the pivot element in the list.


2. Determine the starting page by positioning the pivot at the center.
3. Display items before and after the pivot to fill the page, wrapping around circularly if necessary.
4. Navigate forward or backward through the list, adjusting the pivot's position while maintaining the circular
order.

My approach ensures the pivot remains central while allowing circular navigation through the list

Machine Learning Engineer Interview

At University of Alberta, I developed a machine learning-based regression framework to predict health


insurance costs on a dataset of 5 million records, achieved **99% accuracy** Random Forest, predicting healt
insurance cost in dollar amount we’ve also evaluated XGBoost, Lasso Regression, and Gradient Boosting.

Minerva Analytics, I used media mix modeling (MMM) to OPTIMIZE marketing budgets, able to increase
sales by 15% and maximizing ROI

I implemented NLP techniques in the TagAI project at BMO, where I used SYNONYM DETECTION,
ACRONYM EXPANSION, and named entity recognition (NER) to automate metadata tagging and improve
data discovery. This system leveraged active learning to enhance accuracy over time, contributing to more
efficient data governance and searchability across large databases.

The key metrics for EVALUATING AND OPTIMIZING machine learning models vary depending on whether
it's a classification or regression problem.

Classification Metrics:

1. Accuracy: The proportion of correct predictions out of the total predictions.


2. Precision: The proportion of true positives out of all positive predictions
3. Recall (Sensitivity/True Positive Rate): The proportion of true positives out of all actual positives
4. F1 Score: The harmonic means of precision and recall, balancing them when there's an uneven class
distribution. (Better)

5. AUC-ROC Curve: Measures the ability of the model to distinguish between classes; the Area Under
the Curve (AUC) quantifies overall performance.
6. Confusion Matrix: Provides a breakdown of true positives, false positives, true negatives, and false
negatives to visualize model performance.

Regression Metrics:

1. Mean Absolute Error (MAE): The average absolute difference between predicted and actual values.
2. Mean Squared Error (MSE): The average squared difference between predicted and actual values,
penalizing larger errors more.
3. Root Mean Squared Error (RMSE): The square root of MSE, providing a more interpretable measure
in the same units as the target variable.
4. R-squared (R²): Indicates the proportion of variance explained by the model; higher values mean better
performance.

Optimization Metrics:

1. Learning Curve: Tracks model performance over time (training and validation losses) to detect
overfitting or underfitting.
2. Hyperparameter Tuning: Metrics like cross-validation score help optimize model parameters (e.g.,
learning rate, depth of trees).
3. Model Interpretability: Use of metrics like SHAP values or feature importance to understand how
well the model is aligned with domain knowledge.
4. Computation Time and Resource Usage: Optimization considers efficiency, ensuring the model
performs well without excessive computation.

Model Monitoring: KPI’s in dashboard, with addition to new datasets / unseen data, performance metrics
evaluation ( Accuracy, Precision, Recall, F1 Score)

Model Drifting refers to the DEGRADATION IN THE Performance of a Machine Learning Model over time
due to CHANGES in the underlying data distribution. It is important to monitor, especially in production
systems, as it can result in the model becoming less reliable or relevant, requiring further adjustments

Data Drift (Covariate Shift): Occurs when the distribution of the input data changes over time compared to the
data the model was originally trained on. lead to poor predictions as the model is no longer seeing data similar
to its training set.

Concept Drift: Happens when the RELATIONSHIP between the Target Variable & Input Features changes
over time. Model's ASSUMPTIONS about the task become outdated, leading to inaccurate predictions.

Model Interpretation: (Global & Local) refer to the process of understanding how a model makes
predictions. It involves analyzing the model's internal workflow which features or data points are influencing
its output. Ensure transparency, trust and accountability especially in critical areas of real-world application
across banking, healthcare, insurance

Model performance and evaluation metrics should be defined as **quantitative measures how well a machine
learning model makes predictions ensuring alignment with the specific task's objectives, such as accuracy,

Precision: Proportion of true positive predictions, measuring accuracy for the positive class
Recall: measuring the model’s ability to detect all positive instances
F1 score: Balancing both metrics to give a single measure of the model’s accuracy, especially when the data is
imbalanced.

Classification Problem: is to categorize or classify data points into predefined **labels or categories**, such
as spam detection or medical diagnosis.
Confusion Matrix**: A table used in **classification problems** to summarize the performance of a
model by showing the counts of true positives, false positives, true negatives, and false negatives.
Regression: focus on predicting a **continuous output variable** based on input features, such as predicting
insurance fees or house prices or stock values.

Azure Machine Learning for building, training, and deploying machine learning models at scale, especially
for automating data workflows and integrating AI solutions into business operations.

A/B testing is crucial for **data-driven optimization** and **improving performance** compare two versions
(A and B) of a PRODUCT, FEATURE, to determine which one performs better. It provides **data-driven
insights

Measuring impact**: It helps identify which changes or strategies improve key performance metrics like
conversions, ENGAGEMENT, OR SALES.
Reducing guesswork**: Instead of relying on assumptions, A/B testing relies on EMPIRICAL EVIDENCE to
make informed decisions.

Optimizing user experience**: ensures the best version is implemented based on actual user preferences

Minimizing risk**: Testing on a smaller sample before a full-scale rollout reduces the risk of implementing
ineffective changes.

Supervised machine learning model is trained on labeled data, meaning the input-output pairs are
known.
The goal is to predict the output for new inputs. Predicting house prices using historical data.
Application**: Email spam detection, fraud detection.

Unsupervised Machine Learning Model is trained on unlabeled data, and the goal is to find hidden
patterns or groupings within the data. Customer segmentation in marketing using clustering. Market
basket analysis, anomaly detection.

Deep Learning: excels in tasks like image recognition and NLP. A subset of machine learning that
uses **neural networks with many layers** to model COMPLEX PATTERNS in data. Image
classification using **Convolutional Neural Networks (CNNs)**. Self-driving cars, speech recognition.

BACKPROPAGATION is A technique used in training **neural networks**,

where the ERROR from the output is propagated backward through the network to update the
weights and improve the model's accuracy. Image classification, Fine-tuning WEIGHTS in deep
learning networks for facial recognition.

NLP (Natural Language Processing): A field of AI focused on the interaction between computers and
human language, enabling machines to understand, interpret, and generate text. Chatbots using
**GPT-3** for generating human-like text. Sentiment analysis, machine translation, speech
recognition.

Computer Vision**: A field of AI focused on enabling machines to interpret and understand visual
data, such as images and videos. Object detection using **CNNs** in images. Facial recognition,
medical imaging, autonomous vehicles.

Time Series Forecasting Model**: A model used to PREDICT FUTURE VALUES BASED ON
PREVIOUSLY OBSERVED DATA, typically involving data that is dependent on time. ARIMA** model
for predicting stock prices.

- **Application**: Sales forecasting, weather prediction.


Reinforcement Learning**: A machine learning technique where an AGENT learns to make decisions
by taking actions in an environment to maximize a reward over time. It learns through trial and error.

- **Application**: Robotics, game AI, autonomous systems like self-driving cars.

Python is widely used in data science for data manipulation, analysis, and visualization.
Key libraries include:

 NumPy: large datasets, multi-dimensional numerical data working with arrays and matrices,
along with a collection of mathematical functions to operate on these arrays.

zeros = [Link]((3,3))
ones = [Link]((3,3))
rand_array = [Link](3,3)

 Pandas: offers data manipulation and analysis, providing data structures and operations for
manipulating numerical tables and time series.
- Reading & Writing a CSV file, Selecting column, - Filtering rows
- Group by and aggregate function, - Pivot tables, - Joining dataframes: [Link](df1, df2, on='key_column',
how='left')

 Scikit-learn: This library features various classification, regression, clustering


algorithms, provides a range of supervised and unsupervised learning efficient for data mining and
data analysis, developing predictive models. built on NumPy, SciPy, and matplotlib
 Advanced Data Science Techniques: Natural language processing and neural networks using
TensorFlow or PyTorch. particularly in deep learning. libraries for designing, training,
and validating deep neural networks with large-scale data.
 Matplotlib: plotting library for creating static, interactive, and animated visualizations, Simulations
 SciPy: Used for scientific and technical computing.

AI domains where Pyhon is extensively used include:

 Machine Learning (ML): Libraries like TensorFlow, PyTorch, and Scikit-learn are used for developing
machine learning models.
 Deep Learning: TensorFlow and PyTorch also provide frameworks for deep learning, which can be
used to develop complex models that simulate human neural networks.
 Natural Language Processing (NLP): Libraries such as NLTK and spaCy allow for the processing and
analysis of human language data.
 Computer Vision: OpenCV and PIL (Python Imaging Library) are used for image processing tasks.

 Data Manipulation & Analysis: Pandas for data cleaning, transformation, and preparation. powerful
data manipulation capabilities that simplify tasks like data filtering, aggregation, and visualization. It
allows handling of large data sets
 Data Visualization: Matplotlib, Seaborn to create meaningful charts and graphs.
 Statistical Analysis: to understand trends, patterns, and inferential statistics.
 Machine Learning: scikit-learn library to implement predictive models.

Data Mining: Techniques to extract valuable information from large datasets. Scikit-learn, TensorFlow, and
Keras these libraries offer regression models to complex neural networks.
Predicting economic trends, analyzing customer behavior, or building recommendation systems

Python scripts provide a quick way to automate data processing tasks, automation, monitoring

In the finance sector, Python is used for quantitative and qualitative analysis, financial modeling, and
algorithmic trading. Libraries like pandas have become popular choice for modeling financial data and risk
management.

Control Flow
- if, else, elif: conditional branching.
- for, while: Looping
- break: Exits from the loop.
- continue: Skips the current iteration and continues with the next one.
- pass: A placeholder that does nothing

Exception Handling
- try, except: Block catches exceptions.
- raise: Used to raise an [Link]
- assert: For debugging purposes

Functions
- def: Defines a function.
- return: Exits a function, returns a value.
- lambda: Creates an anonymous function.
- yield: returns a value and pauses the execution.

Web development using Python: Django and Flask provide tools and libraries to create secure,
scalable, and maintainable web sites and services.

Question: Given a 2D binary matrix, write a solution to make the image symmetric along the X and Y
axes. The only operations allowed are full row and full column insertions without modifying the values in
the original matrix. The goal is to find the minimum number of row and column insertions.

To make a 2D binary matrix symmetric along both X and Y axes, compare rows and columns with their
counterparts, and count the necessary insertions. The goal is to achieve symmetry with minimal row and column
insertions.

def min_insertions_for_symmetry(matrix):
n = len(matrix)
m = len(matrix[0])

row_insertions = 0
col_insertions = 0

# Check rows for X-axis symmetry


for i in range(n // 2):
if matrix[i] != matrix[n - 1 - i]:
row_insertions += 1
# Check columns for Y-axis symmetry
for j in range(m // 2):
col = [matrix[i][j] for i in range(n)]
mirror_col = [matrix[i][m - 1 - j] for i in range(n)]
if col != mirror_col:
col_insertions += 1

return row_insertions + col_insertions

# Example matrix
matrix = [
[1, 0, 1],
[0, 1, 0],
[1, 0, 1]
]

result = min_insertions_for_symmetry(matrix)

You might also like