VEL TECH HIGH TECH
Dr. RANGARAJAN Dr. SAKUNTHALA ENGINEERING COLLEGE
An Autonomous Institution
Approved by AICTE-New Delhi, Affiliated to Anna University, Chennai
Accredited by NBA, New Delhi & Accredited by NAAC with “A” Grade & CGPA of 3.27
COURSE DETAILS
FACULTY CODE HTS 1821 FACULTY NAME Dr.SATHISH KUMAR
SUBJECT CODE 21AI35IT SUBJECT NAME DATA SCIENCE FOR
ENGINEERS
YEAR SECOND YEAR SEMESTER 3rd SEMESTER
DEGREE B.E BRANCH/SEC AI&DS – C SEC
BATCH 2024-2028 ACADEMIC YEAR 2025-2026
Course code 21AI35IT Semester III
Category ENGINEERING SCIENCE COURSE (ESC) L T P C
Course Title DATA SCIENCE FOR ENGINEERS 2 0 4 4
COURSE OBJECTIVES:
TodescribethelifecycleofDataScienceandcomputationalenvironmentsfordata scientists using Python.
· To describe the fundamentals for exploring and managing data with Python.
· To examine the various data analytics techniques for labelled /columnar data using Python.
· To demonstrate a flexible range of data visualizations techniques in Python.
· To describe the various Machine learning algorithms for data modelling with Python.
Blooms
CO.No. CourseOutcomes
level
OnsuccessfulcompletionofthisCourse,studentswillbeableto
K2
C305. 1 UnderstandthebasicconceptofData Science.
SYLLABUS:
UNIT I INTRODUCTION TO DATA SCIENCE
Introduction to Data Science and its importance - Data Science and Big data-, The life cycle of
Data Science- The Art of Data Science - Work with data – data Cleaning, data Munging, data
manipulation. Establishing computational environments for data scientists using Python with
IPython and Jupyter.
Introduction to Data Science and Its Importance
What is Data Science?
Data Science is an interdisciplinary field that uses scientific methods, algorithms, processes, and
systems to extract knowledge and insights from structured and unstructured data. It combines
techniques from statistics, computer science, mathematics, and domain knowledge to make data-
driven decisions.
Importance of Data Science
1. Data-Driven Decision Making
Organizations rely on data science to analyze trends and make informed business or
engineering decisions.
2. Automation and Efficiency
Data science enables automation through machine learning, which improves operational
efficiency across industries.
3. Problem-Solving with Predictive Models
By leveraging historical data, predictive analytics helps in forecasting outcomes such as
market trends, system failures, or customer behavior.
4. Innovation in Engineering and Technology
Data science powers innovations like autonomous vehicles, smart cities, and IoT systems,
optimizing performance and user experience.
5. Personalization
It is widely used in recommendation systems (e.g., Netflix, Amazon), tailoring services and
products to individual users.
6. Healthcare Transformation
Assists in disease prediction, patient monitoring, and drug discovery using bioinformatics
and clinical data.
Applications of Data Science
● Finance: Fraud detection, risk assessment
● Retail: Inventory optimization, customer analytics
● Manufacturing: Predictive maintenance, quality control
● Education: Student performance prediction, adaptive learning systems
● Telecommunications: Network optimization, customer churn prediction
2. Data Science and Big Data
Data Science
Data Science is the process of extracting meaningful insights from data using statistical analysis,
machine learning, and domain knowledge.
Big Data
● Refers to extremely large datasets that are complex and cannot be handled by traditional data-
processing tools.
● Characterized by the 5 V’s:
○ Volume – Large amounts of data
○ Velocity – Speed of data generation
○ Variety – Different types (text, image, video)
○ Veracity – Data uncertainty
○ Value – Insights gained
Relationship:
Data Science uses tools and methods to analyze Big Data and extract useful patterns or trends.
The Life Cycle of Data Science
The Data Science Life Cycle is a structured process that guides how raw data is transformed into
meaningful insights and solutions. It consists of several key phases, each playing a crucial role in
building data-driven applications.
🧩 1. Problem Definition
● Objective: Understand and define the problem you are trying to solve.
● Ask questions like:
○ What is the goal?
○ What outcome is expected?
○ What decisions will be supported?
● Example: Predict customer churn, classify images, forecast sales.
📥 2. Data Collection
● Objective: Gather relevant data from various sources.
● Sources:
○ Databases (SQL, NoSQL)
○ Web APIs
○ Files (CSV, Excel, JSON)
○ Sensors, Logs, Social Media
● Ensure data is representative and sufficient.
🧹 3. Data Cleaning (Preprocessing)
● Objective: Prepare the data for analysis by handling errors or inconsistencies.
● Tasks:
○ Handle missing or null values
○ Remove duplicates
○ Fix data types
○ Standardize formats
● Tools: Pandas, OpenRefine
📊 4. Data Exploration and Analysis
● Objective: Understand the data distribution, patterns, and relationships.
● Activities:
○ Descriptive statistics
○ Correlation analysis
○ Visualizations (histograms, box plots, scatter plots)
● Tools: Matplotlib, Seaborn, Pandas
5. Feature Engineering
● Objective: Select, create, or transform variables (features) that improve model performance.
● Examples:
○ Encoding categorical variables
○ Normalizing or scaling data
○ Creating new derived features (e.g., Age from DOB)
● Tools: scikit-learn, Featuretools
🤖 6. Model Building
● Objective: Choose and train machine learning models on the data.
● Algorithms:
○ Classification: Logistic Regression, Decision Trees
○ Regression: Linear Regression
○ Clustering: K-means
● Tools: scikit-learn, TensorFlow, Keras
📈 7. Model Evaluation
● Objective: Measure model performance using metrics.
● Metrics:
○ Accuracy, Precision, Recall, F1-Score
○ RMSE, MAE for regression
○ Confusion matrix, ROC curve
● Helps in comparing different models.
🚀 8. Deployment
● Objective: Integrate the model into a live environment.
● Methods:
○ REST APIs
○ Embedded in web/mobile apps
○ Real-time or batch processing
● Tools: Flask, FastAPI, Docker
9. Monitoring and Maintenance
● Objective: Track model performance over time and update when necessary.
● Monitor for:
○ Data drift
○ Model degradation
○ System failures
● Re-train models periodically with new data.
The Art of Data Science
📌 Definition:
The Art of Data Science refers to the creative and intuitive aspects of the data science process.
While data science is rooted in mathematics, statistics, and programming, the "art" lies in:
● Asking the right questions,
● Choosing meaningful variables,
● Visualizing data effectively, and
● Communicating insights clearly.
🔑 Key Elements:
● Curiosity: Constantly exploring and questioning data.
● Storytelling: Making data relatable and meaningful.
● Decision-making: Knowing when to apply which technique or algorithm.
● Design Thinking: Creating impactful visualizations and user-centric models.
⚠️Data science is not just about building models — it's about making data work to solve
real-world problems.
🧰 Working with Data
Overview:
The practical side of data science involves handling real-world data, which is often messy and
unstructured. This includes:
● Data Cleaning
● Data Munging (Wrangling)
● Data Manipulation
🧹 Data Cleaning
📌 Definition:
The process of detecting and correcting (or removing) inaccurate, corrupt, or irrelevant parts
of the dataset.
Common Tasks:
● Handling missing values (e.g., NaN, nulls)
● Removing duplicates
● Fixing inconsistent formats (e.g., date formats, currency symbols)
● Correcting data types
● Filtering out outliers or noise
Tools:
● Python: Pandas, NumPy
● Methods: dropna(), fillna(), astype(), replace()
🔄 Data Munging (Wrangling)
📌 Definition:
The process of transforming raw data into a clean, structured, and usable format for analysis.
🧱 Typical Steps:
1. Parsing data from files (CSV, JSON, XML)
2. Merging datasets
3. Reshaping data (pivot, melt)
4. Converting data types
5. Encoding categorical variables
Example:
python
CopyEdit
import pandas as pd
df = pd.read_csv("data.csv")
df['date'] = pd.to_datetime(df['date']) # Convert to
datetime
df = df.pivot(index='ID', columns='month', values='sales') #
Reshape
🔧 Data Manipulation
📌 Definition:
Refers to accessing, transforming, filtering, sorting, or combining data to prepare for analysis.
🔍 Examples of Operations:
● Filtering rows based on conditions
● Sorting by one or more columns
● Grouping and aggregating data (groupby())
● Joining/Merging multiple datasets (merge(), concat())
● Creating new columns using formulas or functions
Example:
python
CopyEdit
# Grouping and Aggregating
grouped = df.groupby("department")["salary"].mean()
# Adding a new column
df["bonus"] = df["salary"] * 0.10
Establishing Computational Environments for Data Scientists using Python with
IPython and Jupyter
This topic focuses on setting up a productive and flexible working environment for data scientists to
write, run, and share Python code effectively. It revolves around two main tools: IPython and
Jupyter Notebook.
🐍 Why Python for Data Science?
Python is the most widely used language in data science due to:
● Readability and simplicity
● Rich ecosystem of libraries (e.g., NumPy, Pandas, Matplotlib, scikit-learn)
● Strong community support
● Easy integration with web, databases, and cloud platforms
🧪 1. IPython (Interactive Python)
📌 What is IPython?
IPython is an enhanced interactive shell for Python that provides a rich toolkit for interactive
computing.
Key Features:
● Tab completion for variable names and functions
● Rich media (images, videos, LaTeX)
● Inline plotting with Matplotlib
● Interactive debugging and shell commands
● Magic commands (%timeit, %run, %matplotlib, etc.)
🔍 Example:
python
CopyEdit
%timeit sum(range(10000)) # Measures execution time
📓 2. Jupyter Notebook
📌 What is Jupyter?
Jupyter (short for Julia + Python + R) Notebook is a web-based interactive development
environment for data science and scientific computing.
Features:
● Supports live code, markdown, visualizations, and LaTeX
● Code can be executed cell by cell
● Ideal for data exploration, analysis, documentation, and sharing
● Outputs are displayed inline (charts, tables, HTML, etc.)
💻 How to Set Up Jupyter and IPython
🔧 Installation Using pip:
bash
CopyEdit
pip install jupyter ipython
🚀 To Launch Jupyter Notebook:
bash
CopyEdit
jupyter notebook
This will open the Jupyter dashboard in your web browser where you can create and manage
.ipynb notebooks.
🧰 Recommended Libraries for Data Science:
Make sure these libraries are installed in your environment:
bash
CopyEdit
pip install numpy pandas matplotlib seaborn scikit-learn
📁 Example Workflow in Jupyter:
Import libraries:
python
CopyEdit
import pandas as pd
import matplotlib.pyplot as plt
1.
Load data:
python
CopyEdit
df = pd.read_csv("data.csv")
2.
Visualize:
python
CopyEdit
df['Sales'].plot(kind='line')
plt.show()
3.
4. Document:
Use Markdown cells to explain your code, write equations, or embed images.
🌐 Advantages of Jupyter for Data Scientists
Feature Benefit
Live Code Immediate feedback and
+ Results iteration
Markdown Document workflows
Support and analysis
Visualizati Better understanding and
ons Inline storytelling
Export to Easy sharing and
HTML/PD reporting
F
Language Python, R, Julia, and
Support more via kernels