Ids Unit 1
Ids Unit 1
LECTURE NOTES
UNIT - 1
Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models.
The data used for analysis can come from many different sources and presented in various
formats.
Data science’s lifecycle consists of five distinct stages, each with its own tasks:
Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves
gathering raw structured and unstructured data.
Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture.
This stage covers taking the raw data and putting it in a form that can be used.
1
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Data science was born from the idea of merging applied statistics with computer science. The
resulting field of study would use the extraordinary power of modern computing. Scientists
realized they could not only collect data and solve statistical problems but also use that data to
solve real-world problems and make reliable fact-driven predictions.
1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article “The Future of Data Analysis,” he foresaw the inevitable emergence of a new
field nearly two decades before the first personal computers. While Tukey was ahead of his time,
he was not alone in his early appreciation of what would come to be known as “data science.”
1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was “to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information and
knowledge.”
1980s and 1990s: Data science began taking more significant strides with the emergence of the
first Knowledge Discovery in Databases (KDD) workshop and the founding of the International
Federation of Classification Societies (IFCS).
1994: Business Week published a story on the new phenomenon of “Database Marketing.” It
described the process by which businesses were collecting and leveraging enormous amounts of
data to learn more about their customers, competition, or advertising techniques.
2
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon
the necessity and potential of data science.
2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose to
the challenge, and later on Spark and Cassandra made their debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding patterns and
making better business decisions, demand for data scientists began to see dramatic growth in
different parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the realm
of data science.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data
Data Analyst
Data Engineers
Database Administrator
3
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Data Scientist
Data Architect
Statistician
Business Analyst
1. Data Analyst
Data analysts are responsible for a variety of tasks including visualisation, munging, and
processing of massive amounts of data. They also have to perform queries on the databases from
time to time. One of the most important skills of a data analyst is optimization.
Extracting data from primary and secondary sources using automated tools
To become a data analyst: SQL, R, SAS, and Python are some of the sought-after technologies for
data analysis.
2. Data Engineers
Data engineers build and test scalable Big Data ecosystems for the businesses so that the data
scientists can run their algorithms on the data systems that are stable and highly optimized. Data
engineers also update the existing systems with newer or upgraded versions of the current
technologies to improve the efficiency of the databases.
4
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
To become data engineer: technologies that require hands-on experience include Hive, NoSQL,
R, Ruby, Java, C++, and Matlab.
3. Database Administrator
The job profile of a database administrator is pretty much self-explanatory- they are responsible
for the proper functioning of all the databases of an enterprise and grant or revoke its services to
the employees of the company depending on their requirements.
To become database administrator: database backup and recovery, data security, data modeling,
and design, etc
Machine learning engineers are in high demand today. However, the job profile comes with its
challenges. Apart from having in-depth knowledge of some of the most powerful technologies
such as SQL, REST APIs, etc. machine learning engineers are also expected to perform A/B
testing, build data pipelines, and implement common machine learning algorithms such as
classification, clustering, etc.
5
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
To become machine learning engineer: technologies like Java, Python, JS, etc. Secondly, you
should have a strong grasp of statistics and mathematics.
5. Data Scientist
Data scientists have to understand the challenges of business and offer the best solutions using
data analysis and data processing. For instance, they are expected to perform predictive analysis
and run a fine-toothed comb through an “unstructured/disorganized” data to offer actionable
insights.
To become a data scientist, you have to be an expert in R, MatLab, SQL, Python, and other
complementary technologies.
6. Data Architect
A data architect creates the blueprints for data management so that the databases can be easily
integrated, centralized, and protected with the best security measures. They also ensure that the
data engineers have the best tools and systems to work with.
6
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
To become a data architect: requires expertise in data warehousing, data modelling, extraction
transformation and loan (ETL), etc. You also must be well versed in Hive, Pig, and Spark, etc.
7. Statistician
A statistician, as the name suggests, has a sound understanding of statistical theories and data
organization. Not only do they extract and offer valuable insights from the data clusters, but they
also help create new methodologies for the engineers to apply.
To become a statistician: SQL, data mining, and the various machine learning technologies.
8. Business Analyst
The role of business analysts is slightly different than other data science jobs. While they do
have a good understanding of how data-oriented technologies work and how to handle large
volumes of data, they also separate the high-value data from the low-value data.
To become business analyst: understanding of business finances and business intelligence, and
also the IT technologies like data modelling, data visualization tools, etc.
7
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Data Science workflows tend to happen in a wide range of domains and areas of expertise such as
biology, geography, finance or business, among others. This means that Data Science projects can
take on very different challenges and focuses resulting in very different methods and data sets
being used. A Data Science project will have to go through five key stages: defining a problem,
data processing, modelling, evaluation and deployment.
Defining a problem
The first stage of any Data Science project is to identify and define a problem to be solved.
Without a clearly defined problem to solve, it can be difficult to know how to tackle to the
problem.
For a Data Science project this can include what method to use, such as is classification,
regression or clustering. Also, without a clearly defined problem, it can be hard to
determine what your measure of success would be.
Without a defined measure of success, you can never know when your project is complete
or is good enough to be used in production.
A challenge with this is being able to define a problem small enough that it can be
solved/tackled individually.
Data Processing
Once you have your problem, how you are going to measure success, and an idea of the
methods you will be using, you can then go about performing the all important task of data
processing. This is often the stage that will take the longest in any Data Science project
and can regularly be the most important stage.
There are a variety of tasks that need to occur at this stage depending on what problem
you are going to tackle. The first is often finding ways to create or capture data that
doesn’t exist yet.
8
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Once you have created this data, you then need to collect it somewhere and in a format
that is useful for your model. This will depend on what method you will be using in the
modelling phase but it will involve figuring out how you will feed the data into your
model.
The final part of this is to then perform any pre-processing steps to ensure that the data is
clean enough for the modelling method to work. This may involve removing outliers, or
choosing to keep them, manipulating null values, whether a null value is a measure or
whether it should be imputed to the average, or standardising the measures.
Modelling
The next part, and often the most fun and exciting part, is the modelling phase of the Data
Science project. The format this will take will depend primarily on what the problem is
and how you defined success in the first step, and secondarily on how you processed the
data.
Unfortunately, this is often the part that will take the least amount of time of any Data
Science project, especially when there are many frameworks or libraries that exist, such as
sklearn, statsmodels, tensorflow and that can be readily utilised.
You should have selected the method that you will be using to model your data in the
defining a problem stage, and this may include simple graphical exploration, regression,
classification or clustering.
Evaluation
Once you have then created and implemented your models, you then need to know how to
evaluate it. Again, this goes back to the problem formulation stage where you will have
defined your measure of success, but this is often one of the most important stages.
Depending on how you processed your data and set-up your model, you may have a
holdout dataset or testing data set that can be used to evaluate your model. On this dataset,
9
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
you are aiming to see how well your model performs in terms of both accuracy and
reliability.
Deployment
Finally, once you have robustly evaluated your model and are satisfied with the results, then you
can deploy it into production. This can mean a variety of things such as whether you use the
insights from the model to make changes in your business, whether you use your model to check
whether changes that have been made were successful, or whether the model is deployed
somewhere to continually receive and evaluate live data.
10
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy
Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company. Also, Financial Industries uses Data
Science Analytics tools in order to predict the future.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the future
outcome.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions
similar to choices according to our past data and also we get recommendations according to
most buy the product, most rated, most searched, etc. This is all done with the help of Data
Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
11
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then Facebook suggests us
auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere.
example: Suppose I want a mobile phone, so I just Google search it and after that, I changed
my mind to buy offline. Data Science helps those companies who are paying for
Advertisements for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommendation of that mobile phone which I
searched for. So this will force me to buy online.
12
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science concept
of Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole
line. Also in Search Engines in social media, in various apps, AutoComplete feature is widely
used.
Data security is the process of protecting corporate data and preventing data loss through
unauthorized access. This includes protecting your data from attacks that can encrypt or destroy
data, such as ransomware, as well as attacks that can modify or corrupt your data. Data security
also ensures data is available to anyone in the organization who has access to it.
Some industries require a high level of data security to comply with data protection regulations.
For example, organizations that process payment card information must use and store payment
card data securely, and healthcare organizations in the USA must secure private health
information (PHI) in line with the HIPAA standard.
Data privacy is the distinction between data in a computer system that can be shared with third
parties (non-private data), and data that cannot be shared with third parties (private data). There
are two main aspects to enforcing data privacy:
Access control—ensuring that anyone who tries to access the data is authenticated to confirm
their identity, and authorized to access only the data they are allowed to access.
Data protection—ensuring that even if unauthorized parties manage to access the data, they
cannot view it or cause damage to it. Data protection methods ensure encryption, which prevents
anyone from viewing data if they do not have a private encryption key, and data loss prevention
mechanisms which prevent users from transferring sensitive data outside the organization.
Data security has many overlaps with data privacy. The same mechanisms used to ensure data
privacy are also part of an organization’s data security strategy.
The primary difference is that data privacy mainly focuses on keeping data confidential, while
data security mainly focuses on protecting from malicious activity.
13
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Accidental Exposure
A large percentage of data breaches are not the result of a malicious attack but are caused by
negligent or accidental exposure of sensitive data. It is common for an organization’s employees to
share, grant access to, lose, or mishandle valuable data, either by accident or because they are not
aware of security policies.
Social engineering attacks are a primary vector used by attackers to access sensitive data.
They involve manipulating or tricking individuals into providing private information or access to
privileged accounts.
Phishing is a common form of social engineering. It involves messages that appear to be from a
trusted source, but in fact are sent by an attacker.
Insider Threats
Insider threats are employees who inadvertently or intentionally threaten the security of an
organization’s data. There are three types of insider threats:
Non-malicious insider—these are users that can cause harm accidentally, via negligence, or
because they are unaware of security procedures.
Malicious insider—these are users who actively attempt to steal data or cause harm to the
organization for personal gain.
Compromised insider—these are users who are not aware that their accounts or credentials were
compromised by an external attacker. The attacker can then perform malicious activity,
pretending to be a legitimate user.
Ransomware
Ransomware is a major threat to data in companies of all sizes. Ransomware is malware that
infects corporate devices and encrypts data, making it useless without the decryption key.
14
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Attackers display a ransom message asking for payment to release the key, but in many cases,
even paying the ransom is ineffective and the data is lost.
Many organizations are moving data to the cloud to facilitate easier sharing and collaboration.
However, when data moves to the cloud, it is more difficult to control and prevent data loss. Users
access data from personal devices and over unsecured networks. It is all too easy to share a file
with unauthorized parties, either accidentally or maliciously.
SQL Injection
SQL injection (SQLi) is a common technique used by attackers to gain illicit access to
databases, steal data, and perform unwanted operations. It works by adding malicious code to a
seemingly innocent database query.
Modern IT environments store data on servers, endpoints, and cloud systems. Visibility
over data flows is an important first step in understanding what data is at risk of being
stolen or misused.
To properly protect your data, you need to know the type of data, where it is, and what it is
used for. Data discovery and classification tools can help.
Data detection is the basis for knowing what data you have. Data classification allows you
to create scalable security solutions, by identifying which data is sensitive and needs to be
secured.
Data Masking
Data masking lets you create a synthetic version of your organizational data, which you
can use for software testing, training, and other purposes that don’t require the real data.
The goal is to protect data while providing a functional alternative when needed.
15
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Data Encryption
In public-key cryptography techniques, there is no need to share the decryption key – the
sender and recipient each have their own key, which are combined to perform the
encryption operation. This is inherently more secure.
Password Hygiene
One of the simplest best practices for data security is ensuring users have unique, strong
passwords. Without central management and enforcement, many users will use easily
guessable passwords or use the same password for many different services.
Password spraying and other brute force attacks can easily compromise accounts with
weak passwords.
Organizations must put in place strong authentication methods, such as OAuth for web-based
systems. It is highly recommended to enforce multi-factor authentication when any user, whether
internal or external, requests sensitive or personal data.
16
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
UNIT –II
DATA COLLECTION:
Data collection is the process of collecting, measuring and analyzing different types of
information using a set of standard validated techniques. The main objective of data collection is
to gather information-rich and reliable data, and analyze them to make critical business decisions.
Once the data is collected, it goes through a rigorous process of data cleaning and data
processing to make this data truly useful for businesses.
There are two main methods of data collection in research based on the information that is
required, namely:
The methods of collecting primary data can be further divided into quantitative data
collection methods (deals with factors that can be counted) and qualitative data collection
methods (deals with factors that are not necessarily numerical in nature).
Here are some of the most common primary data collection methods:
1. Interviews
Interviews are a direct method of data collection. It is simply a process in which the interviewer
asks questions and the interviewee responds to them. It provides a high degree of flexibility
because questions can be adjusted and changed anytime according to the situation.
17
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
2. Observations
In this method, researchers observe a situation around them and record the findings. It can be used
to evaluate the behaviour of different people in controlled (everyone knows they are being
observed) and uncontrolled (no one knows they are being observed) situations.
Surveys and questionnaires provide a broad perspective from large groups of people. They can be
conducted face-to-face, mailed, or even posted on the Internet to get respondents from anywhere
in the world.
4. Focus Groups
A focus group is similar to an interview, but it is conducted with a group of people who all have
something in common. The data collected is similar to in-person interviews, but they offer a better
understanding of why a certain group of people thinks in a particular way.
5. Oral Histories
Oral histories also involve asking questions like interviews and focus groups. However, it is
defined more precisely and the data collected is linked to a single phenomenon. It involves
collecting the opinions and personal experiences of people in a particular event that they were
involved in.
Secondary data refers to data that has already been collected by someone else. It is much more
inexpensive and easier to collect than primary data.
Here are some of the most common secondary data collection methods:
18
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
1. Internet
The use of the Internet has become one of the most popular secondary data collection methods in
recent times. There is a large pool of free and paid research resources that can be easily accessed
on the Internet.
2. Government Archives
There is lots of data available from government archives that you can make use of. The most
important advantage is that the data in government archives are authentic and verifiable. The
challenge, however, is that data is not always readily available due to a number of factors.
3. Libraries
Most researchers donate several copies of their academic research to libraries. You can collect
important and authentic information based on different research contexts.
Data preprocessing
Data preprocessing, a component of data preparation, describes any type of processing performed
on raw data to prepare it for another data processing procedure.
Data preprocessing transforms the data into a format that is more easily and effectively processed
in data mining, machine learning and other data science tasks. The techniques are generally used
at the earliest stages of the machine learning and AI development pipeline to ensure accurate
results.
There are several different tools and methods used for preprocessing data, including the
following:
19
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
feature extraction, which pulls out a relevant feature subset that is significant in a particular
context.
2. Data cleansing. The aim here is to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable for
feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise from characterizing
phenomena in different ways or data that is not relevant to a particular ML, AI or analytics task.
Data reduction uses techniques like principal component analysis to transform the raw data into a
simpler form suitable for particular use cases.
4. Data transformation. Here, data scientists think about how different aspects of the data need
to be organized to make the most sense for the goal. This could include things like
structuring unstructured data, combining salient variables when it makes sense or identifying
important ranges to focus on.
5. Data enrichment. In this step, data scientists apply the various feature engineering libraries to
the data to effect the desired transformations. The result should be a data set organized to achieve
the optimal balance between the training time for a new model and the required compute.
6. Data validation. At this stage, the data is split into two sets. The first set is used to train a
machine learning or deep learning model. The second set is the testing data that is used to gauge
the accuracy and robustness of the resulting model.
20
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
Data preprocessing is the process of transforming raw data into an understandable format. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
21
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
22
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
This is how we import libraries in Python using import keyword and this is the most popular
libraries which any Data Scientist used.
NumPy is the fundamental package for scientific computing with Python. It contains among other
things:
Pandas is for data manipulation and analysis. Pandas is an open source, BSD-licensed library
providing high-performance, easy-to-use data structures and data analysis tools for
the Python programming language.
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook,
web application servers, and four graphical user interface toolkits.Seaborn is a Python data
visualization library based on matplotlib.
23
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
By using Pandas we import our data-set and the file I used here is .csv file [Note: It’s not
necessarily every-time you deal with CSV file, sometimes you deal with Html or Xlsx(Excel
file) ].
The concept of missing values is important to understand in order to successfully manage data. If
the missing values are not handled properly by the researcher, then he/she may end up drawing an
inaccurate inference about the data. Due to improper handling, the result obtained by the
researcher will differ from ones where the missing values are present.
This data preprocessing method is commonly used to handle the null values.
This strategy can be applied on a feature which has numeric data like the year column or Home
team goal column. We can calculate the mean, median or mode of the feature and replace it with
the missing values.
24
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE
label_encoder is object which is I use and help us in transferring Categorical data into Numerical
data. Next, I fitted this label_encoder object to the first column of our matrix X and all this return
the first column country of the matrix X encoded.
25
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)