0% found this document useful (0 votes)
55 views25 pages

Ids Unit 1

Data science is a multidisciplinary field focused on extracting insights from large volumes of data using advanced tools and techniques, including machine learning. The data science lifecycle consists of five stages: capture, maintain, process, analyze, and communicate, each critical for deriving actionable business insights. The document also outlines the evolution of data science, key roles within the field, stages of a data science project, and its applications across various industries such as finance, healthcare, and e-commerce.

Uploaded by

separ42282
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views25 pages

Ids Unit 1

Data science is a multidisciplinary field focused on extracting insights from large volumes of data using advanced tools and techniques, including machine learning. The data science lifecycle consists of five stages: capture, maintain, process, analyze, and communicate, each critical for deriving actionable business insights. The document also outlines the evolution of data science, key roles within the field, stages of a data science project, and its applications across various industries such as finance, healthcare, and e-commerce.

Uploaded by

separ42282
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE

LECTURE NOTES

UNIT - 1

Introduction to data science


Data science:

Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models.

The data used for analysis can come from many different sources and presented in various
formats.

Data science is about extraction, preparation, analysis, visualization, and maintenance of


information. It is a cross disciplinary field which uses scientific methods and processes to draw
insights from data.

The Data Science Lifecycle

Data science’s lifecycle consists of five distinct stages, each with its own tasks:

Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves
gathering raw structured and unstructured data.

Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture.
This stage covers taking the raw data and putting it in a form that can be used.

Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data


scientists take the prepared data and examine its patterns, ranges, and biases to determine how
useful it will be in predictive analysis.

Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative


Analysis. Here is the real meat of the lifecycle. This stage involves performing the various
analyses on the data.

1
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making. In


this final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and
reports.

Evolution of Data Science: Growth & Innovation

Data science was born from the idea of merging applied statistics with computer science. The
resulting field of study would use the extraordinary power of modern computing. Scientists
realized they could not only collect data and solve statistical problems but also use that data to
solve real-world problems and make reliable fact-driven predictions.

1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article “The Future of Data Analysis,” he foresaw the inevitable emergence of a new
field nearly two decades before the first personal computers. While Tukey was ahead of his time,
he was not alone in his early appreciation of what would come to be known as “data science.”

1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was “to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information and
knowledge.”

1980s and 1990s: Data science began taking more significant strides with the emergence of the
first Knowledge Discovery in Databases (KDD) workshop and the founding of the International
Federation of Classification Societies (IFCS).

1994: Business Week published a story on the new phenomenon of “Database Marketing.” It
described the process by which businesses were collecting and leveraging enormous amounts of
data to learn more about their customers, competition, or advertising techniques.

2
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon
the necessity and potential of data science.

2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.

2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose to
the challenge, and later on Spark and Cassandra made their debuts.

2014: Due to the increasing importance of data, and organizations’ interest in finding patterns and
making better business decisions, demand for data scientists began to see dramatic growth in
different parts of the world.

2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the realm
of data science.

2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.

2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data

Roles in Data Science

Data Analyst

Data Engineers

Database Administrator

Machine Learning Engineer

3
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Data Scientist

Data Architect

Statistician

Business Analyst

Data and Analytics Manager

1. Data Analyst

Data analysts are responsible for a variety of tasks including visualisation, munging, and
processing of massive amounts of data. They also have to perform queries on the databases from
time to time. One of the most important skills of a data analyst is optimization.

Few Important Roles and Responsibilities of a Data Analyst include:

Extracting data from primary and secondary sources using automated tools

Developing and maintaining databases

Performing data analysis and making reports with recommendations

To become a data analyst: SQL, R, SAS, and Python are some of the sought-after technologies for
data analysis.

2. Data Engineers

Data engineers build and test scalable Big Data ecosystems for the businesses so that the data
scientists can run their algorithms on the data systems that are stable and highly optimized. Data
engineers also update the existing systems with newer or upgraded versions of the current
technologies to improve the efficiency of the databases.

Few Important Roles and Responsibilities of a Data Engineer include:

Design and maintain data management systems

4
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Data collection/acquisition and management

Conducting primary and secondary research

To become data engineer: technologies that require hands-on experience include Hive, NoSQL,
R, Ruby, Java, C++, and Matlab.

3. Database Administrator

The job profile of a database administrator is pretty much self-explanatory- they are responsible
for the proper functioning of all the databases of an enterprise and grant or revoke its services to
the employees of the company depending on their requirements.

Few Important Roles and Responsibilities of a Database Administrator include:

 Working on database software to store and manage data


 Working on database design and development
 Implementing security measures for database
 Preparing reports, documentation, and operating manuals

To become database administrator: database backup and recovery, data security, data modeling,
and design, etc

4. Machine Learning Engineer

Machine learning engineers are in high demand today. However, the job profile comes with its
challenges. Apart from having in-depth knowledge of some of the most powerful technologies
such as SQL, REST APIs, etc. machine learning engineers are also expected to perform A/B
testing, build data pipelines, and implement common machine learning algorithms such as
classification, clustering, etc.

Few Important Roles and Responsibilities of a Machine Learning Engineer include:

 Designing and developing Machine Learning systems


 Researching Machine Learning Algorithms

5
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

 Testing Machine Learning systems


 Developing apps/products basis client requirements

To become machine learning engineer: technologies like Java, Python, JS, etc. Secondly, you
should have a strong grasp of statistics and mathematics.

5. Data Scientist

Data scientists have to understand the challenges of business and offer the best solutions using
data analysis and data processing. For instance, they are expected to perform predictive analysis
and run a fine-toothed comb through an “unstructured/disorganized” data to offer actionable
insights.

Few Important Roles and Responsibilities of a Data Scientist include:

 Identifying data collection sources for business needs


 Processing, cleansing, and integrating data
 Automation data collection and management process
 Using Data Science techniques/tools to improve processes

To become a data scientist, you have to be an expert in R, MatLab, SQL, Python, and other
complementary technologies.

6. Data Architect

A data architect creates the blueprints for data management so that the databases can be easily
integrated, centralized, and protected with the best security measures. They also ensure that the
data engineers have the best tools and systems to work with.

Few Important Roles and Responsibilities of a Data Architect include:

 Developing and implementing overall data strategy in line with business/organization


 Identifying data collection sources in line with data strategy
 Collaborating with cross-functional teams and stakeholders for smooth functioning of
database systems

6
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

 Planning and managing end-to-end data architecture

To become a data architect: requires expertise in data warehousing, data modelling, extraction
transformation and loan (ETL), etc. You also must be well versed in Hive, Pig, and Spark, etc.

7. Statistician

A statistician, as the name suggests, has a sound understanding of statistical theories and data
organization. Not only do they extract and offer valuable insights from the data clusters, but they
also help create new methodologies for the engineers to apply.

Few Important Roles and Responsibilities of a Statistician include:

 Collecting, analyzing, and interpreting data


 Analyzing data, assessing results, and predicting trends/relationships using statistical
methodologies/tools
 Designing data collection processes

To become a statistician: SQL, data mining, and the various machine learning technologies.

8. Business Analyst

The role of business analysts is slightly different than other data science jobs. While they do
have a good understanding of how data-oriented technologies work and how to handle large
volumes of data, they also separate the high-value data from the low-value data.

Few Important Roles and Responsibilities of a Business Analyst include:

 Understanding the business of the organization


 Conducting detailed business analysis – outlining problems, opportunities, and solutions
 Working on improving existing business processes

To become business analyst: understanding of business finances and business intelligence, and
also the IT technologies like data modelling, data visualization tools, etc.

7
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Stages in a data science project

Data Science workflows tend to happen in a wide range of domains and areas of expertise such as
biology, geography, finance or business, among others. This means that Data Science projects can
take on very different challenges and focuses resulting in very different methods and data sets
being used. A Data Science project will have to go through five key stages: defining a problem,
data processing, modelling, evaluation and deployment.

Defining a problem

 The first stage of any Data Science project is to identify and define a problem to be solved.
Without a clearly defined problem to solve, it can be difficult to know how to tackle to the
problem.
 For a Data Science project this can include what method to use, such as is classification,
regression or clustering. Also, without a clearly defined problem, it can be hard to
determine what your measure of success would be.
 Without a defined measure of success, you can never know when your project is complete
or is good enough to be used in production.
 A challenge with this is being able to define a problem small enough that it can be
solved/tackled individually.

Data Processing

 Once you have your problem, how you are going to measure success, and an idea of the
methods you will be using, you can then go about performing the all important task of data
processing. This is often the stage that will take the longest in any Data Science project
and can regularly be the most important stage.
 There are a variety of tasks that need to occur at this stage depending on what problem
you are going to tackle. The first is often finding ways to create or capture data that
doesn’t exist yet.

8
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

 Once you have created this data, you then need to collect it somewhere and in a format
that is useful for your model. This will depend on what method you will be using in the
modelling phase but it will involve figuring out how you will feed the data into your
model.
 The final part of this is to then perform any pre-processing steps to ensure that the data is
clean enough for the modelling method to work. This may involve removing outliers, or
choosing to keep them, manipulating null values, whether a null value is a measure or
whether it should be imputed to the average, or standardising the measures.

Modelling

 The next part, and often the most fun and exciting part, is the modelling phase of the Data
Science project. The format this will take will depend primarily on what the problem is
and how you defined success in the first step, and secondarily on how you processed the
data.
 Unfortunately, this is often the part that will take the least amount of time of any Data
Science project, especially when there are many frameworks or libraries that exist, such as
sklearn, statsmodels, tensorflow and that can be readily utilised.
 You should have selected the method that you will be using to model your data in the
defining a problem stage, and this may include simple graphical exploration, regression,
classification or clustering.

Evaluation

 Once you have then created and implemented your models, you then need to know how to
evaluate it. Again, this goes back to the problem formulation stage where you will have
defined your measure of success, but this is often one of the most important stages.
 Depending on how you processed your data and set-up your model, you may have a
holdout dataset or testing data set that can be used to evaluate your model. On this dataset,

9
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

you are aiming to see how well your model performs in terms of both accuracy and
reliability.

Deployment

Finally, once you have robustly evaluated your model and are satisfied with the results, then you
can deploy it into production. This can mean a variety of things such as whether you use the
insights from the model to make changes in your business, whether you use your model to check
whether changes that have been made were successful, or whether the model is deployed
somewhere to continually receive and evaluate live data.

10
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Applications of data science in various fields


Major Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.

2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy
Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company. Also, Financial Industries uses Data
Science Analytics tools in order to predict the future.

For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the future
outcome.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions
similar to choices according to our past data and also we get recommendations according to
most buy the product, most rated, most searched, etc. This is all done with the help of Data
Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.

11
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then Facebook suggests us
auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere.
example: Suppose I want a mobile phone, so I just Google search it and after that, I changed
my mind to buy offline. Data Science helps those companies who are paying for
Advertisements for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommendation of that mobile phone which I
searched for. So this will force me to buy online.

8. Airline Routing Planning


With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.

9. Data Science in Gaming


In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the Computer
will improve its performance. There are many games like Chess, EA Sports, etc. will use Data
Science concepts.

10. Medicine and Drug Development


The process of creating medicine is very difficult and time-consuming and has to be done with
full disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of
time, resources, and finance or developing new Medicine or drug but with the help of Data
Science, it becomes easy because the prediction of success rate can be easily determined based
on biological data or factors. The algorithms based on data science will forecast how this will
react to the human body without lab experiments.

11. In Delivery Logistics


Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.

12
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science concept
of Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole
line. Also in Search Engines in social media, in various apps, AutoComplete feature is widely
used.

Data security issues


What is Data Security?

Data security is the process of protecting corporate data and preventing data loss through
unauthorized access. This includes protecting your data from attacks that can encrypt or destroy
data, such as ransomware, as well as attacks that can modify or corrupt your data. Data security
also ensures data is available to anyone in the organization who has access to it.

Some industries require a high level of data security to comply with data protection regulations.
For example, organizations that process payment card information must use and store payment
card data securely, and healthcare organizations in the USA must secure private health
information (PHI) in line with the HIPAA standard.

Data Security vs Data Privacy

Data privacy is the distinction between data in a computer system that can be shared with third
parties (non-private data), and data that cannot be shared with third parties (private data). There
are two main aspects to enforcing data privacy:

 Access control—ensuring that anyone who tries to access the data is authenticated to confirm
their identity, and authorized to access only the data they are allowed to access.
 Data protection—ensuring that even if unauthorized parties manage to access the data, they
cannot view it or cause damage to it. Data protection methods ensure encryption, which prevents
anyone from viewing data if they do not have a private encryption key, and data loss prevention
mechanisms which prevent users from transferring sensitive data outside the organization.

Data security has many overlaps with data privacy. The same mechanisms used to ensure data
privacy are also part of an organization’s data security strategy.

The primary difference is that data privacy mainly focuses on keeping data confidential, while
data security mainly focuses on protecting from malicious activity.

13
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Data Security Risks

 Accidental Exposure

A large percentage of data breaches are not the result of a malicious attack but are caused by
negligent or accidental exposure of sensitive data. It is common for an organization’s employees to
share, grant access to, lose, or mishandle valuable data, either by accident or because they are not
aware of security policies.

 Phishing and Other Social Engineering Attacks

Social engineering attacks are a primary vector used by attackers to access sensitive data.
They involve manipulating or tricking individuals into providing private information or access to
privileged accounts.

Phishing is a common form of social engineering. It involves messages that appear to be from a
trusted source, but in fact are sent by an attacker.

 Insider Threats

Insider threats are employees who inadvertently or intentionally threaten the security of an
organization’s data. There are three types of insider threats:

 Non-malicious insider—these are users that can cause harm accidentally, via negligence, or
because they are unaware of security procedures.
 Malicious insider—these are users who actively attempt to steal data or cause harm to the
organization for personal gain.
 Compromised insider—these are users who are not aware that their accounts or credentials were
compromised by an external attacker. The attacker can then perform malicious activity,
pretending to be a legitimate user.

 Ransomware

Ransomware is a major threat to data in companies of all sizes. Ransomware is malware that
infects corporate devices and encrypts data, making it useless without the decryption key.

14
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Attackers display a ransom message asking for payment to release the key, but in many cases,
even paying the ransom is ineffective and the data is lost.

 Data Loss in the Cloud

Many organizations are moving data to the cloud to facilitate easier sharing and collaboration.
However, when data moves to the cloud, it is more difficult to control and prevent data loss. Users
access data from personal devices and over unsecured networks. It is all too easy to share a file
with unauthorized parties, either accidentally or maliciously.

 SQL Injection

SQL injection (SQLi) is a common technique used by attackers to gain illicit access to
databases, steal data, and perform unwanted operations. It works by adding malicious code to a
seemingly innocent database query.

Common Data Security Solutions and Techniques:

Data Discovery and Classification

 Modern IT environments store data on servers, endpoints, and cloud systems. Visibility
over data flows is an important first step in understanding what data is at risk of being
stolen or misused.

 To properly protect your data, you need to know the type of data, where it is, and what it is
used for. Data discovery and classification tools can help.

 Data detection is the basis for knowing what data you have. Data classification allows you
to create scalable security solutions, by identifying which data is sensitive and needs to be
secured.

Data Masking

 Data masking lets you create a synthetic version of your organizational data, which you
can use for software testing, training, and other purposes that don’t require the real data.

 The goal is to protect data while providing a functional alternative when needed.

15
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Data Encryption

 Data encryption is a method of converting data from a readable format (plaintext) to an


unreadable encoded format (ciphertext). Only after decrypting the encrypted data using the
decryption key, the data can be read or processed.

 In public-key cryptography techniques, there is no need to share the decryption key – the
sender and recipient each have their own key, which are combined to perform the
encryption operation. This is inherently more secure.

 Data encryption can prevent hackers from accessing sensitive information.

Password Hygiene

 One of the simplest best practices for data security is ensuring users have unique, strong
passwords. Without central management and enforcement, many users will use easily
guessable passwords or use the same password for many different services.

 Password spraying and other brute force attacks can easily compromise accounts with
weak passwords.

Authentication and Authorization

Organizations must put in place strong authentication methods, such as OAuth for web-based
systems. It is highly recommended to enforce multi-factor authentication when any user, whether
internal or external, requests sensitive or personal data.

16
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

UNIT –II

DATA COLLECTION AND PREPROCESSING

DATA COLLECTION:

Data collection is the process of collecting, measuring and analyzing different types of
information using a set of standard validated techniques. The main objective of data collection is
to gather information-rich and reliable data, and analyze them to make critical business decisions.
Once the data is collected, it goes through a rigorous process of data cleaning and data
processing to make this data truly useful for businesses.

There are two main methods of data collection in research based on the information that is
required, namely:

 Primary Data Collection

 Secondary Data Collection

Primary Data Collection Methods


Primary data refers to data collected from first-hand experience directly from the main source. It
refers to data that has never been used in the past. The data gathered by primary data collection
methods are generally regarded as the best kind of data in research.

 The methods of collecting primary data can be further divided into quantitative data
collection methods (deals with factors that can be counted) and qualitative data collection
methods (deals with factors that are not necessarily numerical in nature).

Here are some of the most common primary data collection methods:

1. Interviews

Interviews are a direct method of data collection. It is simply a process in which the interviewer
asks questions and the interviewee responds to them. It provides a high degree of flexibility
because questions can be adjusted and changed anytime according to the situation.

17
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

2. Observations

In this method, researchers observe a situation around them and record the findings. It can be used
to evaluate the behaviour of different people in controlled (everyone knows they are being
observed) and uncontrolled (no one knows they are being observed) situations.

3. Surveys and Questionnaires

Surveys and questionnaires provide a broad perspective from large groups of people. They can be
conducted face-to-face, mailed, or even posted on the Internet to get respondents from anywhere
in the world.

4. Focus Groups

A focus group is similar to an interview, but it is conducted with a group of people who all have
something in common. The data collected is similar to in-person interviews, but they offer a better
understanding of why a certain group of people thinks in a particular way.

5. Oral Histories

Oral histories also involve asking questions like interviews and focus groups. However, it is
defined more precisely and the data collected is linked to a single phenomenon. It involves
collecting the opinions and personal experiences of people in a particular event that they were
involved in.

Secondary Data Collection Methods

Secondary data refers to data that has already been collected by someone else. It is much more
inexpensive and easier to collect than primary data.

Here are some of the most common secondary data collection methods:

18
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

1. Internet

The use of the Internet has become one of the most popular secondary data collection methods in
recent times. There is a large pool of free and paid research resources that can be easily accessed
on the Internet.

2. Government Archives

There is lots of data available from government archives that you can make use of. The most
important advantage is that the data in government archives are authentic and verifiable. The
challenge, however, is that data is not always readily available due to a number of factors.

3. Libraries

Most researchers donate several copies of their academic research to libraries. You can collect
important and authentic information based on different research contexts.

Data preprocessing

Data preprocessing, a component of data preparation, describes any type of processing performed
on raw data to prepare it for another data processing procedure.

Data preprocessing transforms the data into a format that is more easily and effectively processed
in data mining, machine learning and other data science tasks. The techniques are generally used
at the earliest stages of the machine learning and AI development pipeline to ensure accurate
results.

There are several different tools and methods used for preprocessing data, including the
following:

 sampling, which selects a representative subset from a large population of data;

 transformation, which manipulates raw data to produce a single input;

 denoising, which removes noise from data;

 imputation, which synthesizes statistically relevant data for missing values;

19
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

 normalization, which organizes data for more efficient access; and

 feature extraction, which pulls out a relevant feature subset that is significant in a particular
context.

key steps in data preprocessing


1. Data profiling. Data profiling is the process of examining, analyzing and reviewing data to
collect statistics about its quality. It starts with a survey of existing data and its characteristics.
Data scientists identify data sets that are pertinent to the problem at hand, inventory its significant
attributes, and form a hypothesis of features that might be relevant for the proposed analytics or
machine learning task.

2. Data cleansing. The aim here is to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable for
feature engineering.

3. Data reduction. Raw data sets often include redundant data that arise from characterizing
phenomena in different ways or data that is not relevant to a particular ML, AI or analytics task.
Data reduction uses techniques like principal component analysis to transform the raw data into a
simpler form suitable for particular use cases.

4. Data transformation. Here, data scientists think about how different aspects of the data need
to be organized to make the most sense for the goal. This could include things like
structuring unstructured data, combining salient variables when it makes sense or identifying
important ranges to focus on.

5. Data enrichment. In this step, data scientists apply the various feature engineering libraries to
the data to effect the desired transformations. The result should be a data set organized to achieve
the optimal balance between the training time for a new model and the required compute.

6. Data validation. At this stage, the data is split into two sets. The first set is used to train a
machine learning or deep learning model. The second set is the testing data that is used to gauge
the accuracy and robustness of the resulting model.

20
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Data Preprocessing in Data Mining

Data preprocessing is the process of transforming raw data into an understandable format. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.

Major Tasks in Data Preprocessing:

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation

21
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Data Preprocessing in machine learning:


Machine Learning ProcessSteps in Data Preprocessing

 Step 1 : Import the libraries


 Step 2 : Import the data-set
 Step 3 : Check out the missing values
 Step 4 : See the Categorical Values
 Step 5 : Splitting the data-set into Training and Test Set
 Step 6 : Feature Scaling

22
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Step 1 : Import the Libraries

This is how we import libraries in Python using import keyword and this is the most popular
libraries which any Data Scientist used.

NumPy is the fundamental package for scientific computing with Python. It contains among other
things:

1. A powerful N-dimensional array object

2. Sophisticated (broadcasting) functions

3. Tools for integrating C/C++ and FORTRAN code

4. Useful linear algebra, Fourier transform, and random number capabilities

Pandas is for data manipulation and analysis. Pandas is an open source, BSD-licensed library
providing high-performance, easy-to-use data structures and data analysis tools for
the Python programming language.

Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook,
web application servers, and four graphical user interface toolkits.Seaborn is a Python data
visualization library based on matplotlib.

23
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Step 2 : Import the Dataset

By using Pandas we import our data-set and the file I used here is .csv file [Note: It’s not
necessarily every-time you deal with CSV file, sometimes you deal with Html or Xlsx(Excel
file) ].

Step 3 : Check out the Missing Values

The concept of missing values is important to understand in order to successfully manage data. If
the missing values are not handled properly by the researcher, then he/she may end up drawing an
inaccurate inference about the data. Due to improper handling, the result obtained by the
researcher will differ from ones where the missing values are present.

Two ways to handle Missing Values in Data Preprocessing

This data preprocessing method is commonly used to handle the null values.

Drop the Missing Values

This strategy can be applied on a feature which has numeric data like the year column or Home
team goal column. We can calculate the mean, median or mode of the feature and replace it with
the missing values.

24
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)
INTRODUCTION TO DATA SCIENCE

Replace the Missing Value

Step 4 : See the Categorical Values

Use LabelEncoder class to convert Categorical data into numerical one

label_encoder is object which is I use and help us in transferring Categorical data into Numerical
data. Next, I fitted this label_encoder object to the first column of our matrix X and all this return
the first column country of the matrix X encoded.

25
CSE NRCM P.LAKSHMI PRASANNA(ASST.PROFESSOR)

You might also like