0% found this document useful (0 votes)

36 views6 pages

Unit 4 - DS - 1st Year

The document discusses challenges and techniques for handling large data on a single computer, including memory limitations and algorithm efficiency. It emphasizes the importance of choosing the right algorithms, data structures, and programming practices to optimize performance. A case study on predicting malicious URLs illustrates the application of these techniques in a real-world scenario.

Uploaded by

allusarunkumar2307

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views6 pages

Unit 4 - DS - 1st Year

Uploaded by

allusarunkumar2307

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Unit 4: Handling large data on a single computer

Prepared by: Varun Rao (Dean, Data Science & AI)

For: Data Science - 1st years

The problems we face when handling large data

A large volume of data poses new challenges, such as overloaded memory and algorithms that
never stop running. It forces you to adapt and expand your repertoire of techniques. But even
when you can perform your analysis, you should take care of issues such as I/O (input/output)
and CPU starvation, because these can cause speed issues.

A computer only has a limited amount of RAM. When you try to squeeze more data into this
memory than actually fits, the OS will start swapping out memory blocks to disks, which is far
less efficient than having it all in memory.

A third thing you’ll observe when dealing with large data sets is that components of your
computer can start to form a bottleneck while leaving other systems idle. Certain programs don’t
feed data fast enough to the processor because they have to read data from the hard drive,
which is one of the slowest components on a computer. This has been addressed with the
introduction of solid state drives (SSD), but SSDs are still much more expensive than the slower
and more widespread hard disk drive (HDD) technology.

General Techniques for handling large volumes of

data

Never-ending algorithms, out-of-memory errors, and speed issues are the most common
challenges you face when working with large data. The solutions can be divided into three
categories: using the correct algorithms, choosing the right data structure, and using the right
tools

1. Choosing the right algorithm can solve more problems than adding more or better
hardware. An algorithm that’s well suited for handling large data doesn’t need to load the
entire data set into memory to make predictions.
Most online algorithms can also handle mini-batches; this way, you can feed them
batches of 10 to 1,000 observations at once while using a sliding window to go over your
data. You have three options:
■ Full batch learning (also called statistical learning)—Feed the algorithm all the data at
once.
■ Mini-batch learning—Feed the algorithm a spoonful (100, 1000, …, depending on what
your hardware can handle) of observations at a time.
■ Online learning—Feed the algorithm one observation at a time.

2. MapReduce algorithms are easy to understand with an analogy: Imagine that you were
asked to count all the votes for the national elections. Your country has 25 parties, 1,500
voting offices, and 2 million people. You could choose to gather all the voting tickets
from every office individually and count them centrally, or you could ask the local offices
to count the votes for the 25 parties and hand over the results to you, and you could then
aggregate them by party. Map reducers follow a similar process to the second way of
working. They first map values to a key and then do an aggregation on that key during
the reduce phase.

3. By cutting a large data table into small matrices, for instance, we can still do a linear
regression. The logic behind this matrix splitting and how a linear regression can be
calculated with matrices can be found in the sidebar.

Choosing the right data structure

Algorithms can make or break your program, but the way you store your data is of equal
importance. Data structures have different storage requirements, but also influence the
performance of CRUD (create, read, update, and delete) and other operations on the data set.
1. A sparse data set contains relatively little information compared to its entries
(observations). Almost everything is “0” with just a single “1” .
2. Trees are a class of data structure that allows you to retrieve information much faster
than scanning through a table. A tree always has a root value and subtrees of children,
each with its children, and so on. Simple examples would be your own family tree.
3. Hash tables are data structures that calculate a key for every value in your data and put
the keys in a bucket. This way you can quickly retrieve the information by looking in the
right bucket when you encounter the data. Dictionaries in Python are a hash table
implementation, and they’re a close relative of key-value stores.

Generating programming tips for dealing with large

datasets
The tricks that work in a general programming context still apply for data science. Several might
be worded slightly differently, but the principles are essentially the same for all programmers.

You can divide the general tricks into three parts, as shown in the figure:
■ Don’t reinvent the wheel: Use tools and libraries developed by others.
■ Get the most out of your hardware: Your machine is never used to its full potential; with
simple adaptations you can make it work harder.
■ Reduce the computing need: Slim down your memory and processing needs as much as
possible
– Don’t reinvent the wheel :

Solving a problem that has already been solved is a waste of time. As a data scientist, you have
two large rules that can help you deal with large data and make you much more productive.

Exploit the power of databases: The first reaction most data scientists have when working
with large data sets is to prepare their analytical base tables inside a database. This method
works well when the features you want to prepare are fairly simple.
Use optimized libraries: Creating libraries like Mahout, Weka, and other machine learning
algorithms requires time and knowledge.

– Get the most out of your hardware:

Feed the CPU compressed data: A simple trick to avoid CPU starvation is to feed the CPU
compressed data instead of the inflated (raw) data.
Make use of the GPU: Sometimes your CPU and not your memory is the bottleneck. If your
computations are parallelizable, you can benefit from switching to the GPU. The GPU is
enormously efficient in parallelizable jobs but has less cache than the CPU.
Use multiple threads: It’s still possible to parallelize computations on your CPU. You can
achieve this with normal Python threads.

– Reduce your computing needs:

The best way to avoid having large data problems is by removing as much of the work as
possible up front and letting the computer work only on the part that can’t be skipped.

■ Profile your code and remediate slow pieces of code: Not every piece of your code needs
to be optimized; use a profiler to detect slow parts inside your program and remediate these
parts.
■ Use compiled code whenever possible, certainly when loops are involved: Whenever
possible use functions from packages that are optimized for numerical computations instead of
implementing everything yourself. The code in these packages is often highly optimized and
compiled.
■ Otherwise, compile the code yourself: If you can’t use an existing package, use either a
just-in-time compiler or implement the slowest parts of your code in a lower-level language such
as C or Fortran and integrate this with your codebase.
■ Avoid pulling data into memory: When you work with data that doesn’t fit in your memory,
avoid pulling everything into memory.
■ Use generators to avoid intermediate data storage: Generators help you return data per
observation instead of in batches. This way you avoid storing intermediate results.
■ Use as little data as possible: If no large-scale algorithm is available and you aren’t willing to
implement such a technique yourself, then you can still train your data on only a sample of the
original data.
■ Use your math skills to simplify calculations as much as possible: Take the following
equation, for example: (a + b)2 = a2 + 2ab + b2 . The left side will be computed much faster
than the right side of the equation; even for this trivial example, it could make a difference when
talking about big chunks of data.

Case study- Predicting malicious URLs

The internet is probably one of the greatest inventions of modern times. It has boosted
humanity’s development, but not everyone uses this great invention with honorable intentions.
Many companies (Google, for one) try to protect us from fraud by detecting malicious websites
for us. Doing so is no easy task, because the internet has billions of web pages to scan.

Step 1: Defining the research goal

The goal of our project is to detect whether certain URLs can be trusted or not. Because the
data is so large we aim to do this in a memory-friendly way. In the next step we’ll first look at
what happens if we don’t concern ourselves with memory (RAM) issues.

Step 2: Acquiring the URL data

Start by downloading the data from http://sysnet.ucsd.edu/projects/url/#datasets and place it in
a folder. Choose the data in SVMLight format. SVMLight is a text-based format with one
observation per row. To save space, it leaves out the zeros.

Step 3: Data preparation and Cleansing, isn’t necessary in this case because the URLs come
pre-cleaned. We’ll need a form of exploration before unleashing our learning algorithm, though.

Step 4: Data exploration

To see if we can even apply our first trick (sparse representation), we need to find out whether
the data does indeed contain lots of zeros.
Step 5: Model building
Now that we’re aware of the dimensions of our data, we can apply the same two tricks (sparse
representation of compressed file) and add the third (using an online algorithm), in the following
listing.

Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
DS R Unit-4
No ratings yet
DS R Unit-4
5 pages
Efficient Single-PC Data Handling
No ratings yet
Efficient Single-PC Data Handling
54 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
Efficient Large Data Handling
No ratings yet
Efficient Large Data Handling
6 pages
Extra - Data Science Unit II
No ratings yet
Extra - Data Science Unit II
41 pages
Handling Large Data in Data Science
No ratings yet
Handling Large Data in Data Science
11 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
Module 2 - DSV-1
No ratings yet
Module 2 - DSV-1
39 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Video 3 What Is Data
No ratings yet
Video 3 What Is Data
3 pages
Essential Data Science Projects Guide
No ratings yet
Essential Data Science Projects Guide
1 page
Ocs353 DSF Unit V Notes
No ratings yet
Ocs353 DSF Unit V Notes
7 pages
Grade 10 Ch-4 Data Science
No ratings yet
Grade 10 Ch-4 Data Science
34 pages
3 - What-Is-Data
No ratings yet
3 - What-Is-Data
3 pages
Data Scince Report
No ratings yet
Data Scince Report
11 pages
7 Practicals With Python Practice With Data Science Cookbook
No ratings yet
7 Practicals With Python Practice With Data Science Cookbook
4 pages
Ids Sem Ans U-Ii
No ratings yet
Ids Sem Ans U-Ii
10 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
19 pages
Ai Blueprint
No ratings yet
Ai Blueprint
6 pages
Complete Data Science Learning Guide - Beginner To Expert
No ratings yet
Complete Data Science Learning Guide - Beginner To Expert
25 pages
Full Stack Data Science Guide 2023
No ratings yet
Full Stack Data Science Guide 2023
17 pages
Lec 1
No ratings yet
Lec 1
9 pages
Data Analytics Syllabus
No ratings yet
Data Analytics Syllabus
15 pages
Python Basics for Data Science Course
No ratings yet
Python Basics for Data Science Course
9 pages
HackyHour - Python Tips & Tricks
No ratings yet
HackyHour - Python Tips & Tricks
26 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Data Science 2
No ratings yet
Data Science 2
15 pages
ML Da
No ratings yet
ML Da
55 pages
Chapter 1+ Python Basics-1
No ratings yet
Chapter 1+ Python Basics-1
16 pages
Big Data Essentials & Challenges
No ratings yet
Big Data Essentials & Challenges
71 pages
Tutorial1 KNN
No ratings yet
Tutorial1 KNN
18 pages
Resume Building Tips by Prafful
No ratings yet
Resume Building Tips by Prafful
7 pages
Ocs353dsf Unit Wise Notes
100% (4)
Ocs353dsf Unit Wise Notes
121 pages
Comprehensive Data Science and AI Course
No ratings yet
Comprehensive Data Science and AI Course
43 pages
Chapter-14 Data Science
No ratings yet
Chapter-14 Data Science
12 pages
Guide To Becoming An AI Expert in 2025.
No ratings yet
Guide To Becoming An AI Expert in 2025.
21 pages
Data Science - Unit2
No ratings yet
Data Science - Unit2
24 pages
Python
100% (2)
Python
635 pages
Artificial Intelligence 0 Machine Learning - 2025
No ratings yet
Artificial Intelligence 0 Machine Learning - 2025
8 pages
Data Scientist Roadmap 2025-26
No ratings yet
Data Scientist Roadmap 2025-26
32 pages
Machine Learning Roadmap For Aspiring Data Scientists
No ratings yet
Machine Learning Roadmap For Aspiring Data Scientists
5 pages
Data Science & ML Full Stack Guide
No ratings yet
Data Science & ML Full Stack Guide
9 pages
Introduction To Data Science
100% (6)
Introduction To Data Science
227 pages
Master Data Science, Data Analytics and Machine Learning Using Python
No ratings yet
Master Data Science, Data Analytics and Machine Learning Using Python
16 pages
Subtitle
No ratings yet
Subtitle
4 pages
PDSC Few Questions Answers 2020
No ratings yet
PDSC Few Questions Answers 2020
36 pages
Data Science Basics with Python
100% (1)
Data Science Basics with Python
25 pages
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
No ratings yet
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
119 pages
AI - Book 10 - Part B - Answer Key (New Version)
No ratings yet
AI - Book 10 - Part B - Answer Key (New Version)
16 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
AI Project Cycle Guide for Students
No ratings yet
AI Project Cycle Guide for Students
4 pages
General Material
No ratings yet
General Material
16 pages
Chapter 1+ Python Basics
No ratings yet
Chapter 1+ Python Basics
6 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Permanent SW Change
No ratings yet
Permanent SW Change
15 pages
Application Software, Word and Excel Grade 8
No ratings yet
Application Software, Word and Excel Grade 8
26 pages
Node MCU: Overview and Architecture
No ratings yet
Node MCU: Overview and Architecture
5 pages
Face Recognition For Identification and Verification in Attendance System A Systematic Review
No ratings yet
Face Recognition For Identification and Verification in Attendance System A Systematic Review
9 pages
Unit 1 Part1
No ratings yet
Unit 1 Part1
78 pages
Salesforce Developer Internship Report
No ratings yet
Salesforce Developer Internship Report
58 pages
Stock Management Project Documentation
No ratings yet
Stock Management Project Documentation
98 pages
18CSL58 DBMS Lab Manual
No ratings yet
18CSL58 DBMS Lab Manual
66 pages
Student Assessment Guidelines
No ratings yet
Student Assessment Guidelines
11 pages
Bug Tracking System Overview
No ratings yet
Bug Tracking System Overview
45 pages
Ecs 36C Data Structures: Spring 2024 - Instructor Siena Saltzen
No ratings yet
Ecs 36C Data Structures: Spring 2024 - Instructor Siena Saltzen
105 pages
Module 6 It Culture and Society
No ratings yet
Module 6 It Culture and Society
6 pages
Cyber Protect - Wide Recognition in Independent Evaluations
No ratings yet
Cyber Protect - Wide Recognition in Independent Evaluations
8 pages
KukretiShubham Solidity-Notes
No ratings yet
KukretiShubham Solidity-Notes
27 pages
Linked List
No ratings yet
Linked List
13 pages
Project 2
No ratings yet
Project 2
5 pages
INTROTOARTIFICIAL
No ratings yet
INTROTOARTIFICIAL
2 pages
Mangalore University Bachelor of Computer Applications (BCA) Degree Programme Pattern and Scheme of Examinations I Semester
No ratings yet
Mangalore University Bachelor of Computer Applications (BCA) Degree Programme Pattern and Scheme of Examinations I Semester
3 pages
45 Hrishikesh STQA
No ratings yet
45 Hrishikesh STQA
88 pages
CAT P1 Afr en Eng
No ratings yet
CAT P1 Afr en Eng
34 pages
SQL vs NoSQL Databases Explained
No ratings yet
SQL vs NoSQL Databases Explained
6 pages
CLS Training Catalogue
No ratings yet
CLS Training Catalogue
16 pages
Purushottam Kumar - LinkedIn
No ratings yet
Purushottam Kumar - LinkedIn
9 pages
BRKCRS-2150 (2017)
No ratings yet
BRKCRS-2150 (2017)
54 pages
Chapter 14: Enterprise Computing
No ratings yet
Chapter 14: Enterprise Computing
27 pages
TGA Data Integrity Expectations
No ratings yet
TGA Data Integrity Expectations
16 pages
TG DK3 en
50% (2)
TG DK3 en
42 pages
Selenium WebDriver Exceptions
No ratings yet
Selenium WebDriver Exceptions
4 pages
Programming Fundamentals 3
No ratings yet
Programming Fundamentals 3
10 pages
Python Flash Cards Booklet - Eric Matthes
100% (2)
Python Flash Cards Booklet - Eric Matthes
8 pages

Unit 4 - DS - 1st Year

Uploaded by

Unit 4 - DS - 1st Year

Uploaded by

Unit 4: Handling large data on a single computer

Prepared by: Varun Rao (Dean, Data Science & AI)

The problems we face when handling large data

General Techniques for handling large volumes of

Choosing the right data structure

Generating programming tips for dealing with large

– Get the most out of your hardware:

– Reduce your computing needs:

Case study- Predicting malicious URLs

Step 1: Defining the research goal

Step 2: Acquiring the URL data

Step 4: Data exploration

You might also like