0% found this document useful (0 votes)
36 views6 pages

Unit 4 - DS - 1st Year

The document discusses challenges and techniques for handling large data on a single computer, including memory limitations and algorithm efficiency. It emphasizes the importance of choosing the right algorithms, data structures, and programming practices to optimize performance. A case study on predicting malicious URLs illustrates the application of these techniques in a real-world scenario.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

Unit 4 - DS - 1st Year

The document discusses challenges and techniques for handling large data on a single computer, including memory limitations and algorithm efficiency. It emphasizes the importance of choosing the right algorithms, data structures, and programming practices to optimize performance. A case study on predicting malicious URLs illustrates the application of these techniques in a real-world scenario.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Unit 4: Handling large data on a single computer

Prepared by: Varun Rao (Dean, Data Science & AI)


For: Data Science - 1st years

The problems we face when handling large data


A large volume of data poses new challenges, such as overloaded memory and algorithms that
never stop running. It forces you to adapt and expand your repertoire of techniques. But even
when you can perform your analysis, you should take care of issues such as I/O (input/output)
and CPU starvation, because these can cause speed issues.

A computer only has a limited amount of RAM. When you try to squeeze more data into this
memory than actually fits, the OS will start swapping out memory blocks to disks, which is far
less efficient than having it all in memory.

A third thing you’ll observe when dealing with large data sets is that components of your
computer can start to form a bottleneck while leaving other systems idle. Certain programs don’t
feed data fast enough to the processor because they have to read data from the hard drive,
which is one of the slowest components on a computer. This has been addressed with the
introduction of solid state drives (SSD), but SSDs are still much more expensive than the slower
and more widespread hard disk drive (HDD) technology.

General Techniques for handling large volumes of


data

Never-ending algorithms, out-of-memory errors, and speed issues are the most common
challenges you face when working with large data. The solutions can be divided into three
categories: using the correct algorithms, choosing the right data structure, and using the right
tools

1. Choosing the right algorithm can solve more problems than adding more or better
hardware. An algorithm that’s well suited for handling large data doesn’t need to load the
entire data set into memory to make predictions.
Most online algorithms can also handle mini-batches; this way, you can feed them
batches of 10 to 1,000 observations at once while using a sliding window to go over your
data. You have three options:
■ Full batch learning (also called statistical learning)—Feed the algorithm all the data at
once.
■ Mini-batch learning—Feed the algorithm a spoonful (100, 1000, …, depending on what
your hardware can handle) of observations at a time.
■ Online learning—Feed the algorithm one observation at a time.

2. MapReduce algorithms are easy to understand with an analogy: Imagine that you were
asked to count all the votes for the national elections. Your country has 25 parties, 1,500
voting offices, and 2 million people. You could choose to gather all the voting tickets
from every office individually and count them centrally, or you could ask the local offices
to count the votes for the 25 parties and hand over the results to you, and you could then
aggregate them by party. Map reducers follow a similar process to the second way of
working. They first map values to a key and then do an aggregation on that key during
the reduce phase.

3. By cutting a large data table into small matrices, for instance, we can still do a linear
regression. The logic behind this matrix splitting and how a linear regression can be
calculated with matrices can be found in the sidebar.

Choosing the right data structure


Algorithms can make or break your program, but the way you store your data is of equal
importance. Data structures have different storage requirements, but also influence the
performance of CRUD (create, read, update, and delete) and other operations on the data set.
1. A sparse data set contains relatively little information compared to its entries
(observations). Almost everything is “0” with just a single “1” .
2. Trees are a class of data structure that allows you to retrieve information much faster
than scanning through a table. A tree always has a root value and subtrees of children,
each with its children, and so on. Simple examples would be your own family tree.
3. Hash tables are data structures that calculate a key for every value in your data and put
the keys in a bucket. This way you can quickly retrieve the information by looking in the
right bucket when you encounter the data. Dictionaries in Python are a hash table
implementation, and they’re a close relative of key-value stores.

Generating programming tips for dealing with large


datasets
The tricks that work in a general programming context still apply for data science. Several might
be worded slightly differently, but the principles are essentially the same for all programmers.

You can divide the general tricks into three parts, as shown in the figure:
■ Don’t reinvent the wheel: Use tools and libraries developed by others.
■ Get the most out of your hardware: Your machine is never used to its full potential; with
simple adaptations you can make it work harder.
■ Reduce the computing need: Slim down your memory and processing needs as much as
possible
– Don’t reinvent the wheel :

Solving a problem that has already been solved is a waste of time. As a data scientist, you have
two large rules that can help you deal with large data and make you much more productive.

Exploit the power of databases: The first reaction most data scientists have when working
with large data sets is to prepare their analytical base tables inside a database. This method
works well when the features you want to prepare are fairly simple.
Use optimized libraries: Creating libraries like Mahout, Weka, and other machine learning
algorithms requires time and knowledge.

– Get the most out of your hardware:

Feed the CPU compressed data: A simple trick to avoid CPU starvation is to feed the CPU
compressed data instead of the inflated (raw) data.
Make use of the GPU: Sometimes your CPU and not your memory is the bottleneck. If your
computations are parallelizable, you can benefit from switching to the GPU. The GPU is
enormously efficient in parallelizable jobs but has less cache than the CPU.
Use multiple threads: It’s still possible to parallelize computations on your CPU. You can
achieve this with normal Python threads.

– Reduce your computing needs:

The best way to avoid having large data problems is by removing as much of the work as
possible up front and letting the computer work only on the part that can’t be skipped.

■ Profile your code and remediate slow pieces of code: Not every piece of your code needs
to be optimized; use a profiler to detect slow parts inside your program and remediate these
parts.
■ Use compiled code whenever possible, certainly when loops are involved: Whenever
possible use functions from packages that are optimized for numerical computations instead of
implementing everything yourself. The code in these packages is often highly optimized and
compiled.
■ Otherwise, compile the code yourself: If you can’t use an existing package, use either a
just-in-time compiler or implement the slowest parts of your code in a lower-level language such
as C or Fortran and integrate this with your codebase.
■ Avoid pulling data into memory: When you work with data that doesn’t fit in your memory,
avoid pulling everything into memory.
■ Use generators to avoid intermediate data storage: Generators help you return data per
observation instead of in batches. This way you avoid storing intermediate results.
■ Use as little data as possible: If no large-scale algorithm is available and you aren’t willing to
implement such a technique yourself, then you can still train your data on only a sample of the
original data.
■ Use your math skills to simplify calculations as much as possible: Take the following
equation, for example: (a + b)2 = a2 + 2ab + b2 . The left side will be computed much faster
than the right side of the equation; even for this trivial example, it could make a difference when
talking about big chunks of data.

Case study- Predicting malicious URLs


The internet is probably one of the greatest inventions of modern times. It has boosted
humanity’s development, but not everyone uses this great invention with honorable intentions.
Many companies (Google, for one) try to protect us from fraud by detecting malicious websites
for us. Doing so is no easy task, because the internet has billions of web pages to scan.

Step 1: Defining the research goal


The goal of our project is to detect whether certain URLs can be trusted or not. Because the
data is so large we aim to do this in a memory-friendly way. In the next step we’ll first look at
what happens if we don’t concern ourselves with memory (RAM) issues.

Step 2: Acquiring the URL data


Start by downloading the data from http://sysnet.ucsd.edu/projects/url/#datasets and place it in
a folder. Choose the data in SVMLight format. SVMLight is a text-based format with one
observation per row. To save space, it leaves out the zeros.

Step 3: Data preparation and Cleansing, isn’t necessary in this case because the URLs come
pre-cleaned. We’ll need a form of exploration before unleashing our learning algorithm, though.

Step 4: Data exploration


To see if we can even apply our first trick (sparse representation), we need to find out whether
the data does indeed contain lots of zeros.
Step 5: Model building
Now that we’re aware of the dimensions of our data, we can apply the same two tricks (sparse
representation of compressed file) and add the third (using an online algorithm), in the following
listing.

You might also like