0% found this document useful (0 votes)
13 views28 pages

R Programming UNIT 3,4,5

This document provides an overview of machine learning (ML), its applications in data science, and the modeling process. It explains key concepts such as algorithms, types of machine learning (supervised, unsupervised, reinforcement), and the machine learning life cycle, which includes steps from problem definition to model deployment. The document highlights the importance of ML for data scientists in automating tasks, improving accuracy, and enabling real-time decision-making across various industries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

R Programming UNIT 3,4,5

This document provides an overview of machine learning (ML), its applications in data science, and the modeling process. It explains key concepts such as algorithms, types of machine learning (supervised, unsupervised, reinforcement), and the machine learning life cycle, which includes steps from problem definition to model deployment. The document highlights the importance of ML for data scientists in automating tasks, improving accuracy, and enabling real-time decision-making across various industries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT :3

MACHINE LEARNING: UNDERSTANDING WHY DATA SCIENTISTS USE MACHINE


LEARNING- WHAT IS MACHINE LEARNING AND WHY WE SHOULD CARE ABOUT,
APPLICATIONS OF MACHINE LEARNING IN DATA SCIENCE, WHERE IT IS USED IN
DATA SCIENCE, THE MODELING PROCESS, TYPES OF MACHINE LEARNING-
SUPERVISED AND UNSUPERVISED

MACHINE LEARNING:

Machine learning (ML) is a field of artificial intelligence (AI) that enables computers to learn from
data and improve their performance without being explicitly programmed. It involves algorithms that
can identify patterns, make predictions, and learn from experience.

Machine learning is a subset of AI that focuses on developing algorithms that can learn from data and
improve their performance over time. Machine learning algorithms are trained on datasets to identify
patterns and make predictions or decisions.

Key Concepts:

 Algorithms: These are the sets of instructions that the machine follows to learn from
data.

 Data: Machine learning algorithms require large amounts of data to learn and improve
their performance.

 Learning: Machine learning algorithms learn from data and improve their performance
without being explicitly programmed.

 Prediction: Once trained, machine learning models can be used to make predictions or
decisions about new data.

 Examples:

 Recommendation systems: Suggest products or content based on user preferences.

 Spam filtering: Identifying and filtering out unwanted emails.

 Image recognition: Identifying objects or people in images.

 Fraud detection: Identifying fraudulent transactions.

 Types of Machine Learning:

 Supervised Learning: Algorithms learn from labeled data (data with known outcomes).
 Unsupervised Learning: Algorithms learn from unlabeled data (data without known
outcomes).

 Reinforcement Learning: Algorithms learn through trial and error and are rewarded for
desired actions.

WHY DATA SCIENTISTS USE MACHINE LEARNING:

Data scientists utilize machine learning (ML) to automate pattern recognition, make predictions, and
gain insights from data, which is crucial for solving complex problems and making informed decisions
in various fields.

Here's a more detailed explanation:

 Automated Pattern Recognition:

ML algorithms can automatically identify patterns and relationships in data that might be difficult or
impossible for humans to spot, enabling data scientists to uncover hidden insights.

 Predictive Modeling:

ML models can be trained to make accurate predictions about future outcomes, allowing data scientists
to forecast trends, anticipate events, and make data-driven decisions.

 Automated Tasks and Insights:

ML can automate tasks like data cleaning, feature engineering, and model evaluation, freeing up data
scientists to focus on higher-level tasks and strategic thinking.

 Improved Accuracy and Efficiency:

ML algorithms can achieve higher accuracy and efficiency in data analysis and prediction compared to
traditional methods, leading to better outcomes and faster results.

 Handling Large Datasets:

ML algorithms are designed to handle large and complex datasets, which are common in many
industries, allowing data scientists to analyze and extract valuable information from massive amounts
of data.

 Real-time Decision Making:

Machine learning can enable real-time decision-making by providing insights and predictions as data
streams in, allowing for timely and informed actions.

 Improved Business Decisions:


By using machine learning, data scientists can help businesses make better decisions based on data,
leading to increased efficiency, profitability, and competitive advantage.

 Specific Applications:

 Fraud Detection: ML algorithms can identify suspicious activities and flag potential
threats, helping to prevent fraud and financial losses.

 Customer Segmentation: ML can help businesses identify customer segments with


similar characteristics and behaviors, enabling targeted marketing and personalized
experiences.

 Product Recommendation: ML can recommend products to customers based on their


past purchases and browsing history, improving sales and customer satisfaction.

 Predictive Maintenance: ML can predict when equipment is likely to fail, allowing


businesses to schedule maintenance proactively and avoid costly downtime.

APPLICATIONS OF MACHINE LEARNING IN DATA SCIENCE:

Machine learning finds diverse applications in data science, including fraud detection, speech
recognition, natural language processing, pattern recognition, and even in fields like finance and
transportation, enabling tasks like predictive analytics and recommendation systems.

Here's a more detailed look at some key applications:

1. Predictive Analytics:

 Fraud Detection:

Machine learning algorithms can analyze vast datasets of transactions to identify unusual patterns that
might indicate fraudulent activity.
 Predicting Customer Behavior:

By analyzing customer data, machine learning can predict future purchasing patterns, churn risk, and
other behaviors.

 Risk Assessment:

In finance and insurance, machine learning models can assess risks associated with loans, investments,
and insurance policies.

2. Computer Vision:

 Image Recognition and Object Detection:

Machine learning algorithms can be trained to recognize objects, faces, and scenes within images and
videos.

 Self-Driving Cars:

Computer vision plays a crucial role in enabling autonomous vehicles to perceive and navigate their
surroundings.

 Medical Image Analysis:

Machine learning can assist in diagnosing diseases, identifying tumors, and analyzing medical images.

3. Natural Language Processing (NLP):

 Machine Translation:

Machine learning models can translate text from one language to another with increasing accuracy.

 Sentiment Analysis:

NLP techniques can analyze text and identify the sentiment expressed, whether positive, negative, or
neutral.

 Chatbots and Virtual Assistants:

Machine learning powers chatbots and virtual assistants that can understand and respond to user
queries.

4. Recommendation Systems:

 Product Recommendations:

Online retailers and streaming services use machine learning to recommend products or content to users
based on their past behavior and preferences.
 Content Recommendation:

Social media platforms and news aggregators use machine learning to personalize content feeds and
suggest relevant topics.

5. Other Notable Applications:

 Speech Recognition:

Machine learning algorithms can transcribe spoken language into text, enabling voice-controlled
devices and applications.

 Pattern Recognition:

Machine learning can identify patterns in data, which can be used for various applications like anomaly
detection and data clustering.

 Transportation and Logistics:

Machine learning can optimize routes, predict traffic, and improve logistics operations.

 Finance:

Machine learning is used for risk management, fraud detection, algorithmic trading, and other financial
applications.

 Social Media:

Machine learning is used for content moderation, sentiment analysis, and targeted advertising.

MACHINE LEARNING USED IN DATA SCIENCE:

Machine learning plays a crucial role in several stages of the data science process, including data
analysis, model building, prediction, and automation.

Here's a more detailed breakdown:

 Data Analysis & Exploration:

Machine learning algorithms can help data scientists identify patterns, relationships, and anomalies in
large datasets that might be difficult to spot through traditional statistical methods.

 Model Building:

Machine learning is the core of building predictive models, where algorithms learn from data to make
predictions or classify data points.
 Prediction & Forecasting:

Once a model is trained, machine learning can be used for making predictions about future outcomes,
such as predicting customer churn, detecting fraud, or forecasting sales.

 Automation:

Machine learning can automate tasks like data cleaning, feature engineering, and model evaluation,
freeing up data scientists to focus on more strategic tasks.

 Real-time Applications:

Machine learning enables real-time data analysis and decision-making, which is crucial in many
applications like fraud detection, recommendation systems, and social media optimization.

 Examples of Applications:

 Image Recognition: Using machine learning algorithms to identify and understand


objects, people, and scenes within images.

 Natural Language Processing: Applying machine learning to analyze and understand


human language, such as in chatbots, sentiment analysis, and machine translation.

 Recommendation Systems: Using machine learning to suggest relevant products,


movies, or content to users based on their preferences and past behavior.

 Fraud Detection: Employing machine learning algorithms to identify and prevent


fraudulent activities in financial transactions and other areas.

 Data Science vs Machine Learning:

While machine learning is a tool used within data science, data science is a broader field that
encompasses various techniques, including statistical analysis, data visualization, and machine learning.

MACHINE LEARNING MODEL:

A machine learning model is a program that can find patterns or make decisions from a previously
unseen dataset. For example, in natural language processing, machine learning models can parse and
correctly recognize the intent behind previously unheard sentences or combinations of words. In image
recognition, a machine learning model can be taught to recognize objects - such as cars or dogs. A
machine learning model can perform such tasks by having it 'trained' with a large dataset. During
training, the machine learning algorithm is optimized to find certain patterns or outputs from the
dataset, depending on the task. The output of this process - often a computer program with specific
rules and data structures - is called a machine learning model.
machine learning Algorithm

A machine learning algorithm is a mathematical method to find patterns in a set of data. Machine
Learning algorithms are often drawn from statistics, calculus, and linear algebra. Some popular
examples of machine learning algorithms include linear regression, decision trees, random forest, and
XGBoost.

Model Training in machine learning

The process of running a machine learning algorithm on a dataset (called training data) and optimizing
the algorithm to find certain patterns or outputs is called model training. The resulting function with
rules and data structures is called the trained machine learning model.

DIFFERENT TYPES OF MACHINE LEARNING

In general, most machine learning techniques can be classified into supervised learning, unsupervised
learning, and reinforcement learning.
Supervised Machine Learning

In supervised machine learning, the algorithm is provided an input dataset, and is rewarded or
optimized to meet a set of specific outputs. For example, supervised machine learning is widely
deployed in image recognition, utilizing a technique called classification. Supervised machine learning
is also used in predicting demographics such as population growth or health metrics, utilizing a
technique called regression.

Unsupervised Machine Learning

In unsupervised machine learning, the algorithm is provided an input dataset, but not rewarded or
optimized to specific outputs, and instead trained to group objects by common characteristics. For
example, recommendation engines on online stores rely on unsupervised machine learning, specifically
a technique called clustering.

Reinforcement Learning

In reinforcement learning, the algorithm is made to train itself using many trial and error experiments.
Reinforcement learning happens when the algorithm interacts continually with the environment, rather
than relying on training data. One of the most popular examples of reinforcement learning is
autonomous driving.

different machine learning models

There are many machine learning models, and almost all of them are based on certain machine learning
algorithms. Popular classification and regression algorithms fall under supervised machine learning,
and clustering algorithms are generally deployed in unsupervised machine learning scenarios.

Supervised Machine Learning

 Logistic Regression: Logistic Regression is used to determine if an input belongs to a certain


group or not
 SVM: SVM, or Support Vector Machines create coordinates for each object in an n-dimensional
space and uses a hyperplane to group objects by common features

 Naive Bayes: Naive Bayes is an algorithm that assumes independence among variables and uses
probability to classify objects based on features

 Decision Trees: Decision trees are also classifiers that are used to determine what category an
input falls into by traversing the leaf's and nodes of a tree

 Linear Regression: Linear regression is used to identify relationships between the variable of
interest and the inputs, and predict its values based on the values of the input variables.

 kNN : The k Nearest Neighbors technique involves grouping the closest objects in a dataset and
finding the most frequent or average characteristics among the objects.

 Random Forest: Random forest is a collection of many decision trees from random subsets of
the data, resulting in a combination of trees that may be more accurate in prediction than a
single decision tree.

 Boosting algorithms: Boosting algorithms, such as Gradient Boosting Machine, XGBoost, and
LightGBM, use ensemble learning. They combine the predictions from multiple algorithms
(such as decision trees) while taking into account the error from the previous algorithm.

Unsupervised Machine Learning

 K-Means: The K-Means algorithm finds similarities between objects and groups them into K
different clusters.

 Hierarchical Clustering: Hierarchical clustering builds a tree of nested clusters without having
to specify the number of clusters.

MACHINE LEARNING LIFE CYCLE

is a branch of artificial intelligence that enables computers to learn from data and make predictions or
decisions without being explicitly programmed. As machine learning grows in importance across
various industries, understanding the process involved in developing effective models becomes
essential. This structured process is known as the Machine Learning Life Cycle. It consists of several
key stages, each of which plays a critical role in building and deploying machine learning models
successfully.

In this article, we will walk through the different stages of the machine learning life cycle, explaining
each step in simple terms to help you grasp the concept easily.
Steps in a Machine Learning Life Cycle

Each step in the machine learning life cycle plays an essential role in building a successful machine
learning solution. By following this life cycle, organizations can tackle complex problems, use data to
generate valuable insights and develop scalable machine learning models that provide lasting impact.

1. Problem Definition

2. Data Collection

3. Data Cleaning and Preprocessing

4. Exploratory Data Analysis (EDA)

5. Feature Engineering and Selection

6. Model Selection

7. Model Training

8. Model Evaluation and Tuning

9. Model Deployment

10. Model Monitoring and Maintenance


1. Problem Definition

The first step in the machine learning life cycle is defining the problem. Before you can create a
machine learning model, it’s important to have a clear understanding of the problem you want the
model to solve. This step sets the foundation for the entire process, as the way you define the problem
will influence every other stage in the life cycle, from the data you collect to the type of model you
choose.

For example, you might want to predict house prices based on certain features like location, size, and
number of rooms. Clearly defining this goal helps you identify what data is needed and what type of
machine learning model will be most suitable for solving it.

Why Problem Definition Matters:

 Clarifies the goal: Knowing exactly what you want the model to achieve is crucial.

 Determines data needs: Different problems require different kinds of data.

 Influences model selection: The problem guides whether you use supervised learning,
unsupervised learning, or another method.

2. Data Collection

After defining the problem, the next step is Data Collection. The quality and quantity of the data
directly impact the success of the machine learning model. Data can come from various sources:

Sources of Data:

 Internal Databases: Company records, customer data, transaction logs.

 Public Datasets: Free datasets from platforms like Kaggle or UCI.

 Web Scraping: Collecting data from websites.

Key Considerations:

 Quality: Data should be accurate and relevant to the problem.

 Quantity: Sufficient data is needed to train the model effectively.

 Relevance: The features in the data must align with the problem you’re solving.

3. Data Cleaning and Preprocessing

Once the data is collected, it must be cleaned and prepared before it can be used for model training.
Data Cleaning and Preprocessing involves removing any errors, handling missing values, and
formatting the data to make it suitable for analysis.
Common Steps:

 Handling Missing Values: Filling in or removing missing data to avoid issues during model
training.

 Normalization: Scaling data so that features with larger ranges do not dominate those with
smaller ranges.

 Outlier Removal: Identifying and eliminating extreme values that could skew the results.

4. Exploratory Data Analysis (EDA)

After cleaning and preprocessing the data, the next step is Exploratory Data Analysis (EDA). EDA
helps in understanding the underlying patterns and characteristics of the data. It involves visualizing
and summarizing the data to discover relationships, trends, and potential insights that will guide the
modeling process.

Key Techniques:

 Data Visualization: Use charts like histograms, scatter plots, and bar charts to identify trends
and distributions.

 Statistical Summary: Calculate basic statistics like mean, median, and standard deviation to
understand the spread and central tendencies of the data.

 Feature Correlation: Analyze the relationships between different features to identify which
variables might have the most influence on the outcome.

5. Feature Engineering and Selection

Once the data has been explored, the next step is Feature Engineering and Selection. Features are the
attributes or variables in the dataset that the model will use to make predictions. This step involves
creating new features or selecting the most important ones to improve the model’s performance.

Key Processes:

 Feature Engineering: Creating new features by transforming existing data, such as combining
or splitting variables (e.g., turning “date of birth” into “age”).

 Feature Selection: Choosing the most relevant features that have the greatest impact on the
model’s accuracy, while eliminating redundant or irrelevant ones.

6. Model Selection

After preparing the data and selecting the relevant features, the next step is Model Selection. This
involves choosing the right machine learning model based on the problem you are trying to solve and
the nature of your data. There are various types of models, each suited for different tasks.
Types of Models:

 Supervised Learning: Used when the data has labeled outcomes, such as classification (e.g.,
spam detection) or regression (e.g., predicting house prices).

 Unsupervised Learning: Used when the data lacks labeled outcomes, such as clustering or
association tasks (e.g., customer segmentation).

 Reinforcement Learning: Used for decision-making tasks where an agent learns by interacting
with the environment (e.g., game playing or robotics).

7. Model Training

Once the model is selected, the next step is Model Training. This is where the model learns from the
data to make predictions. In this step, the data is divided into two sets: a training set and a testing set.
The model uses the training set to learn patterns in the data, and then it is tested on the unseen testing
set to evaluate how well it has learned.

Key Concepts:

 Training the Model: The model learns by identifying patterns in the training data.

 Data Splitting: Dividing data into training and testing sets ensures that the model can
generalize well to new data.

 Algorithms: Various algorithms, like decision trees or neural networks, are used to teach the
model how to make predictions.

8. Model Evaluation and Tuning

After training the model, the next step is Model Evaluation and Tuning. This step assesses how well the
model performs on unseen data and ensures it is optimized for accuracy and reliability. Various metrics
are used to evaluate the model’s performance, and fine-tuning is done to improve it.

Key Concepts:

 Model Evaluation: Common metrics like accuracy, precision, recall, and F1 score are used to
measure how well the model performs on the test data.

 Cross-Validation: This technique divides the data into several parts to test the model’s
performance more reliably.

 Hyperparameter Tuning: Adjusting parameters such as learning rate or tree depth to optimize
the model’s performance.

9. Model Deployment
Once the model has been trained, evaluated, and fine-tuned, the next step is Model Deployment. This
involves integrating the model into a real-world environment where it can start making predictions
based on new data.

Key Concepts:

 Deployment Options: Models can be deployed through cloud platforms, APIs, or embedded
systems.

 Real-Time Predictions: In a production environment, the model is used to make predictions or


decisions on live data.

 Scalability: The model should be capable of handling large volumes of data and requests
without significant delays or errors.

10. Model Monitoring and Maintenance

After deployment, the model’s performance needs to be monitored continuously. Model Monitoring and
Maintenance ensures that the model remains effective over time as new data is introduced. Over time,
models may experience performance degradation, known as model drift, which occurs when the data
changes or the environment evolves.

Key Concepts:

 Monitoring: Regularly tracking the model’s predictions to detect any decrease in accuracy or
performance.

 Model Retraining: Periodically updating the model with new data to maintain its performance.

 Model Drift: Occurs when the model’s predictions become less accurate due to changes in data
patterns over time.
UNIT-4
Handling large data on a single computer: The problems you face when handling large data General
techniques for handling large volumes of data,General programming tips for dealing with large data sets

Handling large data on a single computer:


The problems you face when handling large data: A large volume of data poses new challenges, such
as overloaded memory and algorithms that never stop running. It forces you to adapt and expand your
repertoire of techniques. But even when you can perform your analysis, you should take care of issues
such as I/O (input/output) and CPU starvation, because these can cause speed issues. Figure 4.1 shows
a mind map that will gradually unfold as we go through the steps: problems, solutions, and tips.

A computer only has a limited amount of RAM. When you try to squeeze more data into this memory
than actually fits, the OS will start swapping out memory blocks to disks, which is far less efficient than
having it all in memory. But only a few algorithms are designed to handle large data sets; most of them
load the whole data set into memory at once, which causes the out-of-memory error. Other
algorithms need to hold multiple copies of the data in memory or store intermediate results. All of
these aggravate the problem.

Even when you cure the memory issues, you may need to deal with another limited resource: time.
Although a computer may think you live for millions of years, in reality you won’t. Certain algorithms
don’t take time into account; they’ll keep running forever. Other algorithms can’t end in a reasonable
amount of time when they need to process only a few megabytes of data. A third thing you’ll observe
when dealing with large data sets is that components of your computer can start to form a bottleneck
while leaving other systems idle. Although this isn’t as severe as a never-ending algorithm or out-of-
memory errors, it still incurs a serious cost. Think of the cost savings in terms of person days and
computing infrastructure for CPU starvation. Certain programs don’t feed data fast enough to the
processor because they have to read data from the hard drive, which is one of the slowest
components on a computer. This has been addressed with the introduction of solid state drives (SSD),
but SSDs are still much more expensive than the slower and more widespread hard disk drive (HDD)
technology.

General techniques for handling large volumes of data


Never-ending algorithms, out-of-memory errors, and speed issues are the most common challenges
you face when working with large data. The solutions can be divided into three categories: using the
correct algorithms, choosing the right data structure, and using the right tools (figure 4.2).

No clear one-to-one mapping exists between the problems and solutions because many solutions
address both lack of memory and computational performance. For instance, data set compression will
help you solve memory issues because the data set becomes smaller. But this also affects computation
speed with a shift from the slow hard disk to the fast CPU. Contrary to RAM (random access memory),
the hard disc will store everything even after the power goes down, but writing to disc costs more
time than changing information in the fleeting RAM. When constantly changing the information, RAM
is thus preferable over the (more durable) hard disc. With an unpacked data set, numerous read and
write operations (I/O) are occurring, but the CPU remains largely idle, whereas with the compressed
data set the CPU gets its fair share of the workload.

1. Choosing the right algorithm: -


Choosing the right algorithm can solve more problems than adding more or better hardware. An
algorithm that’s well suited for handling large data doesn’t need to load the entire data set into
memory to make predictions. Ideally, the algorithm also supports parallelized calculations. In this
section we’ll dig into three types of algorithms that can do that, they are online algorithms, block
algorithms, and MapReduce algorithms.
ONLINE LEARNING ALGORITHMS: - Several, but not all, machine learning algorithms can be trained
using one observation at a time instead of taking all the data into memory. Upon the arrival of a new
data point, the model is trained and the observation can be forgotten; its effect is now incorporated
into the model’s parameters. For example, a model used to predict the weather can use different
parameters (like atmospheric pressure or temperature) in different regions. When the data from one
region is loaded into the algorithm, it forgets about this raw data and moves on to the next region.
This “use and forget” way of working is the perfect solution for the memory problem as a single
observation is unlikely to ever be big enough to fill up all the memory of a modern day computer.

BLOCK ALGORITHMS (DIVIDING A LARGE MATRIX INTO MANY SMALL ONES):

By cutting a large data table into small matrices, for instance, we can still do a linear regression. The
logic behind this matrix splitting and how a linear regression can be calculated with matrices can be
found in the sidebar. It suffices to know for now that the Python libraries we’re about to use will take
care of the matrix splitting, and linear regression variable weights can be calculated using matrix
calculus.

MAPREDUCE:

MapReduce algorithms are easy to understand with an analogy: Imagine that you were asked to count
all the votes for the national elections. Your country has 25 parties, 1,500 voting offices, and 2 million
people. You could choose to gather all the voting tickets from every office individually and count them
centrally, or you could ask the local offices to count the votes for the 25 parties and hand over the
results to you, and you could then aggregate them by party. Map reducers follow a similar process to
the second way of working. They first map values to a key and then do an aggregation on that key
during the reduce phase.

2. Choosing the right data structure: -


Algorithms can make or break your program, but the way you store your data is of equal importance.
Data structures have different storage requirements, but also influence the performance of CRUD
(create, read, update, and delete) and other operations on the data set. Figure 4.5 shows you have
many different data structures to choose from, three of which we’ll discuss here: sparse data, tree
data, and hash data. Let’s first have a look at sparse data sets.
SPARSE DATA: -

A sparse data set contains relatively little information compared to its entries (observations). Look at
figure 4.6: almost everything is “0” with just a single “1” present in the second observation on variable
9. Data like this might look ridiculous, but this is often what you get when converting textual data to
binary data. Imagine a set of 100,000 completely unrelated Twitter tweets. Most of them probably
have fewer than 30 words, but together they might have hundreds or thousands of distinct words. In
text mining we’ll go through the process of cutting text documents into words and storing them as
vectors. But for now imagine what you’d get if every word was converted to a binary variable, with “1”
representing “present in this tweet,” and “0” meaning “not present in this tweet.” This would result in
sparse data indeed. The resulting large matrix can cause memory problems even though it contains
little information.
Luckily, data like this can be stored compacted. In the case of figure 4.6 it could look like this: data =
[(2,9,1)]

Row 2, column 9 holds the value 1.

Support for working with sparse matrices is growing in Python. Many algorithms now support or
return sparse matrices.

TREE STRUCTURES: - Trees are a class of data structure that allows you to retrieve information much
faster than scanning through a table. A tree always has a root value and sub trees of children, each
with its children, and so on. Simple examples would be your own family tree or a biological tree and
the way it splits into branches, twigs, and leaves. Simple decision rules make it easy to find the child
tree in which your data resides. Look at figure 4.7 to see how a tree structure enables you to get to the
relevant information quickly.

HASH TABLES: - Hash tables are data structures that calculate a key for every value in your data and
put the keys in a bucket. This way you can quickly retrieve the information by looking in the right
bucket when you encounter the data. Dictionaries in Python are a hash table implementation, and
they’re a close relative of key-value stores. Hash tables are used extensively in databases as indices for
fast information retrieval.

3. Selecting the right tools: -


With the right class of algorithms and data structures in place, it’s time to choose the right tool for the
job. The right tool can be a Python library or at least a tool that’s PVP Siddhartha Institute of
Technology, Department of IT 13 | P a g e controlled from Python, as shown figure 4.8. The number of
helpful tools available is enormous, so we’ll look at only a handful of them.

PYTHON TOOLS: - Python has a number of libraries that can help you deal with large data. They range
from smarter data structures over code optimizers to just-in-time compilers. The following is a list of
libraries we like to use when confronted with large data :Cython ,Numexpr etc.

USE PYTHON AS A MASTER TO CONTROL OTHER TOOLS: - Most software and tool producers support a
Python interface to their software. This enables you to tap into specialized pieces of software with the
ease and productivity that comes with Python. This way Python sets itself apart from other popular
data science languages such as R and SAS. You should take advantage of this luxury and exploit the
power of specialized tools to the fullest extent possible.

General programming tips for dealing with large data sets: -


The tricks that work in a general programming context still apply for data science. Several might be
worded slightly differently, but the principles are essentially the same for all programmers. You can
divide the general tricks into three parts, as shown in the figure 4.9 mind map:

 Don’t reinvent the wheel. Use tools and libraries developed by others.

 Get the most out of your hardware. Your machine is never used to its full potential; with simple
adaptions you can make it work harder.

 Reduce the computing need. Slim down your memory and processing needs as much as possible.
1.Don’t reinvent the wheel: - “Don’t repeat anyone” is probably even better than “don’t repeat
yourself.” Add value with your actions: make sure that they matter. Solving a problem that has already
been solved is a waste of time. As a data scientist, you have two large rules that can help you deal with
large data and make you much more productive, to boot:

 Exploit the power of databases: The first reaction most data scientists have when working with large
data sets is to prepare their analytical base tables inside a database. This method works well when the
features you want to prepare are fairly simple. When this preparation involves advanced modeling,
find out if it’s possible to employ userdefined functions and procedures. The last example of this
chapter is on integrating a database into your workflow.

 Use optimized libraries: Creating libraries like Mahout, Weka, and other machinelearning algorithms
requires time and knowledge. They are highly optimized and incorporate best practices and state-of-
the art technologies. Spend your time on getting things done, not on reinventing and repeating others
people’s efforts, unless it’s for the sake of understanding how things work. Then you must consider
your hardware limitation.

2. Get the most out of your hardware: - Resources on a computer can be idle, whereas other
resources are over-utilized. This slows down programs and can even make them fail. Sometimes it’s
possible (and necessary) to shift the workload from an overtaxed resource to an underutilized
resource using the following techniques:

 Feed the CPU compressed data: A simple trick to avoid CPU starvation is to feed the CPU
compressed data instead of the inflated (raw) data. This will shift more work from the hard disk to the
CPU, which is exactly what you want to do, because a hard disk can’t follow the CPU in most modern
computer architectures.

 Make use of the GPU: Sometimes you’re CPU and not your memory is the bottleneck. If your
computations are parallelizable, you can benefit from switching to the GPU. This has a much higher
throughput for computations than a CPU. The GPU is enormously efficient in parallelizable jobs but
has less cache than the CPU. But it’s pointless to switch to the GPU when your hard disk is the
problem. Several Python packages, such as Theano and NumbaPro, will use the GPU without much
programming effort. If this doesn’t suffice, you can use a CUDA (Compute Unified Device Architecture)
package such as PyCUDA. It’s also a well-known trick in bitcoin mining, if you’re interested in creating
your own money.

 Use multiple threads: It’s still possible to parallelize computations on your CPU. You can achieve this
with normal Python threads.

3. Reduce your computing needs: - “Working smart + hard = achievement.” This also applies to
the programs you write. The best way to avoid having large data problems is by removing as much of
the work as possible up front and letting the computer work only on the part that can’t be skipped.
The following list contains methods to help you achieve this:

 Profile your code and remediate slow pieces of code: Not every piece of your code needs to be
optimized; use a profiler to detect slow parts inside your program and remediate these parts.

 Use compiled code whenever possible, certainly when loops are involved: Whenever possible use
functions from packages that are optimized for numerical computations instead of implementing
everything yourself. The code in these packages is often highly optimized and compiled.

 Otherwise, compile the code yourself: If you can’t use an existing package, use either a just-in-time
compiler or implement the slowest parts of your code in a lower-level language such as C or Fortran
and integrate this with your codebase. If you make the step to lower-level languages (languages that
are closer to the universal computer bytecode), learn to work with computational libraries such as
LAPACK, BLAST, Intel MKL, and ATLAS. These are highly optimized, and it’s difficult to achieve similar
performance to them.

 Avoid pulling data into memory: When you work with data that doesn’t fit in your memory, avoid
pulling everything into memory. A simple way of doing this is by reading data in chunks and parsing
the data on the fly. This won’t work on every algorithm but enables calculations on extremely large
data sets.

 Use generators to avoid intermediate data storage. Generators help you return data per observation
instead of in batches. This way you avoid storing intermediate results.

 Use as little data as possible. If no large-scale algorithm is available and you aren’t willing to
implement such a technique yourself, then you can still train your data on only a sample of the original
data.

 Use your math skills to simplify calculations as much as possible. Take the following equation, for
example: (a + b)2 = a2 + 2ab + b2 . The left side will be computed much faster than the right side of the
equation; even for this trivial example, it could make a difference when talking about big chunks of
data.

UNIT:5
SUBSETTING R-OBECTS, VECTORISED OPERATIONS, MANAGING DATA FRAMES WITH
THE DPLYR, CONTROL STRUCTURES, FUNCTIONS, SCOPING RULES OF R, CODING
STANDARDS IN R, LOOP FUNCTIONS, DEBUGGING SIMULATIONS,CASE STUDIES ON
PRELIMINARY DATA ANALYSIS.

SUBSETTING R-OBECTS: REFER UNIT 1,2

VECTORISED OPERATIONS:REFER UNIT 1,2

MANAGING DATA FRAMES WITH THE DPLYR REFER UNIT 1,2

CONTROL STRUCTURES REFER UNIT 1,2

FUNCTIONS REFER UNIT 1,2

SCOPING RULES OF R

In R, scoping rules, specifically using lexical scoping, determine how values are
associated with free variables within functions by searching for the variable in the
environment where the function was defined and its parent environments.

Here's a breakdown of R's scoping rules:

1. Lexical (or Static) Scoping:

 R uses lexical scoping, meaning the value of a free variable (a variable not defined
within the function itself) is determined by the environment where the function
was defined, not where it's called.

 This contrasts with dynamic scoping, where the value is determined by the call
stack.

2. The Search Process:


 When a function encounters a free variable, R first searches for it in the
function's own environment (where it was defined).

 If not found there, the search continues in the parent environment (the
environment where the function was defined).

 This process continues up the chain of environments until the variable is found or
the global environment is reached.

3. Key Concepts:

 Name Masking:

If a variable is defined both in the function's environment and a parent environment,


the variable in the function's environment "masks" the one in the parent environment,
meaning the function uses the local variable.

 Functions vs. Variables:

R distinguishes between functions and variables, and this distinction is important for
understanding how scoping works.

 A Fresh Start:

When a function is called, its environment is a new, empty environment, independent


of the caller's environment.

CODING STANDARDS IN R :REFER UNIT 1,2

LOOP FUNCTIONS:REFER UNIT 1,2

DEBUGGING SIMULATIONS

Debugging simulations in R involves identifying and resolving errors, glitches, or


unexpected behavior in your simulation code. You can use R's built-in debugging tools
like debug(), browser(), and traceback() to step through your code, examine variable
values, and pinpoint the source of issues.

Here's a breakdown of common debugging techniques and tools for simulations in R:


1. Understanding the Problem:

 Error Messages:

Carefully examine error messages, as they often provide clues about the nature and
location of the problem.

 Unexpected Output:

If your simulation produces unexpected results, meticulously compare the output with
the expected behavior to identify discrepancies.

 Reproducibility:

Ensure your simulation is reproducible by setting a seed for random number generators
(e.g., set.seed()).

2. Debugging Tools:

 debug():

Flags a function for debugging, allowing you to step through it line by line.

 debug(function_name): Starts debugging of the specified function.

 undebug(function_name): Removes the debugging flag from the function.

 browser():

Suspends execution at a specific point in your code, allowing you to inspect the current
state of variables and evaluate expressions.

 traceback():

Prints the call stack, showing the sequence of function calls that led to an error.

 trace():

Allows you to insert debugging code (e.g., print() statements) into a function at specific
places.

 recover():
Modifies the error behavior, allowing you to inspect the call stack after an error occurs.

 RStudio Debugger:

RStudio provides a graphical debugger with features like stepping through code,
inspecting variables, and setting breakpoints.

3. Debugging Strategies:

 Reduce Complexity:

Simplify your simulation by reducing the number of iterations, parameters, or complex


logic to isolate the problem.

 Print Statements:

Insert print() statements to display the values of variables at different points in your
code.

 Step-by-Step Execution:

Use the debugger to step through your code line by line, examining variable values at
each step.

 Conditional Debugging:

Use if statements to conditionally enable debugging, for example, only when a specific
condition is met.

 Logging:

Use logging techniques to record events and data during the simulation, which can be
useful for later analysis.

 Test Cases:

Design specific test cases to verify the correctness of different parts of your simulation.

 Smaller Datasets:
Start with smaller datasets to make debugging easier, and then gradually increase the
dataset size.

 Breakpoints:

Use breakpoints in the RStudio debugger to pause execution at specific lines of code.

 Inspect Variables:

Use the debugger to examine the values of variables in the current environment.

 Traceback:

Use traceback() to see the call stack when an error occurs.

Example:

Code

# Function to simulate a simple process


my_simulation <- function(n, p) {
# Set a seed for reproducibility
set.seed(123)

# Generate random data


data <- rbinom(n, 1, p)

# Calculate the mean


mean_data <- mean(data)

# Return the mean


return(mean_data)
}

# Debug the function


# debug(my_simulation) # Uncomment this line to start debugging
# Run the simulation
result <- my_simulation(100, 0.5)

# Print the result


print(result)

Troubleshooting Tips:

 Syntax Errors: Double-check for typos, missing parentheses, or other syntax


errors.

 Logical Errors: Carefully review the logic of your simulation to ensure it is correct.

 Package Conflicts: Ensure that there are no conflicts between different packages
you are using.

 Data Issues: Check the format and validity of your input data.

 External Libraries: If your simulation uses external libraries, ensure that they are
installed and configured correctly.

CASE STUDIES ON PRELIMINARY DATA ANALYSIS: REFER UNIT 1,2

You might also like