Vidyavardhini’s College of Engineering & Technology
Department of Computer Science and Engineering (Data Science)
ACADEMIC YEAR: 2024-25
Course: Data Analytics And Visualization Lab
Course code: CSL601
Year/Sem: TE/VI
Experiment No.: 1
Aim: Introduction to Data analytics libraries in Python and R
Name: shubham deshwal
Roll Number: 10
Date of Performance: 14/1/25
Date of Submission:28/1/25
Evaluation
Marks
Performance Indicator Max. Marks
Obtained
Performance 5
Understanding 5
Journal work and timely submission. 10
Total 20
Performance Exceed Expectations Meet Expectations Below Expectations
Indicator (EE) (ME) (BE)
Performance 5 3 2
Understanding 5 3 2
Journal work and
10 8 4
timely submission.
Checked by
Name of Faculty : Mrs. Komal Champanerkar
Signature :
Date :
Experiment No. 1
Aim: Introduction to Data analytics libraries in Python and R.
Objective- Understand the use of Python and R, To effectively use libraries for data science.
Description:
Why Choose Python?
Python is a popular choice for data analytics due to its versatility, ease of use, and the robust ecosystem of
tools and libraries. Here are some key reasons why Python excels in data analytics:
1. Ease of Learning and Use
● Python has a simple and readable syntax, making it accessible for beginners and efficient for
experienced programmers.
● The language's simplicity allows analysts to focus on problem-solving rather than syntax
complexities.
2. Comprehensive Libraries
Python offers a wide range of libraries for data analytics, including:
● Pandas: For data manipulation and analysis.
● NumPy: For numerical computations and handling large datasets.
● Matplotlib & Seaborn: For data visualization and creating informative plots.
● SciPy: For scientific computing and advanced statistical analysis.
● Scikit-learn: For machine learning and predictive modeling.
3. Scalability and Performance
● Python can handle large datasets and complex computations with tools like Dask and PySpark.
● For performance-intensive tasks, Python integrates well with languages like C and C++ to
improve speed.
4. Cross-Platform Compatibility
● Python runs on various operating systems (Windows, macOS, Linux), ensuring flexibility in
deployment and collaboration.
5. Extensive Community Support
● Python has a vast, active community that provides support, tutorials, and pre-built solutions,
making it easier to learn and solve challenges.
6. Integration Capabilities
● Python seamlessly integrates with other languages, databases (e.g., SQL, NoSQL), and tools like
Hadoop, Tableau, and Power BI.
● APIs and web scraping tools (e.g., BeautifulSoup, Requests) make it easy to gather and analyze
data from the web.
7. Automation and Scripting
● Python is excellent for automating repetitive tasks, such as data cleaning, ETL (Extract,
Transform, Load) processes, and report generation.
8. Open Source and Cost-Effective
● Python is free and open-source, reducing costs for organizations while providing a highly capable
environment.
9. Growing Use in Machine Learning and AI
● Python supports advanced analytics with libraries like TensorFlow, PyTorch, and Keras for
machine learning, deep learning, and artificial intelligence.
10. Interactive Analysis with Jupyter Notebooks
● Jupyter Notebooks allow users to combine code, visualization, and narrative text in a single
document, making it a powerful tool for exploratory data analysis and reporting.
Python's adaptability, coupled with its robust ecosystem, makes it a powerful and practical choice for data
analytics across industries, from finance and healthcare to marketing and e-commerce.
Why Choose R?
R is a powerful tool for data analytics, particularly favored in academia and industries requiring statistical
analysis and data visualization. Here's why R stands out in data analytics:
1. Designed for Data Analysis
● R was created specifically for statistical computing and data visualization.
● It has a rich set of functions for statistical modeling, hypothesis testing, regression analysis, and
more.
2. Comprehensive Libraries
R provides a vast range of packages for data analytics, including:
● ggplot2: For creating sophisticated and customizable data visualizations.
● dplyr and tidyr: For data manipulation and cleaning.
● caret: For machine learning and predictive modeling.
● shiny: For building interactive web applications for data visualization.
3. Advanced Statistical Techniques
● R excels at implementing complex statistical methods, including time-series analysis, Bayesian
modeling, and survival analysis.
● The language supports cutting-edge statistical research, with many new methods available in R
packages.
4. Data Visualization Capabilities
● R provides unparalleled data visualization tools, such as ggplot2, lattice, and plotly.
● It allows for creating both static and interactive visualizations tailored to specific analytical needs.
5. Flexible Data Handling
● R handles a variety of data types, from structured data (e.g., tables) to unstructured data (e.g., text
and images).
● Packages like data.table optimize processing large datasets efficiently.
6. Strong Community and Open Source
● R has a strong academic and professional community contributing to its development.
● The Comprehensive R Archive Network (CRAN) hosts over 18,000 packages, offering solutions
for diverse analytical challenges.
7. Integration with Other Tools
● R integrates seamlessly with databases, Excel, and big data tools like Hadoop and Spark.
● It also works well with programming languages like Python, C++, and Java for extended
functionality.
8. Cross-Platform Availability
● R is available on Windows, macOS, and Linux, ensuring flexibility for users across different
operating systems.
9. Interactive Environment
● IDEs like RStudio provide an intuitive interface for coding, debugging, and visualization.
● R Markdown allows combining code, narrative text, and visualizations in dynamic reports.
10. Academia and Research Focus
● R is widely used in academic research due to its statistical rigor and ability to implement new
methodologies quickly.
● It’s the preferred language for statisticians and researchers.
11. Cost-Effective
● R is free and open-source, making it accessible to individuals, businesses, and institutions without
licensing costs.
Conclusion
R's strength lies in its statistical capabilities, rich ecosystem, and visualization tools, making it a leading
choice for data analysts, statisticians, and researchers. It complements Python well and excels in tasks
requiring heavy statistical computation and academic rigor.
R vs Python: Key Differences
Sr. No Parameters R Python
1 General Being a general-purpose programming R is predominantly used for statistical computing
language, Python is widely used for data and graphics due to its functional programming
analysis and scientific computing. environment.
2 Objective A general purpose language used for Data A statistical programming language used for
Science, Web Development, and Embedded Data Science and Statistical Modeling
Systems
3 IDE PyCharm, Spyder, Thonny, IPython RStudio, Eclipse, StatET, R KWARD
4 Packages and Numpy, Pandas, Pytest, Matplotlib, Requests, Ggplot2, data.table, dplyr, Plotly, tidyr, readr,
Libraries TensorFlow, sci-kit-learn, PyTorch, Theano stringr, lubridate, shiny
5 Syntax Python has a relatively simple syntax and is R has a complex syntax and a relatively large
easy to learn. learning curve.
6 Workability Python consists of many easy-to-use packages R easily performs matrix computation and
optimization.
7 Integration Programs that Run Locally Well-integrated with web apps
8 Community Python has a more robust community for R Community is comparatively smaller.
ongoing support and development.
9 Learning Curve Linear and Smooth Difficult at the beginning
10 Machine Learning Excellent for machine learning with libraries Equally good for machine learning with libraries
such as Scikit-learn and TensorFlow such as Caret and H2O.
R packages:
Sr Name of Package Use/Description Functions and its description
No
1 Purrr A functional programming toolkit in R for map(): Applies a function to each element of a list
working with lists and vectors. It simplifies or vector and returns a list.
iteration and mapping operations.
map_df(): Maps a function and returns a data
frame as the output.
map_chr(), map_dbl(), map_int(): Apply functions
and return specific data types (character, double,
integer).
safely(): Executes a function and captures errors
without stopping execution.
reduce(): Combines elements of a list/vector into a
single result using a specified function.
2 Rcrawler A web scraping and crawling package in R Rcrawler(): The main function for crawling and
designed to extract structured data from scraping websites.
websites.
LinkExtractor(): Extracts all links from a
webpage.
ContentScraper(): Extracts specific content (e.g.,
text, tables) from a webpage.
NetworkGraph(): Generates a graph of the
website’s link structure.
write_out(): Saves the crawled data in a structured
format for further analysis.
3 Tidyquant A financial analysis package that combines tq_get(): Retrieves financial data from various
the power of tidyverse and quantitative sources (e.g., Yahoo Finance).
finance tools.
tq_transmute(): Applies transformations to time
series data, such as calculating moving averages
or returns.
tq_performance(): Evaluates portfolio
performance metrics like Sharpe ratio or
annualized returns.
tq_portfolio(): Aggregates and analyzes
portfolio-level time series data.
tq_mutate(): Adds calculated financial indicators
to data frames, such as Bollinger Bands or RSI.
4 Knitr A dynamic report generation tool that knit(): Convert R Markdown files (.Rmd) into
integrates R code, results, and narrative text final output formats.
into documents like PDFs, HTML, and Word.
kable(): Create formatted tables in reports.
opts_chunk$set(): Customize global chunk options
(e.g., echo, results).
include_graphics(): Embed external images in
reports.
purl(): Extract R code from a document for reuse.
5 Mlr A comprehensive machine learning package makeLearner(): Define a machine learning
for R that provides tools for training, tuning, algorithm or mode
and validating models in a consistent
framework. train(): Train a model on a given dataset.
predict(): Generate predictions using a trained
model.
resample(): Perform cross-validation and other
resampling methods.
benchmark(): Compare multiple models or
algorithms on a dataset.
Python packages:
Sr Name of Package Use/Description Fuctions and its description
No
1 Scrapy Scrapy is a powerful framework for web CrawlSpider: A base class for building spiders to
scraping and crawling to extract data from follow links on websites.
websites efficiently.
Selector: Extracts data using XPath or CSS
selectors.
Pipeline: Processes and stores scraped data (e.g.,
cleaning or saving to a database).
Request: Sends HTTP requests to fetch web pages
for scraping.
Item: Defines the structure of the scraped data.
2 PyTorch PyTorch is an open-source deep learning torch.Tensor: A multi-dimensional array for
framework designed for building and training computations with automatic differentiation.
neural networks. It is known for its flexibility
and dynamic computation graph. torch.nn: A module for building neural networks.
torch.optim: Contains optimization algorithms
like SGD and Adam for training models.
torch.utils.data: Utilities for handling datasets and
creating data loaders.
torch.autograd: Automatic differentiation for
gradient computation.
3 Keras Keras is a high-level API for building and Sequential: Simplifies building neural networks
training deep learning models, often used with layer-by-layer.
TensorFlow as the backend.
Model: Offers a flexible way to define and train
models.
layers: A library of pre-built neural network layers
(e.g., Dense, Conv2D).
compile: Configures the model’s optimization
algorithm, loss function, and metrics.
fit: Trains the model on the dataset.
4 Scikit-learn Scikit-learn is a robust library for machine train_test_split: Splits data into training and
learning, offering tools for classification, testing sets.
regression, clustering, and dimensionality
reduction. fit: Trains a machine learning model on data.
predict: Generates predictions using the trained
model.
GridSearchCV: Tunes hyperparameters to find the
optimal configuration.
PCA (Principal Component Analysis): Reduces
the dimensionality of data.
5 NLTK (Natural NLTK is a library for processing and word_tokenize: Splits text into individual words.
Language Toolkit) analyzing human language data, often used in
natural language processing (NLP) tasks. pos_tag: Tags words with their part of speech
(e.g., noun, verb).
stopwords: Provides common words to filter out
during text preprocessing.
FreqDist: Computes the frequency distribution of
words in a text.
stem: Reduces words to their root forms using
stemmers like PorterStemmer.
Common IDEs
RStudio:
● Designed for R but supports Python via the reticulate package.
● Ideal for combining R and Python in data analysis.
Jupyter Notebook/Lab:
● Interactive, browser-based IDE with support for both R and Python.
● Great for data science and exploratory analysis.
VS Code:
● Lightweight and extensible with R and Python extensions.
● Suitable for multi-language projects with advanced customization.
Atom:
● Supports R and Python via packages like Hydrogen.
● Lightweight and good for basic needs.
Spyder:
● Primarily a Python IDE but supports R with RPy2.
● Useful for data science and scientific computing.
Emacs with ESS:
● Powerful, customizable editor with R and Python support via ESS.
● Best for advanced users.
Conclusion-
We discussed key Python and R packages for data analytics, like Scrapy (web scraping), PyTorch and
Keras (deep learning), Scikit-learn (machine learning), and NLTK (NLP). Python is great for
general-purpose coding, machine learning, and scalability, while R excels in statistical analysis and
visualization.
We also compared IDEs for both languages, such as RStudio, Jupyter, and VS Code, with RStudio and
Jupyter being popular for data science. In conclusion, Python is ideal for machine learning and
large-scale projects, while R is best for statistical work and visualization.