0% found this document useful (0 votes)
242 views39 pages

Data Science Question Bank With Answer

Uploaded by

shelakeavi2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views39 pages

Data Science Question Bank With Answer

Uploaded by

shelakeavi2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Unit-I

1. Explain data, information and knowledge triangle of Data.

Ans –

The four levels of the DIKW pyramid are:

1.Data: This is the raw, unprocessed facts and figures that are collected from various sources. Data
can be structured or unstructured and may include text, numbers, images, audio, and video.

2.Information: Data becomes information when it is organized, processed, and interpreted in a


meaningful way. The information provides context and relevance to data and enables decision-
making and action.

3.Knowledge: Knowledge is the understanding gained from information, through analysis,


interpretation, and synthesis. Knowledge is often based on experience, expertise, and intuition, and
enables more complex decision-making and problem-solving.

4.Wisdom: Wisdom is the highest level of the DIKW pyramid, representing the ability to apply
knowledge and experience to make sound judgments and decisions. Wisdom requires reflection,
insight, and foresight, and is often based on a deep understanding of the broader context and
implications of decisions.

2. Write note on high level language.

Ans-

High-level languages are the backbone of data science, offering a versatile and intuitive platform for
professionals to analyze, manipulate, and interpret data effectively. Python, R, and Julia stand out as
the primary choices, providing rich libraries and frameworks tailored for data-centric tasks. These
languages abstract away low-level complexities, allowing data scientists to focus on problem-solving
rather than implementation details. Their expressive syntax, extensive community support, and
seamless integration with other tools make them indispensable for tackling the diverse challenges of
data analysis and machine learning.

In data science workflows, high-level languages enable rapid prototyping, experimentation, and
iteration, facilitating a dynamic and agile approach to problem-solving. Python, for instance, is
renowned for its simplicity and readability, making it accessible to both beginners and seasoned
professionals alike. Its ecosystem encompasses powerful libraries such as NumPy, Pandas, and scikit-
learn, which streamline tasks ranging from data wrangling to model deployment. Similarly, R boasts
comprehensive packages like tidyverse and caret, tailored specifically for statistical analysis and
machine learning, while Julia's high-performance computing capabilities make it ideal for numerical
computing and optimization tasks.
Moreover, the collaborative nature of high-level languages fosters knowledge-sharing and innovation
within the data science community. Online forums, tutorials, and open-source contributions
contribute to a vibrant ecosystem where practitioners can exchange ideas, troubleshoot challenges,
and leverage best practices. Whether developing predictive models, exploring data visualizations, or
deploying scalable solutions, the versatility and support provided by high-level languages empower
data scientists to extract actionable insights and drive meaningful impact in diverse domains, from
healthcare and finance to marketing and beyond.

3. Explain IDE.

Ans:

Ans-An IDE (Integrated Development Environment) is software that combines commonly used
developer tools into a compact GUI (graphical user interface) application. It is a combination of tools
like a code editor, code compiler, and code debugger with an integrated terminal.Integrating features
like software editing, building, testing, and packaging in a simple-to-use tool, IDEs help boost
developer productivity. IDEs are commonly used by programmers and software developers to make
their programming journey smoother.

Common Features of an IDE

1. Editor: Typically a text editor can help you write software code by highlighting syntax with visual
cues, providing language-specific auto-completion, and checking for bugs as you type.

2. Compiler: A compiler interprets human-readable code into machine-specific code that can be
executed on different operating systems like Linux, Windows, or Mac OS. Most IDEs usually come
with built-in compilers for the language it supports.

3. Debugger: A tool that can assist developers to test and debug their applications and graphically
point out the locations of bugs or errors if any.

4. Build-in Terminal: Terminal is a text-based interface that can be used for interacting with the
machine’s operating system. Developers can directly run the scripts or commands within an IDE with
a built-in terminal/console.

5. Version Control: Version control helps bring clarity to the development of the software. Some
IDEs also support version control tools like Git, through which a user can track and manage the
changes to the software code.

6. Code snippets: IDEs support code snippets that are usually used to accomplish a single task and
can also reduce redundant work to some great extent.

7. Extensions and Plugins: Extensions and Plugins are used to extend the functionality of the
IDEs with respect to specific programming languages.

8. Code navigation: IDEs come with tools like code folding, class and method navigation, and
refactoring tools that make it simple to go through and analyze code.
4. Write note on EDA and data visualization.

Ans-

EDA

The main purpose of EDA is to help look at data before making any assumptions.

It can help identify obvious errors, as well as better understand patterns within the data, detect
outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they

produce are valid and applicable to any desired business outcomes and

goals.

EDA also helps stakeholders by confirming they are asking the right

questions.

EDA can help answer questions about standard deviations, categorical variables, and confidence
intervals.

Once EDA is complete and insights are drawn, its features can then be used for more sophisticated
data analysis or modeling, including machine learning.

Data visualization

Data visualization is a graphical representation of quantitative

information and data by using visual elements like graphs, charts, and

maps.

 Data visualization convert large and small data sets into visuals, which

is easy to understand and process for humans.

 Data visualization tools provide accessible ways to understand

outliers, patterns, and trends in the data.

 In the world of Big Data, the data visualization tools and technologies

are required to analyze vast amounts of information.

 Data visualizations are common in your everyday life, but they always

appear in the form of graphs and charts. The combination of multiple

visualizations and bits of information are still referred to as

Infographics.

 Data visualizations are used to discover unknown facts and trends.

You can see visualizations in the form of line charts to display change
over time. Bar and column charts are useful for observing relationships

and making comparisons. A pie chart is a great way to show parts-of-

a-whole. And maps are the best way to share geographical data

visually.

 Today's data visualization tools go beyond the charts and graphs used

in the Microsoft Excel spreadsheet, which displays the data in more

sophisticated ways such as dials and gauges, geographic maps, heat

maps, pie chart, and fever chart.

5. What is data? Explain different types of data sources.

Ans-

data

Data refers to information facts or figures that are collected, store and used for various purpose.

Essentially data is raw unprocessed material that require interpretation or analysis to derive meaning
or insights.

Data can be collected through various methods such as services, experiment & observation.

There are different kinds of data; such are as follows:

Sound

Video

Single character

Number (integer or floating-point)

Picture

Boolean (true or false)

Text (string)

Data sources-The following are the two sources of data:


1. Internal sources  When data is collected from reports and records of the organisation

itself, they are known as the internal sources.  For example, a company publishes its annual report’
on profit and loss,

total sales, loans, wages, etc.

2. External sources  When data is collected from sources outside the organisation, they are

known as the external sources. For example, if a tour and travel

company obtains information on Karnataka tourism from Karnataka

Transport Corporation, it would be known as an external source of

data.

6. Explain data collection methods.

Ans-

The following are seven primary methods of collecting data in business

analytics.

 Surveys

 Transnational Tracking

 Interviews and Focus Groups

 Observation

 Online Tracking

 Forms

 Social Media Monitoring

Data collection breaks down into two methods. As a side note, many

terms, such as techniques, methods, and types, are interchangeable and

depending on who uses them. One source may call data collection

techniques “methods,” for instance. But whatever labels we use, the

general concepts and breakdowns apply across the board whether we’re

talking about marketing analysis or a scientific research project.

The two methods are:


 Primary:-

As the name implies, this is original, first-hand data collected by the data

researchers.

This process is the initial information gathering step, performed before anyone carries out any
further or related research.

Primary data results are highly accurate provided the researcher collects

the information.

However, there’s a downside, as first-hand research is potentially time-consuming and expensive.

 Secondary:-

Secondary data is second-hand data collected by other parties and already

having undergone statistical analysis.

This data is either information that the researcher has tasked other people to collect or information
the researcher has looked up. Simply put, it’s second-hand information.

Although it’s easier and cheaper to obtain than primary information, secondary information raises
concerns regarding accuracy and authenticity.

Quantitative data makes up a majority of secondary data.

7. Explain Data Cleaning.

Ans:

Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and
removing any missing, duplicate, or irrelevant data.

The goal of data cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML model.

Professional data scientists usually invest a very large portion of their time in this step because of the
belief that “Better data beats fancier algorithms”.

Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the data
science pipeline that involves identifying and correcting or removing errors, inconsistencies, and
inaccuracies in the data to improve its quality and usability.

Data cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which can
negatively impact the accuracy and reliability of the insights derived from it.

8. Explain data Analysis and Modeling.

Ans:
The collection, transformation, and organization of data to draw conclusions make predictions for the
future and make informed data-driven decisions is called Data Analysis. The profession that handles
data analysis is called a Data Analyst.

 Data Requirement Gathering: Ask yourself why you’re doing this

analysis, what type of data you want to use, and what data you plan to

analyze. 

Data Collection: Guided by your identified requirements, it’s time to

collect the data from your sources. Sources include case studies,

surveys, interviews, questionnaires, direct observation, and focus

groups. Make sure to organize the collected data for analysis. 

Data Cleaning: Not all of the data you collect will be useful, so it’s

time to clean it up. This process is where you remove white spaces,

duplicate records, and basic errors. Data cleaning is mandatory before

sending the information on for analysis.

 Data Analysis: Here is where you use data analysis software and other

tools to help you interpret and understand the data and arrive at

conclusions. Data analysis tools include Excel, Python, R, Looker,

Rapid Miner, Chartio, Metabase, Redash, and Microsoft Power BI. 

Data Interpretation: Now that you have your results, you need to

interpret them and come up with the best courses of action based on

your findings. 

Data Visualization: Data visualization is a fancy way of saying,

“graphically show your information in a way that people can read and

understand it.” You can use charts, graphs, maps, bullet points, or a

host of other methods. Visualization helps you derive valuable insights

by helping you compare datasets and observe relationships.

Data modeling

Data modelers use three types of models to separately represent business

concepts and workflows, relevant data entities and their attributes and

relationships, and technical structures for managing the data.


The models typically are created in a progression as organizations plan new applications and
databases.

These are the different types of data models and what they include:

Conceptual data model:-

This is a high-level visualization of the business or analytics processes that a system will support.

It maps out the kinds of data that are needed, how different business entities interrelate and
associated business rules. Business executives are the main audience for conceptual data models, to
help them see how a system will work and ensure that it meets business needs.

Conceptual models aren't tied to specific database or application technologies.

Logical data model:-

Once a conceptual data model is finished, it can be used to create a less-abstract logical one.

Logical data models show how data entities are related and describe the data from a technical
perspective.

For example, they define data structures and provide details on attributes, keys, data types and other
characteristics. The technical side of an organization uses logical models to help understand required
application and database designs. But like conceptual models, they aren't connected to a particular
technology platform.

Physical data model:-

A logical model serves as the basis for the creation of a physical data model.

Physical models are specific to the database management system (DBMS) or application software
that will be implemented.

They define the structures that the database or a file system will use to store and manage the data.

That includes tables, columns, fields, indexes, constraints, triggers and other DBMS elements.

Database designers use physical data models to create designs and generate schema for databases.
UNIT – 2

9) Write note on Data Curation.

Ans:-

Data curation is the process of collecting, organizing, managing, and maintaining data to ensure its
quality, usability, and longevity. It involves various activities aimed at enhancing the value of data for
analysis, interpretation, and decision-making. Data curation is crucial in today's data-driven world
where vast amounts of data are generated daily across various domains such as business, science,
healthcare, and academia.

Here are some key aspects of data curation

1.Data Collection: Data curation begins with the collection of relevant data from various sources. This
may include structured data from databases, unstructured data from documents, or semi-structured
data from web sources, sensors, or social media platforms.

2.Data Cleaning and Quality Assurance: Raw data often contains errors, inconsistencies, duplicates,
or missing values. Data curation involves cleaning and validating the data to ensure its accuracy,
consistency, and completeness. This process may include techniques such as data deduplication,
outlier detection, and imputation of missing values.

3.Data Organization and Integration: Curated data needs to be organized in a structured manner to
facilitate easy access and retrieval. This involves creating metadata, taxonomies, or ontologies to
describe the data's content and relationships. Data integration techniques may also be employed to
combine data from different sources while resolving inconsistencies in formats and semantics.

4.Data Storage and Preservation: Curation includes decisions about where and how to store data to
ensure its security, accessibility, and long-term preservation. This may involve choosing appropriate
storage technologies, backup strategies, and data archiving practices to safeguard against data loss,
corruption, or obsolescence.

5.Data Annotation and Enrichment: To enhance the interpretability and usability of data, curation
may involve annotating or enriching data with additional metadata, contextual information, or
domain-specific knowledge. This can help users better understand the data and its relevance to
specific tasks or analyses.
10) Write difference between structured, semi structured and unstructured data .

Ans:

11). Explain query languages and their operations

Ans:

Difficult to find out so skip this question.


12)Explain Authentication and authorization for storage system.

Ans:-

Authentication and authorization are two crucial aspects of securing storage systems in data science.
They ensure that only authorized users or processes can access and manipulate data, thereby
protecting sensitive information from unauthorized access or misuse. Here's an explanation of
authentication and authorization in the context of storage systems:

a. Authentication:

Authentication is the process of verifying the identity of users or entities attempting to access a
storage system. It ensures that only legitimate users with valid credentials are allowed access.
Authentication mechanisms typically involve the following:

1.User Credentials: Users provide credentials such as usernames, passwords, or cryptographic keys to
prove their identity.

2.Multi-Factor Authentication (MFA): Enhances security by requiring users to provide multiple forms
of verification, such as a password and a unique code sent to their mobile device.

3.Biometric Authentication: Utilizes biometric characteristics like fingerprints, facial recognition, or


iris scans to authenticate users.

4.Token-based Authentication: Users receive temporary access tokens after successful


authentication, which they present for subsequent interactions with the storage system.

5.Single Sign-On (SSO): Allows users to authenticate once and access multiple services or storage
systems without the need to re-enter credentials.

6.OAuth/OpenID Connect: Standards for delegated authentication, commonly used for


authenticating users with third-party services.

Authentication ensures that only legitimate users gain access to the storage system, thereby
preventing unauthorized access and protecting sensitive data from malicious actors.

b.Authorization:

Authorization determines what actions users are allowed to perform within the storage system once
their identity has been authenticated. It specifies the permissions and privileges granted to users
based on their roles, responsibilities, or access levels. Authorization mechanisms typically involve the
following:

1.Role-Based Access Control (RBAC): Assigns permissions to users based on predefined roles (e.g.,
admin, manager, user) associated with specific privileges.
2.Attribute-Based Access Control (ABAC): Grants access based on attributes of users, resources, and
environmental conditions, allowing for more fine-grained access control.

3.Access Control Lists (ACLs): Lists of permissions attached to specific resources, specifying which
users or groups have permission to perform certain actions.

4.Policy-Based Access Control (PBAC): Uses policies to determine access rights based on predefined
rules or conditions.

5.Granular Permissions: Assigns specific permissions (e.g., read, write, execute) to users or groups at
the level of individual files, directories, or data objects.

6.Authorization ensures that authenticated users only have access to the resources and
functionalities that are necessary for their role or task, reducing the risk of data breaches,
unauthorized modifications, or data leaks.

13)Write note on GitHub.

Ans:

GitHub is a widely used platform in data science for collaboration, version control, and sharing of
code, projects, and data-related resources. It offers a range of features and functionalities that make
it particularly well-suited for data science workflows. Here's a note on GitHub's significance in data
science:

1.Version Control:

GitHub utilizes Git, a distributed version control system, allowing data scientists to track changes to
their code, scripts, notebooks, and other project files.

Version control enables collaboration among team members by providing a history of changes,
facilitating code review, and allowing for easy integration of contributions from multiple developers.

2.Collaboration:

GitHub provides a platform for data scientists to collaborate on projects in real-time, irrespective of
geographical locations.

Users can fork repositories to create their own copies of projects, make changes, and propose
modifications through pull requests.

Features like issues, discussions, and project boards facilitate communication and coordination
among team members.
3.Reproducibility:

GitHub fosters reproducibility in data science by preserving the entire history of a project, including
code, data, and documentation.

Researchers can share their analyses, workflows, and experimental results on GitHub, allowing
others to reproduce and validate their findings.

4.Open Source Contributions:

GitHub serves as a hub for open source data science projects, libraries, and tools.

Data scientists can contribute to existing projects, share their own projects, and collaborate with the
broader community to advance the field of data science.

5.Project Management:

GitHub provides project management features such as milestones, labels, and issue tracking to
organize and prioritize tasks within a project.

Data science teams can use GitHub's project management tools to plan, track progress, and manage
workflows effectively.

6.Integration with Data Science Tools:

GitHub seamlessly integrates with popular data science tools and platforms such as Jupyter
Notebooks, RStudio, and Kaggle.

Data scientists can version control their notebooks, scripts, datasets, and models directly from their
preferred tools and synchronize changes with GitHub repositories.

7.Community and Learning:

GitHub hosts a vibrant community of data scientists, researchers, developers, and enthusiasts.

Users can discover new projects, explore code repositories, participate in discussions, and learn from
others' work.
14)Write note on NoSQL.

And:

NoSQL (Not Only SQL) databases have gained significant popularity in data science
due to their ability to efficiently handle large volumes of diverse and unstructured
data. Unlike traditional SQL databases, NoSQL databases are designed to be highly
scalable, flexible, and capable of handling varying data formats and structures. Here's
a note on the significance of NoSQL in data science:

1. Handling Unstructured and Semi-Structured Data:


 NoSQL databases excel at storing and processing unstructured and
semi-structured data, such as text, documents, JSON, XML, key-value
pairs, graphs, and time-series data.
 This flexibility allows data scientists to work with diverse data sources
and formats without the need for complex data modeling or schema
modifications.
2. Scalability and Performance:
 NoSQL databases are designed for horizontal scalability, enabling them
to efficiently distribute data across multiple nodes or servers.
 This scalability allows NoSQL databases to handle massive data
volumes and high throughput, making them suitable for big data
processing and real-time analytics.
3. Schema Flexibility:
 NoSQL databases offer schema flexibility, allowing developers to store
and manipulate data without predefined schemas or strict data
structures.
 This flexibility enables agile development and experimentation in data
science projects, as schemas can evolve over time to accommodate
changing requirements and data models.
4. High Availability and Fault Tolerance:
 NoSQL databases are built with distributed architectures that provide
high availability and fault tolerance.
 Data replication, sharding, and automatic failover mechanisms ensure
that data remains accessible and resilient to hardware failures or
network outages.
5. Support for Complex Queries:
 Many NoSQL databases offer powerful querying capabilities, including
support for complex queries, aggregations, and data transformations.
 Query languages like MongoDB's Query Language (MQL) or Cassandra
Query Language (CQL) enable data scientists to perform advanced
analytics and data exploration tasks.
6. Specialized Use Cases:
 NoSQL databases are well-suited for specific use cases in data science,
such as real-time analytics, content management, IoT data processing,
social media analytics, and recommendation systems.
 Each NoSQL database type (document-oriented, key-value, column-
family, graph) offers unique features and optimizations tailored to
different data processing requirements.
7. Integration with Data Science Ecosystem:
 NoSQL databases seamlessly integrate with popular data science tools
and platforms, such as Apache Spark, Apache Hadoop, Python libraries
(e.g., PyMongo, Cassandra driver), and cloud-based data services.
 Data scientists can leverage NoSQL databases for storing, processing,
and analyzing large-scale datasets within their preferred data science
workflows.

15)Write note on MongoDB.

Ans:

MongoDB is a popular NoSQL database widely used in data science for its flexibility,
scalability, and ease of use. It's particularly well-suited for storing and processing
unstructured or semi-structured data commonly encountered in data science
applications. Here's a note on the significance of MongoDB in data science:

1. Document-Oriented Storage:
 MongoDB stores data in flexible, JSON-like documents, allowing data
scientists to work with complex, nested data structures without the
need for predefined schemas.
 This document-oriented approach simplifies data modeling and
supports agile development and experimentation in data science
projects.
2. Schema Flexibility:
 MongoDB offers dynamic schemas, enabling data scientists to store
and manipulate data without strict schema definitions.
 Fields within documents can vary from one document to another,
providing schema flexibility and accommodating evolving data
requirements.
3. Querying and Aggregation:
 MongoDB provides a powerful query language (MongoDB Query
Language or MQL) that supports a wide range of operations, including
filtering, sorting, projection, and aggregation.
 Data scientists can perform complex queries and aggregations to
extract insights from large datasets efficiently.
4. Scalability and Performance:
 MongoDB is designed for horizontal scalability, allowing data to be
distributed across multiple nodes or servers.
 Its distributed architecture supports high throughput and low-latency
access, making it suitable for handling large-scale data processing and
real-time analytics.
5. Indexing and Full-Text Search:
 MongoDB supports various indexing techniques to optimize query
performance, including single-field indexes, compound indexes,
geospatial indexes, and text indexes.
 Full-text search capabilities enable data scientists to search and analyze
text data efficiently, making MongoDB suitable for text mining and
natural language processing (NLP) tasks.
6. Geospatial Data Processing:
 MongoDB includes robust support for geospatial data types and
queries, allowing data scientists to store, index, and query spatial data
such as coordinates, polygons, and multi-geometries.
 Geospatial queries enable location-based analytics and geospatial
visualization, essential for applications like GIS (Geographic Information
Systems) and location-based services.
7. Integration with Data Science Tools:
 MongoDB integrates seamlessly with popular data science tools and
libraries, such as Python (via PyMongo), R (via RMongo), and Apache
Spark (via Spark Connector).
 Data scientists can leverage MongoDB as a backend storage solution
for storing and processing large-scale datasets within their data science
workflows.

16)Explain basic architecture of AWS.

Ans:

Amazon Web Services (AWS) provides a comprehensive suite of cloud computing


services that are widely used in data science for storage, computation, analytics,
machine learning, and more. The basic architecture of AWS in data science
encompasses various services and components that work together to support data
storage, processing, and analysis. Here's an overview of the basic architecture of AWS
in data science:

1. Compute Services:
 Amazon EC2 (Elastic Compute Cloud): Virtual servers in the cloud
that provide resizable compute capacity. Data scientists can use EC2
instances to run data processing tasks, execute machine learning
algorithms, and host applications.
 AWS Lambda: Serverless compute service that runs code in response
to events or triggers without provisioning or managing servers. Lambda
functions can be used for real-time data processing, event-driven
workflows, and automation tasks.
2. Storage Services:
 Amazon S3 (Simple Storage Service): Object storage service that
provides scalable storage for data lakes, data warehouses, and analytics
workloads. S3 is commonly used for storing large volumes of
structured, semi-structured, and unstructured data in various formats.
 Amazon EBS (Elastic Block Store): Block storage service for EC2
instances, providing persistent storage volumes that can be attached to
EC2 instances. EBS volumes are suitable for storing data used by
applications and databases.
3. Database Services:
 Amazon RDS (Relational Database Service): Managed relational
database service that simplifies database administration tasks such as
setup, patching, backup, and scaling. RDS supports popular database
engines like MySQL, PostgreSQL, Oracle, and SQL Server.
 Amazon DynamoDB: Fully managed NoSQL database service that
offers seamless scalability, high performance, and low latency.
DynamoDB is suitable for storing and querying semi-structured data
with flexible schemas.
4. Analytics Services:
 Amazon Redshift: Fully managed data warehouse service that
provides fast query performance and petabyte-scale data storage.
Redshift is optimized for analytics workloads and supports querying
large datasets using SQL.
 Amazon Athena: Interactive query service that allows users to analyze
data stored in S3 using standard SQL queries. Athena is serverless and
requires no infrastructure setup, making it easy to analyze data directly
from S3.
5. Machine Learning Services:
 Amazon SageMaker: Fully managed platform for building, training,
and deploying machine learning models at scale. SageMaker provides
pre-built machine learning algorithms, automated model tuning, and
managed infrastructure for training and inference.
 Amazon Comprehend: Natural language processing (NLP) service that
uses machine learning to analyze text data and extract insights such as
sentiment analysis, entity recognition, and topic modeling.
6. Big Data Services:
 Amazon EMR (Elastic MapReduce): Managed big data platform that
simplifies the deployment and management of Apache Hadoop, Spark,
HBase, and other big data frameworks. EMR is used for processing and
analyzing large-scale datasets using distributed computing.
7. Integration Services:
 AWS Glue: Fully managed extract, transform, and load (ETL) service
that makes it easy to prepare and transform data for analytics and
machine learning. Glue automatically discovers, catalogs, and cleanses
data from various sources.
8. Networking and Security:
 Amazon VPC (Virtual Private Cloud): Virtual network infrastructure
that allows users to provision a logically isolated section of the AWS
cloud. VPC enables data scientists to define network settings, configure
security groups, and control access to resources.
 AWS Identity and Access Management (IAM): Security service that
provides centralized control over user access to AWS resources. IAM
allows data scientists to manage user permissions, create security
policies, and enforce multi-factor authentication.

17. Write note on web scraping.

Ans:

Web scraping is a pivotal technique within data science, serving as a gateway to a vast
reservoir of online data. Through automated extraction methods, it enables data scientists to
efficiently retrieve information from diverse sources such as websites, forums, and social
media platforms. This process is instrumental in data acquisition, providing access to
valuable datasets essential for analysis and research endeavors.

The real-time nature of web scraping empowers data scientists to capture dynamic trends
and fluctuations as they occur. By collecting data continuously, organizations can stay abreast
of evolving market conditions, consumer sentiments, and emerging patterns.

This timeliness enhances decision-making processes and strategic planning efforts, fostering
agility and adaptability in today's fast-paced business landscape.

Ethical considerations play a crucial role in the practice of web scraping, necessitating
adherence to established guidelines and regulations. It's imperative for practitioners to
respect the terms of service of websites being scraped and to obtain data ethically and
responsibly.

By upholding ethical standards, data scientists can maintain trust and integrity within the
data community while avoiding potential legal ramifications.

Technically, web scraping involves parsing HTML code to extract relevant data elements from
web pages. Utilizing specialized libraries and frameworks like BeautifulSoup and Scrapy in
programming languages such as Python streamlines the scraping process.

These tools offer functionalities to navigate through web structures, locate desired
information, and extract data efficiently, enabling seamless integration into data analysis
pipelines.

In conclusion, web scraping serves as a cornerstone of data science, facilitating the extraction
of valuable insights from the vast expanse of online resources. Through automation, real-
time data collection, adherence to ethical guidelines, and utilization of technical tools, web
scraping empowers organizations to harness the power of data for informed decision-making
and strategic initiatives.

18)Explain in brief Version Control.

Ans:

Version control in data science is a fundamental practice that involves tracking changes made
to code, data, and other project assets. It operates through specialized software systems like
Git, which provide a structured approach to managing versions of files. This enables data
scientists to maintain a chronological record of modifications, facilitating collaboration,
reproducibility, and project management.

Central to version control is the concept of repositories, which serve as centralized locations
for storing project files and tracking changes. Within a repository, data scientists can create
branches to work on different aspects of a project simultaneously. This branching mechanism
allows for experimentation and feature development while preserving the integrity of the
main project.

Commits are the individual snapshots of changes made to files within a repository. Each
commit includes a unique identifier, a timestamp, and a description of the changes. This
granular level of tracking enables data scientists to understand the evolution of their work,
revert to previous states if necessary, and review the history of contributions.

Merging is a key operation in version control that involves integrating changes from one
branch into another. This process ensures that updates made by different team members can
be combined seamlessly. Merging allows data scientists to consolidate their work, resolve
conflicts, and maintain a cohesive and up-to-date version of the project.

Version control systems also provide mechanisms for collaboration and coordination among
team members. Through features like pull requests and code reviews, data scientists can
propose changes, solicit feedback, and ensure the quality of contributions. This collaborative
workflow promotes transparency, accountability, and knowledge sharing within data science
teams.

19)Explain software development tools.

Ans:

Software development tools play a crucial role in data science by providing frameworks,
libraries, and platforms that streamline various aspects of the data analysis and model
development process. These tools encompass a wide range of functionalities, from data
manipulation and visualization to machine learning model deployment and monitoring.
Some key software development tools in data science include:

a.Integrated Development Environments (IDEs): IDEs such as Jupyter Notebook, PyCharm,


and RStudio provide comprehensive environments for writing, executing, and debugging
code. They offer features like syntax highlighting, code completion, and interactive
visualization, enhancing productivity and facilitating exploratory data analysis.

b.Version Control Systems (VCS): Version control systems like Git are essential for managing
changes to code, data, and project assets. They enable data scientists to track modifications,
collaborate effectively, and maintain a historical record of project iterations, promoting
transparency and reproducibility.

c.Data Manipulation Libraries: Libraries like Pandas (for Python) and dplyr (for R) are widely
used for data manipulation and analysis. They provide powerful tools for cleaning,
transforming, and aggregating data, enabling data scientists to preprocess datasets efficiently
and extract valuable insights.

d.Visualization Libraries: Visualization libraries such as Matplotlib, Seaborn, and ggplot2


facilitate the creation of informative and visually appealing plots and charts. These tools help
data scientists explore data patterns, communicate findings effectively, and gain insights into
complex datasets.

e.Machine Learning Frameworks: Machine learning frameworks like TensorFlow, PyTorch,


and scikit-learn offer a rich set of algorithms and tools for building and training machine
learning models. They provide APIs for tasks such as classification, regression, clustering, and
neural network design, empowering data scientists to develop predictive models and analyze
data at scale.

f.Model Deployment Platforms: Model deployment platforms like TensorFlow Serving, Flask,
and Streamlit facilitate the deployment and integration of machine learning models into
production systems. They provide APIs, hosting services, and deployment pipelines for
deploying models as web services or embedding them into applications, enabling real-time
inference and decision-making.

g.Cloud Computing Services: Cloud computing platforms like Amazon Web Services (AWS),
Google Cloud Platform (GCP), and Microsoft Azure offer scalable infrastructure and services
for data storage, processing, and analysis. They provide tools for distributed computing, big
data processing, and machine learning, enabling data scientists to leverage cloud resources
for large-scale data projects.

20)Explain cloud computing.

Ans:

Cloud computing in data science transforms the traditional approach to data management by
providing remote access to computing resources via the internet. It offers scalability, allowing
data scientists to scale computing power and storage resources dynamically based on project
requirements. This flexibility ensures optimal performance and cost-effectiveness for data-
intensive tasks.

Data storage in the cloud eliminates the need for on-premises infrastructure, offering secure
and reliable storage solutions for large volumes of data. Cloud storage services such as
Amazon S3, Google Cloud Storage, and Azure Blob Storage provide scalable storage options,
enabling data scientists to store and access data from anywhere with an internet connection.
This accessibility facilitates seamless collaboration and data sharing among team members.

Data processing in the cloud is facilitated through services like AWS Glue, Google Cloud
Dataflow, and Azure Data Factory, which streamline the orchestration of data pipelines and
ETL processes. These platforms enable data scientists to process, transform, and analyze data
at scale, leveraging distributed computing resources for efficient data processing. This
accelerates time-to-insight and enables data-driven decision-making.

Cloud-based analytics services like AWS Athena, Google BigQuery, and Azure Synapse
Analytics empower data scientists to derive insights from large datasets through real-time
querying and analysis. These services offer powerful capabilities for ad-hoc analysis, data
exploration, and visualization, enabling data scientists to uncover patterns, trends, and
anomalies within their data. This facilitates informed decision-making and drives business
innovation.

Machine learning is revolutionized by cloud-based platforms such as AWS SageMaker, Google


AI Platform, and Azure Machine Learning, which provide tools and services for model
development, training, and deployment. These platforms enable data scientists to build and
deploy machine learning models at scale, leveraging cloud resources for efficient model
training and inference. This accelerates the development and deployment of AI-driven
applications and solutions.
Data Science Unit-III

21. Explain Linear Regression.


A statistical method known as "linear regression" makes predictions about the outcome of a
response variable by fusing a variety of affecting variables. It makes an effort to depict the
target's (dependent variables)linear relationship with features (independent variables). We can
determine the ideal model parameter values using the cost function.
Example: An analyst would be interested in seeing how market movement influences the
price of ExxonMobil (XOM). The value of the S&P 500 index will be the independent
variable, or predictor, in this example, while the price of XOM will be the dependent
variable. In reality, various elements influence an event's result. Hence, we usually have
many independent features.

22. Explain regularization.


The term "regularization" describes methods for calibrating machine learning models to
reduce the adjusted loss function and avoid overfitting or underfitting. We can properly fit our
machine learning model on a particular test set using regularization, which lowers the
mistakes in the test set.
Regularization techniques
There are two main types of regularization techniques:
Ridge Regularization and Lasso Regularization.
1] Ridge Regularization
It is also referred to as Ridge Regression and modifies over- or under-fitted models by
applying a penalty equal to the sum of the squares of the coefficient magnitude.
As a result, coefficients are produced and the mathematical function that represents our
machine learning model is minimized. The coefficients' magnitudes are squared and summed.
Ridge Regression applies regularization by reducing the number of coefficients. The cost
function of ridge regression is shown in the function below:
The penalty term is represented by Lambda in the cost function. We can control the
punishment term by varying the values of the penalty function. The magnitude of the
coefficients decreases as the penalty increases. The settings are trimmed. As a result, it serves
to prevent multicollinearity and, through coefficient shrinkage, lower the model's complexity.
Have a look at the graph below, which shows linear regression:
Cost function = Loss + λ x∑‖w‖^2
For Linear Regression line, let’s consider two points that are on the line,
Loss = 0 (considering the two points on the line)
λ= 1
w = 1.4
Then, Cost function = 0 + 1 x 1.42
= 1.96
Data science For Ridge Regression, let’s assume,
Loss = 0.32 + 0.22 = 0.13
λ=1
w = 0.7
Then, Cost function = 0.13 + 1 x 0.72
= 0.62
Comparing the two models, with all data points, we can see that the Ridge
regression line fits the model more accurately than the linear regression
line.

2]Lasso Regularization
By imposing a penalty equal to the total of the absolute values of the
coefficients, it alters the models that are either overfitted or underfitted.
Lasso regression likewise attempts coefficient minimization, but it uses
the actual coefficient values rather than squaring the magnitudes of the
coefficients. As a result of the occurrence of negative coefficients, the
coefficient sum can also be 0. Think about the Lasso regression cost
function:

We can control the coefficient values by controlling the penalty terms, just
like we did in Ridge Regression. Again, consider a Linear Regression
model:
Cost function = Loss + λ x ∑‖w‖
For Linear Regression line, let’s assume,
Loss = 0 (considering the two points on the line)
λ=1
w = 1.4
Then, Cost function = 0 + 1 x 1.4
= 1.4
For Ridge Regression, let’s assume,
Loss = 0.32 + 0.12 = 0.1
λ=1
Data science w = 0.7
Then, Cost function = 0.1 + 1 x 0.7
= 0.8
Comparing the two models, with all data points, we can see that the Lasso
regression line fits the model more accurately than the linear regression
line.

23. Explain bias and variance trade off .


The bias is the discrepancy between our actual values and the predictions.
In order for our model to be able to forecast new data, it must make some
basic assumptions about our data.
We need to strike the ideal balance between bias and variance for every
model. This only makes sure that we record the key patterns in our model
and ignore the noise it generates. The term for this is bias-variance
tradeoff. It aids in optimizing and maintaining the lowest feasible level of
inaccuracy in our model.
A model that has been optimized will be sensitive to the patterns in our
data while also being able to generalize to new data. This should have a
modest bias and variance to avoid overfitting and underfitting.
We can observe from the above figure that when bias is large, the error in
both the training set and the test set is also high. When the variance is
high, the model performs well on the testing set and the error is low, but
the error on the training set is significant. We can see that there is a zone
in the middle where the bias and variance are perfectly balanced and the
error in both the training and testing set is minimal.
The bull's eye graphs up top clarifies the bias and variance tradeoff. When
the data is concentrated in the center, or at the target, the fit is optimal. We
can see that the error in our model grows as we move farther and farther
from the center. The ideal model has little bias and little volatility.

24. Write note on AIC


Akaike Information Criterion (AIC)
The AIC of a model can be calculated as:
AIC = -2/n * LL + 2 * k/n
where:
 n: Number of observations in the training dataset.
 LL: Log-likelihood of the model on the training dataset.
 k: Number of parameters in the model.
The AIC of each model may be determined using this procedure, and the
model with the lowest AIC value will be chosen as the best model.
When compared to the next method, BIC, this strategy tends to prefer
more intricate models.

25. Write note on BIC.


Bayesian Information Criterion (BIC)
The BIC of a model can be calculated as:
BIC = -2 * LL + log(n) * k
where:
 n: Number of observations in the training dataset.
 log: The natural logarithm (with base e)
 LL: Log-likelihood of the model on the training dataset.  k: Number of parameters in the
model.
Using this method, you can calculate the BIC of each model and then
select the model with the lowest BIC value as the best model
26. Explain cross validation.
By training the model on a subset of the input data and testing it on a
subset of the input data that hasn't been used before, you may validate the
model's effectiveness. It is also a method for determining how well a
statistical model generalizes to a different dataset.
Testing the model's stability is a necessary step in machine learning (ML).
This indicates that we cannot fit our model to the training dataset alone.
We set aside a specific sample of the datasetone that wasn't included in the
training datasetfor this use. After that, before deployment, we test our
model on that sample, and the entire procedure is referred to as cross-
validation. It differs from the typical train-test split in this way

Hence, the fundamental cross-validation stages are:


 As a validation set, set aside a portion of the dataset.
 Use the training dataset to provide the model with training.
 Use the validation set to assess the model's performance right now. Do
the next step if the model works well on the validation set; otherwise,
look for problems.
Methods used for Cross-Validation
1] Validation Set Approach
2] Cross-validation using Leave-P-out
3] Leave one out cross-validation
4] K-Fold Cross-Validation
5] Cross-validation with a stratified k-fold
6] Holdout Method
27. Explain the concept of data transformation.
It's challenging to track or comprehend raw data. Because of this, it needs
to be preprocessed before any information can be extracted from it. The
process of transforming raw data into a format that makes it easier to
conduct data mining and recover strategic information is known as data
transformation. In order to change the data into the right form, data
transformation techniques also include data cleansing and data reduction.
To produce patterns that are simpler to grasp, data transformation is a
crucial data preprocessing technique that must be applied to the data
before data mining.
Data transformation transforms the data into clean, useable data by
altering its format, structure, or values. In two steps of the data pipeline for
data analytics projects, data can be modified. Data transformation is the
middle phase of an ETL (extract, transform, and load) process, which is
commonly used by businesses with on-premises data warehouses. The
majority of businesses now increase their compute and storage resources
with latency measured in seconds or minutes by using cloud-based data
warehouses. Organizations can load raw data directly into the data
warehouse and perform preload transformations at query time thanks to
the scalability of the cloud platform.
Data transformation may be used in data warehousing, data wrangling,
data integration, and migration. Data transformation makes business and
analytical processes more effective and improves the quality of data-
driven decisions made by organizations. The structure of the data will be
determined by an analyst throughout the data transformation process.
Hence, data transformation might be:
o Constructive: The data transformation process adds, copies, or
replicates data.
o Destructive: The system deletes fields or records.
o Aesthetic: The transformation standardizes the data to meet
requirements or parameters.
o Structural: The database is reorganized by renaming, moving, or
combining columns

28. Explain the concept of Dimensionality Reduction.


Dimensionality refers to how many input features, variables, or columns
are present in a given dataset, while dimensionality reduction refers to the
process of reducing these features.
In many circumstances, a dataset has a significant number of input
features, which complicates the process of predictive modelling. For
training datasets with a large number of features, it is extremely
challenging to visualize or anticipate the results; hence, dimensionality
reduction techniques must be used.
The phrase "it is a manner of turning the higher dimensions dataset into
lower dimensions dataset, guaranteeing that it gives identical information"
can be used to describe the technique of "dimensionality reduction." These
methods are frequently used in machine learning to solve classification
and regression issues while producing a more accurate predictive model.
It is frequently utilized in disciplines like speech recognition, signal
processing, bioinformatics, etc. that deal with high-dimensional data.
Moreover, it can be applied to cluster analysis, noise reduction, and data
visualization.
Benefits of applying dimensionality reduction
Following are some advantages of using the dimensionality reduction technique on the
provided dataset:
 The space needed to store the dataset is decreased by lowering the dimensionality of the
features.
 Reduced feature dimensions call for shorter computation training times.
 The dataset's features with reduced dimensions make the data easier to visualize rapidly.
 By taking care of the multicollinearity, it removes the redundant features (if any are
present).

Disadvantages of dimensionality reduction


The following list of drawbacks of using the dimensionality reduction also includes:
 The reduction in dimensionality may result in some data loss.
 Sometimes the primary components needed to consider in the PCA dimensionality
reduction technique are unknown.

29. Explain regression trees.

30. Write note on time series analysis.


A method of examining a collection of data points gathered over time is a
time series analysis. Additionally, it is specifically utilized for non-
stationary data, or data that is constantly changing over time. The time
series data varies from all other data due to this component as well. Time
series analysis is also used to predict future data based on the past. As a
result, we can conclude that it goes beyond simply gathering data.
Predictive analytics includes the subfield of time series analysis. It
supports in forecasting by projecting anticipated variations in data, such as
seasonality or cyclical activity, which provides a greater understanding of
the variables.
Types of time series analysis
Time series are used to collect a variety of data kinds;thus, analysts have
created some intricate models to help with understanding. Analysts, on the
other hand, are unable to take into account all variations or generalize a
specific model to all samples. These are the typical time series analysis
methods:
o Classification: This model is used for the identification of data. It also allocates categories
to the data.
o Descriptive Analysis: As time series data has various components, this descriptive analysis
helps to identify the varied patterns of time series including trend, seasonal, or cyclic
fluctuations.
o Curve Fitting: Under this type of time series analysis, we generally plot data along some
curve in order to investigate the correlations between variables in the data.
o Explanative Analysis: This model basically explains the correlations between the data and
the variables within it, and also explains the causes and effects of the data on the time series.
o Exploratory Analysis: The main function of this model is to highlight the key features of
time series data, generally in a graphic style.
o Forecasting: As the name implies, this form of analysis is used to forecast future data.
Interestingly, this model uses the past data (trend) to forecast the forthcoming data, thus,
projecting what could happen at future plot points.
o Intervention Analysis: This analysis model of time series denotes or investigates how a
single incident may alter data.

31. Explain forecasting.

32. Write note on Classification.


A supervised learning method called a decision tree can be used to solve
classification and regression problems, but it is typically favored for doing
so. It is a tree-structured classifier, where internal nodes stand in for a
dataset's features, branches for the decision-making process, and each leaf
node for the classification result. The Decision Node and Leaf Node are
the two nodes of a decision tree. Whereas Leaf nodes are the results of
decisions and do not have any more branches, Decision nodes are used to
create decisions and have numerous branches.The given dataset's features
are used to execute the test or make the decisions.The given dataset's
features are used to execute the test or make the decisions.It is a graphical
depiction for obtaining all feasible answers to a choice or problem based
on predetermined conditions.It is known as a decision tree because, like a
tree, it begins with the root node and grows on subsequent branches to
form a structure resembling a tree.The CART algorithm, which stands for
Classification and Regression Tree algorithm, is used to construct a tree.A
decision tree simply asks a question, then based on the answer (Yes/No), it
further split the tree into subtrees.

Decision Tree Terminologies:


 Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches
from the tree.
 Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.

Working of an algorithm:
In a decision tree, the algorithm begins at the root node and works its way
up to forecast the class of the given dataset. This algorithm follows the
branch and jumps to the following node by comparing the values of the
root attribute with those of the record (real dataset) attribute.
The algorithm verifies the attribute value with the other sub-nodes once
again for the following node before continuing. It keeps doing this until it
reaches the tree's leaf node. The following algorithm can help you
comprehend the entire procedure:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.

o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).

o Step-3: Divide the S into subsets that contains possible values for the
best attributes.

o Step-4: Generate the decision tree node, which contains the best
attribute.

o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.

33. Write note on K-Nearest Neighbor.


One of the simplest machine learning algorithms, based on the supervised
learning method, is K-Nearest Neighbor.The K-NN algorithm makes the
assumption that the new case and the existing cases are comparable, and it
places the new instance in the category that is most like the existing
categories.A new data point is classified using the K-NN algorithm based
on similarity after all the existing data has been stored. This means that
utilizing the K-NN method, fresh data can be quickly and accurately
sorted into a suitable category.Although the K-NN approach is most
frequently employed for classification problems, it can also be utilized for
regression.Since K-NN is a non-parametric technique, it makes no
assumptions about the underlying data.It is also known as a lazy learner
algorithm since it saves the training dataset rather than learning from it
immediately. Instead, it uses the dataset to perform an action when
classifying data.KNN method maintains the dataset during the training
phase and subsequently classifies new data into a category that is quite
similar to the new data.
Working of KNN Algorithm
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points
in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Advantages of KNN Algorithm


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large
Disadvantages of KNN Algorithm
o Always needs to determine the value of K which may be complex
some time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.
34. Write note on PCA.
An unsupervised learning approach called principal component analysis is
used in machine learning to reduce dimensionality. With the use of
orthogonal transformation, it is a statistical process that transforms the
observations of correlated features into a set of linearly uncorrelated data.
The Main Components are these newly altered features. One of the widely
used tools for exploratory data analysis and predictive modelling is this
one. It is a method for identifying significant patterns in the provided
dataset by lowering the variances.
Typically, PCA looks for the surface with the lowest dimensionality onto
which to project the high-dimensional data.
PCA functions by taking into account each attribute's variance since a high
attribute demonstrates a solid split between classes, which lowers the
dimensionality. Image processing, movie recommendation systems, and
power allocation optimization in multiple communication channels are
some examples of PCA's practical uses. Since it uses a feature extraction
technique, it keeps the crucial variables and discards the unimportant ones.
The PCA algorithm is founded on mathematical ideas like:
 Variance and covariance
 Eigen values and eigen factors
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the
given dataset. More easily, it is the number of columns present in the
dataset.
o Correlation: It signifies that how strongly two variables are related to
each other. Such as if one changes, the other variable also gets
changed. The correlation value ranges from -1 to +1. Here, -1 occurs if
variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other,
and hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is
given. Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the
pair of variables is called the Covariance Matrix.

35. Explain hierarchical clustering.


Data are grouped into groups in a tree structure in a hierarchical clustering
method. Every data point is first treated as a separate cluster in a
hierarchical clustering process. The following steps are then repeatedly
carried out by it:
Choose the two clusters that are the closest to one another, and then
combine the two clusters that are the most similar. These procedures must
be repeated until all of the clusters are combined.
The goal of hierarchical clustering is to create a hierarchy of nested
clusters. a Dendrogram, a type of graph (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or splits) depicts this
hierarchy graphically and is an inverted tree that explains the sequence in
which elements are combined (bottom-up view) or clusters are dispersed
(top-down view).
A data mining technique called hierarchical clustering builds a
hierarchical representation of the clusters in a dataset. Each data point is
initially treated as an independent cluster, and from there, the algorithm
iteratively aggregates the nearest clusters until a stopping requirement is
met. A dendrogram - a tree-like structure that shows the hierarchical links
between the clustersis the outcome of hierarchical clustering.
Compared to other clustering techniques, hierarchical clustering has a
variety of benefits, such as:
1. The capacity for non-convex clusters as well as clusters of various
densities and sizes.
2. The capacity to deal with noisy and missing data.
3. The capacity to display the data's hierarchical structure, which is
useful for comprehending the connections between the clusters.

It does, however, have several shortcomings, such as:


1. The requirement for a threshold to halt clustering and establish the
total number of clusters.
2. The approach can have high processing costs and memory needs,
particularly for huge datasets.
3. The initial conditions, linkage criterion, and distance metric can have
an impact on the outcomes.
In conclusion, hierarchical clustering is a data mining technique that
groups related data points into clusters by giving the clusters a
hierarchical structure.
4. This technique can handle various data formats and show the
connections between the clusters. Unfortunately, the results could be
sensitive to certain circumstances and have a large computational cost.

36. Explain Ensemble learning.


A machine learning technique called ensemble techniques combines
multiple base models to create a single, ideal predictive model.By mixing
numerous models rather than relying just on one, ensemble approaches
seek to increase the accuracy of findings in models. The integrated models
considerably improve the results' accuracy. Due of this, ensemble
approaches in machine learning have gained prominence.

Main types of ensemble methods


1] Bagging
Bootstrap aggregating is commonly used in classification and regression,
and also known as bagging. Using decision trees, it improves the models'
accuracy, greatly reducing variation. Many prediction models struggle
with overfitting, which is eliminated by reducing variation and improving
accuracy.
Bootstrapping and aggregation are the two categories under which
bagging is categorized. Bootstrapping is a sampling strategy where
samples are taken utilizing the replacement procedure from the entire
population (set). The sampling with replacement method aids in the
randomization of the selection process. The process is finished by
applying the base learning algorithm to the samples.
In bagging, aggregation is used to include all potential outcomes of the
prediction and randomize the result. Predictions made without aggregation
won't be accurate because all possible outcomes won't be taken into
account. As a result, the aggregate is based either on all of the results from the predictive
models or on the probability bootstrapping techniques. Bagging is useful because it creates a
single strong learner that is more stable than individual weak base learners. Moreover, it gets
rid of any variance, which lessens overfitting in models. The computational cost of bagging is
one of its drawbacks. Hence, ignoring the correct bagging technique can result in higher bias
in models.

2] Boosting
Boosting is an ensemble strategy that improves future predictions by learning from previous
predictor errors. The method greatly increases model predictability by combining numerous
weak base learners into one strong learner. Boosting works by placing weak learners in a
sequential order so that they can learn from the subsequent learner to improve their predictive
models.
There are many different types of boosting, such as gradient boosting, Adaptive Boosting
(AdaBoost), and XGBoost (Extreme Gradient Boosting). AdaBoost employs weak learners in
the form of decision trees, the majority of which include a single split known as a decision
stump. The primary decision stump in AdaBoost consists of observations with equal weights.
Gradient boosting increases the ensemble's predictors in a progressive manner, allowing
earlier forecasters to correct later ones, improving the model's accuracy. To offset the
consequences of errors in the earlier models, new predictors are fitted. The gradient booster
can identify and address issues with learners' predictions thanks to the gradient of descent.
Decision trees with boosted gradients are used in XGBoost, which offers faster performance.
It largely depends on the goal model's efficiency and effectiveness in terms of computing.
Gradient boosted machines must be implemented slowly since model training must proceed
sequentially.

3]Stacking
Another ensemble method called stacking is sometimes known as layered generalization.
This method works by allowing a training algorithm to combine the predictions of numerous
different learning algorithms that are similar. Regression, density estimations, distance
learning, and classifications have all effectively used stacking. It can also be used to gauge
the amount of inaccuracy that occurs
when bagging.

You might also like