Data Science Question Bank With Answer
Data Science Question Bank With Answer
Ans –
1.Data: This is the raw, unprocessed facts and figures that are collected from various sources. Data
can be structured or unstructured and may include text, numbers, images, audio, and video.
4.Wisdom: Wisdom is the highest level of the DIKW pyramid, representing the ability to apply
knowledge and experience to make sound judgments and decisions. Wisdom requires reflection,
insight, and foresight, and is often based on a deep understanding of the broader context and
implications of decisions.
Ans-
High-level languages are the backbone of data science, offering a versatile and intuitive platform for
professionals to analyze, manipulate, and interpret data effectively. Python, R, and Julia stand out as
the primary choices, providing rich libraries and frameworks tailored for data-centric tasks. These
languages abstract away low-level complexities, allowing data scientists to focus on problem-solving
rather than implementation details. Their expressive syntax, extensive community support, and
seamless integration with other tools make them indispensable for tackling the diverse challenges of
data analysis and machine learning.
In data science workflows, high-level languages enable rapid prototyping, experimentation, and
iteration, facilitating a dynamic and agile approach to problem-solving. Python, for instance, is
renowned for its simplicity and readability, making it accessible to both beginners and seasoned
professionals alike. Its ecosystem encompasses powerful libraries such as NumPy, Pandas, and scikit-
learn, which streamline tasks ranging from data wrangling to model deployment. Similarly, R boasts
comprehensive packages like tidyverse and caret, tailored specifically for statistical analysis and
machine learning, while Julia's high-performance computing capabilities make it ideal for numerical
computing and optimization tasks.
Moreover, the collaborative nature of high-level languages fosters knowledge-sharing and innovation
within the data science community. Online forums, tutorials, and open-source contributions
contribute to a vibrant ecosystem where practitioners can exchange ideas, troubleshoot challenges,
and leverage best practices. Whether developing predictive models, exploring data visualizations, or
deploying scalable solutions, the versatility and support provided by high-level languages empower
data scientists to extract actionable insights and drive meaningful impact in diverse domains, from
healthcare and finance to marketing and beyond.
3. Explain IDE.
Ans:
Ans-An IDE (Integrated Development Environment) is software that combines commonly used
developer tools into a compact GUI (graphical user interface) application. It is a combination of tools
like a code editor, code compiler, and code debugger with an integrated terminal.Integrating features
like software editing, building, testing, and packaging in a simple-to-use tool, IDEs help boost
developer productivity. IDEs are commonly used by programmers and software developers to make
their programming journey smoother.
1. Editor: Typically a text editor can help you write software code by highlighting syntax with visual
cues, providing language-specific auto-completion, and checking for bugs as you type.
2. Compiler: A compiler interprets human-readable code into machine-specific code that can be
executed on different operating systems like Linux, Windows, or Mac OS. Most IDEs usually come
with built-in compilers for the language it supports.
3. Debugger: A tool that can assist developers to test and debug their applications and graphically
point out the locations of bugs or errors if any.
4. Build-in Terminal: Terminal is a text-based interface that can be used for interacting with the
machine’s operating system. Developers can directly run the scripts or commands within an IDE with
a built-in terminal/console.
5. Version Control: Version control helps bring clarity to the development of the software. Some
IDEs also support version control tools like Git, through which a user can track and manage the
changes to the software code.
6. Code snippets: IDEs support code snippets that are usually used to accomplish a single task and
can also reduce redundant work to some great extent.
7. Extensions and Plugins: Extensions and Plugins are used to extend the functionality of the
IDEs with respect to specific programming languages.
8. Code navigation: IDEs come with tools like code folding, class and method navigation, and
refactoring tools that make it simple to go through and analyze code.
4. Write note on EDA and data visualization.
Ans-
EDA
The main purpose of EDA is to help look at data before making any assumptions.
It can help identify obvious errors, as well as better understand patterns within the data, detect
outliers or anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they
produce are valid and applicable to any desired business outcomes and
goals.
EDA also helps stakeholders by confirming they are asking the right
questions.
EDA can help answer questions about standard deviations, categorical variables, and confidence
intervals.
Once EDA is complete and insights are drawn, its features can then be used for more sophisticated
data analysis or modeling, including machine learning.
Data visualization
information and data by using visual elements like graphs, charts, and
maps.
Data visualization convert large and small data sets into visuals, which
In the world of Big Data, the data visualization tools and technologies
Data visualizations are common in your everyday life, but they always
Infographics.
You can see visualizations in the form of line charts to display change
over time. Bar and column charts are useful for observing relationships
a-whole. And maps are the best way to share geographical data
visually.
Today's data visualization tools go beyond the charts and graphs used
Ans-
data
Data refers to information facts or figures that are collected, store and used for various purpose.
Essentially data is raw unprocessed material that require interpretation or analysis to derive meaning
or insights.
Data can be collected through various methods such as services, experiment & observation.
Sound
Video
Single character
Picture
Text (string)
itself, they are known as the internal sources. For example, a company publishes its annual report’
on profit and loss,
2. External sources When data is collected from sources outside the organisation, they are
data.
Ans-
analytics.
Surveys
Transnational Tracking
Observation
Online Tracking
Forms
Data collection breaks down into two methods. As a side note, many
depending on who uses them. One source may call data collection
general concepts and breakdowns apply across the board whether we’re
As the name implies, this is original, first-hand data collected by the data
researchers.
This process is the initial information gathering step, performed before anyone carries out any
further or related research.
Primary data results are highly accurate provided the researcher collects
the information.
Secondary:-
This data is either information that the researcher has tasked other people to collect or information
the researcher has looked up. Simply put, it’s second-hand information.
Although it’s easier and cheaper to obtain than primary information, secondary information raises
concerns regarding accuracy and authenticity.
Ans:
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and
removing any missing, duplicate, or irrelevant data.
The goal of data cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML model.
Professional data scientists usually invest a very large portion of their time in this step because of the
belief that “Better data beats fancier algorithms”.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the data
science pipeline that involves identifying and correcting or removing errors, inconsistencies, and
inaccuracies in the data to improve its quality and usability.
Data cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which can
negatively impact the accuracy and reliability of the insights derived from it.
Ans:
The collection, transformation, and organization of data to draw conclusions make predictions for the
future and make informed data-driven decisions is called Data Analysis. The profession that handles
data analysis is called a Data Analyst.
analysis, what type of data you want to use, and what data you plan to
analyze.
collect the data from your sources. Sources include case studies,
Data Cleaning: Not all of the data you collect will be useful, so it’s
time to clean it up. This process is where you remove white spaces,
Data Analysis: Here is where you use data analysis software and other
tools to help you interpret and understand the data and arrive at
Data Interpretation: Now that you have your results, you need to
interpret them and come up with the best courses of action based on
your findings.
“graphically show your information in a way that people can read and
understand it.” You can use charts, graphs, maps, bullet points, or a
Data modeling
concepts and workflows, relevant data entities and their attributes and
These are the different types of data models and what they include:
This is a high-level visualization of the business or analytics processes that a system will support.
It maps out the kinds of data that are needed, how different business entities interrelate and
associated business rules. Business executives are the main audience for conceptual data models, to
help them see how a system will work and ensure that it meets business needs.
Once a conceptual data model is finished, it can be used to create a less-abstract logical one.
Logical data models show how data entities are related and describe the data from a technical
perspective.
For example, they define data structures and provide details on attributes, keys, data types and other
characteristics. The technical side of an organization uses logical models to help understand required
application and database designs. But like conceptual models, they aren't connected to a particular
technology platform.
A logical model serves as the basis for the creation of a physical data model.
Physical models are specific to the database management system (DBMS) or application software
that will be implemented.
They define the structures that the database or a file system will use to store and manage the data.
That includes tables, columns, fields, indexes, constraints, triggers and other DBMS elements.
Database designers use physical data models to create designs and generate schema for databases.
UNIT – 2
Ans:-
Data curation is the process of collecting, organizing, managing, and maintaining data to ensure its
quality, usability, and longevity. It involves various activities aimed at enhancing the value of data for
analysis, interpretation, and decision-making. Data curation is crucial in today's data-driven world
where vast amounts of data are generated daily across various domains such as business, science,
healthcare, and academia.
1.Data Collection: Data curation begins with the collection of relevant data from various sources. This
may include structured data from databases, unstructured data from documents, or semi-structured
data from web sources, sensors, or social media platforms.
2.Data Cleaning and Quality Assurance: Raw data often contains errors, inconsistencies, duplicates,
or missing values. Data curation involves cleaning and validating the data to ensure its accuracy,
consistency, and completeness. This process may include techniques such as data deduplication,
outlier detection, and imputation of missing values.
3.Data Organization and Integration: Curated data needs to be organized in a structured manner to
facilitate easy access and retrieval. This involves creating metadata, taxonomies, or ontologies to
describe the data's content and relationships. Data integration techniques may also be employed to
combine data from different sources while resolving inconsistencies in formats and semantics.
4.Data Storage and Preservation: Curation includes decisions about where and how to store data to
ensure its security, accessibility, and long-term preservation. This may involve choosing appropriate
storage technologies, backup strategies, and data archiving practices to safeguard against data loss,
corruption, or obsolescence.
5.Data Annotation and Enrichment: To enhance the interpretability and usability of data, curation
may involve annotating or enriching data with additional metadata, contextual information, or
domain-specific knowledge. This can help users better understand the data and its relevance to
specific tasks or analyses.
10) Write difference between structured, semi structured and unstructured data .
Ans:
Ans:
Ans:-
Authentication and authorization are two crucial aspects of securing storage systems in data science.
They ensure that only authorized users or processes can access and manipulate data, thereby
protecting sensitive information from unauthorized access or misuse. Here's an explanation of
authentication and authorization in the context of storage systems:
a. Authentication:
Authentication is the process of verifying the identity of users or entities attempting to access a
storage system. It ensures that only legitimate users with valid credentials are allowed access.
Authentication mechanisms typically involve the following:
1.User Credentials: Users provide credentials such as usernames, passwords, or cryptographic keys to
prove their identity.
2.Multi-Factor Authentication (MFA): Enhances security by requiring users to provide multiple forms
of verification, such as a password and a unique code sent to their mobile device.
5.Single Sign-On (SSO): Allows users to authenticate once and access multiple services or storage
systems without the need to re-enter credentials.
Authentication ensures that only legitimate users gain access to the storage system, thereby
preventing unauthorized access and protecting sensitive data from malicious actors.
b.Authorization:
Authorization determines what actions users are allowed to perform within the storage system once
their identity has been authenticated. It specifies the permissions and privileges granted to users
based on their roles, responsibilities, or access levels. Authorization mechanisms typically involve the
following:
1.Role-Based Access Control (RBAC): Assigns permissions to users based on predefined roles (e.g.,
admin, manager, user) associated with specific privileges.
2.Attribute-Based Access Control (ABAC): Grants access based on attributes of users, resources, and
environmental conditions, allowing for more fine-grained access control.
3.Access Control Lists (ACLs): Lists of permissions attached to specific resources, specifying which
users or groups have permission to perform certain actions.
4.Policy-Based Access Control (PBAC): Uses policies to determine access rights based on predefined
rules or conditions.
5.Granular Permissions: Assigns specific permissions (e.g., read, write, execute) to users or groups at
the level of individual files, directories, or data objects.
6.Authorization ensures that authenticated users only have access to the resources and
functionalities that are necessary for their role or task, reducing the risk of data breaches,
unauthorized modifications, or data leaks.
Ans:
GitHub is a widely used platform in data science for collaboration, version control, and sharing of
code, projects, and data-related resources. It offers a range of features and functionalities that make
it particularly well-suited for data science workflows. Here's a note on GitHub's significance in data
science:
1.Version Control:
GitHub utilizes Git, a distributed version control system, allowing data scientists to track changes to
their code, scripts, notebooks, and other project files.
Version control enables collaboration among team members by providing a history of changes,
facilitating code review, and allowing for easy integration of contributions from multiple developers.
2.Collaboration:
GitHub provides a platform for data scientists to collaborate on projects in real-time, irrespective of
geographical locations.
Users can fork repositories to create their own copies of projects, make changes, and propose
modifications through pull requests.
Features like issues, discussions, and project boards facilitate communication and coordination
among team members.
3.Reproducibility:
GitHub fosters reproducibility in data science by preserving the entire history of a project, including
code, data, and documentation.
Researchers can share their analyses, workflows, and experimental results on GitHub, allowing
others to reproduce and validate their findings.
GitHub serves as a hub for open source data science projects, libraries, and tools.
Data scientists can contribute to existing projects, share their own projects, and collaborate with the
broader community to advance the field of data science.
5.Project Management:
GitHub provides project management features such as milestones, labels, and issue tracking to
organize and prioritize tasks within a project.
Data science teams can use GitHub's project management tools to plan, track progress, and manage
workflows effectively.
GitHub seamlessly integrates with popular data science tools and platforms such as Jupyter
Notebooks, RStudio, and Kaggle.
Data scientists can version control their notebooks, scripts, datasets, and models directly from their
preferred tools and synchronize changes with GitHub repositories.
GitHub hosts a vibrant community of data scientists, researchers, developers, and enthusiasts.
Users can discover new projects, explore code repositories, participate in discussions, and learn from
others' work.
14)Write note on NoSQL.
And:
NoSQL (Not Only SQL) databases have gained significant popularity in data science
due to their ability to efficiently handle large volumes of diverse and unstructured
data. Unlike traditional SQL databases, NoSQL databases are designed to be highly
scalable, flexible, and capable of handling varying data formats and structures. Here's
a note on the significance of NoSQL in data science:
Ans:
MongoDB is a popular NoSQL database widely used in data science for its flexibility,
scalability, and ease of use. It's particularly well-suited for storing and processing
unstructured or semi-structured data commonly encountered in data science
applications. Here's a note on the significance of MongoDB in data science:
1. Document-Oriented Storage:
MongoDB stores data in flexible, JSON-like documents, allowing data
scientists to work with complex, nested data structures without the
need for predefined schemas.
This document-oriented approach simplifies data modeling and
supports agile development and experimentation in data science
projects.
2. Schema Flexibility:
MongoDB offers dynamic schemas, enabling data scientists to store
and manipulate data without strict schema definitions.
Fields within documents can vary from one document to another,
providing schema flexibility and accommodating evolving data
requirements.
3. Querying and Aggregation:
MongoDB provides a powerful query language (MongoDB Query
Language or MQL) that supports a wide range of operations, including
filtering, sorting, projection, and aggregation.
Data scientists can perform complex queries and aggregations to
extract insights from large datasets efficiently.
4. Scalability and Performance:
MongoDB is designed for horizontal scalability, allowing data to be
distributed across multiple nodes or servers.
Its distributed architecture supports high throughput and low-latency
access, making it suitable for handling large-scale data processing and
real-time analytics.
5. Indexing and Full-Text Search:
MongoDB supports various indexing techniques to optimize query
performance, including single-field indexes, compound indexes,
geospatial indexes, and text indexes.
Full-text search capabilities enable data scientists to search and analyze
text data efficiently, making MongoDB suitable for text mining and
natural language processing (NLP) tasks.
6. Geospatial Data Processing:
MongoDB includes robust support for geospatial data types and
queries, allowing data scientists to store, index, and query spatial data
such as coordinates, polygons, and multi-geometries.
Geospatial queries enable location-based analytics and geospatial
visualization, essential for applications like GIS (Geographic Information
Systems) and location-based services.
7. Integration with Data Science Tools:
MongoDB integrates seamlessly with popular data science tools and
libraries, such as Python (via PyMongo), R (via RMongo), and Apache
Spark (via Spark Connector).
Data scientists can leverage MongoDB as a backend storage solution
for storing and processing large-scale datasets within their data science
workflows.
Ans:
1. Compute Services:
Amazon EC2 (Elastic Compute Cloud): Virtual servers in the cloud
that provide resizable compute capacity. Data scientists can use EC2
instances to run data processing tasks, execute machine learning
algorithms, and host applications.
AWS Lambda: Serverless compute service that runs code in response
to events or triggers without provisioning or managing servers. Lambda
functions can be used for real-time data processing, event-driven
workflows, and automation tasks.
2. Storage Services:
Amazon S3 (Simple Storage Service): Object storage service that
provides scalable storage for data lakes, data warehouses, and analytics
workloads. S3 is commonly used for storing large volumes of
structured, semi-structured, and unstructured data in various formats.
Amazon EBS (Elastic Block Store): Block storage service for EC2
instances, providing persistent storage volumes that can be attached to
EC2 instances. EBS volumes are suitable for storing data used by
applications and databases.
3. Database Services:
Amazon RDS (Relational Database Service): Managed relational
database service that simplifies database administration tasks such as
setup, patching, backup, and scaling. RDS supports popular database
engines like MySQL, PostgreSQL, Oracle, and SQL Server.
Amazon DynamoDB: Fully managed NoSQL database service that
offers seamless scalability, high performance, and low latency.
DynamoDB is suitable for storing and querying semi-structured data
with flexible schemas.
4. Analytics Services:
Amazon Redshift: Fully managed data warehouse service that
provides fast query performance and petabyte-scale data storage.
Redshift is optimized for analytics workloads and supports querying
large datasets using SQL.
Amazon Athena: Interactive query service that allows users to analyze
data stored in S3 using standard SQL queries. Athena is serverless and
requires no infrastructure setup, making it easy to analyze data directly
from S3.
5. Machine Learning Services:
Amazon SageMaker: Fully managed platform for building, training,
and deploying machine learning models at scale. SageMaker provides
pre-built machine learning algorithms, automated model tuning, and
managed infrastructure for training and inference.
Amazon Comprehend: Natural language processing (NLP) service that
uses machine learning to analyze text data and extract insights such as
sentiment analysis, entity recognition, and topic modeling.
6. Big Data Services:
Amazon EMR (Elastic MapReduce): Managed big data platform that
simplifies the deployment and management of Apache Hadoop, Spark,
HBase, and other big data frameworks. EMR is used for processing and
analyzing large-scale datasets using distributed computing.
7. Integration Services:
AWS Glue: Fully managed extract, transform, and load (ETL) service
that makes it easy to prepare and transform data for analytics and
machine learning. Glue automatically discovers, catalogs, and cleanses
data from various sources.
8. Networking and Security:
Amazon VPC (Virtual Private Cloud): Virtual network infrastructure
that allows users to provision a logically isolated section of the AWS
cloud. VPC enables data scientists to define network settings, configure
security groups, and control access to resources.
AWS Identity and Access Management (IAM): Security service that
provides centralized control over user access to AWS resources. IAM
allows data scientists to manage user permissions, create security
policies, and enforce multi-factor authentication.
Ans:
Web scraping is a pivotal technique within data science, serving as a gateway to a vast
reservoir of online data. Through automated extraction methods, it enables data scientists to
efficiently retrieve information from diverse sources such as websites, forums, and social
media platforms. This process is instrumental in data acquisition, providing access to
valuable datasets essential for analysis and research endeavors.
The real-time nature of web scraping empowers data scientists to capture dynamic trends
and fluctuations as they occur. By collecting data continuously, organizations can stay abreast
of evolving market conditions, consumer sentiments, and emerging patterns.
This timeliness enhances decision-making processes and strategic planning efforts, fostering
agility and adaptability in today's fast-paced business landscape.
Ethical considerations play a crucial role in the practice of web scraping, necessitating
adherence to established guidelines and regulations. It's imperative for practitioners to
respect the terms of service of websites being scraped and to obtain data ethically and
responsibly.
By upholding ethical standards, data scientists can maintain trust and integrity within the
data community while avoiding potential legal ramifications.
Technically, web scraping involves parsing HTML code to extract relevant data elements from
web pages. Utilizing specialized libraries and frameworks like BeautifulSoup and Scrapy in
programming languages such as Python streamlines the scraping process.
These tools offer functionalities to navigate through web structures, locate desired
information, and extract data efficiently, enabling seamless integration into data analysis
pipelines.
In conclusion, web scraping serves as a cornerstone of data science, facilitating the extraction
of valuable insights from the vast expanse of online resources. Through automation, real-
time data collection, adherence to ethical guidelines, and utilization of technical tools, web
scraping empowers organizations to harness the power of data for informed decision-making
and strategic initiatives.
Ans:
Version control in data science is a fundamental practice that involves tracking changes made
to code, data, and other project assets. It operates through specialized software systems like
Git, which provide a structured approach to managing versions of files. This enables data
scientists to maintain a chronological record of modifications, facilitating collaboration,
reproducibility, and project management.
Central to version control is the concept of repositories, which serve as centralized locations
for storing project files and tracking changes. Within a repository, data scientists can create
branches to work on different aspects of a project simultaneously. This branching mechanism
allows for experimentation and feature development while preserving the integrity of the
main project.
Commits are the individual snapshots of changes made to files within a repository. Each
commit includes a unique identifier, a timestamp, and a description of the changes. This
granular level of tracking enables data scientists to understand the evolution of their work,
revert to previous states if necessary, and review the history of contributions.
Merging is a key operation in version control that involves integrating changes from one
branch into another. This process ensures that updates made by different team members can
be combined seamlessly. Merging allows data scientists to consolidate their work, resolve
conflicts, and maintain a cohesive and up-to-date version of the project.
Version control systems also provide mechanisms for collaboration and coordination among
team members. Through features like pull requests and code reviews, data scientists can
propose changes, solicit feedback, and ensure the quality of contributions. This collaborative
workflow promotes transparency, accountability, and knowledge sharing within data science
teams.
Ans:
Software development tools play a crucial role in data science by providing frameworks,
libraries, and platforms that streamline various aspects of the data analysis and model
development process. These tools encompass a wide range of functionalities, from data
manipulation and visualization to machine learning model deployment and monitoring.
Some key software development tools in data science include:
b.Version Control Systems (VCS): Version control systems like Git are essential for managing
changes to code, data, and project assets. They enable data scientists to track modifications,
collaborate effectively, and maintain a historical record of project iterations, promoting
transparency and reproducibility.
c.Data Manipulation Libraries: Libraries like Pandas (for Python) and dplyr (for R) are widely
used for data manipulation and analysis. They provide powerful tools for cleaning,
transforming, and aggregating data, enabling data scientists to preprocess datasets efficiently
and extract valuable insights.
f.Model Deployment Platforms: Model deployment platforms like TensorFlow Serving, Flask,
and Streamlit facilitate the deployment and integration of machine learning models into
production systems. They provide APIs, hosting services, and deployment pipelines for
deploying models as web services or embedding them into applications, enabling real-time
inference and decision-making.
g.Cloud Computing Services: Cloud computing platforms like Amazon Web Services (AWS),
Google Cloud Platform (GCP), and Microsoft Azure offer scalable infrastructure and services
for data storage, processing, and analysis. They provide tools for distributed computing, big
data processing, and machine learning, enabling data scientists to leverage cloud resources
for large-scale data projects.
Ans:
Cloud computing in data science transforms the traditional approach to data management by
providing remote access to computing resources via the internet. It offers scalability, allowing
data scientists to scale computing power and storage resources dynamically based on project
requirements. This flexibility ensures optimal performance and cost-effectiveness for data-
intensive tasks.
Data storage in the cloud eliminates the need for on-premises infrastructure, offering secure
and reliable storage solutions for large volumes of data. Cloud storage services such as
Amazon S3, Google Cloud Storage, and Azure Blob Storage provide scalable storage options,
enabling data scientists to store and access data from anywhere with an internet connection.
This accessibility facilitates seamless collaboration and data sharing among team members.
Data processing in the cloud is facilitated through services like AWS Glue, Google Cloud
Dataflow, and Azure Data Factory, which streamline the orchestration of data pipelines and
ETL processes. These platforms enable data scientists to process, transform, and analyze data
at scale, leveraging distributed computing resources for efficient data processing. This
accelerates time-to-insight and enables data-driven decision-making.
Cloud-based analytics services like AWS Athena, Google BigQuery, and Azure Synapse
Analytics empower data scientists to derive insights from large datasets through real-time
querying and analysis. These services offer powerful capabilities for ad-hoc analysis, data
exploration, and visualization, enabling data scientists to uncover patterns, trends, and
anomalies within their data. This facilitates informed decision-making and drives business
innovation.
2]Lasso Regularization
By imposing a penalty equal to the total of the absolute values of the
coefficients, it alters the models that are either overfitted or underfitted.
Lasso regression likewise attempts coefficient minimization, but it uses
the actual coefficient values rather than squaring the magnitudes of the
coefficients. As a result of the occurrence of negative coefficients, the
coefficient sum can also be 0. Think about the Lasso regression cost
function:
We can control the coefficient values by controlling the penalty terms, just
like we did in Ridge Regression. Again, consider a Linear Regression
model:
Cost function = Loss + λ x ∑‖w‖
For Linear Regression line, let’s assume,
Loss = 0 (considering the two points on the line)
λ=1
w = 1.4
Then, Cost function = 0 + 1 x 1.4
= 1.4
For Ridge Regression, let’s assume,
Loss = 0.32 + 0.12 = 0.1
λ=1
Data science w = 0.7
Then, Cost function = 0.1 + 1 x 0.7
= 0.8
Comparing the two models, with all data points, we can see that the Lasso
regression line fits the model more accurately than the linear regression
line.
Working of an algorithm:
In a decision tree, the algorithm begins at the root node and works its way
up to forecast the class of the given dataset. This algorithm follows the
branch and jumps to the following node by comparing the values of the
root attribute with those of the record (real dataset) attribute.
The algorithm verifies the attribute value with the other sub-nodes once
again for the following node before continuing. It keeps doing this until it
reaches the tree's leaf node. The following algorithm can help you
comprehend the entire procedure:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.
2] Boosting
Boosting is an ensemble strategy that improves future predictions by learning from previous
predictor errors. The method greatly increases model predictability by combining numerous
weak base learners into one strong learner. Boosting works by placing weak learners in a
sequential order so that they can learn from the subsequent learner to improve their predictive
models.
There are many different types of boosting, such as gradient boosting, Adaptive Boosting
(AdaBoost), and XGBoost (Extreme Gradient Boosting). AdaBoost employs weak learners in
the form of decision trees, the majority of which include a single split known as a decision
stump. The primary decision stump in AdaBoost consists of observations with equal weights.
Gradient boosting increases the ensemble's predictors in a progressive manner, allowing
earlier forecasters to correct later ones, improving the model's accuracy. To offset the
consequences of errors in the earlier models, new predictors are fitted. The gradient booster
can identify and address issues with learners' predictions thanks to the gradient of descent.
Decision trees with boosted gradients are used in XGBoost, which offers faster performance.
It largely depends on the goal model's efficiency and effectiveness in terms of computing.
Gradient boosted machines must be implemented slowly since model training must proceed
sequentially.
3]Stacking
Another ensemble method called stacking is sometimes known as layered generalization.
This method works by allowing a training algorithm to combine the predictions of numerous
different learning algorithms that are similar. Regression, density estimations, distance
learning, and classifications have all effectively used stacking. It can also be used to gauge
the amount of inaccuracy that occurs
when bagging.