Holy Ganges Public School
Class:- X
Subject:- Al
Part-B
Unit-4 Statistical Data Concepts & Its Applications
Notes:
Data Science:- Data Science is a deep study of the Massive amount of data, which
involves extracting meaningful insights from raw, structured and unstructured
data.
Data science is a concept to unify statistics, data analysis, machine learning and
their related methods in order to understand and analyse actual phenomena with
data.
Data science deals with all types of Data i.e., structure data, unstructured data,
and semi-structured data.
Types of Data:
1. Structured Data:- Structured Data is a well defined set of data values.
Structured Data is typically stored in a relational database.
Examples:- Records in Database, spreadsheets, etc.
2. Unstructured Data:- Unstructured data is information that either does not have
a predefined data model or is not organised in a predefined manner.
Examples:- Online search results returned by the search engines, etc.
3. Semi-structured Data:- Semi- Structured Data refers to what would normally be
considered unstructured data, but that also has metadata that identifies certain
characteristics.
Need for Data Science:- Some main reasons for using Data Science technology
are:
1. We can convert the massive amount of raw data and unstructured data into
meaningful insights.
2. It handle the huge amount of data are using Data Science algorithms for better
customer experience.
3. Data Science is working for automating transportation such as creating a self
driving car.
4. Data Science can help in different prediction such as various surveys, elections,
etc
Tools for Data Science:
1. Data Analysis Tools:- R, Python, Jupyter, Excel,etc.
2. Data Warehousing:- ETL, SQL, AWS Redshift, etc.
3. Data Visualisation:- R, Jupyter, Tableau, Cognos, etc.
4. Machine Learning Tools:- Spark, Mahout, Azure ML studio.
Application of data science
1. Internet Search - Nowadays search engines are using data science to know
what users want to search on the internet and also search engines want to know
that the searching information is useful for users or not, some of the search
engines using data science are Google, Yahoo, Bing and DuckDuckGo etc.
2. Targeted Advertising - Nowadays people are using the internet; digital
advertising can be shown to only specific people based on their interests. This is
the reason why digital ads have been able to get a much higher CTR (Click
Through Rate) than traditional advertisements. They can be targeted based on a
user's past behavior.
3. Website Recommendations - Website recommendations help the users to find
relevant products from billions of products available on e-commerce websites. A
lot of companies promote their products on e-commerce websites based on the
interests and relevance of the users. Internet giants like Amazon, Twitter, Google
Play, Netflix, Linkedln, IMDb, and many more use this system to improve the user
experience. The recommendations are made based on previous search results for
a user.
4. Genetics & Genomics - Data science techniques allow integration of different
kinds of data with genomic data in disease research, which provides a deeper
understanding of genetic issues in reactions to particular drugs and diseases. As
soon as we acquire reliable personal genome data, we will achieve a deeper
understanding of human DNA.
5. Finance - Data science also plays a crucial role in the finance sector. Data
science can help banks to identify the fraud and risk of losses. Nowadays, the
finance sector wants to identify and analyze the risk of loss automatically, here
data science can play a crucial role in identifying the risk factor of losses in the
banking sector. Data science can also examine the past behavior of the stock
market and make predictions for future outcomes.
6. Health Care- Nowadays many health industries use data science for identifying
tumors, medical related image analysis, Patient health record maintenance,
pharmaceutical development, predictive diagnosis etc. Data science also help the
hospital to make more accurate predictions which reduce the rate of treatment
failure.
7. Airline Route Planning:- With the help of Data Science Airline companies can
1. Predict delay in flights.
2. Decide which class of airplanes to buy.
3. Effectively drive customer loyalty programs.
4. It decide for directly land into the destination.
8. Image Recognition and Speech Recognition:- When we upload an image on
Facebook and start getting the suggestions to tag to our friends, this automatic
tagging suggestion uses image recognition algorithm, which is a part of Data
Science.
When we say something using "OK Google, Siri, Cortana", etc. these devices
responsed as per voice control and this has become possible through speech
recognition algorithm.
Define High-Code, Low-Code and No-Code Al Tool :
High-Code - High code development refers to traditional software development
where programmers write code manually using programming languages like Java,
Python, C# etc. High-code is also known as custom-code.
Low-Code Al - The person has some coding knowledge to create Al applications
with minimum coding. Low-code users have some programming skills, and they
can build their own applications. Low-code Al users can also use a drag-and-drop
interface to build the components of Al.
Features of Low-Code Al:
1. Pre-built Components.
2. Code customisation.
3. Simplified Al pipelines.
4. Integration with code.
5. Visual Development Environment.
No-Code Al - It is a tool and platform where the users can build Al applications
without writing any code. No-Code Al uses a drag-and-drop interface to build the
components of Al and make it easy for the people who do not have a technical
background.
Features of No-Code Al:
1. User friendly Interface.
2. Pre-built Models.
3. Automated Workflow.
4. Integration Capabilities.
5. No Programing knowledge Required.
Disadvantages of No-Code Tools:
1. Lack of flexibility.
2. Automation Bias.
3. Security Issues.
Some No-Code Tools:
Azure Machine Learning:- It is a Cloud based service provided by Microsoft
released in July 2014. It aims to simplify ML processes.
Google Cloud AutoML:- It is a suit of machine learning tools and services provided
by Google Cloud released in January 2018.
Apple CreatML:- developed by Apple Inc. It specially designed for MacOS and ioS
platforms.
Microsoft Lobe:- Developed by Microsoft in 2015, Love helps with no data science
experience import images and easily label them to create a machine learning
dataset.
Google Teachable Machine:- It is a Web based tool developed by Google released
in November 2017 that allows users to create machine learning models without
need of coding.
Orange Data Mining:- It is open source data visualisation, machine learning and
data mining toolkit released in Otober 1996.
What is the Orange data mining tool?
Orange is an open-source software of machine learning that helps to design based
on a no-code or lowcode framework. With the help of Orange software, you can
design the data visualization, predictive modeling, and analysis of the data. The
orange tool is easy to use and has a drag-and-drop interface, basically used in
education, research, business, etc.
Statistics in Al:
Statistics play an important role in analysis and dealing with data in data science.
Statistics is used for collecting, exploring, and analyzing the data. It also helps in
drawing conclusions from data.Learning Resources for Students
Important concepts in statistics
Statistical sampling
The entire set of raw data that you may have available for a test or experiment is
known as the population.
You cannot necessarily measure the patterns and trends across the entire
population.
Take a sample, or portion of the population, perform some computations.
Descriptive statistics:-Descriptive statistics refers to a set of methods used to
summarize and describe the main features of a dataset. Helps us to describe the
data and enables us to understand the underlying characteristics.
Mean - The central value, commonly called the average.
Median - The middle value if we ordered the data from low to high and divided it
exactly in half.
Mode - The value which occurs most often.
Standard Deviations:- This function is calculated on a given sample which is
available in the form of the list. It is the measure of dispersion of dataset from its
means.
Variance:- Variance is the squared deviation ofa variable from its means.
Data Visualisation:- It is the graphical representations of information and data by
using visual elements like charts, graphs, and maps.
Types of Problem during collection of data:
1.Erroneous Data:
Incorrect values
Invalid or Null Values
2. Missing Data
3. Outliners.!