Unit 3 Advance Python
Unit 3 Advance Python
CSV files are delimited files that store tabular data (data stored in rows and columns).
Each line in a csv file is a data record. Each record consists of more than one fields(columns).
Working with csv file in python.
Sample Program-10
Write a Program to open a csv file students.csv and display its details
INTRODUCING LIBRARIES
A library in Python is refers to a collection of reusable modules or functions that provide specific
functionality
For example, the "math" library contains numerous functions like sqrt(), pow(), abs(), and sin(),
which facilitate mathematical operations and calculations.
To utilize a library in a program, it must be imported by using "import math" statement.
This allows us to access and utilize the functionalities provided by the math library.
NUMPY
NumPy, which stands for Numerical Python, is a powerful library in Python used for numerical
computing.
NumPy provides the ndarray (N-dimensional array) data structure, which represents arrays of any
dimension. These arrays are homogeneous (all elements are of the same data type) and can
contain elements of various numerical types (integers, floats, etc.)
NumPy can be installed using Python's package manager, pip. pip install numpy
Creating a Numpy Array –
PANDAS ("Panel Data", and "Python Data)
The Pandas library is a Python tool for data analysis and manipulation.
Pandas is well suited for working with tabular data, such as CSV or SQL tables
Pandas can be installed using: pip install pandas
Pandas generally provide two data structures for manipulating data, they are: Series and
DataFrame.
Series
A Series is a one-dimensional array containing a sequence of values of any data type (int, float,
list, string, etc.) which by default have numeric data labels starting from zero. The data label
associated with a particular value is called its index. We
can also assign values of other data types as index. A
Series is a one-dimensional array containing a
sequence of values of any data type (int, float, list,
string, etc.)
Default datatype is numeric data starting from zero.
The data label associated with a particular value is
called its index.
We can also assign values of other data types as index.
Create a simple Pandas Series from a list:
import pandas as pd
a = [‘Mark’, ‘Justin’, ‘John’,’Vicky’]
myvar = pd.Series(a)
print(myvar)
DataFrame
Creation of DataFrame
import pandas as pd
lst = [10,20,30,40,50]
df = pd.DataFrame(lst)
print(df)
Dealing with Rows and Columns
import pandas as pd
data = [ [90, 92, 89, 81, 94], [91, 81, 91, 71, 95], [97, 96, 88, 67, 99] ]
columns = ['Rajat', 'Amrita', 'Meenakshi', 'Rose', 'Karthika']
index = ['Maths', 'Science', 'Hindi']
Result = pd.DataFrame(data, index=index, columns=columns)
print(Result)
Missing Data or Not Available data can occur when no information is provided.
In DataFrame it is stored as NaN (Not a Number).
Attributes of DataFrames
Key Features:
Here, each row represents a sample (i.e., an iris flower), and each column represents a feature (i.e., a
measurement of the flower). For example, the first row [ 5.1 3.5 1.4 0.2] corresponds to an iris flower with
the following measurements:
Sepal length: 5.1 cm
Sepal width: 3.5 cm
Petal length: 1.4 cm
Petal width: 0.2 cm
Datasets are usually split into training set and testing set. T
he training set is used to train the model and testing set is used to test the model.
Most common splitting ratio is 80: 20. (Training -80%, Testing-20%)
from sklearn.model_selection import importing train_test_split
train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
X_train, y_train the feature vectors and target variables of the training set
respectively.
X_test, y_test the feature vectors and target variables of the testing set
respectively.
test_size = 0.2 specifies that 20% of the data will be used for testing, and
the remaining 80% will be used for training.
random_state = 1 Ensures reproducibility by fixing the random seed. This
means that every time you run the code, the same split
will be generated.
Scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for
fitting, predicting accuracy, recall etc. Here we are going to use KNN (K nearest neighbors) classifier.
from sklearn.neighbors import importing KneighboursClassifier
KNeighboursClassifier (type of supervised learning algorithm used for classification
tasks.)
knn = KNeighborsClassifier(n_neighbors we create an instance of the KNeighborsClassifier class .
=3) n_neighbors = 3 indicates that the classifier will consider the
3 nearest neighbors when making predictions. This is a
hyperparameter that can be tuned to improve the
performance of the classifier.
knn.fit(X_train, y_train) trains the KNeighborsClassifier model using the fit method.
it constructs a representation of the training data that allows
it to make predictions based on the input features.
y_pred = knn.predict(X_test) The knn object contains the trained model, make
predictions on new, unseen data.
This calculates the accuracy of the model by comparing the predicted target values (y_pred) with the
actual target values (y_test). The accuracy_score represents the proportion of correctly predicted
instances out of all instances in the testing set.
from sklearn import metrics
Accuracy = metrics.accuracy_score(y_test, y_pred))
Now, to validate the model's predictive accuracy, we can use some sample data.
sample = [[5, 5, 3, 2], [2, 4, 3, 5]]
preds = knn.predict(sample)
for p in preds:
pred_species.append(iris.target_names[p])
print("Predictions:", pred_species)