0% found this document useful (0 votes)
10 views186 pages

Python All Module Notes - Pagenumber

Uploaded by

shobha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views186 pages

Python All Module Notes - Pagenumber

Uploaded by

shobha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 186

1

Module 1 [20MCA31] Data Analytics using Python

Module -1
Python Basic Concepts and Programming

• Variables, Keywords
• Statements and Expressions
• Operators, Precedence and Associativity
• Data Types, Indentation, Comments
• Reading Input, Print Output
• Type Conversions, The type( ) Function and Is Operator
• Control Flow Statements
— The if Decision Control Flow Statement,
— The if…else Decision Control Flow Statement
— The if…elif…else Decision Control Statement
— Nested if Statement
— The while Loop
— The for Loop
— The continue and break Statements
• Built-In Functions, Commonly Used Modules
• Function Definition and Calling the Function
• The return Statement and void Function
• Scope and Lifetime of Variables
• Default Parameters, Keyword Arguments
• *args and **kwargs, Command Line Arguments

ROOPA.H.M, Dept of MCA, RNSIT Page 1


2
Module 1 [20MCA31] Data Analytics using Python

Introduction

Why to choose python language?


There are many reasons:

• Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax.
This allows the student to pick up the language quickly.

• Easy-to-read − Python code is more clearly defined and visible to the eyes.

• Easy-to-maintain − Python's source code is fairly easy-to-maintain.

• A broad standard library − Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.

• Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.

• Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.

• Databases − Python provides interfaces to all major commercial databases.

• Scalable − Python provides a better structure and support for large programs than other
scripting languages.

Conversing with Python


• Before we can converse with Python, we must first install the Python software on the
computer and learn how to start Python on computer.
• There are multiple IDEs (Integrated Development Environment) available for working with
Python. Some of them are PyCharm, LiClipse, IDLE, Jupyter, Spyder etc.
• When you install Python, the IDLE editor will be available automatically. Apart from all these
editors, Python program can be run on command prompt also. One has to install suitable
IDE depending on their need and the Operating System they are using. Because, there are
separate set of editors (IDE) available for different OS like Window, UNIX, Ubuntu, Mac, etc.
• The basic Python can be downloaded from the link: https://www.python.org/downloads/

Python has rich set of libraries for various purposes like large-scale data processing,
predictive analytics, scientific computing etc. Based on one’s need, the required packages
can be downloaded. But there is a free open source distribution Anaconda, which simplifies
package management and deployment. Hence, it is suggested for the readers to install
Anaconda from the below given link, rather than just installing a simple Python.

ROOPA.H.M, Dept of MCA, RNSIT Page 2


3
Module 1 [20MCA31] Data Analytics using Python

https://anaconda.org/anaconda/python

Successful installation of anaconda provides you Python in a command prompt, the default
editor IDLE and also a browser-based interactive computing environment known as Jupyter
notebook.

Terminology: Interpreter and compiler


• Python is a high-level language intended to be relatively straightforward for humans to
read and write and for computers to read and process.
• The CPU understands a language called machine language. Machine language is very
tiresome to write because it is represented all in zeros and ones:
001010001110100100101010000001111 11
100110000011101010010101101101 ...

Hence these high-level programming language has to be translated into machine language
using translators such as : (1) interpreters and (2) compilers.

interprets it immediately.
every line of code can
Interpreter generate the output
Source immediately
program
compiler
executable files
(with extensions .exe, .dll )

The difference between an interpreter and a compiler is given below:

Interpreter Compiler

Translates program one statement at a Scans the entire program and translates it
time. as a whole into machine code.

It takes less amount of time to analyze the It takes large amount of time to analyze the
source code but the overall execution time source code but the overall execution time
is slower. is comparatively faster.

Generates intermediate object code which


No intermediate object code is generated,
further requires linking, hence requires
hence are memory efficient.
more memory.

ROOPA.H.M, Dept of MCA, RNSIT Page 3


4
Module 1 [20MCA31] Data Analytics using Python

Continues translating the program until It generates the error message only after
the first error is met, in which case it scanning the whole program. Hence
stops. Hence debugging is easy. debugging is comparatively hard.

Programming language like Python, Ruby Programming language like C, C++ use
use interpreters. compilers.

Writing a program
• Program can be written using a text editor.
• To write the Python instructions into a file, which is called a script. By convention, Python
scripts have names that end with .py .
• To execute the script, you have to tell the Python interpreter the name of the file. In a
command window, you would type python hello.py as follows:
$ cat hello.py
print('Hello world!')
$ python hello.py
Hello world!
The “$” is the operating system prompt, and the “cat hello.py” is showing us that the file
“hello.py” has a one-line Python program to print a string. We call the Python interpreter
and tell it to read its source code from the file “hello.py” instead of prompting us for lines
of Python code

Program Execution

The execution of the Python program involves 2 Steps:


— Compilation
— Interpreter

Compilation
The program is converted into byte code. Byte code is a fixed set of instructions that represent
arithmetic, comparison, memory operations, etc. It can run on any operating system and

ROOPA.H.M, Dept of MCA, RNSIT Page 4


5
Module 1 [20MCA31] Data Analytics using Python

hardware. The byte code instructions are created in the .pyc file. he compiler creates a
directory named __pycache__ where it stores the .pyc file.

Interpreter
The next step involves converting the byte code (.pyc file) into machine code. This step is
necessary as the computer can understand only machine code (binary code). Python Virtual
Machine (PVM) first understands the operating system and processor in the computer and then
converts it into machine code. Further, these machine code instructions are executed by
processor and the results are displayed.

Flow Controls

There are some low-level conceptual patterns that we use to construct programs. These
constructs are not just for Python programs, they are part of every programming language
from machine language up to the high-level languages. They are listed as follows:
• Sequential execution: Perform statements one after another in the order they are
encountered in the script.
• Conditional execution: Check for certain conditions and then execute or skip a sequence
of statements. (Ex: statements with if-elfi-else)
• Repeated execution: Perform some set of statements repeatedly, usually with some
variation. (Ex: statements with for, while loop)
• Reuse: Write a set of instructions once and give them a name and then reuse those
instructions as needed throughout your program. (Ex: statements in functions)

Identifiers

• A Python identifier is a name used to identify a variable, function, class, module or other
object. An identifier starts with a letter A to Z or a to z or an underscore (_) followed by zero
or more letters, underscores and digits (0 to 9).
• Python does not allow punctuation characters such as @, $, and % within identifiers. Python
is a case sensitive programming language. Thus, Manpower and manpower are two different
identifiers in Python.
• Here are naming conventions for Python identifiers
— Class names start with an uppercase letter. All other identifiers start with a lowercase
letter.
— Starting an identifier with a single leading underscore indicates that the identifier is
private.

ROOPA.H.M, Dept of MCA, RNSIT Page 5


6
Module 1 [20MCA31] Data Analytics using Python

— Starting an identifier with two leading underscores indicates a strongly private identifier.
— If the identifier also ends with two trailing underscores, the identifier is a language-
defined special name.

Variables, Keywords

Variables
• A variable is a named place in the memory where a programmer can store data and later
retrieve the data using the variable “name”
• An assignment statement creates new variables and gives them values.
• In python, a variable need not be declared with a specific type before its usage. The type of it
will be decided by the value assigned to it.
Values and types
A value is one of the basic things a program works with, like a letter or a number.
Consider an example, It consists of integers and strings, floats, etc.,
>>> print(4)
4
>>> message = 'And now for something completely different'
>>> n = 17
>>> pi = 3.1415926535897931

• The above example makes three assignments.


— The first assigns a string to a new variable named message.
— The second assigns the integer 17 to n
— The third assigns the (approximate) value of π to pi.

• To display the value of a variable, you can use a print statement:


>>> print(n)
17
>>> print(pi)
3.141592653589793
• The type of a variable is the type of the value it refers to above example
>>> type(message)
<class 'str'> #type refers to string
>>> type(n)
<class 'int'> #type refers to integer
>>> type(pi)
<class 'float'> #type refers to float

ROOPA.H.M, Dept of MCA, RNSIT Page 6


7
Module 1 [20MCA31] Data Analytics using Python

• Rules to follow when naming the variables.


— Variable names can contain letters, numbers, and the underscore.
— Variable names cannot contain spaces and other special characters.
— Variable names cannot start with a number.
— Case matters—for instance, temp and Temp are different.
— Keywords cannot be used as a variable name.

• Example of valid variable names are: Spam, eggs, spam23 , _speed


• Variable names can be arbitrarily long. In this case ,underscore character (_) can appear in a
name. Ex: my_first_variable .
• As Python is case-sensitive, variable name sample is different from SAMPLE .
• Variable names can start with an underscore character, but we generally avoid doing this
unless we are writing library code for others to use.
• Examples for invalid variable names are list below
>>> 76trombones = 'big parade' # illegal because it begins with a number
SyntaxError: invalid syntax

>>> more@ = 1000000 # illegal because it contains an illegal character, @.


SyntaxError: invalid syntax

>>> class = 'Advanced' # class is one of Python’s keywords


SyntaxError: invalid syntax

Keywords
Keywords are a list of reserved words that have predefined meaning. Keywords are special
vocabulary and cannot be used by programmers as identifiers for variables, functions, constants
or with any identifier name. Attempting to use a keyword as an identifier name will cause an
error. The following table shows the Python keywords.

and del from None True def lambda


as elif global nonlocal Try raise return
assert else if not While finally
break except import or with is
class False in pass yield continue

ROOPA.H.M, Dept of MCA, RNSIT Page 7


8
Module 1 [20MCA31] Data Analytics using Python

Statements and Expressions

Statement
• A statement is a unit of code that the Python interpreter can execute.
• We have seen two kinds of statements:
— assignment statement: We assign a value to a variable using the assignment statement
(=). An assignment statement consists of an expression on the
right-hand side and a variable to store the result. In python ,there is special feature for
multiple assignments, where more than one variable can be initialized in single
statement.
Ex: str=”google”
x = 20+y
a, b, c = 2, “B”, 3.5
print statement : print is a function which takes string or variable as a argument to display it
on the screen.
Following are the examples of statements –
>>> x=5 #assignment statement
>>> x=5+3 #assignment statement
>>> print(x) #printing statement

Optional arguments with print statement:


sep : Python will insert a space between each of the arguments of the print function. There is an
optional argument called sep, short for separator, that you can use to change that space to
something else. For example, using sep=':' would separate the arguments by a colon and sep='##'
would separate the arguments by two pound signs.
>>> print("a","b","c","d",sep=";")
a;b;c;d

end : The print function will automatically advance to the next line. For instance, the following
will print on two lines:
code output
print("A") A
print("B") B
print("C", end=" ") CE
print("E")

ROOPA.H.M, Dept of MCA, RNSIT Page 8


9
Module 1 [20MCA31] Data Analytics using Python

Expressions
An expression is a combination of values, variables, and operators. A value all by itself is
considered an expression, and so is a variable, so the following are all legal expressions. If you
type an expression in interactive mode, the interpreter evaluates it and displays the result:
>>> x=5
>>> x+1
6

Operators, Precedence and Associativity

• Operators are special symbols that represent computations like addition and multiplication.
The values the operator is applied to are called operands.
• Here is list of arithmetic operators
Operator Meaning Example
+ Addition Sum= a+b
- Subtraction Diff= a-b
Division a=2
/ b=3
div=a/b
(div will get a value 1.3333333)
Floor Division – F = a//b
// returns only integral A= 4//3 (X will get a value 1)
part of qotient after
division
% Modulus – remainder A= a %b
after (Remainder after dividing a by b)
Division
** Exponent E = x** y
(means x to the power of y)
• Relational or Comparison Operators: are used to check the relationship (like less than,
greater than etc) between two operands. These operators return a Boolean value either True
or False.
• Assignment Operators: Apart from simple assignment operator = which is used for assigning
values to variables, Python provides compound assignment operators.
• For example,
statements Compound
statement
x=x+y x+=y
y=y//2 y//=2

ROOPA.H.M, Dept of MCA, RNSIT Page 9


10
Module 1 [20MCA31] Data Analytics using Python

• Logical Operators: The logical operators and, or, not are used for comparing or negating the
logical values of their operands and to return the resulting logical value. The values of the
operands on which the logical operators operate evaluate to either True or False. The result of
the logical operator is always a Boolean value, True or False.

Operators Statement Comments Example


And x> 0 and x<10 Is true only if x is greater Ex1 :>>> x=5
than 0 and less than10 >>> x>0 and x<10
True
Ex2: >>> x= -5
>>> x>0 and x<10
False
Or n%2==0 or Is true only if either >>> n=2
n%3==0 condition is true >>>n%2==or n%3==0
True
Not not(x>y) negates Boolean >>> x=5
expression >>> x> 0 and x<10
True
>>> not x
False

Precedence and Associativity (Order of operations)


• When an expression contains more than one operator, the evaluation of operators depends
on the precedence of operators.
• The Python operators follow the precedence rule (which can be remembered as PEMDAS) as
given below :
— Parenthesis have the highest precedence in any expression. The operations within
parenthesis will be evaluated first.
— Exponentiation has the 2nd precedence. But, it is right associative. That is, if there are two
exponentiation operations continuously, it will be evaluated from right to left (unlike most
of other operators which are evaluated from left to right). For example

>>> print(2**3**2) #It is 512 i.e.,

— Multiplication and Division are the next priority. Out of these two operations, whichever
comes first in the expression is evaluated.

>>> print(5*2/4) #multiplication and then division


2.5
>>> print(5/4*2) #division and then multiplication 2.5

ROOPA.H.M, Dept of MCA, RNSIT Page 10


11
Module 1 [20MCA31] Data Analytics using Python

— Addition and Subtraction are the least priority. Out of these two operations, whichever
appears first in the expression is evaluated i.e., they are evaluated from left to right .

Example : x = 1 + 2 ** 3 / 4 * 5

Data Types, Indentation, Comments

Data Types
Data types specify the type of data like numbers and characters to be stored and manipulated
within a program.
Basic data types of Python are
• Numbers
• Boolean
• Strings
• list
• tuple
• dictionary
• None

• Numbers
Integers, floating point numbers and complex numbers fall under Python numbers category.
They are defined as int, float and complex class in Python. Integers can be of any length; it is
only limited by the memory available. A floating-point number is accurate up to 15 decimal
places. Integer and floating points are separated by decimal points. 1 is an integer, 1.0 is
floating point number. Complex numbers are written in the form, x + yj, where x is the real
part and y is the imaginary part.
• Boolean
Booleans may not seem very useful at first, but they are essential when you start using
conditional statements. Boolean value is, either True or False. The Boolean values, True and
False are treated as reserved words.

ROOPA.H.M, Dept of MCA, RNSIT Page 11


12
Module 1 [20MCA31] Data Analytics using Python

• Strings
A string consists of a sequence of one or more characters, which can include letters, numbers,
and other types of characters. A string can also contain spaces. You can use single quotes or
double quotes to represent strings and it is also called a string literal. Multiline strings can be
denoted using triple quotes, ''' or " " ". These are fixed values, not variables that you literally
provide in your script.
For example,
1. >>> s = 'This is single quote string'
2. >>> s = "This is double quote string"
3. >>> s = '''This
is Multiline
string'''
• List
A list is formed(or created) by placing all the items (elements) inside square brackets [ ],
separated by commas.It can have any number of items and they may or may not be of different
types (integer, float, string, etc.).
Example : List1 = [3,8,7.2,"Hello"]

• Tuple
A tuple is defined as an ordered collection of Python objects. The only difference between tuple
and list is that tuples are immutable i.e. tuples can’t be modified after it’s created. It is
represented by tuple class. we can represent tuples using parentheses ( ).
Example: Tuple = (25,10,12.5,"Hello")

• Dictionary
Dictionary is an unordered collection of data values, which is used to store data values like a
map, which, unlike other Data Types that hold only a single value as an element, a
Dictionary consists of key-value pair. Key-value is provided within the dictionary to form it
more optimized. In the representation of a dictionary data type, each key-value pair during a
Dictionary is separated by a colon: whereas each key’s separated by a ‘comma’.
Example: Dict1 = {1 : 'Hello' , 2 : 5.5, 3 : 'World' }

• None
None is another special data type in Python. None is frequently used to represent the absence
of a value. For example, >>> money = None

ROOPA.H.M, Dept of MCA, RNSIT Page 12


13
Module 1 [20MCA31] Data Analytics using Python

Indentation
• In Python, Programs get structured through indentation (FIGURE below)

Figure : Code blocks and indentation in Python.


• Usually, we expect indentation from any program code, but in Python it is a requirement and
not a matter of style. This principle makes the code look cleaner and easier to understand
and read.

• Any statements written under another statement with the same indentation is interpreted to
belong to the same code block. If there is a next statement with less indentation to the left,
then it just means the end of the previous code block.

• In other words, if a code block has to be deeply nested, then the nested statements need to be
indented further to the right. In the above diagram, Block 2 and Block 3 are nested under
Block 1. Usually, four whitespaces are used for indentation and are preferred over tabs.
Incorrect indentation will result in Indentation Error.

Comments
• As programs get bigger and more complicated, they get more difficult to read. Formal
programming languages are many, and it is often difficult to look at a piece of code and figure
out what it is doing, or why.

• For this reason, it is a good idea to add notes to your programs to explain in natural language
what the program is doing. These notes are called comments, and in Python they start with
the # symbol:
Ex1. #This is a single-line comment
Ex2. ''' This is a
multiline
comment '''

ROOPA.H.M, Dept of MCA, RNSIT Page 13


14
Module 1 [20MCA31] Data Analytics using Python

• Comments are most useful when they document non-obvious features of the code. It is
reasonable to assume that the reader can figure out what the code does; it is much more
useful to explain why.

Reading Input, Print Output

Reading Input
Python provides a built-in function called input that gets input from the keyboard. When this
function is called, the program waits for the user input. When the user press the Enter key,
the program resumes and input returns user value as a string.
For example
>>> inp = input()
Welcome to world of python
>>> print(inp)
Welcome to world of python

• It is a good idea to have a prompt message telling the user about what to enter as a value.
You can pass that prompt message as an argument to input function.
>>>x=input('Please enter some text:\n')
Please enter some text:
Roopa
>>> print(x)
Roopa
• The sequence \n at the end of the prompt represents a newline, which is a special character
that causes a line break. That’s why the user’s input appears below the prompt.
• If you expect the user to type an integer, you can try to convert the return value to int using
the int() function:
Example1:
>>> prompt = 'How many days in a week?\n'
>>> days = input(prompt)
How many days in a week?
7
>>> type(days)
<class 'str'>

#by default value is treated as string

Example 2:
>>> x=int(input('enter number\n'))
enter number
12
>>> type(x)
<class 'int'>

ROOPA.H.M, Dept of MCA, RNSIT Page 14


15
Module 1 [20MCA31] Data Analytics using Python

Print Output
Format operator
• The format operator, % allows us to construct strings, replacing parts of the strings with the
data stored in variables.
• When applied to integers, % is the modulus operator. But when the first operand is a string,
% is the format operator.
• For example, the format sequence “%d” means that the operand should be formatted as an
integer (d stands for “decimal”):
Example 1:
>>> camels = 42
>>>'%d' % camels
'42'
• A format sequence can appear anywhere in the string, so you can embed a value in a
sentence:
Example 2 :
>>> camels = 42
>>> 'I have spotted %d camels.' % camels
'I have spotted 42 camels.'
• If there is more than one format sequence in the string, the second argument has to be a
tuple. Each format sequence is matched with an element of the tuple, in order.
• The following example uses “%d” to format an integer, “%g” to format a floating point number
, and “%s” to format a string:
Example 3:
>>> 'In %d years I have spotted %g %s.' % (3, 0.1, 'camels')
'In 3 years I have spotted 0.1 camels.'

Format function
format() : is one of the string formatting methods in Python3, which allows multiple
substitutions and value formatting. This method lets us concatenate elements within a string
through positional formatting.
Two types of Parameters:
— positional_argument
— keyword_argument
• Positional argument: It can be integers, floating point numeric constants, strings, characters
and even variables.

ROOPA.H.M, Dept of MCA, RNSIT Page 15


16
Module 1 [20MCA31] Data Analytics using Python

• Keyword argument : They is essentially a variable storing some value, which is passed as
parameter.

# To demonstrate the use of formatters with positional key arguments.


Positional arguments >>>print("{0} college{1} department
are placed in order ".format("RNSIT","EC"))

RNSIT college EC department

Reverse the index >>>print("{1} department {0} college


".format("RNSIT","EC”))
numbers with the
parameters of the EC department RNSIT college
placeholders
Positional arguments >>>print("Every {} should know the use of {} {}
are not specified. By python programming and {}".format("programmer",
default it starts "Open", "Source", "Operating Systems"))
positioning from zero Every programmer should know the use of Open Source
programming and Operating Systems

Use the index numbers >>>print("Every {3} should know the use of {2} {1}
of the values to change programming and {0}" .format("programmer", "Open",
the order that they "Source", "Operating Systems"))
appear in the string Every Operating Systems should know the use of Source
Open programming and programmer

Keyword arguments are print("EC department {0} ‘D’ section {college}"


called by their keyword .format("6", college="RNSIT"))
name EC department 6 ‘D’ section RNSIT

f-strings
Formatted strings or f-strings were introduced in Python 3.6. A f-string is a string literal that is
prefixed with “f”. These strings may contain replacement fields, which are expressions enclosed
within curly braces {}. The expressions are replaced with their values.

Example : >>>a=10
>>>print(f”the value is {a}”)
the value is 10

ROOPA.H.M, Dept of MCA, RNSIT Page 16


17
Module 1 [20MCA31] Data Analytics using Python

Type Conversions, The type( ) Function and Is Operator

Type conversion functions


Python also provides built-in functions that convert values from one type to another.
Table : Type conversion functions
DataType Example
Ex:1>>> int('32')
32
Ex:2>>> int('Hello')
ValueError: invalid literal for int() with base 10:
int() 'Hello'
Ex:3>>>int(3.9999)
3
Ex:4>>> int(-2.3)
-2
Ex1 : >>>float(32)
float() 32.0
Ex2 : float(‘3.124’)
3.124
Ex1 : >>>str(32)
str() ‘32’
Ex2 : >>> str(3.124)
‘13.124’
• int can convert floating-point values to integers, but it doesn’t round off; it chops off the
fraction part.
• float converts integers and strings to floating-point numbers.
• str converts its argument to a string.

The type( ) Function


• type function is called to know the datatype of the value. The expression in parenthesis is
called the argument of the function. The argument is a value or variable that we are passing
into the function as input to the function.
Example: >>> type(33)
<class 'int'>

Is Operator
• If we run these assignment statements:
a = 'banana'
b = 'banana'

ROOPA.H.M, Dept of MCA, RNSIT Page 17


18
Module 1 [20MCA31] Data Analytics using Python

Figure: (a)

• We know that a and b both refer to a string, but we don’t know whether they refer to the
same string. There are two possible states, shown in Figure (a).
• In one case, a and b refer to two different objects that have the same value. In the second
case, they refer to the same object. That is, a is an alias name for b and viceversa. In other
words, these two are referring to same memory location.
• To check whether two variables refer to the same object, you can use the is operator.

>>> a = 'banana'
>>> b = 'banana'
>>> a is b
True

• When two variables are referring to same object, they are called as identical objects.
• When two variables are referring to different objects, but contain a same value, they are
known as equivalent objects.

>>>s1=input(“Enter a string:”)
>>>s2= input(“Enter a string:”)
>>>s1 is s2 #check s1 and s2 are identical
False
>>>s1 == s2 #check s1 and s2 are equivalent
True

• Here s1 and s2 are equivalent, but not identical

• If two objects are identical, they are also equivalent, but if they are equivalent, they are not
necessarily identical.

ROOPA.H.M, Dept of MCA, RNSIT Page 18


19
Module 1 [20MCA31] Data Analytics using Python

Control Flow Statements

A conditional statement gives the developer to ability to check conditions and change the
behaviour of the program accordingly. The simplest form is the if statement:
1) The if Decision Control Flow Statement
Syntax, x=1
if condition: if x>0:
statement 1 print("positive
statement 2 number")
……………..
statement n Output:
positive number

• The Boolean expression after the if keyword is called the condition.


• The if statement consists of a header line that ends with the colon character (:) followed by an
indented block. Statements like this are called compound statements because they stretch
across more than one line.
• If the logical condition is true, then the block of statements get executed. If the logical
condition is false, the indented block is skipped.
2) The if…else Decision Control Flow Statement (alternative execution)
A second form of the if statement is alternative execution, in which there are two possibilities and
the condition determines which one gets executed is as shown in flowchart below.
The syntax looks like this:
Syntax:

if condition :
statements
else :
statements

Example:
x=6
if x%2 == 0:
print('x is even')
else :
print('x is
odd') Figure : if-Then-Else Logic

ROOPA.H.M, Dept of MCA, RNSIT Page 19


20
Module 1 [20MCA31] Data Analytics using Python

3) The if…elif…else Decision Control Statement (Chained conditionals)


If there are more than two possibilities and we need more than two branches. One way to
express a computation like that is a chained conditional. elif is an abbreviation of “else if.” Again,
exactly one branch will be executed is as shown in flowchart below.

Figure : If-Then-ElseIf Logic

Syntax Example
if condition1: x=1
Statement y=6
elif condition2: if x < y:
Statement print('x is less than y')
.................... elif x > y:
elif condition_n:
print('x is greater than y')
else:
Statement
print('x and y are equal')
else:
Statement Output:
X is less than y
4) Nested if Statement
• The conditional statements can be nested. That is, one set of conditional statements can be
nested inside the other.
• Let us consider an example, the outer conditional statement contains two branches.

Example Output:
x=3 x is less than y
y=4
if x == y:
print('x and y are equal')
else:
if x < y:
print('x is less than y')
else:
print('x is greater than y')

ROOPA.H.M, Dept of MCA, RNSIT Page 20


21
Module 1 [20MCA31] Data Analytics using Python

— The first branch contains a simple statement.


— The second branch contains another if statement, which has two branches of its own.
Those two branches are both simple statements, although they could have been
conditional statements as well is as shown in flowchart below.

• Nested conditionals make the code difficult to read, even though there are proper
indentations. Hence, it is advised to use logical operators like and to simplify the nested
conditionals.

Figure : Nested If Statements

Short-circuit evaluation of logical expressions


• When Python is processing a logical expression such as
x >= 2 and (x/y) > 2
it evaluates the expression from left to right. Let’s assume x=1. Because of the definition of
and operator, if x is less than 2, the expression x >= 2 is False and so the whole expression is
False regardless of whether (x/y) > 2 evaluates to True or False.

• If Python detects that there is nothing to be gained by evaluating the rest of a logical
expression, it stops its evaluation and does not do the computations in the rest of the logical
expression. When the evaluation of a logical expression stops because the overall value is
already known, it is called short-circuiting the evaluation.

• However ,if the first part of logical expression results in True, then the second part has to be
evaluated to know the overall result. The short-circuiting not only saves the computational
time, but it also leads to a technique known as guardian pattern.

ROOPA.H.M, Dept of MCA, RNSIT Page 21


22
Module 1 [20MCA31] Data Analytics using Python

Consider the below examples:


Example1 Example 2 Example 3
>>> x = 6
>>> x = 6
>>> x = 1 >>> y = 0
>>> y = 2
>>> y = 0 >>> x >= 2 and (x/y) > 2
>>> x >= 2 and
>>> x >= 2 and Traceback (most recent call
(x/y) > 2
(x/y) > 2 last):
True
False ZeroDivisionError: division
by zero
• The first example is true because both conditions are true.
• But the second example did not fail because the first part of the expression
x >= 2 evaluated to False so the (x/y) was not ever executed due to the short-circuit rule and
there was no error .
• The third calculation failed because Python was evaluating (x/y) and y was zero, which
causes a runtime error.
We can construct the logical expression to strategically place a guard evaluation just before
the evaluation that might cause an error as follows:
Consider an example,
Example 1:
>>> x = 1
>>> y = 0
>>> x >= 2 and y != 0 and (x/y) > 2
False

Example 2
>>> x = 6
>>> y = 0
>>> x >= 2 and y != 0 and (x/y) > 2
False

Example 3
>>> x >= 2 and (x/y) > 2 and y != 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

• In the first logical expression, x >= 2 is False so the evaluation stops at first condition itself.
• In the second logical expression, x >= 2 is True but y != 0 is False so it never reach the
condition (x/y).
• In the third logical expression, the y!= 0 is placed after the (x/y)>2 condition so the
expression fails with an error.

ROOPA.H.M, Dept of MCA, RNSIT Page 22


23
Module 1 [20MCA31] Data Analytics using Python

• In the second expression, we say that y != 0 acts as a guard to insure that we only execute
(x/y) if y is non-zero.

ITERATION (looping statements)

Iteration repeats the execution of a sequence of code. Iteration is useful for solving many
programming problems. Iteration and conditional execution form the basis for algorithm
construction.

The while statement


• Sometimes, though, we need to repeat something, but we don’t know ahead of time exactly
how many times it has to be repeated i.e., indefinite loop. This is a situation when ‘while loop’
is used.
• The syntax of while loop :
while condition:
block_of_statements
statements after the loop

Here, while is a keyword


• The flow of execution for a while statement is as below :
— The condition is evaluated first, yielding True or False
— If the condition is false, the loop is terminated and statements after the loop will be
executed.
— If the condition is true, the body of the loop (indented block of statements)will be
executed and then goes back to condition evaluation.

• Example: program to print values 1 to 5


i = 1 Output:
while i<=5: 1
2
print(i) 3
i= i+1 4
5
print(“ printing is done”) printing is done

In the above example, variable i is initialized to 1. Then the condition i <=5 is being checked. If
the condition is true, the block of code containing print statement print(i) and increment
statement (n=n+1) are executed. After these two lines, condition is checked again. The
procedure continues till condition becomes false, that is when n becomes 6. Now, the while-
loop is terminated and next statement after the loop will be executed. Thus, in this example,

ROOPA.H.M, Dept of MCA, RNSIT Page 23


24
Module 1 [20MCA31] Data Analytics using Python

the loop is iterated for 5 times.

Also notice that, variable i is initialized before starting the loop and it is incremented inside
the loop. Such a variable that changes its value for every iteration and controls the total
execution of the loop is called as iteration variable or counter variable. If the count variable is
not updated properly within the loop, then the loop may enter into infinite loop.

Infinite loops with break


• Infinite loops are the looping statements which iterates infinite number of times where the
condition remains true always.
• Let us consider an example:
n = 10
while True:
print(n, end=’‘)
n = n - 1
print('Done!')

• Here, the condition is always True, which will never terminate the loop. Sometimes, the
condition is given such a way that it will never become false and hence by restricting the
program control to go out of the loop. This situation may happen either due to wrong
condition or due to not updating the counter variable.

• Hence to overcome this situation, break statement is used. The break statement can be
used to break out of a for or while loop before the loop is finished.

• Here is a program that allows the user to enter up to 10 numbers. The user can stop early
by entering a negative number.
while True: Output:
Enter a number: 23
num = eval(input('Enter a number: ')) 23
if num<0: Enter a number: 34
34
break Enter a number: 56
print(num) 56
Enter a number: -12

• In the above example, observe that the condition is kept inside the loop such a way that, if the
user input is a negative number, the loop terminates. This indicates that, the loop may
terminate with just one iteration (if user gives negative number for the very first time) or it
may take thousands of iteration (if user keeps on giving only positive numbers as input).
Hence, the number of iterations here is unpredictable. But we are making sure that it will not

ROOPA.H.M, Dept of MCA, RNSIT Page 24


25
Module 1 [20MCA31] Data Analytics using Python

be an infinite-loop, instead, the user has control on the loop.


• Another example for usage of while with break statement: program to take the input from the
user and echo it until they type done:
while True: Output:
line = input('>') > hello there
if line == 'done': hello there
> finished
break
finished
print(line) > done
print('Done!') Done!

Finishing iterations with continue


• Sometimes the user may want to skip few tasks in the loop based on the condition. To do so
continue statement can be used.
• the continue statement skips to the next iteration without finishing the body of the loop for
the current iteration.
• Here is an example of a loop that copies its input until the user types “done” but treats lines
that start with the hash character as lines not to be printed (kind of like Python comments).

while True: Output:


line = input('>‘) > hello there
if line[0] == '#’: hello there
continue > # don't print this
if line == 'done’: > print this!
break print this!
print(line) > done
print('Done!') Done!

All the lines are printed except the one that starts with the hash sign because when the
continue is executed, it ends the current iteration and jumps back to the while statement to
start the next iteration, thus skipping the print statement.
Definite loops using for
• Sometimes we want to loop through a set of things such as a list of words, the lines in a file,
or a list of numbers. When we have a list of things to loop through, we can construct a
definite loop using a for statement.

ROOPA.H.M, Dept of MCA, RNSIT Page 25


26
Module 1 [20MCA31] Data Analytics using Python

• for statement loops through a known set of items so it runs through as many iterations as
there are items in the set.
• There are two versions in for loop:
— for loop with sequence
— for loop with range( ) function

• Syntax of for loop with sequence:


for var in list/sequence :
statements to be repeated
Here,
for and in -> are keywords
list/sequence -> is a set of elements on which the loop is iterated. That is, the loop will
be executed till there is an element in list/sequence
statements -> constitutes body of the loop.
• Example :
friends = ['Roopa', 'Smaya', 'Vikas’] Output:
for name in friends: Happy New Year: Roopa
print('Happy New Year:', name) Happy New Year: Smaya
Happy New Year: Vikas
print('Done!')
Done!

— In the example, the variable friends is a list of three strings and the for loop goes through
the list and executes the body once for each of the three strings in the list.
— name is the iteration variable for the for loop. The variable name changes for each iteration
of the loop and controls when the for loop completes. The iteration variable steps
successively through the three strings stored in the friends variable.
• The for loop can be used to print (or extract) all the characters in a string as shown below :
for i in "Hello": Output:
print(i, end=‟\t‟) H e l l o

• Syntax of for loop with range( ) function:


for variable in range( start, end, steps):
statements to be repeated

The start and end indicates starting and ending values in the sequence, where end is
excluded in the sequence (That is, sequence is up to end-1). The default value of start is 0.

ROOPA.H.M, Dept of MCA, RNSIT Page 26


27
Module 1 [20MCA31] Data Analytics using Python

The argument steps indicates the increment/decrement in the values of sequence with the
default value as 1. Hence, the argument steps is optional. Let us consider few examples on
usage of range() function.
EX:1 Program code to print the message multiple times
for i in range(3): Hello
print('Hello’) Hello
Hello
EX:2 Program code to print the numbers in sequence. Here iteration variable i takes the
value from 0 to 4 excluding 5. In each iteration value of i is printed.

for i in range(5): 0 1 2 3 4
print(i, end= “\t”)

EX:3 Program to allow the user to find squares of any three number

for i in range(3): Enter a number: 3


num = int(input('Enter a number:‘)) The square of your number is 9
print ('The square of your number is', Enter a number: 4
num*num) The square of your number is 16

print('The loop is now done.') Enter a number: 56


The square of your number is
3136
The loop is now done.
EX:4 Program that counts down from 5 and then prints a message.

for i in range(5,0,-1): 5
4
print(i) 3
print('Blast off!!') 2
1
Blast off !!

Functions

• A sequence of instructions intended to perform a specific independent task is known as a


function.
• You can pass data to be processed, as parameters to the function. Some functions can
return data as a result.
• In Python, all functions are treated as objects, so it is more flexible compared to other high-
level languages.
• In this section, we will discuss various types of built-in functions, user-defined functions,
applications/uses of functions etc.

ROOPA.H.M, Dept of MCA, RNSIT Page 27


28
Module 1 [20MCA31] Data Analytics using Python

Function types
• Built in functions
• User defined functions

Built-in functions
• Python provides a number of important built in functions that we can use without needing
to provide the function definition.
• Built-in functions are ready to use functions.
• The general form of built-in functions: function_name(arguments)

An argument is an expression that appears between the parentheses of a function call and
each argument is separated by comma .

Table : Built in functions


Built–in- functions Syntax/Examples Comments

max( ) >>>max(‘hello world’) display character having


‘w’ maximum ASCII code

min( ) >>>min(‘hello world’) display least character


‘‘ having minimum ASCII
code

len( ) >>>len(‘hello world’) display length of string


11
round( ) >>>round(3.8) round the value with
4 single argument.
>>>round(3.3)
3
>>>round(3.5)
4
>>>round( 3.141592653, 2 ) round the value with 2
3.14 arguments.
pow() >>>pow(2,4) 2^4

abs() >>>abs(-3.2) returns the absolute value


of object

ROOPA.H.M, Dept of MCA, RNSIT Page 28


29
Module 1 [20MCA31] Data Analytics using Python

Commonly Used Modules


Math functions
• Python has a math module that provides most of the frequently used mathematical
functions. Before we use those functions , we need to import the module as below:
>>> import math
• This statement creates a module object named math. If we pass module object as an
argument to print, information about that object is displayed:
>>> print(math)
<module 'math' (built-in)>

Table : Math Functions


Examples Comments
>>> import math # Finds the square root of
>>> math.sqrt(3) number
1.7320508075688772
>>> math.sin(30) # Finds the sin of 30
-0.9880316240928618
>>> math.cos(30) # Finds the cos of 30
0.15425144988758405
>>> print(math.pi) # print the value of pi
3.141592653589793
>>> math.sqrt(2) # finds the square root of 2
1.4142135623730951
>>> math.log(2) #finds the log base e
0.6931471805599453

Random numbers
• Most of the programs that we write take predefined input values and produces expected
output values. such programs are said to be deterministic. Determinism is usually a good
thing, since we expect the same calculation to yield the same result.
• But it is not case always, for some applications, we want the computer to be unpredictable.
Games are an obvious example, but there are many applications.
• Making a program truly nondeterministic turns out to be not so easy, but there are ways to
make it at least seem nondeterministic. One of them is to use algorithms that generate
pseudorandom numbers.
• The function random returns a random float between 0.0 and 1.0 and for integer between (1
and 100 etc) .
• Python has a module called random, in which functions related to random numbers are
available.

ROOPA.H.M, Dept of MCA, RNSIT Page 29


30
Module 1 [20MCA31] Data Analytics using Python

• To generate random numbers. Consider an example program to use random() function which
generates random number between 0.0 and 1.0 ,but not including 1.0. In the below program,
it generates 5 random numbers

import random Output:


0.11132867921152356
for i in range(5):
0.5950949227890241
x = random.random() 0.04820265884996877
print(x) 0.841003109276478
0.997914947094958

• The function randint() takes the parameters low and high, and returns an integer between
low and high (including both).
>>> import random
>>> random.randint(5,10)
10
>>> random.randint(5,10)
6
>>> random.randint(5,10)
7
• To choose an element from a sequence at random, you can use choice():

>>> t = [1, 2, 3]
>>> random.choice(t)
2
>>> random.choice(t)
3

Function Definition and Calling the Function


• Python facilitates programmer to define his/her own functions.
• The function written once can be used wherever and whenever required.
• The syntax of user-defined function would be:

def fname(arg_list): Here,

statement_1 def : is a keyword indicating it as a function definition.

statement_2 fname : is any valid name given to the function.


arg_list : is list of arguments taken by a function. These are
……………
treated as inputs to the function from the position of function call.
statement_n
There may be zero or more arguments to a function.
return value
statements : are the list of instructions to perform required task.
return : is a keyword used to return the output value. This
statement is optional

ROOPA.H.M, Dept of MCA, RNSIT Page 30


31
Module 1 [20MCA31] Data Analytics using Python

• The first line in the function def fname(arg_list) is known as function header/definition. The
remaining lines constitute function body.
• The function header is terminated by a colon and the function body must be indented.
• To come out of the function, indentation must be terminated.
• Unlike few other programming languages like C, C++ etc, there is no main() function or
specific location where a user-defined function has to be called.
• The programmer has to invoke (call) the function wherever required.
• Consider a simple example of user-defined function –

def myfun(): Output:


print("Hello everyone")
before calling the function
print("this is my own function") Hello everyone
this is my own function
print("before calling the function") after calling the function
myfun() #fuction call
print("after calling the function")

Function calls
• A function is a named sequence of instructions for performing a task.
• When we define a function we will give a valid name to it, and then specify the instructions
for performing required task. Then, whenever we want to do that task, a function is called by
its name.
Consider an example,
>>> type(33)
<class 'int'>
• Here, type function is called to know the datatype of the value. The expression in
parenthesis is called the argument of the function. The argument is a value or variable that
we are passing into the function as input to the function.
• It is common to say that a function “takes” an argument and “returns” a result. The result is
called the return value.
The return Statement and void Function
• A function that performs some task, but do not return any value to the calling function is
known as void function. The examples of user-defined functions considered till now are void
functions.

• The function which returns some result to the calling function after performing a task is
known as fruitful function. The built-in functions like mathematical functions, random

ROOPA.H.M, Dept of MCA, RNSIT Page 31


32
Module 1 [20MCA31] Data Analytics using Python

number generating functions etc. that have been considered earlier are examples for fruitful
functions.

• One can write a user-defined function so as to return a value to the calling function as
shown in the following example.

def addition(a,b): #function definition


sum=a + b
return sum

x=addition(3, 3) #function call


print("addition of 2 numbers:",x)
Output:
addition of 2 numbers:6

— In the above example, The function addition() take two arguments and returns their sum to
the receiving variable x.

— When a function returns something and if it is not received using a some variable, the
return value will not be available later.

• When we are using built –in functions, that yield results are fruitful functions.
>>>math.sqrt(2)
1.7320508075688772
• The void function might display something on the screen or has some other effect. they
perform an action but they don’t have return value.
Consider an example
>>> result=print('python')
python
>>> print(result)
None
>>> print(type(None))
<class 'NoneType'>

Scope and Lifetime of Variables


• All variables in a program may not be accessible at all locations in that program. This
depends on where you have declared a variable.

• The scope of a variable determines the portion of the program where you can access a
particular identifier. There are two basic scopes of variables in Python
— Global variables
— Local variables

ROOPA.H.M, Dept of MCA, RNSIT Page 32


33
Module 1 [20MCA31] Data Analytics using Python

Global vs. Local variables


• Variables that are defined inside a function body have a local scope, and those defined
outside have a global scope.

• This means that local variables can be accessed only inside the function in which they are
declared, whereas global variables can be accessed throughout the program body by all
functions. When you call a function, the variables declared inside it are brought into scope.

total = 0 # This is global variable.


# Function definition is here
def sum( arg1, arg2 ):
total = arg1 + arg2 # Here total is local variable.
print("Inside the function local total : ", total)
return total

sum( 10, 20 ); # Now you can call sum function


print("Outside the function global total : ", total)
Output :
Inside the function local total : 30
Outside the function global total : 0

Default Parameters
• For some functions, you may want to make some parameters optional and use default values
in case the user does not want to provide values for them. This is done with the help of
default argument values.

• Default argument values can be specified for parameters by appending to the parameter
name in the function definition the assignment operator ( = ) followed by the default value.

• Note that the default argument value should be a constant.


• Only those parameters which are at the end of the parameter list can be given default
argument values i.e. you cannot have a parameter with a default argument value preceding a
parameter without a default argument value in the function's parameter list. This is because
the values are assigned to the parameters by position.
For example, def func(a, b=5) is valid, but def func(a=5, b) is not valid.

def say(message, times=1): Output :


print(message * times) Hello
WorldWorldWorldWorldWorld
say('Hello')
say('World', 5)

ROOPA.H.M, Dept of MCA, RNSIT Page 33


34
Module 1 [20MCA31] Data Analytics using Python

How It Works
The function named say is used to print a string as many times as specified. If we don't supply a
value, then by default, the string is printed just once. We achieve this by specifying a default
argument value of 1 to the parameter times . In the first usage of say , we supply only the string
and it prints the string once. In the second usage of say , we supply both the string and an
argument 5 stating that we want to say the string message 5 times.

Keyword Arguments
• If you have some functions with many parameters and you want to specify only some of
them, then you can give values for such parameters by naming them - this is called
keyword arguments - we use the name (keyword) instead of the position to specify the
arguments to the function.
• There are two advantages
— one, using the function is easier since we do not need to worry about the order of the arguments.
— Two, we can give values to only those parameters to which we want to, provided that the other
parameters have default argument values.
def func(a, b=5, c=10):
print('a is', a, 'and b is', b, 'and c is', c)

func(3, 7)
func(25, c=24)
func(c=50, a=100)
Output:
a is 3 and b is 7 and c is 10
a is 25 and b is 5 and c is 24
a is 100 and b is 5 and c is 50

How It Works
The function named func has one parameter without a default argument value, followed by two
parameters with default argument values. In the first usage, func(3, 7) , the parameter a gets the
value 3 , the parameter b gets the value 7 and c gets the default value of 10 . In the second
usage func(25, c=24) , the variable a gets the value of 25 due to the position of the argument.
Then, the parameter c gets the value of 24 due to naming i.e. keyword arguments. The variable b
gets the default value of 5 . In the third usage func(c=50, a=100) , we use keyword arguments for
all specified values. Notice that we are specifying the value for parameter c before that for a even
though a is defined before c in the function definition.

ROOPA.H.M, Dept of MCA, RNSIT Page 34


35
Module 1 [20MCA31] Data Analytics using Python

*args and **kwargs, Command Line Arguments


• Sometimes you might want to define a function that can take any number of parameters, i.e.
variable number of arguments, this can be achieved by using the stars.
• *args and **kwargs are mostly used as parameters in function definitions. *args and
**kwargs allows you to pass a variable number of arguments to the calling function. Here
variable number of arguments means that the user does not know in advance about how
many arguments will be passed to the called function.
• *args as parameter in function definition allows you to pass a non-keyworded, variable
length tuple argument list to the called function.
• **kwargs as parameter in function definition allows you to pass keyworded, variable length
dictionary argument list to the called function. *args must come after all the positional
parameters and **kwargs must come right at the end

def total(a=5, *numbers, **phonebook): Output:
print('a', a)
#iterate through all the items in tuple a 10
for single_item in numbers: single_item 1
print('single_item', single_item) single_item 2
#iterate through all the items in dictionary single_item 3
for first_part, second_part in phonebook.items(): Inge 1560
print(first_part,second_part) John 2231
total(10,1,2,3,Jack=1123,John=2231,Inge=1560) Jack 1123

Note: statement blocks of the function definition * and ** are not used with args and kwargs.

Command Line Arguments


• The sys module provides a global variable named argv that is a list of extra text that the user
can supply when launching an application from command prompt in Windows and terminal in
OS X and Linux.
• Some programs expect or allow the user to provide extra information.
• Example : let us consider the user supplies a range of integers on the command line. The
program then prints all the integers in that range along with their square roots.

ROOPA.H.M, Dept of MCA, RNSIT Page 35


36
Module 1 [20MCA31] Data Analytics using Python

sqrtcmdline.py
from sys import argv
from math import sqrt

if len(argv) < 3:
print('Supply range of values')
else:
for n in range(int(argv[1]), int(argv[2]) + 1):
print(n, sqrt(n))

Output:
C:\Code>python sqrtcmdline.py 2 5
2 1.4142135623730951
3 1.7320508075688772
4 2.0
5 2.23606797749979

ROOPA.H.M, Dept of MCA, RNSIT Page 36


37
Module 1 [20MCA31] Data Analytics using Python

Question Bank
Q.
Questions
No.

1 Explain the features of python.


2 Give the comparison between Interpreter and compiler
3 Define python? List the standard data types of python?
4 Explain Type conversion in Python with examples.
5 Write a short note on data types in Python.
6 Differentiate between local and global variables with suitable examples.
Write short notes on :
7
i)Variables and statements
ii)Expressions
iii) String and Modules operator
8 Explain with example how to read input from user in python
9 Explain the different types of arithmetic operators with example.
10 Discuss operator precedence used in evaluating the expression
11 Briefly explain the conditional statements available in Python.
12 Explain the syntax of for loop with an example
13 When to use nested condition in programming? Discuss with an example.
14 Explain the fruitful and void functions? Give examples.
15 With an example, Demonstrate effective coding with functions in Python.
16 Explain the working of python user defined functions along with its syntax.
Explain the following terms
i)Boolean and logical expression
17 ii) Condition execution and alternative execution with syntax
iii)Chained conditions
iv)Short circuit evaluation of logical expression
18 Illustrate the three functions of random numbers with example
19 List and explain all built in math functions with example
20 Write a note on short circuit evaluation for logical expression
Predict the output for following expression:
i)-11%9
21 ii) 7.7//7
iii)(200-7)*10/5
iv) 5*2**1
22 What is the purpose of using break and continue?
23 Differentiate the syntax of if...else and if...elif...else with an example.

ROOPA.H.M, Dept of MCA, RNSIT Page 37


38
Module 1 [20MCA31] Data Analytics using Python

22 Create a python program for calculator application using functions


23 Define function. What are the advantages of using a function?
24 Differentiate between user-defined function and built-in functions.
25 Explain the built-in functions with examples in Python.
26 Explain the advantages of *args and **kwargs with examples.

ROOPA.H.M, Dept of MCA, RNSIT Page 38


39
Module 2 [20MCA31] Data Analytics using Python

Module -2
Python Collection Objects, Classes

Strings
– Creating and Storing Strings,
– Basic String Operations,
– Accessing Characters in String by Index Number,
– String Slicing and Joining,
– String Methods,
– Formatting Strings,
Lists
– Creating Lists,
– Basic List Operations,
– Indexing and Slicing in Lists,
– Built-In Functions Used on Lists,
– List Methods
– Sets, Tuples and Dictionaries.
Files
– reading and writing files.
Class
– Class Definition
– Constructors
– Inheritance
– Overloading

ROOPA.H.M, Dept of MCA, RNSIT Page 1


40
Module 2 [20MCA31] Data Analytics using Python

Strings

A string consists of a sequence of characters, which includes letters, numbers, punctuation


marks and spaces.
Creating and Storing Strings
• A string can be created by enclosing text in single and double quotes. A triple quote can be
used for multi-line strings. Here are some examples:

>>> S='hello'
>>> Str="hello"
>>> M="""This is a multiline
String across two lines"""
• Sometimes we may want to have a string that contains backslash and don't want it to be
treated as an escape character. Such strings are called raw string. In Python raw string is
created by prefixing a string literal with 'r' or 'R'. Python raw string treats backslash (\) as a
literal character.

>>> s= r" world health \n organization"


>>> print(s)
world health \n organization

Basic String Operations


String Concatenation
Concatenation can be done in various ways, some are as follows:
>>> str1 = "Hello" Strings can be concatenated by
>>> str2 = 'there'
>>> str3 = str1 + str2 using + operator
>>> print(str3)
Hellothere
>>> str= 'this' 'is' 'python' 'class' Strings can be concatenated by
>>> print(str)
thisispythonclass placing strings side by side.

The in operator
• The in operator of Python is a Boolean operator which takes two string operands.
• It returns True, if the first operand appears as a substring in second operand, otherwise
returns False.

ROOPA.H.M, Dept of MCA, RNSIT Page 2


41
Module 2 [20MCA31] Data Analytics using Python

Ex:1
if 'pa' in “roopa:
print('Your string contains “pa”.')
Ex:2
if ';' not in “roopa”:
print('Your string does not contain any semicolons.')
Ex:3 we can avoid writing longer codes like this
if t=='a' or t=='e' or t=='i' or t=='o' or t=='u':
instead we can write this code with ‘in’ operator
if t in 'aeiou':

String comparison
• Basic comparison operators like < (less than), > (greater than), == (equals) etc. can be
applied on string objects.
• Such comparison results in a Boolean value True or False.
• Internally, such comparison happens using ASCII codes of respective characters.
• List of ASCII values for some of the character set

A – Z : 65 – 90
a – z : 97 – 122
0 – 9 : 48 – 57
Space : 32
Enter Key : 13

• Examples are as follows:


Ex:1
if Name == ‘Ravindra’:
print(‘Ravindra is selected.’)
Ex:2
if word < ' Ravindra':
print('Your name,' + word + ', comes before Ravindra.’)
elif word > ' Ravindra':
print('Your name,' + word + ', comes after Ravindra.’)
else:
print('All right, Ravindra.')

ROOPA.H.M, Dept of MCA, RNSIT Page 3


42
Module 2 [20MCA31] Data Analytics using Python

Traversal through a string with a loop


• Extracting every character of a string one at a time and then performing some action on
that character is known as traversal.
• A string can be traversed either using while loop or using for loop in different ways.
Consider an example using while loop:
st="Roopa" Output:
index=0 R o o p a
while index < len(st):
print(st[index],
end="\t")
index+=1

This loop traverses the string and displays each letter on a line by itself. The loop condition is
index < len(fruit), so when index is equal to the length of the string, the condition is false, and
the body of the loop is not executed. The last character accessed is the one with the index
len(fruit)-1, which is the last character in the string.
• Another way to write a traversal is with a for loop:
fruit="grapes" Output:
for char in fruit: g r a p e s
print(char,end="\t")

Each time through the loop, the next character in the string is assigned to the variable char.
The loop continues until no characters are left.
Accessing Characters in String by Index Number
• We can get at any single character in a string using an index specified in square brackets

• The index value must be an integer and starts at zero.

• The index value can be an expression that is computed.

Example : str= “ good morning”


character g o o d m o r n i n g
index 0 1 2 3 4 5 6 7 8 9 10 11

>>> word = 'Python'


>>> word[0] # character in position 0
'P'
>>> word[5] # character in position 5
'n'

ROOPA.H.M, Dept of MCA, RNSIT Page 4


43
Module 2 [20MCA31] Data Analytics using Python

• Python supports negative indexing of string starting from the end of the string as shown
below:

character g o o d m o r n i n g
index -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

>>> word = 'Python'


>>> word[-1] # last character
'n'
>>> word[-2] # second-last character
'o'
>>> word[-6]
'P'

String Slicing and Joining


String slices
• A segment or a portion of a string is called as slice.

• Only a required number of characters can be extracted from a string using colon (:)
symbol.

• The basic syntax for slicing a string :


String_name[start : end : step]

Where,
start : the position from where it starts
end : the position where it stops(excluding end position)
step: also known as stride, is used to indicate number of steps to be incremented
after extracting first character. The default value of stride is 1.

• If start is not mentioned, it means that slice should start from the beginning of the string.
• If the end is not mentioned, it indicates the slice should be till the end of the string.
• If the both are not mentioned, it indicates the slice should be from the beginning till the
end of the string.

Examples: s = ”abcdefghij”
index 0 1 2 3 4 5 6 7 8 9
characters a b c d e f g h i j
Reverse index -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

ROOPA.H.M, Dept of MCA, RNSIT Page 5


44
Module 2 [20MCA31] Data Analytics using Python

Slicing Code Result Description


s[2:5] cde characters at indices 2, 3, 4
s[ :5] abcde first five characters
s[5: ] fghij characters from index 5 to the end
s[-2:] ij last two characters
s[ : ] abcdefghij entire string
s[1:7:2] bdf characters from index 1 to 6, by twos
s[ : :-1] jihgfedcba a negative step reverses the string

By the above set of examples, one can understand the power of string slicing and of Python
script. The slicing is a powerful tool of Python which makes many task simple pertaining to data
types like strings, Lists, Tuple, Dictionary etc.

String Methods
• Strings are an example of Python objects.
• An object contains both data (the actual string itself) and methods, which are effectively
functions that are built into the object and are available to any instance of the object.
• Python provides a rich set of built-in classes for various purposes. Each class is enriched
with a useful set of utility functions and variables that can be used by a Programmer.
• The built-in set of members of any class can be accessed using the dot operator as shown–
objName.memberMethod(arguments)
• The dot operator always binds the member name with the respective object name. This is very
essential because, there is a chance that more than one class has members with same name.
To avoid that conflict, almost all Object-oriented languages have been designed with this
common syntax of using dot operator.

• Python has a function called dir which lists the methods available for an object. The type
function shows the type of an object and the dir function shows the available methods.

>>> stuff = 'Hello world'


>>> type(stuff) <class 'str'>
>>> dir(stuff)
['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format',
'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower',
'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans',
'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines',
'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

ROOPA.H.M, Dept of MCA, RNSIT Page 6


45
Module 2 [20MCA31] Data Analytics using Python

• help function can used to get some simple documentation on a method.


>>> help(msg.casefold)
Help on built-in function casefold:
casefold() method of builtins.str instance
Return a version of the string suitable for caseless comparisons..

— This is built-in help-service provided by Python. Observe the className.memberName


format while using help.

— The methods are usually called using the object name. This is known as method
invocation. We say that a method is invoked using an object.

Method description Example

capitalize() >>> msg="bengaluru"


Return a capitalized version of the string. >>> print(msg.capitalize())
More specifically, make the first character Bengaluru
have upper case and the rest lower case.

find() >>> st='hello'


S.find(sub[, start[, end]]) -> int >>> i=st.find('l')
>>> print(i)
Return the lowest index in S where substring 0
sub is found, such that sub is contained >>> print(st.find(‘x’))
within S[start:end].
-1
>>> st="calender of Feb.cal2019"
Optional arguments start and end are
>>> i= st.find('cal')
interpreted as in slice notation.
>>> print(i)
0
Return -1 on failure.
>>> i=st.find('cal',10,20)
>>> print(i)
16

ROOPA.H.M, Dept of MCA, RNSIT Page 7


46
Module 2 [20MCA31] Data Analytics using Python

strip() >>> st=" hello world "


Return a copy of the string with leading and >>> st1 = st.strip()
trailing whitespace remove. >>> print(st1)
hello world
strip(chars)
If chars is given, remove characters specified >>> st="###Hello##“
as arguments at both end
>>> st1=st.strip('#’)
>>> print(st1)
rstrip()
Hello
to remove whitespace at right side

lstrip()
to remove whitespace at left side

casefold() >>> first="india"


Return a version of the string suitable for >>> second="INDIA"
caseless comparisons. >>>first.casefold()== second.casefold()
True

split(“separator”) >>>str="this is python"


The split() method returns a list of strings >>> var=str.split()
after breaking the given string by the specified >>> print(var)
separator. separator is a delimiter. The string ['this','is','pyhton']
splits at this specified separator. If is not >>> s="abc,def,ght,ijkl"
provided then any white space is a separator.
>>> print(s.split(","))
['abc', 'def', 'ght', 'ijkl']

startswith(prefix, start, end) >>> str="logical"


Return True if Str starts with the specified >>> print(str.startswith("L"))
prefix, False otherwise. With optional start, False
test Str beginning at that position. With
optional end, stop comparing Str at that #case sensitive, hence false
position. Prefix can also be a tuple of strings
to try.

ROOPA.H.M, Dept of MCA, RNSIT Page 8


47
Module 2 [20MCA31] Data Analytics using Python

join() >>> s1 = 'abc'


Concatenate any number of strings. The string >>> s2 = '123'
whose method is called is inserted in between >>> print(s1.join(s2))
each given string. The result is returned as a 1abc2abc3
new string. >>>s1 = '-'
>>>s2 = 'abc'
>>>print(s1.join(s2))
a-b-c

• Other string methods are:


Method Description
lower() returns a string with every letter of the original in lowercase
upper() returns a string with every letter of the original in uppercase
replace(x,y) returns a string with every occurrence of x replaced by y
count(x) counts the number of occurrences of x in the string
index(x) returns the location of the first occurrence of x
isalpha() returns True if every character of the string is a letter

Formatting Strings
Strings can be formatted in different ways, using :
• format operator
• f-string
• format( ) function
Format operator
• The format operator, % allows us to construct strings, replacing parts of the strings with the
data stored in variables.
• When applied to integers, % is the modulus operator. But when the first operand is a string,
% is the format operator.
• For example, the format sequence “%d” means that the operand should be formatted as an
integer (d stands for “decimal”):
Example 1:
>>> camels = 42
>>>'%d' % camels
'42'

ROOPA.H.M, Dept of MCA, RNSIT Page 9


48
Module 2 [20MCA31] Data Analytics using Python

• A format sequence can appear anywhere in the string, so you can embed a value in a
sentence:
Example 2 :
>>> camels = 42
>>> 'I have spotted %d camels.' % camels
'I have spotted 42 camels.'
• If there is more than one format sequence in the string, the second argument has to be a
tuple. Each format sequence is matched with an element of the tuple, in order.
• The following example uses “%d” to format an integer, “%g” to format a floating point number
, and “%s” to format a string:
Example 3:
>>> 'In %d years I have spotted %g %s.' % (3, 0.1, 'camels')
'In 3 years I have spotted 0.1 camels.'

f-strings
Formatted strings or f-strings were introduced in Python 3.6. A f-string is a string literal that is
prefixed with “f”. These strings may contain replacement fields, which are expressions enclosed
within curly braces {}. The expressions are replaced with their values.
Example : >>>a=10
>>>print(f”the value is {a}”)
the value is 10
Format function
format() : is one of the string formatting methods in Python3, which allows multiple
substitutions and value formatting. This method lets us concatenate elements within a string
through positional formatting.
Two types of Parameters:
— positional_argument
— keyword_argument
• Positional argument: It can be integers, floating point numeric constants, strings, characters
and even variables.
• Keyword argument : They is essentially a variable storing some value, which is passed as
parameter.

ROOPA.H.M, Dept of MCA, RNSIT Page 10


49
Module 2 [20MCA31] Data Analytics using Python

# To demonstrate the use of formatters with positional key arguments.

Positional arguments >>>print("{0} college{1} department


are placed in order ".format("RNSIT","EC"))

RNSIT college EC department

Reverse the index >>>print("{1} department {0} college


".format("RNSIT","EC”))
numbers with the
parameters of the EC department RNSIT college
placeholders
Positional arguments >>>print("Every {} should know the use of {} {}
are not specified. By python programming and {}".format("programmer",
default it starts "Open", "Source", "Operating Systems"))
positioning from zero Every programmer should know the use of Open Source
programming and Operating Systems

Use the index numbers >>>print("Every {3} should know the use of {2} {1}
of the values to change programming and {0}" .format("programmer", "Open",
the order that they "Source", "Operating Systems"))
appear in the string Every Operating Systems should know the use of Source
Open programming and programmer

Keyword arguments are print("EC department {0} ‘D’ section {college}"


called by their keyword .format("6", college="RNSIT"))
name EC department 6 ‘D’ section RNSIT

Lists

A list is a sequence
• A list is an ordered sequence of values.
• It is a data structure in Python. The values inside the lists can be of any type (like
integer, float, strings, lists, tuples, dictionaries etc) and are called as elements or
items.
• The elements of lists are enclosed within square brackets.

ROOPA.H.M, Dept of MCA, RNSIT Page 11


50
Module 2 [20MCA31] Data Analytics using Python

Creating Lists
There are various ways of creating list:
• Creating a simple list:
L = [1,2,3]
Use square brackets to indicate the start and end of the list, and separate the items
by commas.
• Empty list is equivalent of 0 or ' '. The empty list [ ] can be created using list function
or empty square brackets.
a = [ ]
l=list()

• Long lists If you have a long list to enter, you can split it across several lines, like
below:
nums = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,]

• We can use eval(input( )) to allow the user to enter a list. Here is an example:
L = eval(input('Enter a list: ')) Output:
print('The first element is ', L[0]) Enter a list: [5,7,9]
The first element is 5

• Creating list using list and range function


num=list(range(10)) Output:
print(num) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
• Nested lists can be created as follows:
a = ['a', 'b', 'c'] Output:
n = [1, 2, 3]
x = [a, n] [['a', 'b', 'c'], [1, 2,
print(x) 3]]
col=[23,[9.3,11.2,],[‘good’], []] [23,[9.3,11.2],['good'],[]]
print(col)

Basic List Operations


• The plus (+) operator concatenates lists in the same way it concatenates strings.
The following shows some experiments in the interactive shell with list
concatenation:

ROOPA.H.M, Dept of MCA, RNSIT Page 12


51
Module 2 [20MCA31] Data Analytics using Python

a=[10, 20, 30] Output:


a = a + [1, 3, 5] [10, 20, 30, 1, 3, 5]
print(a) [10, 20, 30, 1, 3, 5, 10]
a += [10]
print(a)

• The statement a = a + [1, 3, 5] reassigns a to the new list [10, 20, 30, 1, 3, 5].

• The statement a += [10] updates a to be the new list [10, 20, 30, 1, 3, 5, 10].
• Similarly, the * operator repeats a list a given number of times:
lt=[0]*20 Output:
print(lt) [0, 0, 0, 0]

print([1, 2, 3] * 3) [1, 2, 3, 1, 2, 3, 1, 2, 3]
print(["abc"]*5) ['abc','abc','abc','abc','abc']

List Aliasing
• When an object is assigned to other using assignment operator, both of them will
refer to same object in the memory. The association of a variable with an object is
called as reference.
a = [10, 20, 30, 40]
b = [10, 20, 30, 40]
print('Is ', a, ' equal to ', b, '?', sep='', end=' ')
print(a == b)
print('Are ', a, ' and ', b, ' aliases?', sep='', end=' ')
print(a is b)

c = [100, 200, 300, 400]


d = c # creating alias
print('Is ', c, ' equal to ', d, '?', sep='', end=' ')
print(c == d)
print('Are ', c, ' and ', d, ' aliases?', sep='', end=' ')
print(c is d)
Output:
Is [10, 20, 30, 40] equal to [10, 20, 30, 40]? True
Are [10, 20, 30, 40] and [10, 20, 30, 40] aliases? False
Is [100, 200, 300, 400] equal to [100, 200, 300, 400]? True
Are [100, 200, 300, 400] and [100, 200, 300, 400] aliases? True

ROOPA.H.M, Dept of MCA, RNSIT Page 13


52
Module 2 [20MCA31] Data Analytics using Python

• The assignment statement (c=d) causes variables c and d to refer to the same list
object. We say that c and d are aliases. In other words, there are two references to the
same object in the memory.

• An object with more than one reference has more than one name, hence we say that
object is aliased. If the aliased object is mutable, changes made in one alias will
reflect the other.

Indexing and Slicing in Lists


Indexing
• The syntax for accessing the elements of a list is the same as for accessing the
characters of a string: the bracket operator. The expression inside the brackets
specifies the index. Indexing starts from 0.

• heterolist.py demonstrates that lists may be heterogeneous; that is, a list can hold
elements of varying types. Accessing elements of the list with their index.

collection=[24.2,4,'word',eval,19,-0.03,'end'] 24.2
4
print(collection[0]) word
print(collection[1]) <built-in function eval>
19
print(collection[2])
-0.03
print(collection[3])
end
print(collection[4]) [24.2,4,'word',
print(collection[5]) <built-in function eval>,
print(collection[6]) 19, -0.03, 'end']
print(collection)

• Accessing elements with negative index

names=["roopa","shwetha","rajani"] Output:
print(names[-2]) shwetha
print(names[-1]) rajani

• Accessing the elements within inner list can be done by double-indexing. The inner
list is treated as a single element by outer list. the first index indicates position of
inner list inside outer list, and the second index means the position of particular
value within inner list.

ls=[[1,2],['EC','CS']] Output:
print(ls[1][0]) EC

ROOPA.H.M, Dept of MCA, RNSIT Page 14


53
Module 2 [20MCA31] Data Analytics using Python

List slicing

• We can make a new list from a portion of an existing list using a technique known
as slicing. A list slice is an expression of the form

list [ begin : end : step ]


• If the begin value is missing, it defaults to 0.

• If the end value is missing, it defaults to the length of the list.

• The default step value is 1.

lst =[10,20,30,40,50,60,70,80,90,100] Output:


print(lst) [10,20,30,40,50,60,70,80,90,100]

print(lst[0:3]) [10, 20, 30]

print(lst[4:8]) [50, 60, 70, 80]

print(lst[2:5]) [30, 40, 50]

print(lst[-5:-3]) [60, 70]

print(lst[:3]) [10, 20, 30]

print(lst[4:]) [50, 60, 70, 80, 90, 100]

print(lst[:]) [10,20,30,40,50,60,70,80,90,100]

print(lst[-100:3]) [10, 20, 30]

print(lst[4:100]) [50, 60, 70, 80, 90, 100]

print(lst[2:-2:2]) [30, 50, 70]

print(lst[::2]) [10, 30, 50, 70, 90]

• A begin value less than zero is treated as zero.


Ex: lst[-100:3] #here -100 is treated as 0

• An end value greater than the length of the list is treated as the length of the list.
Ex: lst[4:100] # here 100 is treated as len(lst)

ROOPA.H.M, Dept of MCA, RNSIT Page 15


54
Module 2 [20MCA31] Data Analytics using Python

• A slice operator on the left side of an assignment can update multiple elements:

t = ['a','b','c','d','e','f'] Output:
t[1:3] = ['x', 'y']
['a', 'x', 'y', 'd', 'e', 'f']
print(t)

Built-In Functions Used on Lists


There are several built-in functions that operate on lists. Here are some useful ones:
Function Description
len returns the number of items in the list
sum returns the sum of the items in the list
min returns the minimum of the items in the list
max returns the maximum of the items in the list

Example demonstrating the use of above functions

nums = [3, 41, 12, 9, 74, Output:


15] length: 6
print("length:",len(nums)) 74
print(max(nums)) 3
print(min(nums)) 154
print(sum(nums)) 25.666666666666668
print(sum(nums)/len(nums))

The sum( ) function only works when the list elements are numbers. The other
functions (max(), len(), etc.) work with lists of strings and other types that can be
comparable.

Ex: Program to read the data from the user and to compute sum and average of
those numbers
In this program, we initially create an empty list. Then, we are taking an infinite while
loop. As every input from the keyboard will be in the form of a string, we need to
convert x into float type and then append it to a list. When the keyboard input is a
string ‘done’, then the loop is going to get terminated. After the loop, we will find the

ROOPA.H.M, Dept of MCA, RNSIT Page 16


55
Module 2 [20MCA31] Data Analytics using Python

average of those numbers with the help of built-in functions sum() and len().

ls= list() Output:


while (True): Enter a number: 2
x= input('Enter a number: ') Enter a number: 3
if x== 'done': Enter a number: 4
break Enter a number: 5
x= float(x) Enter a number: done
ls.append(x) Average: 3.5

average = sum(ls) / len(ls)


print('Average:', average)

List Methods
There are several built-in methods in list class for various purposes. They are as follows:
append: The append() method adds a single item to the existing list. It doesn't return a new list;
rather it modifies the original list.
fruits = ['apple', 'banana', 'cherry'] fruits after appending: ['apple',
fruits.append("orange") 'banana', 'cherry', 'orange']
print("fruits after appending:",fruits)

extend: the extend() method takes a single argument (a list) and adds it to the end.

t1 = ['a', 'b', 'c']


t2 = ['d', 'e']
t1.extend(t2) t1+t2 : ['a', 'b', 'c', 'd', 'e']
print("t1+t2 :",t1)
count: Returns the number of times a given element appears in the list. Does not modify the
list
ls=[1,2,5,2,1,3,2,10]
print("count of 2 :" ,ls.count(2)) count of 2 : 3

insert : Inserts a new element before the element at a given index. Increases the length of the
list by one. Modifies the list.

ls=[3,5,10]
ls.insert(1,"hi") ls after inserting element: [3,'hi,5,10]
print("ls after inserting element:",ls)
index : Returns the lowest index of a given element within the list. Produces an error if the
element does not appear in the list. Does not modify the list.

ROOPA.H.M, Dept of MCA, RNSIT Page 17


56
Module 2 [20MCA31] Data Analytics using Python

ls=[15, 4, 2, 10, 5, 3, 2, 6]
print(ls.index(2)) 2
print(ls.index(2,3,7)) 6
#finds the index of 2 between the position 3 to 7

reverse: Physically reverses the elements in the list. The list is modified.

ls=[4,3,1,6]
ls.reverse()
print("reversed list :" ,ls) reversed list : [6, 1, 3, 4]

sort: Sorts the elements of the list in ascending order. The list is modified.

ls=[3,10,5, 16,-2]
ls.sort()
print("list in ascending order: ",ls) ls in ascending order:[-2,3,5,10,16]

ls.sort(reverse=True)
print("list in descending order: ",ls) ls in descending order:[16,10,5,3,-2]

clear: This method removes all the elements in the list and makes the list empty

ls=[1,2,3] []
ls.clear()
print(ls)

Deleting elements

There are several ways to delete elements from a list. Python provides few built-in
methods for removing elements as follows:
• pop() : This method deletes the last element in the list, by default or removes the
item at specified index p and returns its value. Index is passed as an argument to
pop(). pop modifies the list and returns the element that was removed.

my_list = ['p','r','o','b','l','e','m'] Output:


poped element is: m
print('poped element is:',my_list.pop()) ['p','r','o','b','l','e']
print(my_list) poped element is: r
['p', 'o', 'b', 'l', 'e']
print('poped element is:',my_list.pop(1))
print(my_list)

• remove(): This method can be used, if the index of the element is not known, then

ROOPA.H.M, Dept of MCA, RNSIT Page 18


57
Module 2 [20MCA31] Data Analytics using Python

the value to be removed can be specified as argument to this function.

my_list = ['p','r','o','p','e','r'] Output:


my_list.remove('r') ['p', 'o', 'p', 'e', 'r']
print(my_list)

— Note that, this function will remove only the first occurrence of the specified value,
but not all occurrences.

— Unlike pop() function, the remove() function will not return the value that has been
deleted.

• del: This is an operator to be used when more than one item to be deleted at a time.
Here also, we will not get the items deleted.

my_list = ['p','r','o','b','l','e','m']
del my_list[2]
['p', 'r', 'b', 'l', 'e', 'm']
#deletes the element at position 2
print(my_list)
del my_list[1:5]
#deletes the elements between position 1 and 5
['p', 'm']
print(my_list)
del my_list NameError:
#deletes the entire list name 'my_list' is not
print(my_list) defined
#Deleting all odd indexed elements of a list
t=['a', 'b', 'c', 'd', 'e'] ['a', 'c', 'e']
del t[1::2]
print(t)

Sets

• Python provides a data structure that represents a mathematical set. As with mathematical
sets, we use curly braces { } in Python code to enclose the elements of a literal set.

• Python distinguishes between set literals and dictionary literals by the fact that all the items
in a dictionary are colon-connected (:) key-value pairs, while the elements in a set are simply
values.

• Unlike Python lists, sets are unordered and may contain no duplicate elements. The following

ROOPA.H.M, Dept of MCA, RNSIT Page 19


58
Module 2 [20MCA31] Data Analytics using Python

interactive sequence demonstrates these set properties:

>>> S = {10, 3, 7, 2, 11}


>>> S {2, 11, 3, 10, 7}
>>> T = {5, 4, 5, 2, 4, 9}
>>> T {9, 2, 4, 5}

• We can make a set out of a list using the set conversion function:
>>> L = [10, 13, 10, 5, 6, 13, 2, 10, 5]
>>> S = set(L)
>>> S
{10, 2, 13, 5, 6}

As we can see, the element ordering is not preserved, and duplicate elements appear only
once in the set.

• Python set notation exhibits one important difference with mathematics: the expression { }
does not represent the empty set. In order to use the curly braces for a set, the set must
contain at least one element. The expression set( ) produces a set with no elements, and
thus represents the empty set.

• Python reserves the { } notation for empty dictionaries. Unlike in mathematics, all sets in
python must be finite. Python supports the standard mathematical set operations of
intersection, union, set difference, and symmetric difference. Table below shows the python
syntax for these operations.

Example : The following interactive sequence computes the union and intersection and two
sets and tests for set membership:

ROOPA.H.M, Dept of MCA, RNSIT Page 20


59
Module 2 [20MCA31] Data Analytics using Python

>>> S = {2, 5, 7, 8, 9, 12}


>>> T = {1, 5, 6, 7, 11, 12}
>>> S | T {1, 2, 5, 6, 7, 8, 9, 11, 12}
>>> S & T {12, 5, 7}
>>> 7 in S True
>>> 11 in S False

Tuples

• Tuples are one of Python's simplest and most common collection types. A tuple is a
sequence of values much like a list.

• The values stored in a tuple can be any type, and they are indexed by integers.

• A tuple is an immutable list of values.

Tuple creation
There are many ways to a tuple. They are as follows
Creation Output Comments
t ='a','b','c','d','e' ('a','b','c','d','e') Syntactically, a tuple is a
print(t) comma-separated list of
values.
t=('a','b','c','d','e') ('a','b','c','d','e') Although not necessary, it
print(t) is common to enclose
tuples in parentheses
t1=tuple() ()
print(t1)
Creating an empty tuple
t2=() ()
print(t2)
t=tuple('lupins') print(t) ('l','u','p','i','n','s') Passing arguments to tuple
function
t = tuple(range(3))
print(t) (0, 1, 2)
t1 = ('a',) <class 'tuple’> To create a tuple with a
print(type(t1)) single element, you have to
include the final comma.
t2 = ('a')
print(type(t2))
<class 'str’>
t = (1, "Hello", 3.4) (1, "Hello", 3.4) tuple with mixed datatypes
print(t)
t=("mouse",[8,4,6],(1,2,3)) ("mouse",[8,4,6],(1,2,3)) Creating nested tuple
print(my_tuple)

ROOPA.H.M, Dept of MCA, RNSIT Page 21


60
Module 2 [20MCA31] Data Analytics using Python

Accessing Elements in a Tuple

index operator [ ] to access an item in a tuple where the index starts from 0
College =('r','n','s','i','t') Output:
print(College[0]) r
print(College[4]) t
g
#accessing elements of nested tuple 4
n_tuple =("program", [8, 4, 6],(1, 2, 3)) ('b', 'c’)
print(n_tuple[0][3])
print(n_tuple[1][1])

#tuple slicing is possible


t = ('a', 'b', 'c', 'd', 'e’)
print(t[1:3])

In the above example, n_tuple is a tuple with mixed elements i.e., string, list , or tuple
Tuples are immutable
• One of the main differences between lists and tuples in Python is that tuples are immutable,
that is, one cannot add or modify items once the tuple is initialized.

For example:
>>> t = (1, 4, 9)
>>> t[0] = 2

Traceback (most recent call last):


File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

• Similarly, tuples don't have .append and .extend methods as list does. Using += is possible,
but it changes the binding of the variable, and not the tuple itself:

>>> t = (1, 2)
>>> q = t
>>> t += (3, 4)
>>>t
(1, 2, 3,4)
>>>
q (1, 2)

ROOPA.H.M, Dept of MCA, RNSIT Page 22


61
Module 2 [20MCA31] Data Analytics using Python

• Be careful when placing mutable objects, such as lists, inside tuples. This may lead to very
confusing outcomes when changing them. For example:

>>> t = (1, 2, 3, [1, 2, 3]) (1, 2, 3, [1, 2, 3])


# the below statement both raise an error and change the contents of the list within the tuple:

>>> t[3] += [4, 5]


TypeError: 'tuple' object does not support item assignment
>>> t
(1, 2, 3, [1, 2, 3, 4, 5])

• You can use the += operator to "append" to a tuple - this works by creating a new tuple with
the new element you "appended" and assign it to its current variable; the old tuple is not
changed, but replaced.

>>>t = ('A',)+ t[1:]


>>> print(t)
('A', 'b', 'c', 'd', 'e')

Comparing tuples
• Tuples can be compared using operators like >, <, >=, == etc.

• Python starts by comparing the first element from each sequence. If they are equal,
it goes on to the next element, and so on, until it finds elements that differ.
Subsequent elements are not considered (even if they are really big).

Example Description
>>>(0, 1, 2) < (0, 3, 4) Step1:0==0
True Step2:1 < 3 true
Comparison stops at this stage
Step3:2 < 4 ignored
>>>(0, 1, 2000000) < (0, 3, 4) Step1:0==0
True Step2:1 < 3 true
Comparison stops at this stage
Step3:2000000 < 4 ignored

• When we use relational operator on tuples containing non-comparable types, then TypeError

ROOPA.H.M, Dept of MCA, RNSIT Page 23


62
Module 2 [20MCA31] Data Analytics using Python

will be thrown.

>>>(1,'hi')<('hello','world')
TypeError: '<' not supported between instances of 'int' and 'str'

• The sort function works the same way. It sorts primarily by first element, but in the case of a
tie, it sorts by second element, and so on. This pattern is known as DSU

—Decorate a sequence by building a list of tuples with one or more sort keys preceding
the elements from the sequence,

—Sort the list of tuples using the Python built-in sort, and

—Undecorate by extracting the sorted elements of the sequence

Consider a program of sorting words in a sentence from longest to shortest, which illustrates
DSU property.
txt = 'this is an example for sorting tuple'
words = txt.split()
t = list()
for word in words:
t.append((len(word), word))
print(t) #displays unsorted list

t.sort(reverse=True)
print(t) #displays sorted list

res = list()
for length, word in t:
res.append(word)
print(res) #displays sorted list with only words but not length

Output:
[(4,'this'),(2,'is'),(2,'an'),(7,'example'),(3,'for'),(7,'sorting'),(5,'tuple')]
[(7,'sorting'),(7,'example'),(5,'tuple'),(4,'this'),(3,'for'),(2,'is'),(2,'an')]
['sorting', 'example', 'tuple', 'this', 'for', 'is', 'an']

In the above program,


— The first loop builds a list of tuples, where each tuple is a word preceded by its length.

— sort compares the first element, length, first, and only considers the second element to break

ROOPA.H.M, Dept of MCA, RNSIT Page 24


63
Module 2 [20MCA31] Data Analytics using Python

ties. The keyword argument reverse=True tells sort to go in decreasing order.

— The second loop traverses the list of tuples and builds a list of words in descending order of
length. The four-character words are sorted in reverse alphabetical order, so “sorting”
appears before “example” in the list.

Tuple assignment (Packing and Unpacking Tuples)


• One of the unique syntactic features of the Python language is the ability to have a tuple on
the left side of an assignment statement. This allows to assign more than one variable at a
time when the left side is a sequence.

Ex:
>>>x, y, z = 1, 2, 3
>>> print(x) #prints 1
>>> print(y) #prints 2
>>> print(y) #prints 3
• when we use a tuple on the left side of the assignment statement, we omit the parentheses,
but the following is an equally valid syntax:

>>> m = [ 'have', 'fun' ]


>>> (x, y) = m
>>> x #prints 'have'
>>> y #prints 'fun'

• A particularly clever application of tuple assignment allows us to swap the values of two
variables in a single statement.

>>> a=10
>>> b=20
>>> a, b = b, a
>>> print(a, b) #prints 20 10

Both sides of this statement are tuples, but the left side is a tuple of variables; the right side
is a tuple of expressions. Each value on the right side is assigned to its respective variable
on the left side. All the expressions on the right side are evaluated before any of the
assignments.
• The number of variables on the left and the number of values on the right must be the same:

>>> a, b = 1, 2, 3
ValueError: too many values to unpack

ROOPA.H.M, Dept of MCA, RNSIT Page 25


64
Module 2 [20MCA31] Data Analytics using Python

• The symbol _ can be used as a disposable variable name if we want only few elements of a
tuple, acting as a placeholder:

>>>a = 1, 2, 3, 4
>>>_, x, y, _ = a
>>>print(x) #prints 2
>>>print(y) #prints 3

• Sometimes we may be interested in using only few values in the tuple. This can be achieved
by using variable-length argument tuples (variable with a *prefix) can be used as a catch-all
variable, which holds multiple values of the tuple.

>>>first, *more, last = (1, 2, 3, 4, 5)


>>>print(first) #prints 1
>>>print(more) #prints [2,3,4]
>>>print(last) #prints 5

• More generally, the right side can be any kind of sequence (string, list, or tuple). For example,
to split an email address into a user name and a domain. Code is as follows:

>>>addr = '[email protected]'
>>> uname, domain = addr.split('@')
>>> print(uname) #prints monty
>>> print(domain) #prints python.org

Dictionaries and tuples


• Dictionaries have a method called items that returns a list of tuples, where each tuple is a
key-value pair:

>>> d = {'a':10, 'b':1, 'c':22}


>>> t = list(d.items())
>>> print(t)
[('b', 1), ('a', 10), ('c', 22)]

• As dictionary may not display the contents in an order, we can use sort() on lists and then
print in required order.

• However, since the list of tuples is a list, and tuples are comparable, we can now sort the list
of tuples. Converting a dictionary to a list of tuples is a way for us to output the contents of a
dictionary sorted by key.

ROOPA.H.M, Dept of MCA, RNSIT Page 26


65
Module 2 [20MCA31] Data Analytics using Python

>>> d = {'a':10, 'b':1, 'c':22}


>>>t =list(d.items()) # Converting a dictionary to a list of tuples
>>> t
[('b',1),('a',10),('c',22)]
>>> t.sort() #sorting the list of tuples
>>> t
[('a',10),('b',1),('c',22)] #sorted list in alphabetical order

Dictionaries

• Dictionary is a set of key: value pairs, with the requirement that the keys are unique (within
one dictionary).

• Dictionary is a mapping between a set of indices (which are called keys) and a set of values.
Each key maps to a value. The association of a key and a value is called a key-value pair.

• Unlike other sequences, which are indexed by a range of numbers, dictionaries are indexed
by keys, which can be any immutable type such as strings, numbers, tuples(if they contain
only strings, numbers, or tuples).

• Dictionaries are mutable, that is, they are modifiable.

• A pair of braces creates an empty dictionary: { }.

d = { }
• The function dict creates a new dictionary with no items.

empty_d = dict()
• Placing a comma-separated list of key:value pairs within the braces adds initial key:value
pairs to the dictionary.

Ex: a dictionary that maps from English to Spanish words


eng2sp={} Output:
eng2sp['one'] = 'uno'
print(eng2sp) {'one': 'uno'}
eng2sp['two'] = 'dos' {'one': 'uno', 'two': 'dos'}
print(eng2sp) {'one':'uno','two':'dos','three':'tres'}
eng2sp['three']= 'tres'
print(eng2sp)

ROOPA.H.M, Dept of MCA, RNSIT Page 27


66
Module 2 [20MCA31] Data Analytics using Python

• A dictionary can be initialized at the time of creation itself.

eng2sp ={'one':'uno','two':'dos', Output:


'three': 'tres'} {'one':'uno','three':'tres',
print(eng2sp) 'two': 'dos'}

Notice the output, the order of items in a dictionary is not same as its creation. As dictionary
members are not indexed over integers, the order of elements inside it may vary.
Accessing elements of dictionary
• To access an element with in a dictionary, we use square brackets exactly as we would with a
list. In a dictionary every key has an associated value.

>>> print(eng2sp['two'])
'dos'
The key 'two' always maps to the value “dos” so the order of the items doesn’t matter.

• If the key isn’t in the dictionary, you get an exception:

>>> print(eng2sp['four'])
KeyError: 'four'
Length of the dictionary
The len function works on dictionaries; it returns the number of key-value pairs:
>>> num_word = {1: 'one', 2: 'two', 3:'three'}
>>>len(num_word)
3
in operator with dictionaries
The in operator works on dictionaries to check whether something appears as a key in the
dictionary (but not the value).
>>>eng2sp ={'one':'uno','two':'dos','three':'tres'}

>>> 'one' in eng2sp #searching key


True
>>> 'uno' in eng2sp #searching value
False

To see whether something appears as a value in a dictionary, you can use the method values,
which returns a collection of values, and then use the in operator:

ROOPA.H.M, Dept of MCA, RNSIT Page 28


67
Module 2 [20MCA31] Data Analytics using Python

>>> vals = eng2sp.values()


>>> 'uno' in vals
True

The in operator uses different algorithms for lists and dictionaries.


— For lists, it searches the elements of the list in order i.e., linear search. As the list gets
longer, the search time gets longer in direct proportion.

— For dictionaries, Python uses an algorithm called a hash table that has a remarkable
property: the in operator takes about the same amount of time no matter how many
items are in the dictionary

Dictionary as a set of counters

• Assume that we need to count the frequency of alphabets in a given string. There are
different ways to do it .

— Create 26 variables to represent each alphabet. Traverse the given string and increment the
corresponding counter when an alphabet is found.

— Create a list with 26 elements (all are zero in the beginning) representing alphabets.
Traverse the given string and increment corresponding indexed position in the list when an
alphabet is found.

— Create a dictionary with characters as keys and counters as values. When we find a
character for the first time, we add the item to dictionary. Next time onwards, we increment
the value of existing item. Each of the above methods will perform same task, but the logic
of implementation will be different. Here, we will see the implementation using dictionary.

word = 'vishveshwarayya' # given string


d = dict() #creates empty dictionary

for c in word: #extracts each character in the string


if c not in d: #if character is not found
d[c] = 1 #initialize counter to 1
else:
d[c] = d[c] + 1 #otherwise, increment counter
print(d)
Output:
{'v':2, 'i':1, 's':2, 'h':2, 'e':1, 'w':1, 'a':3, 'r':1, 'y':2}

ROOPA.H.M, Dept of MCA, RNSIT Page 29


68
Module 2 [20MCA31] Data Analytics using Python

It can be observed from the output that, a dictionary is created here with characters as keys
and frequencies as values. Note that, here we have computed histogram of counters.

• Dictionary in Python has a method called get(), which takes key and a default value as two
arguments. If key is found in the dictionary, then the get() function returns corresponding
value, otherwise it returns default value. For example,

month = {'jan':1, 'march':3, 'june':6} Output:


print(month.get('jan', 0)) 1
print(month.get('october', 'not found')) not found

In the above example, when the get() function is taking 'jan' as argument, it returned
corresponding value, as 1 is found in month directory . Whereas, when get() is used with
'october' as key, the default value ‘not found’ (passed as second argument) is returned.

• The function get() can be used effectively for calculating frequency of alphabets in a string.
Here is the modified version of the program –

word = 'vishveshwarayya'
d = dict()
for c in word:
d[c] = d.get(c,0) + 1
print(d)
Output:
{'v':2, 'i':1, 's':2, 'h':2, 'e':1, 'w':1, 'a':3, 'r':1, 'y':2}

In the above program, for every character c in a given string, we will try to retrieve a value. When
the c is found in d, its value is retrieved, 1 is added to it, and restored. If c is not found, 0 is
taken as default and then 1 is added to it.
Looping and dictionaries

• When a for-loop is applied on dictionaries, it traverses the keys of the dictionary. This loop
prints each key and the corresponding value:

names ={'chuck':1, 'annie':42, 'jan':100} Output:


for key in names: chuck 1
print(key, names[key]) annie 42
jan 100
• If we wanted to find all the entries in a dictionary with a value above ten, we could write the
following code:

ROOPA.H.M, Dept of MCA, RNSIT Page 30


69
Module 2 [20MCA31] Data Analytics using Python

names ={'chuck':1, 'annie':42, 'jan':100} Output:


for key in names: annie 42
if names[key] > 10 : jan 100
print(key, names[key])
The for loop iterates through the keys of the dictionary, so we must use the index operator to
retrieve the corresponding value for each key that are greater than 10.

• Sometimes we may want to access key-value pair together from the dictionary, it can be done
by using items() method as follows:

Output:
names ={'chuck':1,'annie':42,'jan':100} chuck 1
for k, v in names.items(): annie 42
print(k, v) jan 100

Sorting the dictionary elements

If we want to print the keys in alphabetical order, first make a list of the keys in the dictionary
using the keys method available in dictionary objects, and then sort that list and loop through
the sorted list, looking up each key and printing out key-value pairs in sorted order as follows:
names ={'chuck':1, 'annie':42, 'jan':100} Output:
lst = list(names.keys())
print(lst) ['chuck', 'annie', 'jan']
lst.sort() Elements in alphabetical order:
annie 42
print("Elements in alphabetical order:")
chuck 1
for key in lst:
jan 100
print(key, names[key])

FILES

• File is a named location on disk to store related information.


• Data for python program can come from difference sources such as keyboard, text file, web
server, database.
• Files are one such sources which can be given as input to the python program. Hence
handling files in right manner is very important.
• Primarily there are two types of files
— Text files : A text file can be thought of as a sequence of lines without any images, tables
etc. These files can be create by using some text editors.

— Binary files : These files are capable of storing text, image, video, audio, database files, etc
which contains the data in the form of bits.

ROOPA.H.M, Dept of MCA, RNSIT Page 31


70
Module 2 [20MCA31] Data Analytics using Python

Opening files
• To perform read or write operation on a file, first file must be opened.
• Opening the file communicates with the operating system, which knows where the data for
each file is stored.
• A file can be opened using a built-in function open( ).
• The syntax of open( ) function is as below :
fhand= open(“filename”, “mode”)

Here,
filename -> is name of the file to be opened. This string may be just a name of the file, or it
may include pathname also. Pathname of the file is optional when the file is
stored in current working directory.
mode -> This string specifies an access mode to use the file i.e., for reading, writing,
appending etc.
fhand -> It is a reference to a file object, which acts as a handler for all further operations on
files.

• If mode is not specified , by default, open( ) uses mode ‘r’ for reading.
>>> fhand = open('sample.txt')
>>>print(fhand)
<_io.TextIOWrapper name='sample.txt' mode='r' encoding='cp1252'>

Note: In this example, we assume the file ‘sample.txt‘ stored in the same folder that you are in
when you start Python. Otherwise path of the file has to be passed.
fhand = open('c:/users/roopa/desktop/sample.txt')

• If the open is successful, the operating system returns a file handle.


• The file handle is not the actual data contained in the file, but instead it is a “handle” that we
can use to read the data. You are given a handle if the requested file exists and you have the
proper permissions to read the file.

ROOPA.H.M, Dept of MCA, RNSIT Page 32


71
Module 2 [20MCA31] Data Analytics using Python

• If the file does not exist, open will fail with a traceback and you will not get a handle to access
the contents of the file:
>>> fhand = open('fibo.txt')
Traceback (most recent call last): File "<stdin>", line 1, in <module> FileNotFoundError: [Errno
2] No such file or directory: 'fibo.txt'
• List of modes in which files can be opened are given below :
Mode Meaning
Opens a file for reading purpose. If the specified file does not exist in the
r specified path, or if you don‟t have permission, error message will be
displayed. This is the default mode of open() function in Python.
Opens a file for writing purpose. If the file does not exist, then a new file
w with the given name will be created and opened for writing. If the file
already exists, then its content will be over-written.
Opens a file for appending the data. If the file exists, the new content will
a be appended at the end of existing content. If no such file exists, it will
be created and new content will be written into it.
r+ Opens a file for reading and writing.
Opens a file for both writing and reading. Overwrites the existing file if the
w+ file exists. If the file does not exist, creates a new file for reading and
writing.
Opens a file for both appending and reading. The file pointer is at the end
a+ of the file if the file exists. The file opens in the append mode. If the file
does not exist, it creates a new file for reading and
rb Opens a file for reading only in binary format
wb Opens a file for writing only in binary format
ab Opens a file for appending only in binary format

Reading files
• Once the specified file is opened successfully. The open( ) function provides handle which is a
refers to the file.
• There are several ways to read the contents of the file
1. using the file handle as the sequence in for loop.
sample.txt
Python is a high level programming language
it is introduced by Guido van rossam.
Python is easy to learn and simple to code an application

ROOPA.H.M, Dept of MCA, RNSIT Page 33


72
Module 2 [20MCA31] Data Analytics using Python

Print_count.py
fhand = open('sample.txt')
count = 0
for line in fhand:
count = count + 1
print('Line :', count,line)
print('Line Count:', count)
fhand.close()

Output:

Line : 1 Python is a high level programming language

Line : 2 it is introduced by Guido van rossam.

Line : 3 Python is easy to learn and simple to code an application

Line Count: 3
• In the above example, ‘for’ loop simply counts the number of lines in the file and prints them
out.

• When the file is read using a for loop in this manner, Python takes care of splitting the data
in the file into separate lines using the newline character. Python reads each line through
the newline and includes the newline as the last character in the line variable for each
iteration of the for loop.
• Notice the above output, there is a gap of two lines between each of the output lines. This is
because, the new-line character \n is also a part of the variable line in the loop, and the
print() function has default behavior of adding a line at the end. To avoid this double-line
spacing, we can remove the new-line character attached at the end of variable line by using
built-in string function rstrip() as below –

print("Line: ",count, line.rstrip())

• Because the for loop reads the data one line at a time, it can efficiently read and count the
lines in very large files without running out of main memory to store the data. The above
program can count the lines in any size file using very little memory since each line is read,
counted, and then discarded.

2. The second way of reading a text file loads the entire file into a string :
• If you know the file is relatively small compared to the size of your main memory, you can
read the whole file into one string using the read method on the file handle.

ROOPA.H.M, Dept of MCA, RNSIT Page 34


73
Module 2 [20MCA31] Data Analytics using Python

>>>fhand = open('sample.txt')
>>> content= fhand.read()
>>> print(len(content))
140 # count of characters
>>> print(content[:7] )
Python

• In this example, the entire contents (all 140 characters) of the file sample.txt are read
directly into the variable content. We use string slicing to print out the first 7 characters of
the string data stored in variable content.
• When the file is read in this manner, all the characters including all of the lines and newline
characters are one big string in the variable content. It is a good idea to store the output of
read as a variable because each call to read exhausts the resource.

Letting the user choose the file name


• Editing Python code every time to process a different file is tedious work. It would be more
usable to ask the user to enter the file name string each time the program runs so they can
use the program on different files without changing the Python code.
• This is quite simple to do by reading the file name from the user using input as follows:

fname = input('Enter the file name: ')


fhand = open(fname)
count = 0
for line in fhand:
if line.startswith('ro'):
count = count + 1
print('There were', count, ‘lines that starts with ro in', fname)
fhand.close()
Output :
Enter the file name: name.txt
There were 3 lines that starts with ro in name.txt

• We read the file name from the user and place it in a variable named fname and open that
file. Now we can run the program repeatedly on different files.

ROOPA.H.M, Dept of MCA, RNSIT Page 35


74
Module 2 [20MCA31] Data Analytics using Python

Writing files
• To write a file, you have to open it with mode “w” as a second parameter:
>>> fout = open('output.txt', 'w')
>>> print(fout)
<_io.TextIOWrapper name='output.txt' mode='w' encoding='cp1252'>

• If the file already exists, opening it in write mode clears out the old data and starts fresh, so
be careful! If the file doesn’t exist, a new one is created.
• The write method of the file handle object puts data into the file, returning the number of
characters written.
>>> line1 = "This here's the wattle,\n"
>>> fout.write(line1)
24

• We must make sure to manage the ends of lines as we write to the file by explicitly inserting
the newline character when we want to end a line. The print statement automatically
appends a newline, but the write method does not add the newline automatically.
>>> line2 = 'the emblem of our land.\n'
>>> fout.write(line2)
24
• When you are done writing, you have to close the file to make sure that the last bit of data is
physically written to the disk so it will not be lost if the power goes off.
>>> fout.close( )
• We could close the files which we open for read as well, but we can be a little sloppy if we are
only opening a few files since Python makes sure that all open files are closed when the
program ends. When we are writing files, we want to explicitly close the files so as to leave
nothing to chance.
• To avoid such chances, the with statement allows objects like files to be used in a way that
ensures they are always cleaned up promptly and correctly.
with open("test.txt", 'w') as f :
f.write("my first file\n")
f.write("This file\n\n")
f.write("contains three lines\n")
my first file
This file
These lines are written into file
“test.txt”
contains three lines

ROOPA.H.M, Dept of MCA, RNSIT Page 36


75
Module 2 [20MCA31] Data Analytics using Python

Classes and objects

• Python is an object-oriented programming language. “object-oriented programming” uses


programmer defined types to organize both code and data.

• Almost everything in Python is an object, with its properties and methods.

• Class is a basis for any object oriented programming language.

Class definition
• We have used many of Python’s built-in types; now we are going to define a new type.
• Class is a user-defined data type which binds data and functions together into single entity.
• Class is just a prototype (or a logical entity/blue print) which will not consume any memory.
• An object is an instance of a class and it has physical existence. One can create any number
of objects for a class.
• A class can have a set of variables (also known as attributes, member variables) and member
functions (also known as methods).
• Class − A user-defined prototype for an object that defines a set of attributes that
characterize any object of the class. The attributes are data members and methods, accessed
via dot notation.
• Class can defined with following syntax
class ClassName:
'Optional class documentation string'
class_suite
— Class can be created using keyword class
— The class has a documentation string, which can be accessed via ClassName.__doc__.
— The class_suite consists of all the component statements defining class members, data
attributes and functions.
• As an example, we will create a class called Point .
class Point:
pass #creating empty class
Here, we are creating an empty class without any members by just using the keyword pass
within it.
• Class can be created only with documentation string as follows:
class Point: Output:
"""Represents a point in 2-D <class
space."""
'__main__.Point'>
Print(Point)

ROOPA.H.M, Dept of MCA, RNSIT Page 37


76
Module 2 [20MCA31] Data Analytics using Python

Because Point is defined at the top level, its “full name” is __main__.Point. The term __main__
indicates that the class Point is in the main scope of the current module.

• Creating a new object is called instantiation, and the object is an instance of the class.
Class can have any number of instances. Object of the class can be created as follows:

blank=Point() Output:
print(blank) <__main__.Point object at 0x03C72070>

Here blank is not the actual object, rather it contains a reference to Point .When we print an
object, Python tells which class it belongs to and where it is stored in the memory. Observe
the output ,It clearly shows that, the object occupies the physical space at location
0x03C72070(hexadecimal value).
Attributes
• An object can contain named elements known as attributes. One can assign values to
these attributes using dot operator. For example, (0,0) represents the origin, and
coordinates (x,y) represents some point. so we can assign two attributes x and y for the
object blank of a class Point as below:
>>> blank.x =
3.0
>>> blank.y =
4.0 Object Diagram

A state diagram that shows an object and its attributes is called an object diagram.
The variable blank refers to a Point object, which contains two attributes. Each attribute
refers to a floating-point number.
• We can read the value of an attribute using the same syntax:
>>> blank.y # read the value of an attribute y
4.0
>>> x = blank.x # Attribute x of an object can be assigned to other
variables
>>> x
3.0

The expression blank.x means, “Go to the object blank refers to and get the value of x”. In
the example, we assign that value to a variable named x. There is no conflict between the
variable x and the attribute x.

ROOPA.H.M, Dept of MCA, RNSIT Page 38


77
Module 2 [20MCA31] Data Analytics using Python

• Classes in Python have two types of attributes:

Class attribute − A variable that is shared by all instances of a class. They are common to
all the objects of that class. Class variables are defined within a class but outside any of the
class's methods. Class variables are not used as frequently as instance variables are.

Instance attribute − instance attributes defined for individual objects.. Attributes of one
instance are not available for another instance of the same class.
Following example demonstrate the usage of instance attribute and class attribute:
class Flower:
‘’’folwers and it behaviour’’’
color = 'white' # class attribute shared by all instances

>>> lotus = Flower()


>>> rose = Flower()
>>> lotus.color # shared by all Flower object
'white'
>>> rose.color # shared by all Flower object
'white'
>>> rose.usedIn=’bouquet’ #defining instance attribute
>>> rose.usedIn #specific to rose object
’bouquet’
>>> lotus.usedIn
AttributeError: 'flower' object has no attribute 'usedin'

Here, the attributes usedIn created is available only for the object rose, but not for lotus.
Thus, usedIn is instance attribute but not class attribute. We can use attributes with dot
notation as part of any expression.
>>> '(%g, %g)' % (blank.x, blank.y)
'(3.0, 4.0)'
>>> sum = blank.x + blank.y
>>> sum
5.0

• We can pass an instance as an argument in the usual way.


def print_point(p): # p is an alias of blank
print('(%g, %g)' %(p.x, p.y))

>>> print_point(blank) #reference of object is sent to p


(3.0, 4.0)

ROOPA.H.M, Dept of MCA, RNSIT Page 39


78
Module 2 [20MCA31] Data Analytics using Python

Example :
Program to create a class Point representing a point on coordinate system. Implement following
functions –
— A function read_point() to receive x and y attributes of a Point object as user input
— A function distance() which takes two objects of Point class as arguments and computes
the Euclidean distance between them.

import math

class Point:
""" class Point representing a coordinate point"""

def read_point(p):
p.x=float(input("x coordinate:"))
p.y=float(input("y coordinate:"))

def print_point(p):
print("(%g,%g)"%(p.x, p.y))

def distance(p1,p2):
d=math.sqrt((p1.x-p2.x)**2+(p1.y-p2.y)**2)
return d
p1=Point() #create first object
print("Enter First point:")
read_point(p1) #read x and y for p1

p2=Point() #create second object


print("Enter Second point:")
read_point(p2) #read x and y for p2

dist=distance(p1,p2) #compute distance


print("First point is:")
print_point(p1) #print p1
print("Second point is:")
print_point(p2) #print p2

print("Distance is: %g" %(distance(p1,p2))) #print d


Output:
Enter First point:
x coordinate:10
y coordinate:20
Enter Second point:
x coordinate:3
y coordinate:5
First point is: (10,20)
Second point is:(3,5)
Distance is: 16.5529

ROOPA.H.M, Dept of MCA, RNSIT Page 40


79
Module 2 [20MCA31] Data Analytics using Python

In the above program, we have used 3 functions which are not members of the class:

— read_point(p) to read the input through keyboard for x and y values.


— print_point() to display point object in the form of ordered-pair.
— distance(p1,p2) to find the Euclidean distance between two points (x1,y1) and (x2,y2).

The formula is

since all these functions does not belong to class, they are called as normal functions without
dot notation.

Constructor method
• Python uses a special method called a constructor method. Python allows you to define only
one constructor per class. Also known as the __init__() method, it will be the first method
definition of a class and its syntax is
def __init__(self, parameter_1, parameter_2, …., parameter_n):
statement(s)

• The __init__() method defines and initializes the instance variables. It is invoked as soon as
an object of a class is instantiated.
• The __init__() method for a newly created object is automatically executed with all of its
parameters .
• The __init__() method is indeed a special method as other methods do not receive this
treatment. The parameters for __init__() method are initialized with the arguments that you
had passed during instantiation of the class object.

• Class methods that begin with a double underscore (__) are called special methods as they
have special meaning. The number of arguments during the instantiation of the class object
should be equivalent to the number of parameters in __init__() method (excluding the self
parameter).

• Example:

class Person:
def __init__(self, name, age):
self.name = name
self.age = age

p1 = Person("John", 36)

print(p1.name)
print(p1.age)

ROOPA.H.M, Dept of MCA, RNSIT Page 41


80
Module 2 [20MCA31] Data Analytics using Python

Inheritance
• Inheritance enables new classes to receive or inherit variables and methods of existing
classes. Inheritance is a way to express a relationship between classes. If you want to build
a new class, which is already similar to one that already exists, then instead of creating a
new class from scratch you can reference the existing class and indicate what is different by
overriding some of its behavior or by adding some new functionality.

• A class that is used as the basis for inheritance is called a superclass or base class. A class
that inherits from a base class is called a subclass or derived class. The terms parent class
and child class are also acceptable terms to use respectively.

• A derived class inherits variables and methods from its base class while adding additional
variables and methods of its own. Inheritance easily enables reusing of existing code. Class
BaseClass, on the left, has one variable and one method.

• Class DerivedClass, on the right, is derived from BaseClass and contains an additional
variable and an additional method.

• The syntax for a derived class definition looks like this:


class DerivedClassName(BaseClassName):
<statement-1>
.
.
<statement-N>

• To demonstrate the use of inheritance, let us take an example.

A polygon is a closed figure with 3 or more sides. Say, we have a class called Polygon
defined as follows.

ROOPA.H.M, Dept of MCA, RNSIT Page 42


81
Module 2 [20MCA31] Data Analytics using Python

class Rectangle(): #base class


def __init__(self, w, h):
self.w = w
self.h = h

def area(self):
return self.w * self.h

def perimeter(self):
return 2 * (self.w + self.h)

"""The Rectangle class can be used as a base class for defining a


Square class, as a square is a special case of
rectangle."""

class Square(Rectangle): #derived class


def __init__(self, s):
super().__init__(s, s) # call parent constructor, w
and h are both s
self.s = s

r = Rectangle(3, 4) #creating instance


s = Square(2)

print("area of rectangle",r.area())
print("perimeter of rectangle",r.perimeter())

print("area of rectangle",s.area())
print("perimeter of rectangle",s.perimeter())

Overloading
Operator overloading
• Normally operators like +,-,/,*, works fine with built-in datatypes.

• Changing the behavior of an operator so that it works with programmer-defined types(class)


is called operator overloading.

• Basic operators like +, -, * etc. can be overloaded. To overload an operator, one needs to write
a method within user-defined class. The method should consist of the code what the
programmer is willing to do with the operator.

• Let us consider an example to overload + operator to add two Time objects by defining
__add__ method inside the class.

ROOPA.H.M, Dept of MCA, RNSIT Page 43


82
Module 2 [20MCA31] Data Analytics using Python

def __add__(self,t2):
sum = Time()
sum.hour = self.hour + t2.hour
sum.minute= self.minute + t2.minute
sum.second = self.second + t2.second
return sum

t3=t1+t2 #When we apply the + operator to Time objects, Python invokes __add__.

In the above example,

→ when the statement t3=t1+t2 is used, it invokes a special method __add__() written inside
the class. Because, internal meaning of this statement is t3 = t1.__add__(t2)

→ Here, t1 is the object invoking the method. Hence, self inside __add__() is the reference (alias)
of t1. And, t2 is passed as argument explicitly.

Python provides a special set of methods which have to be used for overloading operator.
Following table shows gives a list of operators and their respective Python methods for
overloading.

Example program
This program demonstrates creating or defining a class and its object, __init__ method
and operator overloading concept by overloading + operator by redefining __add__
function.

ROOPA.H.M, Dept of MCA, RNSIT Page 44


83
Module 2 [20MCA31] Data Analytics using Python

class Point: Output:


def __init__(self,a=0,b=0): P1 is: (10,20)
self.x=a P2 is: (0,0)
self.y=b Sum is: (10,20)

def __str__(self):
return "(%d,%d)"%(self.x, self.y)

def __add__(self, p2):


p3=Point()
p3.x=self.x+p2.x
p3.y=self.y+p2.y
return p3
p1=Point(10,20)
p2=Point()
print("P1 is:",p1)
print("P2 is:",p2)
p4=p1+p2
print("Sum is:",p4)

ROOPA.H.M, Dept of MCA, RNSIT Page 45


84
Module 2 [20MCA31] Data Analytics using Python

Question Bank
Q.
Questions
No.
LISTS
1 What are lists? Lists are mutable. Justify the statement with examples.
2 How do you create an empty list? Give example.
3 Explain + and * operations on list objects with suitable examples.
4 Discuss various built-in methods in lists.
Implement a Python program using Lists to store and display the average of N integers accepted
5
from the user.
6 What are the different ways of deleting elements from a list? Discuss with suitable functions.
7 How do you convert a list into a string and vice-versa? Illustrate with examples.
8 Write a short note on: a) Parsing lines b) Object Aliasing
Write the differences between
a. sort() and sorted()
9
b) append() and extend()
c) join() and split()
10 When do we encounter TypeError, ValueError and IndexError?
11 What are identical and equivalent objects? How are they identified? give examples.
12 Discuss different ways of traversing a list.

Discuss the following List operations and functions with examples :


1. Accessing, Traversing and Slicing the List Elements
13 2. + (concatenation) and * (Repetition)
3. append, extend, sort , remove and delete
4. len, sum, min and max
5. split and join
DICTIONARY and TUPLES

1 How tuples are created in Python? Explain different ways of accessing and creating them.
Write a Python program to read all lines in a file accepted from the user and print all email
2
addresses contained in it. Assume the email addresses contain only non-white space characters.
3 List merits of dictionary over list.
4 Explain dictionaries. Demonstrate with a Python program.
5 Compare and contrast tuples with lists.
6 Define a dictionary type in Python. Give example.
7 Explain get() function in dictionary with suitable code snippet.
8 Discuss the dictionary methods keys() and items() with suitable programming examples.

ROOPA.H.M, Dept of MCA, RNSIT Page 46


85
Module 2 [20MCA31] Data Analytics using Python

9 Explain various steps to be followed while debugging a program that uses large datasets.
10 Briefly discuss key-value pair relationship in dictionaries.
11 Define a tuple. Give an example to illustrate creation of a tuple
12 What is mutable and immutable objects? Give examples.
13 Explain List of Tuples and Tuple of Lists.
14 How do you create an empty tuple and a single element tuple?
15 Explain DSU pattern with respect to tuples. Give example
16 How do you create a tuple using a string and using a list? Explain with example.
17 Explain the concept of tuple-comparison. How tuple-comparison is implemented in sort()
function? Discuss.
Write a short note on:
a) Tuple assignment
18
b) Dictionaries and Tuples
19 How tuples can be used as a key for dictionaries? Discuss with example.
20 Discuss pros and cons of various sequences like lists, strings and tuples.
Explain the following operations in tuples:
21a) Sum of two tuples
b) Slicing operators

Discuss the Tuple Assignment with example .Explain how swapping can be done using tuple
22 assignment. Write a Python program to input two integers a and b , and swap those numbers .
Print both input and swapped numbers.
SETS
23 Explain how to create an empty set.
24 List the merits of sets over list. Demonstrate it with example.
Write a program to create an intersection, union, set difference, and symmetric
25
difference of sets.

Classes and Objects

26 Define class and object. Given an example for creating a class and an object of that class.

27 What are attributes? Explain with an example and respective object diagram.

Write a program to create a class called Rectangle with the help of a corner point, width and
height. Write following functions and demonstrate their working:
28
a. To find and display center of rectangle
b. To display point as an ordered pair
c. To resize the rectangle

ROOPA.H.M, Dept of MCA, RNSIT Page 47


86
Module 2 [20MCA31] Data Analytics using Python

29 What is a Docstring? Why are they written?

30 What do you mean by “instance as returning value”? Explain with an example.

31 What is inheritance? Demonstrate it with example.

32 When do we encounter AttributeError?

33 How do you find the memory address of an instance of a class?

34 Differentiate instance attribute and class attribute

35 Difference between pure function and modifier. write a python program to find duration of the
event if start and end time is given by defining class TIME.
Write a program to create a class called ‘Time’ to represent time in HH:MM:SS format.
Perform following operations:

36 i. T3=T1+T2
ii. T4=T1+360
iii. T5=130+T1

Where T1,T2,T3,T4,T5 are object of time class


37 Explain init and str method with an example program

38 Write a program to add two point objects by overloading + operator. Overload _ _str_ _( ) to
display point as an ordered pair.

39 Define polymorphism. Demonstrate polymorphism with function to find histogram to count the
number of times each letters appears in a word and in sentence.
40 Using datetime module write a program that gets the current date and prints the day of the week.

41 What does the keyword self in python mean? Explain with an example.

ROOPA.H.M, Dept of MCA, RNSIT Page 48


87
Module 1 [22MCA31] Data Analytics using Python

Module 3
Data Pre-processing and Data Wrangling
Topics Covered

Loading from CSV files, Accessing SQL databases. Cleansing Data with Python:
Stripping out extraneous information, Normalizing data AND Formatting data.
Combining and Merging Data Sets – Reshaping and Pivoting – Data
Transformation – String Manipulation, Regular Expressions.

Loading from CSV files

Pandas features a number of functions for reading tabular data as a DataFrame object. Table
below has a summary of all of them.

Functions, which are meant to convert text data into a DataFrame. The options for these
functions fall into a few categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether to get
column names from the file, the user, or not at all.

• Type inference and data conversion: this includes the user-defined value conversions and
custom list of missing value markers.

• Datetime parsing: includes combining capability, including combining date and time
information spread over multiple columns into a single column in the result.

• Iterating: support for iterating over chunks of very large files.

• Unclean data issues: skipping rows or a footer, comments, or other minor things like
numeric data with thousands separated by commas.

ROOPA.H.M, Dept of MCA, RNSIT Page 1


88
Module 1 [22MCA31] Data Analytics using Python

read_csv and read_table are most used functions.


Before using any methods in the pandas library , import the library with the following
statement: Import pandas as pd

Let’s start with a small comma-separated (CSV) text file:


ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file
contains any other delimiters then, read_table can be used by specifying the delimiter.

pandas allows to assign column names by specifing names argument:

Suppose we wanted the message column to be the index of the returned DataFrame. We can
either indicate we want the column at index 4 or named 'message' using the index_col
argument:

ROOPA.H.M, Dept of MCA, RNSIT Page 2


89
Module 1 [22MCA31] Data Analytics using Python

To form a hierarchical index from multiple columns, just pass a list of column numbers or
names:

The parser functions have many additional arguments to help you handle the wide variety of
exception file formats that occur. For example, you can skip the first, third, and fourth rows
of a file with skiprows:

Handling missing values


• Handling missing values is an important and frequently nuanced part of the file parsing
process. Missing data is usually either not present (empty string) or marked by some
sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as
NA, -1.#IND, and NULL:

ROOPA.H.M, Dept of MCA, RNSIT Page 3


90
Module 1 [22MCA31] Data Analytics using Python

• The na_values option can take either a list or set of strings to consider missing values:

• Different NA sentinels can be specified for each column in a dict:

ROOPA.H.M, Dept of MCA, RNSIT Page 4


91
Module 1 [22MCA31] Data Analytics using Python

Accessing SQL databases

• A database is a file that is organized for storing data. Most databases are organized
like a dictionary in the sense that they map from keys to values. The biggest
difference is that the database is on disk (or other permanent storage), so it persists
after the program ends. Because a database is stored on permanent storage, it can
store far more data than a dictionary, which is limited to the size of the memory in
the computer.

• Like a dictionary, database software is designed to keep the inserting and accessing
of data very fast, even for large amounts of data. Database software maintains its
performance by building indexes as data is added to the database to allow the
computer to jump quickly to a particular entry.

• There are many different database systems which are used for a wide variety of
purposes including: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite.

• Python uses SQLite database. SQLite is designed to be embedded into other


applications to provide database support within the application.

• Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/

• Using the browser you can easily create tables, insert data, edit data, or run simple
SQL queries on the data in the database

Database concepts

• For the first look, database seems to be a spreadsheet consisting of multiple sheets.
The primary data structures in a database are tables, rows and columns.

• In a relational database terminology, tables, rows and columns are referred as


relation, tuple and attribute respectively. Typical structure of a database table is as
shown below Table 3.1.

• Each table may consist of n number of attributes and m number of tuples (or
records).

• Every tuple gives the information about one individual. Every cell (i, j) in the table
indicates value of jth attribute for ith tuple.

ROOPA.H.M, Dept of MCA, RNSIT Page 5


92
Module 1 [22MCA31] Data Analytics using Python

Table 3.1: Typical Relational database table

Consider the problem of storing details of students in a database table. The format may
look like –
Roll No Name DOB Marks
Student1 1 Akshay 22/10/2001 82.5
Student 2 2 Arun 20/12/2000 81.3
............... ............... ............... ...............
............... ............... ............... ...............
Student m ............... ............... ............... ...............

Thus, table columns indicate the type of information to be stored, and table rows gives
record pertaining to every student. We can create one more table say department
consisting of attributes like dept_id, homephno, City. To relate this table with a
respective Rollno stored in student, and dept_id stored in department table. Thus, there
is a relationship between two tables in a single database. There are softwares that can
maintain proper relationships between multiple tables in a single database and are
known as Relational Database Management Systems (RDBMS).

Creating a database table


The code to create a database name(music.db) and a table named Tracks with two
columns in the database is as follows:

import sqlite3
conn = sqlite3.connect('music.db') #create database name music
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
conn.close()

• The connect operation makes a “connection” to the database stored in the file
music.db in the current directory. If the file does not exist, it will be created.

• A cursor is like a file handle that we can use to perform operations on the data
stored in the database. Calling cursor() is very similar conceptually to calling
open() when dealing with text files.

ROOPA.H.M, Dept of MCA, RNSIT Page 6


93
Module 1 [22MCA31] Data Analytics using Python

• Once we have the cursor, we can begin to execute commands on the contents of
the database using the execute() method is as shown in figure below.

Figure : Database Cursor

cur.execute(INSERT INTO Tracks (title, plays) VALUES ('My Way', 15))


This command inserts one record into the table Tracks where values for the attributes title
and plays are ‘My Way’ and 15 respectively.
cur.execute( SELECT * FROM Tracks)
Retrieves all the records from the table Tracks
cur.execute(SELECT * FROM Tracks WHERE title = 'My Way’)
Retrieves the records from the table Tracks having the value of attribute title as ‘My Way’
cur.execute(UPDATE Tracks SET plays = 16 WHERE title = 'My Way’)
Whenever we would like to modify the value of any particular attribute in the table, we
can use UPDATE command. Here, the value of attribute plays is assigned to a new value for
the record having value of title as ‘My Way’.

cur.execute(DELETE FROM Tracks WHERE title = 'My Way')


A particular record can be deleted from the table using DELETE command. Here, the record
with value of attribute title as ‘My Way’ is deleted from the table Tracks.
cur.execute('DROP TABLE IF EXISTS Tracks ')
This command will delete the contents of entire table

Example1: Write a python to create student Table from college database.(the attributes of
student like Name,USN,Marks.)Perform the following operations like insert,delete and
retrieve record from student Table.

ROOPA.H.M, Dept of MCA, RNSIT Page 7


94
Module 1 [22MCA31] Data Analytics using Python

import sqlite3
conn = sqlite3.connect(‘college.db’)
cur=conn.cursor()
print(“Opened database successfully”)
cur.execute(‘CREATE TABLE student(name TEXT, usn NUMERIC, Marks INTEGER)’)
print(“Table created successfully”)
cur.execute(‘INSERT INTO student(name,usn,marks) values (?,?,?)’,(‘akshay’,’1rn16mca16’,30))
cur.execute(‘insert into student(name,usn,marks) values (?,?,?)’,(‘arun’,’1rn16mca17’,65))
print(‘student’)
cur.execute(‘SELECT name, usn ,marks from student’)
for row in cur:
print(row)
cur.execute(‘DELETE FROM student WHERE Marks < 40’)
cur.execute(‘select name,usn,marks from student’)
conn.commit()
cur.close()
Output:
Opened database successfully
Table created successfully
student
('akshay', '1rn16mca16', 30)
('arun', '1rn16mca17', 65)
Example 2: Write a python code to create a database file(music.sqlite) and a table named
Tracks with two columns- title , plays. Also insert , display and delete the contents of the
table
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
cur.execute(“INSERT INTO Tracks (title, plays) VALUES ('Thunderstruck', 200)”)
cur.execute(“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15))
conn.commit()
print('Tracks:')
cur.execute('DELETE FROM Tracks WHERE plays < 100')
cur.execute('SELECT title, plays FROM Tracks')
for row in cur:
print(row)
cur.close()

Output
Tracks:
('Thunderstruck', 200)

ROOPA.H.M, Dept of MCA, RNSIT Page 8


95
Module 1 [22MCA31] Data Analytics using Python

Cleansing Data with Python: Stripping out extraneous information

Extraneous information refers to irrelevant or unnecessary data that can clutter a dataset and
make it difficult to analyze. This could include duplicate entries, empty fields, or irrelevant
columns. Stripping out this information involves removing it from the dataset, resulting in a more
concise and manageable dataset.

To strip out extraneous information in a Pandas DataFrame, you can use various methods and
functions provided by the library. Some commonly used methods include:
• dropna( ): This method removes rows with missing values (NaN or None) from the DataFrame.
You can specify the axis (0 for rows and 1 for columns) along which the rows or columns with
missing values should be dropped.
Example:
df = df.dropna()
#This will remove all rows that contain at least one missing value.

• drop( ): The drop() method in Pandas is used to remove columns from a DataFrame. It can be
used to drop a single column or multiple columns at once.
df.drop(columns, axis=1, inplace=False)
Ex:
cars2 = cars_data.drop(['Doors','Weight'],axis='columns')

• drop_duplicates(): methods to remove missing values and duplicate rows specify the
columns based on which the duplicates should be checked.

ROOPA.H.M, Dept of MCA, RNSIT Page 9


96
Module 1 [22MCA31] Data Analytics using Python

• loc[ ] and iloc[ ]: These indexing methods allow you to select specific rows and columns from
the DataFrame. They are used to select only the relevant data and exclude the unwanted
information.
Ex:1

Ex:2

ROOPA.H.M, Dept of MCA, RNSIT Page 10


97
Module 1 [22MCA31] Data Analytics using Python

• Filtering: conditional statements can be used to filter the DataFrame and select only the
rows that meet certain criteria. This allows to remove unwanted data based on specific
conditions.
Example:

data = {'Name': [‘Anitha', ‘Barathi', 'Charlie', 'David'],


'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Example of filtering based on a condition


filtered_df = df[df['Age'] > 30]

# Display the filtered DataFrame


print(filtered_df)

Normalizing data AND Formatting data


Normalizing Data:

Data normalization is the process of transforming data into a consistent format to facilitate
comparison and analysis. This may involve converting data to a common unit of measurement,
formatting dates and times consistently, or standardizing data formats. Normalization ensures
that data is comparable and can be easily processed and analysed.

Normalization is a crucial step in data preprocessing for machine learning tasks. It involves
transforming numerical features to have a mean of 0 and a standard deviation of 1. This process
ensures that all features are on the same scale, enabling efficient and accurate learning by
machine learning algorithms.

In Python, several libraries provide functions for data normalization .

Method 1: Using sklearn


The sklearn method is a very famous method to normalize the data.

ROOPA.H.M, Dept of MCA, RNSIT Page 11


98
Module 1 [22MCA31] Data Analytics using Python

We import all the required libraries, NumPy and sklearn. You can see that we import the
preprocessing from the sklearn itself. That’s why this is the sklearn normalization method. We
created a NumPy array with some integer value that is not the same. We called the normalize
method from the preprocessing and passed the numpy_array, which we just created as a
parameter. We can see from the results, all integer data are now normalized between 0 and 1.

Method 2: Normalize a particular column in a dataset using sklearn

We can also normalize the particular dataset column. In this, we are going to discuss about that.

We import the library pandas and sklearn. We created a dummy CSV file, and we are now
loading that CSV file with the help of the pandas (read_csv) package. We print that CSV file
which we recently loaded. We read the particular column of the CSV file using the np. array and
store the result to value_array. We called the normalize method from the preprocessing and
passed the value_array parameter.

Method 3: Convert to normalize without using the columns to array (using sklearn)

In the previous method 2, we discussed how to a particular CSV file column we could normalize.
But sometimes we need to normalize the whole dataset, then we can use the below method
where we do normalize the whole dataset but along column-wise (axis = 0). If we mention the
axis = 1, then it will do row-wise normalize. The axis = 1 is by default value.

ROOPA.H.M, Dept of MCA, RNSIT Page 12


99
Module 1 [22MCA31] Data Analytics using Python

Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the
library that the user wanted to normalize the whole dataset column-wise.

Method 4: Using MinMaxScaler()


The sklearn also provides another method of normalization, which we called it MinMaxScalar.
This is also a very popular method because it is easy to use.

ROOPA.H.M, Dept of MCA, RNSIT Page 13


100
Module 1 [22MCA31] Data Analytics using Python

We called the MinMaxScalar from the preprocessing method and created an object
(min_max_Scalar) for that. We did not pass any parameters because we need to normalize the
data between 0 and 1. But if you want, you can add your values which will be seen in the next
method.

We first read all the names of the columns for further use to display results. Then we call the
fit_tranform from the created object min_max_Scalar and passed the CSV file into that. We get
the normalized results which are between 0 and 1.

Method 5: Using MinMaxScaler(feature_range=(x,y))

The sklearn also provides the option to change the normalized value of what you want. By
default, they do normalize the value between 0 and 1. But there is a parameter which we called
feature_range, which can set the normalized value according to our requirements.

Here, We call the MinMaxScalar from the preprocessing method and create an object
(min_max_Scalar) for that. But we also pass another parameter inside of the MinMaxScaler
(feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will normalize the
data values between 0 to 2. We first read all the names of the columns for further use to display
results. Then we call the fit_tranform from the created object min_max_Scalar and passed the
CSV file into that. We get the normalized results which are between 0 and 2.

ROOPA.H.M, Dept of MCA, RNSIT Page 14


101
Module 1 [22MCA31] Data Analytics using Python

Method 6: Using the maximum absolute scaling

We can also do normalize the data using pandas. These features are also very popular in
normalizing the data. The maximum absolute scaling does normalize values between 0 and 1.
We are applying here .max () and .abs() as shown below:

We call each column and then divide the column values with the .max() and .abs(). We print the
result, and from the result, we confirm that our data normalize between 0 and 1.

Method 7: Using the z-score method

The next method which we are going to discuss is the z-score method. This method converts the
information to the distribution. This method calculates the mean of each column and then
subtracts from each column and, at last, divides it with the standard deviation. This normalizes
the data between -1 and 1.

ROOPA.H.M, Dept of MCA, RNSIT Page 15


102
Module 1 [22MCA31] Data Analytics using Python

We calculate the column’s mean and subtract it from the column. Then we divide the column
value with the standard deviation. We print the normalized data between -1 and 1.

One popular library is Scikit-Learn, which offers the StandardScaler class for normalization.
Here's an example of how to use StandardScaler to normalize a dataset:

ROOPA.H.M, Dept of MCA, RNSIT Page 16


103
Module 1 [22MCA31] Data Analytics using Python

Formatting Data:

• Formatting data in Pandas involves transforming and presenting data in a structured and
readable manner. Pandas, a popular Python library for data analysis, offers various methods
and techniques to format data effectively.

• One of the key features of Pandas is its ability to handle different data types and structures.
It provides specific formatting options for each data type, ensuring that data is displayed in
a consistent and meaningful way. For example, numeric data can be formatted with specific
number of decimal places, currency symbols, or percentage signs. Date and time data can be
formatted in various formats, such as "dd/mm/yyyy" or "hh:mm:ss".

• Pandas also allows users to align data within columns, making it easier to read and compare
values. This can be achieved using the "justify" parameter, which takes values such as "left",
"right", or "center". Additionally, Pandas provides options to control the width of columns,
ensuring that data is presented in a visually appealing manner.

• Furthermore, Pandas offers methods to format entire dataframes, applying consistent


formatting rules to all columns. This can be done using the "style" attribute, which allows
users to specify formatting options for different aspects of the dataframe, such as font,
background color, and borders.

• By leveraging the formatting capabilities of Pandas, users can effectively communicate


insights and patterns in their data, making it easier to analyze and interpret. Overall,
formatting data in Pandas is a crucial skill for data analysts and scientists to present their
findings in a clear and professional manner.
Ex 1 : Formatting Numeric Data

ROOPA.H.M, Dept of MCA, RNSIT Page 17


104
Module 1 [22MCA31] Data Analytics using Python

Ex 2: Formatting Date and Time Data

Ex 3: Aligning Data in Columns

ROOPA.H.M, Dept of MCA, RNSIT Page 18


105
Module 1 [22MCA31] Data Analytics using Python

Combining and merging data sets

Data contained in pandas objects can be combined together in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This will be
familiar to users of SQL or other relational databases, as it implements database join
operations.

• pandas.concat glues or stacks together objects along an axis.

• combine_first instance method enables splicing together overlapping data to fill in missing
values in one object with values from another.

Database-style DataFrame Merges


Merge or join operations combine data sets by linking rows using one or more keys. The merge
function in pandas is used to combine datasets.

Let’s start with a simple example:

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2

• The below examples shows many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column.

# performs inner join # performs outer join


# pd.merge(df1, df2)
pd.merge(df1, df2, how='outer')
pd.merge(df1, df2, on='key')

ROOPA.H.M, Dept of MCA, RNSIT Page 19


106
Module 1 [22MCA31] Data Analytics using Python

Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.

• The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

df1 df2

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in
the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method
only affects the distinct key values appearing in the result:

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

Merging on Index
The merge key or keys in a DataFrame will be found in its index. In this case, you can pass
left_index=True or right_index=True (or both) to indicate that the index should be used as the
merge key

ROOPA.H.M, Dept of MCA, RNSIT Page 20


107
Module 1 [22MCA31] Data Analytics using Python

ROOPA.H.M, Dept of MCA, RNSIT Page 21


108
Module 1 [22MCA31] Data Analytics using Python

Concatenating Along an Axis


Another kind of data combination operation is alternatively referred to as concatenation,
binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy
arrays:

By default concat works along axis=0, producing another Series. If you pass axis=1, the result
will instead be a DataFrame (axis=1 is the columns):

ROOPA.H.M, Dept of MCA, RNSIT Page 22


109
Module 1 [22MCA31] Data Analytics using Python

Reshaping and pivoting

There are a number of fundamental operations for rearranging tabular data. These are
alternatingly referred to as reshape or pivot operations.

Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.


There are two primary actions:
• stack: this changes from the columns in the data to the rows.
• unstack: this changes from the rows into the columns.

• Using the stack method on this data pivots the columns into the rows, producing a
Series.

From a hierarchically-indexed Series, we can rearrange the data back into a DataFrame with
unstack.

ROOPA.H.M, Dept of MCA, RNSIT Page 23


110
Module 1 [22MCA31] Data Analytics using Python

• By default the innermost level is unstacked (same with stack). You can unstack a different
level by passing a level number or name:

• Unstacking might introduce missing data if all of the values in the level aren’t found in each
of the subgroups:

Pivoting “long” to “wide” Format


• A common way to store multiple time series in databases and CSV is in so-called long or
stacked format :

• Data is frequently stored this way in relational databases like MySQL as a fixed schema allows
the number of distinct values in the item column to increase or decrease as data is added or
deleted in the table.

• The data may not be easy to work with in long format; it is preferred to have a DataFrame
containing one column per distinct item value indexed by timestamps in the date column.
ROOPA.H.M, Dept of MCA, RNSIT Page 24
111
Module 1 [22MCA31] Data Analytics using Python

DataFrame’s pivot method performs exactly this transformation.

The pivot() function is used to reshape a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a
MultiIndex in the columns.

Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)

Example:

ROOPA.H.M, Dept of MCA, RNSIT Page 25


112
Module 1 [22MCA31] Data Analytics using Python

Suppose you had two value columns that you wanted to reshape simultaneously:

• By omitting the last argument, you obtain a DataFrame with hierarchical columns:

ROOPA.H.M, Dept of MCA, RNSIT Page 26


113
Module 1 [22MCA31] Data Analytics using Python

Data transformation

Data transformation is the process of converting raw data into a format that is suitable for
analysis and modeling. It's an essential step in data science and analytics workflows, helping to
unlock valuable insights and make informed decisions.
Few of the data transfer mechanisms are :
• Removing Duplicates
• Replacing Values
• Renaming Axis Indexes
• Discretization and Binning
• Detecting and Filtering Outliers
• Permutation and Random Sampling

i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly,
drop_duplicates returns a DataFrame where the duplicated array is True.

data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )

data.duplicated() data.drop_duplicates()
data

# rows 1, 4 and 6 are


dropped

ii) Filtering outliers


Filtering or transforming outliers is largely a matter of applying array operations.
Consider a DataFrame with some normally distributed data. (Note : while writing answers,
write your own random numbers between 0 and 1)

ROOPA.H.M, Dept of MCA, RNSIT Page 27


114
Module 1 [22MCA31] Data Analytics using Python

• Suppose we wanted to find values in one of the columns exceeding one in magnitude:

• To select all rows having a value exceeding 1 or -1, we can use the any method on a
boolean DataFrame:

iii) Replacing Values

• Some times it is necessary to replace missing values with some specific values or NAN
values. It can be done by using replace method. Let’s consider this Series:

data = Series([1., -999., 2., -999., -1000., 3.])


data

• The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:

data.replace(-999, np.nan)

ROOPA.H.M, Dept of MCA, RNSIT Page 28


115
Module 1 [22MCA31] Data Analytics using Python

• If we want to replace multiple values at once, you instead pass a list then the substitute
value:

data.replace([-999, -1000], np.nan)

• To use a different replacement for each value, pass a list of substitutes:

data.replace([-999, -1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

iv. Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping
of some form to produce new, differently labeled objects. The axes can also be modified in
place without creating a new data structure.

• We can assign to index, modifying the DataFrame in place:

import pandas as pd

data = pd.DataFrame(np.arange(12).reshape((3, 4)),


index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four'])
data.index = data.index.map(str.upper)
data

• To create a transformed version of a data set without modifying the original, a useful
method is rename:

data.rename(index=str.title, columns=str.upper)

ROOPA.H.M, Dept of MCA, RNSIT Page 29


116
Module 1 [22MCA31] Data Analytics using Python

• rename can be used in conjunction with a dict-like object providing new values for a subset
of the axis labels:
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})

v. Discretization and binning

• Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose we have data about a group of people in a study, and we want to group them into
discrete age buckets:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do
so, we have to use cut, a function in pandas:

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

• The object pandas returns is a special Categorical object. We can treat it like an array of
strings indicating the bin name; internally it contains a levels array indicating the distinct
category names along with a labeling for the ages data in the labels attribute:

cats.labels

cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)

Consistent with mathematical notation for intervals, a parenthesis means that the side is
open while the square bracket means it is closed (inclusive).

ROOPA.H.M, Dept of MCA, RNSIT Page 30


117
Module 1 [22MCA31] Data Analytics using Python

vi. Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of the axis you
want to permute produces an array of integers indicating the new ordering:

df

df = DataFrame(np.arange(5 * 4).reshape(5, 4))

sampler = np.random.permutation(5)
array([1, 0, 2, 3, 4])
sampler

That array can then be used in ix-based indexing or the take function:

df.take(sampler)

vii. Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applica tions
is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a
DataFrame has k distinct values, you would derive a matrix or DataFrame containing k
columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this,
though devising one yourself is not difficult. Let’s return to an earlier ex ample
DataFrame:

df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

pd.get_dummies(df['key'])

In some cases, you may want to add a prefix to the columns in the indicator DataFrame,
which can then be merged with the other data. get_dummies has a prefix argument for
doing just this:

ROOPA.H.M, Dept of MCA, RNSIT Page 31


118
Module 1 [22MCA31] Data Analytics using Python

dummies = pd.get_dummies(df['key'], prefix='key')

df_with_dummy = df[['data1']].join(dummies)

df_with_dummy

String Manipulation

Python has long been a popular data munging language in part due to its ease-of-use for
string and text processing. Most text operations are made simple with the string object’s built-
in methods. For more complex pattern matching and text manipulations, regular expressions
may be needed. pandas adds to the mix by enabling you to apply string and regular
expressions concisely on whole arrays of data, additionally handling the annoyance of missing
data.

String Object Methods


• In many string munging and scripting applications, built-in string methods are sufficient.
Examples:
Description Code Output
a comma-separated val = 'a,b, guido'
string can be broken val.split(',') ['a', 'b', ' guido']
into pieces with split

split is often combined pieces = [x.strip()


with strip to trim for x in val.split(',')]
['a', 'b', 'guido']
whitespace (including
newlines)
The above substrings first,second,third= pieces
could be concatenated
together with a two- first+'::'+second+'::'+ 'a::b::guido'
colon delimiter using third
addition
A faster and more '::'.join(pieces)
Pythonic way is to pass
a list or tuple to the 'a::b::guido'
join method on the
string '::'

ROOPA.H.M, Dept of MCA, RNSIT Page 32


119
Module 1 [22MCA31] Data Analytics using Python

Note: refer module 2 for remaining string methods

Regular Expressions
Expressions.

• A RegEx, or Regular Expression, is a sequence of characters that forms a search


pattern.
• RegEx can be used to check if a string contains the specified search pattern.
• Python has a built-in package called re, which can be used to work with Regular
Expressions

RegEx Functions
The re module offers a set of functions that allows us to search a string for a match.
By using these functions we can search required pattern. They are as follows:

• match(): re.match() determine if the RE matches at the beginning of the string. The
method returns a match object if the search is successful. If not, it returns None.

import re Output:
abyss
pattern = '^a...s$'
Search successful.
test_string = 'abyss'
result = re.match(pattern,test_string)

if result:
print("Search successful.")
else:
print("Search unsuccessful.")

• search(): The search( ) function searches the string for a match, and returns a Match
object if there is a match. If there is more than one match found, only the first
occurrence of the match will be returned.
import re Output:
pattern='Tutorials' <re.Match object;
line ='Python Tutorials' span=(7,16),match='Tutorials'>
result = re.search(pattern, line)
Tutorials
print(result)
print(result.group())

#group( ) returns matched string

ROOPA.H.M, Dept of MCA, RNSIT Page 33


120
Module 1 [22MCA31] Data Analytics using Python

• findall() : Find all substrings where the RE matches, and returns them as a list. It
searches from start or end of the given string and returns all occurrences of the
pattern. While searching a pattern, it is recommended to use re.findall() always, it
works like re.search() and re.match() both.
import re Output:
str = "The rain in Spain" ['ai', 'ai']
x = re.findall("ai", str)
print(x)

Character matching in regular expressions

• Python provides a list of meta-characters to match search strings.

• Metacharacters are characters that are interpreted in a special way by a RegEx


engine. Here's a list of metacharacters:

[] A set of characters "[a-m]"

[^…] Matches any single character NOT in "[^a-m]"


brackets
\ Signals a special sequence (can also be "\d"
used to escape special characters)
. Any character (except newline character) "he..o"

^ Starts with "^hello"

$ Ends with "world$"

* Zero or more occurrences "aix*"

+ One or more occurrences "aix+"

{} Exactly the specified number of "al{2}"


occurrences
| Either or "falls|stays"

() Capture and group

When parentheses are added to a regular


expression, they are ignored for the
purpose of matching, but allow you to
extract a particular subset of the
matched string rather than the whole
string when using findall()

ROOPA.H.M, Dept of MCA, RNSIT Page 34


121
Module 1 [22MCA31] Data Analytics using Python

Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:
\d Returns a match where the string contains "\d"
digits (numbers from 0-9)
\D Returns a match where the string DOES NOT "\D"
contain digits

\s Returns a match where the string contains a "\s"


white space character

\S Returns a match where the string DOES NOT "\S"


contain a white space character

\w Returns a match where the string contains "\w"


any word characters (characters from a to Z,
digits from 0-9, and the underscore _
character)

\W Returns a match where the string DOES NOT "\W"


contain any word characters

\A Returns a match if the specified characters "\AThe"


are at the beginning of the string

\b Returns a match where the specified r"\bain"


characters are at the beginning or at the end r"ain\b"
of a word
\B Returns a match where the specified r"\Bain"
characters are present, but NOT at the r"ain\B"
beginning (or at the end) of a word
\Z Returns a match if the specified characters "Spain\Z"
are at the end of the string

Few examples on set of characters for pattern matching are as follows:

Set Description Examples

[arn] Returns a match where >>> str = "The rain in


one of the specified Spain"
characters (a, r, or n) >>>re.findall("[arn]",
are present str)
['r', 'a', 'n', 'n', 'a',
'n']

ROOPA.H.M, Dept of MCA, RNSIT Page 35


122
Module 1 [22MCA31] Data Analytics using Python

[a-n] Returns a match for >>> str = "The rain in Spain"


any lower case >>>re.findall("[a-n]",str)
character, ['h', 'e', 'a', 'i', 'n', 'i',
alphabetically 'n', 'a', 'i', 'n']
between a and n

[^arn] Returns a match for >>> str = "The rain in Spain"


any character >>>re.findall("[^arn]", str)
EXCEPT a, r, and n ['T', 'h', 'e', ' ', 'i', ' ',
'i', ' ', 'S', 'p', 'i']

[0123] Returns a match where >>> str = "The rain in Spain"


any of the specified >>>re.findall("[0123]", str)
digits (0, 1, 2, or 3) are []
present

[0-9] Returns a match for >>>str ="8 times before 11:45 AM"
any digit >>>re.findall("[0-9]", str)
between 0 and 9 ['8', '1', '1', '4', '5']

[0- Returns a match for >>>str = "8 times before 11:45 AM"
5][0- any two-digit numbers >>>re.findall("[0-5][0-9]", str)
9] ['11', '45']
from 00 and 59

[a-zA-Z] Returns a match for >>>str = "8 times before


any character 11:45 AM"
>>>re.findall("[a-zA-Z]",
alphabetically
str)
between a and z, lower ['t', 'i', 'm', 'e', 's', 'b',
case OR upper case 'e', 'f', 'o', 'r', 'e', 'A',
'M']

[+] In
sets, +, *, ., |, (), $,{} has
no special meaning, >>>str ="8 times before 11:45 AM"
so [+] means: return a >>>re.findall("[+]", str)
match for []
any + character in the
string

ROOPA.H.M, Dept of MCA, RNSIT Page 36


123
Module 1 [22MCA31] Data Analytics using Python

Few more examples for searching the pattern in files:


Let us consider a text file pattern.txt
#pattern.txt
From: Bengaluru^560098
From:<[email protected]>
ravi
rohan
Mysore^56788
From:Karnataka
From:
<[email protected]>

EX:1 Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
import re Output:
hand = open('pattern.txt')
for line in hand: From: Bengaluru^560098
line = line.rstrip() From:<[email protected]>
if re.search('^F..m:', From: <[email protected]>
line):
print(line)

The regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”,
or “F!@m:” since the period characters in the regular expression match any character.

Ex:2 Search for lines that start with From and have an at sign
import re Output:
hand = open('pattern.txt')
for line in hand: From:<[email protected]>
line = line.rstrip() From:
if re.search('^From:.+@', <[email protected]>
line):
print(line)

The search string ˆFrom:.+@ will successfully match lines that start with “From:”,
followed by one or more characters (.+), followed by an at-sign.

Extracting data using regular expressions


If we want to extract data from a string in Python we can use the findall() method to
extract all of the substrings which match a regular expression.

Ex:1 Extract anything that looks like an email address from the line.
import re
s = 'A message from [email protected] to [email protected] about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)
Output: ['[email protected]', '[email protected]']

ROOPA.H.M, Dept of MCA, RNSIT Page 37


124
Module 1 [22MCA31] Data Analytics using Python

In the above example:


— The findall() method searches the string in the second argument and returns a list of all of
the strings that look like email addresses.

— Translating the regular expression, we are looking for substrings that have at least One or
more non-whitespace character, followed by an at-sign, followed by at least one more non-
whitespace character.

— The “\S+” matches as many non-whitespace characters as possible.

Ex:2 Search for lines that have an at sign between characters


import re Output:
hand = open('pattern.txt')
for line in hand: ['From:<[email protected]>']
line = line.rstrip() ['<[email protected]>']
x = re.findall('\S+@\S+',
line)
if len(x) > 0:
print(x)
We read each line and then extract all the substrings that match our regular
expression. Since findall() returns a list, we simply check if the number of elements in
our returned list is more than zero to print only lines where we found at least one
substring that looks like an email address.
Observe the above output, email addresses have incorrect characters like “<” or “>” at
the beginning or end. To eliminate those characters, refer to the below example
program.

Ex:3 Search for lines that have an at sign between characters .The characters
must be a letter or number
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:[email protected]']
line = line.rstrip() ['[email protected]']
x=re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-
Z]',line)
if len(x) > 0:
print(x)

Here, we are looking for substrings that start with a single lowercase letter, uppercase
letter, or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (\S*),
followed by an at-sign, followed by zero or more non-blank characters (\S*), followed by
an uppercase or lowercase letter. Note that we switched from + to * to indicate zero or
more non-blank characters since [a-zA-Z0-9] is already one non-blank character.
Remember that the * or + applies to the single character immediately to the left of the
plus or asterisk.

ROOPA.H.M, Dept of MCA, RNSIT Page 38


125
Module 1 [22MCA31] Data Analytics using Python

Combining searching and extracting


• Sometimes we may want to extract the lines from the file that match with specific
pattern, let say
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

We can use the following regular expression to select the lines:


^X-.*: [0-9.]+
Let us consider a sample file called file.txt
File.txt

X-DSPAM-Confidence:
0.8475
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
X-DSPAM
done with the file
content

Ex:1 Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number can
include a decimal.
import re Output:
hand = open('file.txt') X-DSPAM-Confidence:
for line in hand: 0.8475
line = line.rstrip() X-DSPAM-Probability:
x =re.search('^X-.*: ([0-9.]+)', 0.0000
line) X-DSPAM-Confidence:
if x: 0.6178
print(x.group()) X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961

Here, it select the lines that


— start with X-,

ROOPA.H.M, Dept of MCA, RNSIT Page 39


126
Module 1 [22MCA31] Data Analytics using Python

— followed by zero or more characters (.*),


— followed by a colon (:) and then a space.
— After the space we are looking for one or more characters that are either a digit (0-
9) or a period [0-9.]+.
— Note that inside the square brackets, the period matches an actual period (i.e., it is
not a meta character between the square brackets).

• But, if we want only the numbers in the above output. We can use split() function on
extracted string. However, it is better to refine regular expression. To do so, we need
the help of parentheses.

When we add parentheses to a regular expression, they are ignored when matching
the string(with search()). But when we are using findall(), parentheses indicate that
while we want the whole expression to match, we are only interested in extracting a
portion of the substring that matches the regular expression.

Ex:2 Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number can include
a decimal. Then print the number if it is greater than zero.

import re Output:
hand = open('file.txt') ['0.8475']
for line in hand: ['0.0000']
line = line.rstrip() ['0.6178']
x = re.findall('^X.*: ([0-9.]+)', ['0.0000']
line) ['0.6961']
if len(x) > 0:
print(x)

• Let us consider another example, assume that the file contain lines of the form:

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

If we wanted to extract all of the revision numbers (the integer number at the end of
these lines) using the same technique as above, we could write the following program:

ROOPA.H.M, Dept of MCA, RNSIT Page 40


127
Module 1 [22MCA31] Data Analytics using Python

Ex:3 Search for lines that start with 'Details: rev=' followed by numbers and '.'
Then print the number if it is greater than zero.
import re
str="Details:http://source.sakaiproject.org/viewsvn/?view=rev&rev=3
9772"
x = re.findall('^Details:.*rev=([0-9]+)', str)
if len(x) > 0:
print(x)
Output:
['39772']

In the above example, we are looking for lines that start with Details:, followed by
any number of characters (.*), followed by rev=, and then by one or more digits. We
want to find lines that match the entire expression but we only want to extract the
integer number at the end of the line, so we surround [0-9]+ with parentheses.
Note that, the expression [0-9] is greedy, because, it can display very large
number. It keeps grabbing digits until it finds any other character than the digit.

• Consider another example – we may be interested in knowing time of a day of each


email. The file may have lines like –
From [email protected] Sat Jan 5 09:14:16 2008
Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. This can done by following code-
line="From [email protected] Sat Jan 5 09:14:16
2008"
x = re.findall('^From .* ([0-9][0-9]):', line)
if len(x) > 0:
print(x)
Output:
['09']

Escape character
Character like dot, plus, question mark, asterisk, dollar etc. are meta characters in
regular expressions. Sometimes, we need these characters themselves as a part of
matching string. Then, we need to escape them using a backslash. For example,
import re Output:
x = 'We just received $10.00 for
cookies.' matched string: $10.00
y = re.search('\$[0-9.]+',x)
print("matched string:",y.group())

ROOPA.H.M, Dept of MCA, RNSIT Page 41


128
Module 1 [22MCA31] Data Analytics using Python

Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we
need to use \ before it. So that, now $ is treated as a part of matching string, but not
as metacharacter.

Question Bank

Explain merge methods with example demonstrating the following joins


1
i) Outer ii)left iii)right
2 Discuss various techniques for stripping out extraneous information in the dataset.
3 What is data normalization? Explain with an example.
4 Illustrate with examples to handle missing data while reading the CSV file.
5 Describe reshaping with hierarchical indexing with suitable examples.
6 Write a short note string manipulation.
7 Write a short note on pivoting mechanism.
8 List and describe different functions used for pattern matching in re module with example.
9 Discuss the data transformation mechanisms with examples.
10 Briefly discuss Discretization and Binning
Implement a python program to demonstrate
11 (i) Importing Datasets
(ii) Cleaning the Data
(iii) Data frame manipulation using NumPy
REGULAR EXPRESSION
12 What is the need of regular expressions in programming? Explain.
13 Discuss any 5 meta characters used in regular expressions with suitable example.
14 Discuss match() , search() and findall() functions of re module.
15 What is the need of escape characters in regular expressions? Give suitable code snippet
Write a Python program to search for lines that start with the word ‘From’ and a character
16 followed by a two digit number between 00 and 99 followed by “:” Print the number if it is
greater than zero. Assume any input file.
17 How to extract a substring from the selected lines from the file
Explain the intention/meaning of the following Regular expressions
18 1. ^From .* ([0-9][0-9]):
2. ^Details:.*rev=([0-9.]+
3. . ^X\S*: ([0-9.]+)

ROOPA.H.M, Dept of MCA, RNSIT Page 42


129

Data Analytics using Python


Module-4
Web Scraping And Numerical Analysis

Topics to be studied

• Data Acquisition by Scraping web applications

• Submitting a form

• Fetching web pages

• Downloading web pages through form submission

• CSS Selectors.

• NumPy Essentials: The NumPy

Need for Web Scraping

• Let’s suppose you want to get some information from a website?


• Let’s say an article from the some news article, what will you do?
• The first thing that may come in your mind is to copy and paste the
information into your local media.
• But what if you want a large amount of data on a daily basis and as quickly
as possible.
• In such situations, copy and paste will not work and that’s where you’ll
need web scraping.
130

Web Scraping

• Web scraping is a technique used to extract data from websites. It involves


fetching and parsing HTML content to gather information.

• The main purpose of web scraping is to collect and analyze data from
websites for various applications, such as research, business intelligence, or
creating datasets.

• Developers use tools and libraries like BeautifulSoup (for Python), Scrapy, or
Puppeteer to automate the process of fetching and parsing web data.

Python Libraries

• requests
• Beautiful Soup
• Selenium

Requests

• The requests module allows you to send HTTP requests using


Python.
• The HTTP request returns a Response Object with all the response
data (content, encoding, status, etc).
• Install requests with pip install requests
131

Python script to make a simple HTTP GET request


import requests
# Specify the URL you want to make a GET request to
url = "https://www.w3schools.com"
# Make the GET request
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Print the content of the response
print("Response content:")
print(response.text)
else:
# Print an error message if the request was not successful
print(f"Error: {response.status_code}")

import requests
# Specify the base URL
base_url = "https://jsonplaceholder.typicode.com"
# GET request
get_response = requests.get(f"{base_url}/posts/1")
print(f"GET Response:\n{get_response.json()}\n")
# POST request
new_post_data = {
'title': 'New Post',
'body': 'This is the body of the new post.',
'userId': 1
}
post_response = requests.post(f"{base_url}/posts", json=new_post_data)
print(f"POST Response:\n{post_response.json()}\n")

# PUT request (Update the post with ID 1)


updated_post_data = {
'title': 'Updated Post',
'body': 'This is the updated body of the post.',
'userId': 1
}
put_response = requests.put(f"{base_url}/posts/1", json=updated_post_data)
print(f"PUT Response:\n{put_response.json()}\n")
# DELETE request (Delete the post with ID 1)
delete_response = requests.delete(f"{base_url}/posts/1")
print(f"DELETE Response:\nStatus Code: {delete_response.status_code}")
132

Implementing Web Scraping in Python with BeautifulSoup

There are mainly two ways to extract data from a website:


• Use the API of the website (if it exists). Ex. Facebook Graph API
• Access the HTML of the webpage and extract useful information/data
from it.
Ex. WebScraping

Steps involved in web scraping

• Send an HTTP request to URL

• Parse the data which is accessed

• Navigate and search the parse tree that we created

BeautifulSoup

• It is an incredible tool for pulling out information from a webpage.

• Used to extract tables, lists, paragraph and you can also put filters to extract
information from web pages.

• BeautifulSoup does not fetch the web page for us. So we use requests pip
install beautifulsoup4
133

BeautifulSoup

from bs4 import BeautifulSoup

# parsing the document


soup = BeautifulSoup('''<h1>Knowx Innovations PVt Ltd</h1>''', "html.parser")

print(type(soup))

Tag Object

• Tag object corresponds to an XML or HTML tag in the original document.

• This object is usually used to extract a tag from the whole HTML document.

• Beautiful Soup is not an HTTP client which means to scrap online websites
you first have to download them using the requests module and then serve
them to Beautiful Soup for scraping.

• This object returns the first found tag if your document has multiple tags with the same name.
from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b>RNSIT</b>
<b> Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag)
# Print the output
print(type(tag))
134

• The tag contains many methods and attributes. And two important features of a tag are
its name and attributes.
• Name:The name of the tag can be accessed through ‘.name’ as suffix.
• Attributes: Anything that is NOT tag

# Import Beautiful Soup


from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b>Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
# Print the output
print(tag.name)
# changing the tag
tag.name = "Strong"
print(tag)

from bs4 import BeautifulSoup


# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b class=“RNSIT“ name=“knowx”>Knowx Innoavtions</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag["class"])
# modifying class
tag["class"] = “ekant"
print(tag)
# delete the class attributes
del tag["class"]
print(tag)
135

• A document may contain multi-valued attributes and can be accessed using key-value pair.

# Import Beautiful Soup


from bs4 import BeautifulSoup
# Initialize the object with an HTML page
# soup for multi_valued attributes
soup = BeautifulSoup('''
<html>
<b class="rnsit knowx">Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag["class"])

• NavigableString Object: A string corresponds to a bit of text within a tag. Beautiful Soup uses
the NavigableString class to contain these bits of text

from bs4 import BeautifulSoup


soup = BeautifulSoup('''
<html>
<b>Knowx Innovations</b>
</html>
''', "html.parser")
tag = soup.b
# Get the string inside the tag
string = tag.string
print(string)
# Print the output
print(type(string))

Find the Siblings of the tag

• previous_sibling is used to find the previous element of the given element

• next_sibling is used to find the next element of the given element

• previous_siblings is used to find all previous element of the given element

• next_siblings is used to find all next element of the given element


136

descendants generator

• descendants generator is provided by Beautiful Soup


• The .contents and .children attribute only consider a tag’s direct children
• The descendants generator is used to iterate over all of the tag’s children,
recursively.

Example for descendants generator


from bs4 import BeautifulSoup
# Create the document
doc = "<body><b> <p>Hello world<i>innermost</i><p> </b><p> Outer text</p><body>"
# Initialize the object with the document
soup = BeautifulSoup(doc, "html.parser")
# Get the body tag
tag = soup.body
for content in tag.contents:
print(content)
for child in tag.children:
print(child)
for descendant in tag.descendants:
print(descendant)

Searching and Extract for specific tags With Beautiful Soup

• Python BeautifulSoup – find all class


# Import Module
from bs4 import BeautifulSoup
import requests
# Website URL
URL = 'https://www.python.org/'
# class list set
class_list = set()
# Page content from Website URL
page = requests.get( URL )
# parse html content
soup = BeautifulSoup( page.content , 'html.parser')
# get all tags
tags = {tag.name for tag in soup.find_all()}
137

# iterate all tags


for tag in tags:
# find all element of tag
for i in soup.find_all( tag ):
# if tag has attribute of class
if i.has_attr( "class" ):
if len( i['class'] ) != 0:
class_list.add(" ".join( i['class']))
print( class_list )

Find a particular class


html_doc = """<html><head><title>Welcome to geeksforgeeks</title></head>
<body>
<p class="title"><b>Geeks</b></p>
<p class="body">This is an example to find a particular class
</body>
"""
# import module
from bs4 import BeautifulSoup
# parse html content
soup = BeautifulSoup( html_doc , 'html.parser')
# Finding by class name
c=soup.find( class_ = "body")
print(c)

Search by text inside a tag


Steps involved for searching the text inside the tag:
• Import module
• Pass the URL
• Request page
• Specify the tag to be searched
• For Search by text inside tag we need to check condition to with help of string function.
• The string function will return the text inside a tag.
• When we will navigate tag then we will check the condition with the text.
• Return text
138

from bs4 import BeautifulSoup


import requests
# sample web page
sample_web_page = 'https://www.python.org'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find_all('strong')
#print(child_soup)
text = """Notice:"""
# we will search the tag with in which text is same as given text
for i in child_soup:
if(i.string == text):
print(i)

IMPORTANTS POINTS
• BeautifulSoup provides several methods for searching for tags based on their contents,
such as find(), find_all(), and select().
• The find_all() method returns a list of all tags that match a given filter, while the find()
method returns the first tag that matches the filter.
• You can use the text keyword argument to search for tags that contain specific text.

Select method

• The select method in BeautifulSoup (bs4) is used to find all elements in a


parsed HTML or XML document that match a specific CSS selector.

• CSS selectors are patterns used to select and style elements in a


document.

• The select method allows you to apply these selectors to navigate and
extract data from the parsed document easily.
139

CSS Selector

• Id selector (#)
• Class selector (.)
• Universal Selector (*)
• Element Selector (tag)
• Grouping Selector(,)

CSS Selector

• Id selector (#) :The ID selector targets a specific HTML element based on its unique
identifier attribute (id). An ID is intended to be unique within a webpage, so using the ID
selector allows you to style or apply CSS rules to a particular element with a specific ID.
#header {
color: blue;
font-size: 16px;
}
• Class selector (.) : The class selector is used to select and style HTML elements based on
their class attribute. Unlike IDs, multiple elements can share the same class, enabling
you to apply the same styles to multiple elements throughout the document.
.highlight {
background-color: yellow;
font-weight: bold;
}

CSS Selector
• Universal Selector (*) :The universal selector selects all HTML elements on the webpage.
It can be used to apply styles or rules globally, affecting every element. However, it is
important to use the universal selector judiciously to avoid unintended consequences.
*{
margin: 0;
padding: 0;
}
• Element Selector (tag) : The element selector targets all instances of a specific HTML
element on the page. It allows you to apply styles universally to elements of the same
type, regardless of their class or ID.
p{
color: green;
font-size: 14px;
}
140

• Grouping Selector(,) : The grouping selector allows you to apply the same styles to
multiple selectors at once. Selectors are separated by commas, and the styles specified
will be applied to all the listed selectors.
h1, h2, h3 {
font-family: 'Arial', sans-serif;
color: #333;
}

• These selectors are fundamental to CSS and provide a powerful way to target and style
different elements on a webpage.

<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div id="content">
Creating a basic HTML page <h1>Heading 1</h1>
<p class="paragraph">This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
<a href="https://example.com">Visit Example</a>
</div>
</body>
</html>

Scraping example using CSS selectors


from bs4 import BeautifulSoup # 4. Select by attribute
Html=request.get((“web.html”) link =
soup = BeautifulSoup(Html, 'html.parser') soup.select('a[href="https://example.com"]
# 1. Select by tag name ')
heading = soup.select('h1') print("4. Link:", link[0]['href'])
print("1. Heading:", heading[0].text) # 5. Select all list items
# 2. Select by class list_items = soup.select('ul li')
paragraph = soup.select('.paragraph') print("5. List Items:")
print("2. Paragraph:", paragraph[0].text)
for item in list_items:
# 3. Select by ID
print("-", item.text)
div_content = soup.select('#content')
print("3. Div Content:", div_content[0].text)
141

Selenium
• Selenium is an open-source testing tool, which means it can be downloaded
from the internet without spending anything.

• Selenium is a functional testing tool and also compatible with non-


functional testing tools as well.

• Pip install selenium

Steps in form filling

• Import the webdriver from selenium

• Create driver instance by specifying browser

• Find the element

• Send the values to the elements

• Use click function to submit

Webdriver

• WebDriver is a powerful tool for automating web browsers.

• It provides a programming interface for interacting with web browsers and


performing various operations, such as clicking buttons, filling forms,
navigating between pages, and more.

• WebDriver supports multiple programming languages


from selenium import webdriver
142

Creating Webdriver instance


• You can create the instance of webdriver by using class webdriver and a browser which
you want to use
• Ex: driver = webdriver.Chrome()
• Browsers:
– webdriver.Chrome()
– webdriver.Firefox()
– webdriver.Edge()
– webdriver.Safari()
– webdriver.Opera()
– webdriver.Ie()

Find the element

• First you need get the form using function get()


• To find the element you can use find_element() by specifying any of the
fallowing arguments
—XPATH
—CSS Selector

XPATH
143

CSS Selector
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page
driver.get('https://www.confirmtkt.com/pnr-status')
# Locate form elements
pnr_field = driver.find_element("name", "pnr")
submit_button = driver.find_element(By.CSS_SELECTOR, '.col-xs-4')
# Fill in form fields
pnr_field.send_keys('4358851774')
# Submit the form
submit_button.click()

Downloading web pages through form submission


from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page
driver.get('https://www.confirmtkt.com/pnr-status')
# Locate form elements
pnr_field = driver.find_element("name", "pnr")
submit_button = driver.find_element(By.CSS_SELECTOR, '.col-xs-4')

# Fill in form fields


pnr_field.send_keys('4358851774')
# Submit the form
submit_button.click()
welcome_message = driver.find_element(By.CSS_SELECTOR,".pnr-card")
# Print or use the scraped values
print(type(welcome_message))
html_content = welcome_message.get_attribute('outerHTML')
# Print the HTML content
print("HTML Content:", html_content)
# Close the browser
driver.quit()
144

A Python Integer Is More Than Just an Integer


Every Python object is simply a cleverly disguised
C structure, which contains not only its value, but
other information as well.

X = 10000

X is not just a “raw” integer. It’s actually a


pointer to a compound C structure, which
contains several values.

Difference between C and Python Variable

A Python List Is More Than Just a List

A Python List Is More Than Just a List


Because of Python’s dynamic typing, we can even create heterogeneous lists:
145

In the special case that all variables are of the same type, much of this information is
redundant: it can be much more efficient to store data in a fixed-type array. The
difference between a dynamic-type list and a fixed-type (NumPy-style) array is
illustrated in Figure.

Fixed-Type Arrays in Python


• Python offers several different options for storing data in efficient, fixed-type data
buffers. The built-in array module (available since Python 3.3) can be used to create
dense arrays of a uniform type:

While Python’s array object provides efficient storage of array-based data, NumPy adds to
this efficient operations on that data.
146

Creating Arrays from Python Lists


import numpy as np

NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible

If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:

Creating Arrays from Python Lists


• NumPy arrays can explicitly be multidimensional; here’s one way of initializing a
multidimensional array using a list of lists:

Creating Arrays from Scratch


147
148

NumPy Standard Data Types


• While constructing an array, you can specify them using a string:

• Or using the associated NumPy object:


149

The Basics of NumPy Arrays

The Basics of NumPy Arrays


We’ll cover a few categories of basic array manipulations here:
• Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays
• Indexing of arrays
Getting and setting the value of individual array elements
• Slicing of arrays
Getting and setting smaller subarrays within a larger array
• Reshaping of arrays
Changing the shape of a given array
• Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many

NumPy Array Attributes


150

NumPy Array Attributes


• Each array has attributes
ndim (the number of dimensions)
shape (the size of each dimension)
size (the total size of the array)

Write a Python program that creates a mxn integer arrayand Prints its attributes using
Numpy
151

Output:

Array Indexing: Accessing Single Elements

In a multidimensional array, you access items using a comma-separated tuple of indices:


152

You can also modify values using any of the above index notation:

NumPy arrays have a fixed type. This means, for example, that if you attempt to insert a floating-point value
to an integer array, the value will be silently truncated.

Array Slicing: Accessing Subarrays


One-dimensional subarrays
153

Multidimensional subarrays

Subarray dimensions can even be reversed together:

Accessing array rows and columns


154

Subarrays as no-copy views

Now if we modify this subarray, we’ll see that


the original array is changed! Observe:

Creating copies of arrays

Reshaping of Arrays
Another useful type of operation is reshaping of arrays. The most flexible way of doing this
is with the reshape() method. For example, if you want to put the numbers 1 through 9 in a
3×3 grid, you can do the following:
155

• Note that for this to work, the size of the initial array must match the size of the
reshaped array.

• The reshape method will use a no-copy view of the initial array, but with noncontiguous
memory buffers this is not always the case.

Another common reshaping pattern is the conversion of a one-dimensional array into a


two-dimensional row or column matrix.

• Reshaping can be done with the reshape method, or more easily by making use of the
newaxis keyword within a slice operation.
156

Array Concatenation and Splitting


• Concatenation of arrays

• Concatenating more than two arrays at once:

np.concatenate can also be used for two-dimensional arrays

For working with arrays of mixed dimensions, it can be clearer to use the np.vstack
(vertical stack) and np.hstack (horizontal stack) functions:
157

Splitting of arrays
• The opposite of concatenation is splitting, which is implemented by the functions np.split,
np.hsplit, and np.vsplit. For each of these, we can pass a list of indices giving the split points:

N split points lead to N + 1 subarrays.

Computation on NumPy Arrays: Universal Functions


• NumPy is so important in the Python data science world. It provides an easy and flexible
interface to optimized computation with arrays of data.

• Computation on NumPy arrays can be very fast, or it can be very slow. The key to making
it fast is to use vectorized operations, generally implemented through NumPy’s universal
functions (ufuncs).

• NumPy’s ufuncs can be used to make repeated calculations on array elements much
more efficient.
158

The Slowness of Loops

Each time the reciprocal is computed, Python first examines the object’s type and does a
dynamic lookup of the correct function to use for that type. If we were working in
compiled code instead, this type specification would be known before the code exe‐
cutes and the result could be computed much more efficiently.

• For many types of operations, NumPy provides a convenient interface into this kind of
statically typed, compiled routine. This is known as a vectorized operation.

• This vectorized approach is designed to push the loop into the compiled layer that
underlies NumPy, leading to much faster execution.
159

• Looking at the execution time for our big array, we see that it completes orders of
magnitude faster than the Python loop:

Introducing UFuncs
• Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is
to quickly execute repeated operations on values in NumPy arrays.

• Ufuncs are extremely flexible—before we saw an operation between a scalar and an


array, but we can also operate between two arrays:

• ufunc operations are not limited to one-dimensional arrays—they can act on


multidimensional arrays as well:
160

Exploring NumPy’s UFuncs


• Ufuncs exist in two flavors:
— unary ufuncs, which operate on a single input
— binary ufuncs, which operate on two inputs.

• We’ll see examples of both these types of functions here with-


— Array arithmetic
— Absolute value
— Trigonometric functions
— Exponents and logarithms

Array arithmetic
• NumPy’s ufuncs feel very natural to use because they make use of Python’s native
arithmetic operators. The standard addition, subtraction, multiplication, and division can
all be used:

• There is also a unary ufunc for negation, a ** operator for exponentiation, and a % operator for modulus:

All of these arithmetic operations are


simply convenient wrappers around
specific functions built into NumPy; for
example, the + operator is a wrapper for
the add function.
161

Absolute value
• The corresponding NumPy ufunc is np.absolute, which is also available under the alias
np.abs:

Trigonometric functions
• NumPy provides a large number of useful ufuncs, and some of the most useful for the
data scientist are the trigonometric functions.
162

Exponents and logarithms

Advanced Ufunc Features


Few specialized features of ufuncs are
• Specifying output
• Aggregates
• Outer products
163

Specifying output
• For large calculations, it is sometimes useful to be able to specify the array where the
result of the calculation will be stored. Rather than creating a temporary array, you can
use this to write computation results directly to the memory location where you’d
like them to be. For all ufuncs, you can do this using the out argument of the function:

we can write the results of a computation to every other element of a specified array:

If we had instead written y[::2] = 2 ** x, this would have resulted in the creation of
a temporary array to hold the results of 2 ** x

Aggregates
• For binary ufuncs, there are some interesting aggregates that can be computed directly
from the object. we can use the reduce method of any ufunc can do this.

• A reduce method repeatedly applies a given operation to the elements of an array until
only a single result remains.
• For example, calling reduce on the add ufunc returns the sum of all elements in the
array:
164

calling reduce on the multiply ufunc results in the product of all array elements:

to store all the intermediate results of the computation

Note that for these particular cases, there are dedicated NumPy functions to compute the results
(np.sum, np.prod, np.cumsum, np.cumprod)

Outer products
• Finally, any ufunc can compute the output of all pairs of two different inputs using the
outer method. This allows you, in one line, to do things like create a multiplication table:

Broadcasting
Broadcasting in NumPy is a powerful mechanism that allows for the arithmetic operations on arrays of
different shapes and sizes, without explicitly creating additional copies of the data. It simplifies the
process of performing element-wise operations on arrays of different shapes, making code more
concise and efficient.

Here are the key concepts of broadcasting in NumPy:


• Shape Compatibility: Broadcasting is possible when the dimensions of the arrays involved are
compatible. Dimensions are considered compatible when they are equal or one of them is 1. NumPy
automatically adjusts the shape of smaller arrays to match the shape of the larger array during the
operation.
• Rules of Broadcasting: For broadcasting to occur, the sizes of the dimensions must either be the
same or one of them must be 1. If the sizes are different and none of them is 1, then broadcasting is
not possible, and NumPy will raise a ValueError.
165

• Automatic Replication: When broadcasting, NumPy automatically replicates the smaller array along the
necessary dimensions to make it compatible with the larger array. This replication is done without actually
creating multiple copies of the data, which helps in saving memory.
Example:
Suppose you have a 2D array A of shape (3, 1) and another 1D array B of shape (3). Broadcasting allows you to
add these arrays directly, and NumPy will automatically replicate the second array along the second
dimension to match the shape of the first array.

import numpy as np array([[5, 6, 7],


A = np.array([[1], [2], [3]]) [6, 7, 8],
[7, 8, 9]])
B = np.array([4, 5, 6])
result = A + B # Broadcasting occurs here
166
Module 5 [20MCA31] Data Analytics using Python

Module 5
Visualization with Matplotlib and Seaborn

Data Visualization: Matplotlib package – Plotting Graphs – Controlling Graph – Adding


Text – More Graph Types – Getting and setting values – Patches. Advanced data
visualization with Seaborn.- Time series analysis with Pandas.

Matplotlib package

Matplotlib is a multiplatform data visualization library built on NumPy arrays. The matplotlib
package is the main graphing and plotting tool . The package is versatile and highly
configurable, supporting several graphing interfaces.

Matplotlib, together with NumPy and SciPy provides MATLAB-like graphing capabilities.

The benefits of using matplotlib in the context of data analysis and visualization are as follows:

• Plotting data is simple and intuitive.

• Performance is great; output is professional.

• Integration with NumPy and SciPy (used for signal processing and numerical analysis) is
seamless.

• The package is highly customizable and configurable, catering to most people’s needs.

The package is quite extensive and allows embedding plots in a graphical user interface.

Other Advantages

• One of Matplotlib’s most important features is its ability to play well with many operating
systems and graphics backends. Matplotlib supports dozens of backends and output types,
which means you can count on it to work regardless of which operating system you are
using or which output format you wish. This cross-platform, everything-to-everyone
approach has been one of the great strengths of Matplotlib.

• It has led to a large userbase, which in turn has led to an active developer base and
Matplotlib’s powerful tools and ubiquity within the scientific Python world.

• Pandas library itself can be used as wrappers around Matplotlib’s API. Even with wrappers
like these, it is still often useful to dive into Matplotlib’s syntax to adjust the final plot
output.

ROOPA.H.M, Dept of MCA, RNSIT Page 1


167
Module 5 [20MCA31] Data Analytics using Python

Plotting Graphs

• Line plots, scatter plots, bar charts, histograms, etc.

This section details the building blocks of plotting graphs: the plot() function and how to
control it to generate the output we require.

The plot() function is highly customizable, accommodating various options, including


plotting lines and/or markers, line widths, marker types and sizes, colors, and a legend to
associate with each plot.

The functionality of plot() is similar to that of MATLAB and GNU-Octave with some minor
differences, mostly due to the fact that Python has a different syntax from MATLAB and
GNU-Octave.

Lines and Markers


• Lets begin by creating a vector to plot using NumPy

The vector y is passed as an input to plot(). As a result, plot() drew a graph of the vector y
using auto-incrementing integers for an x-axis. Which is to say that, if x-axis values are
not supplied, plot() will automatically generate one for you: plot(y) is equivalent to
plot(range(len(y)), y).

Note:If you don’t have a GUI installed with matplotlib, replace show() with
savefig('filename') and open the generated image file in an image viewer.)

• let’s supply x-axis values (denoted by variable t):

The call to function figure() generates a new figure to plot on, so we don’t overwrite the previous
figure.

• Let’s look at some more options. Next, we want to plot y as a function of t, but display only
markers, not lines. This is easily done:

ROOPA.H.M, Dept of MCA, RNSIT Page 2


168
Module 5 [20MCA31] Data Analytics using Python

To select a different marker, replace the character 'o' with another marker symbol.

Table below lists some popular choices; issuing help(plot) provides a full account of the
available markers.

Controlling Graph

• Axis limits, labels, ticks, colors, styles, etc.


• Use keyword arguments when calling plotting functions in Matplotlib.

For a graph to convey an idea aesthetically, though it is important, the data is not
everything. The grid and grid lines, combined with a proper selection of axis and labels,
present additional layers of information that add clarity and contribute to overall graph
presentation.

Now, let’s focus to controlling the figure by controlling the x-axis and y-axis behavior
and setting grid lines.
• Axis
• Grid and Ticks
• Subplots
• Erasing the Graph

Axis
The axis() function controls the behavior of the x-axis and y-axis ranges. If you do not supply a
parameter to axis(), the return value is a tuple in the form (xmin, xmax, ymin, ymax). You can
use axis() to set the new axis ranges by specifying new values: axis([xmin, xmax, ymin, ymax]).

If you’d like to set or retrieve only the x-axis values or y-axis values, do so by using the
functions xlim(xmin, xmax) or ylim(ymin, ymax), respectively.

ROOPA.H.M, Dept of MCA, RNSIT Page 3


169
Module 5 [20MCA31] Data Analytics using Python

The function axis() also accepts the following values: 'auto', 'equal', 'tight', 'scaled', and 'off'.

— The value 'auto'—the default behavior—allows plot() to select what it thinks are the best
values.
— The value 'equal' forces each x value to be the same length as each y value, which is
important if you’re trying to convey physical distances, such as in a GPS plot.
— The value 'tight' causes the axis to change so that the maximum and minimum values of
x and y both touch the edges of the graph.
— The value 'scaled' changes the x-axis and y-axis ranges so that x and y have both the
same length (i.e., aspect ratio of 1).
— Lastly, calling axis('off') removes the axis and labels.

To illustrate these axis behaviors, a circle is plotted as below:

Figure below shows the results of applying different axis values to this circle.

Grid and Ticks

The function grid() draws a grid in the current figure. The grid is composed of a set of
horizontal and vertical dashed lines coinciding with the x ticks and y ticks. You can toggle
the grid by calling grid() or set it to be either visible or hidden by using grid(True) or
grid(False), respectively.

To control the ticks (and effectively change the grid lines, as well), use the functions xticks()
and yticks(). The functions behave similarly to axis() in that they return the current ticks if
ROOPA.H.M, Dept of MCA, RNSIT Page 4
170
Module 5 [20MCA31] Data Analytics using Python

no parameters are passed; you can also use these functions to set ticks once parameters
are provided. The functions take an array holding the tick values as numbers and an
optional tuple containing text labels. If the tuple of labels is not provided, the tick numbers
are used as labels.

Adding Text

There are several options to annotate your graph with text. You’ve already seen some, such as
using the xticks() and yticks() function.

The following functions will give you more control over text in a graph.

Title

The function title(str) sets str as a title for the graph and appears above the plot area. The
function accepts the arguments listed in Table 6-5.

ROOPA.H.M, Dept of MCA, RNSIT Page 5


171
Module 5 [20MCA31] Data Analytics using Python

All alignments are based on the default location, which is centered above the graph. Thus,
setting ha='left' will print the title starting at the middle (horizontally) and extending to the
right. Similarly, setting ha='right' will print the title ending in the middle of the graph
(horizontally). The same applies for vertical alignment. Here’s an example of using the title()
function:

>>> title('Left aligned, large title', fontsize=24, va='baseline')

Axis Labels and Legend

The functions xlabel() and ylabel() are similar to title(), only they’re used to set the x-axis and y-
axis labels, respectively. Both these functions accept the text arguments .
>>> xlabel('time [seconds]')

Next on our list of text functions is legend(). The legend() function adds a legend box and
associates a plot with text:

The legend order associates the text with the plot. An alternative approach is to specify the
label argument with the plot() function call, and then issue a call to legend() with no
parameters:

ROOPA.H.M, Dept of MCA, RNSIT Page 6


172
Module 5 [20MCA31] Data Analytics using Python

loc can take one of the following values: 'best', 'upper right', 'upper left', 'lower left', 'lower right',
'right', 'center left', 'center right', 'lower center', 'upper center', and 'center'. Instead of using
strings, use numbers: 'best' corresponds to 0, 'upper left' corresponds to 1, and 'center'
corresponds to 10. Using the value 'best' moves the legend to a spot less likely to hide data;
however, performance-wise there may be some impact.

Text Rendering

The text(x, y, str) function accepts the coordinates in graph units x, y and the string to print,
str. It also renders the string on the figure. You can modify the text alignment using the
arguments. The following will print text at location (0, 0):

The function text() has many other arguments, such as rotation and fontsize.

Example:

The example script summarizes the functions we’ve discussed up to this point: plot() for
plotting; title(), xlabel(), ylabel(), and text() for text annotations; and xticks(), ylim(), and grid()
for grid control.

ROOPA.H.M, Dept of MCA, RNSIT Page 7


173
Module 5 [20MCA31] Data Analytics using Python

More Graph Types

• Boxplots, pie charts, polar plots, 3D plots, etc.


• Matplotlib offers a variety of specialized plotting functions for different types of
data.

(Note: Refer to the text book)

Getting and setting values

Object-oriented design of matplotlib involves two functions, setp() and getp(), that retrieve and
set a matplotlib object’s parameters. The benefit of using setp() and getp() is that automation is
easily achieved. Whenever a plot() command is called, matplotlib returns a list of matplotlib
objects.

For example, you can use the getp() function to get the linestyle of a line object. You can use the
setp() function to set the linestyle of a line object.

Here is an example of how to use the getp() and setp() functions to get and set the linestyle of a
line object:

ROOPA.H.M, Dept of MCA, RNSIT Page 8


174
Module 5 [20MCA31] Data Analytics using Python

import matplotlib.pyplot as plt


# Create a figure and a set of subplots
fig, ax = plt.subplots()

# Generate some data


x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Plot the data


line, = ax.plot(x, y)

# Get the linestyle of the line object


linestyle = plt.getp(line, 'linestyle')

# Print the linestyle


print('Linestyle:', linestyle)

# Set the linestyle to dashed


plt.setp(line, linestyle=' -.')

# Show the plot


plt.show()

This code will create a line plot of the data in the x and y lists. The linestyle of the line object will
be set to dashed. The code will then print the linestyle to the console. Finally, the code will show
the plot.

Here is a Python code example of getting and setting values in Matplotlib:

ROOPA.H.M, Dept of MCA, RNSIT Page 9


175
Module 5 [20MCA31] Data Analytics using Python

import matplotlib.pyplot as plt

# Create a figure and a set of subplots


fig, ax = plt.subplots()

# Generate some data


x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Plot the data


ax.plot(x, y)

# Set the x-axis label


ax.set_xlabel('X-axis')

# Set the y-axis label


ax.set_ylabel('Y-axis')

# Set the title of the plot


ax.set_title('My Plot')

# Get the x-axis limits


x_min, x_max = ax.get_xlim()

# Get the y-axis limits


y_min, y_max = ax.get_ylim()

# Print the x-axis limits


print('X-axis limits:', x_min, x_max)

# Print the y-axis limits


print('Y-axis limits:', y_min, y_max)

# Show the plot


plt.show( )

This code will create a line plot of the data in the x and y lists. The x-axis label will be set to
"X-axis", the y-axis label will be set to "Y-axis", and the title of the plot will be set to "My Plot".
The code will then print the x-axis and y-axis limits to the console. Finally, the code will show
the plot.

You can use the get_xlim() and get_ylim() functions to get the current x-axis and y-axis limits,
respectively. You can use the set_xlim() and set_ylim() functions to set the x-axis and y-axis
limits, respectively.

ROOPA.H.M, Dept of MCA, RNSIT Page 10


176
Module 5 [20MCA31] Data Analytics using Python

Patches

Drawing shapes requires some more care. matplotlib has objects that represent many
common shapes, referred to as patches. Some of these, like Rectangle and Circle are found
in matplotlib.pyplot, but the full set is located in matplotlib.patches.

To use patches, follow these steps:


1. Draw a graph.
2. Create a patch object.
3. Attach the patch object to the figure using the add_patch() function.

To add a shape to a plot, create the patch object shp and add it to a subplot by calling
ax.add_patch(shp).

import matplotlib.pyplot as plt


fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
rect = plt.Rectangle((0.2, 0.75), 0.4, 0.15, color='k', alpha=0.3)
circ = plt.Circle((0.7, 0.2), 0.15, color='b', alpha=0.3)
pgon = plt.Polygon([[0.15, 0.15], [0.35, 0.4], [0.2, 0.6]], color='g', alpha=0.5)
ax.add_patch(rect)
ax.add_patch(circ)
ax.add_patch(pgon)

To work with patches, assign them to an already existing graph because, in a sense, patches
are “patched” on top of a figure. Table below gives a partial listing of available patches. In this
table, the notation xy indicates a list or tuple of (x, y) values

ROOPA.H.M, Dept of MCA, RNSIT Page 11


177
Module 5 [20MCA31] Data Analytics using Python

Advanced data visualization with Seaborn

Seaborn is a powerful Python data visualization library based on Matplotlib. It provides a


high-level interface for creating attractive and informative statistical graphics. Here's a guide
for advanced data visualization with Seaborn:

Import Libraries: Import the necessary libraries including Seaborn and Matplotlib. Seaborn
comes with several built-in datasets for practice. You can load one using the load_dataset
function.

import seaborn as sns


import matplotlib.pyplot as plt
tips_data = sns.load_dataset("tips")
tips_data.head()

Data set has the following columns :

Load a Dataset:

tips_data = sns.load_dataset("tips")

ROOPA.H.M, Dept of MCA, RNSIT Page 12


178
Module 5 [20MCA31] Data Analytics using Python

Customize Seaborn Styles: Seaborn comes with several built-in styles. You can set the style using
sns.set_style().
sns.set_style("whitegrid")
# Other styles include "darkgrid", "white", "dark", and "ticks"

Advanced Scatter Plots: Create a scatter plot with additional features like hue, size, and style.

sns.scatterplot(x="total_bill", y="tip", hue="day", size="size", style="sex", data=tips_data)

Pair Plots for Multivariate Analysis: Visualize relationships between multiple variables with pair
plots.

sns.pairplot(tips_data, hue="sex", markers=["o", "s"], palette="husl")

Heatmaps: Create a heatmap to visualize the correlation matrix of variables.

correlation_matrix = tips_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")

Violin Plots: Visualize the distribution of a numerical variable for different categories.

sns.violinplot(x="day", y="total_bill", hue="sex", data=tips_data, split=True, inner="quart")

FacetGrid for Customized Subplots: Use FacetGrid to create custom subplots based on
categorical variables.

g = sns.FacetGrid(tips_data, col="time", row="smoker", margin_titles=True)


g.map(sns.scatterplot, "total_bill", "tip")

Joint Plots: Combine scatter plots with histograms for bivariate analysis.

sns.jointplot(x="total_bill", y="tip", data=tips_data, kind="hex")

ROOPA.H.M, Dept of MCA, RNSIT Page 13


179
Module 5 [20MCA31] Data Analytics using Python

Question bank:

1. What are patches? Explain with an example.


2. How to get and set the values in the graphs? Give example
3. Discuss any 3 aspects of the graph that can be controlled to enhance the visualization.
4. How to annotate the graph with text. Illustrate with example.
5. What is Seaborn? List the advantages.
6. Write a Python code to plot following graphs using Seaborn:
a) line plot b) histogram c) scatter plot
7. What is Time series analysis? Write a Python program to demonstrate Timeseries analysis
with Pandas.
8. Discuss Advanced data visualization using seaborn.

ROOPA.H.M, Dept of MCA, RNSIT Page 14


180
Module 5 [20MCA31] Data Analytics using Python

1. Explain , how simple line plot can be created using matplotlib? Show the adjustments
done to the plot w.r.t line colors.
The simplest of all plots is the visualization of a single function y = f (x ). Here we will create
simple line plot.

In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single
container that contains all the objects representing axes, graphics, text, and labels. The
axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks
and labels, which will eventually contain the plot elements that make up the visualization.

Alternatively, we can use the pylab interface, which creates the figure and axes in the
background. Ex: plt.plot(x, np.sin(x))

Adjusting the Plot: Line Colors

The plt.plot() function takes additional arguments that can be used to specify the color
keyword, which accepts a string argument representing virtually any imaginable color. The
color can be specified in a variety of ways.

ROOPA.H.M, Dept of MCA, RNSIT Page 15


181
Module 5 [20MCA31] Data Analytics using Python

2. Distinguish MATLAB style and object-oriented interfaces of Matplotlib.


MATLAB style interface Object-oriented interface

Matplotlib was originally written as a Python The object-oriented interface is available for
alternative for MATLAB users, and much of its these more complicated situations, and for
syntax reflects that fact. when we want more control over your
figure.
The MATLAB-style tools are contained in the
pyplot (plt) interface.

Interface is stateful: it keeps track of the Rather than depending on some notion of
current” figure and axes, where all plt an “active” figure or axes, in the object-
commands are applied. once the second panel oriented interface the plotting functions are
is created, going back and adding something methods of explicit Figure and Axes
to the first is bit complex. objects.

3. Write the lines of code to create a simple histogram using matplotlib library.

A simple histogram can be useful in understanding a dataset. the below code creates a
simple histogram.

ROOPA.H.M, Dept of MCA, RNSIT Page 16


182
Module 5 [20MCA31] Data Analytics using Python

4. What are the two ways to adjust axis limits of the plot using Matplotlib? Explain with the example
for each.

Matplotlib does a decent job of choosing default axes limits for your plot, but some‐ times
it’s nice to have finer control.

The two ways to adjust axis limits are:

• using plt.xlim() and plt.ylim() methods

ROOPA.H.M, Dept of MCA, RNSIT Page 17


183
Module 5 [20MCA31] Data Analytics using Python

• using plt.axis()

The plt.axis( ) method allows you to set the x and y limits with a single call, by passing a
list that specifies [xmin, xmax, ymin, ymax].

5. List out the dissimilarities between plot() and scatter() functions while plotting scatter plot.

• The difference between the two functions is: with pyplot.plot() any property you apply
(color, shape, size of points) will be applied across all points whereas in pyplot.scatter() you
have more control in each point’s appearance. That is, in plt.scatter() you can have the color,
shape and size of each dot (datapoint) to vary based on another variable.

• While it doesn’t matter as much for small amounts of data, as datasets get larger than a
few thousand points, plt.plot can be noticeably more efficient than plt.scatter. The reason is
that plt.scatter has the capability to render a different size and/or color for each point, so
the renderer must do the extra work of constructing each point individually. In plt.plot, on
the other hand, the points are always essentially clones of each other, so the work of
determining the appearance of the points is done only once for the entire set of data.

• For large datasets, the difference between these two can lead to vastly different performance,
and for this reason, plt.plot should be preferred over plt.scatter for large datasets.

6. How to customize the default plot settings of Matplotlib w.r.t runtime configuration
and stylesheets? Explain with the suitable code snippet.
• Each time Matplotlib loads, it defines a runtime configuration (rc) containing the default
styles for every plot element we create.

• We can adjust this configuration at any time using the plt.rc convenience routine.

• To modify the rc parameters, we’ll start by saving a copy of the current rcParams
dictionary, so we can easily reset these changes in the current session:

IPython_default = plt.rcParams.copy()

• Now we can use the plt.rc function to change some of these settings:

ROOPA.H.M, Dept of MCA, RNSIT Page 18


184
Module 5 [20MCA31] Data Analytics using Python

7. Elaborate on Seaborn versus Matplotlib with suitable examples.


Seaborn library is basically based on Matplotlib. Here is a detailed comparison
between the two:

Seaborn Matplotlib

Seaborn, on the other hand, provides Matplotlib is mainly deployed for


a variety of visualization patterns. It basic plotting. Visualization using
uses fewer syntax and has easily Matplotlib generally consists of
interesting default themes. It bars, pies, lines, scatter plots and
Functionality
specializes in statistics visualization so on.
and is used if one has to summarize
data in visualizations and also show
the distribution in the data.

ROOPA.H.M, Dept of MCA, RNSIT Page 19


185
Module 5 [20MCA31] Data Analytics using Python

Seaborn automates the creation of Matplotlib has multiple figures


Handling Multiple multiple figures. This sometimes can be opened, but need to be
Figures leads to OOM (out of memory) closed explicitly. plt.close() only
issues. closes the current figure.
plt.close(‘all’) would close them
all.

Seaborn is more integrated for Matplotlib is a graphics package


working with Pandas data frames. for data visualization in Python.
It extends the Matplotlib library It is well integrated with NumPy
for creating beautiful graphics and Pandas. The pyplot module
Visualization
with Python using a more mirrors the MATLAB plotting
straightforward set of methods. commands closely. Hence,
MATLAB users can easily
transit to plotting with Python.

Seaborn works with the dataset as Matplotlib works with data


a whole and is much more frames and arrays. It has
intuitive than Matplotlib. For different stateful APIs for
Seaborn, replot() is the entry API plotting. The figures and aces
Data frames and with ‘kind’ parameter to specify are represented by the object
Arrays the type of plot which could be and therefore plot() like calls
line, bar, or many of the other without parameters suffices,
types. Seaborn is not stateful. without having to manage
Hence, plot() would require parameters.
passing the object.

Flexibility Seaborn avoids a ton of boilerplate Matplotlib is highly


by providing default themes which customizable and powerful.
are commonly used.

Seaborn is for more specific use Pandas uses Matplotlib. It is a


cases. Also, it is Matplotlib under neat wrapper around
Use Cases
the hood. It is specially meant for Matplotlib.
statistical plotting.

Let us assume

x=[10,20,30,45,60]

y=[0.5,0.2,0.5,0.3,0.5]

Matplotlib Seaborn
#to plot the graph #to plot the graph
import matplotlib.pyplot as plt import seaborn as sns
plt.style.use('classic') sns.set()
plt.plot(x, y) plt.plot(x, y)
plt.legend('ABCDEF',ncol=2, plt.legend('ABCDEF',ncol=2,
loc='upper left') loc='upper left')

ROOPA.H.M, Dept of MCA, RNSIT Page 20


186
Module 5 [20MCA31] Data Analytics using Python

8. List and describe different categories of colormaps with the suitable code snippets.
Three different categories of colormaps:

Sequential colormaps : These consist of one continuous sequence of colors


(e.g., binary or viridis).

Divergent colormaps : These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).

Qualitative colormaps : These mix colors with no particular sequence (e.g., rainbow or jet).

speckles = (np.random.random(I.shape) < 0.01)


I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles))
plt.figure(figsize=(10, 3.5))
plt.subplot(1, 2, 1)
plt.imshow(I, cmap='RdBu')

9. How to customize legends in the plot using matplotlib.

10.With the suitable example, describe how to draw histogram and kde plots using
seaborn.
Often in statistical data visualization, all we want is to plot histograms and joint
distributions of variables.

ROOPA.H.M, Dept of MCA, RNSIT Page 21

You might also like