DATA ENGINEERING --> AWS + PYSPARK --> AWS DATA ENGINEER --> AWS CLOUD
LEARNINGS -->
1) PYTHON
2) PYSPARK --> TABLE IN SPARK --> DATAFRAMES
3) AWS --> DATA ENGINEERING SERVICES
BIG DATA --> 200 plus frameworks -->
SPARK --> DATA PROCESSING FRAMEWORK --> 2012 --> SCALA ( JAVA )
2019 - 2021
1) SCALA --> 80 to 90%
2) Python --> 7 - 9 %
3) Java --> 3 to 5 %
2022 -->
PYTHON --> 65% --> PYSPARK --> SPARK + PYTHON --> 1
Scala --> 20 to 30 --> Scala --> 2
Java --> less than 2 % --> FORD --> 3
PYSPARK --> HEART OF BIG DATA --> PYTHON + SPARK
=========================
Python --> Data Processing , Data Analysis
1) Data types
2) Collections --> List , tuple , dictionary ( IMP)
3) LOOPS --> IF , FOR
4) Functions --> This is very imp
5) Class , method --> DIfferent class methods
6) Error handling mechanism
=========================
THINGS TO BE DONE BEFORE TOMMOROW'S SESSION -->
1) Download Python --> SHARE THE DOCUMENT
2) IDE's --> Jupyter notebook , Pycharm , INTELLIJ , ECLIPSE
DOwnload pycharm --> Share the document
3) Sublime test --> Share the document
4) You can download Anaconda --> Jupyter notebook --> YOUTUBE LINK
============================
What is Python ?
SPARK -->
Python -->
1) Simple to learn
2) Reduce the number of lines of code --> Debugging will be very easy
3) vast availability of libraries
1) General Purpose Programming language -->
--> Data Engineering --> Data processing
--> Web development
--> Reporting purpose
--> Machine learning
DE , DS , DA , WD
2) SCALA is Object Oriented programming language
Python Object Oriented programming language
3) Why python is interpreted ?
Python --> 10 lines -->
line 1 --> mc
line 2 --> mc
Compile -->
WHy scala is faster than python ?
SCALA --> CL --> MC
PYTHON --> IL
4) pyspark code --> ERROR --> Interactive mode ...
CLI --> Command Land interface
5) Dynamically typed -->
WINDOWS , MAC , LINUX
===================================
Structured programming --> Modularized programming
Functional programming --> Immutability --> Pure functions -->Removing mutability
in code
Mutability --> Mutable code
Function --> Logic --> Impure function
fn(1,2) --> JAN 1 --> 100
fn(1,2) --> JAN 2 --> 200
fn(1,2) --> JAN 1 --> 100 --> Pure function
fn(1,2) --> JAN 2 --> 100
Immutable -->
Functions -->
==================
Automatich Garbage Collections -->
Indentation --> 4 spaces and 1 tab -->
python 2 or python 3 ...
100% --> Python 3
==================
1) Keywords
2) Variables
3) Indentation
4) Comments
5) Loops
6) Output format
1) Keywords -->
Reserved Words in python
2) Identifiers --> Rules
data ---> identifer
1) DOnt create an identifier with digit at the start
2) Lower case or upper case or combination of digits
3) DOnt create indetifer with keywords
3) Comments in python
WHy we need comments ?
# -->
DOC STRING -->
class ETL_Pipelines {
def extract {
"""
Extracting the data from Hive tables customer and creating the
dataframe out of it
"""
}
}
4) Indentation -->
{} --> Block of code
indentation --> 4 spaces or 1 tab
pyspark code --> spark-submit --> testing --> IndentationError: unexpected indent
CONTROL + ENTER
Data types -->
1) Python Numbers --> Integer data type , float data type and Complex Data type
2) Boolean
3) String data
isinstance()