0% found this document useful (0 votes)
14 views88 pages

Notes

Uploaded by

vinayak pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views88 pages

Notes

Uploaded by

vinayak pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 88

Tutorials - Matt Harrison: Getting Started with Polars

"Transcript

Search in video
Introduction
0:00
so welcome everyone to the getting started with polar's class I'm Matt Harrison
0:05
um so if you go to this link this will take you to the GitHub project here for
0:10
the course um and one thing I found out is that let me
0:20
so the link is uh Matt Harrison in GitHub and then it's 2023 Dash pycon
0:27
Dash polars if you hit period when you're in here it
0:32
will launch this thing called code spaces and you can actually run this all from
code spaces too if you want to
0:39
um so I don't know if anyone needs help with that but otherwise if you download the
note if you clone
0:46
the project or you can download The Notebook either way it won't work and then pip
install pullers
0:52
you should be mostly good you probably want to install
0:58
um I believe Pi Arrow as well if you want some pandas capability
Matts Background
1:05
okay so I I just want to maybe go around the class really quickly and understand
1:12
our backgrounds here so who here is working sort of in a data
1:17
science like capacity okay you
1:32
uh
1:40
okay okay so we got a lot of data sciency people in here who here uses pandas
1:48
okay um so I imagine you're here because
1:53
you're either looking for something different or you're just curious I mean
1:59
I'd be I'd be interested in knowing like what your motivation is for being here
2:04
um one thing to note is like we are kind of small because this was just put
together Monday
2:10
the uh pycon reached out to me and said do you want to do tutorial we have someone
drop out and make sure they're
2:16
like okay you're doing it and that was like the end of it so there's not been a lot
of marketing or whatnot but I guess that's fine for for you all
2:22
um what's what's the motivation for pullers anyone want to like share
2:29
water 2.0 came out came out we wanted to have something that was a lot more
efficient
2:39
we learned Russ okay
2:46
okay yeah so the comment for for those who
2:53
didn't hear um speed memory and and anselor interest
2:59
and rust as well okay
3:04
okay
3:13
so so whether whether pullers is is the end-all be-all to all of our pandas woes
3:19
sort of thing okay um anyone else
3:25
curious okay okay so so let me give my background um just so we're like on the same
page
3:33
as to like what's going on here I mean I I wrote this book effective pandas and
3:40
I I my job is Corporate trainer so I I like to say that I sell snake oil and
3:47
teach people to tell lies with data but I spend a good deal of my time in front of
larger groups than this
3:54
helping them level up in Python and learn things like pandas and machine
3:59
learning and so I've been using pandas almost since it
4:06
came out in both professional capacity and then more recently in Consulting and
training capacity but I I use it uh
4:15
often on my own data as well in addition to like Consulting and whatnot so I I
4:21
feel relatively comfortable with pandas and you know polars
4:26
has been out for a while and at the end of last year I'm like okay maybe I should I
should like see what this Buzz
4:31
is about and and so this this tutorial is a result of me
4:37
basically taking I would say like
4:42
a lot of content that I do in pandas and sort of porting that to pullers and seeing
like where the the give and take
4:47
is so um am I a pullers expert no not by any
4:52
shot of the imagination in my arrest expert no I'm not a rust expert either so so
what we're what we're going to get
5:00
here is I'm biased but a biased look at pullers
5:06
but hopefully we can sort of see like where the pros and cons are and um
5:12
I think we can come to some preliminary suggestions on when it might
5:18
make sense to use and explore a little bit more okay so again you can go to that
URL
5:24
just download The Notebook throw it into Jupiter and make sure that you've pip
installed you you'll want to um
5:32
have installed pullers and I believe Pi Arrow as well if you
5:38
pip install both of those you should be good so I'm going to restart my kernel here
5:44
and I'm going to clear my output as well okay
5:51
so and I'm I think I think the schedule as we go
5:57
till 10 30 then we have a break I'm not quite sure how long the break is
6:03
okay I've heard anywhere from 10 to 30 minutes so um okay 10 minute break at that
time and
6:11
then basically I've got this notebook here it's got a bunch of examples but
6:16
also has lab work in it as well so what we'll do is I'll talk for a little bit
6:21
you can ask questions or whatnot nice thing about a notebook is you can try things
out and play around in it and
6:28
then we do have some labs in there so I'll give you some time to work on the labs
try out pullers on your own and
6:34
we'll just sort of rinse and repeat that for for our time here but yeah like I said
I'm
6:42
I'm pretty flexible here so if you have questions feel free to just raise your hand
shout them out and and let's answer
6:49
those so I am using puller's 017.5 so I
6:55
assume by the time this makes it out to like YouTube or whatever this will be an
outdated version of pullers
7:02
um sort of only half kidding there but like the they release pullers pretty quickly
and so that is something to note
7:09
is that um actually did a version of this talk uh at the end of last year and and I
7:16
would say a good not a lot well not a lot uh portions of it broke
7:23
um from from doing it last year right so uh what's my point there my point is
7:30
that polars is is moving and evolving and it's not the API isn't stable per se
Right In
7:37
fact when I was doing the when I gave this uh talk or or gave this uh
7:45
uh course uh a couple months ago the
7:51
creator of pullers actually reached out to me and I went through it with him at
that point in time and so
7:58
um he kind of gave it a thumbs up or a semi blessing and said okay yeah you know
8:04
we're good there and it was kind of nice because he walked me through a lot of them
and the mentality and the ideas of
8:09
pullers um so if Richie's watching this I'm sure he'll comment or correct uh things
that
8:16
I might say or misspeak at at this point in time because he seems to be pretty
involved or in that so so let's load
Loading Data
8:24
some data here um so you you can load this data which is
8:30
um um actually I'm gonna let me let me reload
8:37
this page here make sure that I have the latest here
8:46
okay so so I'm you can load this data if you've checked it out otherwise otherwise
if you haven't checked it out
8:53
run this one down here this will fetch it from the internet and so I'm just using
uh in pullers
9:01
there's a read CSV function and that is making what we call a data frame here
9:07
and so this is the output of doing that anyone having issues doing that so
9:14
ideally you're you're able to run what I'm doing and sort of follow along with me
9:20
um so if you need help with that let me know this this is the output this is a data
frame so data frame for for those
9:28
everyone in here probably knows but for those who are watching you might not know
is basically a representation of
9:34
tabular data and generally when you're doing analytics you want the columns to
9:39
be of the same type so so this is a polars data frame at the top you can see the
shape of it it's got
9:45
5700 rows and 40 columns this happens to be my tweet data for some portion of
9:51
time this is from Twitter these are my tweets and so this has analytic
9:57
information about how I tweeted here so what we're going to do is we're going to go
through this and sort of assume that
10:04
I was given this data set for maybe like a Consulting gig or I'm like doing a
10:10
data analysis on this and this is kind of my process of going through this data set
and exploring it and cleaning it up
10:18
uh I I mean this so so a couple things to note if you are familiar with pandas
10:23
here I mean this looks kind of like a panda's data frame I mean it says shape at
the top so the output is a little bit
10:29
different Panda says shape at the bottom it's got it looks like column names which
should be familiar for for people
10:35
who are familiar with pandas and then below that it's got types here
10:41
just uh I think everyone's been here is probably aware but python is a slow
language python is not a fast language
10:48
and one of the things that makes a data frame Library fast is leveraging some
10:54
lower level uh tooling to optimize memory and so pandas in the case of
11:01
pandas prior to 2.0 leveraged numpy under the covers as a pandas 2 you get
11:06
Pi Aero support Polaris is leveraging rust and a library
11:11
called Arrow under the covers a rest implementation of this Arrow arrow is an in-
memory representation to
11:20
store [Music] analytic data or tabular data
11:27
and and the benefit you get there versus saying storing this in pure python is
11:32
that basically arrow is going to say here is a column that represents the Tweet ID
that's a 64-bit integer so
11:40
we've got 5700 of these so it's going to make a buffer that's like 5700 times 60
11:45
64 bits and all of that will be the Tweet ID information in there if you
11:52
were doing that in Python you would have a python object for each of those IDs and
there'd be a couple dozen bytes of
11:59
overhead for each of those so in addition to saving memory uh pairo also
12:05
supports low level operations that take advantage things like multi-core or simd
12:11
instructions so when you do certain operations they can be optimized and run very
quickly so a data frame Library
12:20
like this kind of gives you the best of both worlds and that like you can write in
a python syntax but get see or rest
12:28
speeds by doing that and so
12:38
yeah we can we can scroll through this if we want to but you can see that it's
12:44
showing some amount of data here and some amount of columns
12:50
um one thing to note here is if you're familiar with pandas there is no index
12:56
here so uh pandas will stick in and index you
13:03
can think of an index as a row ID for some purposes even though indexes don't
13:09
have to be unique and there is no index in in here so
13:16
that that causes I guess uh changes or effects that
13:22
um if you're coming from pandas you just need to be aware that there isn't an index
there also pandas does have the
13:29
ability to do hierarchical columns and hierarchical multi-indexes and pandas
13:35
uses the same data structure to represent an index and to represent columns
13:41
um and polars does not have hierarchical columns either so it's it's just all
13:48
flat okay so so in in these cells here I've just listed out
13:55
what you get just to sort of show you so in in the
14:00
pullers package itself you have a good deal of functions of things you can do
14:09
and then inside of the data frame you've got a good deal of things that
14:14
you can do now now if I did this with a pandas data frame there's like 400
different things that you can do with a
14:21
pandas data frame here with pullers you're looking at quite a few less than
14:27
that so 176 I think pandas is around like 430 or so attributes on a data
14:34
frame so a little bit less exposed there from
14:39
that point of view and then one thing to be aware of that there's also this notion
of a column
14:45
um and a column lets you do things that are called expressions and so I'm just
14:52
showing uh what is available on the expressions and you can see that there
14:58
there's a good deal of things that you can do to a column so like taking the
standard deviation the
15:04
square root the Sorting it even things that you couldn't do for example like in
pandas pandas doesn't have like sine it
15:12
does have the ability to do do some things that pandas doesn't do
Exploring Data Frame
15:18
okay so let's let's explore our data frame I mean like if I just got this this is
15:24
kind of the process that I would do and if I was in pandas I'd maybe explore the
index and again there is no index in
15:31
polar so so just be aware of that um here are the columns it's just going to
15:36
give us a list of those columns um in in pandas I'd probably do
15:42
something like this I'd probably like say trans take a sample of 20 rows and
transpose that
15:48
um and that kind of works too and so why do I like to take a sample uh what I find
15:54
is that often the first n rows of the data are kind of either not complete or might
have test
16:02
data in them or the schema hasn't been completely fleshed out so by looking at the
first amount of rows in the data
16:08
that's often not um good enough to give me sufficient
16:13
understanding of what's going on in the data so just taking a sample of those rows
is often better and then
16:18
transposing is just a little hack that I like to do because often transposing
16:24
transposing if you're not aware if it flips the rows and the columns transposing
lets me
16:30
um often see a little bit more data than I would if I don't transpose it
16:36
so I'll often do that just to see a little bit more data sometimes I do transposing
as for part of a math
16:42
operation but a thing to be aware of is because we are column oriented here when
16:48
we transpose the data you can see that all these columns basically went to Strings
because in order to represent
16:54
all those strings was like the lowest common denominator that could represent
everything so you you might lose some
17:00
memory efficiency and probably probably change might to probably will if you
transpose your data
17:07
if it has different types and columns and we can also inspect the shape here so
this is just a python Tuple saying
17:14
the number of rows and the number of columns okay so once I've got my data loaded
up
Exploring Data Types
17:20
I've kind of initially explored some of it I might dive deep into a little bit more
of it and so one things I might
17:26
want to look at is the types and so there is D types which again is similar
17:33
to what you would see in in pandas so we see like in 64 utf-8
17:40
float64 n64. so again just to be pedantic here these are not python types there is
no
17:46
utf-8 type in Python there's no N64 type in Python there's no float64 type in
17:51
Python these are Pi Arrow types and so this is for every column this is the
17:57
type that is represented that Pi arrow is using to hold that data
18:03
and uh we can ask it to give us the estimated size and bytes of this data so
18:09
this is saying this is like uh 2.9 megabytes of data
18:14
okay so so just um let me comment on on data size for a
18:20
little bit um so I when I talk about pandas I like to say that pandas is an in-
memory tool
18:26
or a small data tool in that the pandas the library not pandas the API but
18:31
Panda's the library requires that you have your data in memory
18:37
um pullers is similar in that generally you
18:43
will have your data in memory but it can also do some streaming so pullers has a
mechanism to do lazy evaluation and then
18:51
you can batch chunks of data through it if you want to and so if you have data
18:57
that's larger than memory it is possible to use pullers to do that
19:04
I said Panda's the the library not Panda's API because there are other
19:10
libraries that aim to implement the pandas API so one example is moden moden
strives to
19:19
implement the complete pandas API as far as even going as far as to implement
19:25
their bugs but allowing you to scale that out across multiple machines dasc
19:30
is somewhat similar and and then like I'm an advisor for a company called Ponder
which is actually a company
19:37
that's founded by the creator of modin the idea there is that they're taking
19:42
the pandas API to things like bigquery and snowflake so you can take your pandas
code and run your same pandas
19:50
code on large data there so the pandas API is going to give you sort
19:59
of like the breadth of like small data to Big Data the pullers API is going to
20:04
give you small data to sort of larger than memory but not sort of
20:10
scaling out to multiple machines at least as right now is the state of that
20:17
and so because because it is I would say
20:22
you know pullers does have some support for streaming but you're generally
20:27
going to want um to be to be holding it in memory if you
20:33
want to like materialize that or visualize that or stick that into like machine
learning so you you generally
20:40
will want a little bit of extra space to be able to do further operations on your
20:45
data um so so this data I mean this data set is
Lowering Memory Requirements
20:51
so small that it's not really a big deal but you know you might get other data sets
that uh where you're you're kind of
20:58
edging up to to your memory limits and so I'm going to show some techniques for
21:04
uh lowering the amount of memory requirements if if memory is one of the
21:09
the main pushes for using uh something like pullers I will note that
21:15
um a lot of it like it like I said a lot of this applies to pandas as well and so
21:20
you'll see you can do this similar things for pandas so here I'm doing the describe
21:26
method this is actually the same as in pandas and I I like to do this uh when
21:32
when I get a new data set so what does this give me gives me summary statistics
21:39
for my data here and this is is pretty similar to what we see in pandas
21:48
we have count we have null count so in pandas count is the number of
21:53
non-missing rows that has a different definition in polars we see the mean the
22:00
standard deviation the minimum the maximum the median which is the 50th percentile
quartile and the 25th and
22:06
75th percentile so so this is kind of nice just to go through our data and and
22:12
kind of understand what's going on with it some of the things I like to do when I
have something like this is to like go
22:19
through like the maximum value and the minimum value and
22:25
try and understand you know is is the the data type that it is using the
22:32
appropriate one and by default pullers is going to use 64-bit floats and integers
22:37
which may or may not be appropriate now you can also ask for for quantiles
22:43
in there like 25th percent quantile
22:48
actually don't know if you can do this let's just try it let's say like 25th and
the 70th
22:56
okay so so pandas would support something like that right where you'd get both of
those pullers does not it
23:01
Just supports uh the one and and so if you look at this again there's no there's no
index over here sort of
23:09
saying like this is the description of this this is just the result of taking the
25th percentile of all of those rows
Expressions
23:19
okay so uh having said that um what I did up here is kind of a No-No
23:25
I said like on the data frame calculate the quantile um generally what we want to
do is we want
23:32
to do these things called Expressions instead of of working directly on the data
and
23:39
and what what is the what are these Expressions the the basic idea here is
23:45
that we you can think of polars as a boundary we've got are Pi Arrow or not our PI
error our
23:53
Arrow data we got a rust layer on top of that and then we got a python layer on top
of that which is the API that we're
23:59
going through and if we can keep things sort of in that rust layer and Below uh
24:05
we have some Speed and Performance benefits that we get from that if we are
crossing that boundary up to the python
24:12
one that can give us some flexibility but it will also slow things down because
then we have to go from rust to
24:18
Python and back so the expressions are a way if we can express things in form as
24:24
the using these Expressions it keeps us in that rust layer so to speak
24:31
okay so so it starts to look something like this which is again completely
different from what you
24:36
would get in pandas say DF select and then PL dot all so
24:43
we're saying I want to run this select which is going
24:48
to select columns and then this is an expression that represents all of the columns
uh so that is that all is from
24:57
the polar's namespace there you can also do things like saying pl.call and then
25:02
give it a name or you can say star to select all the columns but uh
25:07
we'll do that saying all there we can also do something like this where we say PL
call and then we can say PL float64
25:15
so this would give us all of the columns that our float 64 columns so this is
25:21
similar to like a select D types and pandas uh this functionality here
25:27
you can also use exclude to exclude things and then this can also support
25:33
regular Expressions if you want to name things as radio expressions uh so here I'm
going to say select The
25:40
Impressions column two times and give me a data frame with that and when I do that
I get this error duplicate error
25:46
and it says that the column with name Impressions has
25:54
more than one occurrence so this is another thing that's a little bit different if
you're used to pandas pandas would allow you to do this where
26:01
you'd have the same column multiple times polars is just saying you can't have
26:06
columns that have the same name so if we wanted to do something like that we could
do something like this
26:13
where we say um where you say
26:20
let's take out Impressions The Impressions column but let's also take out the
Impressions column again but it
26:26
has this expression suffix underscore two and so what that is going to do is it's
going to change the name of the
26:32
column and just stick the underscore 2 there and polars is fine if we want to do
that
Columns
26:41
okay so what I'm going to do here is I'm going to say let's try and get we saw
26:46
that I couldn't do like the 25th percentile and the 75th percentile in
26:52
one called a quantile let's see if we can do this another way so I'm going to
26:57
build up this little operation here I'm going to say DF dot select and I'm going to
say let's select all of these and do
27:04
the 25th percentile and then let's select all of these and do the the 50th
27:09
percentile and give that a suffix of Q5 and
27:15
um we can see that if we scroll over here you can see that these have the suffix
27:20
of Q5 over here so this looks like that that kind of worked
27:26
it gave us those um now now if you were to do that in pandas
27:33
you'd probably want it like vertically stacked instead of horizontally stacked like
that um
27:39
so I'll kind of show you how you might do something like that
27:45
um so I'm going to say here let's let's take the zeroth percentile here and then
27:52
let's make a column and so to make a column you say this with columns
27:59
and here I'm going to say I want a lit which is a literal column or a literal value
and the value that
28:06
will be put in this uh is q0 so I'm going to put the string
28:11
q0 in there and then I'm going to say Alias Alias is going to rename the column so
if we do this we'll see that
28:19
there should be a column called describe here at the end and it
28:25
has a value of q0 . so again this this syntax is is completely different from
pandas right
28:31
if I were to do this I would use like an assigned to do this
28:37
there isn't an assign here this is how you create columns using expressions in
28:43
polars okay so here what I'm going to do is I'm
28:49
going to say let's do uh the literal here as
28:55
described and then we'll do the quantile there and we'll just stick that at the
front and you can see that that now is
29:02
at the front okay so so this is kind of an attempt to emulate pandas describe
functionality
Vertical Stacking
29:10
um so the uh when I initially so again like I said when when I initially did
29:17
this um I I talked with the creator of pullers to go through this when I
29:24
initially did this pullers had described but it didn't have all of the features of
pandas indescribe I think because of
29:30
our discussions or maybe whatever they since put those in but when I initially
29:35
did this they it didn't have like the the 25th and 75th percentile there so so
29:40
this is an attempt to stick those in there so
29:46
uh what are we doing here this this is a little bit more involved I'm saying um
we're going to concatenate in
29:53
verticals we're going to stack these things on top of each other and what am I
going to concatenate I'm going to concatenate uh everything from this list
30:00
comprehension so this is a list comprehension from there there I'm putting putting
the star in front of it to unpack the values of that so what are
30:08
the values of that um that is saying uh we're going to take the DF with the
quantile 0 and we're
30:17
going to say with these columns which is each each of these values here
30:23
so I'm going to say q0 and then we're going to select all of these columns and do
the q0 and that will be one entry in
30:31
that and then we'll do another one with Q 0.25 another one with Q 0.5 0.75 and
30:36
one and so if we do that
30:43
we get something that looks like this okay and so we're taking
30:49
um these are all the ones that I created here and then these are the other ones
that came from there and we're doing a
30:55
vertical stacking of those okay so so that's
31:00
um you know maybe a little bit more complicated
31:06
um and uh maybe more of a challenge I mean we are
31:12
combining list comprehensions with this unpacking here to do that
31:17
um but if if you want to uh vertically stack
31:23
things um it's a little bit harder to do um
31:30
because we have to do things like stick in a new literal as well so that we know
31:36
what the rows represent um so so I'd say that's sort of like a give and take like
we don't have the
31:42
index the index and pandas would sort of represent that and so we're having to kind
of play games making a literal to
31:49
to to label what the rows are um
Transposing
31:58
okay so what am I doing here here I'm saying let's select the time and the
Impressions and we'll do these quantiles
32:04
0 25 and 1 for each of those and then we'll transpose that
32:10
and so this is what that looks like if I don't transpose that I get something
32:15
like this so so this is uh those different columns and
32:23
that looks okay right transposing that if I transpose that because it doesn't
32:30
have the index I do have to say like include header is true and then I I have to
say columns as well to get those
32:37
column names in there um again if if you want to you can just
32:44
explore like what happens if I transpose that without that it looks like that so uh
those are some other I guess Hoops
32:51
that you might have to jump through if you're you know used to Panda Panda's
32:57
index functionality here okay so uh this functionality that I put
33:03
I mean up here it's it's kind of I know that's not the world's greatest code I mean
it's kind of unsatisfying
33:10
I'm going to make a function here like put this into
33:16
a function and so let's try that with pipe so I'm going to say let's pipe the PD
describe uh function in here
33:25
and and uh we can see that that does work as
33:33
well so this is a way um where you can refactor your code
33:41
um the one of the things you want to be
33:46
aware of is like pandas or pullers has the notion of being lazy here and
33:52
um like in this case like can I make this lazy
34:00
some things are I guess lazy are you're able to make lazy and some things you
34:06
you aren't able to make lazy so what does laziness give you in polars um laziness
gives you the ability to
34:13
construct operations a chain of operations and then pullers can go
34:19
through that and it can optimize you can do things like query optimization so
similar to like a SQL database or a SQL
34:26
database would do query optimization uh lazy and pullers allows you to build up all
34:32
the operations and then um do them in an efficient manner
34:39
and and so in this case because we are using like describe it's like no you can't
you can't really make
34:45
describe lazy because um when you're lazy pullers wants to know all about the
schema
34:52
and um when you do something like a pivot it makes it hard to know what the schema
is
34:58
and and so you can think of describe as a pivot okay let's look at the the integer
types
Integers
35:04
here um so here's here's everything that's in integers and I've got things like the
35:10
number of apps that were open the number of installs number of follows again this
is Twitter data and a lot of these uh
35:17
look like um small numbers I mean the Tweet ID is is
35:22
a large number but a lot of these look like small numbers but I'm using a 64-bit
integer to represent them right
35:28
and so I might do something like this where I say let's do this chain here let's
select the integer types
35:34
and then let's do a describe on those okay and if I look at these
35:41
so for example like the app opens that maxes out at zero app installs is all
35:47
zeros follows goes up to 191. media views goes to like 16 000 and
35:55
media engagements goes to sixteen thousand as well so this lets me consider like do
I need
36:03
to use 64-bit integers to represent this data right um probably not right for a
zero I mean
36:10
if everything's zero for the purpose of like machine learning I could just drop
that column because there's no
36:15
information in it right a column that has no variance doesn't have any information
36:23
um something like follows follows goes up to 191 so if I use like an unsigned
36:28
eight bit integer I could probably represent that in follows assuming that
36:34
I never got more than like 255 follows um
36:40
so let's let's look at like how we might cast these into a different type so here's
Impressions right
36:46
um and it turns out that there is a cast on that we can say take Impressions and
36:51
cast it to an it 32. that looks like it works um however if I'm doing this with
square
36:56
brackets I'm accessing the column here with square brackets I'm not using that
expression syntax and so I'm not being
37:03
able to be lazy and uh pullers isn't able to make optimizations with that so
37:09
I do want to use this expression style instead which would I would do that like
this
37:15
I'd say I want to select so instead of square brackets you're going to say select
and then I'm going
37:21
to say let's take the Impressions column and I'm going to cast that to an n32.
37:31
okay if I want to stick that back into the data frame with an expression I will do
something like this I'll say DF with
37:37
columns and stick that back into the data frame and here's Impressions you can see
that
37:43
now it says that it is a 32-bit integer let's try and see if we can get it down to
an 8-bit integer and unlike pandas
Polar Share
37:51
pandas if you try and do this pandas will happily proceed and give you overflow
errors
37:58
polar share is a little bit more smart about that it says um if you if you do that
38:04
um you're you're basically going to be lossy because you've got values that are
outside of the range of that
38:13
um so the the max value of Impressions is is this right here um so I'm going to
cast it to an INT 32
38:20
instead and that looks like that works if I leave that as an in 32.
38:26
so if you're curious about what the size range of these are I wish I wish there's
some way I don't know that there's a way
38:32
in pullers to to ask to get this in in pandas generally we go through numpy so
38:38
a numpy has this thing called I info to get integer info and F info to get floating
Point info so you can pass in
38:45
the numpy types there which should have the same range as the arrow types but you
can see is it like an unsigned 8-bit
38:51
integer is from zero to 255. all right so here's just some code
38:58
that's trying to say okay let's take our data frame let's select the float columns
and the integer columns let's do
39:06
a describe on that and then let's select the columns where if we go through this
39:11
describe I want the max value to be less than 255.
39:18
okay and if I do this I get an error here and um it says that this less than is not
39:26
supported for instances of string and integer um so uh
39:34
what I need to do instead is a little bit more involved I need to say like
39:40
let's pull off so so this is giving me this described thing in the second to
39:45
last position is the max and I'm checking whether that is less than some value here
so this is uh some code that
39:52
I I could do to say what are um and the columns where the numeric values
39:59
are less than 255 and and these are those columns right here
40:05
and so then I could I could maybe take this and say like okay well let's look
40:11
at like maybe an unsigned 16 bit integer and uh
40:17
that goes up to 65 000 so maybe let's see if there are columns that we can
40:23
make unsigned 16-bit integers there and so uh
40:28
these are the ones that could be unsigned 16-bit but if you look at these these are
the same as these ones up here that were less than 255. so you could do
40:35
like some set operations there but basically um most of those could be
40:42
um unsigned eights but maybe for growth purposes if everyone say like we might
40:48
want greater than 255 uh we'll leave those as unsigned 16 bit again unsigned
40:57
can't go negative so it it really I don't think it makes sense to have negative
retweets unless elon's changed
41:03
that recently um so we we should be okay with those unsigned uh values there so so
what am I
41:11
doing here I'm saying with columns and then I'm just using a list comprehension to
say take all these
41:18
columns here and make a new column expression that casts those to 16-bit
41:24
integers now that's using a list comprehension generally we want to stay
41:30
with Expressions rather than sticking in Python code in there um so again that that
is one of those
41:36
things you want to be aware of is when you're Crossing sort of that python boundary
there and if you've got like a
41:43
list comprehension in the middle of your chain that can make it so pullers has to
cross
41:50
the boundary between rust and python here so instead of doing it that way
41:57
saying a list comprehension here turns out that we can just say with pl.call we
42:02
can just pass in the list of columns that we want and that's the expression way to
do that
42:13
okay so what I've got here is just a benchmark here I'm going to run this here
42:20
so what do I have in the first one the first one I have a list comprehension and
the second one I have the expression
42:26
for the purposes of this the micro Benchmark that I'm running on my machine the
list comprehension is slightly
42:32
slower I will say like caveats with
42:37
um uh timing and uh benchmarking my
42:43
experience is if you are benchmarking at least in pandas world you want to
42:48
Benchmark with the size of data that you're going to be using in production doesn't
make it doesn't make much sense
42:54
to Benchmark something on 10 rows of data or 100 rows of data or a few
42:59
thousand rows of data if you've got millions of rows of data that you're going to
be working with in production because oftentimes there might be
43:06
different paths that you're going through there so just be aware of this in this
case like both of these take like around 200
43:14
um microseconds which I I can't really comprehend 200 microseconds it's not
43:21
like a huge deal for it for either of these you know but if you had
43:28
a few gigs of data this might take a little bit longer such that you know doing it
the idiomatic puller's
43:36
way with Expressions um especially when you start chaining that with other
operations is going to
43:41
have an impact on what's going on there okay this one here I'm I'm being lazy so
43:47
again pullers does have the ability to be lazy and and basically what what
43:53
laziness means is instead of being eager and materializing or evaluating all of
43:59
these operations as we do them what we do is we build up this chain and then
pullers can analyze that and make a
44:07
certain optimizations here and in this case because we're only doing one thing
we're just doing that with comments
44:13
there we don't there's no benefit for the laziness here
44:19
okay so let's see uh do we have any integers that need to be larger than those 16-
bit integers so here I'm just
44:26
sticking in that magic number there and it says the the only column that I find
44:32
here is the describe column which isn't actually a numeric column that's just I
said select the describe column and this
44:40
column so there's there's no columns in this data set that are larger than uh
44:48
unsigned 16-bit integers here so I might have a chain that looks something like
this where I say with
44:55
these columns here casim as 8-bit unsigned integers and
45:01
casts these ones a 16-bit and uh maybe uh Impressions we leave that as 32-bit
45:08
and we get something that looks like this okay and it didn't complain here so this
45:16
conversion should be um uh lossless and that like we haven't
45:22
lost any data or we haven't had any uh integer overflows by doing that
45:28
okay so if if we look at if we if we look through the data here you see that
there's a bunch of these uh promoted
45:34
values here where we have a bunch of nulls um it turns out that like I haven't paid
45:41
Twitter to promote anything so so those values are missing so maybe I want to
filter those out because if I'm doing
45:47
analysis or machine learning I'm not really getting any bang for my buck from those
columns because they're not doing
45:52
anything so one of the things that we can do is do
45:58
select and sadly on it looked like a recent version of Jupiter broke the cell magic
I don't
46:05
know if this this works on yours or not so
46:10
I need to explore that this isn't it seems to be a Jupiter issue
46:15
um but if you look at the help for select the documentation I would say the
Polar Documentation
46:21
documentation for polars is pretty good and so uh pulled that documentation my
general recommendation for for people
46:28
who are working with libraries like this is to stay in Jupiter if you can or stay
46:34
in in your environment and so if you can pull that documentation directly here that
might be useful so you can see that
46:41
you know it describes what's going on then there's examples of using that as well
46:46
so this I think this is a pretty good documentation so what I'm going to try and do
is I'm going to try let's if you
46:53
go up here let's see select it says um these are the columns to select accepts
47:00
an expression input so you can pass an expression strings are passed as column
names other non-expressions are passed
47:07
as literals so this should say that you can pass in a
47:14
regular expression as well but
47:21
and it doesn't say that but you can so here I'm going to pass in a regular
expression here I'm going to say does
47:27
this column start with promoted okay so here are all the columns that are promoted
columns
47:33
so again select is is the the puller's way of pulling out columns and so here I'm
going to say let's do this I'm going
47:40
to say select all of the columns but then we're going to exclude the promoted
columns
47:46
and uh that looks like that works there aren't any promoted columns there we see
that
47:51
there are only 22 columns now um instead of all of our columns
47:57
so we might want to drop all those columns here again you can pull up the
documentation for a drop here I'm
48:02
dropping off those promoted columns that looks like that works as well so another
hint that I got from Richie
48:11
uh is that um
Avoid Drop
48:16
if if your code's probably going to be better if you avoid using drops so
48:23
um rather than using drop so so his point
48:29
was when I was going over this with him he's like I don't care about the columns
that I'm dropping I care about the
48:36
columns that I'm working with so rather than saying drop these columns you should
be explicit and say use these
48:42
columns and um I don't know if that helps um with their optimizations inside it
48:49
might there might be some optimizations where it helps because it should have like
the scheme of what's going on here but here I'm just saying let's pull off
48:55
these columns here so now I know what the columns I have are and then instead of
doing the drop there I'm going to
49:01
come down here and say let's select these columns so being positive on what
49:06
you're selecting rather than saying what you're refusing here so now I've built up
this chain here where I've said okay
49:12
here's my my raw data right and then I'm saying I want to cast all of these
49:18
columns to uint eights and then I want to cast all these to un16s and I cast
49:24
all this one to a un-32 and then I want to select these columns right here
49:30
let's just run that and make sure that that works it looks like it does work it's
not complaining so so we should be
49:35
good there um another thing I like to do is rename my columns
Renaming columns
49:42
at least in pandas that's super useful because I'm going to be like pulling off
49:48
the columns directly by attribute access that's how I prefer to do that because in
Jupiter I get better tap completion
49:54
when I do that however it's not such a huge deal here in polars because you
49:59
don't really use attribute access to pull off columns in in pullers but let's say
that I did want to I could do
50:05
something like this where I say I'm going to rename these and
50:12
um I want to take my current column names and rename those in this case I'm using
50:18
a pipe here let's try and do that and that looks like it works so what's the
50:24
benefit of doing a pipe here well when you do a pipe you get the current state of
the data frame and in this case I
50:31
have the current columns in there so I can refer to the current state not the
original state that had all those extra
50:37
columns and what I'm doing here is I'm just saying do a rename and I'm going to
replace any any spaces with underscores
50:44
in there you see that like Tweet permalink now has an underscore in there again not
a huge thing for pullers per
50:50
se and pandas I do like that because it makes it easier to work with however I am
using like a Lambda here which might
50:58
slow down the laziness here so um
51:03
and this is using um like a dictionary comprehension so if
51:09
you want to get rid of like the dictionary comprehension we can do something like
this where we say uh
PLAll
51:15
PL all and because this is an expression this all is going to refer to the
51:20
current state of the data frame in our chain not the original data frame
51:26
and then I'm going to say just map this Alias here which is a function to clean
51:31
all those up so the this would be a more idiomatic uh puller's way of doing that
51:38
now now I did stick a lazy in here that that instead of doing anything actually
just gives us the plan so if I comment
51:44
out the lazy you can see that this gives us that result here so this would be like
an example of what
51:53
I might do if I was given a data set going through those columns cleaning
51:58
them up and finding which ones I might need or not need
Task
52:04
okay so we're at our first exercise here this is for you so your task is to uh
52:14
select just The Impressions and engagement columns and then select all the columns
except Impressions and
52:21
engagements without using a drop and then use rename to rename the
52:27
impression column to imp and engagement to Eng so I'll just um
52:33
let you work on that for a little bit if you have questions or need help let me
know give you the chance to practice
52:40
your pullers so so there's a question the question is like what what's the best
best way for
52:46
like a comment and a question one of the comments is is like selecting being
52:52
positive about the columns you want is kind of maybe more work or annoying um
52:58
and so a couple couple things on that I want to focus on writing code that's easy
to
53:04
read not necessarily easy to write and so those might be odds with each other
53:09
um but but I I think that being positive about what columns you have is slightly
53:15
easier to read because you're making an emphasis on the columns you want now there
are cases where like in machine
53:21
learning if you're making a pipeline and you're dropping columns if that pipeline
changes such that you're adding new
53:27
columns to it you'll break your pipeline but if you're being positive about that
you're going to prevent that
53:33
um and then the other the comment on that is what's the best way to do that well
53:38
um I'm kind of showing you my process so a lot of times like when people will see a
chain like I'm I write a lot of pandas
53:46
code and I do chaining and Penance code and people are like oh that's a horrible
chain or whatever and what they don't
53:52
see is that like I built that up right I didn't start off with like this huge chain
I sort of built it up sort of like
53:58
what I'm doing here in this class and and so will I use drop I mean yeah it
54:04
might make sense for me to use drop and then use find the columns and take the
54:09
columns from that and then make a and then update my chain without the drop so
that's what I would do
54:16
um so I'm not saying like don't ever use drop but I would I would use that in
54:21
cases where it makes sense or helps me to to write my code that way okay should we
look at the solutions here or some
54:28
solutions okay and I'm happy to if you have yours
54:34
if you want to you know ask about them uh feel free to do that okay so um we
54:40
have our DF here um so the first one here select just The
54:46
Impressions columns so I'm going to say DF and then I would say and I'm going to
54:51
put it in parentheses here so I can just so parentheses in case you don't know do
54:57
you have to put parentheses around your chains no the nice thing about that is when
you're inside a parenthetical in
55:02
Python you kind of don't have to worry about white space rules and I generally like
to put each step on its own line so
55:09
it reads like a recipe so that's that's why I like to do that so I'm going to say
select PL and then I can say I
55:15
should be able to say just Impressions here as a list and engagements
55:22
if I spell it correctly with um so I'm going to spell incorrectly just so so I got
a calm not found error
55:30
and then one thing is it's I I think is a little
Troubleshooting
55:36
bit more of a challenge with pullers than maybe pandas is like it does have like
55:43
error originated just after this operation here um but the exception here is this
column
55:51
not found error so you just have to make sure that you're reading this correctly um
it's it's not like at the very end
55:56
it's slightly up but but I have a typo there so let's see if I can fix that
56:02
and that looks like that now works okay select all the columns except
56:07
Impressions and engagements without using drop okay so how would I do that I should
be
56:13
able to do something like this where I say select and then I should say like tl.all
56:21
okay so that's all the columns and then I should be able to say like exclude
56:28
and let's say exclude impressions and exclude
56:34
engagements
56:45
okay and then rename so I'm going to rename these
56:52
so I'm going to say select those and then let's do a rename
56:57
and we could pull up the documentation for rename but I can pass in a mapping there
so I should be able to say let's
57:03
take impressions and rename that to imp and let's take engagements
57:17
and rename that to Eng
57:35
okay what questions do you have
57:44
function s
57:51
okay so the question is can we do like a rename with with like a Lambda here uh
57:59
and we'll say called upper
58:05
and it says function has no attribute keys
58:12
yeah so yeah it looks like if we look at the implementation of this lazy frame I
need
58:18
this here's the rename and it says it takes a mapping here so it looks like
58:23
that um will not work let's pull up the documentation here
58:30
so I'm going to have to say help df.rename because my Jupiter is broken
58:36
um yeah and and so this just wants a
58:44
dictionary it doesn't want it doesn't look like it works with a function here
58:50
okay other questions
59:05
um the question is what's the functional difference of adding select
59:10
um so in this case this select well that one doesn't work but this one right here
this is just saying pull off these two
59:17
columns right um but um can you be like I guess the question
59:26
maybe goes to like can this can rename be lazy um and be sort of taken into account
in
59:32
the query plan so you can just stick in the lazy in there and like like if I tried
to do like a pivot
59:38
it should complain about that um like lazy has no pivot so you can't
59:44
pivot um let's see if we can do describe you can't describe so in the context of
59:53
polar as you should be fine doing a rename there the the select here is just saying
pull off these two columns first
59:59
right um so so here's the result with the select
1:00:04
here's the result without the select
1:00:11
okay other questions okay
1:00:18
okay let's look at um date columns here
1:00:24
so so here here's my chain that I built up and so as as again
1:00:30
I I'm showing this process that I would do going through a new data set and uh
1:00:35
again I I wouldn't start off with this huge chain I'm building this chain up and so
I'm showing you sort of my
1:00:40
thought process as I'm going through the data building this up so in this case I do
have a column in here called time and
1:00:46
I want to convert it to a a date time here so I'm going to say cast PL dot date
time so let's do this
1:00:54
and I get a compute error so let's just look at this compute error it says
1:01:01
strict conversion from stream to date time failed for this value um and then it
says consider using this
1:01:07
stir p time okay so let um
1:01:14
if you look at um time is a string so similar to pandas
1:01:22
strings have a stir accessor on them so this is kind of an idea taken from
1:01:28
pandas so in addition to the things that you can do directly if things are strings
you can do these additional
1:01:34
things which mostly Court whoops mostly correlate with a functionality that you'd
find on
1:01:41
like a python string right so so the issue here is I believe there's a plus
1:01:47
here in our date stream and apparently it doesn't like that plus
1:01:52
when it's trying to convert that so what I'm going to do is I'm going to try and
tease this apart here so I'm
1:01:58
going to say like let's split on that plus and then like I'll do an apply and
1:02:04
pull off the first value of that and and so if you split on that we get something
1:02:09
that looks like that if I don't do the apply there this is kind of interesting you
get if you look at this you get a
1:02:15
list with strings in it yeah so so so this is I believe a a a
1:02:25
listerish yeah it's I don't think it's a strict type I think it's called a list
type and you can have like list types
1:02:31
that are Optima this is not like a a python list this is like a pi Arrow list so
it's kind of smart about what what's
1:02:38
going on in there um personally I try and avoid like nested
1:02:45
structures inside of of tabular data if I can but but you know there are cases
where
1:02:50
um you might have something that is nested and so being able to like deal with that
is kind of nice functionality to have in there
1:02:57
um so so what I kind of want to do is just pull off that first part here and so
1:03:02
um again this is what that looks like
1:03:08
um so let's look at what I can do with the split thing here and again you can pull
1:03:14
up the the the documentation here but uh one of the things that's in there is
1:03:21
this R which is different from from pandas so this because this is that list
1:03:27
type you have the ability to do a
1:03:33
a array operations here so so maybe we can come in here and say like let's inspect
what you can do with an array
1:03:39
here um so these are kind of like listy things that you would do with a list right
so that's kind of cool so I'm
1:03:46
going to say just uh split it and then get the the thing that's in the zero index
here so that's instead of doing
1:03:53
this with um this mechanism up here
1:04:01
saying apply a Lambda this would be the the polar's way of doing that saying
1:04:07
array.get0.
1:04:19
yeah you're Crossing that that polar that rust python boundary right and and
1:04:25
so you're kind of losing you're losing your speed benefit when you do that
1:04:32
okay so so let's try this now I'm going to say pull off that first part and then
cast that to a date
1:04:38
and I still got an error um it says it it failed and it says if
1:04:45
you're using uh trying to cast a string utf-8 to a temporal type consider using
1:04:50
stir P time okay so maybe I should take their hint that they're giving me and and
use this stir P time so so now my
1:04:57
chain looks like something like this um I'm going to take my time I'm going to
split it on plus I'm going to get the left hand side of that and then with
1:05:04
that that should be a string and then I'm going to say convert my string to a date
and uh this is the type and then
1:05:13
this is the format that is in there let's try that and that looks like that works
1:05:18
so um yeah this this is again if you're coming
1:05:24
from like pandas this is maybe a little bit more involved than than pandas parsing
that you could probably just
1:05:30
stick that into a two date time in pandas but this um you know they want us to be
explicit
1:05:36
which you know if you are explicit about like these date times here these strings
it makes it easier to parse and I get
1:05:43
like if this is implemented in Rust then yeah forcing us to do things that make it
easy from the rust side kind of makes
1:05:50
sense so my chain would be updated to something like this where I'd say okay
1:05:58
now I want to have this time column which is this chain of splitting that up
1:06:03
so let's just run that and make sure that that works and it looks like time is now
a Time column there
1:06:10
okay and I'm storing this in df2 so now I have this df2 thing now because I have
converted that to a
1:06:17
time it also has a DT attribute here right so so what we're seeing similar to
1:06:23
pandas pandas has DT attributes it has Str attributes it doesn't have r a r r
1:06:28
attributes but there are various uh date manipulations that we can do on this now
1:06:33
that it's converted to a date like we can say what is the month and this is kind of
nice we have this functionality you know could we pull this out of the
1:06:40
string yeah you could use a regular expression or whatever to pull this out of the
string but uh super nice here let's try and convert the time zone here
1:06:48
and here's the documentation for doing that so I'm going to say replace time
1:06:54
zone and we'll put it as UTC and so now it is in UTC and then if I want to like
1:07:01
convert that to America Denver oh which is the time zone from where I tweeted I
1:07:06
might do something like this where I say convert it to UTC first of all and then
convert that to America Denver
1:07:18
okay so um yeah
1:07:39
um so the comment is that uh this is micro set consists of nanoseconds versus
1:07:45
what it would be in pandas uh uh I don't know off the top of my head I I mean I
mean that I guess that is one
1:07:52
of the nice things about uh being precise about that stir P time is that you can
like tell it what it is
1:07:59
um but yeah I guess maybe the comment is more like if I've got like things
1:08:05
happening at some very small granularity maybe I'm doing particle physics or
1:08:10
something maybe this is not the appropriate mechanism to keep track of of the
granularity of that time
1:08:18
um yeah I don't I'll say I don't know um I haven't explored that or or am
1:08:25
a super uh well-versed in things that are very small points of time but that
1:08:32
might be something to consider okay um let's so let's do another exercise
1:08:37
here um so create a series with the months of the time column
1:08:43
um and then convert the time column to UTC and then convert the time column to
America New York so again just want to
1:08:51
give you some practice in addition to just running those cells of writing out some
code and doing that as well so let
1:08:58
you work on that for a little bit and if you have questions or need help let me
know and then I'm assuming we'll go
1:09:06
straight into our break since it's it's like 10 22 and I think our break is at 10
30. so I think we'll jump we'll work
1:09:12
on this and then probably jump into our break okay hope everyone had a great break
1:09:19
um okay let's look at the the state exercise here it says create a series
1:09:25
with the months of the time columns so if we so if we look at this df2 we do
1:09:33
have that time column in there so I'm going to say df2
1:09:39
and uh we can say select to select columns
1:09:46
it's going to say select and then what column do I want so I'm going to say pl.call
1:09:51
and then I'm going to say time column name so this is the time column because that
is a date column it should
1:09:58
have a DT accessor right and I'm going to say select the month and if I say select
month like
1:10:04
this I think in pandas this is a property in this case it says it did not accept
1:10:10
expect a bound method so what does that mean month
1:10:17
in Python is an object right it's a method and apparently pullers you can't put a
method inside of a a data frame in
1:10:26
pen as you could actually but you can't do that here so we
1:10:32
actually need to call the method and that should be that I think there's
1:10:37
a month name as well let me just check here maybe it's month underscore name
1:10:45
um
1:10:54
okay there's not a month's name in there um Yeah question
1:11:03
yeah if we wanted to persist it as a new column then we would use the with column
right so let's do that
1:11:12
so if instead of doing the selector I'd say with columns and if I did this
1:11:20
um whoops
1:11:26
um what this did was yeah this actually
1:11:34
overrode our time column so I might want to do something like this where I say
alias and say this is the month column
1:11:41
so that's that keeps the time column in there but we should have a month column
1:11:48
if I can scroll over here right there
1:11:56
okay the next one is uh convert time column to UTC so since we're we're doing
1:12:03
this in a data frame I'll just do this in the same one here I'll say okay let's
make a Time column that's UTC which is
1:12:09
this replace time zone UTC right here
1:12:18
so I'm going to say pl.call.time .dt
1:12:25
replace time zone
1:12:33
and um now it's in UTC right there again if I
1:12:39
if I want that to be a new column I can say alias and then I can say this is the
UTC time
1:12:50
using there's UTC time okay and if I want to convert that to
1:12:57
America time so in pandas you could do something like this where we would say
pl.call and then
1:13:03
I could say like let's take UTC time and let's do it on that I want to do a DT dot
replace
1:13:10
not replace um that is not replaced it is convert time
1:13:16
zone so we do a convert time zone
1:13:21
and we'll see America New York
1:13:29
and when I do that I get an error here column not found so let's look at that it
says there is not a column called UTC
1:13:35
time so what's going on here it's saying you
1:13:41
should have a column called UTC time but I don't that's because I just created that
in in this with columns call here
1:13:48
so in pandas you can get around that by using a Lambda if you use a Lambda with
1:13:53
a A Sign you can get around that let's see if we can get around it by just
1:13:58
doing another call to with columns
1:14:08
and I'm going to Alias this as NYC
1:14:15
okay so that looked like that worked um so if you want to refer to a column that
1:14:20
you just made
1:14:25
you'll have to do a new with columns call does that make sense this UTC column is
1:14:32
the result of this this previous with columns up here
1:14:43
alternatively if I wanted to I could just take this whole thing and Tack on this
1:14:49
onto the end of it and I wouldn't have to do a new call to with columns
1:14:58
okay questions before I keep going
1:15:22
so the comment is um you could do had to get the first number
1:15:27
of rows but but if you wanted to select different rows not the 20th and 30th so
there is a filter so select lets us
1:15:36
select columns filter lets us select rows and so in that case you would need to
1:15:42
have some row that you could filter on right so if you wanted to do it by like
1:15:49
position like pandas has like I look right to get like the 20th row then you'd have
to make a column that has
1:15:55
like the row count in it and then do a filter on on that column
1:16:02
yeah so we don't have again we don't have the index there um so we'd have to use
something like
1:16:08
that
1:16:16
yeah it has I've got an example here that will show you how to like emulate the
index
1:16:22
down below and insert that in there okay
1:16:27
um so so after I've done this I feel pretty good about what I have I would go
1:16:32
out and make a function that looks something like this right and and so this is my
tweak Twitter function this
1:16:39
is taking the raw data and and cleaning it up and
1:16:44
again I would make a chain like this um I would do this in pandas as well in
1:16:49
polars it's kind of encouraged to do a chain in pandas you don't see that a lot but
again polar is because it can be
1:16:55
lazy um can do some smart things about this so so if this you know if I was working
1:17:00
for a client what I would do is I would take this code here and I would put this
right at the very top of my file right
1:17:06
next to where I load the data so what this does is it reads like a recipe whoops
I'm working with the raw data and
1:17:14
then I'm saying okay I'm going to convert these columns to integer eights
1:17:19
these two unsigned 16s this one to an unsigned 32. this time I'm going to
1:17:25
convert it to America time zone and then I'm going to select these columns and then
I'm going to rename the column names all right it looks like the series
1:17:32
of steps that I'm going to do let's just run that and make sure that it works and
it looks like it does
1:17:39
okay so this is also called chaining in pandas I'm a big proponent of
1:17:46
chaining a lot of people say that it's ugly again I think they don't see that like
I build up the chain as I'm going I
1:17:53
don't just like ride a big huge chain I'm building it up as I'm testing it as I'm
going and I'm doing the same thing
1:17:59
here in polars now polars does have the added benefit that it does have this
laziness right where we can say let's be
1:18:06
lazy here at that case it's not going to materialize these operations until I call
collect at the end
1:18:13
and so if there are uh it does have a query Optimizer so to speak like a
1:18:19
database and so it can go in and be smart about how it's doing these things but
this should give me the same result
1:18:25
as the one up above um but it should possibly be a little quicker
1:18:31
in fact let's just do some timing here and I'm going to time my lazy Twitter
1:18:36
against my non-lazy one
1:18:41
and we'll see what happens here
1:18:47
okay so so my lazy one took 1.3 milliseconds on my machine it's a MacBook Pro from
last year or whatever
1:18:55
my non-lazy one took 1.47 now again there aren't a lot of places
1:19:01
for optimization here so to speak I mean I'm just sort of pulling off columns and
converting them uh where the query
1:19:09
Optimizer can't really do a lot of optimizations optimizations here now what I'm
going to do in this example is
1:19:15
I'm going to actually add file rating into this so here instead of passing in my
data
1:19:22
frame I'm going to pass in a file name and I'm going to say scan CSV scan CSV is
lazy so it's not going to read the
1:19:29
CSV it's just going to say I have this CSV file up here I don't think I actually
need this lazy down here
1:19:36
because I think scan CSV is lazy oh it says I don't have file
1:19:42
um let me get file here
1:19:49
so I I have a local file right here okay so
1:19:55
um so if you don't have that I don't believe that works with URLs so you do need
you need to have that local if you
1:20:00
want to try that out um so let let me scroll back down here
1:20:10
okay and that looks like it works again if if you leave off that collect there
1:20:15
at the end it's going to give you this query plan of what it's going to do so again
there's not a lot of like fan out
1:20:21
or group buys where it can do a lot more um operations here but let's just time
1:20:28
this so right now without collecting it it takes 200 microseconds if we do collect
1:20:36
it let's time it like this and it takes 2.7 milliseconds to run now
1:20:44
this is including reading the file right so that this is not just doing the
1:20:49
operations this is reading the the CSV file so it takes 2.7 milliseconds to run
1:20:55
um I'm just gonna down here I've got a pandas comparison
1:21:03
and um I'm gonna I'm gonna read my pandas file and um do this so in pandas this
takes
1:21:12
38 milliseconds in in polars um this takes 2.7
1:21:17
milliseconds so quite a bit faster here again this is milliseconds so it's not a
huge deal
1:21:24
like 34 milliseconds versus two milliseconds is not a huge deal but if your data
gets
1:21:31
bigger or you have more complicated operations you can see where you might be
getting some some benefits for this
1:21:38
okay so this example is just showing the late the the plan here and uh
1:21:45
this is what it can go through and optimize after this
1:21:50
okay so uh one of the things that people uh complain about chains is they
1:21:56
complain about that you can't debug them so I think we'll do this exercise together
this exercise is use pipe to
1:22:03
print the shape of the data frame after every step in the chain in The Tweak
Twitter function so a lot of people when
1:22:08
they see like a function like this they're like I can't understand what's going on
here
1:22:14
um which whatever uh I get that like it might be overwhelming
1:22:20
but once you like learn a little bit of pullers it's it's not too bad I I sort
1:22:25
of feel the same way about pandas um but but what I want to do is I want to use a
pipe here and I want to debug
1:22:31
this so one of the one of the things that people claim is that they want to look at
the intermediate variables and I
1:22:37
don't personally want to do that I want to have my input and my output and I don't
care about what's made along the
1:22:43
way but let's assume that you did or maybe your data change and you need to look at
that and so one thing that you
1:22:48
could do is something like this where you can say like death and let's say a debug
and we're going to take a data
1:22:54
frame and maybe we'll take a extra and what I'll do is I'll just say print
1:23:00
I'll use an F string here and we'll say um extra
1:23:08
and then we'll say df.shape
1:23:15
and then we'll return the data frame okay so what I'm going to do is I'm just going
to put this right in here
1:23:22
and I'm going to say dot pipe and I'm going to say debug is the function I want to
call I'm going
1:23:29
to say extra is equal to start okay let me just run this
1:23:35
and you can see that now it's putting out start there at the top
1:23:41
okay so so what I can do is I can just so this is kind of nice especially if
1:23:46
you're doing like joins or whatever I can say like okay after I've updated my
1:23:52
columns let's look at the size update columns and then um
1:24:00
here I've selected my columns and the last one is a rename it's not
1:24:05
going to do anything but you can imagine if I had a filter or whatever it might do
something um
1:24:10
so there you can see that I'm starting off with 5791 rows and 40 columns after I
update
1:24:18
my columns I still have 40 columns and then after I select my columns I have 22
columns but something like this is
1:24:24
especially nice if you're doing a join you can see if you do like a competent
torque explosion and there are something to catch that
1:24:30
scroll back up there so you can see that code again all I'm doing is making a
function that takes the data frame as
1:24:36
the first input and it's returning a data frame that allows me to stick it inside
of one of these chains here and
1:24:43
then I'm just having a side effect of printing that out on the site there
1:24:48
any questions about that or comments
1:24:58
okay I mean if you wanted to you could go beyond this you could say uh let's do the
same thing but instead of printing
1:25:04
out to the screen I'm going to make a global variable and update the global
variable with the current state of the data frame if you wanted that current
1:25:09
state of the data frame to inspect it
1:25:20
um the question is could you do that inside of the with columns um
1:25:27
yeah I don't know that that would work inside I don't I don't know that with
columns will take a callable to do that
1:25:34
um so that wouldn't really work but uh
1:25:43
yeah I guess you just sort of have to treat with columns as sort of its own box of
of operations
1:25:54
okay so we did this pandas comparison here um again one thing to be aware of is
1:26:00
like this would be like my pandas code to do what I'm doing um
1:26:07
you know is which code is better I
1:26:12
I'm just honestly more familiar with pandas I've used it a lot more right so uh to
me uh this code is is perfectly
1:26:20
fine to a lot of people like it is doing some things like it's doing some
dictionary comprehensions with them
1:26:26
packing in there which a lot of people aren't familiar with we are using like
lambdas in here which in pandas I'm
1:26:32
perfectly fine with but I mean I can read this as a recipe I'm going to rename my
columns I'm going to replace spaces with underscores I'm going to
1:26:40
drop these columns in here I'm going I'm going to also drop these other
1:26:47
columns and these are the promoted columns so this is just me not having to
1:26:52
write out promoted in there and so I'm using a list comprehension to drop out those
promoted columns again if I wanted
1:26:58
to go through that positive thing I might do this and then select my columns and go
in and say positively select
1:27:04
these columns and then I'm converting my Impressions column then converted my
engagements columns
1:27:09
all of these columns I'm converting to unsigned integers all of these columns I'm
converting to unsigned
1:27:16
integers as well this tweet permalink here I'm converting
1:27:21
that to a category which I didn't do in polars and this one here is the date
1:27:27
time conversion so I think it's a little
1:27:33
a little bit less code or at least a little a little bit less of vertical lines
1:27:43
a bit faster in in polars
1:27:48
okay so let's look at uh doing um some some manipulation of columns
1:27:54
here so here is my twit DF here let's
1:27:59
assume that I want like an engagement rate as a percentage so what I'm going
1:28:04
to say is I'm going to say take my engagement ring column and then let's apply this
function right here the
1:28:10
percentage function and we'll call that the engagement percent and if we do that
1:28:16
we now have this engagement percent right here right so what's the engagement
percentage of that let's just
1:28:23
do a benchmark of that and then I'm going to do a benchmark of this where I'm
taking the expression and
1:28:29
multiplying the expression by 100.
1:28:35
okay and you see that we've got a pretty big performance difference here so uh
1:28:41
top one here 500 microseconds uh this one here 55 microseconds again this is
microseconds
1:28:48
but um it is 10 times difference um it will that pan out to larger data
1:28:56
sets maybe maybe not again General Benchmark advices is to Benchmark against the
size of the data that that
1:29:03
you use one of the things to be aware of is this
1:29:08
apply when you're doing apply that's element wise so you're pulling off the
individual values you're Crossing that
1:29:14
rust python boundary when you do that so here I'm I'm saying
1:29:19
um let's do apply with NP DOT sign polars does have
1:29:26
a map and so map is column wise and so you can see the difference there of MP sign
versus map of sine applying
1:29:34
NP sign versus map of it it's like uh two milliseconds versus 200 microseconds
1:29:41
about 10 times faster there for this data set and again unlike pandas polars
actually
1:29:49
has sign in it itself so you don't have to do any of this you can just say DOT sign
there to get to get the sign of
1:29:55
that um just for fun I'll import the math library and look at calculating uh the
1:30:01
python math sign function here you can see that polars is going to be faster
1:30:07
than using numpy and um doing math.sign in this case
1:30:13
um math.sign is faster than using numpy
1:30:19
I'd have not quite sure why that is but um
1:30:27
yeah probably probably because it's well yeah
1:30:33
I would think that converting to numpy would be faster than converting to
1:30:38
Python and applying math.s sine to it but apparently something's going on there
okay so another thing you might want to do is manipulate your string
1:30:44
columns right and so um here I'm saying I want to determine
1:30:50
whether a string a text a tweet is a response and so a response tweet to
1:30:56
someone else would start with the at sign and so in pandas you can do this
1:31:01
Dura dot starts with in pullers there is not a starts with um so like
1:31:08
python has starts with let's look at the attributes of Str and
1:31:14
for whatever reason they named the starts underscore with which it is what it is I
mean is like
1:31:22
starts with without an underscore there the world's less naming probably not but
1:31:27
I mean it's the convention in Python and pandas is sort of using it
1:31:32
um so what whatever I mean it is what it is so some of those things might not be
exactly the same so let's do starts with
1:31:39
that and let's we'll just do a timing here of that and we're going to compare
1:31:45
that with an apply where we're pulling out the value and sticking it into a python
string and doing that starts with
1:31:51
there
1:31:59
so generally in panda is my advice is avoid apply however if you're using
1:32:05
strings and pandas and you're not using like the Pyro strings you're going down a
slow path
1:32:11
anyway generally in pandas and so apply might make sense for Strings and pandas
1:32:16
um in in polars here we can see that a lot of these things are optimized in pullers
1:32:23
and so you get quite a big speed performance here by doing this with an expression
if you can do it with an
1:32:30
expression okay so um
1:32:37
I I might update my tweak Twitter to look something like this now if I'm going to
do like some machine learning
1:32:42
or some further analysis on this I might want to look at my tweets and pull off
various features from that right so so
1:32:48
here I'm not going over all these but I'm doing things like um I'm going to make an
is reply column
1:32:53
I'm going to make a length column that's just taking stir dot lengths s I'm not
sure why it's spelled lengths
1:33:00
but that's that's the naming there um and then I'm going to make an whether
1:33:08
it's an ad tweet whether it has new lines whether it has the number of lines in
there and so you say anything like stir dot contains straight out contains
1:33:15
stir dot count match stir dot contains
1:33:21
okay and then I'm also going to put some other columns here for for time I'm going
to get like the hour when I
1:33:28
tweeted the day of the month and the day of the week just to see if like if I'm
doing some machine learning or further
1:33:34
analysis whether those have an impact on on what's going on there okay and and
1:33:39
you know I can run this and and inspect this and it looks like this is working
1:33:45
okay so I want to give you the chance to work with apply here so we're going to do
this calculate the engagement ratio
1:33:52
by dividing engagements by impressions and then calculate engagement ratio 2 by
1:33:59
dividing the sum of replies retweets likes user profile clicks and detail
1:34:05
expands by impressions so I'll let you work on that for a
1:34:10
moment and then we'll go over that I'm going to give a I'm going to give a hint for
this first one that might just sort
1:34:19
of help I'm going to say so so so um
1:34:26
I have like this twit DF here which is has gone through my tweak DF so
1:34:32
so I do have this twit DF here and I want to make a new column so I'm not
1:34:37
going to make this ratio one but but to make a new column I say with columns okay
and then let's say instead of doing
1:34:45
the ratio I'm just going to say I want to uh maybe I had a good day so I'm going to
1:34:50
have the double engagements column right so how would I do that I would say call I
want to make an expression and and so
1:34:56
pl.call right and then um will take engagements
1:35:04
and if I want to double it I want to multiply it by 2. and so these Expressions
respond to python operators
1:35:11
like this and um and says incomplete input it wants that
1:35:17
parenthesis there so this is engagements multiplied by two and if I wanted to name
this as like
1:35:23
double engagements um I I would do like an alias here
1:35:28
so Alias I'll just call it double ENT now that doesn't work because it's like
1:35:35
int doesn't have Alias that's because of operator precedence here so what I need to
do is just wrap this inside of a
1:35:42
parenthetical here say do the multiplication which should return another polar's
expression
1:35:49
and then once I've done that Alias that expression
1:35:55
so now you can see that there's the double Eng right there
1:36:02
does that does that kind of give you an idea or some boilerplate for what I might
want to do for doing that
1:36:13
oh yeah um well my mine have underscores in them
1:36:19
because I'm using twit DF
1:36:25
but yeah yours might not so so so yeah you can do it either way okay so
1:36:32
hopefully now again I I like to have the exercises here because it's one thing to
1:36:37
watch me talk about it and it's like oh it goes in one ear and sort of makes sense
but then when you actually do it
1:36:43
that's like where it goes into muscle memory right and and science tells us that if
you actually do the exercise
1:36:49
you'll learn it a lot better because you're going to understand it better um okay
so I I'm just gonna do this here
1:36:58
so I've got with columns here so I think I can put in a a bracket here and I'm just
going to tack on these other ones
1:37:04
in here so I want um to divide engagements by impressions
1:37:10
so I'm going to do that here um I'm gonna just come in here and say take
1:37:16
engagements and divide it by plcol dot impressions
1:37:25
okay so that looks like that worked um
1:37:32
um it turns out that because I guess this is polars just takes it looks like it's
just taking the First Column here
1:37:39
and using that as the name so it looks like it has updated the engagements with
that so I don't want it to update that
1:37:45
so again same thing here I'm going to come in here wrap this with parentheses so
that I do that whole expression and
1:37:52
then I want to Alias the result of that I guess alternatively I could come in here
and say like Alias this as the
1:37:59
ratio in here but I I kind of like to keep that math separate like do the math
1:38:04
and then do the Alias so we could call this like the Eng ratio if we wanted to
1:38:10
and now we we should have that over here on the side
1:38:16
okay if we wanted that to be a percentage we could we could multiply that by 100 to
make it a percentage
1:38:22
the next one is calculate it by summing up these columns and then dividing that
1:38:27
by Impressions so this is just another example um you know do I need to use a list
1:38:33
comprehension I may be um
1:38:38
there there is a way to do um a sum
1:38:46
column wise I don't I don't recall off the top of my head what that syntax is so
I'm just gonna I'm just gonna
1:38:53
manually write this out so we're going to take replies and we're
1:38:59
going to add PL dot call Dot retweets
1:39:06
and we're going to add pl.call.lights
1:39:14
and because I'm inside a parenthetical here I'll just go to the next line and we'll
add pl.call
1:39:19
user profile clicks and we will add PL dot call detail
1:39:28
expands so so this it should be according to Twitter's definition at
1:39:33
some point in time this should be the same thing as engagements it's summing up
these columns here
1:39:38
um so so I I could come in here and say like Alias this
1:39:47
this is a ing ratio two this should be the same as
1:39:54
engagement so engagements is like seven and three so it should be like seven and
three over here so it looks like it looks like that is matching so again
1:40:01
showing you how I like would build this up and like test it as I'm going and then
we want to divide this by The
1:40:07
Impressions here so we've got that sum we want to do the sum first and then I'm
1:40:13
not sure if there's a divide method let's just try it and see there might there
might be one there might not be
1:40:18
we'll divide it by PL dot call and impressions
1:40:31
okay and it's claiming that parentheses are off so let's do that it looks like that
doesn't work so I'm going to
1:40:38
actually come in here and
1:40:43
put that there um so
1:40:49
I'm not seeing end ratio two it's probably because my I need I think I need an
extra
1:40:55
parentheses around this right here
1:41:03
okay there it goes it looks like those are the same so so um
1:41:09
looks like we're good there okay so the point of this is that um
1:41:16
similar to pandas um if you can you want to do those math operations directly at
like the pandas
1:41:23
level or the polars level keeps it fast rather than jumping to like an apply which
might be uh maybe I would say the
1:41:30
naive way to do it but is going to be slow
1:41:36
okay um I think without our remaining time we're
1:41:41
going to look at aggregation and aggregation is I like to say aggregation
1:41:46
is what your boss wants right so uh you know let's
1:41:54
it was spring break the other day and we went to a candy store in Jackson Hole it
1:41:59
was like a tourist trap but it's like just a bunch of candy right and like weird
candies and whatever and I imagine
1:42:05
that the boss of that candy store doesn't come in and say uh tell me what happened
today and someone comes in and
1:42:11
says well Susie came in and bought one lollipop and two two sticks of gum and then
a gummy bear and then Billy came in
1:42:18
your boss doesn't want that right they want to say 58 people came in they spent 2
368 dollars right so we're going to
1:42:25
take all of that information and collapse it to a single value generally when your
boss wants something they want
1:42:30
some sort of aggregate and so that's what I like to think when you're thinking
reports generally you're
1:42:35
thinking some sort of aggregate right I mean some people say they they want the
details they don't really want the
1:42:42
details generally they want the details to like make sure that you did your
calculations correctly or something like
1:42:47
that or dive into that but really they want these aggregations so now let's look at
some of these
1:42:53
aggregations here um and how we might do that and generally how you're going to do
that is
1:42:58
either with Group by or with pivot okay so here I'm going to say let's take
1:43:03
my Twitter data and then I'm going to say I want to group by and look at this
1:43:09
I'm putting an expression in here so I'm taking the time column pulling off the ear
and then aliasing that is year and
1:43:16
then my aggregation function here is going to be take the mean of that
1:43:21
okay and and this is a result here now in pandas you would see the year in the
1:43:27
index um here we don't see that but we are seeing uh that we have that Alias year
1:43:33
um let me just for fun take off the alias
1:43:39
and we get an error here it's a duplicate error right and so why is it
1:43:44
complaining about duplicate error because we're using the time column and
apparently
1:43:49
um it has more than one occurrence which is a No-No in polar's land so so we want
to stick that alias in there
1:43:56
so we don't get that so it's going to stick that instead of making an index it's
going to make a new column called
1:44:01
year and then we have for every uh entry there in in this case it is including
1:44:08
strings as well it's giving us the mean value
1:44:15
and I I should note as well this this is a place where um
1:44:21
I you can get improvements uh kind of two places in polars one is that
1:44:27
pullers when you do grouping pullers can maximize those CPUs right so if you have
1:44:32
multiple groups can it can do parallel operations on each of those groups because
they're independent
1:44:39
um and then if you're doing a lazy operation it can further go through and
1:44:45
figure out if there are other optimizations that it can make before it does the
grouping if you're doing lazy so this this aggregation this reporting
1:44:52
is where you're going to see probably your most sort of bang for your buck if you
if using pullers for performance is
1:44:59
is kind of what you're looking for or if that's your motivation for using pullers
1:45:06
okay so so there's a question how do I get um the index in there or or simulate the
1:45:12
index and so there is this with row count so let me just show you that I'm
1:45:17
just going to comment out these other things here you see that the sticks in this
thing called row NR
1:45:23
okay so you can think of that if you're used to pandas as
1:45:29
an index here um so we're going to group by here Group by
1:45:35
just like in pandas is lazy it doesn't do anything until I do an aggregation and
here in this case I'm going to call
1:45:41
Ag and in this case I'm passing in a list of expressions right and so this is
another place where
1:45:48
Polaris can be smart about this because we're passing multiple things it can say oh
I'm going to do these at the same
1:45:53
time if I have the CPUs to do them so the in this case it's saying with the
1:45:59
Impressions take the mean value and with this number row take the first entry for
1:46:04
that okay so you see that there's a number row of the first entry for the 2021 year
1:46:12
was uh this uh value right there so the question going back to that
1:46:18
question like if I wanted to see row 20 or something like that uh like 20
1:46:24
through 30. I might do something like this
1:46:30
or I might say okay let's do a filter here filter lets us filter rows and then we
can say PL dot call and this should
1:46:39
be row and r and I could I should be able to say greater than or equal to 20.
1:46:48
and I don't know if this if this works this way um
1:46:57
I don't know if it works like this it might let's see if it does
1:47:05
okay since suspicion is the lady the truthiness of the expressionism big risks home
1:47:10
um okay so so this is doing less than 30. so so
1:47:17
it wants us to combine it it wants us to do something like this um
1:47:23
less than 30 npl.call row and r
1:47:30
greater than or equal to 20. okay so there's rows uh 20 through 30.
1:47:39
so a little a little bit more uh verbose than what you'd see using like a pandas
1:47:44
or a panda's eye look to get that
1:47:53
okay so let's go to our next example here we're grouping by year and we're
1:47:58
getting the mean Impressions and the mean replies by year so we're going to say
like how how are we doing by each
1:48:03
year and we see that you know for 2020 a thousand Impressions on average in 2021
1:48:11
we're getting three thousand and then we can also see that like our replies went
1:48:17
up a bit per se as well now again um
1:48:23
because we are doing these different aggregations this this is a place where where
polars can be really quick
1:48:29
especially relative to just uh pandas pandas the library
1:48:37
okay so let's assume that we want to do uh instead of just by looking by year we
1:48:42
want to break this down into fire granularity let's do a month and year okay and so
it turns out that you can
1:48:47
pass in a list of Expressions into the group by here so I'm going to say I want to
group by year and then I also want to
1:48:52
group by month in year and so let's run that and we get something that looks like
this
1:48:58
now if you look at this um the result here might be a little not satisfying here
because it's going from
1:49:05
like April of 2021 to August and then back to June of 2020 so the results are
1:49:13
not sorted and that's due to how polars is doing this by default it's doing what's
called
1:49:19
a hash join a hash is unordered but it's fast and so it gets that done quickly
1:49:25
but it's not necessarily ordered here so we might need to do some further
operations so we might say like let's
1:49:31
sort this by year and month again we don't have an index so we're just going to
sort it by The Columns and we get
1:49:36
something that looks like that that looks like that's that's decent okay another
option here
1:49:44
is to pass in this maintain order okay and so we'll pass in maintain order
1:49:50
here and that looks like that works as well as I was just telling the grouping hey
1:49:57
um I want you to maintain the order and uh again according to uh Richie the
1:50:03
creator of pullers um this is preferable to calling sort
1:50:08
apparently there are some optimizations that can happen by knowing this at this
1:50:14
point rather than just calling a generic sort after the fact um
1:50:23
also there is a fast path in pullers so if you call sort on a column
1:50:29
um pullers will know that it's sorted and it can do further operations and you can
if you think about that I mean if you've
1:50:35
done a data structures class if you have sorted data it's easier to merge sort of
data than data that's not sorted and so
1:50:42
uh it can take advantage of that so here we're going to say I want to sort this but
um I'm also going to do this I'm
1:50:50
going to say let's make a a column called a month but the column is going to be the
year and the dash with the
1:50:57
month in it so rather than having like this one up here had year end month as
columns rather than having two columns I
1:51:03
just want a single column that represents that and so I'm going to do that and
we'll get a result that looks like
1:51:09
this and here month is this string right here so that's looking okay as well that
1:51:14
might be preferable uh now if you've used pandas you might be
1:51:20
familiar with pandas has the notion of an offset alias anyone familiar with that
also aliases
1:51:26
where you if you have a column that is a date you can say I want to uh resample
1:51:33
this at the quarter level and it will take everything from the same quarter and
then you can do an aggregate on the
1:51:38
same quarter or you can say I want to do everything from the same week and it will
give you each week
1:51:43
um pullers has something similar it's called Group by dynamic and so you can do
something like this
1:51:50
where it's kind of interesting you say index column even though there isn't really
an index in polars but you say indexcom so we're going to use the time
1:51:56
column and then I'm going to say every and this is 2mo this stands for two months
so let's
1:52:03
run this and if you look at this in this time column now what you have is instead
1:52:09
of having a string like we had in that different in that previous one we now have a
date time
1:52:14
okay and if you look at the date times here this is January 1st this is March 1st
May 1st Etc so this has done the
1:52:23
aggregation using date manipulation if you think about this this actually isn't
1:52:29
is just doing an aggregation based on these strings that I created but this is a
little bit
1:52:35
more uh intelligent down here okay so you can do crazy things like this look at
this I'm going to do every
1:52:42
and I'm going to do 2W that stands for every two weeks and then gonna do every five
three days
1:52:48
and every five hours kind of silly but um the point is is that like this is kind of
powerful too
1:52:55
right if you wanted to do aggregations um and by different frequencies you can do
1:53:01
that and if you look at the date the first one here is January 1st um and it's like
2044 and this next one
1:53:08
should be like two weeks later and three days after that um
1:53:16
and it doesn't look like that's two weeks later so that apparently there's a bug
1:53:21
in here um so maybe we found a bug
1:53:27
okay or maybe I have a bug maybe my two weeks is not right this looks like three
days so so maybe my my weeks is messed
1:53:35
up um let me just just for fun here
1:53:42
every and I'm going to say 2W
1:53:50
okay so so apparently it doesn't like combining uh
1:53:55
dates and weeks and apparently the uh the dates uh super seeds the week entry
1:54:03
there okay so so I guess if we wanted it every two
1:54:10
weeks and three days right we could say two weeks is 14 days and we can make the
1:54:15
17 right which which should do uh what we wanted there
1:54:24
okay so so oftentimes when I have this right I mean I think this is a you know is
this the best information ever I mean
1:54:30
it might be right but I might want to plot this as well right and plotting is
another way to visualize it I'm a huge fan of plotting
1:54:37
um sadly pullers or maybe not sadly depending on your point of view pullers doesn't
have plotting in it
1:54:43
so if you want to do plotting you need to use some other Library [Music] um
1:54:49
one of the reasons I really like pandas is because I think I I think like using
pandas lets me create plots and
1:54:56
matplotlab a lot easier than matplotlib itself and um actually I think once you
understand
1:55:02
the interface of matplot of polars or sorry once you understand the interface of
how pandas makes plots it's really
1:55:09
easy especially if you're using an aggregation to go from an aggregation to a plot
and so I might want to look at
1:55:15
like a Time plot or a line plot of this data and what would I want I would want
this time to be in the index and then I
1:55:22
want Impressions and replies to be a little line for each of those columns here
really easy to do in pandas so let
1:55:30
me just uh show you how we might do that we can actually say
1:55:36
uh send this over to pandas and so you say two pandas this is now not a puller's
1:55:41
data frame this is a pandas data frame and in pandas if you're doing a line
1:55:47
plot by default it's going to plot the index and the x-axis so I'm going to say
stick that time into the x-axis
1:55:55
and now that is the index you can see it's bold there and I if I just call plot
here it's going to plot the index
1:56:03
and the x-axis and each column will be its own line okay so here is my plot from
that
1:56:10
um maybe I come in here and say fix size is equal to and maybe just make this a
1:56:17
little bit smaller eight by three okay so so that's every two months here
1:56:22
but if I wanted to do like every uh three days there's every three days
1:56:27
right or or every uh seven days
1:56:35
so this makes it really easy to see that like in July of 2021 something happened
1:56:41
there were Impressions like shot up right
1:56:50
okay and and so this is just showing what I just showed I'm just changing the
frequency there uh so you can change it
1:56:56
to kind of play around with that but super powerful I mean once you sort of
understand this this uh every uh
1:57:04
parameter here again pandas calls this an offset Alias um I'm actually going to
pull up the
1:57:09
documentation here one one thing I don't like about pandas is that the offset Alias
isn't documented in the in in
1:57:16
documentation that you can access from Jupiter let's see if that's the case here
1:57:21
um
1:57:29
okay yeah so here's the offset arguments right there
1:57:36
cool so so kudos to polars for actually including that documentation uh in in the
library
1:57:43
itself that you can access from the library okay let's let's try and do a pivot
1:57:50
there's also a pivot um in pandas pivot is basically built on top of group by
1:57:56
um so here is uh there's a pivot and this looks pretty similar to pandas you
1:58:02
say the values you say what goes in the index again there's no index in pullers but
that's basically what you want to
1:58:07
group by um so so if you're not familiar with pandas that might be a little bit
weird but basically the index is what you want
1:58:14
to group by and then you can say what columns you want to be in the output and
1:58:19
then you can specify an aggregation function okay so so
1:58:25
um one of the things that's sad about this
1:58:31
is is that or what I kind of miss is that we can't use a dynamic there's no
1:58:37
Dynamic pivot per se so if I'm making a pivot table and I wanted to do like
dynamically by every
1:58:44
three days and four hours I can't do that with a pivot with the pivot function or
method itself so I would
1:58:52
need to come in here and do something like this where I use direct time and then
with that new column tell it to
1:58:59
aggregate by that new column
1:59:08
okay so here what I'm doing is I'm just saying okay I'm going to make a new time
column which is year and month I'm going
1:59:14
to Pivot based on the month I'm going to pull out the values for Impressions and
1:59:21
then I want columns to be whether it's a reply or not so this is uh what is this
telling me this is telling me do I get
1:59:27
more Impressions if if something is is a reply or if it's not a reply and so
1:59:33
we've got like a true false column true means that something is a reply false means
that it's not a reply and it looks
1:59:40
like in general if it's not a replay I get more Impressions which makes sense
Twitter doesn't really show your replies
1:59:47
to everyone only if someone is is following both of those people so that would make
sense
2:00:02
okay in this case um I'm saying let's look at so up here I
2:00:08
just did let's look at the Impressions here I'm saying let's look at the
Impressions and the replies
2:00:13
and um and then I'm going to aggregate or I'm
2:00:18
going to in the my columns is whether something is a reply or not so what do we
have up here we have here's something
2:00:25
that is a reply faults for Impressions here's is a reply true for Impressions
here's uh replies
2:00:33
that were replies and here's um replies sorry here's replies that were false and
2:00:41
here's replies that is a reply was true so this is
2:00:48
um kind of need to think about this like what does uh if something is a reply but
the call
2:00:54
aggregating the replies for something is oh okay yeah so the
2:00:59
um replaces the number of replies and this is a an indicator whether something
2:01:05
is reply so that's possible for replies something that is a reply to have a
2:01:10
reply on it um basically what that is saying so yeah
2:01:15
um the column headers here might be a little bit weird right I mean it's it's
building these up for us in pandas we
2:01:23
would have hierarchical columns here instead which might make it a little bit
easier to understand
2:01:29
um but again pullers doesn't have hierarchical columns so so it's kind of uh
2:01:34
just building those up based on the values that are in there now you can do
2:01:39
something like this you can say like I want two separate specify the separator in
there
2:01:45
um
2:01:52
so instead of you can see how it's kind of building that up this is the Impressions
column and then we're
2:01:58
looking at whether is reply is false this is the Impressions columns whether is
reply is true the replies calm
2:02:04
weather is reply is false and the reply is called whether it's reply is true and
this is just saying stick in a star as a
2:02:10
separator you know if you want space in there you can stick in a space and that
2:02:15
would work as well and then here let's just plot this here so oops
2:02:25
so we're going to set uh this we're going to convert it to pandas stick that month
in there and then we'll plot that
2:02:33
we get something that looks like that
2:02:40
okay so uh we're going to do our aggregation exercise let you work on this for a
little bit
2:02:46
um so uh let's just look at the exercise the
2:02:52
first one is what is what were the total impressions for each year so so hint is
2:02:57
when you hear something say four each year or buy something that's what you want to
group by
2:03:03
okay so in this case um if we want the total impressions for each year we need to
group by the year
2:03:09
and then we're going to pull off that Impressions column and take the sum of it um
2:03:15
in this case total impressions for each month we need to get the month and then sum
The Impressions so once you have
2:03:21
this first one um hopefully that that second one is not too hard
2:03:26
um the this plot the previous so just visualize those and then the next one here
what is the
2:03:34
total impressions for reply and non-reply tweets for each month so that one's a
little bit more
2:03:39
complicated and then plotting that so I'll let you uh work on those and again just
if you have
2:03:48
questions concerns happy to go around and help but this is again your chance to get
some of that muscle memory I've
2:03:55
found that grouping and aggregations these pivoting there's probably some of the
most
2:04:00
important things that people need to be able to do with their data and it tends to
be one of those things that again
2:04:07
the more you practice it the easier it's going to be like this is one of those
things that you need you need to get the
2:04:12
muscle memory to to really start understanding how it works so I'll let you work on
that for a
2:04:19
little bit questions questions before you hop into this what questions do you have
2:04:28
okay be quiet and let you work on it okay I think we're gonna we're gonna
2:04:34
look at the solutions here for this as we're we're getting to the end of our time
here
2:04:51
okay so so first of all what were the impressions for each year
2:04:58
okay and so we got a couple options here one is we can pull off that year column
and that year value from the date and
2:05:06
and put that in there the other is we can do a group by dynamic um
2:05:11
so maybe I'll do let's do a dynamic one
2:05:23
okay so I want each year so I should be able to say y here
2:05:33
um
2:05:39
maybe one yeah so it says expected an integer in the duration stream
2:05:45
so wants us to to specify how many years so so if you look at this this is the
2:05:51
January 1st of 2020 and this is January 1st of 2021. this is a little bit different
than pandas and pandas with
2:05:58
and offset a list you don't have to specify a number there but it looks like
2:06:03
pullers wants us to okay so and we want the impressions for each year and we
2:06:09
want the total so I've got replies in here I don't want replies and then I
2:06:15
want this to be the sum here okay so that there's the sum of the impressions for
each year
2:06:22
okay if I wanted this for each month um once I've got it in this form it's
2:06:29
really easy I just change this to 1M and you can see this case that didn't
2:06:36
work that looks like um it's probably doing some sort of
2:06:41
minute there so we'll try Mo month that looks like it did work
2:06:47
okay so now both of those work the next one is to plot these
2:06:53
so again Polaris doesn't have plotting so we'll convert that to pandas stick
2:06:58
that time into the index and if we just call plot there there we go that's not
2:07:03
particularly interesting I would probably prefer to do a bar plot on this
2:07:08
um slightly more interesting I might even do a horizontal bar plot change that to
2:07:13
a bar H something looks like that this one here probably would leave as a
2:07:20
line plot let me see something that looks like that
2:07:28
okay uh let's see the next one here is what were the total impressions for
2:07:34
reply and non-reply tweets for each month okay so in this case we're going to do a
2:07:42
pivot on that so I'll take my Pivot code and copy and paste it
2:07:48
um and I already have each this is the month code here and so I want to stick
2:07:54
that into the index what do I want to look at I want to look at the impressions
2:08:00
and I want the columns to be whether it's a reply or not so that little tweak
2:08:05
of that should be okay so I'm I'm not afraid to like take code
2:08:11
and and run with it again like getting this code started off might be a little
2:08:17
bit more involved um so start with something simple right start with something
simple that you can
2:08:24
that you can check and make sure it's working and and then build that up
2:08:31
so let's plot this so I'll convert it into pandas let's whoops
2:08:39
let's set the index to the month and then we'll plot this
2:08:47
oh thank you yeah aggregate function to be the sum total impressions
2:08:53
yeah and there we go we get something that looks like that
2:08:58
okay um questions concerns thoughts
2:09:10
okay so we are at the end of our tutorial here hopefully this gave you a chance to
2:09:17
play around with pullers see some of the places where it might be different or
2:09:23
improve on on other libraries that you're familiar with again sort of the elephant
in the room is pandas
2:09:30
um my take on this again I'm pretty biased
2:09:36
because I've used pandas for a long time my take on this is like I I feel like I
I'm very comfortable with pandas the
2:09:43
expressions for pullers take a little bit of getting used to it
2:09:48
seems that pullers requires a little bit more typing than pandas does that might
2:09:54
be a trade-off that you're willing to make to get the speed performance again
2:09:59
a lot of that speed performance is going to come when you start doing aggregations
and groupings where it can
2:10:05
be smart about that polar is also because of the streaming functionality
2:10:11
that we didn't have time to go into but it does have some streaming functionality
can do larger than memory
2:10:16
data sets that uh pandas the library can't do but again pen does the API can do
with various
2:10:23
other libraries Implement pandas that allow you to go out of core
2:10:30
I do like to uh so so I showed my method for like taking a data set and going
2:10:35
through it and analyzing it I do like to do that just go through the columns
2:10:41
check those types again when you're using libraries like polars or pandas going
through those types can be often a
2:10:48
significant memory savings by doing that and that's can also speed up computation
2:10:54
time as well huge fan of chaining in pandas but also chaining in polars you don't
have to
2:11:01
chain but you're kind of going to get dinged or penalized in pullers if you don't
so you can get performance boost
2:11:08
in pullers by doing chaining especially if you're lazy because it does have query
optimization
2:11:14
so Embrace chaining um again when you go across that rust
2:11:21
python threshold things will slow you down there so find ways to do things in
2:11:26
Expressions if you can and then finally the these aggregations that we looked at
are very powerful and
2:11:33
let you answer questions very succinctly uh
2:11:40
the syntax might be a little bit confusing or weird to you if that is the case my
advice is to practice them right
2:11:46
the more you practice these aggregations the better they're going to make sense and
you'll see sort of the pros and cons
2:11:52
of them okay any questions before we dismiss
2:12:09
Yeah question like if I'm happy with pandas generally I'm
2:12:16
not having performance issues I'm not like rust isn't super interesting to me
2:12:21
as a language do I need to care about polars um
2:12:27
I I I mean and and my sort of short answer would be uh if pandas works for you
2:12:33
um you should be good um again pandas 2 just came out a few weeks
2:12:40
ago it does have the ability to use those Pi Aero data types which is going
2:12:46
to get you both speed and memory improvements from that point of view
2:12:51
um you know for doing exploratory data type
2:12:58
stuff pandas is going to have better integration with other libraries right because
that has been around for a
2:13:06
while it's sort of the big player there um so yeah
2:13:13
if you don't if you if you don't have uh compelling reason to use pullers
2:13:18
um I don't I don't think there's a need to switch to it per se
2:13:23
um I guess my my take on that would be like I I do think you would want to at least
2:13:29
if you're coming from pandas to probably start embracing that chaining because
2:13:35
um even though you don't have like query planning and optimization that you do in
in pandas I do think it makes your code
2:13:42
easier to read and debug and reuse some foreign that would be my advice around that
2:13:49
other questions
2:13:55
okay well thank you for your time hopefully you feel like you are better uh
2:14:01
better prepared to to sort of figure out where polars make sense where it might not
make sense if if you want to connect
2:14:08
with me on LinkedIn or Twitter I'm there happy to connect
2:14:13
um yeah and and again uh have a great conference thanks for
2:14:20
letting me present and um good luck with your your future tabular
2:14:25
data endeavors"

2nd Video
Learn Data Analytics with Polars (Python) in just 2 hours!
"Transcript
Search in video
0:01
welcome to a transformative Journey Through the dynamic realm of python data
manipulation in this course we'll
0:07
explore polers a python library that equips you with the tools to handle data with
precision speed and unparalleled
0:14
ease we'll dive into Data ingestion cleaning aggregation the management of
0:20
null values delivery of custom functions and conditional logic and so much more
0:26
our goal is simply to empower you to extract the maximum value from your data in
the least amount of time join us and
0:33
unleash the full potential of polers in [Music]
0:38
Python let's now introduce the data that we'll be working with throughout this
course the idea of this data set is that
0:46
it's supposed to be logs from your internet service provider so that is your
internet provider your mobile phone
0:53
provider they will get logs about your internet traffic on their Network so
0:58
they will log every we it you go to the amount of data you consume where you are
when you do it and all that sort of
1:04
thing so the data that we have here is first of all date time so when did you
1:10
make this request for the website user ID so which user made the request to
1:17
access a given website the domain names of the website that they requested to
actually access DL bytes which is the
1:24
number of download bytes so how big was the amount of data that you downloaded from
from that website UL bytes is the
1:32
upload bytes so how large was the amount of data uploaded to the server during
1:38
your request we've got the client IP which is the IP address of the person's mobile
1:44
device or PC or whatever they're using to access the website we've got the
1:50
server IP which is the IP address of the website itself we've got the country that
they
1:56
were requesting access to the website from the transaction time in seconds so
2:01
this particular transaction took 2.33 seconds to complete we've got the HTTP method
which
2:08
is either HTTP or https https being the secure traffic
2:14
protocol we've then got user agent so what user agent were they using when they
made this request and finally we
2:21
have platforms so were they on Android Linux Windows Mac what were they using to
request the data so all of this is
2:29
going to be read into our polar data frame and when we say data frame we
2:34
simply mean a frame for our data it can have rows and columns just like this
2:40
table it's just a way to refer to tabular data in polar and in other
2:46
libraries in Python like pandas or Apache spark we use this consistent terminology
of data frame which simply
2:53
means rows and columns just like you get in this spreadsheet
3:00
for this course we are going to use Google collab and this is the Google collab
interface when you first load it
3:07
up if you've not used it before you'll likely have no previous notebooks and no
previous files that it will suggest you
3:13
to open so all we need to do is press the new notebook button this will load the
Google collab
3:20
interface for us and this might look familiar if you've used Jupiter notebooks
before the beauty of Google
3:27
collab is that it's entirely cloud-based and runs in your web browser there's no
need to worry about setups or
3:34
installations on the left hand side we have a bunch of options the key one here for
us during this course is files this
3:42
will give us the ability to upload our own data sets to Google cab and to start
playing with that data within our python
3:49
notebooks on that note collab works with notebooks these are interactive
3:55
documents you can add code cells and write and execute python code or text
4:00
sales to explain your work it's perfect for combining your code with documentation
hence the name a notebook
4:07
it's like documenting your code in an executable way we can have a quick look
around collab then so at the very top we
4:15
have a name of our notebook so in this case we'll call it the Cod pois course that
notebook is now automatically saved
4:22
to our Google Drive so it's fully integrated and gives us access to our Google
drive
4:28
documents next down here you can see this block this is where we would like to
write code so we might do a equals
4:34
run and print a as a simple python statement to run the block you can either hold
the shift key and press
4:41
enter or you can press the play button here so we said a equals 1 and print a
4:47
and it has printed one as we'd expect if we want to add a new code block we can do
by pressing plus code here and now we
4:55
might want to say B is equal to hello and then we might want to print B so we
5:01
can execute blocks one after the other and it gives us a nice way to structure our
code and have it all separated off
5:07
into isolated little blocks we can also add text blocks and this is great for
explaining your code so I might have a
5:14
new section where I'm going to talk about if statements and then when I say that I
have a nice title it gives me the
5:21
ability to really structure my code so that now this is a very clear section of
5:27
of code that obviously covers if statements and if I come to the left hand side
here I now have a table of
5:32
contents so I have a section called if statements and if I click there it will
immediately take me to that section so
5:38
again it just highlights the way that it's a nice way to structure your code keep
everything nice and organized and
5:44
run python directly in the browser the other cool thing is that we have almost
complete control over what
5:50
we want to do so on your local PC you might write pip install pandas for
5:57
example you can run that that in Google collab if you just prefix it with an
exclamation mark by prefixing with an
6:04
exclamation mark it will go away it knows that's now a system command and it will
go to install the Panda's Library
6:11
so whatever you want to do locally with installing new libraries or working with
python or working with files you can do
6:17
exactly the same in Google collab without managing your own environment installing
libraries or getting python
6:24
all configured on your local PC of course that also means it's device agnostic you
caness access this through
6:30
any device any browser any time all you need to do is log into your Google account
and you will have access to your
6:36
codebase so that's a quick intro to Google collab this is what we're going to use
throughout the course we'll
6:42
actually start working with it in the next video okay so let's go ahead and start
6:48
actually using polers and reading data into a data frame on the left hand side
6:53
you have this files tab which will open this file window if you click the upload
6:59
button here and then navigate and find your course unor file.csv which is
7:05
attached to this course we can see down here that that is now uploading so it will
be available to us very shortly so
7:12
if we click away from that first thing we need to do is make sure that polers is
actually installed on Google collab
7:19
so if we just do an exclamation mark and then pip install polers and pip is how
7:24
we install python libraries in general so it's just the standard way of installing
a python Library so we're
7:30
just saying install poers library and make it available for us we then need to
7:35
import polers into our notebook so that we can start using polers and the polers
7:41
functionality now we usually shorten that to PL so we'll say import polers as
7:47
PL and that means going forwards I can just re refer to the library as PL
7:52
rather than typing the entire polar word which you know it's just really a laziness
thing to be honest but it's the
7:58
standard way of doing things so how do we actually read data into uh the data
8:05
frame well we do DF which is shorten for data frame it's just standard notation but
you can really call that whatever
8:11
you want we then do PL so we're saying use the polar library because we've imported
8:17
that as PL so pl. read CSV and then we give the name of that CSV so if we just
8:23
open it to the side we can see it has now uploaded we've got course file.csv so
that's what we want to reference so
8:31
within quote marks we can do course file.csv now my particular CSV has the first
row
8:40
of the data is the header so the column headings or the column names are the
8:45
first row of the data so I want to tell polers that that is the case so we do
8:51
head has header is equal to true so that says treat the first line of this data
8:59
frame as the header or the column names so we can see that that's worked so if we
do
9:05
DF do head and then two what this does is it gives us the head of the data frame so
the the top bit of the data
9:12
frame and the number in the brackets is how many rows it's going to return so in
this case it's only going to return two
9:18
rows because we just want to see what the data looks like we can see that we have
our date time column and it has
9:25
listed it as a string we've got user ID which is a 6 4 bit integer domain is a
9:31
string DL bytes 64-bit integer and so on so the way that it is laid out is
9:39
exactly that that we have our column names the data type associated with them and
underneath you can see some of the
9:44
data that we actually have in our data frame other things we could do is we
9:49
could actually look at the bottom of the data frame so if we particularly wanted to
look the two bottom rows of the data
9:55
frame we could do tail so head is the very top of the data frame tail is the very
bottom of the data frame they both
10:01
work in exactly the same way in that they just select a certain number of rows to
display to
10:07
us so we now have our data in a uh
10:13
polar's data frame ready for us to use but we probably want to understand a little
bit more about our data like how
10:20
many rows are there I don't know so we can just do Len of the
10:25
F and in this case the Len is the length so what is the the total length or the
10:30
total number of rows that I should expect to find in this data frame and in this
case we can see that it's 100,000
10:37
rows excellent but what if I also have a very wide data frame I can't see all the
10:43
columns so we can simply type DF do columns which we should uh wrap in a
10:49
print statement and that will print us a list of columns within the data frame
10:54
that are available for us to use so we can see that we have a date time us ID
10:59
domain DL bytes UL byes and so on so all the columns that we would expect but it
gives us a nice way to print that out
11:06
and in fact we can iterate over this list for certain functions as well which
proves to be very
11:12
useful if we want to see what data types we have we can do DF DOD types so that
11:19
will look at all of the data types within that data frame and so we can see that
date time is a utf8 which is a
11:25
string user ID is integer 64 domain is utf8 and so on so the list of columns is
11:33
in exactly the same order as the list of D types so it allows you to really start
to inspect the data and understand what
11:40
we've got to understand it a little bit more we could do DF print DF do shape and
what shape is
11:48
going to give us is literally the shape of the data frame so we know that there's
100,000 rows but we also now
11:54
know there's 12 columns so again it just gives you another way to sort of analyze
the data that you've got by analyzing
12:00
the shape something that's super useful is if we do print DF do
12:07
describe then this is going to describe our data for us in quite a lot of detail so
we don't have to then go away and
12:14
start analyzing it and doing a lot of exploratory data analysis ourselves in fact
it's done a lot of it for us here
12:20
so for each column we have the count of values how many of those values are null
12:25
what is the mean value across all of the records in the data set the standard
deviation the Min the max the median the
12:32
25th percentile and the 75th percentile so this gives us a really quick view of
12:37
our data a really quick understanding of our data and enables us to really
understand what we've got to work with
12:43
without needing to do any of that analysis ourselves so this is all great it's
12:50
given us a good idea about our data frame it's given us a good idea of how we can
inspect our data understand what
12:56
we've got understand how much data is in the data frame what columns we have and
all that sort of stuff so if we quickly
13:02
just review what we've covered there we've learned how to use import the polar's
library and we import it as pl
13:09
we can then reference it as PL throughout the entire uh process of
13:15
developing in polers so here we have pl. read CSV so we're saying using the
13:20
polers library read the CSV this is the file name of that CSV and this
13:26
particular attribute here says treat the first row of that CSV data as the column
13:32
headings so the first row of the CSV includes date time user ID domain DL
13:38
bytes UL bytes so it's saying make sure that those values are the column
13:43
headings of our data frame we then know that we can use the head function to look
at the top n rows so we could do
13:51
the top 20 we can change that number to be whatever we want but it gives us the
chance to look at the top block of rows
13:57
similarly tail en us to look at the bottom block of rows so we could change this
number to be whatever we want but
14:03
it just gives us a way to inspect the data at either end of the data frame which is
very helpful of course if it is
14:08
ordered if you ordered something by date time the head would show you the very
earliest date in your data frame and
14:15
tail would show you the very latest date so it gives you a nice way to look at
either end of the data frame we can then
14:21
use Len which is checking the length of the data frame how many rows are there
14:27
we can print the columns using DF columns which simply provides you with a list of
columns DFD type provides you
14:35
with a list of data types for each of those columns DF do shape tells you how
14:41
many rows you have and how many columns you have better helping you to understand
what data you're working with
14:47
and DF describe gives you quite a nice view of your data set with uh counts and
14:54
and averages and all that sort of stuff across your field which helps you better
understand your your data without doing
15:00
too much data analysis in the next video we will look at how we use a custom
15:05
schema to bring our data in and the reason for that is that if we go and have a
look at our data types uh which
15:13
were down here in this case I might not want all of these data types to be as
15:18
they have been inferred by polers for example I might want to change an
15:23
integer 64 to be a string or a floating point or whatever I may like it to be so I
might not agree with what polers has
15:30
inferred the data type of the column to be and I might want more flexibility and
control myself so we'll have a look at
15:37
that okay so last time we looked at how to read data into a CSV so just to
15:43
review we said use the polar's library to read the CSV this is the file name of
15:48
that CSV and the first row of data should be considered to be the column
15:54
names now we saw that when we brought our data in that it defined
15:59
user ID for example as an integer now I know as as as I work in the business
16:06
that my user IDs are about to change they're going to start being string values so
I want to Define this field to
16:13
be string so that when my new data comes in it still accepts the data but at the
16:19
moment polers is inferring the schema so it's looking at the data that's already in
the data frame and it's saying what
16:25
data type do I think this should be so we need to force it to use a different
16:31
schema so we Define a python dictionary and what I'm going to call this one is
simply custom
16:41
schema and in here we're going to Define each of the fields within our data frame
16:46
and the data type that we want them to be so we have date time as a column and
16:53
I still want that to be a utf8 string value I'm quite happy with how it was
inferred by
17:02
ERS the next one though was user ID where that was inferred bipolar to be an
17:09
integer but I don't need that I want that to be a string as well because as I say
my data is about to change so we'll
17:17
just have user ID and within there we'll do pl.
17:22
utf8 so we've now changed the data type for user ID we've
17:28
told it when you read this data in it's not an integer anymore it is a string value
so that gives us a nice amount of
17:36
flexibility for when our data ultimately changes now I'll paste the rest in so
17:41
that we don't need to sort of manually type all these out through the video but
this is how we Define the entire data
17:47
set so we've said each of the fields in our data set we have defined as an item
17:53
within our dictionary each of the values Associated to them are the data types
17:59
for uh that we expect to be in those columns so we've got utf8 which is a
18:04
string value it's alpha numeric it's like Words things that that would be
18:10
considered as a string we've got integer values we've got floating Point values
18:15
and we could Define other data types in there as well which are available to us
through polers so I have now defined
18:22
this custom schema so how do I force polers to use that custom schema when
18:28
when it's reading our data in well we can Define DF as we did before which we
remember just stands for data frame it's
18:35
just a shorthand way of saying this is a data frame and we're going to use the
polar library to read the CSV and that's
18:44
going to be course file. CSV we're still going to have our
18:50
has header is equal to true because it's still true that the first row of our
18:55
data should be considered as the column names but then we're going to add an
19:02
additional quote that says D types is equal to custom
19:07
schema so it's saying don't try and infer the data types this time I've already
told what they are and those
19:15
values are included in the custom schema so when we do DF do head and two
19:22
to show the top two rows of our data frame we can now see that user ID is a string
so previously we saw that it was
19:29
an integer and now because we've defined our custom schema it's now defined as a
19:35
string so when my new data starts coming in when I've got my new user ID
19:41
format that will not cause any failures it's already inferred as a string so
19:47
that's great so we've now managed to determine that or now managed to force
19:53
polers to rather than infer or predetermine the schema or the data
19:58
types that it expects to see we've now managed to tell it that these are the data
types that we expect you to infer
20:06
within the data frame okay so we're very lucky that our
20:12
current data set has no null values in it there are no missing values in any of our
columns but we're building a
20:19
repeatable pipeline so when new data comes in in new files we don't know that
20:24
it's always going to have no nulls in it so let's look at some ways that we might
try to address that situation so the
20:33
first thing is that we could filter out nulls so if we're interested in
20:38
filtering NS in the datetime column and user ID column for example there are a
20:43
few ways we could do that the first way is quite clunky really we're going to
create a new data frame called DF no
20:51
nulls and that is going to be a data frame comprised of the output of
20:56
filtering our original data frame so we're going to say DF do filter and what
21:02
we're going to say is that we want to filter the data frame where the date
21:07
time field is not
21:14
null so if we just add that as a single condition for now we say DF no nulls do
21:20
head 2 and that is going to say let's go up and have a
21:26
review only provide the values where the datetime field is not
21:32
empty it's not null so this field must be populated otherwise it that record
21:39
will not appear in DF no NS so this is a single criteria that we have defined for
21:47
our filter but what we might like to do is we might like to have multiple
21:53
filters so we'll encapsulate that filter within an additional set of brackets just
for clean this really so this
22:01
particular criteria is as we've just discussed that date time must not be null we
can then after that put an
22:08
anasan and we can create a new uh criteria and this is going to be
22:15
in addition to so we're saying date time must be null and also user ID must not
22:21
be null is not null
22:29
so let's just check what we think that's going to do so we're going to say we want
to filter our data frame the first
22:35
bit of criteria is this the date time is not null the second bit of criteria is
22:41
this that user ID is not null and here we've put an an sound which means and so
22:47
user ID must not be null and date time must not be null if one is null and the
22:53
other isn't then that's fine it will appear in DF no nulls in this case we're
22:58
saying both of those criteria must be true if we wanted to make it an or so
23:05
filter where date time is null or user ID is null we could change that out for
23:12
a pipe which means or it's just a vertical line but in this case we'll leave it as
the ampan and we'll click
23:18
run on that so now we've got our new data frame and we can guarantee that
23:24
both the date time and user ID fields are not both null both of them or at
23:29
least one of them has a value populated within it so that's perfect now we've got
other
23:38
ways to deal with the same sort of problem really some slightly less clunky
23:43
ways so we might want to say DF no nulls to so it's a new data
23:52
frame we're going to call this DF no nulls to and it's going to be DF do drop
23:57
nulls and specifically we're going to do a subset of columns so where any of the
24:05
columns within this list have a null value we're just going to drop that
24:11
record so we're going to have date time we've got user
24:17
ID and let's add another one just because and we'll have domain so this
24:23
one here is telling us that we're going to drop the null values when we find a
24:28
null in any of these columns so if we now do our DF NOS
24:33
2 do head we can then see the output of that
24:40
so that's just another way to deal with sort of the same problem but handling nulls
is very very important so another
24:48
way to handle nulls is you might want to just fill it with a specific value so in
24:54
this case we're going to say dfwi columns so this is saying we're going to actually
change the
25:02
columns or we're going to add a column or remove a column or update a column the
with columns gives us the ability to do that and with we're going to
25:10
say the polar column of user ID if that is null I want to fill that
25:19
null with a particular value and that particular value that we're going to fill it
with is going to
25:26
be a literal value value so an exact value just just a number or a a a string
25:33
so we're just going to put 22 in there so in any case that we see user ID
25:40
22 we know that it was null it's just it has been filled for us now previously
25:47
we've been looking at rewriting the data frame name and then putting do head we
don't actually have to even do that we
25:53
can just do do head and then run that
26:00
my apologies we need to put the underscore there so it's with underscore columns so
we now have our data so let's
26:08
just quickly review what that's done it said whenever we find the user ID column
26:13
to be null then we should replace it with the literal value of 22 so it's
26:21
simply just going to replace any missing user ID with 22 so it's just a nice way
26:27
to to fill it with a very specific value but that's not really that helpful
26:34
in many situations I mean why would you want to fill it with 22 as a static value
there's probably not very many use
26:39
cases for that so we can actually fill it with something more useful so if we do
dfwi
26:47
columns again and this time we're going to say the column I want to fill is DL
26:54
bytes and I want to fill that null value with the median value of DL
27:06
bytes and that makes much more sense doesn't it that if you're missing a value in
DL byes we can sort of guess
27:13
that it will be somewhere near the median or you know it gives us a way to infer a
realistic value within that
27:21
column so again we can have a look at that and in this case if there were any
27:28
values in here that we're missing we would then get the median value of the entire
DL bytes
27:34
column which again gives us the ability to have a little bit more uh control or
27:41
Precision over the data that we fill into a null value so we can now look at
another
27:48
example of filling nulls and this one simply says if you find any nulls in the
27:55
entire data frame irrespective of column just fill it with the static value so
there we can just do DF fill Nan Nan
28:03
which is just for nulls and we'll say 344 in this case so if you find any
28:09
nulls in any column across the entire data frame just fill that value with
28:15
344 so that's that is how we're going to deal with it in this case so we've looked
at a number of things we've
28:21
looked at how we could apply a very specific filter to our data frame where we
could say when this column is not
28:28
null and this column is not null then that should form the basis of my clean
28:33
data set or my new data frame we've also looked here where we might say look for
28:40
nulls in any of these columns and if you find them drop the entire row so in that
28:45
case it will say there's a null in date time remove the entire row we then looked
at uh how we might
28:53
fill with a specific value so if we find a null in the user ID column
28:58
fill it with the value 22 we then looked at how we might fill
29:04
the data with the median value so in this case we've looked at the DL bytes
29:10
and if DL bytes is empty then we should fill that with the median of DL bytes to
29:15
give us a nice realistic value then finally we looked at filling the data
29:21
frame anywhere that we find a null any null value should be replaced with 344
29:28
so actually dealing with nulls is one of the most important things when we're
cleaning up our data and making it ready
29:34
for analysis so this given you a nice grounding hopefully on how we might do that
and different approaches to doing
29:41
so so in the previous videos we've looked at how we read data into our data frame
how we deal with null values but
29:48
we haven't really looked at how we select only certain columns so let's do that so
the first thing we can do is we
29:56
can simply say DF do select and we can do pl. call and in
30:03
here we can do a star and that means select all of the columns within the
30:10
data frame and that has done exactly that and we've limited it with head so
30:16
we said PL call and we put in there a specific column what we might also
30:23
rather than do that want to do is we might want to say select every column
30:28
except for DL bytes for some reason I don't want to see DL bytes and we can do
30:34
here exclude and within the the the brackets we can do DL
30:40
bytes we can actually of course also add other other values as well so UL
30:49
byes and so when we run that we can see that we've suddenly lost these two columns
here we've go from domain
30:56
straight to an IP address now whereas previously we had our DL and UL bytes so
31:01
this one is saying select every column but not these two so every column except
31:07
those two columns but we could also simply select
31:13
specific columns as well so we could say DL bytes uh and then we'll do pl.
31:21
call UL bytes and in this case we'll remove the exclude because of course we're no
longer saying select everything
31:28
which're just saying select specific columns so in this case we always need to say
polar dots so we're saying select
31:34
the polar data frame column of DL bytes and select the polar's data frame column
31:39
of UL bytes we separate all the columns that we want to select with a comma so it's
very simple to nice syntax here
31:46
that we're just saying simply select these two columns so we've seen how we select
all
31:51
of the columns we've seen how we select all of the columns except a subset of
columns so we can get rid of things that
31:57
we don't really want what else can we look at well there's one other way that
32:02
we can select all of the columns which just for um completeness we should cover so
we can simply just do DF do select
32:10
and within that we can do pl. all and that
32:16
simply will select all of the columns within our data frame just as the uh the
32:22
previous example we saw with the star but in this case it's just a different syntax
and worth noting in case you run
32:28
into it um when looking through other people's code so that's just the very
32:33
simple uh introduction into selecting specific columns so if we have a very quick
look through that again let just
32:39
find the right place here we're saying from the data frame select the DL bytes
32:44
column and the UL bytes column and in this case we've limited the head to five
32:49
in this case we say select the um select all of the columns from the data
32:55
frame um which we do pl. all but we could also do pl. call and that would
33:03
give us with a star in there and the star denotes all of the columns that will give
us exactly the same output and
33:11
we also looked at how you might say select all of the columns except for
33:16
where we put exclude so select all of the columns except for DL bytes and that
33:22
would give us everything except for debes so you can see that that's missing just
here so this is quite a nice uh
33:29
nice feature where we can select all the columns except because sometimes there's
just a few columns that you don't need
33:35
and want to drop and in other languages or other libraries it can be quite painful
to do so so that's a very nice
33:42
um additional bit of syntax so now we know how to select
33:48
specific columns from our data frame we know how to deal with null values we of
course know how to bring the data into
33:53
the data frame itself let's talk about how we now start creating calculating
columns and start getting value from our
34:00
data so what if I wanted to calculate a specific column so for example what if I
34:08
wanted to multiply DL bytes by five and have that as a column in my data frame
34:13
well we could just do DF do select so select specific columns as we saw last
34:20
time we're going to select the DL bytes column specifically as the first one the
34:25
next one we're going to do is we're actually going to select the pl. call of
34:33
deales and then we're going to multiply that by five so we're just simply taking
the DL
34:41
byes column we're going to multiply by five note that this is within its own
brackets to make it very clear that this
34:47
is a calculation PL call DL byes Time 5 is all encapsulated with its own
34:53
brackets and outside of the brackets we can set an alias and the Alias is the field
name so it's
35:00
like in SQL where you might say select something or select a field name as and
35:05
then you provide it a new name it's very similar here so we might want to call this
one calculated
35:12
field the final thing we're going to do is just run the head operation to get the
first number of
35:19
Records so let's just run through this and make sure we're very happy with what
it's doing so we're selecting specific
35:26
columns from our data fr the First Column is DL bytes the second column is
35:32
DL bytes multiplied by five and as I say the this particular calculation is
35:38
encapsulated within its own brackets and that means that we can really see that
this is a calculation and all of these
35:45
things belong together once our calculation has been completed we use
35:50
Alias and Alias gives the field a name so we can see here this is called calculated
if I want to change the name
35:57
to something something else we can suddenly see that that changes in the table so
it just gives us a way to make
36:02
sure that that column is identifiable when it comes out the other side so that's
really useful for doing
36:09
sort of these static calculations but can we do multiple of
36:15
these in one go so let's try that we'll take this as a
36:22
copy and we'll spread things out onto multiple lines because it just makes it much
much more
36:29
readable so we've already got this which we understand we've selected our DL
36:34
bytes we've then taken DL bytes Time 5 we've given it an alias which will now
change to calculated DL
36:41
bytes I also want to calculate UL bytes
36:46
multiplied by 5 and I want it to be in the same table so we can simply copy this
stick it
36:55
below and now change this to UL byes so we now have three columns in our
37:02
data frame we have DL bytes we have the calculated DL byes column and we have
37:09
the calculated UL byes column so this just demonstrates that we
37:14
can include multiple calculated Fields within one select statement but note
37:19
that each calculated field is encapsulated between its own brackets so the pl
called DL bytes is in its own
37:26
brackets the call UL bytes is in its own brackets it makes it very clear tidies
everything up and makes it all
37:32
encapsulated so that we know that all of those that calculation belongs together
37:37
each of them has an alias of calculated UL bytes and DL bytes and when we run the
data frame we can now see we've got
37:44
our DL byes column our calculated DL bytes and our calculated Ur byes of
37:49
course if we want to validate the data we probably want to have the UL bytes column
in there as well to do that we
37:56
simply take exact the same syntax here and change that to UL byes so now we're
38:02
selecting the DL bytes column the UL byes column the calculated DL bytes column and
the calculated UL byes column
38:09
so we can validate that yes these numbers do look like they are five times the
38:15
original which is great so the other thing that we might
38:20
want to do is we might want to start looking at rounding this data because
38:26
when you're doing a multiplication it's not really a problem but if we take this
particular statement and I were to
38:32
change that to divided by 5 we would find that I've left a special
38:40
character there we now see that Ur byes has uh a a
38:49
decimal place if we change it to 5.7 for example suddenly that decimal place
becomes very large so when we Define it
38:56
by five actually it doesn't look too bad it's only one decimal place but if we were
to divide it by a different value
39:04
that would become very difficult to read so what do we want to do we want to round
that to two decimal places and to
39:12
do that we simply have to Define that we'll cast it
39:18
first of all as a float so we're going to tell um polers that the value that we
39:26
expect as a result of this calculation is going to be a floating point value the
original DL bytes and UL bytes were
39:33
integer values so we're just confirming that this is going to be a floating Point
once we've confirmed it's going to
39:40
be a floating point we can just simply use the round function of two so when we
39:47
now run this it forces it to two decimal places so let's have a particular view
39:54
of this let's let's run through this so we're going to say we want the UL bytes
40:01
column we want to divide it by 5.7 once we've done that we know that
40:07
it's going to be a floating Point number so we want to cast that float that number
as floating point just to make
40:13
sure 100% that polers knows that this is a floating Point number we can see down
40:19
here that it has defined it as such now that that's been defined as
40:24
floating point we want to round that that to two decimal places and that means two
decimal places we could change
40:31
it to three and it would lengthen it to three we could change it to one as well
40:36
so we can change that to be whatever we want based on the level of precision that
we need for our work so that just
40:42
gives us the ability to round it to any number of decimal places finally we just
assign the Alias
40:49
to that to say that this field should be called calculated UL bytes and that gets
40:54
reflected here in the table so that's a quick introduction to
40:59
calculated Fields it shows us how we can calculate a field based on a
41:05
multiplication division whatever it might be how we can then cast the value of that
field to a different data type
41:12
and how we can then deal with that data type so in the case of floating points we
can round that to one decimal place
41:19
three decimal places whatever it might be and then we can assign an alias to it so
that the output table is useful and
41:25
usable and has a name that we can really use so now we're going to start doing
41:32
some really interesting stuff with our data we want some conditional things put
into our data
41:39
frame so we might want to say when the DL bytes column is over 500,000 we will
41:44
consider that to be a large transaction in all other situations we consider it
41:50
to be small and so that is a when statement a case statement whatever you might
like to call it so let's have a
41:56
look at how we do that that so we're going to do as we always do DF do select
42:01
so we're going to select specific fields in this case we're going to do pl. and
we're going to do deal
42:07
bytes so this is the field that we're going to be working with so I'm going to
select the deal bites column I then want
42:15
to do a when statement so when the pl. of DL bytes so when the DL
42:23
bytes value is greater than 500 th000 then we should do something so on
42:31
the next line we'll just indent in to make it nice and readable we'll do then
42:36
and then we're going to actually use this lip function again so we're going to
assign a very specific value and that
42:43
is large so we said when the value of DL
42:49
bytes column is greater than 500,000 we're going to set the value as large
42:57
otherwise so else or in all other situations we're going to set the value
43:03
to be the literal value of small so then we simply want to assign
43:11
an alias to this put that in the wrong place there
43:18
it's Alias and we're going to do this as our when statement
43:24
test and then we can do our do head as we always would so let's have a quick look
before we run this so we're doing a
43:31
select from our data frame that data frame is going to select the DL bytes column
and we're going to include that
43:37
because it's going to enable us to actually check between the two Fields large or
small whether this condition
43:43
has worked we can then do a when statement so when that DL bytes column
43:48
is over 500,000 then print the value or set the value of the field to be large
otherwise
43:56
set the value of the field to be small and then we're going to give that column
44:01
a name which is when statement test so when we run that is this greater than
500,000 yes so
44:08
it set as large is this greater than 500,000 no so it set it as small is this
44:14
yes it's greater so it's large this is smaller so it's small and this is larger so
it's large so our statement has
44:22
worked correctly it has taken the Dr byes column and it said when it meets
44:27
the condition that DL bytes is a greater value than 500,000 then we're going to
print in the
44:34
column the word large otherwise we're going to print the word small and we're
44:39
going to give that column a name and that name is going to be called when statement
test so that's an introduction
44:46
to when statements we're going to go through a few more in the following videos but
that's just a way that we can
44:52
apply conditions to the values within our Fields different example of a when
statement here here we've done some
44:57
numerical comparisons to say is the value greater than or less than than a certain
value and we provided a string
45:04
as a response in this example we're going to check if a string is equal to a
certain value and we're going to reply a
45:11
Boolean true or false as a response so we're going to do our DF select as we
45:17
always do so we're going to select and in this case let's use country so we're
45:22
going to do pl. and that is going to be the country column so we're selecting from
our PL
45:28
data frame the the column of country we then say
45:34
when that country value is equal to
45:43
England then we want to do something and what we want to do is then
45:52
return the literal value of true and otherwise
46:00
return the lit value of false we then finally want to set an
46:07
alias of uh easy
46:14
England which will give us the uh column
46:19
name so we just move that we have moved a bracket there
46:28
always be mindful of of bracket positioning
46:33
so we've select a column of country we've then said when that country is
46:38
equal to England return true otherwise return false and make
46:45
that an is it England to be the column name so if we look at ahead of this
46:50
newly defined data frame we can now see is it England yes it's true is it
46:55
England no and it is printed false no it's printed false yes it's printed true no
it's false and yes it's true so it
47:01
seems to have worked exactly as we had hoped it to that it has provided those
47:07
true or false values but we could do something quite cool here we could edit
47:14
this and have a country's list and I am only interested in
47:20
Scotland and Wales so instead of when the country is
47:27
equal to England I can do when it is in
47:32
countries so when the country value is in this list of values then provide true
47:38
else provide false so when we run that is England in that list of values no and
47:44
it's returned false is Wales in that list of values yes and it's returned true
Scotland has returned true England
47:51
has returned force and so on so it gives us an additional bit of flexibility there
that we might have a static look
47:56
on up list we might say if this value is in the static lookup list then we can
48:01
apply a certain business rule else we'll reply apply a different business rule to
48:07
the data so in the next um video we'll start
48:13
looking at making these slightly more intricate adding additional conditions and
things to to the
48:20
statement okay so in this video we're going to look at making our example a little
bit more complex so we're going
48:27
to follow a very similar structure to the previous example but this time we're
going to add some additional conditions
48:34
so this time let's do something with the platforms field so we're going to again
make a list and in this case it's going
48:40
to be called platforms and inside we're going to have Android and
48:46
Linux so these are our platforms that we're interested in so similarly as
48:51
before we're going to do our DF do select as we always do and we're going to Now
work on our when statement so the
49:00
first thing is we're going to of course bring in the columns we're interested in so
we want platform and we also want DL
49:12
bytes so those are the columns that we want to be looking at so our when
49:17
statement is going to say when the platform is in this list of platforms so when
the platform is equal to Android or
49:25
Linux and the DL bytes is greater than 5,000 then we're going to print one else
49:32
we're going to print zero so the first thing to do is add pl. when and we'll
49:39
add a load of tabs here to make it really clear what we're doing so inside the when
statement we have two
49:45
conditions one is that the platform must be in the list of platforms and the other
is that DL bytes must be greater
49:53
than 5,000 each of those conditions will be in capture ated in its own
49:58
bracket and it will have an aan for and in between so essentially we end up with
50:06
this we're going to have within the when statement condition one and condition
50:12
two so condition one is going to be uh the pl. call of
50:19
platform and then we're going to have our is in platforms so that's our first
condition
50:27
so the platform column must be in here
50:32
and if it is we'll do something if it's not we'll do something else but that's our
first condition then we have the add
50:39
sign and our second condition so our second condition is going to be pl. and in
this case it's DL bytes and
50:46
we're just going to Simply say is over 5,000 so we have our two conditions this
50:54
condition is encapsulated within its own brackets and this condition is also
encapsulated within its own brackets we
51:01
can remove some of the spaces now to tidy it up but the point of adding all those
spaces in the first place was to
51:08
really demonstrate that this is its own condition and we've encapsulated it this is
its own condition and we've
51:14
encapsulated it and that all sits within this when statement so we just remove
51:20
some of these now just to make it nice and readable so when the plat form is in the
51:27
platform's list so it's either Android or Linux and dtes is greater than 5,000
51:32
then we're going to do something so we're going to do then we're going to have the
literal result of one so it's
51:39
simply going to be one and otherwise we're going to just
51:45
do zero so it's going to return a zero and the final thing we need to do is of
51:51
course set out Alias so Alias is going to be platform
51:57
when and then we'll do our head of five rows so when we run that we now get
52:06
111 1 Z so Android is it Android and is it greater than 5,000 yes so it's one is
52:14
it in the list of Android or Linux yes is it greater than 5,000 yes so it's one
52:19
so the condition has worked as we expected that both this condition and
52:26
this condition are true and when they are both true we print one and here we can
see we've got Windows which is not
52:33
in the platform's list so even though the DL bytes is greater than 5,000 we
52:38
still get a zero because it does not meet both conditions so just to run
52:44
through this line one more time because I think this is somewhere that people get a
little bit stuck because there are
52:50
just so many brackets but we're saying the when statement has its own bracket
52:56
this is our condition or conditions that all live within these W statement brackets
we then
53:03
have condition one which is within its own brackets and that is saying that when
the column platform is within this
53:10
list so when the value in platform is either Android or Linux then that
53:15
condition is met the second condition again is within its own set of brackets and
that is when DL bytes is greater
53:22
than 5,000 so these two are the condition we've got the amp stand in the middle
53:28
which means we're looking for an and we can replace that with a pipe if we so wish
which will swap the condition to be
53:34
or and that will change the results to be always one why is it now always one
53:39
because in every condition here DL bytes is greater than 5,000 or the platform is
53:47
either Android or Linux so it's always one of those two conditions is always
53:54
true so that's and then as sort of an introduction into slightly more complex
53:59
examples in the next example we're going to now look at multiple when conditions so
we can say when this is true do this
54:06
when this is true do this when this is true do this and otherwise do that so now
we're going to start adding more wi
54:12
conditions to add more complex logic to our calculation okay so now let's start
54:19
making this even more complex so now we're going to have multiple conditions where
we have multiple outcomes
54:25
depending on what the value is within the field so we're going to do our DF do
select as always and inside the brackets
54:34
we're going to select a number of fields so we're going to take the UL bytes
54:40
field and we're going to take the DL byes field as our
54:46
columns okay so these are our static columns now what we want is we want multiple
conditions here so we want to
54:52
say when UL bytes is greater than x and DL byes is greater than x then we're going
to print large in another
54:58
situation we're going to print medium and in all other situations we're going to
print small so we're calculating a
55:03
flag for transaction size so it's either going to be large medium or small so we
55:09
can make this super simple we can just say pl. when and in here as before we're
55:14
going to have two conditions so they're going to be within their own brackets and
we're going to say when the UL
55:21
byes column is greater than 500
55:28
100,000 and the pl. call of DL bytes is greater
55:36
than 500,000 then we're going to do something and what we're going to do is we'll
put
55:44
then we want the pl. literal value of
55:50
large so that's our first criteria we're then going to say that this is going to
55:56
continue this statement is about to continue with something else so we put the
backs slash there to say the next line is a continuation of this statement
56:04
and when we get to the next line we start again when and again we're going to say
56:12
when pl. UL
56:17
bytes and in this case we'll do is greater than 250,000 instead of
56:25
500,000 and and then another condition pl. DL
56:32
byes is greater than 250,000 then we're going to do something
56:38
different and in this case we're going to say then we want the value to be pl.
56:43
lit medium because this is a mediumsized transaction we've got our large
transaction where everything has to be
56:50
above 500,000 bytes we've got medium where everything has to be above 250,000 bytes
and now we're going to going to
56:56
say in all other situations we want you just to print um small so we'll just do
57:04
the backslash again it's just good practice to say this is going to be the next
line and then we're going to do
57:09
otherwise we're going to reply with pl. lit and this is going to be
57:16
small the final thing that we need to do as always is fit a Alias to it so this
57:23
is Alias and we're going to call this transaction
57:29
size so that's our column there defined so let's have a quick review of what
57:35
this is doing we're saying when the UL bytes column is over 500,000 and the DL
57:42
byes column is over 500,000 we are then going to print large in this field when
57:49
the UR byes colum is over 250,000 and the pl column byes is over
57:54
250,000 then we're going to print medium otherwise we're going to print small and
58:00
we're going to give the Alias of transaction size so the column is going to be
called transaction size and you
58:06
can see here that we've got large small large small large why is that the case
58:12
because here although this meets the criteria for large this one does not so
58:18
it meets the criteria for small this one again meets the criteria for large but
this one does not so our criteria is is
58:26
working because both of those things must be true in order to print as we
58:33
expected now we looked at adding these backslashes these are not necessarily
mandatory features within your code the
58:40
code will run without them the reason I like to put backslashes is just for
readability so that when you're
58:46
scrolling through the code you can see that this is a continuation of the line
above is a continuation of the function
58:54
above it's just a nice way to denote that that is the
59:01
case Define a user defined function and this is where we Define a traditional
59:07
python function and we pass our data into it and we process the data using
59:12
that python function the reason that is so important is because python is super
59:18
extensible you've got hundreds of libraries loads of functionality things you can
do with your data in Python
59:24
which aren't necessarily available out the box within polers so having the mixture
of the two being able to run
59:30
your traditional python functions on top of your polar data frame is super useful
so let's define a function as we
59:39
always would in Python we're going to call this calculate bites and it's going to
ingest the row
59:46
of the data frame so we're just going to put Row in there now because your data is
not
59:52
necessarily always going to be exactly as you hoped it to be we'll just use some
try and accept here so we're going to say try to set DL bytes to be the
1:00:02
value of row now what makes this confusing is
1:00:08
that we can't actually refer to the column name here we have to refer to the
1:00:14
column index number so if I just print DF do head of two we'll see our data
1:00:22
index numbers start at zero so date time is our first column but it's index zero
1:00:27
so we've got index0 index one index 2 index 3 index 4 and so on so we don't
1:00:35
refer to them with their specific column names we just refer to them with the
1:00:40
index so DL bytes is index 0 1 2 3 so I
1:00:46
want DL bytes to be three and I want UL
1:00:51
bites to be equal to and we're just casting them as integer just to make sure that
they are the data type we want
1:00:57
them to be and in this case we know that UL bytes is the next column along so
that's going to be column number
1:01:05
four we are then going to Simply return DL
1:01:10
bytes plus UL byes and the final thing we're going to do is we're going to have
1:01:16
us accept St uh statement so if that function doesn't work for any reason
1:01:21
like we've got wrong data types or anything like that then we want to do something
different so if there's an
1:01:26
error in our code or an error in processing the data we want to do something
different and in this case
1:01:32
we're just going to return the value of one so if you are unable to add DL bytes
1:01:38
and Ur bytes together for any reason just return the number one so that is our
python function we
1:01:46
now need to apply that to our polar's data frame so we're going to do DF dowi
1:01:54
columns and we're going to apply our function so we say DF do apply because
1:01:59
we want to apply a function to the data frame of data our function's name is
1:02:05
calculate bytes so we're going to apply calculate
1:02:11
bytes and we now need to define the return type so we expect the value that
1:02:18
comes out of this addition to be an integer why because we've got an integer
1:02:23
added to another integer so it really can't anything else other than an integer
turn return type so we just do
1:02:31
return D type and we're going to Define that as
1:02:36
PL do int64 so that says return the type of a
1:02:42
64-bit integer we'll then just do head of five
1:02:50
so when we run that we can now see that to the very right we have a column called a
apply and it is the sum of this
1:02:59
and this which is great the problem is it's called apply which isn't really
1:03:04
very nice it's just what happens by default so we can put here rename and we
1:03:10
can simply rename the columns we put them into a dictionary and we say I want to
rename the apply column and what I
1:03:17
want to rename that is called total bites and so when we run that now we get
1:03:24
a total bite column which is just the renamed version so let's have a quick look at
that statement then because it
1:03:31
gets a bit confusing we want to take our data frame with a new column it's going
1:03:36
to add this total byes column the result of that column or what that column should
be populated with is the result
1:03:44
of applying the calculate bytes function that we've just defined and when we
1:03:50
apply that function we expect the returns data type to be of an integer 6
1:03:56
bit so 64-bit integer as the return data type then simply just because the field
1:04:03
is going to be called apply and that's not very informative we're going to rename
that field after we've done the
1:04:10
calculation so we're going to rename the apply field just to be called total bytes
and then we apply that and we get
1:04:17
our total bytes column just here so just to review n to n then we've done a new
1:04:24
function called calculate by sites it ingests a row of the data frame so it goes
one row at a time and ingests that
1:04:31
data for each row it's going to try to extract the DL bytes and UL bytes and
1:04:38
then it's going to try and return those two values added together if for any
1:04:43
reason our code fails or there's a problem with our data we will get an exception
in the case that we get an
1:04:50
exception we just want to return the number one as the value into our column
1:04:56
we're going to apply this function to our data by simply taking our data frame and
using the width columns function so
1:05:04
we want to add a column to our data frame we're going to add it by applying a
function to our data frame that
1:05:11
function is the calculate bytes function and that function we expect to return a
1:05:16
64-bit integer we then rename the column from the default apply value to the
1:05:23
total bytes title that we see down here so that's how we apply a python function
1:05:30
to our polar's data frame so that we can start to extend the capabilities of the
polar's data frame to encapsulate all
1:05:37
the cool things that python can do so we've seen how to do some cool
1:05:42
stuff with a python UDF we took our DL bytes and new bytes values and we just
1:05:48
added them together as part of a python UDF clearly it's not very complex but it
1:05:53
gives you an example of how we could actually take take a python function and apply
it to our pyth our polar data
1:05:59
frame now we're going to just take another simple example just to reinforce the
point so we're going to define a new
1:06:06
function and we're going to call this make milliseconds and again it's going to
1:06:13
take a row of the data frame so for each row we're going to apply this function as
before we're going to use a
1:06:20
try so try to run this function and as long as there's no error or no exception
1:06:27
then this should be what we do we're then going to take the transaction time
1:06:32
column so transaction time and if you remember it needs to equal the index of
1:06:40
the column so if we go back up here we've got date time is index zero index 1 2 3 4
5 6 7 8 so transaction time is
1:06:52
index number eight in our data frame so let's just Define that then so the
1:06:57
transaction time is index number 8 we want to calculate milliseconds from
1:07:02
seconds so we're just going to multiply it by 1,000 so
1:07:08
milliseconds is equal to transaction Time multip by
1:07:15
1,000 and then we're going to return Mill seconds so that's our function it's
1:07:22
simply going to take the data from the transaction time column which is index
1:07:27
number eight of our data frame it's going to multiply it by 1,000 and then it's
going to return the resulting value
1:07:34
we then of course need our accept block to say if you run into any errors if our
1:07:39
data causes you issues if you can't complete the function then just return
1:07:45
one so that's our function now we need to apply that function so we're going to do
dfwi columns so we want to add a
1:07:52
column to our data frame and that column is going to be the result of applying a
1:07:57
function to our data the function we're going to apply is called make
1:08:04
milliseconds which is the one we've just defined up here and the return data
1:08:09
type is going to be a float in this case so we'll just use float 64 as a test so
1:08:17
why is it going to be a float because if we come up here and we say 2.33 * 1000
1:08:23
in some cases you may have longer transaction times I don't know so it is
1:08:28
a possibility that it could still be a floating Point number once it's multiplied
so we'll just use float as a
1:08:35
reliable type so we'll just do do head on this
1:08:41
and have a look so float 64 is not a defined data
1:08:47
type and that's because I haven't got pl. float 64 so that's a good one to catch
that always make sure that you
1:08:54
prefix your data types with PL dot which of course is the reference to the polar's
library so in this example we
1:09:01
have our transaction time of 233 2.33 and that has been multiplied so it's now
1:09:07
2330 as a uh as an output of our function again we see that the name of
1:09:13
the column is apply which is not very nice so we can simply before we do uh
1:09:20
anything with the data we can rename the column and we're just going to rename
1:09:25
the apply column to be transaction milliseconds and when we run that now we
1:09:32
get the column renamed to transaction milliseconds and our function has worked
1:09:38
perfectly so if we just run through it one more time we're defining a python
function is called make milliseconds it
1:09:45
processes each row of our data frame and it tries to take the transaction time
1:09:51
from index number eight of the data frame which is the index number of of the
transaction time column remember the
1:09:58
indexing starts at zero so when you're counting across do keep that in mind
1:10:03
we're then going to calculate milliseconds by taking that transaction time value
and multiplying it by
1:10:09
1,000 and if all of that been successful we're going to return milliseconds as
1:10:15
the value for the column if that isn't successful for whatever reason it's failed
we'll return the number one we
1:10:22
then are going to apply that function to our data frame so we're going to take our
DF dowi columns so we're going to
1:10:28
have a new column in our data frame and that column is going to be populated by a
applying a function to the data frame
1:10:36
that function is make milliseconds and the data type that we expect that function
to return is a float 64
1:10:42
remembering that we must prefix the data type with PL which is the short hand for
polers we then finally rename the field
1:10:51
so we're going to rename the apply field to be transaction milliseconds and then we
look at the head of the data frame so
1:10:57
that we can see the first five rows so that's another quick example of how we can
apply a python function a
1:11:04
userdefined function to our polar's data frame now let's start looking at
1:11:10
aggregation how can we better understand and make use of the data in our data
1:11:15
frame so the way we're going to do it is let's use a couple of examples so we'll do
DF do select and we'll start by
1:11:24
simply summing the entirety of the DL byes column so we just do PL which is
1:11:30
poers do sum and we do DL bytes we can then apply it with an
1:11:37
alias and we'll just call that total DL
1:11:42
bytes and then we can do well we don't even need to do do head because it should
just be one row so there we go
1:11:49
this is the total number of byes in our data frame we can expand this out
1:11:55
as much as we like so we might now want to take a few more examples so we'll use
1:12:02
lots of DL bytes examples here we can take the Min so the minimum of DL bytes
1:12:08
across the entire column what is the smallest value we can take the max so
1:12:15
across the entire column what is the largest value of DL bytes we can even
1:12:22
calculate the standard deviation by just taking STD so we can do
1:12:29
standard deviation of dyes and we could look at the
1:12:35
median so the median of D byes we'll change the adus accordingly we can look at the
1:12:46
mean and we'll take here the mean of dyes something that's quite interesting
1:12:52
as well is we can also take the first value so what is the first value of DL
1:12:57
bytes in the data frame that's probably enough examples for here but we can see
1:13:03
we now have lots of columns so the the total D bytes is this very large number the
minimum D bytes across the entire
1:13:10
table is 271 so the the smallest value AC on any
1:13:16
Row in the data frame is 271 the maximum value on a given Row for DL bytes is
1:13:23
this and so on we get standard deviation the median the mean and the first value
1:13:29
so that's pretty cool it gives us a nice way to have a single row or aggregate
1:13:35
our data frame but what if I wanted to say what is the total number of dryes
1:13:43
for each country so what's the total number of dryes for England total number of
dryes for whales and so on with this
1:13:51
format you would have to filter the data frame have a few different data frame fres
get the different values and it's
1:13:56
just a bit clunky and it's not great so we have another option we can define a
1:14:03
query so in this case we'll just use q and this query we're going to first do DF do
1:14:09
lazy and this means that the calculation is not executed until we call the
1:14:16
collect command so it will look through our calculation understand what it does but
until we actually execute the
1:14:23
collect command it will not actually do anything which is different to how
1:14:29
pandas and other libraries works so we're then going to do do group by and I'm
going to do
1:14:37
country so Group by country just means for each country so for each country I
1:14:45
want to run an aggregation so we do doag I want to make an aggregation for each
1:14:51
country in the data set so when we open the brackets we can then do exactly what
1:14:56
we've done above so we can do PL do sum DL
1:15:01
bytes and then we can give it an alias and we'll do sum of DL
1:15:08
bytes so that now has enabled us to do this so we're going to execute this
1:15:15
lazily we're going to have a group by so for each country for each country we're
1:15:21
going to run an aggregation so for each country we're going to sum the total number
of DL bytes and the column name
1:15:27
is going to be sum of DL bytes we can then come right down to the
1:15:33
bottom and we can simply do q. collect so as I say this now forces it to run
1:15:38
and it forces it to give us the result so now we have for ING we have a certain
1:15:45
number of of bytes so maybe there's a misspelling in the data frame somewhere for
England this is a total bites
1:15:51
Ireland this is total bites and so on so that's a nice basic aggregation in the
next video we'll
1:15:58
start looking at more complex aggregations and how we can have multiple
aggregations multiple columns to group by and all that sort of thing
1:16:05
but right now let's quickly review the query one last time so the query is
1:16:10
defined as q and within the brackets we do a number of things we just tell it
1:16:15
it's going to be executed lazily and that means it doesn't actually run the code
until we get to this point until we
1:16:22
call the collect we're going to to group by country so that says for each country
1:16:28
we should run an aggregation and that's a so for each country we're going to run an
aggregation and that aggregation is a
1:16:35
sum on the de byes column and the result of that should be stored in the column
1:16:41
called sum deal bites we've set an alias for that when we do the collect it forces
it to run and it collects the
1:16:47
results and this is the resulting data frame in the next video as I say we'll go
through some more complex examples
1:16:56
in this video let's start looking at building up our aggregations so we're going to
start again with our same query
1:17:04
which will be encapsulated in Brackets so again we're going to do DF La it's just
standard practice that this is a
1:17:11
lazily executed command we're don't going to run it until we do a collect and then
we're going to do a do group by
1:17:18
in this case we're going to group by platform so for each platform we need to
1:17:24
run an gations so we're going to doag and Within These brackets we're going to
define a number of aggregations the
1:17:31
first is we're going to do a count so pl. count and that's simply going to count
the number of rows so for each
1:17:38
platform how many rows in our data frame are there so how many rows is platform
1:17:43
equal to Linux how many rows is platform equal to Windows and so on so we're going
to call that an alias and we're
1:17:50
just simply going to call that row count that is our first aggregation the second
1:17:55
aggregation that's Force the SP typo there uh our second aggregation is going
1:18:02
to be a sum so we're going to take the sum of DL bytes so for each platform
1:18:08
what is the total number of DL bytes if I add together every row related to that
1:18:14
platform we could add an alias to that as well if we wished but we won't do in this
case we're also going to do
1:18:21
something new we're going to do pl. and we're going to user ID this is a really
1:18:27
interesting function because this essentially captures a list of every single value
in that column so for
1:18:36
platform for example Linux it will capture every single user ID associated
1:18:43
with Linux so where platform is equal to Linux what was the user ID it captures
every single one in a
1:18:50
list the final one that we'll look at is first so in this case we'll do pl. first
1:18:57
and simply take the HTTP method
1:19:02
column so we'll now just do q. collect as we did before and let's have a look at
what we get so for each platform so
1:19:10
we've got Android Windows Linux iOS and Mac the first one was a row count they're
all pretty similar Androids
1:19:16
slightly lower than the others but broadly they're all around 3,000 the total
number of bytes here is very
1:19:23
interesting again it's just the sum of the number of bites here is the really
interesting column so this is each of
1:19:32
these values is a user ID so this user ID used Android this user ID also used
1:19:39
Android so it gives us a full list of every user but compacted into one row
1:19:44
which we can then process as part of a python UDF or something similar and then
HTTP method so the very
1:19:51
first transaction that happened on Android was HTT TPS which
1:19:57
is interesting to know so that is how we
1:20:03
Define our more complex query but if we go in here we might want to say actually
1:20:09
I want to sort the output of this query I want the very largest row count to
1:20:15
show first so we simply just say sort by row count which is the Alias that we've
1:20:22
defined up here and then we can do
1:20:27
descending is equal to true so that means make the largest value at the top
1:20:32
and the smallest value at the bottom so when we run that we then see that the order
has changed Windows has the
1:20:40
highest row count of 3055 iOS has the second highest of 3033 and so on so we
1:20:47
now have a way to order our output of our aggregation as well which is of course
very
1:20:53
useful so now we're going to start looking at a super useful function and this is a
rolling function so we're
1:21:00
essentially having this sliding window across our data and at any given time it
1:21:05
will say over the course of the last two rows three rows five rows whatever we
determine it to be what is the average
1:21:14
number of DL bytes or the Min or the max or whatever calculation we want it to be
1:21:19
and imagine that's really useful if you have a sales data set and you might say
over the last 3 months what is the average number of sales so that we can
1:21:26
start making comparisons on a quarter by quarter basis for example so what we
1:21:31
want to do first of all is we want to do a sort so DF do sort by date time and
we're going to do that in
1:21:39
descending is equal to false we want this in ascending order we want the very first
record the earliest point in time
1:21:47
in the datetime colum to be the first record in our data frame so it's all
1:21:53
nice and pre sorted and when we do our rolling windows it's always referencing the
date prior to
1:22:00
itself so we can simply do select here and we can start working with our data
1:22:06
so we might want the pl. call of date time we'll bring that one in and we'll
1:22:12
also bring in the pl. of DL bytes so
1:22:17
these are our columns we can then use to refer to to understand what our rolling
functions are doing so if we do pl. call
1:22:25
of DL bytes DL bytes but now we're going to
1:22:30
say show me the rolling Max so the over
1:22:36
a specified window of time what is the maximum value that we see we're going to
1:22:41
set the window size to be two and we're going to give it an alias of
1:22:48
Max so if we now do head of 10 let's say on the on the data frame we'll look at
1:22:54
the top 10 rows and we'll understand what's happening here so for each row
1:22:59
you can see the top one is null why is it null because it's looking at the two rows
and it's going to give you the
1:23:06
highest value of those two rows clearly the first row in the data frame has nothing
prior to look
1:23:11
at so if we look on row two it's now looking at this row and this row so Row
1:23:18
one and two which one of these is the highest value of course it's row two so
1:23:23
it takes in the max column that value on the next row is looking at this value
1:23:29
and this value and we're now saying what's the maximum value well actually row
two's value is higher than Row three
1:23:35
so that comes down onto the max column the next is saying between 4 through five
and 402 and this is of course the
1:23:43
higher value so that is taken we can do this over a bigger window to make it
1:23:48
even clearer so over a five window where we're saying over the last five
1:23:55
values what's the highest number and the highest number was on transaction two so
1:24:00
on our fifth transaction or our sixth transaction we are now defining that
1:24:05
number why because of these transactions here that is the highest
1:24:11
number here again within that five transaction window this is the highest number
and so on so it looks over a set
1:24:18
period of time and it takes the maximum value within that set period within
1:24:23
those currently five windows or five rows of data we can do this again we'll just
1:24:30
copy it for ease but we can do this with other functions as well so we've got pl.
1:24:36
DL bytes again but instead we can set Min and we'll set this one to be Min uh
1:24:43
we can set um what else we got median and we can set again the the
1:24:50
Alias as median and we can do this actually for every one of our uh
1:24:57
aggregate functions so we can do median mean uh we could do standard deviation
1:25:03
variance all that sort of thing but they all work in exactly the same way where
they look at the number of specified
1:25:11
windows or the number of specified rows in your data frame so in this case within a
two row window what's the
1:25:19
highest number what's the lowest number what's the median number and so on
1:25:26
so that works really nicely it enables us to compare time periods and a rolling
1:25:32
basis which is absolutely perfect for time series analysis now let's have a look at
window
1:25:40
functions and window functions allow us to calculate something over a given uh
1:25:46
criteria or term so we might want to say show me so we'll do DF do select
1:25:56
and I want you to show me first of all the uh country column I also want you to
show me the
1:26:04
user ID column so what we're going to say is I
1:26:09
want you to tell me the total number of DL bytes for each
1:26:14
country and I want you to set an alias for that so we're going to say pl. call
1:26:20
of DL byes we're then going to run our aggregation of sum and we're going to
1:26:28
say over and this is M country so this essentially means for
1:26:34
each country sum the total number of DL bytes so if we now do our do head five
1:26:42
there we can have a look here so we have two rows showing England for example and
1:26:49
you will know that the number in here is exactly the same
1:26:54
why is that because they are the same country we said for each country sum the
total number of DL bites
1:27:02
and it has it said for England this is the total number of DL byes this is also
England so it should have the same value
1:27:09
this is Wales ends in 796 this one is Wales and end in 796 so it gives you a
1:27:14
consistent value across all of them why is that useful let's have a look at an
1:27:19
example so if we do pl. call and we'll just bring our DL Byes column in here as
1:27:26
well so what we might want to do so what's
1:27:32
what's the problem here column with name d byes has more than one currents and
that's because we haven't set an alias
1:27:38
so if we just set an alias on here and we'll call this one window
1:27:43
bytes and then we run that so we've got deal bytes we've got window PES why is
having this consistent
1:27:51
value useful you might say what percent of the overall number of bytes does this
1:28:00
transaction contribute towards England's total so we've got 872 807 that what
1:28:07
percentage does this transaction count towards the entire sum
1:28:12
of England bytes and we can work that out because we've got the two columns if we
think about a more sort of tangible
1:28:19
example we might say these are our monthly sales what percent of our
1:28:24
overall sales for a given country does this month represent and so we'll do a
window function and say for each month
1:28:32
or process it over each month then we're going to sum the total number of sales
1:28:38
so it just gives us a way to bring this different level of aggregation into the
same data
1:28:44
frame so we can do exactly the same again we could have some min max average
1:28:52
whatever you might want to look at so this is just another way to run aggregations
where we're saying for a
1:28:58
specific category in this case country so for each country take the minimum
1:29:04
number of bites the maximum the mean whatever it might be and that enables us to
print a different level of
1:29:10
aggregation within the data frame so that we can make comparisons and derive
insights from that
1:29:17
data so let's take a little bit of time to talk about filtering we have done some
filtering throughout the course but
1:29:23
we've not spent time specifically talking about how to filter a data frame
1:29:29
so it simply just means only show me the rows where something is true so in our
1:29:35
example we might want to do DF do filter so apply a filter to the DAT frame we've
1:29:41
got pl. column so on a specific column that column is going to be HTP method
1:29:47
and I want you to only show me the records where that HTTP method is equal
1:29:53
to http P so then when we take our heads of five we can see that now the HTTP
1:30:00
method is only HTTP if we swap that out to https then
1:30:06
when we run it it swaps it so the HTTP method only shows https traffic so just
1:30:13
a quick review we're applying a filter we're applying that filter on this column
and the condition we want to be
1:30:20
met is this that it must be equal to https we can then apply multiple filters
1:30:27
as well so we could do DF do filter and we're going to apply multiple
1:30:33
filters which again as with all our other functions sit within their own brackets
so we're going to here say
1:30:40
where pl. of platform is equal to
1:30:49
Mac and then outside the brackets we say and
1:30:54
where the pl. call of country is equal to
1:31:01
England and then we'll just do head and select only the top five rows so what is
1:31:07
this showing us well if we scroll across the platform is always Mac and the
1:31:12
country is always England so we're saying apply a filter to the data frame
1:31:19
we've got two conditions in here the first filter is to filter the platform to only
be Mac and we have the anasan
1:31:26
that says and so the platform must be Mac and the country must be England so
1:31:34
this applies both filters we can also use or so we just replace that Aman with
1:31:40
a pipe which is just a vertical line when we run that we now get a different set of
values why do we get this because
1:31:48
we've got the country is equal to England it is the platform is Android though but
that's okay cuz we've got the
1:31:55
all sign so either one of these conditions has to be true the country either has to
be England or the platform
1:32:03
has to be Mac whereas when we had the Avan both had to be true the platform
1:32:10
had to equal Mac and the country had to equal England and hence we got a slightly
different result set right so
1:32:17
we've been through lots of functionality in this course we've just gone through
each function demonstrated some examples
1:32:22
of that now let's try and solve some challenges so the first challenge for you to
solve is that you need to find
1:32:29
the five most heavy users so you can you to do that we can calculate the UL bytes
1:32:38
plus DL bytes as a total bytes column after we've done that we then
1:32:46
want to order the date or we want to sum the values of the total bytes
1:32:54
for each user so we're going to want to group by user and sum the total bytes so at
this
1:33:02
point we now know the total amount of bytes that each transaction is used we've now
summed the total number of
1:33:08
bytes that each user has consumed and then we want to sort the data
1:33:14
frame by our aggregated
1:33:19
fields in descending order then if we simply do ahead of five we will get the
1:33:29
five top records so to solve that problem we're going to calculate a new field
which is going to be UL bytes plus
1:33:35
DL bytes we've got our total bytes field we're then going to group by user and the
aggregation we're going to run is a
1:33:42
sum on the total bytes field that we've just calculated we're then going to sort
1:33:47
that data frame by that new field in descending order and when we take the
1:33:53
five top records using head we will then have our five heaviest users the Second
1:33:59
Challenge is as below so we're going to use a window function to find the total
1:34:06
number of UL bytes for each country so we're going to do window
1:34:11
function over country where we sum U
1:34:18
byes we are then going to work out the percentage
1:34:23
of each transaction against that total so we're going to take UL bytes and
1:34:30
we're going to divide it by the column we've created up here so if we have that as
an
1:34:36
alias as country bites then we're simply going
1:34:41
to do this and that will solve our problem so for each transaction we're going to
work
1:34:47
out the UL bytes divided by the total number of bytes for the country and that
1:34:53
will give us a percentage value for that
1:34:58
transaction so let's go away have a go with those in the next video we'll revisit
them and make sure that we've
1:35:03
got the right answers so let's have a look at the answer to the first question that
question was to find the five most
1:35:11
heavy users and we Define that as calculating UL bytes plus DL bytes and
1:35:16
calling that total bytes we then sum the total bytes column per user so we're
1:35:22
going to group by user ID and we're going to sum the total bytes we then sort the
data frame based on that total
1:35:29
bytes field in descending order when we take the top five rows we will have the
1:35:35
five heaviest users so here's our answer we create a new data frame called users
1:35:41
and that data frame is equal to this we select The Columns of user ID download
1:35:47
bytes and UL bytes from our original data frame we then do a calculation
1:35:53
which says take the DL bytes column and add it to the UL bytes column and that's
1:36:00
going to create a new column and the Alias for that column is going to be total
uncore byes so that's the the
1:36:06
column name that I want to see in the output table we then need to group by a user
1:36:13
and sum the total byes so we create a new query and we say Group by user ID
1:36:19
and run an aggregation and that aggregation is to sum the total bytes
1:36:24
column and give it an alias of total bytes so in this case we have calculated
1:36:31
total bytes up here so it's download bytes plus upload bytes we're now saying for
each user if you have five rows 10
1:36:38
rows however many rows relate to this user sum that field for each user and
1:36:43
give me the total number of bites they used across all of those rows we then
1:36:48
sort the table by total bytes in descending order because we've said descending is
equal to True finally we
1:36:56
just collect the data and look at the head of five rows which is the top five rows
and there we have our user ID total
1:37:03
bites and these are our heaviest users the next one we wanted to look at
1:37:09
was to do a window function for each country and work out what percent of
1:37:14
each country's overall transaction value each transaction represented so if we
1:37:21
look down here we create a new data frame called df2 that data frame is equal to
this we select the country
1:37:28
column the user ID column the UL byes column and then we run our window function so
here we're saying that we're
1:37:36
going to sum the UR byes column over country really you read this the wrong way
around you say for each country I
1:37:43
want to sum the UR byes column and so that's going to give us a total number
1:37:48
of UR bytes for each country we then have an alias which is the new column name and
we call that country
1:37:55
bytes we can then use that so we then say select the user ID and the number of
1:38:00
UL bytes and when we do a calculation and that simply says take the UL byes
1:38:05
value and divide it by the country byes value and that will give us the percent
1:38:11
of the overall transaction volume that this particular row represents and at
1:38:17
the very end we multiply it by 100 just to give it to us as a percent rather than
as a decimal we then give it an
1:38:24
alias of percent so we get the column name sent and here we can see that the
percentage is calculated and they are
1:38:31
all quite low because each individual transaction does not contribute hugely
1:38:37
to the overall country usage well done on completing the course
1:38:42
you've demonstrated dedication perseverance and commitment to enhancing your data
analysis skills your
1:38:49
achievement is Testament to your hard work and a significant step forwards the
becoming a data analysis expert by
1:38:55
finishing this course you've gained valuable knowledge and practical experience in
data ingestion cleaning
1:39:02
aggregation and manipulation using the powerful polar's library in Python these
1:39:08
skills will undoubtedly serve you well for future endeavors whether in your career
personal projects or academic
1:39:14
Pursuits once again congratulations on your accomplishment and great work

"

You might also like