how would you summarize an entire book
using a language model I asked myself
the same question and I went on a
mission to find out summarizing bodies
of text is one of the Prime use cases
when working with language models it's
extremely valuable to be able to distill
important pieces of information from
long bodies of text people summarize
articles financial documents chat
history tables Pages books song lyrics
and way too many more to count in this
video we're going to go from novice to
expert and review the five levels of
summarization let's jump into it alright
for the five levels of summarization
novice to expert we're going to
summarize a couple sentences a couple
paragraphs couple Pages an entire book
and we're also going to summarize an
unknown amount of text what does that
mean you have to wait till the end and
see here first thing we're going to do
is import our openai API key and for
level one we're just going to do a basic
prompt so just summarize a couple of
sentences here I'm going to import open
AI from langchain I'm going to create my
language model and in this case I'm
simply just going to copy and paste some
text from Wikipedia and I'm going to put
it inside of a prompt for level one here
please provide a summary of the
following text this is my instructions
that I want the language model to do and
the text that I'm giving it is a passage
on Philosophy from Wikipedia so I'm
going to put that into a prompt variable
and let me get the number of tokens that
this is so this prompt has 121 tokens
right now and the reason why this is
important is because as our token a
number of tokens increases over time
with the larger documents we're going to
need to handle them differently here but
121 is pretty small so let's go ahead
and run this output all right philosophy
is a systematized study of general and
fundamental equations about existence
reason blah blah blah you know that's
still a little too complicated for me so
I'm going to adjust the instructions
that I'm giving it to get a different
type of summary or a different output
for me so please provide a summary of
the following text please provide your
output in a manner that a five-year-old
would understand let me do the same
thing put that into a prompt variable
and so now we have a few more tokens but
it's still it's really not that much and
let's see what the output is philosophy
is about asking questions and trying to
figure out the answers that's a lot more
digestible for me nice let's move on to
level two prompt to templates so here
we're going to summarize a couple of
paragraphs again we're going to import
openai but this time we're going to
import a prompt template which is a
really easy way to swap out different
pieces of a prompt that we send to a
language model
let's go ahead and load this the two
essays we're going to look at are Paul
Graham essays this is going to be get
ideas and Noob so what I'm going to do
is create an empty list called essays
and for each one of those essays I'm
going to put it inside that list and
let's print out a preview of the essays
to see what they look like so essay
number one uh someone fed my essays into
GPT to make something I could answer
based on them cool and then essay number
two when I was young I thought old
people had everything figured out
awesome so now what we're going to do
here is we're going to use a prompt
template and we're going to dynamically
insert these essays into that prompt
template so here we have our template
please write a one sentence summary of
the following text so notice here how I
said one sentence instead of just
writing a summary let's go ahead and run
this we put it in our prompt template
and we have our essay token which
corresponds right here and then what I'm
going to do is I'm going to Loop through
those two essays and I'm going to get
summaries for both of them so for essay
in essays which is our list of essays
that we had up above we're going to
format The Prompt we're going to put the
single essay right inside of here and so
our full prompt will be the template
Plus the essay
we're going to get our number of tokens
which we're going to look at and then
we're finally going to get a summary and
we're going to put take that by using
the summary prompt and throwing it in
our language model let's go ahead and
run this so the first one had 205 tokens
not too bad and the summary is just a
one sentence summary exploring Anonymous
at the Frontier front of knowledge is
the best way to generate new ideas cool
this prompt has 500 tokens and so the
second one had 500 tokens this essay
explores the idea that feeling like a
noob is actually beneficial nice so we
get a one sentence summary of both those
essays which is pretty cool level three
this is where it starts to get a little
bit more complicated we're going to use
a mapreduce method and this means that
we're going to chunk our document up
into pieces and we're going to get a
summary of each of the individual chunks
and then finally get a summary of the
summaries okay so let's import open AI
we're going to import load summarize
chain and this is a really easy
convention that lane chain provides in
order to do this map reduce operation
over a few documents and our recursive
character text splitter is what we're
going to split our text with so here we
have a Paul grams say and this startup
ideas one I know is actually pretty long
so let's load that and let's see how
many tokens it is it's 9 500 tokens and
so today this would be too big for gbt
3.5 and even gpt4 however in the future
token limits are going to increase this
likely won't be an issue but it's good
you're learning how to do this in case
you ever run into this problem okay all
right so we have our docs and then I
want to run this and see how many docs
we have so what we did is we just split
our documents up here and we ran it into
the create documents from our text we
had now we have our number of docs and
number of tokens on the first stock so
we now have five documents and the first
one has about 2 000 tokens so we went
from one document with 9500 to five
documents roughly 2 000 tokens each then
what we're going to do is we're going to
load our summarize chain and in this
case we're going to pass in the language
model that we're using and we're going
to specify the chain type and the chain
type is going to be the type of
operation or the type of chain that gets
deployed on this one and in this case we
wanted to do the map reduce operation
for us so we loaded up that chain and
now I'm going to uh run that chain and
put it in an output variable
great now we have our output y
combinator explains that the best
startup ideas come from looking for
problems preferably ones the founders
have themselves that's cool but but this
is kind of long for me so I'm going to
actually create my own prompts and not
use the default ones that lanechain uses
so here I'm going to create my map
prompt and actually in fact I lied this
one is the same one that the link chain
uses by default but then for the
combined prompt I'm going to specify the
format that I want return your response
and bullet points which covers the key
points of the text so I want it to
respond in bullet points for me not just
in regular Pros let's run this I'm going
to load up my summary chain and the
important parts here is I'm going to
pass what map prompt I want and what
combined prompt I want let's run this
and let's run the output on this one
great now we get a response and you can
see here that we have bullet points of a
summary let's move on to level four and
this is when you want to summarize an
entire book and this is actually what
started this journey of the different
levels of summarization the method that
I've come up with and I don't know if
there's a better name for this someone
please tell me that there is cause I
just made this out it's called the best
representation vectors instead of doing
a mapreduce operation over an entire
book can we extract the important
sections of that book and then do a
summary on those sections so it's kind
of like hey can we pick the 10 best
sections from this book and then do a
summary on those without having to look
at the rest of the text and the method I
used here involves embeddings and
clustering so let's dive in to see how
this works the first thing I'm going to
do is import my book here so the book
that I'm actually doing is going to be
Into Thin Air one of my favorites and
it's about the 1960 1996 Everest
disaster I'm doing some hand wavy stuff
here but I only want the 26th page
through the 277th page and this is the
contents of the book it doesn't have the
footnotes and it doesn't have the table
of contents and things like that just to
make it a little easier and then this is
a PDF so I'm going to run through it and
I'm actually going to take the page
content and put it into a regular text
file for me and finally I'm going to
replace the tabs with some spaces
because it was some weird formatting out
of the PDF PDF loader is not defined
great so let's run this great so now
that we've loaded up our PDF and put it
inside this text let's see how many
tokens this is
and wow so what we get is almost 140 000
tokens even with GPT 32k that would it
would not be able to handle summarizing
this entire book for us so we're gonna
have to come up with another method now
the part that I want to avoid is I don't
want to send all 140 000 tokens to the
language model itself because I
calculated it and it would roughly be
almost five bucks four or five bucks
just for the prompt itself not even for
the output or the completion so let's
come up with a different method here my
goal is to chunk this book and then get
embeddings for each of the chunks I want
to pick a subset of the chunks which
represent a holistic but diverse view of
the book so I want an encompassing set
of chunks but I want them a lot I want
them different from each other so that
we get different parts about the book
that may be the important parts or
simply another way is can we get the top
10 passages from the book that describe
the book the best we're going to load up
our book into the single text file which
we've already done we're going to split
our text into kind of large chunks we're
going to embed those chunks to get the
vectors and then the interesting part is
we're going to Cluster the vectors to
see which ones are similar to each other
because I don't want to do a mapreduce
operation on similar chunks because it's
likely telling me the same thing what I
want is just one representative from a
cluster and I want one representative
from each cluster so I get a diverse set
here so I'm going to pick the embeddings
that represent the cluster the most and
my method to do this is to take the one
that's closest to the cluster centroid
which is just the middle part of the
cluster because I figure that's where
the most average meaning of that cluster
is going to be and then finally we're
going to summarize the documents that in
that these embeddings represent
okay
again another way for putting in a
straight English is which 10 documents
from this book represent most of the
meaning I want to build a summary off
those 10 documents all right so we're
just going to load in a bunch of stuff
here but we're going to do the vector
store dance next thing we're going to do
is we're going to split our text and
we're just put it in a bunch of docs
here and let's see how many docs we
actually have so now we have 78
documents and if you try to do a regular
mapreduce method on this you could but
you'd have to go through all 78 of those
documents and you don't want to do that
you'll notice here that my chunk size is
10 000 characters and so it's on the
larger side and what we're going to do
is we're going to create our embeddings
and we're actually going to get our
vectors out the other end
let's go ahead and do that cool now that
we have our vectors I want to Cluster
the vectors to see which groups pop out
here and I'm going to specify I want 11
clusters right here you should specify
what you want for your book it's going
to take trial and error and see what
works best for you and my clustering
method is I'm going to use k-means I
actually went down a pretty complicated
path to see which other clustering
methods would work better for me and
this one actually worked out the best
now I know there's going to be a lot of
data science experts out there who are
way smarter than me and they're going to
tell me that this isn't the right
optimal path
um great it's working for me right now
no but really I would love if you would
tell me which one would actually be more
optimal here because I'd love to improve
this method and share it out with the
crew so let's go ahead and run that and
already we that was super quick we just
got our clusters there and if we were to
take a look at our labels we can see
here that this represents the 78
documents that we've had and it says
that document 0 has a label of two so it
just has the cluster two and it looks
like the first couple have all label two
which is pretty interesting because I
kind of expect that I would expect that
the beginning of the book would all be
talking about the same thing and as the
plot develops a little bit more you'd
have different clusters that represent
different pieces so it's pretty cool
because cluster 8 here doesn't appear
anywhere else because cluster 8 is
likely talking about the end of the book
I thought that was pretty interesting
now I couldn't help myself I had to
graph these clusters because what else
are you supposed to do when you have
clustering other than graph it right so
what I had to do first though was do
dimensionality reduction because each
one of these vectors is about 1500
Dimensions but I don't want to plot 1500
Dimensions I want it down to two so in
that case what I did is I used tsne it's
just a dimensionality reduction type of
algorithm and then I ran it through
matplotlib so we can actually see what
these clusters look like now the cool
part is that I'm going to label the
color of each cluster depend or the of
each dot depending on what cluster it's
with let's go ahead and run this and
sweet what we get is a 2d representation
of those clusters so what's pretty cool
to see is that the dimensionality
reduction worked and we can see that
there's groups of different clusters so
like these three three yellow dots or
all these blue dots together or these
dark blue ones and so this is 11 colors
I probably could have split the colors
out a little bit more but I thought this
was pretty cool this represents
different sections of the book that
we're now going to go pick our best
um document from okay so then we have a
nice little for Loop right here that's
going to go through each cluster that we
have and it's going to pick the vector
that's closest to the center of that
centroid so this is my way to figure out
which document is going to be closest to
the centroid of each one of those
clusters because my hypothesis is that
one will be the most representative of
the whole thing let's go ahead and run
this and we get our selected indices
here and I'm actually going to sort them
so that they appear in order because
with a book you want to make sure that
you process your summaries in order so I
want to process the first document that
because the first document is going to
be the one that appears first in the
book and then go down the line now the
other interesting thing here is I
noticed it starts off with document 0
which is the first document in the book
and this makes sense this is the
introduction it's likely doing a lot of
exposition and describing the plot then
it jumps all the way to document number
12. so it determined that documents one
two three all the way up to 11 weren't
that important in order to describe most
of the book and so we jump to 12 but
then down later down the line it looks
like there's a lot happening in the plot
because we only skip three documents and
this one the next thing I'm going to do
is I'm actually going to do a mapreduce
method but now instead of 78 documents I
just have 10 documents that I want to do
the mapreduce method on now when I did
this in the load summarize chain
beforehand I kept on getting timeout
errors so I'm actually going to do the
mapreduce method by hand which may be
fun for everyone to see as well for the
map part of it I'm going to use GPT 3.5
turbo here and this is to save on cost
here's our map prompt so I actually had
to do a custom one here you'll be given
a single passage of a book the section
will be enclosed in triple backticks
your goal is to give a summary of the
section so that the reader will have a
full understanding about what happened
your response should be at least three
paragraphs and fully Encompass what was
said in the passage I added this last
one because some of the summaries were a
little short and there's a little bit
more information lost than I wanted to
do and I said three paragraphs because
for the combined method we're actually
going to use gpt4 which has the 8K token
limit and so we can stuff in a whole lot
of information there which is nice
there's our map prompt and then I'm
going to initialize my map chain I'm
going to put in the 3.5 and here I'm
going to do a stuff which means it's
just going to take that passage and put
it right in the prompt not do anything
else that's fancy and our prompt is we
have our map prompt template up above
let's go ahead and do that and then here
what we're going to do is we're going to
go through each one of our selected
indices which is the selected pieces up
here and then we're just going to go get
the doc that represents that that spot
so here we have our selected docs okay
now what we're going to do is we're
going to go through each one of those
selected docs so we're going to go
through doc 0 then doc 12 then doc 36
and we're going to get the summary or
the summary right here and then we're
going to append that to our summary list
which is just going to be an empty list
and then we're going to print out a
little status for us so if we look at
this first one right here we can see
that summary zero just got done this is
the first summary and this happened to
be chunk zero which represents the first
chunk that it thought was most important
up above here and so the preview of the
summary that it generated was the
passage describes the author's
experience of reaching the summer of
Mount Everest on the state and the
events that followed the author who is
part of the new zealand-based team had
been fantasizing about this moment for
months but found him and I only did the
first 250 characters so it got cut off
let's let the rest of these summaries
load for us great so we just got the
summaries of each individual chunk here
so in from chunk zero to chunk 12 26 29
and if we take a look here
um you know on 51 this passage describes
the harrowing experience of a group of
Climbers on Mount Everest During a
severe storm which is later on in the
book which is absolutely correct so what
we want to do here is with all those
summaries that are going to be put into
our summary list we we want to well
first of all let's see how long that
summary list is looks like it's about 4
000 characters this is ideal for gpt4
because it has a 8 000 character limit
here I'm going to set the max tokens at
3 000 because that means it's only a
total of 7 000 which shouldn't give me a
problem here so with this combined
prompt it's it's not really a combined
prompt anymore but I'm just calling it
that you will be given a series of
summaries from a book The summaries will
be enclosed in triple backticks your
goal is to give a verbose summary about
what happened in the story I say verbose
because I don't want a ton of
information loss the reader should be
able to grasp what happened in the book
there's a text in triple backticks and
then I'm asking for a verbose summary
again we're just using the stuff chain
here because I'm taking all those
summaries that we had before and I'm
going to put them in the load summarize
chain and just put them right in the
prompt and notice here that I'm doing uh
gpt4 as well so let's let this run this
will take a while on my machine so I'm
going to pause the video awesome let's
take a look at the result here and now
we have our book summary so in this
story the author recounts the experience
as part of the New Zealand based a team
attempting to Summit Mount Everest
despite months of anticipation author
finds herself unable to fully appreciate
the extreme exhaustion blah blah blah
um throughout the story the author
describes various events challenge by
the climbers
um
they go through blah blah blah and the
aftermath of the tragedy this even talks
about what happened afterwards so in
terms of a book summary this really
isn't too bad and I'm actually pretty
happy with the results I'd love for you
to try this out on your own your own
books let me know what you think for
level five we're going to look at how
you summarize an unknown amount of text
and we're actually going to use agents
for this one and the word of caution I
have to give is that the best practices
for using agents is still being actively
researched and developed they're not as
reliable as we may want in this example
I'm going to run through a very quick
and brief research project that the
agent needs to go and search Wikipedia
for now I'm going to ask a different
question that requires two different
searches and the important part about
this is that the agent needs to
understand what it needs to go search
for as agents become more developed
you're going to be able to throw more
complicated and nuanced research
projects at them but until now let's
just walk through an easy example all
right first thing we're going to do is
we're going to import our packages here
so initialize agent and Tool are the
important ones as well as the Wikipedia
API wrapper okay let's initialize our
Wikipedia API wrapper and then we're
gonna create our tool kit and in this
case there's just going to be one tool
and it's going to be the Wikipedia tool
we're going to give it a name we're
going to tell what function it needs to
run which is the wikipedia.run and then
we're saying it's useful for when you
need to get information from Wikipedia
about a single topic let's go ahead and
run that and then we're going to
initialize our agent call it agent
executor and we have our toolkit we have
our language model we have our agent
type and then I'm doing verbose equals
true so we can see what it's thinking
and then what we're going to do is ask
it to go over multiple Wikipedia Pages
here can you please provide a quick
summary of the Napoleon Bonaparte then
do a separate search
and tell me what the commonalities are
with Serena Williams so Napoleon and
Serena Williams I'm not really sure
about the commonalities but uh let's see
what the language model comes up with
for us great so we run here and the
first action is going to be going to
Wikipedia it's going to input it's going
to input Napoleon it's going to look at
the Napoleon page and then we're going
to look at the Napoleon third page then
we're going to look at House of
Bonaparte and then we're going to say
okay it knows about Napoleon now now I
need now I know the summer of Napoleon I
need to find information about Serena
Williams to identify the commonalities
between them now it's going to move over
to Serena Williams and then it's going
to look at more information on the
William sisters it's going to even look
at Venus Williams and it's going to say
I know the final answer so
Napoleon and Serena Williams both have
achieved remarkable success in the
respective Fields with Napoleon being
one of the greatest military commanders
in history and Serena billing one of the
greatest tenor players of all time they
both dominated their fields at their
Peak and have left lasting legacies so
interesting it's kind of a lame
commonality but I'll take it absolutely
awesome and congratulations that is the
five levels of summarizing from novice
to expert please let me know what you
think on Twitter and I'd love to see
what types of things you're summarizing
so please share with the community