0% found this document useful (0 votes)
13 views22 pages

Document Summarisation

Uploaded by

dhruv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views22 pages

Document Summarisation

Uploaded by

dhruv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

how would you summarize an entire book

using a language model I asked myself

the same question and I went on a

mission to find out summarizing bodies

of text is one of the Prime use cases

when working with language models it's

extremely valuable to be able to distill

important pieces of information from

long bodies of text people summarize

articles financial documents chat

history tables Pages books song lyrics

and way too many more to count in this

video we're going to go from novice to

expert and review the five levels of

summarization let's jump into it alright

for the five levels of summarization

novice to expert we're going to

summarize a couple sentences a couple

paragraphs couple Pages an entire book

and we're also going to summarize an

unknown amount of text what does that

mean you have to wait till the end and

see here first thing we're going to do

is import our openai API key and for

level one we're just going to do a basic

prompt so just summarize a couple of

sentences here I'm going to import open

AI from langchain I'm going to create my

language model and in this case I'm

simply just going to copy and paste some


text from Wikipedia and I'm going to put

it inside of a prompt for level one here

please provide a summary of the

following text this is my instructions

that I want the language model to do and

the text that I'm giving it is a passage

on Philosophy from Wikipedia so I'm

going to put that into a prompt variable

and let me get the number of tokens that

this is so this prompt has 121 tokens

right now and the reason why this is

important is because as our token a

number of tokens increases over time

with the larger documents we're going to

need to handle them differently here but

121 is pretty small so let's go ahead

and run this output all right philosophy

is a systematized study of general and

fundamental equations about existence

reason blah blah blah you know that's

still a little too complicated for me so

I'm going to adjust the instructions

that I'm giving it to get a different

type of summary or a different output

for me so please provide a summary of

the following text please provide your

output in a manner that a five-year-old

would understand let me do the same

thing put that into a prompt variable


and so now we have a few more tokens but

it's still it's really not that much and

let's see what the output is philosophy

is about asking questions and trying to

figure out the answers that's a lot more

digestible for me nice let's move on to

level two prompt to templates so here

we're going to summarize a couple of

paragraphs again we're going to import

openai but this time we're going to

import a prompt template which is a

really easy way to swap out different

pieces of a prompt that we send to a

language model

let's go ahead and load this the two

essays we're going to look at are Paul

Graham essays this is going to be get

ideas and Noob so what I'm going to do

is create an empty list called essays

and for each one of those essays I'm

going to put it inside that list and

let's print out a preview of the essays

to see what they look like so essay

number one uh someone fed my essays into

GPT to make something I could answer

based on them cool and then essay number

two when I was young I thought old

people had everything figured out

awesome so now what we're going to do

here is we're going to use a prompt


template and we're going to dynamically

insert these essays into that prompt

template so here we have our template

please write a one sentence summary of

the following text so notice here how I

said one sentence instead of just

writing a summary let's go ahead and run

this we put it in our prompt template

and we have our essay token which

corresponds right here and then what I'm

going to do is I'm going to Loop through

those two essays and I'm going to get

summaries for both of them so for essay

in essays which is our list of essays

that we had up above we're going to

format The Prompt we're going to put the

single essay right inside of here and so

our full prompt will be the template

Plus the essay

we're going to get our number of tokens

which we're going to look at and then

we're finally going to get a summary and

we're going to put take that by using

the summary prompt and throwing it in

our language model let's go ahead and

run this so the first one had 205 tokens

not too bad and the summary is just a

one sentence summary exploring Anonymous

at the Frontier front of knowledge is


the best way to generate new ideas cool

this prompt has 500 tokens and so the

second one had 500 tokens this essay

explores the idea that feeling like a

noob is actually beneficial nice so we

get a one sentence summary of both those

essays which is pretty cool level three

this is where it starts to get a little

bit more complicated we're going to use

a mapreduce method and this means that

we're going to chunk our document up

into pieces and we're going to get a

summary of each of the individual chunks

and then finally get a summary of the

summaries okay so let's import open AI

we're going to import load summarize

chain and this is a really easy

convention that lane chain provides in

order to do this map reduce operation

over a few documents and our recursive

character text splitter is what we're

going to split our text with so here we

have a Paul grams say and this startup

ideas one I know is actually pretty long

so let's load that and let's see how

many tokens it is it's 9 500 tokens and

so today this would be too big for gbt

3.5 and even gpt4 however in the future

token limits are going to increase this

likely won't be an issue but it's good


you're learning how to do this in case

you ever run into this problem okay all

right so we have our docs and then I

want to run this and see how many docs

we have so what we did is we just split

our documents up here and we ran it into

the create documents from our text we

had now we have our number of docs and

number of tokens on the first stock so

we now have five documents and the first

one has about 2 000 tokens so we went

from one document with 9500 to five

documents roughly 2 000 tokens each then

what we're going to do is we're going to

load our summarize chain and in this

case we're going to pass in the language

model that we're using and we're going

to specify the chain type and the chain

type is going to be the type of

operation or the type of chain that gets

deployed on this one and in this case we

wanted to do the map reduce operation

for us so we loaded up that chain and

now I'm going to uh run that chain and

put it in an output variable

great now we have our output y

combinator explains that the best

startup ideas come from looking for

problems preferably ones the founders


have themselves that's cool but but this

is kind of long for me so I'm going to

actually create my own prompts and not

use the default ones that lanechain uses

so here I'm going to create my map

prompt and actually in fact I lied this

one is the same one that the link chain

uses by default but then for the

combined prompt I'm going to specify the

format that I want return your response

and bullet points which covers the key

points of the text so I want it to

respond in bullet points for me not just

in regular Pros let's run this I'm going

to load up my summary chain and the

important parts here is I'm going to

pass what map prompt I want and what

combined prompt I want let's run this

and let's run the output on this one

great now we get a response and you can

see here that we have bullet points of a

summary let's move on to level four and

this is when you want to summarize an

entire book and this is actually what

started this journey of the different

levels of summarization the method that

I've come up with and I don't know if

there's a better name for this someone

please tell me that there is cause I

just made this out it's called the best


representation vectors instead of doing

a mapreduce operation over an entire

book can we extract the important

sections of that book and then do a

summary on those sections so it's kind

of like hey can we pick the 10 best

sections from this book and then do a

summary on those without having to look

at the rest of the text and the method I

used here involves embeddings and

clustering so let's dive in to see how

this works the first thing I'm going to

do is import my book here so the book

that I'm actually doing is going to be

Into Thin Air one of my favorites and

it's about the 1960 1996 Everest

disaster I'm doing some hand wavy stuff

here but I only want the 26th page

through the 277th page and this is the

contents of the book it doesn't have the

footnotes and it doesn't have the table

of contents and things like that just to

make it a little easier and then this is

a PDF so I'm going to run through it and

I'm actually going to take the page

content and put it into a regular text

file for me and finally I'm going to

replace the tabs with some spaces

because it was some weird formatting out


of the PDF PDF loader is not defined

great so let's run this great so now

that we've loaded up our PDF and put it

inside this text let's see how many

tokens this is

and wow so what we get is almost 140 000

tokens even with GPT 32k that would it

would not be able to handle summarizing

this entire book for us so we're gonna

have to come up with another method now

the part that I want to avoid is I don't

want to send all 140 000 tokens to the

language model itself because I

calculated it and it would roughly be

almost five bucks four or five bucks

just for the prompt itself not even for

the output or the completion so let's

come up with a different method here my

goal is to chunk this book and then get

embeddings for each of the chunks I want

to pick a subset of the chunks which

represent a holistic but diverse view of

the book so I want an encompassing set

of chunks but I want them a lot I want

them different from each other so that

we get different parts about the book

that may be the important parts or

simply another way is can we get the top

10 passages from the book that describe

the book the best we're going to load up


our book into the single text file which

we've already done we're going to split

our text into kind of large chunks we're

going to embed those chunks to get the

vectors and then the interesting part is

we're going to Cluster the vectors to

see which ones are similar to each other

because I don't want to do a mapreduce

operation on similar chunks because it's

likely telling me the same thing what I

want is just one representative from a

cluster and I want one representative

from each cluster so I get a diverse set

here so I'm going to pick the embeddings

that represent the cluster the most and

my method to do this is to take the one

that's closest to the cluster centroid

which is just the middle part of the

cluster because I figure that's where

the most average meaning of that cluster

is going to be and then finally we're

going to summarize the documents that in

that these embeddings represent

okay

again another way for putting in a

straight English is which 10 documents

from this book represent most of the

meaning I want to build a summary off

those 10 documents all right so we're


just going to load in a bunch of stuff

here but we're going to do the vector

store dance next thing we're going to do

is we're going to split our text and

we're just put it in a bunch of docs

here and let's see how many docs we

actually have so now we have 78

documents and if you try to do a regular

mapreduce method on this you could but

you'd have to go through all 78 of those

documents and you don't want to do that

you'll notice here that my chunk size is

10 000 characters and so it's on the

larger side and what we're going to do

is we're going to create our embeddings

and we're actually going to get our

vectors out the other end

let's go ahead and do that cool now that

we have our vectors I want to Cluster

the vectors to see which groups pop out

here and I'm going to specify I want 11

clusters right here you should specify

what you want for your book it's going

to take trial and error and see what

works best for you and my clustering

method is I'm going to use k-means I

actually went down a pretty complicated

path to see which other clustering

methods would work better for me and

this one actually worked out the best


now I know there's going to be a lot of

data science experts out there who are

way smarter than me and they're going to

tell me that this isn't the right

optimal path

um great it's working for me right now

no but really I would love if you would

tell me which one would actually be more

optimal here because I'd love to improve

this method and share it out with the

crew so let's go ahead and run that and

already we that was super quick we just

got our clusters there and if we were to

take a look at our labels we can see

here that this represents the 78

documents that we've had and it says

that document 0 has a label of two so it

just has the cluster two and it looks

like the first couple have all label two

which is pretty interesting because I

kind of expect that I would expect that

the beginning of the book would all be

talking about the same thing and as the

plot develops a little bit more you'd

have different clusters that represent

different pieces so it's pretty cool

because cluster 8 here doesn't appear

anywhere else because cluster 8 is

likely talking about the end of the book


I thought that was pretty interesting

now I couldn't help myself I had to

graph these clusters because what else

are you supposed to do when you have

clustering other than graph it right so

what I had to do first though was do

dimensionality reduction because each

one of these vectors is about 1500

Dimensions but I don't want to plot 1500

Dimensions I want it down to two so in

that case what I did is I used tsne it's

just a dimensionality reduction type of

algorithm and then I ran it through

matplotlib so we can actually see what

these clusters look like now the cool

part is that I'm going to label the

color of each cluster depend or the of

each dot depending on what cluster it's

with let's go ahead and run this and

sweet what we get is a 2d representation

of those clusters so what's pretty cool

to see is that the dimensionality

reduction worked and we can see that

there's groups of different clusters so

like these three three yellow dots or

all these blue dots together or these

dark blue ones and so this is 11 colors

I probably could have split the colors

out a little bit more but I thought this

was pretty cool this represents


different sections of the book that

we're now going to go pick our best

um document from okay so then we have a

nice little for Loop right here that's

going to go through each cluster that we

have and it's going to pick the vector

that's closest to the center of that

centroid so this is my way to figure out

which document is going to be closest to

the centroid of each one of those

clusters because my hypothesis is that

one will be the most representative of

the whole thing let's go ahead and run

this and we get our selected indices

here and I'm actually going to sort them

so that they appear in order because

with a book you want to make sure that

you process your summaries in order so I

want to process the first document that

because the first document is going to

be the one that appears first in the

book and then go down the line now the

other interesting thing here is I

noticed it starts off with document 0

which is the first document in the book

and this makes sense this is the

introduction it's likely doing a lot of

exposition and describing the plot then

it jumps all the way to document number


12. so it determined that documents one

two three all the way up to 11 weren't

that important in order to describe most

of the book and so we jump to 12 but

then down later down the line it looks

like there's a lot happening in the plot

because we only skip three documents and

this one the next thing I'm going to do

is I'm actually going to do a mapreduce

method but now instead of 78 documents I

just have 10 documents that I want to do

the mapreduce method on now when I did

this in the load summarize chain

beforehand I kept on getting timeout

errors so I'm actually going to do the

mapreduce method by hand which may be

fun for everyone to see as well for the

map part of it I'm going to use GPT 3.5

turbo here and this is to save on cost

here's our map prompt so I actually had

to do a custom one here you'll be given

a single passage of a book the section

will be enclosed in triple backticks

your goal is to give a summary of the

section so that the reader will have a

full understanding about what happened

your response should be at least three

paragraphs and fully Encompass what was

said in the passage I added this last

one because some of the summaries were a


little short and there's a little bit

more information lost than I wanted to

do and I said three paragraphs because

for the combined method we're actually

going to use gpt4 which has the 8K token

limit and so we can stuff in a whole lot

of information there which is nice

there's our map prompt and then I'm

going to initialize my map chain I'm

going to put in the 3.5 and here I'm

going to do a stuff which means it's

just going to take that passage and put

it right in the prompt not do anything

else that's fancy and our prompt is we

have our map prompt template up above

let's go ahead and do that and then here

what we're going to do is we're going to

go through each one of our selected

indices which is the selected pieces up

here and then we're just going to go get

the doc that represents that that spot

so here we have our selected docs okay

now what we're going to do is we're

going to go through each one of those

selected docs so we're going to go

through doc 0 then doc 12 then doc 36

and we're going to get the summary or

the summary right here and then we're

going to append that to our summary list


which is just going to be an empty list

and then we're going to print out a

little status for us so if we look at

this first one right here we can see

that summary zero just got done this is

the first summary and this happened to

be chunk zero which represents the first

chunk that it thought was most important

up above here and so the preview of the

summary that it generated was the

passage describes the author's

experience of reaching the summer of

Mount Everest on the state and the

events that followed the author who is

part of the new zealand-based team had

been fantasizing about this moment for

months but found him and I only did the

first 250 characters so it got cut off

let's let the rest of these summaries

load for us great so we just got the

summaries of each individual chunk here

so in from chunk zero to chunk 12 26 29

and if we take a look here

um you know on 51 this passage describes

the harrowing experience of a group of

Climbers on Mount Everest During a

severe storm which is later on in the

book which is absolutely correct so what

we want to do here is with all those

summaries that are going to be put into


our summary list we we want to well

first of all let's see how long that

summary list is looks like it's about 4

000 characters this is ideal for gpt4

because it has a 8 000 character limit

here I'm going to set the max tokens at

3 000 because that means it's only a

total of 7 000 which shouldn't give me a

problem here so with this combined

prompt it's it's not really a combined

prompt anymore but I'm just calling it

that you will be given a series of

summaries from a book The summaries will

be enclosed in triple backticks your

goal is to give a verbose summary about

what happened in the story I say verbose

because I don't want a ton of

information loss the reader should be

able to grasp what happened in the book

there's a text in triple backticks and

then I'm asking for a verbose summary

again we're just using the stuff chain

here because I'm taking all those

summaries that we had before and I'm

going to put them in the load summarize

chain and just put them right in the

prompt and notice here that I'm doing uh

gpt4 as well so let's let this run this

will take a while on my machine so I'm


going to pause the video awesome let's

take a look at the result here and now

we have our book summary so in this

story the author recounts the experience

as part of the New Zealand based a team

attempting to Summit Mount Everest

despite months of anticipation author

finds herself unable to fully appreciate

the extreme exhaustion blah blah blah

um throughout the story the author

describes various events challenge by

the climbers

um

they go through blah blah blah and the

aftermath of the tragedy this even talks

about what happened afterwards so in

terms of a book summary this really

isn't too bad and I'm actually pretty

happy with the results I'd love for you

to try this out on your own your own

books let me know what you think for

level five we're going to look at how

you summarize an unknown amount of text

and we're actually going to use agents

for this one and the word of caution I

have to give is that the best practices

for using agents is still being actively

researched and developed they're not as

reliable as we may want in this example

I'm going to run through a very quick


and brief research project that the

agent needs to go and search Wikipedia

for now I'm going to ask a different

question that requires two different

searches and the important part about

this is that the agent needs to

understand what it needs to go search

for as agents become more developed

you're going to be able to throw more

complicated and nuanced research

projects at them but until now let's

just walk through an easy example all

right first thing we're going to do is

we're going to import our packages here

so initialize agent and Tool are the

important ones as well as the Wikipedia

API wrapper okay let's initialize our

Wikipedia API wrapper and then we're

gonna create our tool kit and in this

case there's just going to be one tool

and it's going to be the Wikipedia tool

we're going to give it a name we're

going to tell what function it needs to

run which is the wikipedia.run and then

we're saying it's useful for when you

need to get information from Wikipedia

about a single topic let's go ahead and

run that and then we're going to

initialize our agent call it agent


executor and we have our toolkit we have

our language model we have our agent

type and then I'm doing verbose equals

true so we can see what it's thinking

and then what we're going to do is ask

it to go over multiple Wikipedia Pages

here can you please provide a quick

summary of the Napoleon Bonaparte then

do a separate search

and tell me what the commonalities are

with Serena Williams so Napoleon and

Serena Williams I'm not really sure

about the commonalities but uh let's see

what the language model comes up with

for us great so we run here and the

first action is going to be going to

Wikipedia it's going to input it's going

to input Napoleon it's going to look at

the Napoleon page and then we're going

to look at the Napoleon third page then

we're going to look at House of

Bonaparte and then we're going to say

okay it knows about Napoleon now now I

need now I know the summer of Napoleon I

need to find information about Serena

Williams to identify the commonalities

between them now it's going to move over

to Serena Williams and then it's going

to look at more information on the

William sisters it's going to even look


at Venus Williams and it's going to say

I know the final answer so

Napoleon and Serena Williams both have

achieved remarkable success in the

respective Fields with Napoleon being

one of the greatest military commanders

in history and Serena billing one of the

greatest tenor players of all time they

both dominated their fields at their

Peak and have left lasting legacies so

interesting it's kind of a lame

commonality but I'll take it absolutely

awesome and congratulations that is the

five levels of summarizing from novice

to expert please let me know what you

think on Twitter and I'd love to see

what types of things you're summarizing

so please share with the community

You might also like