The most optimistic vision of generative AI is that it will relieve us of
the tedious, repetitive elements of knowledge work so that we can get to work
on the really interesting problems that such tedium stands in the way of.
Even if you fully believe in this vision, it’s hard to deny that today, some
tedium is associated with the process of using generative AI itself.
Generative AI also
isn’t
free,
and so, as responsible consumers, we need to ask: is it worth it? What’s the
ROI
of genAI, and how can we tell? In this post, I’d like to explore a logical
framework for evaluating genAI expenditures, to determine if your organization
is getting its money’s worth.
Perpetually Proffering Permuted Prompts
I think most LLM users would agree with me that a typical workflow with an LLM
rarely involves prompting it only one time and getting a perfectly useful
answer that solves the whole problem.
Generative AI best practices, even from the most optimistic
vendors
all suggest that you should continuously evaluate everything. ChatGPT, which
is really the
only
genAI product with significantly scaled adoption, still says at the bottom of
every interaction:
ChatGPT can make mistakes. Check important info.
If we have to “check important info” on every interaction, it stands to reason
that even if we think it’s useful, some of those checks will find an error.
Again, if we think it’s useful, presumably the next thing to do is to perturb
our prompt somehow, and issue it again, in the hopes that the next invocation
will, by dint of either:
- better luck this
time
with the stochastic aspect of the inference process,
- enhanced application of our skill to
engineer
a better prompt based on the deficiencies of the current inference, or
- better performance of the model by populating additional
context in subsequent chained
prompts.
Unfortunately, given the relative lack of reliable methods to re-generate the
prompt and receive a better answer, checking the output and re-prompting
the model can feel like just kinda futzing around with it. You try, you get a
wrong answer, you try a few more times, eventually you get the right answer
that you wanted in the first place. It’s a somewhat unsatisfying process, but
if you get the right answer eventually, it does feel like progress, and you
didn’t need to use up another human’s time.
In fact, the hottest buzzword of the last hype cycle is “agentic”. While I
have my own feelings about this particular word, its current practical
definition is “a generative AI system which automates the process of
re-prompting itself, by having a deterministic program evaluate its outputs for
correctness”.
A better term for an “agentic” system would be a “self-futzing system”.
However, the ability to automate some level of checking and re-prompting does
not mean that you can fully delegate tasks to an agentic tool, either. It
is, plainly put, not safe. If you leave the AI on its own, you will get
terrible results that will at best make for a funny story and at
worst might end up causing serious damage.
Taken together, this all means that for any consequential task that you want
to accomplish with genAI, you need an expert human in the
loop. The human must be
capable of independently doing the job that the genAI system is being asked to
accomplish.
When the genAI guesses correctly and produces usable output, some of the
human’s time will be saved. When the genAI guesses wrong and produces
hallucinatory gibberish or even “correct” output that nevertheless fails to
account for some unstated but necessary property such as security or scale,
some of the human’s time will be wasted evaluating it and re-trying it.
Income from Investment in Inference
Let’s evaluate an abstract, hypothetical genAI system that can automate some
work for our organization. To avoid implicating any specific vendor, let’s
call the system “Mallory”.
Is Mallory worth the money? How can we know?
Logically, there are only two outcomes that might result from using Mallory to
do our work.
- We prompt Mallory to do some work; we check its work, it is correct, and
some time is saved.
- We prompt Mallory to do some work; we check its work, it fails, and we futz
around with the result; this time is wasted.
As a logical framework, this makes sense, but ROI is an arithmetical concept,
not a logical one. So let’s translate this into some terms.
In order to evaluate Mallory, let’s define the Futzing Fraction, “”, in terms of the following variables:
-
the average amount of time a Human worker would take to do a task,
unaided by Mallory
-
the amount of time that Mallory takes to run one Inference
-
the amount of time that a human has to spend Checking Mallory’s output for
each inference
-
the Probability that Mallory will produce a correct inference for each prompt
-
the average amount of time that it takes for a human to Write one prompt for
Mallory
-
since we are normalizing everything to time, rather than money, we do also have to account for the dollar of Mallory as as a product, so we will include the Equivalent amount of human time we could purchase for the marginal cost of one inference.
As in last week’s example of simple ROI
arithmetic, we will put our costs in the
numerator, and our benefits in the denominator.
The idea here is that for each prompt, the minimum amount of time-equivalent cost possible is . The user must, at least once, write a prompt, wait for inference to run, then check the output; and, of course, pay any costs to Mallory’s vendor.
If the probability of a correct answer is , then they will do this entire process 3 times, so we put in the denominator. Finally, we divide everything by , because we are trying to determine if we are actually saving any time or money, versus just letting our existing human, who has to be driving this process anyway, do the whole thing.
If the Futzing Fraction evaluates to a number greater than 1, as previously discussed, you are a bozo; you’re spending more time futzing with Mallory than getting value out of it.
Figuring out the Fraction is Frustrating
In order to even evaluate the value of the Futzing Fraction though, you have to
have a sound method to even get a vague sense of all the terms.
If you are a business leader, a lot of this is relatively easy to measure. You
vaguely know what is, because you know what your
payroll costs, and similarly, you can figure out with
some pretty trivial arithmetic based on Mallory’s pricing table. There are endless
YouTube channels, spec sheets and benchmarks to give you . is probably going to be so small compared to that it hardly merits consideration.
But, are you measuring ? If your employees are not checking the outputs of the AI, you’re on a path to catastrophe that no ROI calculation can capture, so it had better be greater than zero.
Are you measuring ? How often does the AI get it right on the first try?
Challenges to Computing Checking Costs
In the fraction defined above, the term is going to be
large. Larger than you think.
Measuring and with a high
degree of precision is probably going to be very hard; possibly unreasonably
so, or too expensive to bother with in practice. So you will undoubtedly need
to work with estimates and proxy metrics. But you have to be aware that this
is a problem domain where your normal method of estimating is going to be
extremely vulnerable to inherent cognitive bias, and find ways to measure.
First let’s discuss cognitive and metacognitive bias.
My favorite cognitive bias is the availability
heuristic and a close
second is its cousin salience
bias.
Humans are empirically predisposed towards noticing and remembering things that
are more striking, and to overestimate their frequency.
If you are estimating the variables above based on the vibe that you’re
getting from the experience of using an LLM, you may be overestimating its
utility.
Consider a slot machine.
If you put a dollar in to a slot machine, and you lose that dollar, this is an
unremarkable event. Expected, even. It doesn’t seem interesting. You can
repeat this over and over again, a thousand times, and each time it will seem
equally unremarkable. If you do it a thousand times, you will probably get
gradually more anxious as your sense of your dwindling bank account becomes
slowly more salient, but losing one more dollar still seems unremarkable.
If you put a dollar in a slot machine and it gives you a thousand dollars,
that will probably seem pretty cool. Interesting. Memorable. You might tell
a story about this happening, but you definitely wouldn’t really remember any
particular time you lost one dollar.
Luckily, when you arrive at a casino with slot machines, you probably know well
enough to set a hard budget in the form of some amount of physical currency you
will have available to you. The odds are against you, you’ll probably lose it
all, but any responsible gambler will have an immediate, physical
representation of their balance in front of them, so when they have lost it
all, they can see that their hands are empty, and can try to resist the “just
one more pull” temptation, after hitting that limit.
Now, consider Mallory.
If you put ten minutes into writing a prompt, and Mallory gives a completely
off-the-rails, useless answer, and you lose ten minutes, well, that’s just what
using a computer is like sometimes. Mallory malfunctioned, or hallucinated,
but it does that sometimes, everybody knows that. You only wasted ten minutes.
It’s fine. Not a big deal. Let’s try it a few more times. Just ten more
minutes. It’ll probably work this time.
If you put ten minutes into writing a prompt, and it completes a task that
would have otherwise taken you 4 hours, that feels amazing. Like the computer
is magic! An absolute endorphin rush.
Very memorable. When it happens, it feels like .
But... did you have a time budget before you started? Did you have a specified
N such that “I will give up on Mallory as soon as I have spent N minutes
attempting to solve this problem with it”? When the jackpot finally pays out
that 4 hours, did you notice that you put 6 hours worth of 10-minute prompt
coins into it in?
If you are attempting to use the same sort of heuristic intuition that probably
works pretty well for other business leadership decisions, Mallory’s
slot-machine chat-prompt user interface is practically designed to subvert
those sensibilities. Most business activities do not have nearly such an
emotionally variable, intermittent reward schedule. They’re not going to trick
you with this sort of cognitive illusion.
Thus far we have been talking about cognitive bias, but there is a
metacognitive bias at play too: while
Dunning-Kruger,
everybody’s favorite metacognitive bias does have some
problems
with it, the main underlying metacognitive bias is that we tend to believe our
own thoughts and perceptions, and it requires active effort to distance
ourselves from them, even if we know they might be wrong.
This means you must assume any intuitive estimate of
is going to be biased low; similarly is going to be
biased high. You will forget the time you spent checking, and you will
underestimate the number of times you had to re-check.
To avoid this, you will need to decide on a Ulysses
pact to provide some inputs to a
calculation for these factors that you will not be able to able to fudge if
they seem wrong to you.
Problematically Plausible Presentation
Another nasty little cognitive-bias landmine for you to watch out for is the
authority bias, for two
reasons:
- People will tend to see Mallory as an unbiased, external authority, and
thereby see it as more of an authority than a similarly-situated human.
- Being an LLM, Mallory will be overconfident in its answers.
The nature of LLM training is also such that commonly co-occurring tokens in
the training corpus produce higher likelihood of co-occurring in the output;
they’re just going to be closer together in the vector-space of the weights;
that’s, like, what training a model is, establishing those relationships.
If you’ve ever used an heuristic to informally evaluate someone’s credibility
by listening for industry-specific shibboleths or ways of describing a
particular issue, that skill is now useless. Having ingested every industry’s
expert literature, commonly-occurring phrases will always be present in
Mallory’s output. Mallory will usually sound like an expert, but then make
mistakes at random..
While you might intuitively estimate by thinking “well,
if I asked a person, how could I check that they were correct, and how long
would that take?” that estimate will be extremely optimistic, because the
heuristic techniques you would use to quickly evaluate incorrect information
from other humans will fail with Mallory. You need to go all the way back to
primary sources and actually fully verify the output every time, or you will
likely fall into one of these traps.
Mallory Mangling Mentorship
So far, I’ve been describing the effect Mallory will have in the context of an
individual attempting to get some work done. If we are considering
organization-wide adoption of Mallory, however, we must also consider the
impact on team dynamics. There are a number of possible potential side effects
that one might consider when looking at, but here I will focus on just one that
I have observed.
I have a cohort of friends in the software industry, most of whom are
individual contributors. I’m a programmer who likes programming, so are most
of my friends, and we are also (sigh), charitably, pretty solidly
middle-aged at this point, so we tend to have a lot of experience.
As such, we are often the folks that the team — or, in my case, the community —
goes to when less-experienced folks need answers.
On its own, this is actually pretty great. Answering questions from more
junior folks is one of the best parts of a software development job. It’s an
opportunity to be helpful, mostly just by knowing a thing we already knew. And
it’s an opportunity to help someone else improve their own agency by giving
them knowledge that they can use in the future.
However, generative AI throws a bit of a wrench into the mix.
Let’s imagine a scenario where we have 2 developers: Alice, a staff engineer
who has a good understanding of the system being built, and Bob, a relatively
junior engineer who is still onboarding.
The traditional interaction between Alice and Bob, when Bob has a question,
goes like this:
- Bob gets confused about something in the system being developed, because
Bob’s understanding of the system is incorrect.
- Bob formulates a question based on this confusion.
- Bob asks Alice that question.
- Alice knows the system, so she gives an answer which
accurately reflects the state of the system to Bob.
- Bob’s understanding of the system improves, and thus he will have fewer and
better-informed questions going forward.
You can imagine how repeating this simple 5-step process will eventually
transform Bob into a senior developer, and then he can start answering
questions on his own. Making sufficient time for regularly iterating this loop
is the heart of any good mentorship process.
Now, though, with Mallory in the mix, the process now has a new decision point,
changing it from a linear sequence to a flow chart.
We begin the same way, with steps 1 and 2. Bob’s confused, Bob formulates a
question, but then:
- Bob asks Mallory that question.
Here, our path then diverges into a “happy” path, a “meh” path, and a “sad”
path.
The “happy” path proceeds like so:
- Mallory happens to formulate a correct answer.
- Bob’s understanding of the system improves, and thus he will have fewer and
better-informed questions going forward.
Great. Problem solved. We just saved some of Alice’s time. But as we learned earlier,
Mallory can make mistakes. When that happens, we will need to check
important info. So let’s get checking:
- Mallory happens to formulate an incorrect answer.
- Bob investigates this answer.
- Bob realizes that this answer is incorrect because it is inconsistent with
some of his prior, correct knowledge of the system, or his investigation.
- Bob asks Alice the same question; GOTO traditional interaction step 4.
On this path, Bob spent a while futzing around with Mallory, to no particular
benefit. This wastes some of Bob’s time, but then again, Bob could have
ended up on the happy path, so perhaps it was worth the risk; at least Bob
wasn’t wasting any of Alice’s much more valuable time in the process.
Notice that beginning at the start of step 4, we must begin allocating all of
Bob’s time to , so already
starts getting a bit bigger than if it were just Bob checking Mallory’s output
specifically on tasks that Bob is doing.
That brings us to the “sad” path.
- Mallory happens to formulate an incorrect answer.
- Bob investigates this answer.
- Bob does not realize that this answer is incorrect because he is unable to
recognize any inconsistencies with his existing, incomplete knowledge of the
system.
- Bob integrates Mallory’s incorrect information of the system into his mental
model.
- Bob proceeds to make a larger and larger mess of his work, based on an
incorrect mental model.
- Eventually, Bob asks Alice a new, worse question, based on this incorrect
understanding.
- Sadly we cannot return to the happy path at this point, because now Alice
must unravel the complex series of confusing misunderstandings that Mallory
has unfortunately conveyed to Bob at this point. In the really sad
case, Bob actually doesn’t believe Alice for a while, because Mallory
seems unbiased, and Alice has to waste even more time convincing Bob
before she can simply explain to him.
Now, we have wasted some of Bob’s time, and some of Alice’s time. Everything
from step 5-10 is , and as soon as Alice gets involved,
we are now adding to at double real-time. If more
team members are pulled in to the investigation, you are now multiplying by the number of investigators, potentially running at triple
or quadruple real time.
But That’s Not All
Here I’ve presented a brief selection reasons why
will be both large, and larger than you expect. To review:
- Gambling-style mechanics of the user interface will interfere with your own
self-monitoring and developing a good estimate.
- You can’t use human heuristics for quickly spotting bad answers.
- Wrong answers given to junior people who can’t evaluate them will waste more
time from your more senior employees.
But this is a small selection of ways that Mallory’s output can cost you
money and time. It’s harder to simplistically model second-order effects like
this, but there’s also a broad range of possibilities for ways that, rather
than simply checking and catching errors, an error slips through and starts
doing damage. Or ways in which the output isn’t exactly wrong, but still
sub-optimal in ways which can be difficult to notice in the short term.
For example, you might successfully vibe-code your way to launch a series of
applications, successfully “checking” the output along the way, but then
discover that the resulting code is unmaintainable garbage that prevents future
feature delivery, and needs to be re-written. But this kind of
intellectual debt isn’t even specific to technical debt while coding; it can
even affect such apparently genAI-amenable fields as LinkedIn content
marketing.
Problems with the Prediction of
isn’t the only challenging term
though. , is just as, if not more important, and just as
hard to measure.
LLM marketing materials love to phrase their accuracy in terms of a
percentage. Accuracy claims for LLMs in general tend to hover around
70%. But these scores vary per field, and when you aggregate them across
multiple topic areas, they start to trend down. This is exactly why “agentic”
approaches for more immediately-verifiable LLM outputs (with checks like “did
the code work”) got popular in the first place: you need to try more than once.
Independently measured claims about accuracy tend to be quite a bit lower.
The field of AI benchmarks is exploding, but it probably goes without saying
that LLM vendors game those benchmarks, because of course every incentive
would encourage them to do that. Regardless of what their arbitrary scoring on
some benchmark might say, all that matters to your business is whether it is
accurate for the problems you are solving, for the way that you use it.
Which is not necessarily going to correspond to any benchmark. You will need to
measure it for yourself.
With that goal in mind, our formulation of must be a
somewhat harsher standard than “accuracy”. It’s not merely “was the factual
information contained in any generated output accurate”, but, “is the output
good enough that some given real knowledge-work task is done and the human
does not need to issue another prompt”?
Surprisingly Small Space for Slip-Ups
The problem with reporting these things as percentages at all, however, is that our actual definition for is , where for any given attempt, at least, must be an integer greater than or equal to 1.
Taken in aggregate, if we succeed on the first prompt more often than not, we could end up with a , but combined with
the previous observation that you almost always have to prompt it more than once, the practical reality is that will start at 50% and go down from there.
If we plug in some numbers, trying to be as extremely optimistic as we can,
and say that we have a uniform stream of tasks, every one of which can be
addressed by Mallory, every one of which:
- we can measure perfectly, with no overhead
- would take a human 45 minutes
- takes Mallory only a single minute to generate a response
- Mallory will require only 1 re-prompt, so “good enough” half the time
- takes a human only 5 minutes to write a prompt for
- takes a human only 5 minutes to check the result of
- has a per-prompt cost of the equivalent of a single second of a human’s time
Thought experiments are a dicey basis for reasoning in the face of
disagreements, so I have tried to formulate something here that is absolutely,
comically, over-the-top stacked in favor of the AI optimist here.
Would that be a profitable? It sure seems like it, given that we are trading
off 45 minutes of human time for 1 minute of Mallory-time and 10 minutes of
human time. If we ask Python:
| >>> def FF(H, I, C, P, W, E):
... return (W + I + C + E) / (P * H)
... FF(H=45.0, I=1.0, C=5.0, P=1/2, W=5.0, E=0.01)
...
0.48933333333333334
|
We get a futzing fraction of about 0.4896. Not bad! Sounds like, at least
under these conditions, it would indeed be cost-effective to deploy Mallory.
But… realistically, do you reliably get useful, done-with-the-task quality
output on the second prompt? Let’s bump up the denominator on just a little bit there, and see how we fare:
| >>> FF(H=45.0, I=1.0, C=5.0, P=1/3, W=5.0, E=0.01)
0.734
|
Oof. Still cost-effective at 0.734, but not quite as good. Where do
we cap out, exactly?
| >>> from itertools import count
... for A in count(start=4):
... print(A, result := FF(H=45.0, I=1.0, C=5.0, P=1 / A, W=5.0, E=1/60.))
... if result > 1:
... break
...
4 0.9792592592592594
5 1.224074074074074
>>>
|
With this little test, we can see that at our next iteration we are already at
0.9792, and by 5 tries per prompt, even in this absolute fever-dream of an
over-optimistic scenario, with a futzing fraction of 1.2240, Mallory is now a
net detriment to our bottom line.
Harm to the Humans
We are treating as functionally constant so far, an
average around some hypothetical Gaussian distribution, but the distribution
itself can also change over time.
Formally speaking, an increase to would be good for
our fraction. Maybe it would even be a good thing; it could mean we’re taking
on harder and harder tasks due to the superpowers that Mallory has given us.
But an observed increase to would probably not be
good. An increase could also mean your humans are getting worse at solving
problems, because using Mallory has atrophied their skills and sabotaged
learning opportunities. It could also go up because your senior,
experienced people now hate their jobs.
For some more vulnerable folks, Mallory might just take a shortcut to all these
complex interactions and drive them completely insane directly. Employees
experiencing an intense psychotic episode are famously less productive than
those who are not.
This could all be very bad, if our futzing fraction eventually does head north
of 1 and you need to reconsider introducing human-only workflows, without
Mallory.
Abridging the Artificial Arithmetic (Alliteratively)
To reiterate, I have proposed this fraction:
which shows us positive ROI when FF is less than 1, and negative ROI when it is
more than 1.
This model is heavily simplified. A comprehensive measurement program that
tests the efficacy of any technology, let alone one as complex and rapidly
changing as LLMs, is more complex than could be captured in a single blog post.
Real-world work might be insufficiently uniform to fit into a closed-form
solution like this. Perhaps an iterated simulation with variables based on the
range of values seem from your team’s metrics would give better results.
However, in this post, I want to illustrate that if you are going to try to
evaluate an LLM-based tool, you need to at least include some representation
of each of these terms somewhere. They are all fundamental to the way the
technology works, and if you’re not measuring them somehow, then you are flying
blind into the genAI storm.
I also hope to show that a lot of existing assumptions about how benefits
might be demonstrated, for example with user surveys about general impressions,
or by evaluating artificial benchmark scores, are deeply flawed.
Even making what I consider to be wildly, unrealistically optimistic
assumptions about these measurements, I hope I’ve shown:
- in the numerator, might be a lot higher than you
expect,
- in the denominator, might be a lot lower than you
expect,
- repeated use of an LLM might make go up, but despite
the fact that it's in the denominator, that will ultimately be quite bad for
your business.
Personally, I don’t have all that many concerns about and . is still seeing significant loss-leader pricing, and might not be coming down as fast as vendors would like us to believe, if the other numbers work out I don’t think they make a huge difference. However, there might still be surprises lurking in there, and if you want to rationally evaluate the effectiveness of a model, you need to be able to measure them and incorporate them as well.
In particular, I really want to stress the importance of the influence of LLMs on your team dynamic, as that can cause massive, hidden increases to . LLMs present opportunities for junior employees to generate an endless stream of chaff that will simultaneously:
- wreck your performance review process by making them look much more
productive than they are,
- increase stress and load on senior employees who need to clean up unforeseen
messes created by their LLM output,
- and ruin their own opportunities for career development by skipping over
learning opportunities.
If you’ve already deployed LLM tooling without measuring these things and
without updating your performance management processes to account for the
strange distortions that these tools make possible, your Futzing Fraction may
be much, much greater than 1, creating hidden costs and technical debt that
your organization will not notice until a lot of damage has already been done.
If you got all the way here, particularly if you’re someone who is
enthusiastic about these technologies, thank you for reading. I appreciate
your attention and I am hopeful that if we can start paying attention to these
details, perhaps we can all stop futzing around so much with this stuff and
get back to doing real work.
Acknowledgments
Thank you to my patrons who are supporting my writing on
this blog. If you like what you’ve read here and you’d
like to read more of it, or you’d like to support my various open-source
endeavors, you can support my work as a
sponsor!