Development

Building AI Dance Partners (and your role as a good lead!)

December 31, 2023

tl;dr LLMs give computers new abilities to be better partners for us humans, and if we build the right systems we can transform how we work together. I have learned some lessons on the building side, but also on how to do more as an augmented human to get the most out of this new world!

A dream stirred me from my sleep. I found myself on the set of ‘Dancing with the Stars,’ but with a twist: my partner was not human, but a robot. As I lay there, half-awake at 3am, I pondered the meaning of this mechanical ballroom dance. Then it clicked… it was a metaphor for the work I’ve been deeply immersed in at the close of 2023: creating computer systems that augment human capabilities, giving developers and their teams superpowers in software delivery.

The Dance

I’ve always believed in the power of combining the best of both worlds: human creativity and computer precision. The best user experiences have always weaved brain and tool, these days including those that are digital.

LLMs have changed the game in that precision has a brand new capability: a new layer of intuition that we can tie to. A way to combine my Systems 1 and 2 brain with a mesh of combined thought. Back in the dream, my subconscious was painting a picture of the ideal partnership where the human mostly leads, and the machine follows in a tightly choreographed back-and-forth. Just like picking up a tool such as PhotoShop, it can still take time to master the steps, and the dance changes as the capabilities change. How can we best use the strengths and weaknesses of each partner so that they work as one?

Crafting the Perfect Partner

I’m currently iterating on a dancer that developers can shape into the best partner possible. Speed and skill are crucial. A slow computer is like a dance partner with two left feet, disrupting the flow and making collaboration frustrating. Skill, on the other hand, is about quality and finesse—leading without stepping on each other’s toes, sharing knowledge to maintain the rhythm.

The Car and the Engine

I was excited to join a Sutter Hill Ventures startup for many reasons, and my expectations have been very much exceeded. Not only do we have a solid financial backing that allows us to really focus on building a game changing product and business, but the support that the Sutter Hill team has is special. I get to work with my favorite UX person there is. The enterprise sales playbook is ready to run. And on and on.

The team itself (founders, CEO, and everyone else!) is not only world class, but there is a strategic bet that I strongly believe in for building the absolutely best product. The heart of the team has AI researches who deeply understand every part of the stack.

It’s one thing to build a car using someone else’s engine; it’s another to be able to fully tinker with that engine or even build your own.

In 2023 we have learned so much as a community. First we had the transformational moment when developers got to poke at what could be done with OpenAI APIs (and then so many more). There was the prompt engineering, RAG’ing, and pushing the boundaries of what’s possible.

Embracing Constant Change

The model tier is just the beginning, and going from demo to a production system requires a world of work to be done around it.

New models and research are popping up on a daily basis, so how do you filter out what could be helpful? How do you determine its utility for your specific needs? How do you ensure your data is accurate and current? Are your evaluations truly reflective of quality, or are you just fitting the last piece of a puzzle?

Metrics

Measuring what matters here is hard. For example, with coding tools, I often see discussion around the amount of codethat is created, or the Completion Acceptance Rate, but when you watch this play out in practice with your users you realize…. wait a minute…

Do we want to always be creating code if it’s adding entropy into the system? If that code is iffy, and if the human can’t tell, then maybe we are adding problems. And, wouldn’t it be nice if we maybe could… delete code and simplify?

For completion acceptance, I can get very different results by changing the system to vary the amount of code that comes back, or the latency, and many of the habits that you build with the developers. The habits have been really fun to watch. Seeing cohorts that start by waiting for the system to do things vs. communicating more and moving quickly.

And when I do side by side comparisons, I see the huge difference where one system can have a hire acceptance rate that ends up with code that doesn’t run. Don’t I really want to be tracking time to running code that is high quality?

Here’s to 2024

We are somewhere in the journey that is akin to constant improvements that we can see with other tools such as Midjourney.

I’m grateful for my team’s collective ability to build everything needed for the ultimate coding dance partner. We are building the platform that enables the building of this partner, to iterate on it, to take in the innovation from open source and our own research, and man I’m having a great time doing it.

I can’t wait to share it with more of you. If you’re a developer who spends most of your day coding, enjoys giving feedback the moulds a product, and are interested in getting early access, I’d love to hear from you.

Happy New Year, and may this become true!

Prediction: 2024 will feel like a breakthrough year in terms of AI capability, safety, and general positivity about its potential impact. In the longer term, it'll look like just one more year on an exponential that can make everyone's lives better than anyone's today.
— Greg Brockman (@gdb) December 31, 2023

NOTE: Of course, this article was written by both Dion Almaer and the dancer within Type.

GenAI: Lessons working with LLMs

February 14, 2023

Creativity & Constraints, Foundations & Flywheels

The developer community is buzzing around the new world of LLMs. Roadmaps for the year are getting ripped up one month in, and there is a whole lot of tinkering… and I love the smell of tinkering.

At Shopify we shared a new Winter Edition, which packaged up 100+ features for merchants and developers. Some of the launches had a lil Shopify Magic in them, using LLMs to make life better for our users.

I had a lot of fun, shipping something for developers that used LLMs, and I thought I would write about a few things that I learned going through the process of getting to shipping.

UI for mock.shop — The mock.shop homepage

What did we ship? mock.shop

We want to make it as easy as possible for developers to learn and explore commerce, by playing. We wanted to take as much friction as possible from being able to explore a commerce data model, and build a custom frontend to show off your frontend.

This is where mock.shop comes in, it sits in front of a Shopify store, but doesn’t require you to create one yourself. Just start playing with it and hitting it directly!

One thing we have heard from some developers is that they are new to GraphQL and/or new to the particulars of the commerce domain. We show examples, and the GraphQL and code examples of how to work with it, but could we go even further?

Generate query with AI

What if you could just use your words and ask us to generate the GraphQL for you? That’s exactly what we did. And here’s what we learned…

Foundations & Flywheels

We used OpenAI for this work, and when working with LLMs you are working with a black box. While GPT3 had some knowledge of GraphQL, and Shopify, it’s knowledge was out dated and often wrong. Out of the box you are working with anything that the model has sucked up, and you can’t trust this data at all.

You need to do all you can to feed the black box information so that it can come up with the best results. Given the black box, you will need to experiment and keep poking it to see if you are making it better or worse.

Here are some of the foundational things that we did:

Feed it the best input

Gather all of information that you think will nudge the model in the right direction. In our case we gathered the GraphQL schema (SDL) for the Shopify storefront APIs, and then a bunch of good examples. With these in hand, we would chunk them up and create OpenAI embeddings from them. You end up with a library of these embeddings, which are vectors that represent the chunks of text.

With these embeddings we can take user queries (eg. “Get me 7 of the most recent products”), get an embedding from that query, and then look for similar embeddings from the library that you have created. Those will contain snippets such as the schema for the products GraphQL section, and some of the good examples that work with products. We call this context and you will pass that to the OpenAI completions endpoint as part of a prompt.

Customize the prompt

You will want to play with prompts that result in the right kind of output for your use case. In our case we are looking for the black box to not just start completing with sentences, but rather give back valid GraphQL.

You end up with a prompt such as:Answer the question as truthfully as possible using the provided context, and if don’t have the answer, say “I don’t know”.\nContext:\n${context}\n\nQuestion:\nWhat is a Shopify GraphQL query, formatted with tabs, for: ${query}\n\nAnswer:

You can see how the prompt is:

Politely asking for the answer to be truthful
Nudging for the answer to be tied to the given context (from the embeddings) vs. making it up from full cloth, and saying that it’s ok to say “I don’t know”!
Asking for a formatted GraphQL query

One other way that we try to stop any hallucinating from the model is via setting the temperature to 0 when we make the completion call:What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It’s quite funny to see how we do everything to try to get the model to speak the truth with this type of use case!

Feedback and Flywheels

Now it’s time for the flywheels to kick in. You want to keep feeding the context with high quality examples, sometimes show what NOT to do, play with different prompts, and start getting feedback.

You will see lots of examples where users are asked for feedback. E.g. in support systems and documentation: did this help? is it accurate? To train the model as best as possible, you can look for ways to get this information from the experts (humans!) and feed it on back, as well as simply tracking what your users are asking for and how well you are acting on those needs!

Creativity & Constraints

We have the foundations in place, and the quality of data will improve through the flywheels. Now it’s time to get more constrained. We are doing all we can to nudge for truth, but you can’t trust these things, so what guardrails should you put in place?

We really want the GraphQL that we show to be valid, so… how about we do some validation?

We take the GraphQL that comes back and we can do a couple things:

We would tweak it, when possible, to place valid IDs and content, for the given dataset that we have in the mock.shop instance.
Validate the GraphQL to make sure the syntax is correct
Run it against the mock.shop, since we have real IDs, and show the results to the user!

You can’t assume anything, so you often will have to have a guard step once you get results.

ChatGPT vs. Stockfish

There was a lot of hubbub when someone pit ChatGPT vs. Stockfish in a game of chess. Many used it as a way to laugh at ChatGPT. This thing is crazy! It did all kinds of invalid moves! No doy! You have to assume that and build systems to tame it… a chess engine wouldn’t allow invalid moves.

Defensive

You have to be incredibly defensive. You are poking a brain with electrodes. It comes out with amazing things, but you can’t trust everything that comes back. Making remote calls to OpenAI itself is flaky, and often goes down.

Now only will you be checking for timeouts and errors in results, but you should consider a feature flag toggle. In the case of mock.shop, the tool is usable without any of the AI features. They are progressive enhancements to the product.

We can add checks to automatically turn it off if something really bad is happening with OpenAI. Marry both:

const openAIStatusRequest = fetch("https://status.openai.com/api/v2/status.json");

and check the results for the type of incident:

openAIStatus.status.indicator === "major"

It’s incredibly fun, getting creative with how you can use the power of LLMs, which are getting better and faster all the time. The black box nature can be frustrating at times, but it’s worth it.

I hope you are having some fun tinkering!

https://polymath.almaer.com/

There are so many helpful libraries out there. I have been working with some friends on Polymath to make it simple to import and create the libraries, as well as query it all.

Generative AI: It’s Time to Get Into First Gear

January 25, 2023

Don’t sit and wait, get tinkering!

We are almost at the end of the first month of 2023, and you are working on executing on the year’s strategy, but we are witnessing an explosion, hopefully a Cambrian one, in front of our eyes… Generative AI.

I wrote about how it can be a helpful tool for us with respect to documentation and beyond and we are seeing changes every week as we learn what works and what doesn’t.

We are seeing developers jump on this, playing with ideas such as commit bots, app generators, ways to generate backends, IDEs, and so much more.

First Gear? Why Now?

There is so much promise, people are already using these tools, so instead of sitting on it and being conservative, now is the time for us to jump in and get into first gear. There is always a fear of being too early into a hype cycle, but the reason I think the time is right is that you see people getting value today. I have been coding with tools like Copilot and ChatGPT and it’s helpful enough that I wouldn’t want to go back to the Before Times. Does it get everything right? No. Is it great for all of my development needs? No, it’s not as good as it should be.

Training all of my content, and being able to query it

What does it mean to get into first gear now?:

Be thinking about use cases that you can start trying. I have been building things such as:
- Using embeddings to bring a chat/search interface to our docs and samples.
- Discord bots to start answering questions
- Super-codemods that help you upgrade, and generally help you build
Build small experiments that one or two people can execute on and start to validate
Build a core competence in the technology, so you can quickly go from ideas to experiments

In first gear you have the pedal down, and a driver is quickly accelerating. This is a technology that is changing fast, and is all about tinkering. Get tinkering.

When do we move to second gear?

You will learn so much through these experiments. What actually works, what doesn’t, and what needs more tuning and tweaking to be valuable. If things are going well and we see how efforts are impacting our key results, you can ramp up and shift more effort into this work. That would be a sign we are seeing something somewhat revolutionary.

But wait, isn’t this a fad? Are we being sheep?

Maybe it turns out that this isn’t as big of a sea change as many imagine. I have been very skeptical of recent webN hype in the past few years, and I don’t think that Gen AI is a silver bullet of any form. I am well aware that it hallucinates, and gives wacky answers at times. However, as mentioned above, I have already witnessed great value, and we have truly just started. I believe it can offer UX improvements for our developer community that are substantial. Sometimes you have to take a calculated risk. Worse case, you learn, and provide some much desired feature food along the way.

/fin