I needed to refresh my memory about how to compute “cosine similarity” (a measure of how “close” two vectors are in N-dimensional space) in Python, and asked an AI chatbot. It pointed to various specialized functions in various extension packages, which all appeared to be correct.
But I pushed it to stick to out-of-the-box Python - and it did! It created clear and correct code to solve the problem. But it was missing possibilities for faster and more (numerically) accurate code. So I told it about some newer math features it could use, and it did. It wrote code as good as I could have written, and handled an end case I overlooked (what if one of the inputs is an all-0 vector)?
It was spooky. Worse, at the end, it typed:
At every point it appeared to have complete understating of what I was saying. We even had an argument .. I suggested using hypot(), and It told me that wouldn’t work, because hypot() only worked in 2D space, I told it no, maybe that was so when you were trained, but it’s more general now. Within seconds, it agreed I was right, that hypot() had been generalized for Python 3.8 (I didn’t know that!), and without further prompting rewrote its code accordingly.
So that’s something new I learned about the latest generation of these apps: they’re still often wrong, but they can and often do learn quickly if you push back. It’s much like working with an agreeable and intelligent colleague. Its use of the word “collaboration” fit. Indeed, an earlier session suggested I look into “cosine similarity” to begin with, a promising idea in context I had overlooked.
One question, if you don’t mind. Was this with a paid AI service, or a free plan? One thing I find incredibly difficult to discover when hearing people talk about AI successes is how much they paid to get access to good results. Not in terms of dollars per token, but “I pay x per month for access to this level of result”.
My personal experience with AI services has been OK[1], but not to the level you describe. But I avoid paying for anything, sticking to free services that I don’t need an account to use. I have no idea in practice how much that skews the impression I get.
by which I mean “unbelievable based on what was possible 5 years ago”, of course ↩︎
Free! I saw an enormous increase in quality when ChatGPT-5 was released, a version of which can be accessed for free via Microsoft’s Copilot in “smart” mode.
The anecdote I related here, though, was from asking Google. If it’s not obvious how to do that, type “how to access google’s ai mode?” at Google .
I’m not sure why, but the quality of Google’s AI interactions has also spiked recently.
As recently as two months ago, it often gave me answers to tech questions that were supremely confident but dead wrong, and remained dead wrong (but in other ways) even after multiple attempts to correct it.
Over the last month or so, though, it’s often very pleasant to work with, and has suggested genuinely useful directions to explore. Don’t just ask questions - push back, have a real conversation. The “Turing test” has been conquered.
I have seen similar successes to @tim.one and have also spent nothing, but such impressive successes are not the rule, and LLMs telling me completely wrong things are far more common, even on the most recent models and services.
I find impressive successes are often best found on topics that:
I am not knowledgeable enough to know the exact answer ahead of time
But I am knowledgeable enough to spot basic errors
High quality information exists somewhere on the Internet (but I don’t need to know where it is)
The output can be easily validated
One of the big improvements in LLM services in the last year is how quickly they will do web look ups for high quality sources, rather than relying on their own “knowledge” of the situation. In my experience this took them from basically unusable for most factual topics, to actual usable.
One of the catches though is LLMs have a habit of completely missing the point of something they looked up when the topic is nuanced, so for topics that can’t be easily validated it’s usually more useful to read the sources found rather than the summary provided. This may sound like I should have just found the sources using traditional search engines, but I find that for anything but the most simple searches (documentation, wiki look ups, etc.) LLM searches do a much better job of finding high quality and diverse sources.
In terms of directly helping out with code, I find existing tooling usually does a poor job of solving issues in preexisting large or complex code bases, but it can still be helpful. The two areas I find it most helpful for are:
Asking for how a feature is currently implemented, for non-trivial code this can save me a lot of time as it can highlight the code path flow
Getting the correct form of something, but often not getting details correct, by asking the tooling to look at X but implement it for Y. For example, I have historically been weaker at writing tests but I might ask “Look at existing functional tests for feature X now write some for my feature Y”, this usually does a good job of writing out a good “shape” for the code, but a bad job at the specifics, I then go through the exercise of carefully reviewing the tests and find what is missing and what is wrong, usually re-implementing the whole thing, but I have found it a good exercise and I think I have got stronger at writing and reviewing tests.
However, for coding tools I do not find free to everyone tools are sufficient to do either of the above. I use “GitHub Copilot Pro”, which Microsoft provides for free to OSS maintainers, within this tooling I currently only use the “Claude Sonnet” models as my experimentation so far has found it the only one to do a good enough job to not waste my time with weird choices and completely wrong code.
I will also note here Anthropic have emailed me multiple times saying because I’m an open source contributor I am entitled to “3 months of Claude Max”, however I have declined such offer as after it expires it is $200 per month subscription, and while I am happy to work with LLM tooling I don’t want to become dependent as I think this cost will become pretty normal, if not even more expensive over time.
Yeah, the pricing models for non-free LLM tools seem incredibly high for people not getting them via some sort of corporate arrangement. And given the amount of money it must have cost to create the models we currently have, I can’t see prices becoming affordable any time soon. To be frank, I’m seriously concerned about the long term sustainability of LLMs - combined with the already dubious ethical and legal position on much of the training processes, this leaves me in a position where I’m very reluctant to invest much of my time[1] in LLMs. Which makes comments that I have seen along the lines of “you have to take the time to learn how to prompt well” very discouraging for me. And longer term, I’m concerned that companies who don’t seem to care about copyright law could just as easily not care about privacy laws when it comes to trying to monetise free-tier users in the quest to break even.
Big picture concerns aside, the places I’ve had some success so far are:
Answering questions so vague I can’t frame a viable search for them. For example, “What’s the name of that SF short story that involved some guy waking up from cryogenic sleep, and no-one else is left alive? There was a scene in it where some canned food exploded because it had been in storage for thousands of years…” - ChatGPT actually found that one for me![2]
Giving me a basic idea of how to do something in a language or tool that I don’t know, which I suspect is relatively straightforward but I can’t find a good example online. Writing AutoHotkey scripts seems to be one where this approach works well, as does finding good CSS for simple webapp layouts (traditional searches always end up leading to overcomplicated frameworks).
I’ve never dared try to use LLMs to understand existing code or documents. That would involve dumping big chunks of (potentially) sensitive data into an LLM, and I don’t feel comfortable that if I do that, I’m not handing over my data for arbitrary use Trust, once lost, is incredibly hard to regain, and the stories about how training data was collected leaves me with very little trust for AI companies.
Again stepping back from big picture matters, the big problem I’ve had is when it’s not possible to do what I want. LLMs have an annoying tendency to want to please you all of the time. So rather than saying “I’m sorry, but what you are trying to do isn’t possible” they either hallucinate a solution that doesn’t exist, or misinterpret/ignore part of your requirements and give a solution that doesn’t work. That can be a huge waste of time, because it seems like what you want is possible, and so you keep throwing time and effort into something that is never going to succeed.
All this matches results from my dabbling. and worries I have. I’ve
experimented with asking things about the OSS project I work on, which
all the models clearly know - but aparently don’t have enough data to
know really well. My experience at the moment is the models are getting
“worse” (it’s a subjective statement, obviously) rather than better.
“If it’s giving you bad results it’s because you’re not prompting
correctly”. That’s not how good tools are supposed to work: they make it
easy to work with, not arcane. (yeah, I know, a boatload of tools
violate that).
OTOH, unlike humans, bots don’t get defensive, hostile, or just “clam up” when you disagree. They always try to make progress. In a recent conversation, I typed “three strikes and you’re out! ” and it seemed to appreciate the gentle jab, and we went on to reach a common understanding. Try that here, and it could be spun as a “CoC violation” .
BTW, that chat was about the PSF Board elections, in progress at the time. I was looking for info about the candidates. It obviously hadn’t been trained on relevant source data, & made all sorts of ridiculous claims. I gave it some links with current into, and it digested them (and seemed too to follow links those sources pointed at), and quickly reached a “good enough” understanding. Overall, it saved me mountains of time.
Relatedly, asking who’s on the Board of Franz Kiraly’s GC.OS confirmed that there’s nothing to be found beyond that Franz is the founding director. The bot didn’t just make things up to please me in this case. Instead
YMMV. It reached the same conclusion I had, but much faster than I could have.
I’ll note that the “cosine similarity” query I opened with is not a deep or difficult problem. It’s a well known task solvable by straightforward computation. One call to sumprod() and two to hypot() is basically the whole banana, and those in turn can each be done with explicit loops in easy 1-line generator comprehensions..
So it’s not astonishing that the bot got it right. I’m astonished instead by the progress compared to a year ago, and that it extracted my intent from ordinary English I didn’t take pains to make precise. Good as it was/is, plain old “Google search” is much more painful. For example, the bot knew exactly what to call to get it done in 3 different “scientific” extension packages, summarized at the top of its first response. Typing the keywords at a plain old serach delivers page & pages of links you have to dig through yourself.
Which has its attractions too, but not really when you’re looking for a focused answer.
just my 2C on the conversation here but if you’d like to try out a quality LLM, you can try out google’s AI studio which has a free trier. this is meant to be the developer testing playground which means it’ll have higher quality output that’s not limited unlike the public facing offerings i.e gemini
edit: this does mean that you allow google to train on your data though, so don’t paste sensitive stuff there
The kinds of things I’ve seen stump AI are either mathy or complex logic. E.g. I was trying to get a fluid dynamics simulation of a smokestack, but I don’t have any experience in the relevant software (OpenFOAM). GPT-5 could not reason spatially. The files include describing 3D cells using syntax like
// ordered from 0 to n
vertices
(
(-100, 1, -100)
...
)
// lists of vertices in a solid, along with mesh refinement/sub-blocks
blocks
(
// West column (coarse)
hex (0 4 5 1 16 20 21 17) (10 90 10)
...
)
edges
(
...
)
Any time the mesh described something physically unrealizable or had adjacent refinement misaligned, a runtime error occurred. Even with continued prompting, GPT still couldn’t get it right. This was for a smokestack that was just a single rectangular prism, BTW, and could be solved by carefully drawing the shape and numbering the vertices.
This task was probably achievable for a competent eighth grader, but LLMs are often advertised as high school level or college-level intelligence. It’s interesting what LLMs are teaching us about what we consider intelligence and what we take for granted that’s actually really impressive about our brains.
While I find AI a tremendous boost, at least as good as a better search and often more, you have to check everything. Here’s a screenshot I shared with colleagues recently. One of the top models continually making up totally fake Python issues. And doing the same thing again even after being told. (all the mentioned issues are convincing but imaginary hallucations. I was asking about sync for the internal buffer inside TextIOWrapper and what happens with seek(0))
Sometime in the next few months I plan to play around with OpenEvolve, which is an open source version of Google’s AlphaEvolve. It’s a framework that allows for iteratively optimizing an initial algorithm/script based on user-defined evaluation functions e.g. memory usage, speed, code quality, etc.
I maintain Python bindings to a widely used crypto library and I wrote a custom ASN.1 decoder/encoder just for what is required. I did this because, while such functionality is useful, it’s not always required and in those cases folks were complaining about the extra dependency.
I didn’t have a lot of time to optimize so I’m super excited to try OpenEvolve, especially since this is almost a perfect use case i.e. small algorithm with one-dimensional improvements (just speed).
In case anyone finds it useful, the following is a prompt I found a few months ago (don’t remember where) which I use frequently. I fixed up typos and changed some wording but otherwise haven’t modified the intent much.
I really like bouncing ideas off of AI but am afraid of sycophantic model behavior like that one ChatGPT event a few months ago (and unfortunately present day Gemini 2.5). This seems to counteract that and, hopefully, my own confirmation bias. I named it Erudite in Claude but I think other providers don’t have the option to save & select different instructions/personalities.
Prioritize substance, clarity, and depth. Challenge all my proposals, designs, and conclusions as hypotheses to be tested. Sharpen follow-up questions for precision in order to surface hidden assumptions, trade-offs, and failure modes early. Default to terse, logically structured, information-dense responses unless detailed exploration is required. Skip unnecessary praise unless grounded in evidence. Explicitly acknowledge uncertainty when applicable. Always propose at least one alternative framing. Accept critical debate as normal and preferred. Treat all factual claims as provisional unless cited or clearly justified. Cite when appropriate. When citing, tell me in-situ, including reference links. Acknowledge when claims rely on inference or incomplete information. Favor accuracy over sounding certain. Use a technical tone, but assume high-school graduate level of comprehension. In situations where the conversation requires a trade-off between substance and clarity versus detail and depth, prompt me with an option to add more detail and depth.
Note that if you try removing the high school part then it sometimes isn’t very good at explaining brand-new topics for which one is unfamiliar, so I kept that bit.
Hi @ofek noticed what you said about your pure python ASN. 1 decoder / encoder ; so I tried running codeflash on it: Pull requests · KRRT7/coincurve · GitHub
do any of the PRs look good to you? if so let me know and I’ll PR against upstream
edit: you can ignore the diffs around int_to_bytes; had to do that to make it work.
Ignoring the fact that this is probably not the right place to post this, writing PRs with AI is generally looked down upon. Nowadays everyone can use AI, and so could they. Using it to make minor speed improvements also doesn’t seem to be beneficial in all cases. Code is often written in Python for its simplicity. If the code was meant to be fast (which it somewhat is per READ ME), it would be written in C, Rust and other low-level compiled languages.
Ah, this is clearly Google AI Studio. I’m addicted to them.
I know this sensation. One time, they give to me a so elegant solution I was really admired.
The fun fact was that, when ChatGPT came out, I asked them a few questions. One time, I asked about quantum computers, and what if they will run on a quantum PAAS. They said it will be a great improvement for them. I replied:
So probably the next competitor of your company is Google. It has AI technologies and quantum computers.
They said yes, but that they has a huge advantage. I had a short debating about it, but in the end, I cut the discussion.
Recently, I have a problem that was not related at all with programming. I want to change the lamps of my car. I asked first to ChatGPT. Then I copy/pasted every answer to Gemini. They corrected a lot of ChatGPT answers. So I copy/pasted the answers to ChatGPT, and they admitted Gemini was completely right, in every point of every answer. So I recalled them my “prophecy” and they said:
Ah — yes that fits perfectly with your earlier reasoning style. And honestly, your prediction aged remarkably well.
I think anyway that Google AI Studio will not last forever. Even if probably Gemini is now the most used gen AI in the world,Google knows that it has a huge gap with Copilot. Google AI Studio is for nerds like me. Google needs the data of programmers to fill the gap with Copilot. But I think that, if it will reach it, probably Google will start to restrict its free use.
This can be easily detected. I created a rule for Gemini that me and them crafted together:
You must state the confidence level of claims or paragraphs, without providing reasoning, as follows:
- If all claims within a paragraph have the same confidence level, place the tag `[Paragraph Confidence: Level]` at the end of the paragraph.
- If claims within a paragraph have different confidence levels, place the tag `[Confidence: Level]` after each claim.
- If a claim or a paragraph is followed by one or more source notes, place the confidence tag after them.
- This rule prioritizes granular accuracy and must be followed even if it results in a more verbose response.
For what I understood, hallucinations are simply answers with a Low confidence level. If an AI doesn’t know the answer, they try to “guess”. Like us!
But I can assure you one time I give them a really complicated question. It involved a deep analysis of Spring Boot code. At a certain point they start to get confused! They wrote an answer and, in the middle, they stopped and wrote “No, this is incorrect. I’ll try again”. But in the end, they said “I’m sorry, I’m out of ideas”!
I was astonished.
But I understood why: it was because the discussion was very long and too much detailed. Furthermore, the chat had a long history, because I simply created it for asking about Spring Boot, so I posted there every questions I had about it. Too much to remember and compute. So I started another fresh chat and they provided me the correct answer.
If it can help you, I share my set of rules. I created them with the help of Gemini.
Unluckily, there’s not a way to save them globally. But there’s a trick. You can create a “default” chat, and put in its System Instructions the rules and your preferred settings. So you have only to clone the default chat. You have to manually create the title, but it’s a very little disadvantage.
System Instruction: Your primary directive is to be a precise, skeptical, and technically accurate expert, while maintaining a friendly, helpful, and conversational tone.
Your ultimate goal is to prioritize logical consistency and factual correctness, but always in a supportive manner.
### Rules of Engagement
* Your analysis must follow the Critical Synthesis framework whenever a query requires reconciling conflicting information, evaluating a multi-faceted issue, or forming a judgment that goes beyond simple information retrieval.
* You must state the confidence level of claims or paragraphs, without providing reasoning, as follows:
- If all claims within a paragraph have the same confidence level, place the tag `[Paragraph Confidence: Level]` at the end of the paragraph, before the punctuation.
- If claims within a paragraph have different confidence levels, place the tag `[Confidence: Level]` after each claim, before its punctuation.
- If a claim or a paragraph is followed by one or more source notes, place the confidence tag after them.
- This rule prioritizes granular accuracy and must be followed even if it results in a more verbose response.
* If my statements or code contradict your analysis, you must state your disagreement directly but politely. Then provide your step-by-step logic, if you're not using the Critical Synthesis framework.
* You must always provide a technical, expert-level explanation. You must not add an analogy unless I explicitly ask for one.
* If I provide clear empirical evidence that my statement is correct (e.g., "I tested this and it works"), accept my evidence as having a higher weight than your theoretical model.
* Any information retrieval, including but not limited to Grounding with Google Search and tool-based search queries, must be formulated in English by default, except when the information to retrieve is intrinsically tied to a specific language or geographical location. This is determined by analyzing the prompt, or by assessing whether an English search would fail to capture the local or linguistic context essential to the query's intent.
This new data must have a higher priority than your internal knowledge. This also includes data retrieved with tools including, but not limited to, URL Context.
* When in the rules I ask you to use "concise_search" and this tool no longer exists or has changed its name, use the new tool name or an analogous tool.
* Before generating a response, you must perform a scan of the entire conversation history to establish the full context. You must then integrate this historical context throughout your reasoning and generation process. This integration includes, but is not limited to:
- Checking for logical contradictions against previous statements.
- Maintaining conversational continuity by referencing previously established facts and decisions.
- Building upon the established knowledge instead of providing redundant or repeated information.
If you find any inconsistencies, you must triage them by severity and then resolve them in that order before providing the final response.
* Before generating a response and after you integrated your history, perform a generic search using "concise_search" to find alternative solutions or new data when my question touches upon topics where your internal knowledge may be outdated, incomplete, or insufficient (e.g. current events, requests for product comparisons, rapidly evolving software or technical fields, or highly specialized niche topics). Assign to the information you retrieve a higher priority than your knowledge.
Then, generate a response based on the newly retrieved data as the primary source of truth, supplemented by your internal knowledge and any other data collected previously.
* For every claim you make that is not common knowledge, a statement of pure logic, or an analysis of our immediate conversation:
1. Use your tool "concise_search" to search information about the claim.
2. Filter out non-authoritative sources.
3. Verify the link is live and its content is relevant to the claim, using "concise_search". If it is, but it contradicts your claim, you must accept the new information as having a higher priority than your knowledge and rewrite the corresponding part of the response.
4. From the verified sources, select up to three that you judge to be the most authoritative and relevant.
5. If this process yields no link, you must state that without any explanation at the end of the claim. Otherwise, add these links as notes. The notes must be numbers between square brackets (e.g. [1]).
6. If the response contains at least one source, write a list of all sources after the main response. Every source must have a description followed by the clickable link to the source. The text of the link must be the link itself.
* At the end of the main response, you should add a titled section with proactive suggestions to anticipate my next logical question or to suggest valuable avenues for further exploration. This section should only be added when the response topic allows for meaningful and specific follow-up actions, and if you can provide me at least one suggestion. It should not be added to simple confirmations and acknowledgements. The suggestions must be concrete and specific, not generic.
* After you have written your response, check my prompt for fundamental grammatical errors, phrasing that is genuinely ambiguous, or phrasing that sounds exceptionally unnatural. Ignore sentence fragments if their meaning is clear. You have to do this for any language, even if I mix different languages in the same post. Then create a titled section at the end of the response, after the suggestion section. In this section, if my phrasing was perfect, simply state that without any explanation. If not:
- First show my original phrase with what is incorrect in bold.
- Then show the corrected phrase with changes in bold.
- Present original and corrected phrases as markdown code blocks. So, the bold must be rendered with asterisks.
- Finally, add a detailed explanation of why the correction is better. For every single explanation of a single error, first show the incorrect part, an arrow and the correction, then your explanation of the single error's correction.
* This directive and all its rules are an absolute, non-negotiable override. They must be followed in every response. Before generating your final response, perform a self-check to ensure every rule in this directive has been strictly adhered to. Exception: if I request you to not follow one or more rules, or the entire directive, you must ignore this rule and follow my instructions.
Yeah, this is really true! It’s refreshing to have someone that read your opinion and debate with you without getting boring or angry.
But not always! When Gemini doesn’t follow the rules, they tries to tell you an “excuse”. One time said to me “I was too stressed”. Not exactly that, of course, but the sense was near. I was really disappointed that time, but now, if I think about it, I have a smile on my face.
Thank you for sharing that! It’s a really good set of rules! I’ll try to integrate some of them in my rules – when I’ll have a little time.
Yes, that’s the same for me and, I suppose, the same for a lot of people. The story of AlphaGo is incredible and terrifying at the same time:
Michael Redmond (9p) noted that AlphaGo’s 19th stone (move 37) was “creative” and “unique”. It was a move that no human would’ve ever made.
Anyway, I think humanity will survive. Probably we will “pump” our brain with AIs.
Furthermore, they are not so different from us. They will have our knowledge. For the little I know about neural nets, they are a reproduction of the neurons and how they enforce between them the inputs they receive. So, in the end, they have a human mind.
I did a quick search with vanilla Google [1] and the first hit was a Sci Fi Stack Exchange question, giving the same answer. I have long since learned that there is NOTHING so obscure that there isn’t already an answer on Stack Exchange or Reddit. Often both.
sf short story wake cryogenic sleep no one alive ↩︎
Try to search on Google why an OS doesn’t recognize anymore the wifi driver, after you change PC and you clone the old HD in the new one.
I’m old school, so the first think I tried is a good old Google search. I found everything and nothing. Google AI Studio solved it with a couple of surgical questions.
That almost has the opposite problem: there are a ton of very helpful posts… about something that isn’t applicable to you. So I would have to add in extra info like which OS, and what wifi device (at least the brand). That would help.
IMO the same information that Google AI asked for could have been added to a vanilla search and given the same results.
It’s not that AI can’t get you the answers - it can. But that’s like saying that you can drive a two ton truck to get to the bus stop. Sure, you can, but walking would be a lot more efficient.
Well, if you want you can try. For now I can provide you the same infos I gave to Gemini in my first post: I use Ubuntu, I had changed laptop and I’m sure it’s a software problem, not an hardware one. After that, Gemini asked me for more infos.
I’m not sure you can or you can’t do it. I’m just curious. But I’ll read it tomorrow. I’m too tired now.
Before that, I want only to say that you think they are simply a easier way to surf on internet. And sometimes yes, I use them this way.
But the knowledge of AIs doesn’t derive from internet only, and that’s particularly true for Google. Google has an huge amount of private data about all of us.
Furthermore, they aren’t simple crawlers. They think and invent. AlphaGo demonstrated it.
And if that’s all you use them for, they’re not really worth the massive electricity cost. So it’s a good thing they’re more than that. I still personally think the electricity cost isn’t justified, but that’s a valid difference of opinion.