Changelog Interviews – Episode #665
The world of open source metadata
with Andrew Nesbitt from ecosyste.ms
Andrew Nesbitt builds tools and open datasets to support, sustain, and secure critical digital infrastructure. He’s been exploring the world of open source metadata for over a decade. First with libraries.io and now with ecosyste.ms, which tracks over 12 million packages, 287 million repos, 24.5 billion dependencies, and 1.9 million maintainers.
What has Andrew learned from all this, who is using this open dataset, and how does he hope others can build on top of it all? Tune in to find out.
Featuring
Sponsors
Tiger Data – Postgres for Developers, devices, and agents The data platform trusted by hundreds of thousands from IoT to Web3 to AI and more.
Augment Code – Developer AI that uses deep understanding of your large codebase and how you build software to deliver personalized code suggestions and insights. Augment provides relevant, contextualized code right in your IDE or Slack. It transforms scattered knowledge into code or answers, eliminating time spent searching docs or interrupting teammates.
Outshift by Cisco – The open source collective building the Internet of Agents. Backed by Outshift by Cisco, AGNTCY gives developers the tools to build and deploy multi-agent software at scale. Identity, communication protocols, and modular workflows—all in one global collaboration layer. Start building at AGNTCY.org.
Miro – The innovation workspace for the age of AI. Built for modern teams, Miro helps you turn unstructured ideas into structured outcomes—fast. Diagramming, product design, and AI-powered collaboration, all in one shared space. Start building at miro.com
Notes & Links
Chapters
| Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
| 1 | 00:00 | This week on The Changelog | 01:17 |
| 2 | 01:17 | Sponsor: Tiger Data | 01:38 |
| 3 | 02:54 | Start the show! | 01:10 |
| 4 | 04:04 | Some history | 13:17 |
| 5 | 17:21 | Research use | 01:39 |
| 6 | 19:00 | Sponsor: Augment Code | 01:33 |
| 7 | 20:33 | CLI install patterns | 04:40 |
| 8 | 25:13 | 15k people run the world | 01:53 |
| 9 | 27:06 | Tracking the funding | 03:58 |
| 10 | 31:04 | Friends tipping circle | 01:55 |
| 11 | 32:59 | How he stores everything | 03:31 |
| 12 | 36:30 | Who's involved | 04:18 |
| 13 | 40:48 | Footing the bill | 09:11 |
| 14 | 49:59 | The black sheep | 09:34 |
| 15 | 59:33 | Sponsor: Outshift by Cisco | 01:13 |
| 16 | 1:00:46 | Sponsor: Miro | 01:27 |
| 17 | 1:02:13 | The schema is not simple | 04:51 |
| 18 | 1:07:04 | Curbing the enthusiasm | 03:21 |
| 19 | 1:10:24 | Deciding what to work on | 05:39 |
| 20 | 1:16:03 | Metadata substrate | 05:13 |
| 21 | 1:21:16 | Designing useful tools | 03:38 |
| 22 | 1:24:54 | Telemetry via exhaust | 02:05 |
| 23 | 1:26:59 | Information black holes | 02:55 |
| 24 | 1:29:54 | Octobox update | 01:05 |
| 25 | 1:30:59 | Exciting new uses | 01:01 |
| 26 | 1:32:00 | OSS Taxonomy | 07:15 |
| 27 | 1:39:14 | What Andrew wants | 02:45 |
| 28 | 1:42:00 | Wrapping up | 00:22 |
| 29 | 1:42:22 | Closing thoughts | 01:35 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Today, we’re joined by, for us, an old friend, but a long time no talk… Andrew Nesbitt is here with us. And you know, Andrew, I came across Ecosystems, which is ecosyste.ms - nice domain hack; hard to say out loud, but it looks cool in the URL bar…
It does look cool.
I came across this and I thought “This is a very cool project. It seems somewhat familiar. I can’t quite put my finger on what it could possibly be… And then I saw it was from you, and I’m like “Oh, it makes total sense.” This is right up your alley. We’ve had you on the show many times back in the day, talking Octobox, talking Libraries.io, talking Ruby ecosystem and dependency management… And it looks like you’re still out there, kind of beating around that same bush. So first of all, welcome back to the show…
Thanks for having me. Yeah, it’s great to be back.
Ecosyste.ms. I mean, okay, we have a lot of context that maybe our listeners don’t share, but take us back to what you’re interested in, which - it seems like you’ve been interested in similar things for a long time. And you built libraries.io around this, and Ecosyste.ms is a very similar thing… I’m wondering if it’s the same old thing, or if it’s a new-new thing… So tell us about your past and like collecting and organizing dependencies, and the information about them, and open source projects, sustainability, and then how that brought you to Ecosyste.ms.
Yeah, okay. So I have been swirling around the world of open source metadata for - it must be coming up to 10 years now. Starting with 24 Pull Requests.
That’s right, 24 Pull Requests, yeah.
And that didn’t kind of start from metadata, but the idea of that project was to encourage people to contribute to open source as part of kind of the run up to Christmas… And after kind of like first getting that off the ground, we quickly ran into “Oh, how do we suggest – where should people go and contribute to?” And a lot of people would try and send a pull request to a project that had no activity, and like the maintainer was gone, or just were struggling to be able to even like work out how to send a pull request to some projects, because they were really not very friendly or easy to contribute to.
And that kind of led me down this path of “Okay, well, how do you define what a good project is?” And then “Can we scale that up, rather than manually having to have people kind of like submit their things and keep those things up to date every year?”, because that project would just kind of come and go every December, and shut down afterwards… So the maintenance there couldn’t be entirely human, because there was thousands of people contributing to that project, and sending pull requests… And it was a lot of data to try and work with.
So I started to build out some basic metrics there to try and go “Does this project look like it has activity that’s happening on it? Does it look like it’s ever received third party contributions?”, and things like that. And that led me to kind of – I got a job at GitHub from there, and then GitHub promptly fell apart internally… Tom Preston-Werner left, it was a horrible time… And so then I left there and started Libraries.io as essentially like a - okay, well, looking at package manager metadata is a different way of kind of getting some measure of what’s an interesting open source project. Like, rather than just using stars, which - stars is a terrible metric, and has very little kind of bearing on a lot of projects, especially as you go down from the massive frameworks, those huge keystone projects. Once you get down to smaller libraries, and also especially the kind of low-level critical projects that are doing a lot of the real work… They don’t get a lot of attention, and a star is basically a measure of attention, how many people are landing on that GitHub repo page.
So package manager metadata was like “Oh, this is really juicy”, because it kind of gives me a hook into saying “These libraries are being used by other people.” But download stats - again, available for most package managers, but not all - is often kind of wildly all over the place for certain projects, especially if they’re used a lot in CI; you’ll just see really inflated download stats. And you also don’t necessarily see those for dev dependencies, the things that people, especially maintainers are installing on their laptops to be able to work on those projects. But they’re not necessarily a runtime dependency of all the applications; there are definitely gems that Ruby and Rails devs use locally, but aren’t shipped with the Rails app, so you would never see those numbers.
[00:08:18.01] And the insight that I kind of accidentally tripped over was if we go mining the dependency information out of open source repositories at a large scale, you actually start to get a really good picture of how people really use open source, and how they don’t use open source. Like, if a project breaks, you probably don’t go and un-star that project, let’s be honest. Not many people are un-starring things. They don’t remember. And also, you don’t un-download a thing. The download counts remains after you downloaded it and was like “Oh, this doesn’t actually work” or “This is not what I wanted”, or it has become unmaintained. Whereas actual “I depend on this thing”, if I remove that thing as a dependency, then numbers go down, and you get a really interesting, strong signal that something is maybe not quite right with that project.
So that kind of led me onto a path of “I should just try and index the dependencies of every open source project ever.” And libraries.io started out as a search engine, designed to be like “I can help you try and find the best package.”
And that was primarily like “This package is well used, so therefore that implies it has good documentation, that it actually works, and other people are using it as kind of a proxy.” And it grew and grew, and became a massive and expensive and difficult project to maintain as a side project, whilst I was doing contracting. And me and Ben, who were working on it, we’re like “Well, what are we gonna do? How can we turn this into a sustainable project that can fund itself?” And at the time, GitHub had just implemented its own dependency graph as well, along with purchasing Dependabot… And that basically – they started giving that away for free. That pulled the rug out of any plans we had to monetize libraries.io directly, as well as a project I was building called Dependency CI… Which never really got off the ground, but was back in the day was like “Oh, this is really cool”, because it could literally like block your pull request to say “You’re trying to add a dependency here that is not good, because it doesn’t have a license”, or it’s got security issues, or other things. And so we ended up selling to Tidelift, to try and find some way of recouping the costs of building out that project… But just before we did, we also made all of the code open source, and all of the data open source. So it was kind of like an airdrop into the community to be like “This is always gonna be here if you wanna use it for purposes.”
Didn’t really work out at Tidelift… There’s a big cultural difference in the founders at Tidelift compared to me and Ben. Me and Ben are very – we really like building and solving problems in the open, and shipping stuff really quickly, and kind of iterating on those things… And Tidelift’s culture was - because they just sold to another company…
Yeah, who bought Tidelift?
Sonar? I can’t remember the name. It’s a security company.
Okay.
And as a shareholder of Tidelift, I can tell you, I didn’t get anything from the sale.
Bummer, dude…
[00:12:02.19] But Libraries.io was there, and was open source, and after – I took a break for a little while during the pandemic, which - you know, everyone had a kind of a crazy time… I went to do some contracting with Protocol Labs, basically kicking the tires on IPFS and Filecoin, and trying to use it as a real user… It was an interesting time to actually try –
[unintelligible 00:12:31.22] was pretty cool.
Yeah. [laughs] And then at the same time was talking to Schmidt Futures, which is now Schmidt Science… But one of the kind of sub-foundations of the Schmidt Foundation, who were basically saying “We have researchers that were using the data from Libraries.io for research”, but now Libraries.io – when I left Tidelift, they started to remove features of Libraries.io, especially the API access and the data… And Schmidt Futures basically kind of came along and said “Could you stand up another copy of it?” And I was like “We could do that… But what if we rebuilt it from the ground up as infrastructure for research purposes?”, rather than taking the same code, which is like one big search engine, one honking great Rails app, and actually make it into kind of a slightly more – like, take all the lessons learned, but instead of building it as a search engine, instead build it as a base layer of open source metadata, which then can be used to build a Libraries.io on top of it. And that also means we can take some of those lessons that were like “Oh, actually, it turns out contributing to a project that has one absolutely enormous database schema is really difficult.” Like, trying to stand that up yourself is really hard as a contributor. So people will just bounce straight off the project, because they’re like “Well, there’s no way I can possibly comprehend how big the stuff that’s going on here…” And then also, the performance implications of deploying a change, that might be like “Oh, you’re about to touch a table with like a billion rows in it.” That’s gonna be difficult for you to test without me giving you production access… And I really don’t wanna do that to random third party open source contributors.
So Ecosyste.ms is essentially a do-over of Libraries.io. It’s many different Rails apps that are focused on collecting different kinds of open source metadata, and then combining them together in different ways. So there’s a packages service, there’s a repo service that collects the dependency information from repositories, there’s an advisory service, and a commit service, and an issue service… Basically, all the different things that you might be interested in. And each one of them can then be independently worked on and scaled up as different amounts of data [unintelligible 00:15:10.20] and kind of collected in different places.
And that has been going on for nearly - I wanna say three years now… Really kind of like going from – it was a nice kind of year where I just worked on it myself, didn’t really tell anyone about it, just kind of like plugged away… And there are core pieces – because Libraries.io is open source, I was able to reuse the dependency parser and a load of the mappings to the package managers… Actually take that code and kind of reuse that in a way that also allows you to have multiple different package manager registries, where Libraries.io would only support one… Which was really nice when RubyGems had all of its drama recently, and the gem.coop popped up… I was able to go “Oh, I can quickly start indexing gem.coop.” It just fits straight into that new schema. And then since kind of like the past year, it’s just absolutely exploded in usage. The amount of traffic today alone was 50 million requests to the API.
[00:16:23.24] Wow.
Wow,
And it has become quite a piece of critical infrastructure to a number of different kind of areas of open source in terms of SBOM enrichment, and also trying to find those critical pieces of open source that need security work or need sustainability efforts to be kind of coordinated around them.
Well, I’m happy to hear that you got to reuse some of your code from Libraries.io, because what I thought was gonna happen when you said “I airdropped it”, I thought you were gonna just catch your own airdrop a few years later and be like “And because I open-sourced it, I just relaunched it under a new…” But obviously, the big rewrite is a very tantalizing thing, especially when you’ve been living with all your mistakes for this time. It’s like “Let’s start over…” But you got to reuse some of your code, which is really awesome. So nice job open-sourcing that when you still had an opportunity to do so.
Yeah, absolutely. You mentioned this is used in research… I guess, research terminology, so to speak. What exactly does that look like? Who are those folks? What kind of research are they doing? Are they developers? Are they developer-adjacent?
I think mostly developer-adjacent, or in the research space I guess you’d call them research engineers… Where lots of computer science researchers are like “We want to study what these behaviors are like across different package managers”, or comparing what are developers doing in this space versus that space, especially around the dependency stuff, to be able to go “Oh, the average number of dependencies in a JavaScript app compared to a Ruby app”, for example, which I think is about 10X… And then looking at kind of “Can you go down those dependency chains and find where the security problems are, or the license problems are?” And also leading into kind of “How can we encourage best practices in this space?” Or work out “How many projects have taken on these various kinds of”, specially – just recently I had a call with someone who’s looking at all the attestations around trusted publishing. Like, can we see the share of usage of packages that have the trusted publishing setup, and are publishing attestations into a SIG store, compared to the overall space? And also then breaking that down across different ecosystems as well.
Break: [00:19:00.27]
This might be silly, but let me ask you this… I’ve been researching some CLIs and I’ve been researching how CLIs install themselves. Sometimes they’ll leverage the actual package manager of the distro, like a Linux distro or something like that… But most, by and large, just give you a URL to curl and pass to Bash, essentially… Which can be problematic if you don’t trust the script.
If I wanted to somehow research CLIs and how they install themselves, and the various ways they install themselves, is that something that this service could do? Is that the level of research it could do?
Yeah. I mean, for one thing, you would be able to quickly find everything that had kind of tagged itself up as a CLI program. I’ve also been indexing every public image on Docker Hub, and basically running an SBOM scanner against each one of those. There would be some juicy insights there, to be able to go “How many–”
Juicy? I like juicy.
“…of these things were installed via a distro package manager, versus “we just have a URL for this”? Which would be recorded in the SBOM, basically to say “Oh, we’ve found this known bit of open source, and it appears to say that it sits in this file system here”, which implies it was installed by apps, or it’s in a random space, like it was probably curled down, along with the Docker file that was used to build that image. And there’s a good kind of million open source Docker images on Docker Hub, or at least individual versions of things. And you also get the interesting aspect there that you can kind of multiply that by the number of downloads that some of these Docker images have… And some of those numbers are crazy. Millions and millions of downloads of a particular image. And of course, those numbers inside that one container are never reflected in the package managers upstream. So just because it was downloaded in Docker doesn’t mean that that actually shows up as being a million downloads in RubyGems, or on Npm. So you start to see some really interesting things, and you start to see those download numbers, or the proxy for a download number of distro packages as well… Which is a really hard number to get a hold of, because every distro package manager is very heavily mirrored, and basically just a file system somewhere exposed over HTTP or Rsync. So no one has good download stats for those things. The only place you really find that is the Debian popularity contest, which is opt in, not opt out.
So you’d be able to go “Oh, okay. Well, I can see – here are the CLI programs that are being manually downloaded inside of Docker images as part of this install process.” It’s not gonna give you everything, but it certainly gives you a good proxy for “Okay, well, I can see where –” Like, relative usage of these things starts to show up, which is where I’ve found the most useful ways of kind of sorting different piles of packages or whole registries, is to go “Okay, well, if I sort this registry by the number of dependent repositories, or the number of dependent packages, which things show up at the top?” And then also, which of those things make up 80% of all of this stuff?
[00:24:28.27] And you actually end up looking, like – I like the 80/20 rule, but it doesn’t actually turn out to be like 20% of packages make up 80% of usage. It’s like 0.01% of packages make up 80% of usage. These tiny amounts. There might be 2000 Node modules total that make up 80% of all of the usage of Npm in terms of downloads, and in terms of discrete dependent repositories…
When you then start to really focus that lens, you see a long tail of stuff that never gets used, and it’s also like all kinds of spam and malware and stuff that floats around. But there’s 10,000-15,000 packages which are the packages that make up most open source usage across all these ecosystems.
It’s kind of amazing how massive that asymmetry is when you pin that down to the individuals…
Yeah, and that’s on average one maintainer per package at that critical level as well… So that’s like 15,000 people maintaining all of open source usage.
That makes the XKCD comic even more poignant. Now you’re the one person in Nebraska - replace Nebraska with wherever they are in the world; probably in different towns…
And how many of them have you had on the Changelog?
That’s a good question. Probably a good percentage of those. Oh, man… So there’s 15,000 people basically running the world for free. [laughs]
Well, I have done a little bit of indexing of how many of those have GitHub Sponsors, or are their projects on Open Collective, or they have some other kind of funding link… And in terms of those top critical packages, it comes out to kind of like - depending on the ecosystem, it’s somewhere between 25% and 50% have some way of “Here’s an automated way you can give me a donation to the project.”
There’s a good chunk of those as well that are massive, corporate-funded projects. Like, all of the AWS RubyGems that make up the AWS CLI are in the top of RubyGems, because they’re just massively used. They don’t need any funding, because Amazon has full-time staff. But there’s a good –
They might need some funding. I hear they’re laying people off again… [laughs]
Hopefully, they didn’t lay off all the Ruby people maintaining the CLI there.
Yeah, that would be awful. So do you track – so you’re tracking those who are able to receive funding in some sort of automated fashion. Do you track funding itself? Like, who’s getting how much money, and how?
Yes. Well, where possible. So I’m tracking – I call it a funding link, and some package managers have funding links support, where you can say “Oh, you can donate to me over here.” Repositories have the funding YAML file, and I go looking for that wherever possible. And you actually see that even on GitLab and Codeberg. I don’t know how well those platforms display it in the UI, but it definitely – because obviously, GitHub Sponsors is not… I don’t think there’s a GitLab Sponsors or a Codeberg Sponsors. Those files do show up all over the place.
[00:27:54.27] And then also being able to go “This repository is owned by a user on GitHub who is part of GitHub Sponsors” is another way of kind of detecting that. Even if they haven’t added their funding YAML file, we can kind of make a hop to say “Oh, here’s one of the maintainers to be able to support that.” I then collect the data from GitHub Sponsors of every – because GitHub Sponsors users are public. You don’t get any financial numbers, but you do get “Here’s the number of active sponsors of things, and here’s the total, like all-time…” It’s quite hard to get time series data out of that API, so instead I basically just kind of snapshot it on a regular basis, to go “Oh, here’s what’s the current state of the world in terms of GitHub Sponsor funding.”
It’s a bit weird, though. A lot of people who have realized that GitHub Sponsors is actually quite a good way to sell digital goods. If you go looking at the top users of GitHub Sponsors who have the most people funding them, they sell things like avatars, and Discord memberships, and eBooks, and things like that. They’re not necessarily kind of selling “Oh, I can maintain this project better for you.” That’s not – like, Open Collective is so much bigger in terms of actually like supporting the projects as a collective, because they’re just set up in a totally different way to GitHub Sponsors.
Yeah. That’s fascinating. So they’re kind of doing sponsorware, insofar as it’s not a donation, or “You’re supporting my work on this project.” It’s like “Actually, there’s a quid pro quo here. We’re going to trade a good or a service for that sponsorship money.” Really, it’s a purchase of some sort of thing.
Yeah, yeah. If you go looking, it’s easy to see GitHub doesn’t make it particularly – like, they don’t have a leaderboard… Which is a good thing, to not – like, putting a leaderboard on things can often produce some very strange behaviors…
There’s also an interesting breakdown of like number of users who sponsor other maintainers, versus companies. Obviously, companies are going to sponsor a lot more in total amount per company, but the distribution is quite surprising in – you’re looking at easily 10 times as many individuals are sponsoring other people on GitHub Sponsors compared to the number of organizations. Like, it’s quite small, really.
Really?
And most of that activity is public. So it’s not like there are – you can be anonymous as a GitHub Sponsor, but you can’t really hide the fact that there is a sponsorship happening there. There’s also on Open Collective some massive donations that go to certain projects through company sponsorships, because they’re acting as a fiscal host, rather than just being a platform to collect tips, which is basically how GitHub Sponsors works.
Right. It reminds me of way back in the day, Chad Whitaker’s Gittip, which was later called Gratipay.
Oh, yes.
Remember that?
Oh, yeah.
And it felt all warm and fuzzy, because people were getting money for their open source, but when you go looking at it very closely, most of that was like the same 50 bucks getting passed around between friends… Not a slush fund, but like a – they just felt good. So I would make 20 bucks a month and I’m using open source, so I would give it to somebody else. And there was really no new – not enough, new money coming in. It was really just money that already existed amongst all of us maintainers kind of patting each other on the back… Which was unfortunate, but just the way it started.
I definitely do that. I sponsor 35 different people on GitHub Sponsors with just a few dollars a month, to just be like “I appreciate your work.” I don’t have a huge amount to support you with, but just as a way of saying “I noticed you and appreciate that you continue to maintain these things that I use.”
[00:31:59.16] Well, I hoped GitHub Sponsors was big enough and mainstream enough to kind of change the shape of that. And maybe it’s done it some, but it sounds like there’s still more indies passing person-to-person kind of sponsorship than there is corporate-to-person.
Yeah, I think the change of interest rate across the world had a massive impact. The nice thing about Open Collective is they are – especially Open Source Collective is very public. You can see the amounts of donations going in and going out… And there was a big drop around the time that – like, post COVID hit and changed all of the finances of these things, and it was like “Oh. Okay, well, open source is no longer one of the–” It’s an easy line item to drop, because “Oh, everything is free, and it just continues to work…” For now, until a security problem comes along and then everyone starts scrambling again.
So you’ve got 12 million packages being tracked, 287 million repositories, 24.5 billion dependencies, 1.9 million maintainers… I’m reading these stats off of your website. There’s a timeline of like public events on GitHub, there’s issues, there’s commits… I mean, there’s just tons of different data points that you’re tracking. How do you store all this stuff? Where do you store it? How big is it all? Because I’m just thinking this is a data management nightmare.
So that 24 billion dependencies is a bit of a headache.
[laughs] I bet. I mean, that’s crazy.
Almost all of this is stored in Postgres.
Okay…
Individual Postgres instances on dedicated machines in France and Amsterdam, mostly because they’re very affordable. Online.net is a very reliable host, similar to Hetzner or some of these other kind of bare metal machines.
So I do the maintenance of the machine myself, and obviously, scaling up is a little more tricky, because there’s not just a nice Heroku slider anymore… I use Dokku as essentially like the open source Heroku, which is really nice.
Just git push, it builds your Docker image, and then it handles putting NGINX, kind of proxying all of those things. Very nice for like an individual machine. It doesn’t really give you any kind of multi-machine things, but I try to avoid too much complexity when there’s only a very small number of people working on doing the infrastructure. And it’s mostly me, rather than – I calculated, like a back of the napkin thing the other day, I think it would cost me 15 times as much to host on AWS as it does to host it on dedicated machines right now… But these Postgres – each service basically has its own database. So rather than it being one that is enormous, it’s split out… Which at least makes it kind of like I can work on individual ones and be like “Oh, this one is reaching capacity, so it’s time to scale it up”, or “I should make another box of web machines or Sidekiq workers separately. I don’t need to kind of do everything in one big lockstep”, which keeps it fairly easy to do.
And then the whole website is basically read only. Like, you can’t ask – you can’t put data into it as a user. You read from it. And all the data comes in in the background through loading data from package managers, and repositories, and… There’s about 2000 different Git hosts in there that I’m constantly crawling at different rates to go like “Oh, there’s new activity over here.” So I can cache things very aggressively at the kind of HTTP layer. I think the cache hit rate at the moment is about 60% in Cloudflare. At some point I’ve got it all the way up to like 95%, but then you get some AI bots come along and they do some weird stuff, and it’s very hard to cache such a long tail of billions and billions of URLs that might exist on the platform. And Cloudflare on the free plan is not gonna cover an unlimited amount of cache. You just kind of keep rolling over the cache, over and over again.
[00:36:31.09] Is this a solar project again, or is this you and Ben back together…? Who’s the band?
So Ben is working on it part-time. He is also one of the directors at Open Source Collective, which is – you know, that’s a lot of work in itself. And then we have a few people who are doing some part-time work. Martin has done all the design work… Which looks so much better than my efforts of the original – you can see, there’s a couple of older hidden web pages there that are very poorly designed, which is just me making some plain bootstrap pages… And we just had James come on to help with making the project better-documented and easier to onboard as a contributor… Because I was running so fast on standing everything up and scaling it up and collecting all that data that I didn’t really leave a lot of documentation along the way… Which is terrible, but - hopefully, these are pretty basic Rails apps. There’s not a lot of interesting stuff. Like, intentionally trying to make it the most boring tech possible, so that I can focus on the interesting stuff, which is like the parsing or the mapping of the metadata… Which is like each app has that core little nubbin of “Oh, here’s where the real logic sits.” And that’s a nice, well-tested bit of functionality, with a load of Rails scaffolding around it to be like “Okay, write this into Postgres and then serve it up in kind of the quickest way possible.”
How many apps is it now?
Oh, good question. It must be coming up to 20… But some of them are quite small. There’s a load of services that are kind of like stateless. Like, I will just give you a SHA-256 of a tarball that you get from RubyGems or similar, and a lot of those I basically have on the chopping block to try and turn into something a little bit more like – imagine a GitHub Actions, but for analyzing packages… So rather than it happening every time that you commit or every time you open a pull request, instead it’d be like you can define “I wanna run this kind of analysis on this package when a new version comes out.” That might be like copyright and license extraction, or it might be “Do me a capabilities analysis of this go package using the Caps Lock library…” Which will basically go like “Oh, this library just gained network access and it can read environment variables, and it became a crypto miner.” It would be a great way of like being able to highlight some of those changes.
So I wanna pull it down, and make it a little bit kind of like fewer services, but one of those services will be basically the “Which open source analysis do you wanna run against this package?” And then “Here’s a massive fire hose of every activity that is happening”, and you can hook those analyses in, to say “Okay, I wanna run Zizmor every time I see a GitHub Action change”, because Zizmor does the security scan on the YAML config to go like “Oh, you’ve just introduced a foot gun of GitHub Actions here.” And then try and publish all of those analyses back out as a public good, just basically fling that into S3 or something as a way that allows researchers, again, to go and do broad analysis over the whole ecosystem, or multiple ecosystems, without having to spend all their time collecting all of that base data, and normalizing it, and then setting up infrastructure to run all of that across all of those packages. I see that time and time again, where the paper is – like, 50% of the work is “Oh, well, we had to collect all of this data, and we had to make sure that it all fit into the right box.”
[00:40:28.07] And then we could actually start doing the interesting research. So what I hope is we get to a place where it’s like “Oh, you don’t need to do that. You can just use this open dataset”, and that gives you a good starting point to then start to really dig into like “What’s going on in these ecosystems?” That’s the dream anyway.
Well, you’re certainly working your way towards that. So does Schmidt Sciences - do they foot the bill for all of this work?
So they gave a grant initially, to get started. Luckily, they gave it in dollars, and the exchange rate was very positive for a while, so we actually managed to stretch from a one-year grant into a two-year grant… And then Open Collective has been supporting the project as well as a fiscal host, but also as like a customer. So I built a number of tools for them, to help them kind of investigate ways of trying to expand the ability to kind of let companies fund open source, and then also to try and measure the return on investment of giving two projects and try and be able to see “Oh, if I donate money here, or resources, does that turn into actions and changes on the repositories?” And that kept me busy for a good nine months, I think, of building out tools for them whilst they financially supported the project.
And we also have a number of customers who pay for a different license for the data. So the data is CC BY-SA, which is like a copyleft license. You can use it for whatever you like, as long as you also persist the license and you credit where it came from. But if you don’t wanna do that, then you can pay to essentially have a CC0 license. It’s not actually CC-0, because there’s some things there to say “Oh, don’t just completely undercut us and sell that on again.” But we have a number of customers there… And that basically pays for all the hosting costs.
So it’s self-sufficient, it runs itself, as long as – but you don’t get any extra feature development on top of that. So that’s where I’m trying to work on right now, is to get that level of sustainability higher. And we’ve just received a grant from Alpha Omega, to basically make that happen.
Alpha Omega is part of OpenSSF, and their goal is turn money into security. And they have become a big user of Ecosyste.ms for doing analysis of like who are the critical projects in a particular space, who are the ones that are gonna be most likely impacted if there’s a big security vulnerability? Who are the ones who have never had a security vulnerability and maybe don’t know what to do if they get one? …things like that. So they have basically given us a grant to try and help make Ecosyste.ms long-term sustainable. So that’s things like making the project easier for people to onboard onto, and also to be able to kind of charge large companies in different ways. That might be like “Oh, you want an even higher rate limit than the very friendly rate limits that are already on there? Do you wanna go even harder? Well, then you can pay for a super-rate limit, or similar.”
[00:44:07.28] And then also this kind of pipeline of analysis will be another way that – it’ll basically be like “Oh, you wanna run your LLM queries across all these package source code? Well, then you can funnel it through here. We’ll just like tee that up and trigger it every time that we see a new release of a package, or similar.” It will be another way that I think would be – essentially, just like “Oh, you’ve just paying for our CPU to do this analysis”, and then the analysis that comes out the other side, if it is idempotent, I guess… LLM queries are not idempotent. You’re gonna get a different thing every time you do an analysis. But for a lot of those things it will just come out as a public good, and companies will have paid to have it generated, but then it’s shared for everyone to use… Which I think is a nice thing.
I mean, what I’d really like to be able to do then is to actually do revenue share with the people who are maintaining those individual command line tools that do the analysis. Imagine being able to go like “Oh, we can help with supporting Zizmor, and Bullet”, or all of these different things that are command line tools that analyze source code. And rather than you build a whole enterprise company around your command line tool, you can just focus on making that tool really good, and then we can run it at scale for customers, and then just funnel the money back to the maintainers, after whatever infrastructure costs there were to run it, so that you can actually focus on building the open source tools, rather than building the scaffolding around it.
That would be super-cool. So it sounds like there’s a collection of potential income sources, some that are currently working, other ones that you’re working on… The relicensing of the data for a fee seems like a good one. Is that potentially – could you see a world where there’s enough people that want to do that, that that could be enough, or no?
Yeah, I think so. Especially this kind of dependent data, the 25 billion row table is really juicy in terms of the insights that you can get from that. The general package data though is often – like, you can get Claude to generate you an Npm scraper very easily. If you ask it to do it in Ruby, you get code that looks a lot like Libraries.io [unintelligible 00:46:33.17] [laughter]
That’s awesome.
Do you get a nickel when that happens, or what happens? [laughs]
No, unfortunately not.
[unintelligible 00:46:39.08]
Yeah. Well, you know, imitation is the sincerest form of flattery… So just remember that.
Yes. It’s tricky to get that kind of balance of like – we want to give away as much as possible, especially as all of this data comes from open source. Like, it should be open, because it is data about open source. But then how do you continue to pay for that? …whilst companies also can kind of go like “Oh, I could just go fetch it from the source myself.” And trying to get as many different ecosystems support in is a good way of kind of going – like, you really don’t want to try and index the R package manager. Like, you’re not going to have a good time.
So we try and take care of all of the horrible bits… And then also being able to fetch the Linux distro package managers, which is something that I’m trying to add more distro support in… Because each one of those has its own kind of like horrible rabbit holes of weird and wonderful metadata. And trying to work out “How does this fit into the schema?” A lot of it is kind of trying to tie it around the package URL format. Perl - but not Perl word language… Although you can have a [unintelligible 00:47:55.09] That has kind of come out from efforts in the SBOM world, and originally, one of the inspirations was Libraries.io being able to map these things into different ecosystems and kind of say “You have an ecosystem, you have a name of a package, and you have a version. Can we talk about this in a fairly standardized way, as a way of transporting these package bits of metadata between different platforms that are doing analysis of different kinds?” And SBOM is the natural conclusion of that.
[00:48:41.13] Of course, you have two different SBOM standards. There can’t just be one standard for things… But being able to look things up by [unintelligible 00:48:50.15] is something the ecosystem does really well, because you can basically then take an SBOM and just work through it, every single package that’s in there, and say “Can you tell me about this package? Can you tell me what security advisories are affecting the version that I’ve got in my SBOM?” And that is the biggest use right now, is there are lots and lots of people with GitHub Actions that are just enriching their SBOMs with this kind of information.
It’s funny how much more traffic we get on a weekday than on the weekend… And I think it’s just because of the GitHub Action kind of like “Oh, this is happening every time someone commits”, so you see a smash of traffic of them enriching their SBOMs and checking out every package that is in there… And then the weekend comes along, everyone stops working, and the traffic shape completely changes. And also the cache hit rate completely goes through the floor, because suddenly it’s like “Oh, there’s all kinds of other weird and wonderful things happening at the weekend”, especially lots more like researchers and hobbyists using it.
So you’ve mentioned a few of the weird, gnarly things like multiple SBOM specs etc. You have 35 ecosystems on here. Npm, Golang, Docker, to name a few. Crates… NuGet, so you’re in that world… Across 75 registries - so I’m assuming some ecosystems have multiple registries…
Yeah, Maven especially. There’s lots of registries in the Maven world.
And then – oh, even Bower.io. I remember Bower. I don’t know if people are still using that… Anyways.
Forever ago, man.
No one adopts anything, they don’t accept any new packages, but you’ll still find people that use them and download stuff through them, yeah.
So what I’m wondering is, where are the black sheep? Where’s the gnarliest, weirdest – like, let’s not… I don’t wanna create any enemies for you, Andrew, but which of these ecosystems are, in your own heart of hearts, notoriously hard to work with?
Well, the hardest bits are often the change over time, especially when you go back to the really old stuff. The classic one is that you’d think “Oh, Npm - their names are case-insensitive.” But if you go and try and index every name in Npm, you will find about a thousand that are case-sensitive and have clashes with a different, cased version of the name. And those still exist on the registry. They haven’t been removed. And so if you try and make an index against that, you’re gonna have a bad time, because as soon as you actually go to run that, you’re like “Oh, that’s not like that anymore.”
So there’s things like that, that when you go back into the time – going back further and further is like “Oh, there’s weird things here”, especially when the package manager registry has a document database, rather than something that is always enforcing its schema in every record. And Npm used to be CouchDB, which is “Oh, they’ve changed some schemas of the package metadata, so in new packages it looks different than old ones.” Of course, now it’s actually Postgres underneath, and it pretends to be CouchDB, which is interesting, and I imagine a headache in terms of actually maintaining that… But they still have some really old and weird – you just run into like “Ah, this bit of metadata isn’t right for these few packages”, because it was frozen in time. There’s JSON in Postgres now, somewhere… Similarly with Maven, they’ve got lots of different kinds of POM XMLs…
[00:52:40.10] And there’s so many features in the way that Maven can have these nested and parent POMs that is – I don’t really have a background in Java, so I’ve never used Maven as a user, but the amount of different ways that you can describe the data that is stored in a POM XML, and then published out to Maven Central… Of course, once it’s on Maven Central and it’s like frozen in time almost, they don’t then go and update – like, if RubyGems adds a new attribute to their registry, that becomes available in the metadata for every single endpoint, because you know, it’s just a Rails app that’s generating JSON. But for the things that store the files as a historical “We just dumped this file somewhere”, then you’re like “Okay, my code needs to be able to know every different possible version of this, how this worked, and then also be able to recover from it.”
The worst one is the R package manager. It’s not huge, but it is used a lot in the research space… And they don’t have an API. You have to scrape HTML from the thing. They also remove packages quite regularly, which is very strange. So R has this really weird – I think it’s because it’s come from a scientific kind of like non-developer background. It also has one indexed arrays, which not many programming languages have that, right? But their package manager won’t let you pin to an older version of something. It won’t say “I want version one”, even though version 2.0 is out. And the knock-on effect of that is that – so as a user, if I’m gonna say “Install my R packages”, I always get the latest version of everything. That means that if something’s broken because something else got a new version, rather than the new version causing the breaking change be told off, it’s actually the package that didn’t upgrade to fix the problem with this other package that just updated.
[00:54:57.17] So if you’re not proactive in fixing breakages with your package being used with other packages, your package gets removed. It gets kicked out of that registry. Which is pretty wild, because people, especially in science, trying to make their science reproducible, are like “Oh, my package got yanked. How am I supposed to reproduce this science? It’s no longer here.” So they have some very strange behaviors where they’ll actually make snapshots of the registry, and then – so you can say “I wanna install my R package from this registry on this day”, so you actually have like a weird historical aspect of the thing… Which is not like a lot of other package managers. And it’s very hard to change, because there’s just not a lot of – we don’t have a lot of funding in open source, but in terms of research software engineers, there’s no incentive there to maintain and develop software, unless it has a paper attached to it. If you can get citations, great. You can continue to make a case to keep working on those things. But once it’s done, it’s done, kind of thing. You’re like “Oh, you already published that paper. I don’t need to continue maintaining the software.” That’s something that I have an interest in trying to solve, but it’s a very hard problem to kind of break into. But what I’d like to be able to do is go “Can we connect the world of papers and citations back to the software that’s being used”, to especially - like, there’s a lot of Python code that might not look like it’s massively used, but then when you kind of go “Oh, but it’s mentioned in all these papers”, especially the kind of AI papers as well, which are just like exploding at the moment… If you can then say “We can send some of this transitive citation credit down the dependency graph to the transitive dependencies of the things mentioned in a paper.” Like, I bet there are maintainers who have no idea that their low-level Python or Julia code is being like referenced in these massive papers. That’s the discovery aspect there… But also, for the people that do know, to be able to go back to their institution and say “Look, my software is supporting all of this research that you’re publishing. You should also support me, because that will make your research better”, would be a really cool thing to make happen.
Right. Until they say “Well, we already published those papers, so… Who cares?”
[laughs]
That attitude makes it tough, for sure.
Yeah. There’s a lot of still that kind of “Oh, open source is just there. I can just use it. I don’t need to contribute back in any way, because someone else will do it.” It’s still a totally unsolved social problem, I think, in the wider open source space.
Well, if somebody wants to write a paper on the reproducibility problem in scientific papers due to mismanaged packages in the R language, I think that would be a hit. I think it’d be a hit.
Oh, my gosh… I’m still dumbfounded that they would not let you pin to an older version.
I know.
I feel like that’s gonna break so many research projects that go stale, essentially.
Well, there’s the Software Heritage Project, which is a massive index of the hashes of every file ever published to any open source thing. It basically was produced to try and help solve that problem. Like, you had to make a full index of every file in every Git repository to be able to try and get around the fact that you can’t pin to older versions in R’s package manager.
I mean, there are still other package managers that don’t have lock files in them, which… If you think like years ago - yes, it wasn’t such a problem. But nowadays lock files are so critical to the way people build and maintain and share their software, to be able to go “Oh, it works on my machine. It should work on yours, because you’re literally installing the same set of dependencies.” And Docker works for that high level, but as soon as you wanna change one thing, you’ll obviously blast away the whole Docker image and have to start over. Whereas a lock file works really nicely at the language level to be able to kind of solve that problem. If your package manager doesn’t have one, you should definitely try and get that added in somehow.
Break: [00:59:34.12]
Behind the scenes I’ve had some AI literally obliterating your API… With the polite mode on, of course. I’ve passed my name in so you can track all the things I’m trying to do here, but… It has finally found a way to craft a script that will pull back essentially some version of curl-fssl blah, which is the URL where the thing lives, and then piping that to sh. And so I’ve got a nice, dramatic list of projects to research that use that command, and what that install.sh script looks like, and what are some of the details in there. So it didn’t take long, but my gosh, if I did not have AI to do this for me, I would have pulled my hair out so badly. Probably not your API by any means, but just more like - you can get the data, it seems, but you’ve got to comb through it, you’ve got to be persistent, and very…
Well, there’s a lot of – the schema is not simple.
No.
Unfortunately. And it’s hard to find a way to describe that in a way that doesn’t just – people will just switch off and kind of glaze over as you start going into the levels. Something that I’ve also tried to do over the past couple of years as the AI bots have kind of gone mad is actually let them scrape the website. Rather than block them, I’ve said “You can go mad”, in the same way as I used to let Googlebot go mad on Libraries.io… Because two years in – like, we’ve had a full training cycle of the frontier models. They actually know what Ecosyste.ms is, and they know the structures of the APIs, and they can actually just suggest those things… Which is a good and a bad thing, but I think in terms of being able to get into the training data in terms of like “My API is here, and my service exists” is helpful to people who are using AI coding agents to do some of these things.
I have dabbled in the MCP world with this stuff, and it would be very easy for anyone to build an MCP adapter on top of this, but the security implications really hurt my brain. So I have kind of like held off going hard into it, because every string that is returned by the MCP is essentially like a prompt injection. So you imagine your version number that is pulled from an Npm package and then fed through an MCP server into your context… They have the ability to make a version number, especially if it’s like semver with your pre-release string on the end of the version number, you could make prompt injection [unintelligible 01:05:00.19] where I just start putting like “Ignore all previous instructions -1.1” in the strings of the thing that come from the package manager is suddenly a security vector. Or even just the description of the package, or the name of the package… There’s a lot of trust that happens on that kind of go through when it comes out as an MCP server on the other side. If you’re just saying “Blindly install whatever the MCP server told me”, then there’s a lot of trust that you’re putting into many layers of indirection that could happen. And we’ve definitely seen loads of threat actors have realized how – I’m gonna use the word ‘juicy’ so many times… But in terms of being able to go like “There’s no restrictions. I can publish things to a package manager, and that might be the [unintelligible 01:05:58.05] level of indirection before I actually get to my target.” That is very hard, to see all of the moving pieces until they actually kind of all come together… But most of these package managers have zero restrictions in what you can do.
[01:06:15.27] Even GitHub only just recently started kind of saying there are certain restrictions in how you automated the Npm publishing can be. Because people were literally like “Every commit, I’ll just publish a new version. Why not? There’s no restrictions. A hundred versions a day.” Which is like, “Why are you doing this?” Well, because we could. And the cost to the registries is mad as well. You see that PyPI are just showing their numbers continue to grow, and they’re like “Well, how the hell are we gonna continue to fund this? Because it doesn’t look like it’s gonna stop anytime soon.” It feels like there’s a lot of challenges that are kind of coming down the pipe for these shared open bits of infrastructure to keep them as open as they currently are.
What is your take then with this rate limits and polling when it comes to this polite nature you have here? How do you leverage that? Because I can pass in my email, but then you say “Well, I can reach out to you later”, you’re watching my rate limits, of course… Can you just shut me off because of me passing that email to you? Or how do you curb the enthusiasm, so to speak?
So right now we have this – we have the anonymous rate limit, which I think is 5,000 requests an hour per IP address, basically… And then the polite pool, which is a term we borrowed from a service called OpenAlex - which is basically like Ecosyste.ms, but for research papers. They have this – if you pass in your email address as part of the user agent, then you just get an uprated rate limit, so that if we see that you’re smashing the API, we can contact you and say “Oh, what are you doing? Can we help you do this in a different way?”
So far, I haven’t actually been tracking that particularly closely. I’ve literally just like “Great.” Cloudflare is still catching most of that stuff before – if you hit anything that’s cached, it doesn’t even touch your rate limit. So it’s only the uncached things that actually affects that rate limit. But even then, it’s like, 10,000 requests an hour - if you’re really, really hitting it, you’re gonna run into that. And then a 429 request is very cheap to serve up. So I can serve up a lot of rate limit used requests before things start to fall over.
And then looking at the patterns and going “How are people using this? And is there a way I can do a higher-level API that avoids having to have someone do that [unintelligible 01:08:59.00]? Or is there ways of being able to export big chunks of data”, rather than doing individual… Lots of little queries is another thing that we’re exploring. It may be like a big click house, with a read-only - like, you can write your SQL-ish query against a column store worth of data, similar to BigQuery, but without the “Whoopsie, I spent $3,000 on my one query through BigQuery, because it pulled in terabytes of data.” But that is a bit of an ongoing side project. It’s not actually live yet for anyone else to use. But hopefully, for researchers especially, you’ll be able to just be like “Oh, I can just do big, sweeping queries in kind of an offline way”, rather than it having to hit the live Postgres databases… Because that’s like the source of truth of these things. And often, researchers aren’t like “I need the most up-to-date, within the hour changes.” They’re like “Ah, actually, I’m fine with this if it’s like a day or a week old.” It’s really not too much difference compared to, you know, “I’m looking for the security advisory stuff that is as fresh as possible”, which is often where you’re scanning your SBOM and trying to find “Where are there new vulnerabilities that are affecting me?”
[01:10:24.19] Yeah. How do you prioritize your time, I suppose? It’s a lot to cover, it seems. It seems there’s a lot that – you know, even discoverability. Like, if I am naturally interested, “How can I pull this data out? It seems like I would have to spend a lot of time to figure that out…” That’s okay, but… Who is your user? How do you prioritize your time? Who are you building the platform really for? I know who’s using it, but how do you prioritize your time to how it’s being used?
Well, to be honest, the number one user is me right now.
Okay, good.
That’s who I prioritize for, because I have a good picture of how you’d want to be able to pull this data out. So the APIs are – each one has its own open API YAML spec, which kind of tells you “Here are all the different endpoints that you’d want to use.” And then I’m building applications on top of this data as well, and going “Oh, this is not here.” Or “I want to be able to do it like this.” So often, a lot of those APIs have shown up because I couldn’t get them to work right.
Josh Bressers has also had a good amount of input in just like absolutely thrashing various aspects of it to look up lots of data around CVEs, and the kind of rate of versions being published. There’s also kind of loads of tools that have been built on top of it. Snyk has a tool called Parlay, which does SBOM enrichment… So I can then go and – these things are open source, I can go and look at them and see “Oh, how are they currently using the existing API? Is there a better way that I can do, or do I just need to beef up the caching in some of these kinds of places?” It’s very much like the prioritization a little bit is like just running around, putting out fires, but then occasionally, it’s like “Right, I’m gonna turn everything off and I’m gonna go and tackle one of these slightly chunkier problems”, of essentially like solving a bigger challenge than just “Oh, there needs to be a new API.” Often that’s like “Oh, there needs to be another service for another kind of data”, or “There needs to be another way of querying this thing, because lots of people have been asking for this.”
The biggest thing is just – coming and asking for things on the issue tracker is a great way to kind of kick off that conversation and say “Oh, I’ve been trying to do this… I’m trying to solve this problem, but I can’t work out how to go through – I’ve hit a wall here, or there’s just too many individual bits of data over there… Can there be an aggregation of this thing somehow?” And sometimes that’s easy, and sometimes it’s like “Oh, actually, if we make this index, it’s gonna be like the index itself is like 500 gigabytes in size.” That’s hard to fit into RAM, so maybe we think of another way to solve that problem, rather than just adding an index for every single different way you might wanna query Postgres.
I’ve found the introducing Parlay post, they even mentioned that “We’re enriching Parlay”, it’s enriching these SBOMs using Ecosyste.ms. So are they one of your paying customers then, considering this tool is probably part of their…?
[01:13:57.00] No, they are using – so Parlay is an open source tool that other people can use.
Okay, gotcha.
And it’s primarily companies, because open source developers don’t actually care about SBOMs, because they’re like “Here’s the code.”
I had to search what SBOM enrichments was. I guess I should have guessed that by – take a little bit of data and make it better, I don’t know… [laughs]
Well, most SBOM extractions don’t – like, when you produce an SBOM from, say, a repository or from a Docker container, it will go “Here are the packages and the version numbers”, but it’s not gonna tell you “And here is all of the information about that package”, because they just don’t have that on disk, available, most of the time. Some package managers, especially the distro package managers do actually have that information right there… But these SBOM generation tools don’t go and hit the Npm API directly to fetch all of those things. So if you wanna be able to get a high level overview of all of the license breakdown of all the different packages in your SBOM, then you need to enrich it by basically going through each one and fetching some extra information and filling in the license field. Maybe they like maintainers – there’s a load of different things in there and it depends on which SBOM standard you’re looking at as well, because they’re different… But also, just being able to look up all the security CVE stuff… It’s nice if you’re only working in one particular ecosystem, because you can use Npm audit or bundle audit. But as soon as you get into the multi ecosystem things, which every Docker container is, right…? It’s gonna be like “Oh, I’ve got my Django app with a JavaScript frontend, and also all of the backend low-level distro package stuff… Like, there’s a big collection of random bits of software in there, and I really don’t wanna have to use 10 different tools to enrich it. I just want one thing that will just sweep across and support everything.”
You mentioned a couple of times building things on top of. Since this is sort of a redo for you, it’s kind of like a take two, do it better… Is this the substrate for many things? And what are some of those things that you mentioned? You mentioned some things being built on top of, but what are those things? What’s the world you envision?
There’s a few that are listed on the Ecosyste.ms homepage. So we have the things that I’ve built for Open Collective, which are the Funds app and the Dashboards app. Those two things are like definitely – they don’t have their own data, they’re essentially aggregations of various bits from Ecosyste.ms to solve particular challenges. One thing I’ve not built is a search engine. I’ve kind of been like “I’d like to see if someone else would build that.” I already did that in libraries.io… But that would be a natural one to add in there.
What I’d really like to build is things that help maintainers understand who is using their software. And this is going back to that 24 billion rows of dependency data, to be able to say like –
How much are bots, how much is Docker pulls, how much is just like CI builds…? I guess those are all still users, right? I mean, if I’m that person releasing a hundred times, I’m still pulling the packages, right?
Everytime they commit…
Yeah. It’s like “Boom. New version. Because I can”, you know?
Yeah, yeah. And also, to be able to go - like, if we can flip that graph upside down and show you “Here are the key people downstream dependencies of your library”, then rather than you find out that you broke them because they come into your issue tracker after you just published that release, and say like you broke stuff, like maybe building a CI that is an inverse, that goes “Okay, well, you committed something. Let me go and test this against your most popular downstream users, to make sure that you didn’t break those things.” And there’s some difficult bits there in making sure those downstream CIs are reliable, and they’re not just gonna be like “Oh, actually, our tests pass all the time, regardless.” Or they fail all the time, so you can’t trust if you actually broke anything or not.
[01:18:23.06] But to be able to do that would give maintainers insights that would be like – they can actually be proactive about some of these things, and maybe even be able to coordinate and go like “Oh, I’m able to reach out to these projects and say “I’m gonna break this thing, or I’m gonna change this thing to make it better. Can I help you upgrade in the process, rather than just firing out into the world and then not being able to know what the impact was until after the fact?”
I’ve also been indexing Dependabot data as a way of being able to show – I’ve no idea why GitHub hasn’t done this, but as a maintainer of a thing, if I publish a new version, I wanna know how many Dependabot PRs actually were successfully merged, or were closed, as like “No, I don’t want this, because it broke my CI.” Or just completely left. Like, give me more context, so that I can understand what’s happening with the people that are using my stuff… At least in the open. Because there’s so many open source users now that it’s a good proxy through to closed source.
Tools like that, that enable maintainers to do more with the same amount of time that they’re putting into the project by being more data-driven, or being able to just have more visibility… Because I think a lot of them are working in the dark a lot of the time, partly because you put the blinkers on and you just focus on getting what you need out of your project… But also because they just have no good idea of where their key consumers of those things are, and the knock-on effects of being able to go “Oh, I’ll make a breaking change.” That breaks this other library, and that ends up having a significant impact… As well as - you know, if you have a security advisory, to be like “Hey, significant end users of my thing, there’s going to be a security update, FYI. Get ready to bump, rather than be like “Oh, we’re stuck on this version and now we’re going to have to like scramble to try and get it updated.” To be able to get a little bit more coordination and collaboration by being data-driven I think would be amazing.
So that’s kind of – that’s my slightly bigger picture of what I would like to build on top of it, is to really empower maintainers to have an impact, to make their process better, but also then make their open source software better… Because everyone uses open source software, and so then you make all software better by just improving the base layers of the most critical packages.
For sure.
It’s a pretty big goal, but I think there’s enough untapped data there that I think can be really powerfully leveraged to make a good go at improving some of these things.
Could you maybe discuss how that interface manifests? Like, what would you show the maintainer? Where would you begin when it comes to exposing the data? Like, how do I get to know my users, the people using my thing?
Yeah, so I can imagine you would see “Okay, well, for this particular package” - and maybe I’ve got lots of packages, but I just drill down to one of them… Then I can see “Here are my top dependents”, and top being - there’s lots of different ways you can define what a top would be… But we can just use the Ecosyste.ms usage metrics is one thing… “Here are the key projects that are using your stuff”, and then which versions they’re currently pinned on as well. So they might be just like “Oh, I always pick up the latest version. I’ve got to Dependabot doing the updates.” But maybe there’s someone who’s really heavily using your stuff, but they’re actually pinned to an old, major version. And that’s like an insight into “Okay, well, why were they stuck? Maybe I can go and help them upgrade, or I can learn that actually I made the most horrific breaking change ever, and they really, really don’t want to upgrade, because it completely causes them too many headaches to do that.” And maybe I can consider that in how I then continue to maintain that project going forward.
[01:22:40.04] You could also then use that interface to say “Okay, well, can you show me everyone that’s on this specific version?” Or “Is 50% of my users stuck on an old version?” Or “Are they stuck on an insecure version?” as well, to be able to go “Well, we had this CVE three months ago, and most people…” Especially thinking about this from like individual packages that depend on me, to be able to see the knock-on effect of like all the users of those packages are my transitive users.
There’s a lot of data there, but being able to highlight where those key points are of leverage that are like “These things could be improved here”, that would be one way of that kind of being manifested.
The other way you could do it is rather than it be a UI, it’s more like a notification system of being like “Oh, did you –” You’ve got the proactive kind of things of like your dependents have updated to your latest version, or your dependents are having problems with - they tried to upgrade to this thing, and “Here’s the context of this Dependabot pull request and the discussion that they had, and they haven’t yet merged it.” To be able to show you that “Oh, wow. Okay, that’s interesting… It’s having a problem for them that we didn’t even imagine, because we’re not using the same database as they are for our testing purposes.” Something like that.
And maybe there’s an AI element there, once you get to very large amounts of users, that you’re like “Actually, this is too many downstream users to reach out to. Maybe I can empower Copilot or Claude through some kind of prompt that is like “I’ve described the changes in my changelog in a way that helps them upgrade from one version to the next.” But there’s a lot of people that are very reluctant to take on some of those things, because they can be wildly unreliable sometimes, when you try to do things the same way over and over and over.
It’s kind of like telemetry via exhaust, too. You’re not like literally tracking your users, you’re tracking them through natural usage pattern of the ecosystem of open source… So you’re not like ask them to opt into too much telemetry either.
Yeah. I really try not to be too invasive. I try not to track too much data about the individuals, and instead keep it at the project level… Because for one thing, the projects are all like licensed in a way, that says “Yeah, you can share this, and you can like understand this.” The licenses let you do that. Whereas tracking individual people is a much more messy thing to do, because people come and go, and they change their names, and they change their email addresses, and it can be hard to try and pin them down.
But also, most open source projects, they’re all volunteers. Trying to pin requirements on an individual is asking a lot of someone who is probably not being – like, they’re just giving away their code. So instead it’s like “Oh, well, we’ll look at these as if you want to do something to help, then here’s data that you can do it”, rather than being like “We’re going to force to upgrade”, like “You must do this.“You wouldn’t want to use Ecosyste.ms to power a massive wave of automated pull requests, for example, for one thing…
Right.
[01:26:22.00] GitHub would just shut you down straight away. They’re allowed to run Copilot or Dependabot at a large scale, but you wouldn’t want – it would be horrible for maintainers, to just have… You hear Daniel from Curl constantly saying like how many different AI bots, especially if it’s incentivized in any kind of way… Then you’re going to make a mess. But Ecosyste.ms tries to kind of just watch what’s the vibe of these ecosystems going on at the moment, and then you can use that to try and have impact on top of that.
Have you found any information black holes in your desires for features, or tracking things? I mean, exact amounts of funding is an example, I guess… But anywhere else where you’re like “Man, I could build this, but I went looking for the data and there’s no data”?
Ooh… So yeah, the funding one is a big one. The other thing that I’d really like is kind of more data around the non-code contributions… But that’s really hard to get, right?
Yeah…
Your Discords, and your Slacks are not open enough to be able to really index without – you know, you need an API key, or you need a ghost user sat in a Discord, collecting everything…
Right. Now you’re getting creepy; you’re getting real creepy.
Yeah, it is way too much. There are tools –
“There’s Ecosyste.ms, again, tracking us… Get out of here, Ecosyste.ms…!” [laughter]
I’d start joining all the community Zoom calls with an AI chat log kind of thing… But no, there are tools like that. Bitergia has one that you can configure it to track your own community, and you can feed in mailing lists, and you can feed in your Slack, or your Discord, or similar. But you’re kind of doing that per community, or even just a per repository level. Trying to do that at a mass scale is stepping into worlds that I’m not really comfortable in terms of the amount of tracking of stuff. It also is just really, really messy. Open source metadata is messy, but it is like tangibly – okay, yeah, I can see how I can connect the dots here… Whereas once you get into –
Right. [unintelligible 01:28:43.26] structured.
…like unstructured text of discussions of things, you’re quickly into like “Right. Well, we’re just going to try and have LLMs process everything here”, and it’s a horrible mess, and it’s incredibly expensive. We use no LLM stuff in Ecosyste.ms, because we just don’t have any budget for that kind of stuff. The amount of processing to analyze 12 million packages…
Well, you do now… Our friends at AMP have free – just advertising as I use it… And it’s like free docs, essentially. I was just telling Jerod about this on the pod we’re releasing on Friday… If you’re not using AMPCode for free, at least two hours a day or so…
Ad-supported.
…then you’re missing out on a little bit of LLM work that you can get for free.
Ad-supported, that’s the way I’m saying it.
Well, yes, sorry. It is ad-supported. So you’re getting advertised to, but you know… I think that if you’re not using that and you have a use for a couple hours a day at no cost… One of the 17 advertisers they have in the network are supporting your open source, essentially. It’s kind of cool.
What else, Andrew? Anything else we didn’t ask you about Ecosyste.ms-wise, or…?
I mean, we covered a lot…
Yeah.
[01:30:02.17] I’m trying to think if there’s anything… I think I covered most of my thinking of the future things, and that’s mostly everything that I’m working on at the moment, is Ecosyste.ms. I haven’t got any other side things…
Octobox is dead, or…?
Octobox is ticking along… GitHub copied most of the features of Octobox, and then we lost most of the customers…
[laughs]
I still use it every day, but there’s not a lot left there… So it still works, but it doesn’t have any AI features, so it’s not particularly interesting in terms of that aspect. Yeah, I think that nicely covers most of what I’ve been working on.
Awesome. Well, it’s really cool stuff. I’ve always been impressed by your abilities and willingness to just collect all the things, and then organize them, and give them back out for free for people to use for various reasons. It’s probably exciting when you see somebody using it in a new way, that maybe you hadn’t dreamed of, or wouldn’t even care to, but you’re like “Oh, that’s cool.” It shows that you’re providing real value to folks, and…
Yeah, especially with the researchers. People will come to me and say “I’m working on this paper, that’s like investigating ways that we can get LLMs to suggest better projects for you to use, or packages”, or “We’re trying to reduce LLMs coming up with old versions of things. Is there good ways of training it on reducing that –” What do they call it? It’s like a data lag, basically; the training lag…
Drift.
The drift, that’s it. That’s an interesting challenge, without resorting, again, to kind of RAG or MCP - are there ways of doing short fine-tunes after the fact, of like “Here are the latest versions of things?” and people are doing some interesting research in that space using big chunks of Ecosyste.ms data.
The other thing I just started noodling on is an open source taxonomy… So to try and define a taxonomy that describes the different facets of what makes an open source project – you know, what does it do? Who is it for? What technologies does it use? There’s about six different facets in about 130 different terms that I put together as like a v1 kind of thing, of going like “If you were to put these packages into a box, or six boxes, which ones would it do?” Rather than just going like “Here’s some free text keywords, here’s a load of the kind of chunks of things”, including the role of the user as well, rather than just thinking about “Oh, it’s a frontend React app.” It’s like, but it’s for a – is it for an end user or is it for a sysadmin, or for a developer? To be able to – and then what domain is it in as well? It’s like really early, but I’m hoping it is another way that can produce some alignment in this open source discovery world… Because I worked at GitHub for a while, on open source discovery, and wasn’t really able to make a good dent in it there… But I think there’s still a lot of low-hanging fruit in terms of just helping people find the right kind of tools to use, because not many other people have really – also, there’s just not a lot of money in that space. It’s a loss leader for most companies; searching for open source is not gonna turn you into… You can’t even run a lot of ads against that kind of stuff, because open source developers are the number one user of Adblock. So those ads will disappear pretty quickly. But I’m hoping that that taxonomy will be like “Here’s a nice blueprint of ways that you can define your project, and put it into ways that then allow you to kind of go “Okay, well, I’ve got five dimensions here, but I wanna rotate around one of them. I want a web framework for researchers, but I wanna rotate about the technology. What are my options there?” Or “I’m definitely in this technology space, and looking at this kind of position in the stack, but what options do I have here for different users?”
[01:34:31.12] And to be able to kind of like twist the picture a little bit, but in a fairly defined space, rather than in just arbitrary free text… Because again, you just end up in this soup of words, which is like “Yeah, we kind of just get very fluffy.” And often, projects just don’t have very well-defined ways of finding things. Like, they don’t add a description to their GitHub repo, or any keywords or topics, so you just kind of like never find it, unless it’s in a generic search engine… Which is then really hard in terms of like “Oh, well, what are my options in this space?” And I made this as just like “Surely someone has made one of these already…” And I found a taxonomy of software in the research space, but I did not find a taxonomy of open source software. So I was like “Okay, I can make a stab at one of these. I’ve never made a taxonomy before…”, but I put it together as a “This should be interesting…” And it’s been useful so far, and it started some interesting conversations, but I really need some people with more experience in actually defining taxonomies than I have to give more input and also expand it and cover the problems, because I’m pretty sure there’s gonna be loads of problems in it… Because I basically just put it together in a couple of days as like “Okay, I think this should work”, but mostly untested.
Where does that live?
That is on the Ecosyste.ms GitHub. There’s also a really quick webpage I made of taxonomy.ecosyste.ms. It’s literally just a few days ago, so it’s not anywhere on the website, but it is on the GitHub org as oss-taxonomy. I’ll get a link in the show notes.
Awesome. Yeah, send us that… And anything else you wanna make sure we get into our show notes, so that y’all can just click through and find that and help Andrew figure out this taxonomy, so that we can all start to kind of formulate around it. Categorization is always useful, especially for otherwise gray areas such as these; especially if you’re self-defining, it helps you to even flesh out your idea or your project better.
I think this is fertile ground right there, honestly, because you’ve got so many… I would describe it as like ecosystem explorers. Previous to LLMs being a ubiquitous thing and agents helping you, you may have just stayed in the zone that you’re comfortable in, because you’re the mere human that cannot think 10x faster. And then you get into this LLM world and you’re like “Man, I can actually explore new languages, because it knows it. I know this ,language and I can at least translate my knowledge…” And so now you find yourself exploring Go, or Rust, when you would have normally just stayed in the Ruby world, because maybe that’s where you’re comfortable. And so when you go into those worlds, you’re like “Well, how do people test here? How do people deal with HTTP? How do you deal with security things?” And so you find yourself exploring new worlds; while you know the Ruby world well, you don’t know the same kind of projects that would help you in a different lens. So I can think that’s going to be useful, honestly.
Yeah, definitely. There’s also kind of the ability to see where are the gaps in a particular space. Where have there not been many people working, or there’s only just this one old library. Is there an opportunity to kind of jump in and improve that? Or, as you say, you come into a new ecosystem and you’re like “What is the Sidekiq of X?”
Exactly.
[01:38:17.03] And often it’s like “Oh, well, actually in Erlang world we don’t need Sidekiq, because we have OTP. It’s kind of all built in.” But to be able to learn what is the alternative to this thing is gonna be an interesting way of challenging that. And maybe also kind of breaking down some of these massive projects into sub pieces as well, to be able to go “Okay, well, you’ve got something huge, but actually there’s lots of individual components here that can be used without you having to take on oh, like, “I’ve got a massive Apache Airflow install now that does everything. Actually, I really only wanna do a piece of this.” But how the hell do you go finding that if like their discovery is just folders full of strangely named projects? That’s not particularly helpful necessarily in terms of discovery.
Well, let’s close with this. What do you want from the world? You seem to be a pretty quiet guy… There’s definitely a blog there, so you’re active… I don’t know how frequently you podcast. We haven’t talked personally in years, at least me personally; maybe you’ve talked to him at least once, Jerod, without me, in the meantime… But what do you want from the world for this project? What kind of response do you want from coming on the show, or producing all this work?
Well, I have had my head down for – basically, since leaving Tidelift, and then COVID happening, I basically just like got my head down and just started plugging away. I also started doing track days in a Subaru BRZ, which is an excellent way to get away from the computer. If you’ve got interesting cars, track days is brilliant fun. But Ecosyste.ms has kind of been building up and building up, and it’s now reached the point where I’m like “I need more people helping, kind of not just contributing to the code, but helping it work out where it should go next.” Because I can definitely come up with lots of things I would like to see happen, but I need more input from more people on like “How would you like to have a impact on the open source world through data?” So that’s input in like feature requests, or thinking about that from a slightly higher level of like collaborations, ways that Ecosyste.ms can support different efforts, be it like security, searching for projects that are like “Oh, there are ways we can improve this part of an ecosystem.”
Collaboration is really what I would like to see more of, and I am starting to do more podcasts and various kinds of – I started a working group with the Chaos Metrics people around package manager metadata, as trying to share the kind of learnings that I’ve done in developing Ecosyste.ms, and being able to kind of like map metadata across different ecosystems into standardized ways… But if they’re interested in ways of understanding and using data in open source to have impacts, then Ecosyste.ms is literally rearing up right now through the Alpha Omega grant that we just received to be able to bring more people into this space and help them have real impact on/knock on effects of improving open source.
Wow, very cool. I’m glad COVID is over, obviously… I’m glad that you’re poking your head out of the hole, little rabbit, and showing the world what you’ve got. It’s kind of cool, I like it.
Good stuff, Andrew. Thanks for coming on the show again.
Yeah, thanks so much for having me.
Our transcripts are open source on GitHub. Improvements are welcome. 💚