bloggle

Making Coggle Even Faster

Today we’ve got another update from the tech behind Coggle: how we cut the average response time by over 40% with some fairly simple changes, and learned a lesson in checking default configurations.

First, a bit of architecture. Coggle is divided into several separate services behind the scenes, with each service responsible for different things. One service is responsible for storing and accessing documents, another for sending email, one for generating downloads, and so on.

These services talk to each other internally with HTTP requests: so for each request from your browser for a page there will be several more requests between these services before a response is sent back to your browser.

This all adds up to quite a lot of HTTP requests - and many of these Coggle services call out to further services hosted by AWS, using (yep you guessed it!) even more HTTP requests.

So, in all, an awful lot of HTTP requests are going on.

Coggle is written in node.js, and originally we just used the default settings of the node request module, and the AWS SDK for node for most of these requests. (At this point there are better options than the request module - we’d recommend undici for new development - but there isn’t a practical alternative to the AWS SDK.)

Why does this matter? Well, it turns out both of these default configurations are absolutely not tuned for high-throughput applications…

The Investigation Begins

A few weeks ago I came across this interesting interactive debugging puzzle by @b0rk - now, no spoilers here (go try it for yourself!), but when I finally got to the solution it did make me immediately wonder if the same issue was present in Coggle - as for a long time our average response time for requests has been about 60ms:

graph showing 60ms response time over several months

It didn’t take long to confirm that the problem in the puzzle was not occurring for us, but this made me wonder why exactly our average response-time graph was so consistently high - was there room for any improvement? Are all those requests between the different services slowing things down?

What About the Database?

The first obvious place to check is the database. While the vast majority of requests are very fast, we have some occasionally slower requests. Could these be holding things up due to slow trains? Tweaking the connection pool size options of the mongodb driver showed a small improvement, and this is definitely a default configuration that you should tune to your application rather than leaving as-is (note maxPoolSize, not poolSize, is the option that should be used for unified topology connections).

No dramatic improvements here though.

All Those Internal Requests…

Like the mongodb driver, nodejs itself also maintains a global connection pool (in this case an http.Agent) for outgoing connections. If you search for information about this connection pool you will find lots articles saying that it’s limited to 5 concurrent connections. Ahha! This could easily be causing requests to back-up.

Inter-service requests are generally slower than database requests, and just five slow requests could cause others to start piling up behind them!

Fortunately, all those articles are very out of date. The global nodejs connection pool has been unlimited in size since nodejs 0.12 in 2015. But this line of investigation does lead directly to the true culprit.

The global http Agent which our internal requests were using is constructed using default options. And a careful reading of the http agent documentation shows that the keepAlive option is false by default.

This means, simply, that after a request is complete nodejs will close the connection to the remote server, instead of keeping the connection in case another request is made to the same server within a short time period.

In Coggle, where we have a small number of component services making a large number of requests to each other, it should almost always be possible to re-use connections for additional requests. Instead, with the default configuration, a new connection was being created for every single request!

A Solution!

It is not possible to change the global default value, so to configure the request module to use an http agent with keepalive set, a new agent must be created and passed in the options to each request. Separate agents are needed for http and https, but we want to make sure to re-use the same agent for multiple requests, so we use a simple helper function to create or retrieve an agent:

Code not formatted nicely? view on bloggle.coggle.it for syntax highlighting.


const http = require('http');
const https = require('https');

const shared_agents = {'http:':null, 'https:':null};
const getAgentFor = (protocol) => {
    if(!shared_agents[protocol]){
        if(protocol === 'http:'){
            shared_agents[protocol] = new http.Agent({
                keepAlive: true
            });
        }else if(protocol === 'https:'){
            shared_agents[protocol] = new https.Agent({
                keepAlive: true,
                rejectUnauthorized: true
            });
        }else{
            throw new Error(`unsupported request protocol ${protocol}`);
        }
    }
    return shared_agents[protocol];
};

And then when making requests, simply set the agent option:


args.agent = getAgentFor(new URL(args.url).protocol);
request(args, callback);

For Coggle, this simple change had a dramatic effect not only on the latency of internal requests (requests are much faster when a new connection doesn’t have to be negotiated), but also on CPU use. For one service a reduction of 70%!

graph showing dramatic reduction in CPU use

The AWS SDK

As with the request module, the AWS SDK for nodejs will also use the default http Agent options for its own connections - meaning again that a new connection is established for each request!

To change this, httpOptions.agent can be set on the constructor for individual AWS services, for example with S3:

const https = require('https');
const s3 = new AWS.S3({
    httpOptions:{
        agent: new https.Agent({keepAlive:true, rejectUnauthorized:true})
    }
});

Setting keepAlive when requests are not made sufficiently frequently will not have any performance benefit. Instead there will be a slight cost in memory and cpu of maintaining connections only for them to be closed by the remote server without being re-used.

So how often do requests need to be for keepAlive to show a benefit, or in other words, how long will remote servers keep the connection option?

When keepAlive Makes Sense

The default for nodejs servers is five seconds, and helpfully the Keep-Alive: timeout=5 header is set on responses to indicate this. For AWS things aren’t so clear.

While the documentation mentions enabling keepAlive in nodejs clients, it doesn’t say how long the server will keep the connection open, and so how frequent requests need to be in order to re-use it.

Some experimentation with S3 in the eu-west-1 region showed a time of about 4 seconds, though it seems possible this could vary with traffic, region, and across services.

But as a rough guide, if you’re likely to make more than one request every four seconds then there’s some gain to enabling keepAlive, and from there on, as request rates increase, the benefit only grows.

Combined Effect

For Coggle, the combined effect of keepalive for internal and AWS requests was a reduction from about 60ms to about 40ms for the median response time, which is quite amazing for such simple changes!

In the end, this is also a cautionary tale about making sure default configurations are appropriate, especially as patterns of use change over time. Sometimes there can be dramatic gains from just making sure basic things are configured correctly.

I’ll leave you with the lovely graph of how much faster the average request to Coggle is since these changes:

graph showing reducing in response time from 60ms to 40ms

I hope this was an interesting read! As always if you have any questions or comments feel free to email [email protected]

Posted by James, June 16th 2021.

coggle tech performance http https latency aws nodejs keep-alive postedbyjames

The Graph That Could Have Killed Coggle

Today we’re back for another look behind the curtain at the tech behind Coggle. This time, the economics of running a web service, and how we’ve made sure Coggle will be around for many years to come.

One of the most important things about Coggle has always been that we’re building for the long term: a sustainable service that you can rely on far into the future. We don’t have venture capitalist investors looking for a quick return, we just want to build a service that you find useful, provide it to as many people as possible for free, and charge a low sustainable price for those upgrade from free to our more advanced plans.

To make this work it’s really important that Coggle is hosted efficiently, so that we can sustain a large number of free users, and keep everyone’s data safe and secure without paying huge hosting costs.

Since its inception in 2013, Coggle has always mostly been on track with that, but until recently there was one Achilles’ heel, a creeping cost which threatened to undermine our sustainability.

The Graph

graph showing increasing database storage size from 50GB in 2015 to 500GB in 2019

The first thing to notice about this is that it’s going up. That’s great, right! Right? Well, sort of. This graph is showing our database storage over time, going back five years to July 2015. An increase means more Coggle documents being created and stored, which means more people making documents, finding Coggle useful, and sharing it with their friends, which is all hunky dory.

However, database storage is *expensive*.

This might be surprising: a lot of the writing about tech companies and startups is about how hardware is cheap, how the cloud has made server costs vanishingly small, and how it’s people who are the expensive part of running a business.

Well, for the growth path of venture-capital-backed companies that’s often true: the team is growing just as fast as the data that’s being stored, gearing up the whole company to succeed – or to fail – fast. If your company might not be around in two years, you’re not worrying about how much it costs to store your data for decades to come.

But for a sustainable company like Coggle, it’s different. The durability of Coggle documents is very important: we think that one of the most important aspects of a web application that replaces something as simple as a pencil and paper is that you must be able to rely on it. And rely on it not to be just as durable as the physical document alternative – but even more so.

That means we have always stored all the data of Coggle documents across multiple physically separate servers in separate ‘availability zones’ of a datacenter, and since February 2017 we’ve also stored data across entirely separate datacenters in two different countries (Ireland, and either the UK or Germany). A single failure or disaster would never destroy a Coggle document.

This durability is why database storage is expensive, any why the graph was potentially such a problem. By 2019 the simple storage cost of five copies (plus backups) of this data, stored on high-speed disks for quick database access, was the single biggest line item in our monthly server bills. And this cost was never going to go down by itself.

Costs Up and to the Right

It was apparent that this threatened the sustainability of Coggle. So, what could we do? One option would have been to either set a time limit for the storage of free Coggle documents, or add incentives for people to delete documents they no longer need.

The problem with this is it makes Coggle diagrams seem fragile, and not like a durable physical document. It’s important that even the free version of Coggle is something you can use sustainably, because the free version of Coggle is the way most people first use it, and it sets your expectations of the paid version.

If the free version deleted your data, wouldn’t you fear that the paid version might also delete it? What if you ever miss a payment or circumstances mean you want to downgrade again?

So deleting data, even of just free customers, is not an option for us. The only other possibility was to look at how the data is stored. Can we store documents durably, and reliably, without using such expensive storage?

Blob Storage, vs Database Storage

In short, yes.

Most of the data in Coggle documents does not need to be stored in a database. While we store each individual change that was made to Coggle diagrams (this is how collaboration and history mode work in Coggle), Coggle documents are more often accessed like files, with all of the data loaded at once, and new data added only at the end of the file.

Cloud services have always supported 'blob storage’ for storing this kind of data – which is about ten times cheaper than database storage for each gigabyte stored, and has the advantage of being completely flexible, with built in archival options for infrequently accessed data: no disks need to be provisioned in advance for data which grows over time.

The challenge with blob storage is that access to the data is the primary cost, instead of the storage data size: each time a file is updated or viewed there is a small cost, and that there is no simple way of adding data onto the end of an existing file (technically speaking, blob storage is 'eventually consistent’, which means all data will be saved safely, but it might be temporarily unavailable: there is no built-in way to append data without being sure something is not overwritten).

This means that to store Coggle documents in blob storage, we’ve built a new back-end storage service that saves and loads this data. This service does the bookkeeping necessary for adding data onto files as changes are made, and ensures every change is saved immediately.

The Great Coggle Blob Storage Migration

Over the past year, we’ve migrated all of the Coggle diagram data to be stored in blob storage using this new storage service. This has been a substantial undertaking, involving migrating the format of all the data being stored, and the design of a completely new storage service, but now our costs are much better matched to the way Coggle is used, with editing and viewing documents forming the majority of our monthly bills instead of simply storing data.

All data is still saved across multiple independent locations, encrypted with independent encryption keys, and in each location data is stored across multiple independent disks with 99.999999999% durability: that’s a million times less likely that data is lost due to hardware failure than the chance of being struck by lightning. And, even if a tsunami, comet, or other disaster completely wipes out our primary datacenter in Ireland, a copy of everything would still be safe in Germany (hey given the progress of 2020 so far we’re not counting out anything!).

I hope this was an interesting read! Here’s the graph again with annotations of different points in the migration:

graph showing increasing database storage size from 50GB in 2015 to 500GB in 2019. May-Jun 2019 data-format migration temporarily doubled storage size. August 2019: new documents stored in new service. From Sep 2019: older documents migrated to new service. Jul 2020: old copies of database data finally deleted, reducing database size dramatically.

Posted by James, August 17th 2020.

coggle tech database behindthescenes blobstorage postedbyjames

What We’ve Learned from Moving to Signed Cookies

We’ve recently moved Coggle’s login sessions from a database-storage model to signed cookies, where session data is stored the session cookie itself.

There aren’t many real-world examples of how to handle this migration, so we’re sharing what we’ve learned doing this with node and express, and hopefully it’ll be a useful and interesting read!

Part 1: How Old Sessions Worked

Previously we handled sessions with the express-session module and connect-mongo data store, and then we used passport to load our actual user data based on the session. Our middleware setup looked like this:

const session = require('express-session');
const MongoStore = require("connect-mongo")(session);
const sessionStore = new MongoStore({ ...  });

// loads req.session from the database store, if the request included a valid session cookie
app.use(session({store: sessionStore, ...}));

// passport middleware loads req.user from our users collection based on the user ID stored in the session
app.use(passport.initialize());
app.use(passport.session());
// csrfMiddleware saves a CSRF token in the session
app.use(csrfMiddleware);

For each request that included a session cookie, the process was basically:

Check the connect-mongo sessions collection in the database to see if the cookie is valid
If it’s valid, load the session data (the user ID and anti-CSRF token) from the sessions collection
Passport middleware loads req.user based on the user ID
Our actual app logic runs
Finally, if the session is updated (for example the cookie expiry is extended), re-save the session to the database. (express-session does this when the response is sent by hooking the response object)

The corresponding data for every single session cookie that hadn’t expired had to be saved in the database. This added up to a lot of session records!

Before the migration sessions were the biggest cause of writes to our database, a significant source of reads, and the majority of data we actually stored in our main database (The actual content of Coggle diagrams is stored separately). Our goal with moving to signed sessions is to significantly reduce the resources needed to host this.

Part of the reason for the volume of session data is that we have very long-lived session cookies, as we prioritise people being able to easily return to their Coggle diagrams. People forgetting which email address they used to log in and ‘losing’ their diagrams as a result is our biggest source of support requests.

Part 2: Choosing a Signed Cookie Implementation

An alternative to storing sessions in the database is to instead store the session data in the cookie itself, so when each page is loaded the session data needed is immediately available in the cookies of the request. This is possible as long as there’s a cryptographic signature on the cookie to stop it from being tampered with. Someone can’t change their cookie to log in to someone elses account, as they have no way to forge the cryptographic signature.

There isn’t a formal standard for signing cookies, but the most common approach is to store a second cookie alongside each cookie to be signed, with a .sig extension to the name. This is the approach used by the cookies npm module, and the cookie-session middleware wraps this module into a convenient middleware which initialises req.session if the session cookie’s signature is valid.

We already use JSON Web Tokens in Coggle for authentication between our back-end services, so we also considered using JWTs as session cookie values. There would be a number of advantages/disadvantages to this:

Public-keys could be used for signing, enabling our back-end services to verify signatures without access to the private signing key
Cookie values could be easily encrypted, as well as signed, by using the related JWE standard.

The additional information that makes JWTs portable (key ID, issuer, and using public-key signatures) also makes them bigger
Public-key signatures are significantly more expensive to sign and verify.
There are no readily available open source node modules for JWT-based session cookies.

Since we don’t need encryption and would prefer to use symmetric keys, we chose the cookie-session middleware. If you’re considering the same route, then think carefully about whether all of the data stored in your session should be unencrypted.

Part 3: Implementation

Secure Configuration:

The default for cookie-session (inherited from the cookies module), is to use the SHA1-HMAC signing algorithm. SHA1 has some weaknesses, so to be cautious we use SHA256-HMAC instead by passing our own Keygrip instance when creating the session middleware:

const signingKeys = new Keygrip([superSecretKey, ...], 'sha256');

const cookieSessionMiddleware = cookieSession({
  name: 'session-cookie',
  keys: signingKeys,
  maxAge: Session_Duration,
  httpOnly: true,
  sameSite: 'lax',
  signed: true,
  secure: true,
});

Handling CSRF:

We set SameSite=Lax on our session cookies, so it would not normally be possible for code on other sites to send potentially state-changing POST requests with the session cookie. However, in case people are using old browsers which do not support SameSite, or there is a bug in browser’s implementation, we still also use an anti-CSRF token for state changing requests.

Previously the CSRF token for each session was stored in the database, and the value sent with each request from the client compared against this - with signed cookies it’s instead stored in the cookie itself.

As the session cookie is stored as a HTTPOnly cookie, it is not possible for a CSRF script to read the value, even though it exists on the client.

It might be possible for malicious javascript to overwrite the HTTPOnly cookie, but in that case the cookie signature would be invalid.

This CSRF protection set-up is definitely a compromise, but as Coggle isn’t handling payments, we think it’s reasonable.

Migrating Old Sessions

It's important to migrate existing sessions so we don't log out users - running both the express-session and cookie-session middleware simultaneously isn't possible, as they both hook req.session and the response object.

As a result, we had to extract the logic from express-session which actually reads and verifies cookies (the getcookie function), and manually check the connect-mongo store, which is relatively straightforward:

const session = require('express-session'); // just for passing to connect-mongo, not used as middleware!
const MongoStore = require("connect-mongo")(session);
const legacySessionStore = new MongoStore({ ...  });

const loadLegacySession = function(req, callback){
  const session_id = getcookie(req, legacyCookieName, [legacyCookieSecret])
  if(session_id){
    legacySessionStore.get(session_id, function(err, session){
      return callback(err, session_id, session);
    });
  }else{
    return callback(null, null, null);
  }
};

With this in place, the final middleware for migrating sessions is straightforward. The migration is only temporary - once all old sessions have expired, we’ll be able to just use the new cookieSessionMiddleware directly instead.

const sessionMiddleware = function(req, res, next){
  // first delegate to the new session middleware:
  cookieSessionMiddleware(req, res, function(err){
    if(err) return next(err);
    // then, ONLY IF there's no user ID in the new style 
    // session, try to load one from the legacy session 
    // so we can migrate it:
    if(!(req.session && req.session.passport && req.session.passport.user)){
      loadLegacySession(req, function(err, legacySessionID, legacySession){
        if(err) return next(err);
        // if there was a passport user ID in the old 
        // session, migrate it:
        if(legacySession && legacySession.passport && legacySession.passport.user){
          req.session.passport = {user:legacySession.passport.user};
          // also migrate any existing CSRF token, so 
          // CSRF tokens used in pages which are already  
          // open remain valid:
          if(legacySession._csrf){
            req.session.csrf = legacySession._csrf;
          }
        }
        // delete the old session:
        if(legacySessionID){
          deleteLegacySession(legacySessionID, req, res);
        }
        return next();
      });
    }else{
      // if we have an authed session from the new cookie 
      // already then we're done:
      return next();
    }
  });
});

After this, the passport middleware works exactly the same as before, loading req.user from session.passport.user

Part 4: The Results!

We deployed the new sessions around January 27th. Based on one week either side of that, we saw some dramatic differences:

Database update operations, and corresponding db journal data, were reduced by approximately 80%, from 0.4MB/s to 0.08MB/s

Chart showing databse update data rates dropping

Chart showing database journal data rates dropping

Database volume busy time (which previously limited our peak scaling), reduced from approximately 15% to approximately 3%. In theory we can now handle peaks of over 30x our normal traffic volume, instead of peaks of only 7x!

Chart showing increase in database volume idle time

(The reduction in the read ops of two of the volumes is primarily because they were being used as syncing sources for our off-site replicas - less journal data means less to be read for syncing)

And finally, 311 GB of session data and corresponding indexes can eventually be dropped from our database (multiplied across replicas, that’s over 1.5TB of disk space, or about $160/month)

Hopefully this has been an interesting read. If you thought we were crazy to store our sessions in MongoDB in the first place, well, we also used to store the entire contents of Coggle documents in a MongoDB database too… maybe we’ll write about that next!

Posted by James, Feb 2020.

coggle tech nodejs express cookie-session express-session postedbyjames

coggle

coggle newthings collaboration remotework remoteworking remotebrainstorming remote mindmapping flowcharts postedbyjames

See, that’s what the app is perfect for.