As we were building our GraphQL API in a totally new stack, we wanted to see how it would measure up against our previous REST API with a real production load, and we wanted to do so without negatively impacting the user experience.
To do this, we released what we called The Shadow Request. On our target page, the user loaded the page’s data from the REST API as normal and displayed the page. Then, the user loaded the same data from GraphQL, measured that call’s timing, and discarded the data.
We didn’t come up with this idea, but it was a game changer for us: we discovered that our first release of the GraphQL API took about double the time — 1200ms versus 600ms — of the REST API. If we had shown this version to real users, it would have led to a very poor experience for them.
To read about how this test fit into our overall process for releasing a GraphQL API at OkCupid, check out my previous post about the transition. But here, I will talk about the improvements we found in our Docker and Node environments, how GraphQL resolvers work on lists of entities, and CORS requests. So, let’s take a look!
Docker and Node Low-Hanging Fruit
The first thing that we realized was that I accidentally released a build with the
NODE_ENV set to
development. You always hear to not do this, since development mode enables more logging and slower code paths in packages. But now I have the empirical evidence to say why not: changing
production saved us 34ms per request on average.
We were also using an unoptimized Docker base image for this initial deploy. Switching from
node-stretch-slim reduced our image size by 600mb (850mb to 250mb); while this didn’t speed up the response time of the application, it did make our development cycle quicker by making the build and deploy process faster.
These were not the biggest wins, but they were two of the easiest!
Naive GraphQL Resolvers Can Be Sloooow
If you have a field that returns a list of entities (in our case, OkCupid users) you’ll probably be getting information about each of those users, like their name or age.
The page that we were converting to GraphQL for this deploy was the OkCupid messages page. When making our schema, we defined a
Conversation as having a text snippet from the last sent message, and a
User entity representing the person to whom you’re talking. Then we added a field on the top-level
User entity to fetch that user’s conversations. Here are the relevant parts of the schema and resolver:
This worked; we deployed and celebrated! But when we looked at a request’s stack trace, we saw something that looked like this:
Uh, ok… that waterfall is definitely NOT what we were looking for. But thinking about it, it makes sense: we just told the resolver about how to get a single user’s information, so it does all it knows how to and makes 20 cascading requests to the back-end.
But, we can do better. We happened to already have a way to get information about multiple users from the back-end at the same time, so the solution was to update our resolver with a package to batch multiple requests of the same entity type. Lots of folks use DataLoader, but in this particular example I found GraphQL Resolve Batch to be more ergonomic. Here’s our updated resolver:
So here, we pass the package a function that looks like a normal resolver, but instead of getting the
parent as the first argument, the package provides
parents (our list of conversations). We then pluck out the user IDs and call our batch API endpoint,
getUsers. That change sliced off almost 275ms from the call, and the timeline looked pretty darn slick:
Subdomains + CORS Didn’t Work For Us
Those two changes got us most of the way there, but our GraphQL API was still consistently 300ms slower than our REST API. Since we had already pared down the server side of things as much as we could, we started looking from the client’s perspective.
Early on in the project, we decided to serve our API from
graphql.okcupid.com, and saw that user requests from
www.okcupid.com were triggering a CORS preflight. That is normal, but they were taking what felt like an eternity: 300ms (does that time ring a bell?). We investigated a number of angles with our Ops team (was it cloudflare? our load balancer HAProxy?), but didn’t come up with any reasonable leads. So, we decided to just try serving it from
www.okcupid.com/graphql, and the 300ms vanished. What a trick!
Hey, It Worked
After releasing this series of changes to our setup, we reached parity with our old REST API. We discovered and fixed issues with our Node environment, GraphQL resolvers, and CORS, all without impacting site performance. And we were then well positioned to release an experiment that compared real users loading data from GraphQL versus the REST API.
If you are considering adding new technology to your stack, hopefully you will consider a shadow request to validate it. And if that stack happens to create a GraphQL API, hopefully you can avoid some of the pitfalls that we hit. Good luck!
Thanks to Raymond Sohn and the OkCupid web team for reading drafts of this article.