A lot has been written about the benefits of moving from a REST API to a GraphQL API1. But let’s say that you’re already convinced. If you want to convert a site with millions of users, ensure that performance doesn’t suffer, and just really don’t want to screw it up: how do you do it?

We embarked on this journey last year and made it out alive to tell the tale! Our GraphQL API is now the official API at OkCupid, with all clients adopting it: our iOS and Android apps, as well as our desktop and mobile web single-page React apps.

So, here’s how we tackled this huge project. I’ll talk a little about what we built, the strategy we came up with to test the new code we were shipping, and a few things that could have gone better on the technology side. Disclaimer: this article is more about the process than the code itself; check back soon for another post about the performance issues we had to overcome to reach parity with our previous API.

But first, some stats

Our GraphQL API has been in production for 1½ years, and we stopped adding new features to our REST API over a year ago. The graph handles up to 170k requests per minute, and it is made up of 227 entities.

We haven’t fully deprecated our REST API, but we’re more than halfway through converting our clients if you look at request volume (we’ve added the entities that support the most popular pages), and maybe a little less than halfway there by entity count.

How we did it

Since this was a whole new tech stack and repository for us (Node, Apollo Server, Docker2), we needed to figure out a plan to verify its efficacy without disrupting production. Our process was:

  1. Pick an appropriate page to convert
  2. Build the schema
  3. Add a shadow request to call the new API while still fetching data via the REST API
  4. Do an A/B test with real users that changes the data source

We started the project at the start of January 2019, released our shadow query on January 28th, started our A/B test on March 13th, and released it fully on April 30th. So in just 4 “easy” steps, you too can have a graph in production in “only” 4 months!

So let’s dig into each step.

1. Pick an appropriate page to convert

We decided to make the OkCupid Conversations page our test bed. On this page, users can see the list of ongoing conversations they have, as well as a list of “mutual matches” (people with whom they can start a new conversation):

Screenshot of the OkCupid Conversations page at the time, with a horizontal list of people with whom you've matched at the top and a vertical list of messages under it
The conversations page at the time of conversion

It’s important to choose a page that will let you model some core parts of your site; this will help you settle on conventions, flesh out important parts of your data model, create a better base for future work, and just be a better proof of concept. The more “real” the page is, the more it will help you learn if the new API is going to work.

We chose the Conversations page, which made us consider how to represent:

  • User: basic information about a user account
  • Match: stateful information about how two users relate to each other (e.g., match percent, if one has liked the other, etc.)
  • Conversation: basic conversation information (e.g., the sender, a snippet of the last message, the time sent)

It also got us thinking about some reusable API concepts like pagination.

2. Build your schema

For a lot of teams doing schema design for the first time, this will likely be a challenging step — it was for me! Some tips:

  • Do research. There is a lot of great writing about schemas, from the basic examples in the GraphQL docs, to GitHub and Yelp’s public APIs, to Relay’s docs. A big shout-out to the Apollo team here; we got great help from them at this stage.
  • Don’t worry about how your REST API formatted its data. It’s better to design your schema to be more expressive and idiomatic than it is to feel constrained by what your previous API returned.
  • Be consistent. Our previous API was mostly snake_case, but had a few ugly combined words (e.g., userid and displayname). This is your opportunity to make your field names more standard and readable, so take it!
  • Be specific. The more accurately you name the fields in your graph, the easier it is to migrate to a new field if you need to make a breaking change. For example, User.essaysWithDefaults is better than User.essays.
  • Take your research and make something that works for your team. When investigating pagination standards, for example, I was tempted to use Relay’s spec, but found its reliance on terms like edges and nodes  more clinical than we wanted to expose to clients in our graph (we instead settled on returning a list of data3).

3. Add a shadow request

Before having GraphQL provide data to real users, we tested our system in production with a shadow request: on our target page, the user requested its data from the REST API, then did the same from GraphQL after displaying the REST data (discarding the duped data). This let us compare the performance of the two APIs and fix issues before users found them.

We certainly aren’t the first people to think of this, but it was a massively important step for us. Our first draft of this API took nearly twice the time of the REST API, which, obviously, was not cool. Releasing a shadow request allowed us to triage these performance issues without affecting real users’ experience on the site.

Check back soon for a post about the technical side of what went wrong and how we got GraphQL up to speed parity.

4. Run an experiment

The final step was to test the new API against the old with real users! Since we already verified that the response times were similar with the shadow request, we felt confident releasing an A/B test.

Experiments where you expect not to see a change are tricky because you are trying to prove that nothing happened. So in an experiment like this, the stats you’re tracking will, by nature, never reach significance unless there’s something wrong.

So instead of looking for a significant change in stats, you should set a duration for your experiment; once you’ve reached that duration and still see no significant changes, you can launch with confidence. For us, that was a month’s run (with over 100k users in each group). And… it worked!

What could have gone better

No first draft is ever perfect (nor any second draft, for me at least). While the process around releasing the API went well, there were a few technical things we learned after our release.

Error handling

We didn’t have any structure around how we returned errors from GraphQL mutations, and by the time we realized there was a problem, we had a robust variety of ways we showed errors to our clients. A solution that seems really interesting would be to standardize on an Error type that we can extend in a given mutation payload. This medium post has a very in-depth writeup of good error styles.

Where should business logic go?

When confronted with a product feature that involves a business rule, it can be tempting to add that logic to the API layer, especially if you’d otherwise be relying on another team to implement it.

For example, we built a feature that shows a list of everyone who liked and messaged you. We show the whole list to paid users, but for free users we only show the first one, then a series of placeholders. Our first release of this feature had the logic to check a user’s paid status and replace the cards with placeholders in the API layer.

After working with the graph for a while now, we’ve realized that the business logic works best when centralized in the back-end, and that the role of our graph is to fetch, format, and present the back-end’s data in a way that makes sense to clients.

That’s it, y’all

Overall, our process worked out really well; it allowed us to get something into production quickly to validate our technical decisions, fix errors before they got to users, and test our changes against the previous API.

If you decide to take a similar journey, we hope this roadmap will be useful. Good luck!


Thanks to Katherine Erickson, Raymond Sohn, and the OkCupid web team for reading drafts of this article.

Footnotes



1. For us, it boiled down to: a more expressive way for clients to interact with our data, a more performant way to retrieve data with fewer network requests, more flexibility for our clients to create new features without API changes once the graph was built out a bit, and a technology that is rapidly being adopted as a community standard for APIs.




2. This was a greenfield project, built in a new repository and deployed separately from our back-end and client codebases. It runs in Node, using Apollo Server and Express. Our data was provided by calls to our REST API for the initial release, but we’ve since moved to calling our back-end directly using gRPC.

The API is deployed with Docker: we build Docker images with CI, and orchestrate releasing those images to our web servers with Docker Swarm. A huge, truly enormous shout-out goes to Hugh Tipping on our ops team for putting together Docker Swarm and a launch script to interact with it, along with tons of Docker experience and support! Also emotional support.

We use Apollo Client across all platforms (desktop/mobile web, iOS, and Android), and integrated with Apollo Studio to use their Operation Registry for security and to track speed and field usage stats.




3. edges and nodes didn’t feel right to us, but the Relay description of paging cursors was pretty spot on. So, we use a data array for the items, and a Relay-inspired PageInfo entity:

""" A common format to use when describing a page of paginated data. """
type PageInfo {
    """ The key to get the previous page of results, if available. """
    before: String

    """ The key to get the next page of results, if available. """
    after: String

    """ A boolean to indicate that more results are available. """
    hasMore: Boolean!

    """ The total number of results available. """
    total: Int!
}

""" An interface to ensure that paginated results have info about the current page. """
interface PageResult {
    pageInfo: PageInfo!
}

""" A paginated list of a user's conversations. """
type ConversationConnection implements PageResult {
    data: [Conversation]!
    pageInfo: PageInfo!
}

extend type User {
    """ A list of this user's conversations. """
    conversations(
        limit: Int = 20
        before: String
        after: String
    ): ConversationConnection!
}