Monitoring is a fundamentally difficult problem, which is annoying because it sounds like it should be easy. We built this service or application and now all we want to do is track its performance; easy, right? As with most things though, the devil is in the details. What do we want to monitor? What do we even mean when we say performance? On a first pass, it seems really tempting to just throw every possible piece of data we can into Prometheus, after all isn’t more data always better? Then your disk space is full within the hour, all of those fancy metrics you’re crawling are using up all of your bandwidth, and you can’t actually figure any of this out because your dashboards are all so overloaded that they grind to a halt - not to mention that you have so many different metrics and alerts that you can’t keep track of what anything is. Today, I’m not going to try and answer all that because I don’t have the answer to most of it (and if I did, I’d have written some brilliant book that I’d tell you all to buy). What I will do is explain how we at OkCupid have tried to regain some of our sanity by extracting our latency directly from HAProxy using Fluentd.
Why latency? Latency is widely recognized as a pretty good signal to monitor; for example it’s included in Google’s Golden Signals. Latency, like monitoring in general, feels like it should be pretty easy to gather - start a timer at the beginning of the code path on your server and then stop it at the end. However, most engineers are probably cringing at that last sentence because rarely is a system so simple. At OkCupid for example, an average user request goes through layers of load balancers and ssl termination, several different services (often on different servers), and sometimes databases before getting turned back around to the user. One possible solution to this is to implement a distributed tracing system. These have some really great benefits beyond just capturing end-to-end latency but also a lot of complexity (and potential performance hits on services that run pretty hot) that makes implementing a production-ready distributed tracing system a bit of a longer project. We instead chose to leverage two pieces of software that already exist in our infrastructure, HAProxy and Fluentd (specifically the td-agent version), that got us a pretty good measure of latency with much less risk and effort.
HAProxy is a good candidate for a place to get latency from because it provides a rather robust set of metrics. As pointed out by this article on Medium, for latency,
Ta is especially useful as this measures: “total active time for the HTTP request, between the moment the proxy received the first byte of the request header and the emission of the last byte of the response body.” The aforementioned article was actually a great first step for us and we started out by following the HAProxy configuration from there and the initial Fluentd configuration from there; you should definitely check it out! We quickly ran into two roadblocks from there though. The first issue was when turning on
httplognormal our logfiles were inundated with requests and filled up within ten minutes. This occurred because, previously we were logging errors to logfiles that could be shipped to Scalyr or other sources. If we wanted to be able to get our metrics from HAProxy through fluentd we would also need to start logging our errors to files via Fluentd which meant that we needed to add an extra step in Fluentd of filtering out errors and logging them or accept that we were going to lose some observability that we weren't willing to compromise on. Our next issue was that we wanted to be able to graph latency per endpoint in Grafana. Our HAProxy backends handle hundreds to thousands of different endpoints, therefore without being able to split by requests by endpoints we would have a latency metric that would be far too vague to actually be useful to us. This meant that we were going to need to parse the actual request so we had to modify the regex that Tom Fawcett had provided.
To handle the error parsing via Fluentd, we added in a
record_modifier block to pull out the status code:
Then, after constructing the histogram, we added a
rewrite_tag_filter block to filter out our errors and log them:
The second issue was a bit more complicated. We went back to the drawing board and decided to split our request capture group into three groups: HTTP verb (which would be one word until a space), HTTP request (which would be anything until a space), and then HTTP request version (which would be HTTP \d.\d). This was a fairly naive first pass that quickly failed us; I’m sure that some Prometheus aficionados see where this one is going. We weren’t splitting the request and parameter so a request like
/ratecard?country_code=US would be different from a request like
/rate?country_code=UK which would lead to an explosion in the number of latency metrics we were grabbing, ultimately leading to the scenario discussed at the top of the article. So we split the HTTP request group into two groups:
request_parameter (separated by a
?). This seemed to work well but we ran into another problem: with some endpoints (like our profile endpoint) the user’s profile uuid is part of the path causing more fragmentation. So we added a filter to our
request_uri capture group to filter out any path that was composed solely of
This seemed to work for a while. Until we ran into our next issue, we were generating way too many timeseries causing Prometheus to take a long time to scrape our endpoint leading to us needing to decide between either increasing our scrape time and delaying the resolution of our monitoring (possibly indefinitely) or needing to restart Fluentd every now and then (we run Fluentd in a container, so this might have actually been a plausible aapproach but we thought it felt a bit too hacky). We discovered that this was caused by a few requests that still embedded a user’s username (which was deprecated by us and likely caused by users using bookmarked links) or contained other information that we weren’t particularly interested in. This led us to try and devise a way to edit the
request_parameters. We didn’t want to write our own filter so we used a small ruby function embedded in a record_modifier:
This list required a bit of monitoring and gradually expanding upon but we arrived at something that we were happy with. We then imported the metrics to Grafana and created some pretty graphs to visualize our endpoints that can be filtered by endpoint, status code, or server:
This was a really interesting project that gave us some insight into our end-to-end latency as well as some of the interesting requests we get. Some of these are just fun to look at; for example: one day we had ~100 different variations of a request for okcupid.com/casino (while love may sometimes feel like a gamble, we definitely don’t run a casino). Other times, this could illuminate actual attempts on vulnerability fuzzing; for example: we routinely see requests that are clearly looking for sqldumps or open adminer endpoints. We can also gain some insight on number of requests that return 500s and alert on an increase in these. I’d love to hear how you’re monitoring latency, if you’re leveraging HAProxy metrics for any other useful monitoring, or if you just want to say hi!