Glance at our glimpse past.

Ballroom hotel Grand Hyatt Jakarta sekarang sudah di isi penuh dengan hampir 700 tamu undangan, jamuan makan malam keluarga Hillary’s itu di selenggarakan untuk merayakan sukses nya brand local…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Nobody wants to be woken up at 4 am

Farfetch journey into monitoring started since its foundation, 10 years ago.
As one might expect, several approaches were tested, implemented and discarded along the way. This organic growth brought a high proliferation of tools to solve specific use-cases, usually not aiming for a consistent and holistic solution for the company monitoring needs.

Speaking in the infrastructure context, we could distil the monitoring tooling being used into two components:

These were the two main solutions if you required monitoring in infrastructure. The SAAS was normally only useful when debugging an issue, not so much for alerting. Regarding the blackbox tool, tens of thousands of ad-hoc checks were created throughout the years.

With such amount of alerts, their traceability and ownership became lost, although the manual process to configure new checks or alerting routes wasn’t helping either. The management toil for such a system was tremendous. Having hundreds of triggering alerts became the status quo and alert fatigue settled in. It became usual to be woken up several times a night for trivial things like single instance CPU pressure.

We decided it was time to stop, think and solve our on-going issues, because nobody wants to be woken up at 4 AM for something that doesn’t impact our customers.

Monitoring, in our current context, is interpreted as metrics and its related visualisation and alerting.

So, we started talking with potential stakeholders, collected their requirements, added our own and came up with this list:

We validated several solutions (free/paid, close/open source) against that list and weighed the pros and cons in a fully transparent way so the entire company could pitch in. In retrospective, I believe that was one of the cornerstones for the success of the current solution. Everything was laid out for everyone to see, not only the benefits but most importantly the shortcomings, so no expectation could be mismatched.

As most tough problems weren’t technical, we also needed a shift in mindset regarding alerting. We needed to step away from the “that alert is normal”, so we dropped the alerting levels altogether and replaced them with the following approach:

To achieve our goals, we collectively decided on Prometheus as our main tool, back then an incubating project at the Cloud Native Compute Foundation (CNCF). Obviously, moving away from a SAAS would incur in more time spent doing something not related to the company core business, but in this specific case, we figured we could gain so much more, besides the obvious cost reduction when choosing the free and open-source path.

We designed a Prometheus-based stack that could be easily provisioned on each of our datacenters, making it as cloud vendor agnostic as possible, and ensuring no manual configuration was required. The following diagram represents the logical representation of the final result, and we’ll provide an overview of each component.

From the get-go, we needed to isolate the state of the stack to Prometheus itself so any other component could be scaled horizontally without effort.

The dashboards are built to be agnostic from the datacenter or the environment, enforcing templating on pretty much any Prometheus query. They are added via merge request and, after proper validation, are deployed to every instance.

Because we fully reset Grafana in each deployment, we hit an issue where folder/dashboards IDs would change across instances, causing the folder structure to break when requesting data from different nodes. For a while, we worked around this by using sticky sessions on the Load Balancer. To permanently fix our problem, we built a module that talks to the Grafana API to ensure the correct IDs for folders/dashboards so that sticky sessions on the load balancer are no longer required. These dashboards are read-only since changes are available solely via source control.

The heart and soul of the stack. We made it so the sharding aspect of the clusters would be easily manageable via source control and that we could add or remove shards quickly when required, which includes all shard-specific configurations. One issue we bumped into was the service discovery for our cloud provider. Since service discovery was being run per scrape job and as we do have thousands of instances, we quickly realised this wouldn’t scale. Furthermore, we suddenly started hitting the rate limit of the provider API. Thus, we built our own service discovery engine.

Alerting rules are also deployed in each shard, via the blueprints. Here’s an example of a code snippet:

Similar to the dashboards, the alerting rules are agnostic of the datacenter, but notification routes can be tailored per environment.

As we have several teams using this workflow and we wanted to provide the best experience as possible, decreasing the complexity of the alert creation process was mandatory. To achieve this goal, we abstracted some of the more exotic PromQL queries into bite-sized expressions, for example:

Using Prometheus recording rules, the previous alert is converted to:

This greatly streamlines the onboarding of new teams, eases the code review process and improves the overall Prometheus server performance.

Here you can find a high-level deployment method:

All deployments are strictly idempotent, which allows them to roll out as many times as required.

Alertmanager is responsible for routing alerts to their destination. We did have an extra requirement regarding the generation of reports about all fired alerts so we could better understand what’s going on in this globally distributed infrastructure. To fulfil this need, we built a service that pushes every triggered alert to Elasticsearch, making it easier to visualise the history of alerts.

Also, to guarantee that we are aware if Alertmanager, for any reason, is unable to send alerts, we have also implemented a deadman’s switch, which is basically an always firing alert that if it ever stops firing, we get paged to investigate.

Something that is missing in the above diagram is the exporters. To be completely honest, I’ve lost count on the number of different exporters we currently have. We actively contribute to some of them and plan to continue doing so. The thriving community around Prometheus is just incredible — and we are proud to be part of it.

The mindset is still changing across Farfetch but the full ownership and self-service approach make it enticing to jump onboard. Currently, the ingested metrics, from all the deployed stacks, sum up to over 600K data points per second and increasing on a daily basis.

It has been quite the ride and it doesn’t seem it’s going to slow down any time soon, but we’re excited about it and we do hope to get the opportunity to open-source all the code we’ve written.

This was a birds-eye overview of the journey so far, and our sleep pattern became indeed much better.

Add a comment

Related posts:

To Let A Thousand Colors Fly

In a world where orange is the new black And we care more for the discomfort Of the criminals that we see on t.v. Than we do for each other Distressed for days at the loss of a make-believe prisoner…

Money making as a writer

1. Freelance writing: You can offer your writing services on freelance platforms such as Upwork, Freelancer, and Fiverr. This can include writing blog posts, articles, website copy, and more. 2…

supernatural curls by Sophia Ordaz

you forgot your hair in my brush. floating over the bathroom tiles, suspended in a reverie, I notice it in the morning during the serene percolation of my coffee brewing. your supernatural curls wrap…