Authors: Mihai Balaci, Valentin Rojco, Douglas Paton
In this blog post, we take a look at how and why we developed the otsdb-protector tool to protect OpenTSDB from query abuse across the Adobe Experience Platform. We figured out what kind of query abuse was taking place and how otsdb-protector helps stop the problem.
We use OpenTSDB, among other databases, for metrics collection and storage in Adobe Experience Platform. This powerful, scalable time-series database provides an ideal storage place for the vast amount of metrics that we produce. OpenTSDB is effective because, along with being scalable, the data scheme allows for fast aggregation of data.
Figure 1: A simplified look at how OpenTSDB works (source: OptenTSBD.net).
The problem with OpenTSDB
After we implemented OpenTSDB as the backend/long term metrics store we found a couple of what the system could handle in terms of the number of unique IDs based on metrics cardinality. Once it hits that limit, around 16,777,215 unique IDs, OpenTSDB became laggy and prone to abuse. It wasn't just prone to abuse, it was defenseless to high intensive queries. There was nothing in place to prevent users from blowing past that limit and flooding the system with queries. This caused latency and high CPU usage that affected the performance of the database for everyone.
The second issue that we found was that there was no visibility. We knew that users' behavior was driving up CPU usage and causing latency. But we had no idea why it was happening or what or who exactly was causing the problem. OpenTSDB lacked the ability to provide any sort of insight into how the data is being used. It also lacked the ability to control the way users were interacting with the frontend, which is Grafana in our case.
This wasn't an issue specific to OpenTSDB, either. We found that other platforms were experiencing similar problems. And, we found that there was a lack of available tools to help mitigate the situation, which is why we built one.
Putting a man in the middle
To see what was happening in the backend that created the problem we were seeing, we needed to create a proxy between OpenTSDB and Grafana.
To do this, we created otsdb-protector, a python-based man in the middle that allowed us to do two very important things:
Gain visibility into the way people were using Grafana to query OpenTSDB.
Put controls in place to help prevent the query abuse that we had been seeing.
To achieve the first, we deployed protector in Safe Mode only to get visibility in the type of traffic is generated from the frontend to the backend. Safe Mode was our way to watch what was happening. Safe Mode provided us with a way to the ability to see the way users were interacting with the OpenTSDB. The goal with Safe Mode wasn't to stop the abuse, it was to understand it. By running Safe Mode initially, we were able to see how users were making so many queries that the system slowed down. We noticed that they were running queries that would last up to 15 minutes. They were constantly clicking refresh on queries. They were running extensive searches back through six months of data. These activities, and others like it, were causing the issues we were seeing.
Once Safe Mode provided us with an understanding of what was happening, we were able to implement blocking rules based on our observations. We used otsdb-protector to limit the number of data points that could be queried at once, we put restrictions in place on accessing older data and allowing for the creation of blacklists and whitelists.
Implementing the otsdb-protector
We wanted otsdb-protector to be a service easy to implement, use, and modify. It was designed to be run as an independent Python application and is open source, so anyone who uses OpenTSDB has a way to prevent the kind of query abuse that we were seeing on our system.
Follow the steps below to get started with otsdb-protector.
Figure 2: Getting started with otsdb-protector.
Once otsdb-protector is in place, the next step is to set up the rules you want the system to follow. With otsdb-protector you can apply the following rules to prevent:
Queries with no aggregation
Queries with no tags or filters
Querying for very old data
Too many data points per query
Queries that exceed a certain frequency
Queries that exceed a certain execution time
Figure 3: A look at otsdb-protector’s rejected queries in the dashboard.
Users can also put white-and blacklists into place to further manage the types of queries that can and can't be made.
If it's not yet clear what's causing the issue, running otsdb-protector in Safe Mode provides the visibility necessary to create the rules that are needed.
With a solid foundation laid for the otsdb-protector, we're now looking at how we can use this to better serve the community. Plans to create more complex filters and rules for the otsdb-protector. And, it's expected that the otsdb-protector will join the Adobe Alerts system at some point.
To help the community at large, we also hope to modify the tool we've developed for OTSDB for use with other databases that experience similar levels of query abuse.