Why scaling to a Million QPS was a win for our tech team and proprietary technology

Written by Sam Pegler | Feb 6, 2018 9:57:16 AM

Recently, our bidder hit a million queries-per-second for the first time. This means we can respond to 1 million bid requests every second.

Quite a big deal for a bidder that was restricted to 5000 QPS when we first released it a few years ago.

So how did we do it?

image: purplepill

Infectious Media built a demand side platform to allow our clients to get the granular data they wanted, with full transparency and agility. Having built our own technology stack, we were able to automate as much of the heavy lifting as possible, enabling our teams to concentrate on adding value for our clients. Next, we needed to start building on the amount of queries per second (QPS) we could operate - the number of times a request is sent to our server per second.

When we first released our bidder, we had a restricted queries per second (QPS) limit under 5000.

On the back of client wins within the region, we have recently upped our reach to the APAC markets. This meant our technology was going to have to scale at a scary rate.

Below we speak to our Production Engineer, Sam Pegler, on the process we took.

How did we do it?

We've had multiple iterations of our platform over the years, what worked for 10k QPS worked up to 100k QPS but definitely doesn't for 1m. We plan for 10x growth, but plan to rewrite before 100x growth.

Most of the main technological changes we've had to make over the years have been moving from a regional EU business to a global one; and handling the larger volume of data that's generated as a result.

Our bidder has moved from being written in a mixture of Ruby and Lua to Golang, most of our ETL has been rewritten from Ruby to a mixture of Python and SQL; the infrastructure behind it however has largely remained the same. By moving all state into databases rather than the apps themselves we have scaled by simply increasing the application instance count. Scaling our data sources have been one of the hardest points; we've had to use more specialist databases for each individual requirement rather than one unified source of truth for everything.

There's a few bits of technology we use that are the foundations of what we do, without these we would have to have created alternatives ourselves.

Kafka is used pretty extensively, with every user event passing through Kafka and into our ETL. It's also used for all of our application logging and metrics. It's allowed us to easily centralise everything to one place, if you're unsure of anything, check in Kafka as that's where all data comes from.
Redis is currently used as both a source of truth and a temporary cache in multiple places. We like it for its ease of use, its great performance and its remote data structures.
BigQuery is our analytical data store, it scales well into the tens of terabyte (assuming your data is relatively flat) and has allowed us to scale with no effort. The only real difficulties we've had have been limiting how frequently we reload data once tables grow into the terabytes.

But what about quality?

We've put in significant effort to remove everything that we're either not targeting, or doesn't offer the performance and quality we’re looking for. Any event that comes in that we're not going to bid on costs money, both in compute time, as well as network bandwidth responding. We heavily pre-filter all bids at the exchange side to limit our total incoming QPS to just what we're interested in.

We regularly audit our incoming inventory, and exclude or heavily limit any that we're not currently purchasing. Any inventory that we see having a very high win rate we look into scaling up; This feedback loop enables us to only listen to what we require, while keeping a wide selection available.

Success

Recently we passed the million auction queries per second (QPS) point. Each auction query requires us to enrich the event with any data we have such as; user history, page type, location data and weather data. All of this has to be done in under 50ms for every unique request, compared to traditional publishing where pages can be cached at the age, real time bidding is much more demanding.

For everyone in the engineering team at Infectious Media this is pretty exciting as it's been nearly four years in the making.

Although QPS is not always the best measure for bidder capability, the fact that a relatively small company like ours can start to attain the throughput levels that the big DSPs maintain is a enormous achievement, especially when considering the relative levels of investment. This says a lot about the benefits of cloud computing and open source and, as a technologist, makes me even more excited for what the future may hold." Dan de Sybel, Chief Technology Office, Infectious Media

What's next?

We'll continue to rewrite parts of our stack as we reach inflection points, likely targets are moving some of our ETL pipeline to a JVM language as well as moving from MariaDB to a more scalable OLTP store.

As the industry moves to be more mobile focused, we're expanding our offering. To do that we're likely going to expand to peer with another exchange over the next 12 months. We're well placed for any expansion, with global coverage and an easily scalable base.

Production Engineer, Sam Pegler, Infectious Media

View full post