How to Optimize Conversion Matching With Cold Storage
Optimize Conversion Matching
The following post was written by Andrew Djoga. Andrew is a software engineer specialized in backend stack development and data processing, with experience in building scalable, fault-tolerant distributed systems that deal with large data volumes.
Millions of users across the globe, from different devices, click on ads provided by us and install apps thereby generating conversions through our system. We are constantly looking for opportunities to improve our reporting and conversions matching pipeline in particular.
On that front, we recently introduced a hot/cold data split strategy in order to cut costs on the conversion – click matching stage of the conversions pipeline. This post explains how we build the data infrastructure to support this pipeline and recent changes we made by introducing the split strategy.
PubNative embraces Service Oriented Architecture, composed of many encapsulated deployment entities that do one thing in their optimised way. There are primarily two parallel data streams that are used for the conversions pipeline and being generated by different services:
- Click events happen when a user sees an advertisement and clicks on it, taking them to the app store page on their phone.
- Install events, occur when a user installs the app and this information propagates to our system through our demand partners.
The needs of our publishers set the requirement of computing reporting in nearly real-time, presenting us with a challenge of building scalable and resilient data pipelines. For those needs, we chose Apache Kafka as a main messaging system – proven to handle a massive amount of events.
We build our stream processors by using the Apache Spark framework, an engine for large-scale data processing. With Spark Streaming, we read the click and install events from Apache Kafka and merge them in order to produce conversions. Thereby we are facing a very common use case in stream processing: to enrich an incoming stream with the side data.
To avoid dealing with the merging of parallel incoming streams, one can always embed all the side data into the first stream. However, you will probably have unnecessarily large events as a result of this approach. In our case, it would mean the ad links will contain all this side data, which leads to longer URLs and as a result – lower conversion rates due to truncated links and other related problems.
It’s also possible to join two streams with Spark Streaming if your windowed data fit into memory, but when you potentially have a time gap measured in days between related events, to join it becomes a non-option.
Our first implementation of joining installs with clicks was based on the knowledge that the conversion rate as a metric is relatively low and pushes you to store clicks in a key-value storage for a quick lookup in order to achieve near real-time processing.
With such schema, we are able to pull yet-unprocessed installs stored in RDBMS and find side data clicks in AWS DynamoDB with quite a fast lookup. We chose DynamoDB as a key-value store because of eventual consistency and critical managed infrastructure features coming out of the box like auto-scaling, provisioned-throughput model, infinitely scalable read-write and so on. Eventual consistency model suits us well as we keep processing installs until we find the side data.
The typical conversion rate for a CPI campaign is less than 1%, which means the remaining 99% of clicks in DynamoDB will not be looked up at all as they did not lead to installs. However, once your product grows big enough, DynamoDB will become a quite expensive part of your pipeline.
Once we reached 1TB of clicks a day, we started thinking about how to optimize it to cut costs to avoid paying for all the clicks we will not match eventually. The difficulty is that we need to be able to look up the clicks for a month at the least, to cover installs which arrive in our system much later than expected. One of the reasons for such a time gap is due to the specifications of some of our demand partners in how they fire postbacks to us.
After some investigation, we found that 95% of installs registered in our system happen within a 24 hour time window after the click. It led us to the solution of splitting data to the hot/cold storage.
Such a strategy refers in part to having data that is frequently accessed on fast storage – hot data, compared to less-frequently accessed data stored on slower storage – cold data. In our case, it means we add TTL (time-to-live) of 24 hours in our hot storage DynamoDB table, thereby reducing the amount of data stored and conversely the costs. But we still need to process the remaining 5% of installs by finding corresponding clicks on S3 (cold storage) where batches of our events are stored in raw, incoming format.
We use Amazon S3 object storage as a cold store for all our pipelines including reporting where we prepare data in Protocol Buffers & Apache Parquet formats to be indexed by time series databases like Druid for our dashboard and distributed SQL engines like Presto, for business intelligence related needs.
In this way we already had the cold storage we could reuse for this pipeline, the click events were in place. The cold matching job groups per hour all yet-unprocessed installs with more than 24 hours delay after the click, and tries to find them on S3, where data is already partitioned by hour. The job runs every hour, once it accumulates sufficient data to process.
Therefore by introducing a hot/cold strategy we reduced costs significantly without having any impact on our reporting. While the throughput remains the same, the storage component of the bill was reduced by an order of magnitude.