Meetrics Data Blog

Meetrics_Balken

Why Meetrics Doesn’t Sample

Written by Jonas Kuhnle, Product Owner Data Platform

First of all, what is sampling? Sampling is a process of taking or drawing samples in order to obtain representative information about the composition of a given set. It is utilized in cases when it is not practical or not possible to ask the total population. For example, all governments carry out censuses at different intervals in order to obtain information on the population of their countries. 

Fun Fact – The beginning of the biblical Christmas story is a census:

“And it came to pass in those days that a decree went out from Caesar Augustus that all the world should be registered. 2 This census first took place while Quirinius was governing Syria. 3 So all went to be registered, everyone to his own city. “ (Luke 2, 1-3 KJV)

Nowadays the census is carried out on the basis of a sample. The advantage of a population census is obvious, there is no need to go around and ask everyone. After all, this problem of population analysis is also faced by Meetrics. For Viewability, Fraud and Brand Safety, we have the ability to analyze the entire population. As for Audience, we use our own microcensus data. This data, which becomes the basis for Audience verification, is provided through collaboration with selected, high-quality panel providers. For each individual impression, we check whether we can identify a panelist from our database. The data is then aggregated and displayed to the client.

The story for our other products is a little different because at Meetrics we do have the ability to measure the viewability and potential invalid traffic of everyone. This poses the challenge that not only must all this data be processed, but must also be available within seconds and aggregated on the fly. Storage on slow media such as real hard disks is therefore not an option. However, this in turn increases the price of storing the data and thus the incentive to store only samples of data. At Meetrics we process all incoming information, billions of impressions, onto thousands of Gigabytes. Our data fills Terabytes of space – with one Terabyte covering about 1357 CDs. So this begs the question why Meetrics still does not sample the data that is collected every second of the day.

The answer is accuracy. We are fully committed to delivering the highest possible accuracy, and being sure of only about 90% is simply not enough. This commitment is further reinforced as we focus on our invalid traffic product. A disadvantage here is the question of whether invalid traffic follows a Gaussian distribution. 

Let’s take a look at the data that we have for total invalid traffic over time:

cases_over_time_adjusted

This total number shows how invalid traffic numbers are distributed over time. The red line shows an approximation that you get when you sample the data. This shows that there are peaks that are overlooked, as well as low points in the data. A complete picture can only be achieved by looking at all data points.

In addition we can look into the distribution of individual invalid traffic indicators: 

cases_overTime

This shows in all its beauty how the invalid traffic is distributed over the day, and it shows that for some indications there are peaks when others reach a low. This is another indicator of a much more fundamental problem in collecting invalid traffic data. It is not distributed randomly over Internet traffic. It reaches peak values and groups itself when fraudsters generate the traffic.

In this example, you see that sampling would not be sufficient enough to ensure high levels of accuracy. In order to precisely infer totals from the sample and not to lose the representativeness of the data, it is always advisable to carry out a complete census whenever possible.

 

That is why Meetrics doesn’t sample their viewability and invalid traffic data.