Have you ever needed to count the number of distinct elements in a really large collection? I’ll walk you through some of the options you have.
The Naive Solution
This function uses a build-in
Set implementation which stores all unique items you add to it. To count the number of unique elements, we need to remember all of the user ids! If we use UUID format for our user Ids, we’d have to store ids of the form
3cc627d7-37d9-4007-a17f-9362300b76b0 which has a length of 36. Storing a billion user ids would cost us 36 billion bytes if we are lucky (most probably even more) which is about 36 GB of data. If we used a 32bit integer as a user id, we would need 4 GB of data. Which is still a lot!
Isn’t there a better way? We need to answer another question first: is it necessary to count it properly? What if we just estimated it? “Number of unique users” won’t ever be a precise number because we cannot track users properly (thankfully). Two people can use the same computer and one person can use two different computers making the user tracking hard. If the raw data are not precise, we don’t need to use a precise algorithm.
So can we somehow estimate number of distinct users?
Designing an Algorithm For Estimating Unique Elements
Let’s try to design such an algorithm. The best way how to understand the algorithm is to design it from scratch.
Imagine two random sequences containing unique numbers between 0 and 100. One sequence contains two numbers, while the other one contains twenty numbers. But you don’t know which sequence is which. The only thing you know that the lowest (minimum) number in sequence
S1 is 3 while the lowest number in sequence
S2 is 42. So
min(S2)=42. The question for you is: what sequence is more likely to contain twenty elements?
Obviously, we cannot be sure. But it is unlikely that someone has chosen twenty numbers by random between 0 and 100 and the lowest one was 42. See for yourself! You can use the online generator to generate twenty random numbers and find the lowest one. I bet it’s going to be lower than 42 every time you try it.
If someone chose twenty random numbers, it’s likely that they also chose some small numbers such as 3. So we can assume the sequence
S1 contains twenty numbers while the sequence
S2 contains just two numbers. Suddenly, we can somehow predict the size of a sequence by knowing the lowest number! And we think that the smaller lowest number is, the bigger the sequence.
From Assumption to the Algorithm
That is a good assumption. But we need to convert it to a real algorithm so we can actually estimate the count. We need another assumption to make. Let’s say the lowest number in another sequence containing numbers between 0 and 100 is ten. So
min(S3)=10. How do other numbers in the sequence look like? We don’t know! But we can assume the perfect distribution. Let’s hope for the best and we can assume the numbers look like this:
S3=10, 20, 30, 40, 50, 60, 70, 80, 90
Or, if you prefer pictures:
We cannot be sure, but if the numbers were distributed perfectly, this would be the result. Suddenly we see the number of unique numbers – there are nine numbers. Can we derive a pattern from it? First, we compute the number of gaps between numbers, the number of blue segments. That is
100/10=10; there are ten blue segments. Then we subtract 1 to get the number of numbers between those gaps. The final formula is
100/min(S) - 1 which in our case equals nine unique numbers.
The formula holds for every upper bound we choose. If we generate random numbers between 0 and 2000, we update the formula to
2000/min(S) - 1. For the sake of simplicity, let’s assume we are generating numbers between 0 a 1. Hence the formula is
1/min(S) - 1. We can even remove the “subtract one” part because let’s be honest: if we count the number of unique users visiting Facebook each month, the “subtract one” does not make it more precise. There is little difference between numbers 2,506,687,153 and 2,506,687,152. Thus our final formula for estimating the number of unique numbers in a sequence is
1/min(s). We can try it:
This function generates
numberOfUniqueElements unique elements, finds the lowest number, and then it estimates the number of unique elements. This function is of course stupid because it takes the number of unique elements as an input, it is just for educational purposes. You can run this function. Ideally, the function should return the same number as we give it to it. These are five outputs from my test runs:
[LOG]: 13298973.04699654 [LOG]: 3908771.0843529785 [LOG]: 1839953.7245636007 [LOG]: 686048.8616700358 [LOG]: 2330679.422359164
Well, the correct answer was one million. So… it works somehow. It does not give us a completely different number like 415. But it also does not give us anything really close to one million. We need to go deeper.
Making it More Precise
What is the main problem with our current algorithm? Relying on a single value. There are plenty of different sequences containing different amount of numbers but having the same lowest number.
S4=2, 7, 8, 15, 27, 37, 40, 59, 60, 71, 88 S5=2, 75
Our algorithm returns the same estimate for both sequences. That is not good. Using a single value to guess the total number of distinct numbers won’t cut it. We need to use more of them.
We could try to also compute the highest number, the maximum, and use it in our formula. In the perfect world, the distance between the number zero and the lowest number is the same as the distance between the highest number and number one. E. g. if the smallest number is 0.1 and we are choosing numbers between 0 a 1, we would get this perfect distribution:
As you can see, the blue segments are of the same length. We can compute the length of the second blue segment as
1 - 0.9. We can take the length of those two segments, compute an average, and this way we get the “average” size of a gap between numbers. In the perfect world, the average size of a gap is equal to the lowest number – because the gap between the lowest number and the number zero is the lowest number itself. Thus we can use the average gap between numbers in our formula instead of the lowest number.
The full formula would be:
1 / ((minimum + (1 - maximum)) / 2). In the code:
You can try to run this updated function. The first five runs give me these results:
[LOG]: 1198360.2729259618 [LOG]: 606477.0234830821 [LOG]: 2658856.70054554 [LOG]: 670156.4024120832 [LOG]: 1299016.1390871936
It’s getting better! But not good enough. We were able to find two significant numbers of the sequence to use it for our estimate. Can we somehow tweak it so we can use an arbitrary number of numbers from the sequence? And then get arbitrary precision?
So far we have been trying to find the average gap between numbers in our sequence to estimate the number of distinct values. But we used only two numbers for it: minimum and maximum. Why not use the four lowest numbers, compute the average distance between them, and use it to count our estimate? Take a look at the picture:
We don’t need to compute maximum to get more significant numbers. It’s enough to take four lowest numbers, compute the distance between each other and then compute the average distance. And as we already know, average distance can be used for estimating. Let’s implement it:
In this function, you can choose the precision (try it). The greater
significantNumbers, the more precise estimate. If you choose
significantNumbers=1 you get the original algorithm computing the estimate just from the lowest number. My first five runs with
[LOG]: 965960.528836764 [LOG]: 882078.2342643808 [LOG]: 1351540.2292114098 [LOG]: 1068408.6166828268 [LOG]: 1117663.4401512167
Not bad, right? The worst estimate was off 35 %. First run with
[LOG]: 1061606.4830280414 [LOG]: 1046807.5574152572 [LOG]: 1004813.8212502197 [LOG]: 1013662.7440249798 [LOG]: 1021627.729578065
Every result was within 10 % error rate. Not bad for a 15-lines-long algorithm that we just implemented out of nowhere I’d say! What are some interesting properties of our algorithm?
- It doesn’t matter how many times the number is repeated in the sequence. That is the reason why it estimates the number of unique elements.
- We can choose the precision by choosing the number of minimums to track and store. The more minimums we store, the precision we get.
- We can compute intersections easily. Say we had 5000 unique users on Saturday and 5000 on Sunday. How many unique users had we had in those two days combined? We cannot simply sum those two numbers because some visitors may come on both Saturday and Sunday and some of them came just on Saturday or just on Sunday. But if we know the minimums for Saturday and Sunday we can estimate numbers for both days. E. g. let’s say we track four minimums. The mins for Saturday are 0.1, 0.15, 0.2 and 0.3 while for Sunday we have 0.16, 0.19, 0,3 and 0.32. we put the numbers together and we compute the new four minimums. We get 0.1, 0.15, 0.16 and 0.19. We use these for numbers to estimate the number of unique users for the whole weekend.
Say we have a website about cars and we estimated we had 10,000 unique visitors on Saturday and 15,000 on Sunday. How many visitors did we have for the whole weekend? (And by the weekend I mean Saturday and Sunday of course.) We cannot simply sum those two numbers, we cannot be sure we had 25,000 visitors. Because there could be duplicates. Assume someone wanting to buy a new Škoda Octavia. Sure they will read review the whole weekend! Both Saturday and Sunday! We need to estimate a new number of unique visitors. There are two ways how to do it:
- Merge all user IDs from Saturday and Sunday, compute the minimums, and then count a new estimate.
- Remember the minimums from Saturday and Sunday and count the estimate using just these minimums.
The first approach is feasible but not very effective. The second approach is much faster. Say we used four minimums to count our estimates for Saturday and Sunday. So we already have the lowest numbers for both days:
min(Sat)=0.1, 0.15, 0.2, 0.3 min(sun)=0.16, 0.19, 0,3, 0.32
We don’t need to merge all the numbers for both days. It’s sufficient to merge just the minimums to get a new sequence:
SS=0.1, 0.15, 0.2, 0.3, 0.16, 0.19, 0,3, 0.32
Now we compute the for loest numbers again:
min(SS)=0.1, 0.15, 0.16, 0.19
And we can use these minimums to correctly estimate the number of unique users for the whole weekend.
Dealing With Real Data
So far we’ve been dealing with random numbers. The previous algorithm can estimate the number of distinct numbers in a sequence of random numbers between 0 and 1. But in the real world, we have different data. E. g. user identificators in UUID format, i. e. 19b4ca50-b9b4-43ea-8511-b46953c2bcbb. How to deal with it?
We use some hash function to convert string-based values to a number. Hash functions are often used in cryptography but we don’t need any cryptographic hash functions such as SHA-3. Any non-cryptographic hash functions would do it, e. g. MurmurHash. You can try the MurmurHash algorithm online. Write some words and it converts the word into a number. The hash function tries to distribute the numbers uniformly. If we convert all of our input values into its hash values, we get a more or less uniformly distributed pseudorandom numbers. We can use those numbers as input to our algorithm. Take a look at the code:
First, we generate unique user IDs with the help of
uuidv4 function. Then we convert them to a number by the
murmurhash2_32_gc function and we normalize it by dividing it by 232 to make the number fit between 0 and 1. Then, the algorithm continues as always. In the real world, we would obviously get a stream of user ids, we wouldn’t generate on the fly. Try it online!
We’ve implemented our first algorithm for estimating the number of distinct elements. But this is just the beginning, the proposed algorithm is not good enough in the real life. We will continue next time with the LogLog algorithm.