Saturday, August 13, 2011

A bit of randomness

A colleague of mine asked a question that seemed trivial, but then it revealed interesting layers of complexity: how would you build an algorithm for a random number in any integer interval assuming that you already have a function that returns a random binary bit? The distribution of the bit is perfectly random and so it should be that of your function.

My first attempt was to divide the interval in two, then choose the first or second half based on the random bit function. This worked perfectly for intervals of even length, but there were issues with odd sized intervals. Let's take the most basic version there is: we want a random number between 7 and 9. The interval has a size of 3, which is not divisible by 2.

One solution is to split it in half anyway, ignoring one number, then use the random bit function one more time to determine in which half the remaining number should be added. For example the random bit yields 1, so we add the odd number to the second half: 7,8,9 -> 7 and 8,9 . Now the random bit is 0, thus choosing the first half, which is 7. This sounds good enough, let's see how this works:

Possible random bit results:
  • 0 (7,8|9)
    • 0 (7|8)
      • 0 (=7)
      • 1 (=8)
    • 1 (=9)
  • 1 (7|8,9)
    • 0 (=7)
    • 1 (8|9)
      • 0 (=8)
      • 1 (=9)

The interesting part is coming when deciding (pun not intended) what type of probability we would consider. From the tree above, if we take the terminal leafs and count them, there are exactly 6. Each of the numbers in the interval appear exactly twice. There is a perfectly balanced probability that a number will appear in the leaf nodes. But if we decide that each random bit run divides the total probability by two, then we have a 50% chance for 0 or 1 and thus the probability that 7 would be chosen is 1/4 + 1/8 (3/8), the same for 9, but then 8 would have a 2/8 probability to be chosen, so not so perfect.

What is the correct way to compute it? As I see it, the terminal graph leaf way is the external method, the algorithm can end in just 6 possible states and an external observer would not care about the inner workings of the algorithm; the second is an internal view of the use of the "coin toss" function inside the algorithm. The methods could be reconciled by continuing the algorithm even when the function has terminated, until all the possible paths have the same length, something akin to splitting 7 in two 7 nodes, for example, so that the probability would be computed between all the 2 to the power of the maximum tree height options. If the random bit yielded 0, then 0, we still toss the coin to get 000 and 001; now there are 8 terminal nodes and they are divided in 3,2,3 nodes per numbers in the interval. But if we force this method, then we will never get a result. No power of two can be equally divided by 3.

Then I came with another algorithm. What if we could divide even an odd number in two, by multiplying it with two? So instead of solving for 7,8,9 what if we could solve it for 7,7,8,8,9,9 ? Now things become interesting because even for a small finite interval length like 3, the algorithm does not have a deterministic running length. Let's run it again:

Possible random bit results:
  • 0 (7,7,8)
    • 0 (7,7,7)
    • 1 (7,8,8)
      • 0 (7,7,8)... and so on
      • 1 (8,8,8)
  • 1 (8,9,9)
    • 0 (8,8,9)
      • 0 (8,8,8)
      • 1 (8,9,9)... and so on
    • 1 (9,9,9)

As you can see, the tree looks similar, but the algorithm never truly completes. There are always exactly two possibilities in each step that the algorithm will continue. Now, the algorithm does end most of the time, with a probability to end increasing exponentially with each step, but its maximum theoretical length is infinity. We are getting into Cantoresque sets of infinite numbers and we want to calculate what is the probability that a random infinite number would be part of one set or another. Ugh!

And even so, for the small example above, it does seem that the probability for each number is 25%, while there is another 25% chance to continue the algorithm, but if you look at the previous stage you have a 25% chance for 7 or 9, but no chance for 8 at all. If we arbitrarily stop in the middle of the algorithm, not only does it invalidate the result, but also makes no sense to compute any probability.

You can look at it another way: this new algorithm is splitting probability in three equal integer parts, then it throws the rest into the future. It is a funny way of using time and space equivalence, as we are trading interval space for time. (See the third and last algorithm in the post)

My conclusion is that the internal method of computing the probability of the result was flawed. As a black box operator of the algorithm I don't really care how it spews its output, only that it does so with an as perfect probability as possible (pun, again, not intended). That means that if I use the algorithm two times there is no way it can output equals amounts of three values. The probability can't be computed like that. If we use it a million times we would expect a rough 333333 times of each value, but still one would be off one side or another. So the two algorithms are just as good.

Also, some people might ask: how can you possible use the second algorithm for large intervals. You are not going to work with arrays of millions of items for million size intervals, are you? In fact, you only need five values for the algorithm: the limits of the interval (a and b), the amount of lower edge values (p), the amount for the higher edge (r), then the amount for any number in between (q). Example: 7778888888899999 a=7, b=9, p=3, q=8, r=5 . You split this in two and (for the coin toss of 0) you get 7778888 a=7, b=8, p=3, q=1 (don't care at this point), r=4. The next step of the algorithm you multiply by two p,q and r and you go on until a=b.

You can consider a simpler version though: there are three values in the interval so we need at least a number equal or bigger than three that is also a power of two. That means four, two coin tosses. If the coin toss is 00, the result is 7; if the coin toss is 01, the result is 8; for 10, the result is 9. What happens when you get 11? Well, you run the algorithm again.