The Coupon Collector Challenge: A Mathematical Exploration
Written on
Introduction
Imagine a scenario where your preferred cereal brand is hosting a contest. Each box contains a coupon marked with a unique integer ranging from 1 to N. To win, you must gather all coupons numbered from 1 to N. Each box you purchase contains a coupon with the number i with a probability of 1/N, independently of the other boxes.
The pressing question is:
How many boxes must you purchase to have a reasonable chance of winning?
This dilemma is widely recognized as the Coupon Collector problem, which we will explore in this article.
Expected Number of Boxes to Buy for a Win
Given the element of chance, our first inquiry concerns the average number of boxes required to achieve victory.
To determine this expectation, we define the random variable X_i for each i = 1, …, N. The variable X_i represents the number of boxes needed to obtain one distinct coupon. Initially, we have X_1 = 1, as our first box will always yield a new coupon. Next, X_2 denotes the additional boxes required to obtain a second distinct coupon, excluding those already acquired. This pattern continues, with each X_i reflecting the boxes necessary to gather i distinct coupons.
To illustrate, consider N = 3. Suppose the sequence of coupons from the boxes purchased is as follows:
1, 1, 3, 1, 3, 3, 2.
In this case, X_1 = 1, since the first box provides a new coupon. The second box yields the same coupon number 1, which we already possess. However, upon opening the third box, we obtain a coupon marked 3, leading to X_2 = 2 (two additional boxes needed). Continuing this, we find it takes four more boxes to finally collect a coupon numbered 2, resulting in X_3 = 4. Consequently, the total number of boxes purchased amounts to X_1 + X_2 + X_3 = 7.
Thus, we aim to compute the expected value of the random variable X, representing the sum of the random variables X_i for i = 1, …, N. To achieve this, we utilize a key property of expected values known as Linearity of Expectation.
Our next task is to calculate E[X_i] for each i = 1, …, N. Starting with X_1, we have E[X_1] = 1.
For X_2, since we possess one coupon, the probability of acquiring a new coupon with a subsequent box is * (N-1)/N*. The only scenario where we do not obtain a new number is if the box contains the same number as that already in our collection. Thus, the probability of this occurring is 1/N, and consequently, the probability of success is 1 - 1/N = (N-1)/N. Therefore, X_2 follows a geometric distribution with a success probability of * (N-1)/N*, and its expected value becomes:
Following a similar reasoning for X_i, where the probability of success when purchasing a new box is given by 1 - (i-1)/N = (N-i+1)/N, we determine its expected value as:
By combining these results, we derive that:
The term within the parentheses in this expression corresponds to the well-known N-th harmonic number, which approximates to ln N. Thus, we can conclude that:
Additionally, to ensure accuracy, we can establish a more conservative bound:
To visualize how E[X] varies with different values of N, refer to the following graph:
This discussion illustrates that we can indeed compute the expected number of boxes necessary to win. However, expected values can sometimes obscure significant insights, especially in instances of considerable variance. Therefore, we now explore the probability of needing to purchase far more boxes than indicated by the expected value.
Probability of Exceeding Expected Purchases
A straightforward method to assess this is through Markov’s inequality, which states that for any non-negative random variable X, and any t > 1:
By substituting t = 2, we find that with at least a 50% probability, we will not need to buy more than 2E[X] ~ 4N ln N boxes.
Similarly, with at least a 75% probability, we will not exceed 4E[X] ~ 8N ln N boxes.
[For additional insights on Markov's inequality, consult Wikipedia or other reputable sources.]
While these bounds provide valuable insights, we can achieve even stronger results through more detailed analysis.
Consider the scenario where we opt to purchase T boxes. We can compute the likelihood of not obtaining a coupon numbered i across these T purchases. As previously mentioned, the probability of not receiving coupon i in a single box is 1 - 1/N. Given the independence of each box, the probability of missing coupon i across all T boxes is:
Next, we apply a useful inequality valid for all real numbers x:
By setting x = -1/N, we derive:
Next, let T = 2N ln N, rounding to the nearest integer if necessary. This leads to:
We are nearing completion. With N distinct coupons, we apply the Union Bound, which is applicable regardless of the independence of the events. Specifically, we consider N events, each representing the event where a coupon number i is absent from the T boxes. Consequently, we conclude that:
In essence, if we purchase T = 2N ln N boxes, then the probability of winning is at least:
This enhanced bound provides a much stronger concentration result as N increases, indicating that if N is sufficiently large and we buy T = 2N ln N boxes, we are almost guaranteed to win!