After hearing about Yu-Gi-Oh!’s 25th Anniversary celebration, my mind instantly turned to the Legend of Blue Eyes (LOB) set, the first booster set of the iconic trading card game; this set contained 10 Ultra Rare cards that were some of the first cards I wanted to get my hands on as a kid, such as Exodia the Forbidden One, Gaia The Fierce Knight, and the Blue-Eyes White Dragon [1].

Now, I could easily buy these 10 cards online, but where is the fun in that? I’d like to properly reminisce, and to do that, I’d like to open booster packs myself.

With booster packs, the cards inside are left to chance; we leave the realm of certainty and enter the realm of randomness [2]. While the contents of packs are random, it’s not as though we can’t make sense of how they work; in this case, we do have some partial information that will let us characterize the behavior of packs and allow us to make informed decisions about purchasing said products. I’ve moved the discussion about the probabilities of finding certain rarities in booster packs to the following page; the important content from that page is that every pack has 1 card of one of 4 rarities (Rare, Super Rare, Ultra Rare, or Secret Rare), and on average, we find 1 Ultra Rare in every 12 packs. With finding the 10 Ultra Rares as our goal, let us formally define a problem for us to tackle in this page.

The Problem

Problem: Given access to an unbounded number of LOB booster packs, how many booster packs $B$ would we need to open to find all 10 distinct Ultra Rare cards in those packs?

When facing a new problem, a good first step is to figure out what answers are reasonable; we essentially ask ourselves, “What values of $B$ make sense? What values don’t make sense? What’s a reasonable guess?” By doing this, we force ourselves to engage with the problem, and this lets us build up some intuition about the problem.

In this case, we can immediately restrict our attention to positive integers. We must open at least 1 pack to find any cards at all, which eliminates all values less than 1; additionally, we consider packs as exactly one of closed or opened, so each opened pack contributes 1 to the total count $B$ (no such thing as, say, 0.72 of a pack). Now, are all positive integers reasonable? For instance, a factory error could make it so that the first pack we open contains all $10$ Ultra Rare cards; this sounds quite unlikely but nevertheless possible. Thus, we’ll further assume that errors like these do not occur and that we can get at most $1$ Ultra Rare card from each booster pack. Since we have $10$ Ultra Rare cards to find, we’ll need at least $10$ packs, but this still leaves us with infinitely many choices for $B$. You might think, for example, that you won’t need as many as $7529831562$ packs to complete the set of Ultra Rare cards, but we’ve not yet developed the formal mathematics to rigorously quantify this intuition. We simply hope we’re not being asked to spend billions of dollars to complete a small collection of cards.

With this problem, quite possibly the most important point to acknowledge is the role of randomness. Randomness makes it so that there isn’t a single number answer (such as, say, $B=237$ packs) that works every time. I could find the $10$ distinct Ultra Rares in the first $10$ packs, or I could go $500$ without finding a single Ultra Rare. Importantly, this $B$ value will vary depending on the sequence of packs that we open; there are many such sequences and we won’t know the value of $B$ until we run a particular experiment, which is why $B$ is referred to as a random variable [3, Pp. 103-104]. Since $B$ can take on many values, our problem is really asking us to understand the distribution of this random quantity.

Some readers might have already seen a variation of this problem before, commonly referred to as the Coupon Collector’s Problem. Rather than talking about coupons, an isomorphic version of the problem asks us to imagine that we are kids at our favorite restaurant, where every kids’ meal comes with exactly one toy out of $n$ distinct toy designs (with every toy design having the same probability of being present); we love all of the designs, so the question we want to answer is: how many kids’ meals, on average, do we need to buy in order to collect all $n$ of the possible toys? [3, Pp. 161-162].

Admittedly, our card problem is not exactly the same as this problem. Firstly, this problem is simpler since it is asking for an average value of $B$, rather than asking us to understand the whole range of values that $B$ can take on. Secondly, in the restaurant setting, every kids’ meal contains one toy by assumption; finding an Ultra Rare card in every booster pack, however, is not guaranteed. We will keep both of these differences in mind as we proceed.

We will now use a number of different approaches to understand $B$.

The Analytical Approach, Attempting to Derive Exact Answers

If a problem is difficult, a key problem solving strategy is to solve a simpler variation of the problem before we tackle the full problem. Thus, let us take a cue from one of the differences to the Coupon Collector’s Problem and address the simpler question presented in that version of the problem; that is, let’s figure out the following: how many booster packs $B$, on average, do we need to open to find all 10 Ultra Rare cards?

The Coupon Collector’s Problem has been studied before, and we can find a full solution to the Coupon Collector’s Problem in [3, Pp. 161-162]; here is the argument from that source adapted for our problem:

Since $B$ requires us to succeed in finding $n=10$ distinct Ultra Rare Cards, we can break down $B$ as follows
\[B = B_1+B_2+...+B_{n}=\sum_{i=1}^{n} B_{i},\]
where $B_i$ represents the number of booster packs opened after seeing the $(i-1)^{th}$ Ultra Rare card up to seeing the $i^{th}$ Ultra Rare card, inclusive of this last pack.

Now, since $B_i$ is counting up the trials until we get a success, we recognize this as a $Geometric(p_i)$ random variable, where $p_i = \frac{n-i+1}{n}\frac{1}{12}$.

To get the average number of packs, we compute the expected value of $B$, which is made easy by our decomposition of $B$ along with the linearity of expection:
\[E[B]=E \left[\sum_{i=1}^{n}B_i \right]=\sum_{i=1}^{n}E[B_i].\]
Given that the $B_i$ are geometric random variables, we know that
\[E[B_i]=\frac{1}{p_i}=\frac{1}{\frac{n-i+1}{n}\frac{1}{12}}=\frac{12n}{n-i+1}.\]
Thus, we see that
\[\sum_{i=1}^{n}E[B_i]=\sum_{i=1}^{n}\frac{12n}{n-i+1}=12n\sum_{i=1}^{n}\frac{1}{n-i+1}=12n\sum_{j=1}^{n}\frac{1}{j}.\]
Now, we can simply plug in $10$ for $n$ and compute that the average number of booster packs is
\[E[B]\approx 351.5\]

The motivation behind mathematical arguments isn’t always fully clear on a first reading, so let’s take a step back and elucidate two steps in the previous section.

Firstly, why did we decompose $B$ into the $B_i$’s?

To make sense of $B$, we want to think about the experiment that makes the variable random. Our experiment here is to open a sequence of booster packs, where we keep track of which Ultra Rare cards we’ve found so far and record the pack number $B$ in which we complete the full set of Ultra Rare cards.

It would be nice to be able to state a distribution for $B$ without doing much, but thinking through the usual distributions we learn about in a probability class (such as Bernoulli, Binomial, Normal, etc), none seem to fit exactly. Because of this, we want to look for a different way to analyze $B$.

The main problem solving strategy we use here is to break down a hard problem into smaller sub-problems that we know how to solve [3, P. 161]. At the beginning, we see $B$ as the pack number where we succeed in finding all of the Ultra Rare cards, but this final count requires a bunch of small successes along the way (namely, getting each Ultra Rare card we haven’t seen yet).

Thus, we want a way to represent and manipulate these small successes, so they are given the $B_i$ notation. When we think about the distribution of these $B_i$ random variables, we find that they do fit the pattern of a distribution from probability class, which reduces our problem into pieces we know how to work with and gives us confidence in this route of attack.

Secondly, where did the expression for $p_i$ come from?

We claim that $p_i = \frac{n-i+1}{n}\frac{1}{12}$. This uses the $\frac{1}{12}$ ratio we mentioned for the probability of getting an Ultra Rare card; furthermore, it then scales that probability according to the percentage of the $n$ Ultra Rare cards left to find. Because of the scaling by $\frac{n-i+1}{n}=1-\frac{i-1}{n}$, we make an extra assumption; we assume that each of the 10 Ultra Rare cards are equally likely, which is why the factor drops by $\frac{1}{n}$ as we move from $B_i$ to $B_{i+1}$. To some, it may feel reasonable to assume that each Ultra Rare card is equally likely to be pulled, but a priori, this assumption does not have to be true.

Once we understand these details, the argument flows nicely.

Admittedly, over 350 packs is a lot of packs to open; however, this number is the average number of packs needed, not necessarily close to how many we’d need if we ran the experiment ourselves. Our analysis so far is insufficient to say anything about the spread of values of $B$ that we could see, but we went through this exercise to unveil the underlying structure of the problem, brainstorm strategies to approach it, and tie down any assumptions that we hadn’t made explicit yet.

Rather than knowing about only one summary statistic, we’d like to turn back to the original main problem of understanding the full distribution of $B$.

When considering the whole distribution, a few ideas we might consider are the following:

Central Limit Theorem (CLT)
Moment Generating Functions (MGFs)
Simulation

Since we decomposed $B$ into a sum of random variables in the previous discussion, our mind might turn to the CLT. A common version of the CLT, however, asks that the summed random variables be independent, identically distributed, and the number of them goes to infinity [3, Pp. 471-473].

Unfortunately, our setup here fails to meet several conditions of the CLT. Firstly, the random variables are not identically distributed since the $p_i$ parameters of the Geometric random variables depend on $i$. Secondly, the number of random variables in the sum does not go too high ($n=10$), so that part of the statement does not apply.

Interestingly, the independence of the random variables is the most plausible condition to argue is being met. If packs work the way many intuitively think they should, where what one pack contains doesn’t affect what is in other packs, then how many times it took to draw the $i$-th Ultra Rare card doesn’t affect how many times it takes to draw the $j$-th Ultra Rare card ($i \ne j$) as in the argument in [3, Pp. 161-162]. This is more of an assumption than a fact. In the real world, there is some method that is followed to produce the cards, and it is up to the companies responsible for these processes to enforce some degree of independence between packs, if that is a goal of theirs. For our purposes here, though, we may choose to assume that our packs aren’t dependent on each other.

In that case, one idea would be to consider the moment generating function (MGF), which gives us a full characterization of the distribution, of $B$ [3, Pp. 279-280]. We could use the independence of the $B_i$’s to conclude that the product of the MGF’s of the $B_i$’s is the MGF of the sum $B$ [3, P. 281]. We could say we are done here, but this approach comes with a cost; we would be working with complicated products and would not easily have the probability mass function or cumulative distribution function at our disposal. Because of this, we will not pursue that idea here, and instead, consider how simulation might help us in this case.

The Empirical Approach, An Approximation With Few Assumptions

If we don’t want to delve into a lot of math, we can leverage the power of computational resources; namely, we can use simulation to estimate the distribution of $B$. Our experiment here is opening packs until we get all of the Ultra Rare cards, so we can write code to simulate this experiment many times and record the value of $B$ that we find after each experiment. In particular, let $B^{(i)}$ be the number of packs needed to complete the set of Ultra Rare cards for the $i$-th experiment, where $i \in \lbrace 1,2,…,m \rbrace $ for some large $m$. We can then use the ratio of occurences of a particular value $b$ to total number of experiments to estimate the probability of $B$ taking on that value $b$ and plot the probability mass function or a histogram of $B$ to see the whole empirical distribution. The process described here embodies the frequentist view of probability and is justified by the Law of Large Numbers [3, Pp. 48-49, 470-471].

In this case, running the simulation for 1 million iterations left us with an empirical distribution like this:

After seeing this plot, we may feel like we’re done answering our original question. We have a picture of the distribution, and this picture lets us take in quite a bit of information about the likelihood of ranges of values just by looking at the magnitude of the vertical bars. As we mentioned above, we also have all the $B^{(i)}$ values that can give us more precise estimates if we want that. Thus, what exactly are we missing? There are a few issues we could point out.

The empirical distribution has a bunch of peaks and valleys that one might think simply arise due to the randomness of the simulation process (along with a finite number of runs in the simulation and sometimes exacerbated by the histogram representation (e.g., the number of bins) creating visual quirks not present in the distribution), which wouldn’t really be there if we ran the simulation an arbitrarily high number of times [4]. As we mentioned previously, the Law of Large Numbers is what guarantees that the empirical distribution converges to the actual distribution as the number of experiments simulated goes to infinity [3, Pp. 470-471]. We don’t often have the luxury of running a procedure an infinite number of times, so with this finite simulation approach, we’re left with an approximation to the true underlying distribution. Sure, we could always run it an incredibly high number of times (at the expense of additional computing time) until we are satisfied that our approximation is good enough, but are there limitations to this? For instance, will we ever run it enough times to complete the set of Ultra Rares in exactly $7529831562$ packs (the number we mentioned earlier)? If we never see such an experiment, as per the procedure described here, our estimate of the probability of that value will be exactly 0, but intuitively, this probability is non-zero (although we could guess that it’s incredibly close to 0). This argument could likewise be made for reasonably big values; peeking at our histogram shows that we don’t really see many values above 2000; thus, for those values that we don’t see even once, we’re declaring them to have 0 probability. In a related context, there is the option of Laplace smoothing that helps us with this problem, but we’ll tackle this differently on this page [5, Pp. 44-46].

Because we simply ran the experiment many times and let those empirical findings fully dictate the distribution, didn’t specify constraints on the underlying distribution, and didn’t learn any parameters, this approach falls under the umbrella of nonparametric statistics [3, P. 471]. This begs the question of what we can gain when we do add in extra restrictions about the underlying model; we’ll explore this move from nonparametric statistics to parametric statistics next.

Other Approximate Approaches, Discovering Hidden Structure

Looking at the empirical distribution, we might say that the overall shape of it isn’t too complicated; roughly speaking, we see a bell-shaped curve with a long right tail. This makes us wonder if there could be simple model for the distribution hiding in the simulation results. If we were to find such a model, it would show that we’ve found some structure in the chaos of the simulation. In essence, we’d be compressing the information from the simulation, which involved a large amount of rote computations, into a manageable form; this form could potentially be only the name of a distribution and a few parameters.

In the Analytical Approach section, we brought up how parametrized distributions we typically encounter in probability class didn’t seem to match up perfectly with our problem, but as a test (or an initial baseline model), let’s try to see how one of those distributions would best fit our empirical distribution.

As with any mathematical endeavor, we want precision in our definitions, so we have to clearly define what ‘best fit’ means. One such definition uses the principle of maximum likelihood; in short, the principle says that when deciding between different parameters for a model, we should choose the parameters which maximize the likelihood of the data [6]. While this is not the only way to define a best fit, we won’t get into others, such as the Method of Moments, here [7].

To show how one can use the principle of maximum likelihood, let’s suppose that we believe our data comes from a Poisson distribution. Accordingly, we will assume that we draw $B^{(i)}$’s independently from a Poisson distribution (our model assumption) but with unknown parameter $\lambda$ (that is, $B^{(i)} \sim Poisson(\lambda)$ but we don’t know what $\lambda$ is), and the principle would return that the best choice of $\lambda$ is the empirical mean $\frac{1}{n}\sum_{i=1}^{n}{B^{(i)}}$; refer to [8] for a derivation of this result, which shows how we write down the joint likelihood of the data, take the log, take the derivative, and set it equal to 0 to find the $\lambda$ which maximizes the likelihood, in very much the same way one might do in a calculus class. We won’t need to solve a maximization problem by hand for every model since SciPy allows us to find the best parameters via maximum likelihood for the Poisson model and many other models [9]. The SciPy function linked above even lets us plot the model fit with the empirical data, but I’ve recreated all of the plots here using bokeh. In general, I find it instructive to dig deeper into the high-level function calls that abstract away a lot of what is going, and in this case, making the plots using a different library forced me to verify what was going on. Here is a plot of the Poisson model along with our empirical data:

Right away, this does not look like a great model for the data. To visually see how different our model is from the empirical findings, we can use a Q-Q plot, where deviations from a straight $45\degree$ line signal that the model differs from the empirical data [10]. For all of our Q-Q plots, we’ll use the functionality of the statsmodels qqplot function to recreate the plot generated by that function [11]. Here is the Q-Q plot for the Poisson model:

As we can see from this plot, the plotted points are not at all close to the $45\degree$ line. We might have chosen the model because it was a model which allowed for discrete outcomes (non-negative integers) and is simple as it has only a single parameter to keep track of, so it checks the boxes for a baseline model. Once we see the fit, though, we have to acknowledge that we are neither fitting the bell-shaped part of the distribution nor the right tail.

To move towards better models, we’ll take inspiration from the Log-Normal and Power Law models brought up in [10] as models that could fit distributions with long right tails. The intuition for each is simple. The logarithm function will bring down the bigger values that we see in the right tail of the empirical distribution and will hopefully change the shape of distribution into one we can deal with easily; this is a standard “trick” in statistics as this logic is used in the pre-processing of time series data to get rid of peaks in order to get the data to fit into standard models [12, P. 58]. The power law model is another way in which we can characterize certain right tail behaviors; it says that bigger values are less likely in a way that is proportional to a power of the value [10].

Accordingly, one model that worked well was the Log-Normal model. Here is that model with the empirical data:

At first glance, the first plot makes the model look fairly good; it seems as though the model is capturing the distribution decently well. When we look at the Q-Q plot, however, we see more of the story:

We see that this model starts off aligning quite well with the empirical data at the start but begins to diverge from it at as the values get larger. This reinforces our belief that the model is adequately fitting the bell-shaped part of the distribution but struggling to fit the right tail of the distribution.

By adding in the power law component, the Power Log-Normal model was one that worked incredibly well. We again show both the model fit with empirical data and the Q-Q plots here:

As we can see, the first plot shows a great fit from an initial glance, and the Q-Q plot further confirms that. It is clearly not perfect as we look at the highest values of packs that we observed, but it appears to be a closer fit than the previous model for most of the probability mass of our distribution, which is housed by the bell-shaped part of the distribution and the beginnings of the right tail. We could likely find other simple models that would fit our data well, but for now, we’ll stop with this model. In some sense, we are finding an approximate model for an approximate distribution, so concerning ourselves with fitting the empirical, approximate distribution exactly should not be our only goal. We’re after uncovering structure, not perfectly fitting an approximation.

With these models, we now have the ability to estimate probabilities far off in the right tail of the distribution, and we can give an estimate for the value of $Pr(B=7529831562)$, which we brought up earlier. For ease of notation, let $b=7529831562$. If we had a discrete model like the Poisson, we could simply take the probality mass function $p_B$ and evaluate it at $b$. Since our best models were continuous distributions, we’ll use a continuity correction and simply take an integral around the value of interest; in this way, we would then say that

\[Pr(B=b) \approx \int_{b-0.5}^{b+0.5}f_B(x)dx = F_B(b+0.5) - F_B(b-0.5)\]

where $f_B$ denotes the probability density function (the model curves we’ve been plotting alongside the empirical data) and $F_B$ denotes the cumulative distribution function of our model [3, P. 475]. While this would certainly output a value, we should still be cautious and question whether this value is reasonable; after all, we didn’t observe values anywhere near this region of inputs, so why should the output we get be reasonable? This is a general issue in statistics where we have models extrapolate outside of the regions they were trained on, and whether we should trust those results is highly questionable. If we understand the general behavior of log-normal or power normal models along with the particular parameter estimates we derived, we could argue from a theoretical standpoint that the probability decreases as the values get bigger; the exact way in which this happens depends on the model and its paramaters but nonetheless, this is what we intuitively want, and we let the model and learning algorithm decide the particular rate at which this happens.

If we chose to use more expressive models, such as neural networks, that are flexible enough to fit real world data well, then we might worry about the behavior in regions far from the data manifold; after all, we have to wonder what is stopping such a model from perfectly fitting the distribution in the region seen in the dataset but giving ridiculous values for other values [13]. There are certainly fixes we could impose, such as enforcing non-negative values only or capping the probability mass at 1, but without thinking about this, we can see how always choosing the most expressive model can backfire. We’d like our understanding to meet the practice, and in this way, ideally move towards models that are simple to think about and have a strong theoretical backing. In this case, we were able to find models that made use of a few intuitive transformations to the data and showed capable model fits.

Conclusions

As we can see, we tried to look at this problem through a few lenses, and each problem solving route gave us an additional perspective on the problem. To summarize, we

formalized the problem we cared about
saw how it's structure was similar to a classic and simpler problem
tackled the simpler problem to unveil strategies that could be useful for the main problem
considered mathematical routes stemming from the structure revealed by the simpler problem
coded up a nonparametric, approximate solution to the problem using simulation
explored parametric models that could characterize the behavior of the distribution succintly.

Personally, I had fun taking something that was already on my mind, applying the problem-solving strategies I’ve spent years developing, and telling a story about the core ideas and connections that crystallized as I thought more and more about the problem.

References

[1]“Booster Set Breakdown - Legend of Blue-Eyes White Dragon.” Accessed: Aug. 20, 2023. [Online]. Available at: https://ygoprodeck.com/article/set-theory-legend-of-blue-eyes-white-dragon-8227.
[2]“New to Yu-Gi-Oh!” PSA, Accessed: Aug. 20, 2023. [Online]. Available at: https://www.yugioh-card.com/en/about/new-to-ygo/.
[3]J. K. Blitzstein and J. Hwang, Introduction to Probability, Second Edition. Chapman & Hall/CRC Press, 2019.
[4]C. O. Wilke, “Fundamentals of Data Visualization.” Accessed: Aug. 20, 2023. [Online]. Available at: https://clauswilke.com/dataviz/histograms-density-plots.html.
[5]A. Ng and T. Ma, “CS229 Lecture Notes.” Stanford University, Accessed: Aug. 20, 2023. [Online]. Available at: http://cs229.stanford.edu/main_notes.pdf.
[6]“1.2 - Maximum Likelihood Estimation.” PennState Eberly College of Science, Accessed: Aug. 20, 2023. [Online]. Available at: https://online.stat.psu.edu/stat415/lesson/1/1.2#paragraph–492.
[7]“1.4 - Method of Moments.” PennState Eberly College of Science, Accessed: Aug. 20, 2023. [Online]. Available at: https://online.stat.psu.edu/stat415/lesson/1/1.4.
[8]“1.5 - Maximum Likelihood Estimation.” PennState Eberly College of Science, Accessed: Aug. 20, 2023. [Online]. Available at: https://online.stat.psu.edu/stat504/lesson/1/1.5.
[9]P. Virtanen et al., “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020, doi: 10.1038/s41592-019-0686-2.
[10]C. O. Wilke, “Fundamentals of Data Visualization.” Accessed: Aug. 20, 2023. [Online]. Available at: https://clauswilke.com/dataviz/ecdf-qq.html.
[11]S. Seabold and J. Perktold, “statsmodels: Econometric and statistical modeling with python,” 2010.
[12]R. H. Shumway and D. S. Stoffer, Time Series Analysis and Its Applications With R Examples, Fourth Edition. Springer, 2017.
[13]“CS231n Convolutional Neural Networks for Visual Recognition.” Stanford University, Accessed: Aug. 20, 2023. [Online]. Available at: https://cs231n.github.io/neural-networks-1/.