The Scenario

Imagine the following.

You’ve got rocket-boosted cars zooming all over your TV. You’re mashing about 100,000 buttons per second on your controller. You keep hitting the ball, but it simply will not make its way to your opponents’ goal.

Right now, you’re trying your best to win the 2-on-2 Rocket League match with one of your best buddies. You’ve been racking up wins left and right (actually, you’ve lost about half of the games, but you ignore that), and you show no signs of stopping.

You enter a new match, but you feel that something is off. No other match you’ve played has ever felt like this. It is not fun.

Is it input delay? Are your opponents hacking? Is it all in your imagination? Almost instantly, you see the ball teleport to the left as you attempt to hit it straight on. “I knew it!!!”, you think to yourself.

You miss, and miss, and miss as the connection quality never seems to improve, and you lose. Disappointing result, but surely, the next match cannot be as bad.

Well, much of the same happens in your next match, and again, you lose.

You tell your friend you don’t think you should play anymore, and do a little exploration as to why those games felt off. Was your Wi-Fi connection dropping? It doesn’t seem like it. Was your friend’s? He doesn’t seem to think so.

You try to join another game, and there is an interesting message on your screen; it appears you are matching up with opponents in the Oceania server.

That explains it, right? Your buddy is quick to exclaim that this has to be reason. Surely, you lost because the connection from North America to Oceania must be terrible, and you wouldn’t have lost otherwise.

The Initial Question

While the gamer in you wants to agree with him, something about the statistician in you fights back. It brings up the following question:

Question 0.1: Is there enough evidence to support the claim that the Oceania server caused you to lose?

As stated, this question is quite imprecise. What does it mean for a server to cause you to lose? In this case, we mean that differences in connection quality (which includes the lag, teleportations, input delay, etc.) cause your underlying true win rate to change to some other win rate. From now on, we won’t talk about server locations; since North America is significantly far from Oceania, we’ll equate games in the Oceania server with bad connection games, and games in nearby servers with good connection games. This equivalence is certainly not perfect as location is not the only determinant of connection quality and connection quality doesn’t fall into exactly 2 discrete buckets, but for this exercise, we’ll make this simplifying assumption here.

We can thus update our question as follows:

Question 0.2: Is there enough evidence to support the claim that differences in connection quality cause a change in your true win rate?

In framing the question like this, we should highlight the assumptions being made.

We are implicitly saying that there are two win rates $r_B$ and $r_G$, the win rates in bad and good connection respectively, over the whole population of interest; such an assumption is standard for statistical inference, which is what we are engaging in when we want to take a dataset gathered from the population to learn something about the true values of these win rates over the whole population [1, Pp. 196-198]. Because we care about the win rates over the whole population, we should clearly define our population of interest. If we wanted to make claims about all of the possible people we could theoretically play in Rocket League, we’d be including every person on the planet in our population. This likely isn’t what we want; if we calculated a win rate over all people, where a majority of people have likely never heard of or played the game, then we’d likely get higher win rates than just from normally playing the game. Thus, it would be preferable to formulate our population to be those we would actually encounter while playing the game. We’d be wise to narrow down on a specific population and then figure out how to get a random sample from it in order to make the inference work easier [1, Pp. 19-22].

It is another assumption to further say that these two win rates are constant. This is also the standard way in which this is done; this approach still acknowledges that they don’t have to be constant, but in that case, we assume that the way in which these win rates vary is small enough that we don’t worry about that [1, P. 198]. This constant assumption could easily be called into question by noting that you are an agent learning with additional experience, who can, for instance, improve your win rate in bad connections by implementing strategies that work well in bad connection games, as you figure them out. Thus, these win rates don’t have to be fixed, and it is an extra assumption to say that they are (or that the difference is). In some cases, such as with an experienced player, you could claim that these rates have stabilized enough to make the assumption reasonable enough. All of the assumptions we bring in make up the model that we are building, but as the common saying goes, “All models are wrong, but some are useful.” Thus, we accept that there is a degree of simplification here that we are trading off for the ability to make claims supported by data (under the model).

Returning to the question, it would be quite difficult to answer it at this point in time since we don’t have concrete data about what happens when the connection is good, so let’s give ourselves some data to work with. From the story so far, we have 0 wins out of 2 total matches for the Oceania server (bad connection). For the sake of this exercise, we’ll assume that you played 100 games with your buddy in the nearby servers (good connection) and won 55 of those. We’ll make it easy for ourselves and further assume that this data is a random sample from our population of interest; this assumption can easily be called into question if we simply looked to our past game data to come up with this data without thinking about the conditions for a random sample and the specific population from which it is drawn.

This is the data we’ll work with for this exercise, but we’ll find that this data, as it is, is insufficient for our purposes; however, in attempting to use this data to extract statistical insights, we’ll take note of the issues that arise and hopefully learn what sort of data we would need to collect to answer our questions. We can use those insights to design an experiment with richer data that overcomes those problems. In some sense, scientific inquiry has to start somewhere, and while an initial study might not be able to make bold claims, it teaches us how subsequent ones might be able to. Essentially, experience shows us questions we want to answer and attempting to answer them shows us what we actually need to do so adequately. With that said, let us return to the question.

Importantly, we have a question of causality here. In this setting, however, we were not controlling the way in which we are assigned to good connection or bad connection games and that pulls us away from the realm of a randomized experiment and towards the realm of an observational study; in order to justify causal claims, we’d likely need more information and to do a lot more work that is outside the scope of this analysis [1, Pp. 33-36].

Since we don’t always have the information to satisfactorily answer the original problem we set out to answer, we will pivot to asking a related question. Here, we move from a question of causality to one of association.

The Modified Question

The question we will address is:

Question: Is there a difference in your true win rate across matches with bad connection quality versus your true win rate with good connection quality?

We’ve written the question in a way that introductory statistics students might recognize as a hypothesis test for a difference of two proportions, seen in textbooks such as [1], to hint at the approaches we’ll get into.

The Approach

In this case, since we have categorical data that we can place in a $2\times2$ contingency table, a common choice here would be to apply a $\chi^2$ test of independence; however, this is an approximate method that relies on additional assumptions, where even simplified rules of thumb for when it is reasonable to use do not apply for our data [1, Pp. 523-530]. Instead, we will choose to use Fisher’s exact test to test whether there is a dependence between connection quality and win rate [2, Pp. 143-150].

We’ll start to set up the language for this test. As we’ve previously introduced, recall that $r_{B}$ and $r_{G}$ denote the true win rates in bad connection and good connection, respectively. Our null hypothesis will be that there is no difference in these win rates; namely, it is that $r_{B}$ = $r_{G}$. Our alternative hypothesis will be that the win rate in good connection is not the same as it is in bad connection; explicitly, it is that $r_{B} \ne r_{G}$.

The Test’s Assumptions

In problems seen in school or books like [1], the examples chosen are generally picked such that the data satisfies the conditions for the subsequent statistical analysis they present. General assumptions might be that the examples in your dataset come from a random sample or that they are independent. For problems that we encounter in the real world, it’s up to us to check whether any assumptions do hold or at the very least, are plausible.

For this test, the assumptions seen in [3, Pp. 133-134] are:

the number of wins $W$ in bad connection is distributed as a $Binomial(n_{B}, r_{B})$, where $n_{B}$ and $r_{B}$ represent the number and win rate in bad connection, respectively
the number of wins $U$ in good connection is distributed as a $Binomial(n_{G}, r_{G})$, with $n_{G}$ and $r_{G}$ representing the analogous quantities in good connection
$W$ and $U$ are independent
$n_{B}$, $n_{G}$, and $t = W+U$ are fixed
the sample of good connection games and bad connection games is a random sample

We’ll consider the first two conditions at the same time since they differ only in the connection quality but are otherwise the same condition.

We must show that two random variables are binomial random variables, but how do binomial random variables arise? Binomial random variables represent the number of successes in a fixed number $n$ of independent $Bernoulli(r)$ trials with fixed success rate $r$ for each of the trials [3, P. 112]. Thus, to argue that $W$ and $U$ follow a Binomial distribution, we should argue that the match results in a respective server are distributed as $Bernoulli(r)$ for some $r$ and that they are independent.

Do we believe that there is some win rate $r$ that applies for each match result within a server? You might say no immediately. After all, we don’t know what is done behind the scenes when opponents are chosen; the game could choose to give you really difficult opponents in one match and really easy opponents in another, which would make this assumption harder to believe (though we could still choose to model using, say, an average rate, as one example estimator for $r$). Furthermore, your skill can change as you play the game; you might choose to watch some tutorials and train arduosly in the practice arena, which might boost your performance significantly from one match to the next. On the other hand, we could assume that the game wants to pair you up with opponents of a similar skill so as to maximize factors such as player retention, improvement, enjoyment, or play time, and it achieves this by roughly aiming to keep you winning at some chosen win rate (with varying success in practice).

Now, are these match results within a server independent? Here, again, you might also immediately say no. At the core, you are involved in every game and that could clearly signal dependence across games. For example, we could argue that a series of bad losses could make it more likely for you to lose the next game as you are too sad to focus on the next game. Furthermore, similar to what we said before, we could also argue that maybe you learned something key in one game and used that skill to impact the following games. These are two very plausible scenarios that could apply or not apply depending on the setting. Admittedly, it sounds really easy to disprove independence as we really just need one sequence of match results that are related in some way, and this can serve as a counterexample.

On the other hand, arguing for independence requires proving that there is no such relationship in all cases; unless we are in a very controlled environment where we aim to eliminate dependence, this sounds really difficult to do. Nevertheless, independence seems to be assumed quite often; for example, it is an assumption that makes up the “Probably Approximately Correct (PAC)” model in statistical learning theory [4, P. 127]. In this particular case, it could be argued that the dependence is weak because we could assume that we get new opponents each time (or almost every time), which changes the game enough and other factors don’t matter as much. We’d be taking it a step further by saying this dependence is weak enough that we could simply model the match results as independent.

What about the independence of $W$ and $U$? Put another way, do the number of wins in good connection affect the number of wins in bad connection, or vice versa? If framed using servers, should what happens in what server affect what happens in another? This assumption potentially sounds more reasonable but can still be challenged. For instance, we could say that if you win a lot in the servers close to you, the game might struggle to find opponents in those servers. Now, further suppose the best players play in bad connection and nowhere else. Then, the game could resort to maching you with really good players in bad connection. This could then lead to game losses in bad connection because it’s the only server where you get matched with players of that skill level, which would break the independence. Whether you believe scenarios like this arise (or arise enough to matter) is up to you to assess.

The next bullet point in the list is the easiest to argue in favor of since we do know the number of matches in the respective servers, and we do know the total number of wins over all of the servers beforehand; I had to double check that this was all that it was asking for, but as argued in [2, P. 144], we’re simply verifying that the summed entries on the margin of the contigency table are constant.

The last bullet point is an assumption that we made previously, so we can proceed with this point.

From this discussion, you might choose to stop here and say it is irresponsible to proceed because you don’t believe the assumptions. I certainly wouldn’t fault anyone for doing this. With extra assumptions, we might be able to make interesting claims that would not be possible without them. If they’re wrong, however, how do you trust your derivations? Real world data seems fraught with caveats, and it’s important to identify which assumptions are at most risk of being false and if anything can be done to either test the assumption further or if a different method can be used that doesn’t ask for that particular assumption. For the sake of our discussion here, we’ll suppose that the assumptions for Fisher’s exact test hold.

The Test Itself

When we do assume that these conditions hold and further assume that the null hypothesis that $r_{B}$ = $r_{G}$ also holds, Fisher’s exact test tells us that the conditional distribution of $W$ given that $W+U=t$ is a $HyperGeometric(n_{B}, n_{G}, t)$ distribution [3, Pp. 133-135]. As you can see in the argument in [3, Pp. 133-135] to arrive at this result, the derivation simply uses Bayes’ Rule and simplifies using the assumptions and a consequence of the assumptions (namely, that the sum of two binomial random variables with the same win rate parameter is again a binomial random variable).

Fisher’s test then asks us to calculate the probability of $W$ taking on values as extreme or more extreme than the value we see in our data (using the distribution derived from assuming the null hypothesis), and we’ll reject that null hypothesis if that probability is low enough [2, Pp. 143-145]. Recall that in our data we saw that $W=0$, so now, we want to compute $Pr(W \leq 0 | W+U=t)$. We note that this conditional probability is for a one-sided test, but this requires that we laid out beforehand that our alternative hypothesis was that $r_{B}$ < $r_{G}$. In this case, we could say we had a reason to test whether the win rate in bad connection was strictly lower due to a belief that lag generally affects gameplay negatively; it is plausible to have believed this before we started playing out the games in our dataset, which led us to then exclaim that we must have lost because of the lag.

Nevertheless, this could certainly be called into question because it wasn’t stated before the data was collected and you could argue that the match results we did observe conditioned us to believe in the negative effects of lag; strictly speaking, if we had no reason to believe which direction was correct for the inequality or if we didn’t say anything before the experiment started, we want to do the two-sided test [2, Pp. 145-146]. After all, a player performing better in lag compared to other players in that same lag is not unheard of, and we can’t erase that possibility if we didn’t have evidence beforehand to do so. In this case, it will not matter since even the one-sided test will produce a p-value that is much too high, which we’ll get to next.

With Fisher’s test, we then see that

\[Pr(W \leq 0 \vert W+U=t) = Pr(W = 0 \vert W+U=t) = \frac{\binom{n_{B}}{0}\binom{n_{G}}{t}}{\binom{n_{B}+n_{G}}{t}} = \frac{\binom{2}{0}\binom{100}{55}}{\binom{102}{55}} \approx 0.21\]

To get the two-sided test p-value, a first approach would be to multiply the value above by 2, and a second approach would be to compute a statistic, such as the $\chi^2$ test statistic, to determine which instances are at least as extreme as our data [2, P. 146]. The first approach is pretty effortless, and since the value is already really high and there isn’t much we could do to make it smaller than it is, we’ll do the simple thing and say that the p-value is $2*0.21 = 0.42$.

Simulation

Closely related to what we found out before, we could also use simulation to generate a bunch of possible scenarios, empirically estimate the conditional probability from before, and use that to then report a p-value. Simulation gets us to estimate the probabilities of events using the frequentist approach to probability and the law of large numbers [3, Pp. 22, 470-471]. This differs from the previous method in that rather than getting an exact p-value, we get an approximate one.

Assuming that our null hypothesis holds, the win rate is the same in both connection types, so intuitively, any one match result could have reasonably come from a game in bad connection or a game in good connection; thus, we could simulate how datasets in other possible worlds could have feasibly looked by simply randomly selecting 2 matches out of all the matches, relabeling those two as coming from bad connection, and labeling the rest as coming from good connection, like what is argued in [2, P. 144]. From this description of this data generating procedure, we can now draw the connection to hypergeometric random variable we discussed earlier. We could imagine that we draw $2$ ($n_{B}$) balls from an urn of colored balls, where white balls represent match wins and black balls represent match losses; then, according to the general story of how hypergeometric random variables arise, the number of white balls that we draw (which correspond to the number of wins in bad connection) is distributed as a $Hypergeometric(t, n_{B}+n_{G}-t, n_{B})$ random variable, which is equivalent to a $Hypergeometric(n_{B}, n_{G}, t)$ [3, Pp. 115-118]. This is exactly what we arrived at earlier and justifies why this approach works.

In python, the following code captures the general idea:

num_simulations = 10**5
Ws = []  # Stores number of wins in bad connection, for each simulated possible world
for dataset_number in range(num_simulations):
    pair_of_indices = rng.choice(len(data), size=num_matches, replace=False)
    bad_connection_sample = data[pair_of_indices]  # data is sequence of 0s (losses) and 1s (wins)
    W = np.sum(bad_connection_sample)
    Ws.append(W)

Now, we can visualize both the theoretical and the empirical distributions on the same plot:

As you can see, the theoretical and empirical distributions pretty much agree, which we would hope is the case given what we’ve argued.

The Verdict

We still haven’t fully addressed the question, but we have all the information that we need and from two different perspectives.

We saw that in our actual dataset, the number of bad connection wins was 0. Assuming our null hypothesis of no difference in win rate with different connection qualities, we computed both a theoretical and an empirical probability for seeing data at least as extreme as the one we saw in our dataset; both probabilities turned out to be around 0.21 as we can see in the plot. We multiplied this value by 2 to get the full p-value of the double-sided test. Thus, even when there is no difference in win rates, 42% of the time we see a dataset with a test statistic ($W$, the number of Oceania wins) at least as extreme as the one we actually came across in our world.

Maybe this is good enough evidence for you to say that there is a difference in the win rates; after all, 58% of the time, you would see a dataset with a test statistic less extreme than the one we saw. This means that more than half of the time, you wouldn’t see a dataset as extreme as ours. To many, our large p-value is not so convincing, and using the common significance level of 5%, we would fail to reject the null hypothesis, and thus, we say that that there is not a statistically significant difference in win rates. This would even be the case if we had set up a one-sided test at the beginning and taken 0.21 to be our p-value, which means that almost 80% of the time we would see less extreme data.

It’s interesting to note that we were quite limited in the data that we collected; you didn’t want to keep playing because the games were not enjoyable, so you played only 2 games. Had you played 5 games, lost all of them, and repeated the analysis described above with this new data, we would find that around 2% of the time, we’d see 0 wins. We would again multiply this by 2, but this much lower number might be more convincing to a lot more people, especially to those who are ardent believers in p-values of at most 0.05, who ignore statisticians’ advice about weighing Type I and Type II errors to arrive at an appropiate significance level for the decision they have to make [1, Pp. 303-305].

The outcome from a hypothesis test like the one we’ve just done is that we make decisions; we either fail to reject the null hypothesis or reject it in favor of the alternative hypothesis [1, Pp. 290-291]. In this case, the 0.42 p-value, while incredibly high, can be enough to get someone to reject the null hypothesis and make the decision to check off Oceania from the list of allowed servers for games. In order for this decision to be properly justified given our model and the current evidence in the data, they’d have to live with an incredibly lax significance value at least as big as 0.42; this big of a value might sound ridiculous, but when the stakes are not very big, one could acknowledge that the evidence is extremely weak but still go ahead with the decision. A better choice would be to set up a new experiment keeping in mind everything that we’ve discussed.

References

[1]R. H. Lock, P. F. Lock, K. L. Morgan, E. F. Lock, and D. F. Lock, Statistics: Unlocking the Power of Data, 2Nd Edition. Wiley, 2016.
[2]P. I. Good, Permutation, Parametric and Boostrap Tests of Hypotheses, Third Edition. Springer, 2005.
[3]J. K. Blitzstein and J. Hwang, Introduction to Probability, Second Edition. Chapman & Hall/CRC Press, 2019.
[4]A. Ng and T. Ma, “CS229 Lecture Notes.” Stanford University, Accessed: Aug. 20, 2023. [Online]. Available at: http://cs229.stanford.edu/main_notes.pdf.