LT Tracking Project Preliminary Results

Unknown · March 2016

It's been 36 hours since the end of the first round of the LT Tracking Project (https://d3go.com/forums/viewtopic.php?f=7&t=41109). If anyone has data they haven't yet sent in, please do and I will add it to these results.

I have posted the raw preliminary data (with draws commingled and potentially identifying information stripped out) here: http://dropproxy.com/f/D8D

Note on Analysis

Before we get to what the data shows, let's talk about what this data can tell us and how we know it. We have a data set and we want to determine whether or not we should question what we have been told and what we have reasonably assumed to be true about the game's random draws. Here's a few basic, testable hypotheses:

1) In this period (and ever since the release of Phoenix), the probability of drawing a 5* is fixed at 10%.
2) LT pulls are fair. That is, no extraneous variables should affect cover pulls. It shouldn't matter a) who is opening the LT, b) what time it was opened, c) through what means (token or CP) the LT was obtained, d) how much money the person has spent, e) how developed the roster is, or f) how the person plays.
3) All covers of the same rarity have the same probability of being drawn, which should match the listed drop rate.
4) Each character's three color should be evenly distributed.

We can only evaluate these probabilistically through our limited dataset. That is, we have to calculate how likely our results are if we assume these hypotheses are true. This likelihood is called the p-value (https://en.wikipedia.org/wiki/P-value). For example, if we assume that the 5* drop rate is 10% but only got 1 5* out of 20, we can calculate that permutations or 1 or 0 out of 20 make up around 40% of all permutations - slightly unlucky, but not enough for us to question the 10% assumption. If it was 1 out of 50, we would be below the bottom 5%. The threshold where we start to question the assumption is the significance level; in typical scientific studies, it's somewhere around 1-5%, but it has to be tailored to the experiment. I'm not sure what the significance level for any of our "experiments" should be, but the lower the threshold, the more we should question the assumption.

Similarly, the more data we have, the more we can question unexpected results. For example, one 5* out of 20 and ten 5*s out of 200 are both 5% drop rates. But the first one is around the bottom 25%, the second is lower than the bottom 1%. The p-value actually tells us whether our sample size is good enough - if we had too small a sample size, we'd mathematically be unable to reach our significance threshold.

Not being able to reach a high enough p-value does not mean that the assumptions are true, just that we do not have the data to question it.

The Data

There were 436 total LT draws. Two contributors (one of which is me) supplied more than 75% of the data. I have validated the results of the other contributor against his or her in-game roster and found no discrepancies (as well as a perfect match in 5* covers). Therefore, most of the data appears reliable but leaves us unable to determine whether individual accounts (or devices or OS software) affect RNG (assumption 2a). If 2a is invalid, this does puts into question the rest of the results and, of course,

This amount of data is sufficient to potentially falsify the listed 10% drop rate (assumption 1). If we accept a significance threshold of 5%, an aggregate drop rate of less than 7.5% or or more than 12.5% would be sufficient for us to question this.

Many people also opened tokens in large chunks, which means that much of the time data is clustered. We also have significant missing data on the token/cp question. I've not yet had the chance to go through the survey data; that data may not be diverse enough either. So, I'm not sure that we have enough data to question the assumptions that hinge on these variables.

The Results

Ultimately, the overall 5% drop rate was 39 out of 436, or 8.9%. This is lower than the 10% rate, but this only puts us in the bottom 26.036 percentile of all permutations so we do not have enough of a reason to question assumption 1.

This breaks down to 208 total Latest LTs and 228 Classic LTs. The 5* rate for the Latest is 7.7%, for Classic 10.1%. Neither is significant enough for us to question assumption 2c.

Kind of boring. But so far so good. The next part is more interesting.

The most common 4* character pulled is three-way tie between Ghost Rider, Iceman, and Red Hulk, each at 23. The least common 4* character pulled is Star Lord at 4. The listed draw rate for all these characters is 3.5%. Since there is a 90% chance to pull a 4* and there are 26 valid 4*s (Quake and Devil Dino are not in the LT pool), this should actually be 3.46%.

23 pulls from 436 at 3.46% is top 96 percentile, which are marginal or close to marginal results close to the reasonable range of significance levels.

4 pulls from 436 at 3.46% is bottom 0.069 percentile, which is far below any reasonable significance level.

This is an astounding result. 4 Star-Lords out of 436 pulls is strongly suggestive that cover distribution is not even. We do not have sufficient data to determine whether the source of this unevenness is global or improperly biased by individual accounts, devices, or some other variable. But we can be pretty close to certain that something is wrong. The next two least common pulls are XF Wolverine and 4hor, each with 9 covers, putting them at 4.9 percentile, which is quite low if marginally significant.

The most common 5* character is Black Suit Spider-Man at 11 pulls out of 208 (Latest LT) at 3.33% (top 91 percentile). The least common 5* was OML at 4 pulls out of 228 (Classic LT) at 3.33% (bottom 8.705 percentile). These are marginal results as well. The problem with the 5* results is that the populations of Classic and Latest tokens are roughly half the size of the total number of LT; if we had more data, we might be able to definitively say for sure whether or not 5* drop rates are as skewed as 4* ones seem to be. As it stands, it is merely very interesting.

ADDENDUM on Colors:

I've added a section to the results spreadsheet breaking draws by character-color pairs to test distribution of colors as well. If we assume that every color is randomly distributed as well, there is a 1.15% chance to get any specific 4* character-color combination (1.11% for 5*).

The most common color-cover is Ghost Rider Red at 12 covers (top 99.472 percentile). The other Ghost Rider color-covers are black at 8 (top 86.637 percentile) and green at 3 (bottom 26.144 percentile). This means that just within the population of Ghost Rider draws, red at 12 is top 95.195 percentile assuming 33.33% and green at 3 is bottom 2.648.

The lowest color-covers are XF Wolverine Black and X-23 Black at 1, both bottom 3.919 percentile.

On the 5* front, OML Yellow and Red, Surfer Purple, Phoenix Green, and Goblin Purple are all tied for least common at 1. For OML and Surfer, this is bottom 27.932 percentile. Most common is BSSM Purple at 6 (top 97.036%).

Here's the TLDR summary:

• Overall, we're close enough to the 10% 5* drop rate that we do not have enough reason to seriously question it.

• There were a fair number more 5*s from Classic LTs than Latest LTs, but the difference is not significant.

• At least one 4*s is significantly less likely to drop than average. Others have very high and low rates that are marginally significant. Several color-cover combinations dropped at rates which are also in the range of reasonable significance. Taken together, the distribution of both characters and colors is skewed enough that we should raise serious doubts as to whether or not character draws are actually evenly distributed.
RNGs can be tricky to get right and MPQ runs on a wide variety of platforms (and we've already seen how draws are determined client-side rather than server-side). I'd love to see a dev response on this, but the historical data on that happening strongly suggests that I should not hold my breath.

Unknown · March 2016

As always, please let me know if you have comments on the methodology or the analysis. I'm not a scientist nor am I very good at math, so I welcome suggestions for either.

I'm also thinking about what the next step might be. A larger, more diverse dataset would help bring some of our results past the significance level and help isolate where the skewed results are coming from.

If the analysis stands, then what? If RNG for cover distribution is, in fact, broken, are we all going to just carry on? Shall this, too, pass?

Pylgrim · March 2016

If it is true that the draw rate of 4*s is not even, it follows that the "rare" characters are different for each person. I don't have exact numbers but I know that I have drawn IW between 20% and 25% of my around 80-90 pulls. On the other hand, I've opened exactly 1 Rulk and 2 Iceman. No exact numbers for Star-Lord but I'd say I've opened him about the expected average times.

Unknown · March 2016

Pylgrim wrote:

If it is true that the draw rate of 4*s is not even, it follows that the "rare" characters are different for each person. I don't have exact numbers but I know that I have drawn IW between 20% and 25% of my around 80-90 pulls. On the other hand, I've opened exactly 1 Rulk and 2 Iceman. No exact numbers for Star-Lord but I'd say I've opened him about the expected average times.

That's kind of what I suspect also, though I don't have the diversity of data to show that here. As I suggested, any number of things relating to account, device, or software might affect RNG. But it doesn't really matter to me why the rates are off so long as the evidence strongly indicates that they are.

Cousin Simpson · March 2016

Malenkov wrote:

The lowest color-covers are XF Wolverine Black and X-23 Black at 1, both bottom 3.919 percentile.

I should hope this would be even lower, since her colors are green, red, and purple. But otherwise, awesome job!

simonsez · March 2016

Malenkov wrote:

Taken together, the distribution of both characters and colors is skewed enough that we should raise serious doubts as to whether or not character draws are actually evenly distributed.

THANK YOU!

I feel just like one of those guys who got sprung from jail by the Innocence Project... finally someone believes me...

Stax the Foyer · March 2016

It's nice to see some confirmation of the streakiness. Whatever causes it, it's a real thing. And believe me, people are trying everything.

We're probably one new 5* release from full-on cargo cult status.

DaveR4470 · March 2016

This is excellent work, and interesting -- but I'd caution you to not draw too many inferences from the data.

Some quick thoughts:

-- There's one assumption that we simply don't know is true: you assume that the color distribution is a single probabilistic determination; i.e. there are 78 possible 4* cover pulls (28*3) and you have a 1/78 chance (1.28%) of pulling any one of those covers. That isn't necessarily how it's done -- it could be that you have a 1/28 chance of pulling a given character, and then a 1/3 chance of a given color in that character. The NET probability is the same, but the expected frequency distribution changes somewhat, as there are two randoms in the equation rather than one.

-- I think your focus on the p-value is a little misplaced. P-values are useful when you do not know the underlying frequency, but you have a theory, and you're testing how well your data conforms to the working theory. If you're doing, say, assays of the effects of a genetic change in a fruit fly, you probably cannot gather 10,000 data points easily. But you can gather a lesser but still significant (from a statistical standpoint) set of data, and see what the observed p-value vs. the expected results are. The more data sets you gather, the more p-values you gather. And the more your p-values show statistically unlikely results, the more likely it is that your underlying hypothesis is invalid.

When you already know with certainty the underlying frequency, the p-value is just a mildly interesting data point, not an analytical tool. If I flip a coin 25 times and get 24 heads, that's got a huge p-value from the expected norm. But that doesn't demonstrate anything other than the fact that I got an unlikely set of data -- it does not provide any proof that the odds of a heads (for an unloaded coin) is less than or greater than exactly 50%.

-- 256 tokens is definitely a statistically significant data set to determine whether the 5*/4* allocation - which is basically a binary option -- actually exists as stated. And, as you note, the numbers don't give us any reason to think it's not a basic 10% chance to get a 5*. Buuuut.... as far as distribution of individual 4* covers, where there are either 28 or 78 possible outcomes, it's not NEARLY enough to begin to draw reasonable conclusions. You'd probably need to analyse something like 2000-3000 tokens before the pool was big enough to start to statistically analyze it. Or, you'd need to run a Monte Carlo simulation for 256 draws over a couple thousand iterations, and see where your data set fell vis a vis the expected standard deviation data set.(1)

So tl;dr you did good work, and definitely statistically demonstrated one conclusion, but going beyond that isn't quite supported by the math, I think.

(1) There is actually a way to calculate how many simulations/draws you'd need to get a statistically significant outcome for the data given the number of potential outcomes you have, but I.... um.... forget it. It's a regression analysis of the monte carlo itself, I think, and the math gets complex....

Unknown · March 2016

Cousin Simpson wrote:

Malenkov wrote:

The lowest color-covers are XF Wolverine Black and X-23 Black at 1, both bottom 3.919 percentile.

I should hope this would be even lower, since her colors are green, red, and purple. But otherwise, awesome job!

Whoops. I think that might be a data issue; I'll have to check the souce files when I get home.

notamutant · March 2016

If you really want to analyze large data sets, just watch my videos of opening legendary tokens for an hour straight.

CrookedKnight · March 2016

I'm trying to think of ways to test character "streaks" that would be able to handle the fact that different players accumulate tokens at significantly different paces. Maybe have people record a fixed number of token pulls, as long as they all come before Quake and "Unknown" are added?

wirius · March 2016

I honestly don't understand the reasoning behind doing this. You can't count on accurate or honest reports, and you don't know if the people who are giving accurate information, intentional or otherwise.

So what's the point? Is it because you're upset about your draw rate and you're hoping that its because the game is being dishonest? Is this a means of illusionary empowerment on a system that is inherently disempowering?

The system is what it is. They say its 10%, and if its less or more, you can't prove it. If its just an expression of disgust with the reward system, complain about it and stop giving them money until they fix it.

Things to do more effective than this project:

1. Write about the negatives of the system as is.
2. Start a petition.
3. Propose new methods of 5* rewards
4. Stop spending money.
5. Stop playing the game.
6. Acceptance that the 5* meta is intentionally a skinner box system to get milk addiction to playing end game and making money.

Sorry if this sounds harsh, but I just see this as wasted time and effort which could be spent better combating the problem otherwise.

DeNappa · March 2016

wirius wrote:

I honestly don't understand the reasoning behind doing this. You can't count on accurate or honest reports, and you don't know if the people who are giving accurate information, intentional or otherwise.

I don't have any real, physical data to base this on, but my gut feeling says that the amount of people taking the effort to create and send fake data to a fellow MPQ player's completely voluntary poll/research probably approaches 0.

So what's the point? Is it because you're upset about your draw rate and you're hoping that its because the game is being dishonest? Is this a means of illusionary empowerment on a system that is inherently disempowering?
*snip*
Sorry if this sounds harsh, but I just see this as wasted time and effort which could be spent better combating the problem otherwise.

You know, maybe he's just curious...

wirius · March 2016

DeNappa wrote:

You know, maybe he's just curious...

Could be. I noticed there isn't an effort for 2*, 3* or 4* drop rates on tokens though. Regardless, I'm being unnecessarily snippy tonight and I came off too harsh. If its for fun, I don't care. If its an attempt to prove another point, I personally think its the wrong way to go about it.

jobob · March 2016

Just sent my data. 2/19 5* pulls... not gonna change the data much.

snlf25 · March 2016

simonsez wrote:

Malenkov wrote:

Taken together, the distribution of both characters and colors is skewed enough that we should raise serious doubts as to whether or not character draws are actually evenly distributed.

THANK YOU!

I feel just like one of those guys who got sprung from jail by the Innocence Project... finally someone believes me...

NO! Finally someone other than me believes you.

EDIT: my pulls this season:
Falcon America red
Falcon America yellow
X-23 purple
Cyclops red
Carnage black
Red Hulk purple

all were pulled from classics except rulk

Unknown · March 2016

wirius wrote:

I honestly don't understand the reasoning behind doing this. You can't count on accurate or honest reports, and you don't know if the people who are giving accurate information, intentional or otherwise.

So what's the point? Is it because you're upset about your draw rate and you're hoping that its because the game is being dishonest? Is this a means of illusionary empowerment on a system that is inherently disempowering?

The system is what it is. They say its 10%, and if its less or more, you can't prove it. If its just an expression of disgust with the reward system, complain about it and stop giving them money until they fix it.

Things to do more effective than this project:

1. Write about the negatives of the system as is.
2. Start a petition.
3. Propose new methods of 5* rewards
4. Stop spending money.
5. Stop playing the game.
6. Acceptance that the 5* meta is intentionally a skinner box system to get milk addiction to playing end game and making money.

Sorry if this sounds harsh, but I just see this as wasted time and effort which could be spent better combating the problem otherwise.

For there to be activism, one must be sure that there is a problem. This is necessary research. But yeah, sure, thanks for telling me that I've wasted my time while offering no productive critique and implying that I have some crass and selfish motivation. Because I really wanted those underrepresented Star-Lord covers!

Unknown · March 2016

DaveR4470 wrote:

It's a regression analysis of the monte carlo itself, I think, and the math gets complex....

Well, there's the problem! Thanks for this comments! As I've said, this isn't really my field, so let me think on this and revise my conclusions accordingly. I've got some additional data that's come in anyways.

simonsez · March 2016

wirius wrote:

Sorry if this sounds harsh, but I just see this as wasted time and effort which could be spent better combating the problem otherwise.

How the hell do you combat a problem few people want to acknowledge? Sorry if this sounds harsh, but your post sounds like "Stop trying to prove something I refuse to believe!"

amusingfoo1 · March 2016

Forgot to send my data earlier. Now sent. 1/34 5*s, though. But on the plus side, got the last covers I needed to cover-max every 4* except Gwen and Quake. But seriously, my overall pull rate is now down to 8.5%; I'd really like those four extra covers.

Mercurywolf · March 2016

Based on your statement of 2 contributors providing 3/4 of the data you gleaned, this does not provide the best statistical analysis of Random Distribution. A better sample would be say 50 people reporting what they pulled for the exact same number of pulls, say 50-100.

With 2 people providing data on 327 (75% of 436) pulls, your statistical analysis is off because this doesn't reflect the whole. As has been stated, the 10% rate of 5* is NOT per person, but an average of a whole. Unfortunately, if the two major contributors were facing less than a 10% distribution, your data is skewed because of the weight of the contribution.

In a case like this, more accurate data is gleaned from having a larger sample size in the form of contributors, with each person contributing the same amount of data. Ex. If I were taking a survey, I would not be able to get good results if I gave 20 people a survey with 3 questions, and a different 5 people a survey with 20 questions. To get the best data, I'd have to give all 25 people the same survey with the exact same questions on each survey.

LT Tracking Project Preliminary Results

Comments

Categories