- [Justin] In the nation of blobs,there's a popular gamebased around flipping coins.Each blob brings their own coinand they take turns flipping.When a coin comes up heads,the one who flipped it feels happy.And the other one feels sad.That's it.That's the game.It seems kind of simple, butthese blobs are a simple folk.There's a rumor going aroundthat some of the playersare using trick coinsthat come up heads morethan half the time.And that's just not fair,so we would like to catch these cheaters.As a warmup,let's have each of these blobsflip its coin five times.(playful electro-percussive music)Okay, you might be able to tellthat this is an artificial sample.We have results ranging from zero heads,all the way up to five heads in a row.Which of these blobs, if any,would you accuse of being a cheater?If you'd like to try your handat judging the blobs yourselfthere is an interactive versionI introduced in the last video.Looking at the data from that game,when a blob got fiveheads out of five flips,it turned out to be a cheateronly about 88% of the time.Because of the randomness,it's impossible to be completely surewhether a blob is a cheater.But some approachesare better than others.During the rest of this video,we're gonna build upone of the main methodsof making decisions with limited data.If you like learning new vocabulary terms,you're in for a real treat.The name of this method isfrequentist hypothesis testing.We're gonna design a testthat this blob detectivecan use in its day-to-daywork searching for cheaters.We want three things from this test.First, if a player is using a fair coin,we want to have a low chanceof wrongly accusing them.Second, if a player ischeating, using an unfair coin,we want to have a highchance of catching them.And third, we want this testto use the smallestnumber of coins possible.We only have one blob detective,and we want it to be able totest as many players as it can.And it's also nice not tobother the players too much.We're gonna design that test together,but if you're feeling up for it,it can be good for learning totry things on your own first.So this is your chance topause and take a momentto think of a test thatmight satisfy these goals.Okay, let's take it one flip at a time.It came up heads.The cheaters have headscome up more often,so this blob must be a cheater.Well, no, we can't just callthem a cheater after one flip.I mean, we could, but with that policy,we'd wrongly accuse quitea lot of fair players.After all, even if the playeris fair, there's a 50% chancethat the first flip would come out heads.So let's see the second flip.Heads again!Cheater?Well, it's more suspiciousfor sure, but again,we should think about howlikely it is for us to see thisif the coin is actually fair.There are two possibleoutcomes for the first flipand two possible outcomesfor the second flip.Two heads in a row is oneof four possible outcomesthat are all equally likely.So the probability of two out of two headsis one fourth or 25%.Another way to get that numberis to multiply the probability valuesof the two events together.You do have to be carefulabout multiplying probabilities,since it only worksif the two events areindependent of each other,and that is the case here,because getting the first headsdoesn't make the secondheads more or less likely.Anyway, with a one in four chanceof falsely accusing an innocent blob,it still feels a bit too earlyto accuse the player of cheating.After another heads,this probability is divided by two again.I'm starting to get pretty suspicious,but we'd still accuse oneout of eight innocent blobsif we accused after three heads in a row.We want that rate of false accusationsto be as low as we can get itbut we're never gonna getit all the way to zero.It'll always be possiblefor an innocent blobto get an epic streak ofheads and look suspicious.So we have to make a decisionabout what's good enough.The standard choice here is 5%,or one false accusation outof every 20 fair players.We could choose a differentvalue if we wanted to,but we might as well start here.Okay, so at this point,we've crossed the threshold.There's only a one in 32 or 3.125% chanceof seeing five heads ina row from a fair coin.So one possible testwe could use would be,if a player gets five out of five heads,accuse them of being a cheater.Otherwise, decide they're innocent.So let's see how this test performs.We're gonna want a lot of data.So let's make a set of 1000 playerswhere half of them are cheaters.Before we see the results,try making some predictions.How often will it wrongfullyaccuse fair players?And what fraction of the cheatersdo you think it'll catch?Alright, we can divide theseblobs into four categories.Fair players the test decided are fair.Fair players we wronglyaccused of cheating.Cheaters who got away with it.And cheaters we caught.It looks like we achievedgoal one with flying colors.Not only did we accuse fewerthan 5% of the fair players,the test did even better than expected.When we use this test in the real world,we won't know how manyfair players there are,but seeing how the testperformed on this sample,combined with our analysis from before,it feels like we can be pretty confidentthat we would accusefewer than 5% of players.We didn't catch very many cheaters,but that's not too surprising.We haven't even thought about them yet,so I'm sure we could do better.Before we make the nextversion of the test,I think it's worth mentioningsome fancy statistics terms.They aren't always necessary,but you might see them around,and like any specialized words,they do make communicationeasier in some contexts.If a test result says not toaccuse a blob of cheating,it's called a negative result,since nothing was found.And when the test does sayto accuse a blob of cheating,it's called a positive result.Cheating is a bad thing,so not very positive,but the term positive hereis referring to the test saying"Yes, the thing I'm looking for is here."The same is true for medical tests.If the test finds what it's looking for,the result is called positive,even though it's usually a bad thing.So we have positive andnegative test results,but the results maynot agree with reality.When we test a blobthat's using a fair coin,the correct result would be negative.So if the test does come up negative,it's called a true negative.And if the test comes outpositive, that's wrong,so it's called a false positive.And when we test a cheater,the correct result would be positive,so if the test does come up positive,we call it a true positive.And if the test incorrectlygives a negative result,that's a false negative.We can also rephrase that firstgoal using another new term,the false positive rate.This can be a dangerouslyconfusing term though.It's easy to mix up whatthe denominator should be.False positive ratesounds like you're saying,out of all the positives,what fraction of thosepositives are false.Or even, out of all the tests,how many are false positives?But really, it's saying,out of all the fair players,how many of them arefalsely labeled positive?I've known these words for quite a while,but my brain stillautomatically interprets itthe wrong way basically every time.So to keep things as clearas possible for this video,we'll keep using the longerwording for goal one.Okay, let's go back to designing the test.We still need to figure out a wayto achieve goal number two.Let's start by makingthe goal more precise.To do that, we need to pick a numberfor the minimum fraction ofcheaters we want to catch.Using the terms from before,we could also call this theminimum true positive rate.But again, let's stickwith the plain language.And to throw even more words at you,this minimum is sometimes calledthe statistical power of the test.It's the power of thetest to detect a cheater.The standard target forstatistical power is 80%.Just like the 5% number in the first goal,we could pick any value we want here.But let's run with 80% for now,and we'll talk aboutdifferent choices later on.Now for calculatingwhat we expect the truepositive rate to be.What's the probabilitythat a cheater wouldget five heads in a row?Take a moment to try that yourself.Okay, that was kind of a trick question.There's no way to calculate that number,since we haven't actually said anythingabout how often an unfaircoin comes up heads.In that trial we just did with 1000 blobs,the cheaters were usingcoins that land heads75% of the time.We don't know for sure ifthat's what the real blobs do.So this 75% is an assumption.But we need some number hereto calculate the probabilities,so we gotta run with something.And yet another word, thisis called the effect size.In this case, it's the effectof using an unfair coin.You might be getting annoyedthat this is the third time I've saidwe should just run withsome arbitrary number.But what can I tell ya?Some things are uncertainand some things are up to us.The important thing is to rememberwhen we're making anassumption or making a choice.That way we can note our assumptionswhen we make any conclusions,and we can adjust the testfor different choices ifwe change our minds later.But now that we have a number,let's do the calculation.If the probability of each heads is 0.75,the probability of five headsin a row is 0.75 to the fifth,or about 24%.So our existing test shouldcatch about 24% of cheaters.And hey, that is pretty closeto what we saw in the trial,so everything seems tobe fitting together.But our goal is to catch 80% of cheaters.The current test is a little bit extreme.It requires 100% headsfor a positive result.This does make false positivesunlikely, which is good,but it also makes true positivesunlikely, which is bad.So we're gonna have to think about a testthat allows for a mixtureof heads and tails.Calculating probabilitiesfor something like thiscan be a bit confusing though.For example, if we make a new testthat requires a blob toflip their coin 10 times,and accuses them of being a cheaterif they get seven or more heads,the calculations in that situationare gonna be a lot harder.There are a bunch of waysfor there to be sevenheads out of 10 flips.And we also have to thinkabout the possibilitiesof eight, nine, and 10 heads.To start making sense of this,let's go back to just two flips.With a fair coin,each of these four possibleoutcomes is equally likely.So the probabilities are one out of fourfor getting zero heads,two out of four forgetting exactly one heads,and one out of four to get two heads.But with an unfair coin that favors heads,they're skewed towardresults with more heads.With three flips, there areeight possibilities total,with four possible numbers of heads.As we add more and more flips,it quickly becomes quite a choreto list out all the possible outcomesand add up the probabilities.But there is a pattern to it,so thankfully there's aformula for cases like thiscalled the binomial distribution.It's not as scary as it looks,but still a full explanationdeserves its own video.I'll put some links aboutthis in the description,but for now just know that this formulais what we're using tomake these bar graphs,and it follows the same pattern we usedfor two flips and three flips.Now let's go back to ourtest rule from before,where we accuse a playerif they get five out of five heads.We can show the rule on these graphsby drawing a vertical linethat separates the positivetest results on the right,from the negative testresults on the left.On the fair player graph,the bars to the leftrepresent the true negatives,or the innocent blobs we leave alone,and to the right are the false positives,the fair players we wrongfully accuse.And on the cheater graph,the bars to the leftrepresent the false negatives,the cheaters who evade our detection,and the bars to the rightare the true positives,the cheaters we catch.Just like before, wecan see that this testsatisfies our first goalof accusing less than 5%of the fair players we test on average.But it doesn't satisfy our second goalof catching at least 80%of the cheaters we test,again, on average.But now that we have these graphs,we can see what happens whenwe change the number of heads.If we lower the threshold tofour or more heads out of five,we don't meet either requirement.If we keep lowering the threshold,it can allow us to meet goal two,catching more than 80% of the cheaters,but then we accuse even morefair blobs, so that won't work.Apparently, if we want to meetboth goals at the same time,we're gonna need more flips.If we put these graphsright next to each other,we can see that the blueand red distributionsoverlap quite a lot.So it's impossible to make a testthat reliably separatesfair players from cheaters.But if we increase thenumber of flips to, say, 100,now there's a big gapbetween the distributions,so it's easy to find aline that separates them.But we also have this third goalof using as few coin flips as possible,so we should try to finda happy medium somehow.Since we already have the computer set upto run the numbers, wecan go back to five flipsand just keep trying different thresholdswith more and more flipsuntil we find a test rule that works.It turns out that the smallest testthat meets our first two goalshas a blob flip its coin 23 times,and the blob is accused of being a cheaterif they get 16 or more heads.That's more than I would'veguessed at the start,but it's not so, so huge, so, it'll do.Alright, let's use thisto test a few blobs.This blob got 17 heads.That fits our rule of 16 or more,so according to that test,we should call this blob a cheater.There is another termworth mentioning here.Assuming this blob is innocent,the probability thatthey'd get 17 or more headsis about 1.7%.We call this 1.7% the Pvalue for this test result.It's kind of like a false positive ratefor a single test result.Kind of.1.7% is below the 5% weset as our threshold,so according to the test,we call this one a cheater.And looking at it fromthe other direction,if the blob is cheating,using a coin that comesup heads 75% of the time,there's a 65% chance thatthey'd get 17 or more heads.Another way to say it isthat they're in the top 65%of results we'd expect from cheaters.So if we wanna catch 80% of the cheaterswe'd better call this one a cheater.Okay, let's try it with one more blob.This one got 13 heads.This is more than half of the 23 flips,so it's tempting to call it a cheater.But 13 is below the 16 headsthe test requires fora positive result, so,we call it a fair player.The P value of this result is about 34%.So if we accuse playerswith results like this,we'd expect to wronglyaccuse about 34% of players.That's well beyond our 5% tolerance,so we can't call it a cheater.And looking at it fromthe other direction,if it were a cheater, therewould be about a 99% chancethat they'd get this many heads or more.We don't have to catch 99% of the cheatersto hit our 80% goal, so wecan still meet that goalif we let this one off the hook.Is the first one really a cheater?Is that second one really playing fair?We can't know for sure,but based on how we designed our testwe should expect to catch atleast 80% of the cheaters,and falsely accuse lessthan 5% of the fair players.So now let's see how this test doeson another group of 1000 blobs.Like before, half the blobs in this groupare using a trick cointhat has a 75% probabilityof landing heads.Okay, the results do look about right.We accused less than5% of the fair players,and we caught more than80% of the cheaters.5% and 80% are the normalnumbers for historical reasons.So we could make differentdecisions if we like.Maybe we decidethat we really do notwant to bother the blobswho are playing fairly.So we wanna lower thefalse positive rate to 1%.To achieve this with 23 flips,we'd have to raise theheads threshold to 18 heads.This would lower thefraction of cheaters we catchto about 47% though.If we don't want to increasethe number of flips,we could decide we're okay with that 47%,maybe we just want cheating to feel risky,so 47% is good enough.Or, if we still want to catch80% of the cheaters we test,we could increase the number of flipsuntil we find a test thatachieves both of those goals.We could also be super hardcoreand go for a 99% true positive rate,and a 1% false positive rate.But we'd have to flip the coin 80 timesto get to that level.We'll always be able to settwo of these goals in stone,but that'll limit how wellwe can do on the third goal.How to set these goalsdepends which trade-offswe're willing to make.For the rest of this video though,we're just gonna go withthe standard 5% and 80%.Now that we've settled onthe goals we're going for,and we have a test that seemsto be achieving those goals,let's test one more set of blobs.To pretend these are real blobsand not some artificial sample,I'm not going to tell youanything about this groupexcept that there are 1000 of them.How do you think this test will doon this more mysterious group?Will it manage to accuse fewerthan 5% of the fair players?And will it catch 80% of the cheaters?At this point in the video,it would be easy to get lazyand not actually make the predictions.But if I'm asking youthese questions yet again,something must be aboutto go wrong, right?Or, maybe I'm just pretendingso you'll engage a little more.Who can say?But really, what do you think?Okay, so we labeled about afifth of them to be cheaters,which is a bit less than before.If this were the real world,that's all you would get.You wouldn't get to seewho was really cheatingand who was really innocentto get confirmation that thetest is working as expected.I mean, maybe you could, butit would take more testing.You couldn't do it with this test alone.But because this is a computer simulation,I do know the full truth.This group was 90% cheaters.We still accused less than5% of the fair players,but we only caught abouta quarter of the cheaters.Something went wrong.The problem is that we assumedthat the cheater coins cameup heads 75% of the time.And that assumption was wrong.The real world cheaters were using coinsthat came up heads 60% of the time.If we knew that from the beginning,we still could have designeda test to achieve our goals,but it would need 158 flipsand require 90 heads toreach those same thresholds,which is honestly way more coinflips than I was expecting.But in hindsight, it's not that surprisingthat we need a lot of datato tease out that smaller difference.But we didn't design that testbecause we got the effect size wrong.I know, I know, I was the onewho said we should assume 75%.But be honest with yourself.Did you remember that assumptionwhen making your prediction?It's very easy to forget thatassumptions are assumptions,and instead just treat them as facts.This concludes me tricking youto try to teach you a lesson,but they really are easymistakes to make in real life.On the bright side, though,our test did succeedat accusing less than5% of the fair players.The framework we built up hereisn't just good for catching unfair coins.It's the dominant framework usedin actual scientific studies.To summarize we take a yes or no question.In this case, our question was,is this particular blobusing a biased coin?But it could be any question.Then we come up with a modelfor what kinds of results we'dexpect if the answer is yes,and if the answer is no.Then we come up with a testthat can do a decent jobof telling those two situations apart,according to the models.The details are usuallya bit more complicated,since most real worldsystems that we wanna studyare more complicated than coin flips.But most scientific studieshave this framework at their core.Like I mentioned at the beginningthis is called frequentisthypothesis testing.There's another method calledBayesian hypothesis testingwhich we'll look at in thenext video in this series.See you then.