Lecture 19: Weak Law of Large Numbers | Video Lectures | Probabilistic Systems Analysis and Applied Probability | Electrical Engineering and Computer Science

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

About this Video
Playlist
Transcript
Lecture Slides
Download this Video

Description: In this lecture, the professor discussed limit theorems, Chebyshev's inequality, and convergence "in probability".

Instructor: John Tsitsiklis

Lecture 1: Probability Mode...

Lecture 2: Conditioning and...

Lecture 3: Independence

Lecture 4: Counting

Lecture 5: Discrete Random ...

Lecture 6: Discrete Random ...

Lecture 7: Multiple Discret...

Lecture 8: Continuous Rando...

Lecture 9: Multiple Continu...

Lecture 10: Continuous Baye...

Lecture 11: Derived Distrib...

Lecture 12: Iterated Expect...

Lecture 13: Bernoulli Process

Lecture 14: Poisson Process I

Lecture 15: Poisson Process II

Lecture 16: Markov Chains I

Lecture 17: Markov Chains II

Lecture 18: Markov Chains III

Now Playing

Lecture 19: Weak Law of Lar...

Lecture 20: Central Limit T...

Lecture 21: Bayesian Statis...

Lecture 22: Bayesian Statis...

Lecture 23: Classical Stati...

Lecture 24: Classical Infer...

Lecture 25: Classical Infer...

Download this transcript - PDF (English - US)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JOHN TSITSIKLIS: We're going to start today a new unit. so we will be talking about limit theorems. So just to introduce the topic, let's think of the following situation. There's a population of penguins down at the South Pole. And if you were to pick a penguin at random and measure their height, the expected value of their height would be the average of the heights of the different penguins in the population. So suppose when you pick one, every penguin is equally likely. Then the expected value is just the average of all the penguins out there.

So your boss asks you to find out what that the expected value is. One way would be to go and measure each and every penguin. That might be a little time consuming. So alternatively, what you can do is to go and pick penguins at random, pick a few of them, let's say a number n of them. So you measure the height of each one. And then you calculate the average of the heights of those penguins that you have collected. So this is your estimate of the expected value.

Now, we called this the sample mean, which is the mean value, but within the sample that you have collected. This is something that's sort of feels the same as the expected value, which is again, the mean. But the expected value's a different kind of mean. The expected value is the mean over the entire population, whereas the sample mean is the average over the smaller sample that you have measured.

The expected value is a number. The sample mean is a random variable. It's a random variable because the sample you have collected is random.

Now, we think that this is a reasonable way of estimating the expectation. So in the limit as n goes to infinity, it's plausible that the sample mean, the estimate that we are constructing, should somehow get close to the expected value. What does this mean? What does it mean to get close? In what sense? And is this statement true?

This is the kind of statement that we deal with when dealing with limit theorems. That's the subject of limit theorems, when what happens if you're dealing with lots and lots of random variables, and perhaps take averages and so on.

So why do we bother about this? Well, if you're in the sampling business, it would be reassuring to know that this particular way of estimating the expected value actually gets you close to the true answer. There's also a higher level reason, which is a little more abstract and mathematical. So probability problems are easy to deal with if you're having in your hands one or two random variables. You can write down their mass functions, joints density functions, and so on. You can calculate on paper or on a computer, you can get the answers.

Probability problems become computationally intractable if you're dealing, let's say, with 100 random variables and you're trying to get the exact answers for anything. So in principle, the same formulas that we have, they still apply. But they involve summations over large ranges of combinations of indices. And that makes life extremely difficult.

But when you push the envelope and you go to a situation where you're dealing with a very, very large number of variables, then you can start taking limits. And when you take limits, wonderful things happen. Many formulas start simplifying, and you can actually get useful answers by considering those limits. And that's sort of the big reason why looking at limit theorems is a useful thing to do.

So what we're going to do today, first we're going to start with a useful, simple tool that allows us to relates probabilities with expected values. The Markov inequality is the first inequality we're going to write down. And then using that, we're going to get the Chebyshev's inequality, a related inequality.

Then we need to define what do we mean by convergence when we talk about random variables. It's a notion that's a generalization of the notion of the usual convergence of limits of a sequence of numbers. And once we have our notion of convergence, we're going to see that, indeed, the sample mean converges to the true mean, converges to the expected value of the X's. And this statement is called the weak law of large numbers.

The reason it's called the weak law is because there's also a strong law, which is a statement with the same flavor, but with a somewhat different mathematical content. But it's a little more abstract, and we will not be getting into this. So the weak law is all that you're going to get.

All right. So now we start our digression. And our first tool will be the so-called Markov inequality. So let's take a random variable that's always non-negative. No matter what, it gets no negative values. To keep things simple, let's assume it's a discrete random variable. So the expected value is the sum over all possible values that a random variable can take. The values of the random variables that can take weighted according to their corresponding probabilities.

Now, this is a sum over all x's. But x takes non-negative values. And the PMF is also non-negative. So if I take a sum over fewer things, I'm going to get a smaller value. So the sum when I add over everything is less than or equal to the sum that I will get if I only add those terms that are bigger than a certain constant.

Now, if I'm adding over x's that are bigger than a, the x that shows up up there will always be larger than or equal to a. So we get this inequality. And now, a is a constant. I can pull it outside the summation. And then I'm left with the probabilities of all the x's that are bigger than a. And that's just the probability of being bigger than a.

OK, so that's the Markov inequality. Basically tells us that the expected value is larger than or equal to this number. It relates expected values to probabilities. It tells us that if the expected value is small, then the probability that x is big is also going to be small. So it's translates a statement about smallness of expected values to a statement about smallness of probabilities.

OK. What we actually need is a somewhat different version of this same statement. And what we're going to do is to apply this inequality to a non-negative random variable of a special type. And you can think of applying this same calculation to a random variable of this form, (X minus mu)-squared, where mu is the expected value of X.

Now, this is a non-negative random variable. So, the expected value of this random variable, which is the variance, by following the same thinking as we had in that derivation up to there, is bigger than the probability that this random variable is bigger than some-- let me use a-squared instead of an a times the value a-squared.

So now of course, this probability is the same as the probability that the absolute value of X minus mu is bigger than a times a-squared. And this side is equal to the variance of X. So this relates the variance of X to the probability that our random variable is far away from its mean. If the variance is small, then it means that the probability of being far away from the mean is also small.

So I derived this by applying the Markov inequality to this particular non-negative random variable. Or just to reinforce, perhaps, the message, and increase your confidence in this inequality, let's just look at the derivation once more, where I'm going, here, to start from first principles, but use the same idea as the one that was used in the proof out here.

Ok. So just for variety, now let's think of X as being a continuous random variable. The derivation is the same whether it's discrete or continuous. So by definition, the variance is the integral, is this particular integral. Now, the integral is going to become smaller if I integrate, instead of integrating over the full range, I only integrate over x's that are far away from the mean. So mu is the mean. Think of c as some big number.

These are x's that are far away from the mean to the left, from minus infinity to mu minus c. And these are the x's that are far away from the mean on the positive side. So by integrating over fewer stuff, I'm getting a smaller integral.

Now, for any x in this range, this distance, x minus mu, is at least c. So that squared is at least c squared. So this term over this range of integration is at least c squared. So I can take it outside the integral. And I'm left just with the integral of the density. Same thing on the other side.

And so what factors out is this term c squared. And inside, we're left with the probability of being to the left of mu minus c, and then the probability of being to the right of mu plus c, which is the same as the probability that the absolute value of the distance from the mean is larger than or equal to c. So that's the same inequality that we proved there, except that here I'm using c. There I used a, but it's exactly the same one.

This inequality was maybe better to understand if you take that term and send it to the other side and write it this form. What does it tell us? It tells us that if c is a big number, it tells us that the probability of being more than c away from the mean is going to be a small number. When c is big, this is small.

Now, this is intuitive. The variance is a measure of the spread of the distribution, how wide it is. It tells us that if the variance is small, the distribution is not very wide. And mathematically, this translates to this statement that when the variance is small, the probability of being far away is going to be small. And the further away you're looking, that is, if c is a bigger number, that probability also becomes small.

Maybe an even more intuitive way to think about the content of this inequality is to, instead of c, use the number k, where k is positive and sigma is the standard deviation. So let's just plug k sigma in the place of c. So this becomes k sigma squared. These sigma squared's cancel. We're left with 1 over k-square.

Now, what is this? This is the event that you are k standard deviations away from the mean. So for example, this statement here tells you that if you look at the test scores from a quiz, what fraction of the class are 3 standard deviations away from the mean? It's possible, but it's not going to be a lot of people. It's going to be at most, 1/9 of the class that can be 3 standard deviations or more away from the mean.

So the Chebyshev inequality is a really useful one. It comes in handy whenever you want to relate probabilities and expected values. So if you know that your expected values or, in particular, that your variance is small, this tells you something about tailed probabilities.

So this is the end of our first digression. We have this inequality in our hands. Our second digression is talk about limits. We want to eventually talk about limits of random variables, but as a warm up, we're going to start with limits of sequences.

So you're given a sequence of numbers, a1, a2, a3, and so on. And we want to define the notion that a sequence converges to a number. You sort of know what this means, but let's just go through it some more. So here's a. We have our sequence of values as n increases.

What do we mean by the sequence converging to a is that when you look at those values, they get closer and closer to a. So this value here is your typical a sub n. They get closer and closer to a, and they stay closer. So let's try to make that more precise.

What it means is let's fix a sense of what it means to be close. Let me look at an interval that goes from a - epsilon to a + epsilon. Then if my sequence converges to a, this means that as n increases, eventually the values of the sequence that I get stay inside this band. Since they converge to a, this means that eventually they will be smaller than a + epsilon and bigger than a - epsilon.

So convergence means that given a band of positive length around the number a, the values of the sequence that you get eventually get inside and stay inside that band. So that's sort of the picture definition of what convergence means. So now let's translate this into a mathematical statement.

Given a band of positive length, no matter how wide that band is or how narrow it is, so for every epsilon positive, eventually the sequence gets inside the band. What does eventually mean? There exists a time, so that after that time something happens. And the something that happens is that after that time, we are inside that band.

So this is a formal mathematical definition, which actually translates what I was telling in the wordy way before, and showing in terms of the picture. Given a certain band, even if it's narrow, eventually, after a certain time n0, the values of the sequence are going to stay inside this band.

Now, if I were to take epsilon to be very small, this thing would still be true that eventually I'm going to get inside of the band, except that I may have to wait longer for the values to get inside here. All right, that's what it means for a deterministic sequence to converge to something.

Now, how about random variables. What does it mean for a sequence of random variables to converge to a number? We're just going to twist a little bit of the word definition.

For numbers, we said that eventually the numbers get inside that band. But if instead of numbers we have random variables with a certain distribution, so here instead of a_n we're dealing with a random variable that has a distribution, let's say, of this kind, what we want is that this distribution gets inside this band, so it gets concentrated inside here. What does it means that the distribution gets inside this band?

I mean a random variable has a distribution. It may have some tails, so maybe not the entire distribution gets concentrated inside of the band. But we want that more and more of this distribution is concentrated in this band. So that -- in a sense that -- the probability of falling outside the band converges to 0 -- becomes smaller and smaller.

So in words, we're going to say that the sequence random variables or a sequence of probability distributions, that would be the same, converges to a particular number a if the following is true. If I consider a small band around a, then the probability that my random variable falls outside this band, which is the area under this curve, this probability becomes smaller and smaller as n goes to infinity. The probability of being outside this band converges to 0. So that's the intuitive idea.

So in the beginning, maybe our distribution is sitting everywhere. As n increases, the distribution starts to get concentrating inside the band. When a is even bigger, our distribution is even more inside that band, so that these outside probabilities become smaller and smaller.

So the corresponding mathematical statement is the following. I fix a band around a, a +/- epsilon. Given that band, the probability of falling outside this band, this probability converges to 0. Or another way to say it is that the limit of this probability is equal to 0.

If you were to translate this into a complete mathematical statement, you would have to write down the following messy thing. For every epsilon positive -- that's this statement -- the limit is 0.

What does it mean that the limit of something is 0? We flip back to the previous slide. Why? Because a probability is a number. So here we're talking about a sequence of numbers convergent to 0.

What does it mean for a sequence of numbers to converge to 0? It means that for any epsilon prime positive, there exists some n0 such that for every n bigger than n0 the following is true -- that this probability is less than or equal to epsilon prime.

So the mathematical statement is a little hard to parse. For every size of that band, and then you take the definition of what it means for the limit of a sequence of numbers to converge to 0. But it's a lot easier to describe this in words and, basically, think in terms of this picture. That as n increases, the probability of falling outside those bands just become smaller and smaller. So the statement is that our distribution gets concentrated in arbitrarily narrow little bands around that particular number a.

OK. So let's look at an example. Suppose a random variable Yn has a discrete distribution of this particular type. Does it converge to something? Well, the probability distribution of this random variable gets concentrated at 0 -- there's more and more probability of being at 0.

If I fix a band around 0 -- so if I take the band from minus epsilon to epsilon and look at that band-- the probability of falling outside this band is 1/n. As n goes to infinity, that probability goes to 0. So in this case, we do have convergence. And Yn converges in probability to the number 0. So this just captures the facts obvious from this picture, that more and more of our probability distribution gets concentrated around 0, as n goes to infinity.

Now, an interesting thing to notice is the following, that even though Yn converges to 0, if you were to write down the expected value for Yn, what would it be? It's going to be n times the probability of this value, which is 1/n. So the expected value turns out to be 1. And if you were to look at the expected value of Yn-squared, this would be 0. times this probability, and then n-squared times this probability, which is equal to n. And this actually goes to infinity.

So we have this, perhaps, strange situation where a random variable goes to 0, but the expected value of this random variable does not go to 0. And the second moment of that random variable actually goes to infinity. So this tells us that convergence in probability tells you something, but it doesn't tell you the whole story. Convergence to 0 of a random variable doesn't imply anything about convergence of expected values or of variances and so on.

So the reason is that convergence in probability tells you that this tail probability here is very small. But it doesn't tell you how far does this tail go. As in this example, the tail probability is small, but that tail acts far away, so it gives a disproportionate contribution to the expected value or the expected value squared.

OK. So now we've got everything that we need to go back to the sample mean and study its properties. So the sad thing is that we have a sequence of random variables. They're independent. They have the same distribution. And we assume that they have a finite mean and a finite variance. We're looking at the sample mean.

Now in principle, you can calculate the probability distribution of the sample mean, because we know how to find the distributions of sums of independent random variables. You use the convolution formula over and over. But this is pretty complicated, so let's not look at that. Let's just look at expected values, variances, and the probabilities that the sample mean is far away from the true mean.

So what is the expected value of this random variable? The expected value of a sum of random variables is the sum of the expected values. And then we have this factor of n in the denominator. Each one of these expected values is mu, so we get mu. So the sample mean, the average value of this Mn in expectation is the same as the true mean inside our population.

Now here, this is a fine conceptual point, there's two kinds of averages involved when you write down this expression. We understand that expectations are some kind of average. The sample mean is also an average over the values that we have observed.

But it's two different kinds of averages. The sample mean is the average of the heights of the penguins that we collected over a single expedition. The expected value is to be thought of as follows, my probabilistic experiment is one expedition to the South Pole. Expected value here means thinking on the average over a huge number of expeditions.

So my expedition is a random experiment, I collect random samples, and they record Mn. The average result of an expedition is what we would get if we were to carry out a zillion expeditions and average the averages that we get at each particular expedition. So this Mn is the average during a single expedition. This expectation is the average over an imagined infinite sequence of expeditions. And of course, the other thing to always keep in mind is that expectations give you numbers, whereas the sample mean is actually a random variable.

All right. So this random variable, how random is it? How big is its variance? So the variance of a sum of random variables is the sum of the variances. But since we're dividing by n, when you calculate variances this brings in a factor of n-squared. So the variance is sigma-squared over n.

And in particular, the variance of the sample mean becomes smaller and smaller. It means that when you estimate that average height of penguins, if you take a large sample, then your estimate is not going to be too random. The randomness in your estimates become small if you have a large sample size. Having a large sample size kind of removes the randomness from your experiment.

Now let's apply the Chebyshev inequality to say something about tail probabilities for the sample mean. The probability that you are more than epsilon away from the true mean is less than or equal to the variance of this quantity divided by this number squared. So that's just the translation of the Chebyshev inequality to the particular context we've got here. We found the variance. It's sigma-squared over n. So we end up with this expression.

So what does this expression do? For any given epsilon, if I fix epsilon, then this probability, which is less than sigma-squared over n epsilon-squared, converges to 0 as n goes to infinity. And this is just the definition of convergence in probability. If this happens, that the probability of being more than epsilon away from the mean, that probability goes to 0, and this is true no matter how I choose my epsilon, then by definition we have convergence in probability.

So we have proved that the sample mean converges in probability to the true mean. And this is what the weak law of large numbers tells us. So in some vague sense, it tells us that the sample means, when you take the average of many, many measurements in your sample, then the sample mean is a good estimate of the true mean in the sense that it approaches the true mean as your sample size increases. It approaches the true mean, but of course in a very specific sense, in probability, according to this notion of convergence that we have used.

So since we're talking about sampling, let's go over an example, which is the typical situation faced by someone who's constructing a poll. So you're interested in some property of the population. So what fraction of the population prefers Coke to Pepsi? So there's a number f, which is that fraction of the population. And so this is an exact number. So out of a population of 100 million, 20 million prefer Coke, then f would be 0.2.

We want to find out what that fraction is. We cannot ask everyone. What we're going to do is to take a random sample of people and ask them for their preferences. So the ith person either says yes for Coke or no. And we record that by putting a 1 each time that we get a yes answer.

And then we form the average of these x's. What is this average? It's the number of 1's that we got divided by n. So this is a fraction, but calculated only on the basis of the sample that we have. So you can think of this as being an estimate, f_hat, based on the sample that we have.

Now, even though we used the lower case letter here, this f_hat is, of course, a random variable. f is a number. This is the true fraction in the overall population. f_hat is the estimate that we get by using our particular sample.

Ok. So your boss told you, I need to know what f is, but go and do some sampling. What are you going to respond? Unless I ask everyone in the whole population, there's no way for me to know f exactly. Right? There's no way.

OK, so the boss tells you, well OK, then that'll me f within an accuracy. I want an answer from you, that's your answer, which is close to the correct answer within 1 % point. So if the true f is 0.4, your answer should be somewhere between 0.39 and 0.41. I want a really accurate answer.

What are you going to say? Well, there's no guarantee that my answer will be within 1 %. Maybe I'm unlucky and I just happen to sample the wrong set of people and my answer comes out to be wrong. So I cannot give you a hard guarantee that this inequality will be satisfied.

But perhaps, I can give you a guarantee that this inequality will be satisfied, this accuracy requirement will be satisfied, with high confidence. That is, there's going to be a smaller probability that things go wrong, that I'm unlikely and I use a bad sample. But leaving aside that smaller probability of being unlucky, my answer will be accurate within the accuracy requirement that you have.

So these two numbers are the usual specs that one has when designing polls. So this number is the accuracy that we want. It's the desired accuracy. And this number has to do with the confidence that we want. So 1 minus that number, we could call it the confidence that we want out of our sample. So this is really 1 minus confidence.

So now your job is to figure out how large an n, how large a sample should you be using, in order to satisfy the specs that your boss gave you. All you know at this stage is the Chebyshev inequality. So you just try to use it. The probability of getting an answer that's more than 0.01 away from the true answer is, by Chebyshev's inequality, the variance of this random variable divided by this number squared. The variance, as we argued a little earlier, is the variance of the x's divided by n. So we get this expression. So we would like this number to be less than or equal to 0.05.

OK, here we hit a little bit off a difficulty. The variance, (sigma_x)-squared, what is it? (Sigma_x)-squared is, if you remember the variance of a Bernoulli random variable, is this quantity. But we don't know it. f is what we're trying to estimate in the first place. So the variance is not known, so I cannot plug in a number inside here.

What I can do is to be conservative and use an upper bound of the variance. How large can this number get? Well, you can plot f times (1-f). It's a parabola. It has a root at 0 and at 1. So the maximum value is going to be, by symmetry, at 1/2 and when f is 1/2, then this variance becomes 1/4.

So I don't know (sigma_x)-squared, but I'm going to use the worst case value for (sigma_x)-squared, which is 4. And this is now an inequality that I know to be always true. I've got my specs, and my specs tell me that I want this number to be less than 0.05.

And given what I know, the best thing I can do is to say, OK, I'm going to take this number and make it less than 0.05. If I choose my n so that this is less than 0.05, then I'm certain that this probability is also less than 0.05.

What does it take for this inequality to be true? You can solve for n here, and you find that to satisfy this inequality, n should be larger than or equal to 50,000. So you can just let n be equal to 50,000. So the Chebyshev inequality tells us that if you take n equal to 50,000, then by the Chebyshev inequality, we're guaranteed to satisfy the specs that we were given.

Ok. Now, 50,000 is a bit of a large sample size. Right? If you read anything in the newspapers where they say so much of the voters think this and that, this was determined on the basis of a sample of 1,200 likely voters or so. So the numbers that you will typically see in these news items about polling, they usually involve sample sizes about the 1,000 or so. You will never see a sample size of 50,000. That's too much.

So where can we cut some corners? Well, we can cut corners basically in three places. This requirement is a little too tight. Newspaper stories will usually tell you, we have an accuracy of +/- 3 % points, instead of 1 % point. And because this number comes up as a square, by making it 3 % points instead of 1, saves you a factor of 10.

Then, the five percent confidence, I guess that's usually OK. If we use that factor of 10, then we make our sample that we gain from here, then we get a sample size of 10,000. And that's, again, a little too big. So where can we fix things?

Well, it turns out that this inequality that we're using here, Chebyshev's inequality, is just an inequality. It's not that tight. It's not very accurate. Maybe there's a better way of calculating or estimating this quantity, which is smaller than this. And using a more accurate inequality or a more accurate bound, then we can convince ourselves that we can settle with a smaller sample size.

This more accurate kind of inequality comes out of a difference limit theorem, which is the next limit theorem we're going to consider. We're going to start the discussion today, but we're going to continue with it next week.

Before I tell you exactly what that other limit theorem says, let me give you the big picture of what's involved here. We're dealing with sums of i.i.d random variables. Each X has a distribution of its own.

So suppose that X has a distribution which is something like this. This is the density of X. If I add lots of X's together, what kind of distribution do I expect? The mean is going to be n times the mean of an individual X. So if this is mu, I'm going to get a mean of n times mu.

But my variance will also increase. When I add the random variables, I'm adding the variances. So since the variance increases, we're going to get a distribution that's pretty wide. So this is the density of X1 plus all the way to Xn. So as n increases, my distribution shifts, because the mean is positive. So I keep adding things. And also, my distribution becomes wider and wider. The variance increases.

Well, we started a different scaling. We started a scaled version of this quantity when we looked at the weak law of large numbers. In the weak law of large numbers, we take this random variable and divide it by n. And what the weak law tells us is that we're going to get a distribution that's very highly concentrated around the true mean, which is mu.

So this here would be the density of X1 plus Xn divided by n. Because I've divided by n, the mean has become the original mean, which is mu. But the weak law of large numbers tells us that the distribution of this random variable is very concentrated around the mean. So we get a distribution that's very narrow in this kind. In the limit, this distribution becomes one that's just concentrated on top of mu. So it's sort of a degenerate distribution.

So these are two extremes, no scaling for the sum, a scaling where we divide by n. In this extreme, we get the trivial case of a distribution that flattens out completely. In this scaling, we get a distribution that gets concentrated around a single point.

Again, we look at some intermediate scaling that makes things more interesting. Things do become interesting if we scale by dividing the sum by square root of n instead of dividing by n. What effect does this have?

When we scale by dividing by square root of n, the variance of Sn over square root of n is going to be the variance of Sn over sum divided by n. That's how variances behave. The variance of Sn is n sigma-squared, divide by n, which is sigma squared, which means that when we scale in this particular way, as n changes, the variance doesn't change.

So the width of our distribution will be sort of constant. The distribution changes shape, but it doesn't become narrower as was the case here. It doesn't become wider, kind of keeps the same width. So perhaps in the limit, this distribution is going to take an interesting shape. And that's indeed the case.

So let's do what we did before. So we're looking at the sum, and we want to divide the sum by something that goes like square root of n. So the variance of Sn is n sigma squared. The variance of the sigma Sn is the square root of that. It's this number. So effectively, we're scaling by order of square root n.

Now, I'm doing another thing here. If my random variable has a positive mean, then this quantity is going to have a mean that's positive and growing. It's going to be shifting to the right.

Why is that? Sn has a mean that's proportional to n. When I divide by square root n, then it means that the mean scales like square root of n. So my distribution would still keep shifting after I do this division.

I want to keep my distribution in place, so I subtract out the mean of Sn. So what we're doing here is a standard technique or transformation where you take a random variable and you so-called standardize it. I remove the mean of that random variable and I divide by the standard deviation. This results in a random variable that has 0 mean and unit variance.

What Zn measures is the following, Zn tells me how many standard deviations am I away from the mean. Sn minus (n times expected value of X) tells me how much is Sn away from the mean value of Sn. And by dividing by the standard deviation of Sn -- this tells me how many standard deviations away from the mean am I.

So we're going to look at this random variable, which is just a transformation Zn. It's a linear transformation of Sn. S And we're going to compare this random variable to a standard normal random variable.

So a standard normal is the random variable that you are familiar with, given by the usual formula, and for which we have tables for it. This Zn has 0 mean and unit variance. So in that respect, it has the same statistics as the standard normal. The distribution of Zn could be anything -- can be pretty messy.

But there is this amazing theorem called the central limit theorem that tells us that the distribution of Zn approaches the distribution of the standard normal in the following sense, that probability is that you can calculate -- of this type -- that you can calculate for Zn -- is the limit becomes the same as the probabilities that you would get from the standard normal tables for Z.

It's a statement about the cumulative distribution functions. This quantity, as a function of c, is the cumulative distribution function of the random variable Zn. This is the cumulative distribution function of the standard normal. The central limit theorem tells us that the cumulative distribution function of the sum of a number of random variables, after they're appropriately standardized, approaches the cumulative distribution function over the standard normal distribution.

In particular, this tells us that we can calculate probabilities for Zn when n is large by calculating instead probabilities for Z. And that's going to be a good approximation. Probabilities for Z are easy to calculate because they're well tabulated. So we get a very nice shortcut for calculating probabilities for Zn.

Now, it's not Zn that you're interested in. What you're interested in is Sn. And Sn -- inverting this relation here -- Sn is square root n sigma Zn plus n expected value of X.

All right. Now, if you can calculate probabilities for Zn, even approximately, then you can certainly calculate probabilities for Sn, because one is a linear function of the other. And we're going to do a little bit of that next time. You're going to get, also, some practice in recitation. At a more vague level, you could describe the central limit theorem as saying the following, when n is large, you can pretend that Zn is a standard normal random variable and do the calculations as if Zn was standard normal.

Now, pretending that Zn is normal is the same as pretending that Sn is normal, because Sn is a linear function of Zn. And we know that linear functions of normal random variables are normal. So the central limit theorem essentially tells us that we can pretend that Sn is a normal random variable and do the calculations just as if it were a normal random variable.

Mathematically speaking though, the central limit theorem does not talk about the distribution of Sn, because the distribution of Sn becomes degenerate in the limit, just a very flat and long thing. So strictly speaking mathematically, it's a statement about cumulative distributions of Zn's. Practically, the way you use it is by just pretending that Sn is normal.

Very good. Enjoy the Thanksgiving Holiday.

Weak Law of Large Numbers (PDF)