Lecture 11: Derived Distributions; Convolution; Covariance and Correlation | Video Lectures | Probabilistic Systems Analysis and Applied Probability | Electrical Engineering and Computer Science

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

About this Video
Playlist
Transcript
Lecture Slides
Download this Video

Description: In this lecture, the professor discussed derived distributions, convolution, covariance and correlation.

Instructor: John Tsitsiklis

Lecture 1: Probability Mode...

Lecture 2: Conditioning and...

Lecture 3: Independence

Lecture 4: Counting

Lecture 5: Discrete Random ...

Lecture 6: Discrete Random ...

Lecture 7: Multiple Discret...

Lecture 8: Continuous Rando...

Lecture 9: Multiple Continu...

Lecture 10: Continuous Baye...

Now Playing

Lecture 11: Derived Distrib...

Lecture 12: Iterated Expect...

Lecture 13: Bernoulli Process

Lecture 14: Poisson Process I

Lecture 15: Poisson Process II

Lecture 16: Markov Chains I

Lecture 17: Markov Chains II

Lecture 18: Markov Chains III

Lecture 19: Weak Law of Lar...

Lecture 20: Central Limit T...

Lecture 21: Bayesian Statis...

Lecture 22: Bayesian Statis...

Lecture 23: Classical Stati...

Lecture 24: Classical Infer...

Lecture 25: Classical Infer...

Download this transcript - PDF (English - US)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality, educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: Good morning. So today we're going to continue the subject from last time. So we're going to talk about derived distributions a little more, how to derive the distribution of a function of a random variable. So last time we discussed a couple of examples in which we had a function of a single variable. And we found the distribution of Y, if we're told the distribution of X.

So today we're going to do an example where we deal with the function of two random variables. And then we're going to consider the most interesting example of this kind, in which we have a random variable of the form W, which is the sum of two independent, random variables. That's a case that shows up quite often. And so we want to see what exactly happens in this particular case.

Just one comment that I should make. The material that we're covering now, chapter four, is sort of conceptually a little more difficult than one we have been doing before. So I would definitely encourage you to read the text before you jump and try to do the problems in your problem sets.

OK, so let's start with our example, in which we're given two random variables. They're jointly continuous. And their distribution is pretty simple. They're uniform on the unit square. In particular, each one of the random variables is uniform on the unit interval. And the two random variables are independent.

What we're going to find is the distribution of the ratio of the two random variables. How do we go about it? , Well, the same cookbook procedure that we used last time for the case of a single random variable. The cookbook procedure that we used for this case also applies to the case where you have a function of multiple random variables.

So what was the cookbook procedure? The first step is to find the cumulative distribution function of the random variable of interest and then take the derivative in order to find the density. So let's find the cumulative. So, by definition, the cumulative is the probability that the random variable is less than or equal to the argument of the cumulative. So if we write this event in terms of the random variable of interest, this is the probability that our random variable is less than or equal to z.

So what is that? OK, so the ratio is going to be less than or equal to z, if and only if the pair, (x,y), happens to fall below the line that has a slope z. OK, so we draw a line that has a slope z. The ratio is less than this number, if and only if we get the pair of x and y that falls inside this triangle.

So we're talking about the probability of this particular event. Since this line has a slope of z, the height at this point is equal to z. And so we can find the probability of this event.

It's just the area of this triangle. And so the area is 1 times z times 1/2. And we get the answer, z/2.

Now, is this answer always correct? Now, this answer is going to be correct only if the slope happens to be such that we get a picture of this kind. So when do we get a picture of this kind? When the slope is less than 1.

If I consider a different slope, a number, little z -- that happens to be a slope of that kind -- then the picture changes. And in that case, we get a picture of this kind, let's say. So this is a line here of slope z, again. And this is the second case in which our number, little z, is bigger than 1.

So how do we proceed? Once more, the cumulative is the probability that the ratio is less than or equal to that number. So it's the probability that we fall below the red line.

So we're talking about the event, about this event. So to find the probability of this event, we need to find the area of this red shape. And one way of finding this area is to consider the whole area and subtract the area of this triangle.

So let's do it this way. It's going to be 1 minus the area of the triangle. Now, what's the area of the triangle? It's 1/2 times this side, which is 1 times this side.

How big is that side? Well, if y and the slope is z, now z is the ratio y over x. So if y over x-- at this point we have y/x = z and y =1. This means that z is 1/x.

So the coordinate of this point is 1/x. And this means that we're going to-- 1/z So here we get the factor of 1/z.

And we're basically done. I guess if you want to have a complete answer, you should also give the formula for z less than 0. What is the cumulative when z is less than 0, the probability that you get the ratio that's negative?

Well, since our random variables are positive, there's no way that you can get a negative ratio. So the cumulative down there is equal to 0. So we can plot the cumulative. And we can take its derivative in order to find the density.

So the cumulative that we got starts at 0, when z's are negative. Then it starts going up in proportion to z, at the slope of 1/2. So this takes us up to 1.

And then it starts increasing towards 1, according to this function. When you let z go to infinity, the cumulative is going to go to 1. And it has a shape of, more or less, this kind.

So now to get the density, we just take the derivative. And the density is, of course, 0 down here. Up here the derivative is just 1/2. And beyond that point we need to take the derivative of this expression.

And the derivative is going to be 1/2 times 1 over z-squared. So it's going to be a shape of this kind. And we're done.

So you see that problems involving functions of multiple random variables are no harder than problems that deal with the functional of a single random variable. The general procedure is, again, exactly the same. You first find the cumulative, and then you differentiate. The only extra difficulty will be that when you calculate the cumulative, you need to find the probability of an event that involves multiple random variables. And sometimes this could be a little harder to do.

By the way, since we dealt with this example, just a couple of questions. What do you think is going to be the expected value of the random variable Z? Let's see, the expected value of the random variable Z is going to be the integral of z times the density.

And the density is equal to 1/2 for z going from 0 to 1. And then there's another contribution from 1 to infinity. There the density is 1/(2z-squared). And we get the z, since we're dealing with expectations, dz.

So what is this integral? Well, if you look here, you're integrating 1/z, all the way to infinity. 1/z has an integral, which is the logarithm of z. And since the logarithm goes to infinity, this means that this integral is also infinite.

So the expectation of the random variable Z is actually infinite in this example. There's nothing wrong with this. Lots of random variables have infinite expectations. If the tail of the density falls kind of slowly, as the argument goes to infinity, then it may well turn out that you get an infinite integral. So that's just how things often are. Nothing strange about it.

And now, since we are still in this example, let me ask another question. Would we reason, on the average, would it be true that the expected value of Z -- remember that Z is the ratio Y/X -- could it be that the expected value of Z is this number? Or could it be that it's equal to this number? Or could it be that it's none of the above?

OK, so how many people think this is correct? Small number. How many people think this is correct? Slightly bigger, but still a small number.

And how many people think this is correct? OK, that's-- this one wins the vote. OK, let's see.

This one is not correct, just because there's no reason it should be correct. So, in general, you cannot reason on the average. The expected value of a function is not the same as the same function of the expected values. This is only true if you're dealing with linear functions of random variables. So this is not-- this turns out to not be correct.

How about this one? Well, X and Y are independent, by assumption. So 1/X and Y are also independent. Why is this? Independence means that one random variable does not convey any information about the other.

So Y doesn't give you any information about X. So Y doesn't give you any information about 1/X. Or to put it differently, if two random variables are independent, functions of each one of those random variables are also independent.

If X is independent from Y, then g(X) is independent of h(Y). So this applies to this case. These two random variables are independent.

And since they are independent, this means that the expected value of their product is equal to the product of the expected values. So this relation actually is true. And therefore, this is not true. OK.

Now, let's move on. We have this general procedure of finding the derived distribution by going through the cumulative. Are there some cases where we can have a shortcut? Turns out that there is a special case or a special structure in which we can get directly from densities to densities using directly just a formula. And in that case, we don't have to go through the cumulative.

And this case is also interesting, because it gives us some insight about how one density changes to a different density and what affects the shape of those densities. So the case where things easy is when the transformation from one random variable to the other is a strictly monotonic one. So there's a one-to-one relation between x's and y's.

Here we can reason directly in terms of densities by thinking in terms of probabilities of small intervals. So let's look at the small interval on the x-axis, like this one, when X ranges from-- where capital X ranges from a small x to a small x plus delta. So this is a small interval of length delta.

Whenever X happens to fall in this interval, the random variable Y is going to fall in a corresponding interval up there. So up there we have a corresponding interval. And these two intervals, the red and the blue interval-- this is the blue interval. And that's the red interval.

These two intervals should have the same probability. They're exactly the same event. When X falls here, g(X) happens to fall in there. So we can sort of say that the probability of this little interval is the same as the probability of that little interval. And we know that probabilities of little intervals have something to do with densities.

So what is the probability of this little interval? It's the density of the random variable X, at this point, times the length of the interval. How about the probability of that interval? It's going to be the density of the random variable Y times the length of that little interval.

Now, this interval has length delta. Does that mean that this interval also has length delta? Well, not necessarily.

The length of this interval has something to do with the slope of your function g. So slope is dy by dx. Is how much-- the slope tells you how big is the y interval when you take an interval x of a certain length.

So the slope is what multiplies the length of this interval to give you the length of that interval. So the length of this interval is delta times the slope of your function. So the length of the interval is delta times the slope of the function, approximately.

So the probability of this interval is going to be the density of Y times the length of the interval that we are considering. So this gives us a relation between the density of X, evaluated at this point, to the density of Y, evaluated at that point. The two densities are closely related.

If these x's are very likely to occur, then this is big, which means that that density will also be big. If these x's are very likely to occur, then those y's are also very likely to occur. But there's also another factor that comes in. And that's the slope of the function at this particular point.

So we have this relation between the two densities. Now, in interpreting this equation, you need to make sure what's the relation between the two variables. I have both little x's and little y's.

Well, this formula is true for an (x,y) pair, that they're related according to this particular function. So if I fix an x and consider the corresponding y, then the densities at those x's and corresponding y's will be related by that formula. Now, in the end, you want to come up with a formula that just gives you the density of Y as a function of y. And that means that you need to eliminate x from the picture.

So let's see how that would go in an example. So suppose that we're dealing with the function y equal to x cubed, in which case our function, g(x), is the function x cubed. And if x cubed is equal to a little y, If we have a pair of x's and y's that are related this way, then this means that x is going to be the cubic root of y.

So this is the formula that takes us back from y's to x's. This is the direct function from x, how to construct y. This is essentially the inverse function that tells us, from a given y what is the corresponding x. Now, if we write this formula, it tells us that the density at the particular x is going to be the density at the corresponding y times the slope of the function at the particular x that we are considering. The slope of the function is 3x squared.

Now, we want to end up with a formula for the density of Y. So I'm going to take this factor, send it to the other side. But since I want it to be a function of y, I want to eliminate the x's. And I'm going to eliminate the x's using this correspondence here.

So I'm going to get the density of X evaluated at y to the 1/3. And then this factor in the denominator, it's 1/(3y to the power 2/3). So we end up finally with the formula for the density of the random variable Y.

And this is the same answer that you would get if you go through this exercise using the cumulative distribution function method. You end up getting the same answer. But here we sort of get it directly.

Just to get a little more insight as to why the slope comes in-- suppose that we have a function like this one. So the function is sort of flat, then moves quickly, and then becomes flat again. What should be -- and suppose that X has some kind of reasonable density, some kind of flat density.

Suppose that X is a pretty uniform random variable. What's going to happen to the random variable Y? What kind of distribution should it have? What are the typical values of the random variable Y?

Either x falls here, and y is a very small number, or-- let's take that number here to be -- let's say 2 -- or x falls in this range, and y takes a value close to 2. And there's a small chance that x's will be somewhere in the middle, in which case y takes intermediate values. So what kind of shape do you expect for the distribution of Y?

There's going to be a fair amount of probability that Y takes values close to 0. There's a small probability that Y takes intermediate values. That corresponds to the case where x falls in here.

That's not a lot of probability. So the probability that Y takes values between 0 and 2, that's kind of small. But then there's a lot of x's that produces y's that are close to 2. So there's a significant probability that Y would take values that are close to 2.

So you-- the density of Y would have a shape of this kind. By looking at this picture, you can tell that it's most likely that either x will fall here or x will fall there. So the g(x) is most likely to be close to 0 or to be close to 2.

So since y is most likely to be close to 0 or close to most of the probability of y is here. And there's a small probability of being in between. Notice that the y's that get a lot of probability are those y's associated with flats regions off your g function. When the g function is flat, that gives you big densities for Y.

So the density of Y is inversely proportional to the slope of the function. And that's what you get from here. The density of Y is-- send that term to the other side-- is inversely proportional to the slope of the function that you're dealing with.

OK, so this formula works nicely for the case where the function is one-to-one. So we can have a unique association between x's and y's and through an inverse function, from y's to x's. It works for the monotonically increasing case. It also works for the monotonically decreasing case. In the monotonically decreasing case, the only change that you need to do is to take the absolute value of the slope, instead of the slope itself.

OK, now, here's another example or a special case. Let's talk about the most interesting case that involves a function of two random variables. And this is the case where we have two independent, random variables, and we want to find the distribution of the sum of the two. We're really interested in the continuous case. But as a warm-up, it's useful to look at the discrete case first of discrete random variables.

Let's say we want to find the probability that the sum of X and Y is equal to a particular number. And to illustrate this, let's take that number to be equal to 3. What's the probability that the sum of the two random variables is equal to 3?

To find the probability that the sum is equal to 3, you consider all possible ways that you can get the sum of 3. And the different ways are the points in this picture. And they correspond to a line that goes this way. So the probability that the sum is equal to a certain number is the probability that -- is the sum of the probabilities of all of those points.

What is a typical point in this picture? In a typical point, the random variable X takes a certain value. And Y takes the value that's needed so that the sum is equal to W. Any combination of an x with a w minus x, any such combination gives you a sum of w.

So the probability that the sum is w is the sum over all possible x's. That's over all these points of the probability that we get a certain x. Let's say x equals 2 times the corresponding probability that random variable Y takes the value 1.

And why am I multiplying probabilities here? That's where we use the assumption that the two random variables are independent. So the probability that X takes a certain value and Y takes the complementary value, that probability is the product of two probabilities because of independence.

And when we write that into our usual PMF notation, it's a formula of this kind. So this formula is called the convolution formula. It's an operation that takes one PMF and another PMF-- p we're given the PMF's of X and Y -- and produces a new PMF.

So think of this formula as giving you a transformation. You take two PMF's, you do something with them, and you obtain a new PMF. This procedure, what this formula does is -- nicely illustrated sort of by mechanically. So let me show you a picture here and illustrate how the mechanics go, in general.

So you don't have these slides, but let's just reason through it. So suppose that you are given the PMF of X, and it has this shape. You're given the PMF of Y. It has this shape. And somehow we are going to do this calculation.

Now, we need to do this calculation for every value of W, in order to get the PMF of W. Let's start by doing the calculation just for one case. Suppose the W is equal to 0, in which case we need to find the sum of Px(x) and Py(-x).

How do you do this calculation graphically? It involves the PMF of X. But it involves the PMF of Y, with the argument reversed. So how do we plot this?

Well, in order to reverse the argument, what you need is to take this PMF and flip it. So that's where it's handy to have a pair of scissors with you. So you cut this down. And so now you take the PMF of the random variable Y and just flip it.

So what you see here is this function where the argument is being reversed. And then what do we do? We cross-multiply the two plots. Any entry here gets multiplied with the corresponding entry there. And we consider all those products and add them up.

In this particular case, the flipped PMF doesn't have any overlap with the PMF of X. So we're going to get an answer that's equal to 0. So for w's equal to 0, the Pw is going to be equal to 0, in this particular plot.

Now if we have a different value of w -- oops. If we have a different value of the argument w, then we have here the PMF of Y that's flipped and shifted by an amount of w. So the correct picture of what you do is to take this and displace it by a certain amount of w.

So here, how much did I shift it? I shifted it until one falls just below 4. So I have shifted by a total amount of 5. So 0 falls under 5, whereas 0 initially was under 0. So I'm shifting it by 5 units.

And I'm now going to cross-multiply and add. Does this give us the correct-- does it do the correct thing? Yes, because a typical term will be the probability that this random variable is 3 times the probability that this random variable is 2. That's a particular way that you can get a sum of 5.

If you see here, the way that things are aligned, it gives you all the different ways that you can get the sum of 5. You can get the sum of 5 by having 1 + 4, or 2 + 3, or 3 + 2, or 4 + 1. You need to add the probabilities of all those combinations.

So you take this times that. That's one product term. Then this times 0, this times that. And so 1-- you cross-- you find all the products of the corresponding terms, and you add them together.

So it's a kind of handy mechanical procedure for doing this calculation, especially when the PMF's are given to you in terms of a picture. So the summary of these mechanics are just what we did, is that you put the PMF's on top of each other.

You take the PMF of Y. You flip it. And for any particular w that you're interested in, you take this flipped PMF and shift it by an amount of w. Given this particular shift for a particular value of w, you cross-multiply terms and then accumulate them or add them together.

What would you expect to happen in the continuous case? Well, the story is familiar. In the continuous case, pretty much, almost always things work out the same way, except that we replace PMF's by PDF's. And we replace sums by integrals.

So there shouldn't be any surprise here that you get a formula of this kind. The density of W can be obtained from the density of X and the density of Y by calculating this integral. Essentially, what this integral does is it fits a particular w of interest. We're interested in the probability that the random variable, capital W, takes a value equal to little w or values close to it.

So this corresponds to the event, which is this particular line on the two-dimensional space. So we need to find the sort of odd probabilities along that line. But since the setting is continuous, we will not add probabilities. We're going to integrate. And for any typical point in this picture, the probability of obtaining an outcome in this neighborhood is the-- has something to do with the density of that particular x and the density of the particular y that would compliment x, in order to form a sum of w.

So this integral that we have here is really an integral over this particular line. OK, so I'm going to skip the formal derivation of this result. There's a couple of derivations in the text. And the one which is outlined here is yet a third derivation.

But the easiest way to make sense of this formula is to consider what happens in the discrete case. So for the rest of the lecture we're going to consider a few extra, more miscellaneous topics, a few remarks, and a few more definitions. So let's change-- flip a page and consider the next mini topic.

There's not going to be anything deep here, but just something that's worth being familiar with. If you have two independent, normal random variables with certain parameters, the question is, what does the joined PDF look like? So if they're independent, by definition the joint PDF is the product of the individual PDF's.

And the PDF's each one of them involves an exponential of something. The product of two exponentials is the exponential of the sum. So you just add the exponents.

So this is the formula for the joint PDF. Now, you look at that formula and you ask, what does it look like? OK, you can understand it, a function of two variables by thinking about the contours of this function.

Look at the points at which the function takes a constant value. Where is it? When is it constant? What's the shape of the set of points where this is a constant? So consider all x's and y's for which this expression here is a constant, that this expression here is a constant.

What kind of shape is this? This is an ellipse. And it's an ellipse that's centered at-- it's centered at mu x, mu y. These are the means of the two random variables.

If those sigmas were equal, that ellipse would be actually a circle. And you would get contours of this kind. But if, on the other hand, the sigmas are different, you're going to get an ellipse that has contours of this kind.

So if my contours are of this kind, that corresponds to what? Sigma x being bigger than sigma y or vice versa. OK, contours of this kind basically tell you that X is more likely to be spread out than Y. So the range of possible x's is bigger.

And X out here is as likely as a Y up there. So big X's have roughly the same probability as certain smaller y's. So in a picture of this kind, the variance of X is going to be bigger than the variance of Y.

So depending on how these variances compare with each other, that's going to determine the shape of the ellipse. If the variance of Y we're bigger, then your ellipse would be the other way. It would be elongated in the other dimension. Just visualize it a little more.

Let me throw at you a particular picture. This is one-- this is a picture of one special case. Here, I think, the variances are equal. That's the kind of shape that you get. It looks like a two-dimensional bell.

So remember, for a normal random variables, for a single random variable you get a PDF that's bell shaped. That's just a bell-shaped curve. In the two-dimensional case, we get the joint PDF, which is bell shaped again. And now it looks more like a real bell, the way it would be laid out in ordinary space.

And if you look at the contours of this function, the places where the function is equal, the typcial contour would have this shape here. And it would be an ellipse. And in this case, actually, it will be more like a circle.

So these would be the different contours for different-- so the contours are places where the joint PDF is a constant. When you change the value of that constant, you get the different contours. And the PDF is, of course, centered around the mean of the two random variables. So in this particular case, since the bell is centered around the (0, 0) vector, this is a plot of a bivariate normal with 0 means.

OK, there's-- bivariate normals are also interesting when your bell is oriented differently in space. We talked about ellipses that are this way, ellipses that are this way. You could imagine also bells that you take them, you squash them somehow, so that they become narrow in one dimension and then maybe rotate them.

So if you had-- we're not going to go into this subject, but if you had a joint pdf whose contours were like this, what would that correspond to? Would your x's and y's be independent? No.

This would indicate that there's a relation between the x's and the y's. That is, when you have bigger x's, you would expect to also get bigger y's. So it would be a case of dependent normals. And we're coming back to this point in a second.

Before we get to that point in a second that has to do with the dependencies between the random variables, let's just do another digression. If we have our two normals that are independent, as we discussed here, we can go and apply the formula, the convolution formula that we were just discussing. Suppose you want to find the distribution of the sum of these two independent normals.

How do you do this? There is a closed-form formula for the density of the sum, which is this one. We do have formulas for the density of X and the density of Y, because both of them are normal, random variables.

So you need to calculate this particular integral here. It's an integral with respect to x. And you have to calculate this integral for any given value of w.

So this is an exercise in integration, which is not very difficult. And it turns out that after you do everything, you end up with an answer that has this form. And you look at that, and you suddenly recognize, oh, this is normal. And conclusion from this exercise, once it's done, is that the sum of two independent normal random variables is also normal.

Now, the mean of W is, of course, going to be equal to the sum of the means of X and Y. In this case, in this formula I took the means to be 0. So the mean of W is also going to be 0. In the more general case, the mean of W is going to be just the sum of the two means.

The variance of W is always the sum of the variances of X and Y, since we have independent random variables. So there's no surprise here. The main surprise in this calculation is this fact here, that the sum of independent normal random variables is normal. I had mentioned this fact in a previous lecture.

Here what we're doing is to basically outline the argument that justifies this particular fact. It's an exercise in integration, where you realize that when you convolve two normal curves, you also get back a normal one once more. So now, let's return to the comment I was making here, that if you have a contour plot that has, let's say, a shape of this kind, this indicates some kind of dependence between your two random variables.

So instead of a contour plot, let me throw in here a scattered diagram. What does this scattered diagram correspond to?

Suppose you have a discrete distribution, and each one of the points in this diagram has positive probability. When you look at this diagram, what would you say? I would say that when y is big then x also tends to be larger.

So bigger x's are sort of associated with bigger y's in some average, statistical sense. Whereas, if you have a picture of this kind, it tells you in association that the positive y's tend to be associated with negative x's most of the time. Negative y's tend to be associated mostly with positive x's.

So here there's a relation that when one variable is large, the other one is also expected to be large. Here there's a relation of the opposite kind. How can we capture this relation between two random variables?

The way we capture it is by defining this concept called the covariance, that looks at the relation of was X bigger than usual? That's the question, whether this is positive. And how does this relate to the answer-- to the question, was Y bigger than usual?

We're asking-- by calculating this quantity, we're sort of asking the question, is there a systematic relation between having a big X with having a big Y? OK , to understand more precisely what this does, let's suppose that the random variable has 0 means, So that we get rid of this-- get rid of some clutter. So the covariance is defined just as this product.

What does this do? If positive x's tends to go together with positive y's, and negative x's tend to go together with negative y's, this product will always be positive. And the covariance will end up being positive. In particular, if you sit down with a scattered diagram and you do the calculations, you'll find that the covariance of X and Y in this diagram would be positive, because here, most of the time, X times Y is positive. There's going to be a few negative terms, but there are fewer than the positive ones.

So this is a case of a positive covariance. It indicates a positive relation between the two random variables. When one is big, the other also tends to be big.

This is the opposite situation. Here, when one variable-- here, most of the action happens in this quadrant and that quadrant, which means that X times Y, most of the time, is negative. You get a few positive contributions, but there are few. When you add things up, the negative terms dominate. And in this case we have covariance of X and Y being negative.

So a positive covariance indicates a sort of systematic relation, that there's a positive association between the two random variables. When one is large, the other also tends to be large. Negative covariance is sort of the opposite. When one tends to be large, the other variable tends to be small.

OK, so what else is there to say about the covariance? One observation to make is the following. What's the covariance of X with X itself?

If you plug in X here, you see that what we have is expected value of X minus expected of X squared. And that's just the definition of the variance of a random variable. So that's one fact to keep in mind.

We had a shortcut formula for calculating variances. There's a similar shortcut formula for calculating covariances. In particular, we can calculate covariances in this particular way. That's just the convenient way of doing it whenever you need to calculate it.

And finally, covariances are very useful when you want to calculate the variance of a sum of random variables. We know that if two random variables are independent, the variance of the sum is the sum of the variances. When the random variables are dependent, this is no longer true, and we need to supplement the formula a little bit.

And there's a typo on the slides that you have in your hands. That term of 2 shouldn't be there. And let's see where that formula comes from.

Let's suppose that our random variables are independent of -- not independent -- our random variables have 0 means. And we want to calculate the variance. So the variance is going to be expected value of (X1 plus Xn) squared. What you do is you expand the square. And you get the expected value of the sum of the Xi squared.

And then you get all the cross terms. OK. And so now, here, let's assume for simplicity that we have 0 means. The expected value of this is the sum of the expected values of the X squared terms. And that gives us the variance.

And then we have all the possible cross terms. And each one of the possible cross terms is the expected value of Xi times Xj. This is just the covariance.

So if you can calculate all the variances and the covariances, then you're able to calculate also the variance of a sum of random variables. Now, if two random variables are independent, then you look at this expression. Because of independence, expected value of the product is going to be the product of the expected values. And the expected value of just this term is always equal to 0.

You're expected deviation from the mean is just 0. So the covariance will turn out to be 0. So independent random variables lead to 0 covariances, although the opposite fact is not necessarily true. So covariances give you some indication of the relation between two random variables.

Something that's not so convenient conceptually about covariances is that it has the wrong units. That's the same comment that we had made regarding variances. And with variances we got out of that issue by considering the standard deviation, which has the correct units.

So with the same reasoning, we want to have a concept that captures the relation between two random variables and, in some sense, that doesn't have to do with the units that we're dealing. We want to have a dimensionless quantity. That tells us how strongly two random variables are related to each other.

So instead of considering the covariance of just X with Y, we take our random variables and standardize them by dividing them by their individual standard deviations and take the expectation of this. So what we end up doing is the covariance of X and Y, which has units that are the units of X times the units of Y. But divide with a standard deviation, so that we get a quantity that doesn't have units.

This quantity, we call it the correlation coefficient. And it's a very useful quantity, a very useful measure of the strength of association between two random variables. It's very informative, because it falls always between -1 and +1. This is an algebraic exercise that you're going to see in recitation.

And the way that you interpret it is as follows. If the two random variables are independent, the covariance is going to be 0. The correlation coefficient is going to be 0. So 0 correlation coefficient basically indicates a lack of a systematic relation between the two random variables.

On the other hand, when rho is large, either close to 1 or close to -1, this is an indication of a strong association between the two random variables. And the extreme case is when rho takes an extreme value.

When rho has a magnitude equal to 1, it's as big as it can be. In that case, the two random variables are very strongly related. How strongly? Well, if you know one random variable, if you know the value of y, you can recover the value of x and conversely.

So the case of a complete correlation is the case where one random variable is a linear function of the other random variable. In terms of a scatter plot, this would mean that there's a certain line and that the only possible (x,y) pairs that can happen would lie on that line. So if all the possible (x,y) pairs lie on this line, then you have this relation, and the correlation coefficient is equal to 1. A case where the correlation coefficient is close to 1 would be a scatter plot like this, where the x's and y's are quite strongly aligned with each other, maybe not exactly, but fairly strongly. All right, so you're going to hear a little more about correlation coefficients and covariances in recitation tomorrow.