Lecture 11: Chain Rule | Video Lectures | Multivariable Calculus | Mathematics

Topics covered: Differentials; chain rule

Instructor: Prof. Denis Auroux

Related Resources

Lecture Notes - Week 5 Summary (PDF)

Download this transcript - PDF (English - US)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. So far we have learned about partial derivatives and how to use them to find minima and maxima of functions of two variables or several variables. And now we are going to try to study, in more detail, how functions of several variables behave, how to compete their variations. How to estimate the variation in arbitrary directions. And so for that we are going to need some more tools actually to study this things. More tools to study functions. Today's topic is going to be differentials. And, just to motivate that, let me remind you about one trick that you probably know from single variable calculus, namely implicit differentiation. Let's say that you have a function y equals f of x then you would sometimes write dy equals f prime of x times dx. And then maybe you would -- We use implicit differentiation to actually relate infinitesimal changes in y with infinitesimal changes in x. And one thing we can do with that, for example, is actually figure out the rate of change dy by dx, but also the reciprocal dx by dy. And so, for example, let's say that we have y equals inverse sin(x). Then we can write x equals sin(y). And, from there, we can actually find out what is the derivative of this function if we didn't know the answer already by writing dx equals cosine y dy. That tells us that dy over dx is going to be one over cosine y. And now cosine for relation to sine is basically one over square root of one minus x^2. And that is how you find the formula for the derivative of the inverse sine function. A formula that you probably already knew, but that is one way to derive it. Now we are going to use also these kinds of notations, dx, dy and so on, but use them for functions of several variables. And, of course, we will have to learn what the rules of manipulation are and what we can do with them. The actual name of that is the total differential, as opposed to the partial derivatives. The total differential includes all of the various causes that can change -- Sorry. All the contributions that can cause the value of your function f to change. Namely, let's say that you have a function maybe of three variables, x, y, z, then you would write df equals f sub x dx plus f sub y dy plus f sub z dz. Maybe, just to remind you of the other notation, partial f over partial x dx plus partial f over partial y dy plus partial f over partial z dz. Now, what is this object? What are the things on either side of this equality? Well, they are called differentials. And they are not numbers, they are not vectors, they are not matrices, they are a different kind of object. These things have their own rules of manipulations, and we have to learn what we can do with them. So how do we think about them? First of all, how do we not think about them? Here is an important thing to know. Important. df is not the same thing as delta f. That is meant to be a number. It is going to be a number once you have a small variation of x, a small variation of y, a small variation of z. These are numbers. Delta x, delta y and delta z are actual numbers, and this becomes a number. This guy actually is not a number. You cannot give it a particular value. All you can do with a differential is express it in terms of other differentials. In fact, this dx, dy and dz, well, they are mostly symbols out there. But if you want to think about them, they are the differentials of x, y and z. In fact, you can think of these differentials as placeholders where you will put other things. Of course, they represent, you know, there is this idea of changes in x, y, z and f. One way that one could explain it, and I don't really like it, is to say they represent infinitesimal changes. Another way to say it, and I think that is probably closer to the truth, is that these things are somehow placeholders to put values and get a tangent approximation. For example, if I do replace these symbols by delta x, delta y and delta z numbers then I will actually get a numerical quantity. And that will be an approximation formula for delta. It will be the linear approximation, a tangent plane approximation. What we can do -- Well, let me start first with maybe something even before that. The first thing that it does is it can encode how changes in x, y, z affect the value of f. I would say that is the most general answer to what is this formula, what are these differentials. It is a relation between x, y, z and f. And this is a placeholder for small variations, delta x, delta y and delta z to get an approximation formula. Which is delta f is approximately equal to fx delta x fy delta y fz delta z. It is getting cramped, but I am sure you know what is going on here. And observe how this one is actually equal while that one is approximately equal. So they are really not the same. Another thing that the notation suggests we can do, and they claim we can do, is divide everything by some variable that everybody depends on. Say, for example, that x, y and z actually depend on some parameter t then they will vary, at a certain rate, dx over dt, dy over dt, dz over dt. And what the differential will tell us then is the rate of change of f as a function of t, when you plug in these values of x, y, z, you will get df over dt by dividing everything by dt in here. The first thing we can do is divide by something like dt to get infinitesimal rate of change. Well, let me just say rate of change. df over dt equals f sub x dx over dt plus f sub y dy over dt plus f sub z dz over dt. And that corresponds to the situation where x is a function of t, y is a function of t and z is a function of t. That means you can plug in these values into f to get, well, the value of f will depend on t, and then you can find the rate of change with t of a value of f. These are the basic rules. And this is known as the chain rule. It is one instance of a chain rule, which tells you when you have a function that depends on something, and that something in turn depends on something else, how to find the rate of change of a function on the new variable in terms of the derivatives of a function and also the dependence between the various variables. Any questions so far? No. OK. A word of warming, in particular, about what I said up here. It is kind of unfortunate, but the textbook actually has a serious mistake on that. I mean they do have a couple of formulas where they mix a d with a delta, and I warn you not to do that, please. I mean there are d's and there are delta's, and basically they don't live in the same world. They don't see each other. The textbook is lying to you. Let's see. The first and the second claims, I don't really need to justify because the first one is just stating some general principle, but I am not making a precise mathematical claim. The second one, well, we know the approximation formula already, so I don't need to justify it for you. But, on the other hand, this formula here, I mean, you probably have a right to expect some reason for why this works. Why is this valid? After all, I first told you we have these new mysterious objects. And then I am telling you we can do that, but I kind of pulled it out of my hat. I mean I don't have a hat. Why is this valid? How can I get to this? Here is a first attempt of justifying how to get there. Let's see. Well, we said df is f sub x dx plus f sub y dy plus f sub z dz. But we know if x is a function of t then dx is x prime of t dt, dy is y prime of t dt, dz is z prime of t dt. If we plug these into that formula, we will get that df is f sub x times x prime t dt plus f sub y y prime of t dt plus f sub z z prime of t dt. And now I have a relation between df and dt. See, I got df equals sometimes times dt. That means the rate of change of f with respect to t should be that coefficient. If I divide by dt then I get the chain rule. That kind of works, but that shouldn't be completely satisfactory. Let's say that you are a true skeptic and you don't believe in differentials yet then it is maybe not very good that I actually used more of these differential notations in deriving the answer. That is actually not how it is proved. The way in which you prove the chain rule is not this way because we shouldn't have too much trust in differentials just yet. I mean at the end of today's lecture, yes, probably we should believe in them, but so far we should be a little bit reluctant to believe these kind of strange objects telling us weird things. Here is a better way to think about it. One thing that we have trust in so far are approximation formulas. We should have trust in them. We should believe that if we change x a little bit, if we change y a little bit then we are actually going to get a change in f that is approximately given by these guys. And this is true for any changes in x, y, z, but in particular let's look at the changes that we get if we just take these formulas as function of time and change time a little bit by delta t. We will actually use the changes in x, y, z in a small time delta t. Let's divide everybody by delta t. Here I am just dividing numbers so I am not actually playing any tricks on you. I mean we don't really know what it means to divide differentials, but dividing numbers is something we know. And now, if I take delta t very small, this guy tends to the derivative, df over dt. Remember, the definition of df over dt is the limit of this ratio when the time interval delta t tends to zero. That means if I choose smaller and smaller values of delta t then these ratios of numbers will actually tend to some value, and that value is the derivative. Similarly, here delta x over delta t, when delta t is really small, will tend to the derivative dx/dt. And similarly for the others. That means, in particular, we take the limit as delta t tends to zero and we get df over dt on one side and on the other side we get f sub x dx over dt plus f sub y dy over dt plus f sub z dz over dt. And the approximation becomes better and better. Remember when we write approximately equal that means it is not quite the same, but if we take smaller variations then actually we will end up with values that are closer and closer. When we take the limit, as delta t tends to zero, eventually we get an equality. I mean mathematicians have more complicated words to justify this statement. I will spare them for now, and you will see them when you take analysis if you go in that direction. Any questions so far? No. OK. Let's check this with an example. Let's say that we really don't have any faith in these things so let's try to do it. Let's say I give you a function that is x ^2 y z. And let's say that maybe x will be t, y will be e^t and z will be sin(t). What does the chain rule say? Well, the chain rule tells us that dw/dt is, we start with partial w over partial x, well, what is that? That is 2xy, and maybe I should point out that this is w sub x, times dx over dt plus -- Well, w sub y is x squared times dy over dt plus w sub z, which is going to be just one, dz over dt. And so now let's plug in the actual values of these things. x is t and y is e^t, so that will be 2t e to the t, dx over dt is one plus x squared is t squared, dy over dt is e over t, plus dz over dt is cosine t. At the end of calculation we get 2t e to the t plus t squared e to the t plus cosine t. That is what the chain rule tells us. How else could we find that? Well, we could just plug in values of x, y and z, x plus w is a function of t, and take its derivative. Let's do that just for verification. It should be exactly the same answer. And, in fact, in this case, the two calculations are roughly equal in complication. But say that your function of x, y, z was much more complicated than that, or maybe you actually didn't know a formula for it, you only knew its partial derivatives, then you would need to use the chain rule. So, sometimes plugging in values is easier but not always. Let's just check quickly. The other method would be to substitute. W as a function of t. Remember w was x^2y z. x was t, so you get t squared, y is e to the t, plus z was sine t. dw over dt, we know how to take the derivative using single variable calculus. Well, we should know. If we don't know then we should take a look at 18.01 again. The product rule that will be derivative of t squared is 2t times e to the t plus t squared time the derivative of e to the t is e to the t plus cosine t. And that is the same answer as over there. I ended up writing, you know, maybe I wrote slightly more here, but actually the amount of calculations really was pretty much the same. Any questions about that? Yes? What kind of object is w? Well, you can think of w as just another variable that is given as a function of x, y and z, for example. You would have a function of x, y, z defined by this formula, and I call it w. I call its value w so that I can substitute t instead of x, y, z. Well, let's think of w as a function of three variables. And then, when I plug in the dependents of these three variables on t, then it becomes just a function of t. I mean, really, my w here is pretty much what I called f before. There is no major difference between the two. Any other questions? No. OK. Let's see. Here is an application of what we have seen. Let's say that you want to understand actually all these rules about taking derivatives in single variable calculus. What I showed you at the beginning, and then erased, basically justifies how to take the derivative of a reciprocal function. And for that you didn't need multivariable calculus. But let's try to justify the product rule, for example, for the derivative. An application of this actually is to justify the product and quotient rules. Let's think, for example, of a function of two variables, u and v, that is just the product uv. And let's say that u and v are actually functions of one variable t. Then, well, d of uv over dt is given by the chain rule applied to f. This is df over dt. So df over dt should be f sub q du over dt plus f sub v plus dv over dt. But now what is the partial of f with respect to u? It is v. That is v du over dt. And partial of f with respect to v is going to be just u, dv over dt. So you get back the usual product rule. That is a slightly complicated way of deriving it, but that is a valid way of understanding how to take the derivative of a product by thinking of the product first as a function of variables, which are u and v. And then say, oh, but u and v were actually functions of a variable t. And then you do the differentiation in two stages using the chain rule. Similarly, you can do the quotient rule just for practice. If I give you the function g equals u of v. Right now I am thinking of it as a function of two variables, u and v. U and v themselves are actually going to be functions of t. Then, well, dg over dt is going to be partial g, partial u. How much is that? How much is partial g, partial u? One over v times du over dt plus -- Well, next we need to have partial g over partial v. Well, what is the derivative of this with respect to v? Here we need to know how to differentiate the inverse. It is minus u over v squared times dv over dt. And that is actually the usual quotient rule just written in a slightly different way. I mean, just in case you really want to see it, if you clear denominators for v squared then you will see basically u prime times v minus v prime times u. Now let's go to something even more crazy. I claim we can do chain rules with more variables. Let's say that I have a quantity. Let's call it w for now. Let's say I have quantity w as a function of say variables x and y. And so in the previous setup x and y depended on some parameters t. But, actually, let's now look at the case where x and y themselves are functions of several variables. Let's say of two more variables. Let's call them u and v. I am going to stay with these abstract letters, but if it bothers you, if it sounds completely unmotivated think about it maybe in terms of something you might now. Say, polar coordinates. Let's say that I have a function but is defined in terms of the polar coordinate variables on theta. And then I know I want to switch to usual coordinates x and y. Or, the other way around, I have a function of x and y and I want to express it in terms of the polar coordinates r and theta. Then I would want to know maybe how the derivatives, with respect to the various sets of variables, related to each other. One way I could do it is, of course, to say now if I plug the formula for x and the formula for y into the formula for f then w becomes a function of u and v, and it can try to take partial derivatives. If I have explicit formulas, well, that could work. But maybe the formulas are complicated. Typically, if I switch between rectangular and polar coordinates, there might be inverse trig, there might be maybe arctangent to express the polar angle in terms of x and y. And when I don't really want to actually substitute arctangents everywhere, maybe I would rather deal with the derivatives. How do I do that? The question is what are partial w over partial u and partial w over partial v in terms of, let's see, what do we need to know to understand that? Well, probably we should know how w depends on x and y. If we don't know that then we are probably toast. Partial w over partial x, partial w over partial y should be required. What else should we know? Well, it would probably help to know how x and y depend on u and v. If we don't know that then we don't really know how to do it. We need also x sub u, x sub v, y sub u, y sub v. We have a lot of partials in there. Well, let's see how we can do that. Let's start by writing dw. We know that dw is partial f, well, I don't know why I have two names, w and f. I mean w and f are really the same thing here, but let's say f sub x dx plus f sub y dy. So far that is our new friend, the differential. Now what do we want to do with it? Well, we would like to get rid of dx and dy because we like to express things in terms of, you know, the question we are asking ourselves is let's say that I change u a little bit, how does w change? Of course, what happens, if I change u a little bit, is y and y will change. How do they change? Well, that is given to me by the differential. dx is going to be, well, I can use the differential again. Well, x is a function of u and v. That will be x sub u times du plus x sub v times dv. That is, again, taking the differential of a function of two variables. Does that make sense? And then we have the other guy, f sub y times, what is dy? Well, similarly dy is y sub u du plus y sub v dv. And now we have a relation between dw and du and dv. We are expressing how w reacts to changes in u and v, which was our goal. Now, let's actually collect terms so that we see it a bit better. It is going to be f sub x times x sub u times f sub y times y sub u du plus f sub x, x sub v plus f sub y y sub v dv. Now we have dw equals something du plus something dv. Well, the coefficient here has to be partial f over partial u. What else could it be? That's the rate of change of w with respect to u if I forget what happens when I change v. That is the definition of a partial. Similarly, this one has to be partial f over partial v. That is because it is the rate of change with respect to v, if I keep u constant, so that these guys are completely ignored. Now you see how the total differential accounts for, somehow, all the partial derivatives that come as coefficients of the individual variables in these expressions. Let me maybe rewrite these formulas in a more visible way and then re-explain them to you. Here is the chain rule for this situation, with two intermediate variables and two variables that you express these in terms of. In our setting, we get partial f over partial u equals partial f over partial x time partial x over partial u plus partial f over partial y times partial y over partial u. And the other one, the same thing with v instead of u, partial f over partial x times partial x over partial v plus partial f over partial u partial y over partial v. I have to explain various things about these formulas because they look complicated. And, actually, they are not that complicated. A couple of things to know. The first thing, how do we remember a formula like that? Well, that is easy. We want to know how f depends on u. Well, what does f depend on? It depends on x and y. So we will put partial f over partial x and partial f over partial y. Now, x and y, why are they here? Well, they are here because they actually depend on u as well. How does x depend on u? Well, the answer is partial x over partial u. How does y depend on u? The answer is partial y over partial u. See, the structure of this formula is simple. To find the partial of f with respect to some new variable you use the partials with respect to the variables that f was initially defined in terms of x and y. And you multiply them by the partials of x and y in terms of the new variable that you want to look at, v here, and you sum these things together. That is the structure of the formula. Why does it work? Well, let me explain it to you in a slightly different language. This asks us how does f change if I change u a little bit? Well, why would f change if u changes a little bit? Well, it would change because f actually depends on x and y and x and y depend on u. If I change u, how quickly does x change? Well, the answer is partial x over partial u. And now, if I change x at this rate, how does that have to change? Well, the answer is partial f over partial x times this guy. Well, at the same time, y is also changing. How fast is y changing if I change u? Well, at the rate of partial y over partial u. But now if I change this how does f change? Well, the rate of change is partial f over partial y. The product is the effect of how you change it, changing u, and therefore changing f. Now, what happens in real life, if I change u a little bit? Well, both x and y change at the same time. So how does f change? Well, it is the sum of the two effects. Does that make sense? Good. Of course, if f depends on more variables then you just have more terms in here. OK. Here is another thing that may be a little bit confusing. What is tempting? Well, what is tempting here would be to simplify these formulas by removing these partial x's. Let's simplify by partial x. Let's simplify by partial y. We get partial f over partial u equals partial f over partial u plus partial f over partial u. Something is not working properly. Why doesn't it work? The answer is precisely because these are partial derivatives. These are not total derivatives. And so you cannot simplify them in that way. And that is actually the reason why we use this curly d rather than a straight d. It is to remind us, beware, there are these simplifications that we can do with straight d's that are not legal here. Somehow, when you have a partial derivative, you must resist the urge of simplifying things. No simplifications in here. That is the simplest formula you can get. Any questions at this point? No. Yes? When would you use this and what does it describe? Well, it is basically when you have a function given in terms of a certain set of variables because maybe there is a simply expression in terms of those variables. But ultimately what you care about is not those variables, z and y, but another set of variables, here u and v. So x and y are giving you a nice formula for f, but actually the relevant variables for your problem are u and v. And you know x and y are related to u and v. So, of course, what you could do is plug the formulas the way that we did substituting. But maybe that will give you very complicated expressions. And maybe it is actually easier to just work with the derivates. The important claim here is basically we don't need to know the actual formulas. All we need to know are the rate of changes. If we know all these rates of change then we know how to take these derivatives without actually having to plug in values. Yes? Yes, you could certain do the same things in terms of t. If x and y were functions of t instead of being functions of u and v then it would be the same thing. And you would have the same formulas that I had, well, over there I still have it. Why does that one have straight d's? Well, the answer is I could put curly d's if I wanted, but I end up with a function of a single variable. If you have a single variable then the partial, with respect to that variable, is the same thing as the usual derivative. We don't actually need to worry about curly in that case. But that one is indeed special case of this one where instead of x and y depending on two variables, u and v, they depend on a single variable t. Now, of course, you can call variables any name you want. It doesn't matter. This is just a slight generalization of that. Well, not quite because here I also had a z. See, I am trying to just confuse you by giving you functions that depend on various numbers of variables. If you have a function of 30 variables, things work the same way, just longer, and you are going to run out of letters in the alphabet before the end. Any other questions? No. What? Yes? If u and v themselves depended on another variable then you would continue with your chain rules. Maybe you would know to express partial x over partial u in terms using that chain rule. Sorry. If u and v are dependent on yet another variable then you could get the derivative with respect to that using first the chain rule to pass from u v to that new variable, and then you would plug in these formulas for partials of f with respect to u and v. In fact, if you have several substitutions to do, you can always arrange to use one chain rule at a time. You just have to do them in sequence. That's why we don't actually learn that, but you can just do it be repeating the process. I mean, probably at that stage, the easiest to not get confused actually is to manipulate differentials because that is probably easier. Yes? Curly f does not exist. That's easy. Curly f makes no sense by itself. It doesn't exist alone. What exists is only curly df over curly d some variable. And then that accounts only for the rate of change with respect to that variable leaving the others fixed, while straight df is somehow a total variation of f. It accounts for all of the partial derivatives and their combined effects. OK. Any more questions? No. Let me just finish up very quickly by telling you again one example where completely you might want to do this. You have a function that you want to switch between rectangular and polar coordinates. To make things a little bit concrete. If you have polar coordinates that means in the plane, instead of using x and y, you will use coordinates r, distance to the origin, and theta, the angles from the x-axis. The change of variables for that is x equals r cosine theta and y equals r sine theta. And so that means if you have a function f that depends on x and y, in fact, you can plug these in as a function of r and theta. Then you can ask yourself, well, what is partial f over partial r? And that is going to be, well, you want to take partial f over partial x times partial x partial r plus partial f over partial y times partial y over partial r. That will end up being actually f sub x times cosine theta plus f sub y times sine theta. And you can do the same thing to find partial f, partial theta. And so you can express derivatives either in terms of x, y or in terms of r and theta with simple relations between them. And the one last thing I should say. On Thursday we will learn about more tricks we can play with variations of functions. And one that is important, because you need to know it actually to do the p-set, is the gradient vector. The gradient vector is simply a vector. You use this downward pointing triangle as the notation for the gradient. It is simply is a vector whose components are the partial derivatives of a function. I mean, in a way, you can think of a differential as a way to package partial derivatives together into some weird object. Well, the gradient is also a way to package partials together. We will see on Thursday what it is good for, but some of the problems on the p-set use it.