Flash and JavaScript are required for this feature.
Download the video from iTunes U or the Internet Archive.
Description: In this lecture, Professor Demaine covers the augmentation of data structures, updating common structures to store additional information.
Instructors: Erik Demaine
Lecture 9: Augmentation: Ra...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PROFESSOR: All right, let's get started. Today we have another data structures topic which is, Data Structure Augmentation. The idea here is we're going to take some existing data structure and augment it to do extra cool things.
Take some other data structure there we've covered. Typically, that'll be a balanced search tree, like an AVL tree or a 2-3 tree. And then we'll modify it to store extra information which will enable additional kinds of searches, typically, and sometimes to do updates better.
And in 006, you've seen an example of this where you took AVL trees and augmented AVL trees so that every node knew the number of nodes in that rooted subtree. Today we're going to see that example but also a bunch of other examples, different types of augmentation you could do. And we'll start out with a very simple one, which I call easy tree augmentation, which will include subtree size as a special case.
So with easy tree augmentation, the idea is you have a tree, like an AVL tree, or 2-3 tree, or something like that. And you'd like to store, for every node x, some function of the subtree, rooted at x. Such as the number of nodes in there, or the sum of the weights of the nodes, or the sum of the squares of the weights, or the min, or the max, or the median maybe, I'm not sure. Some function f of x which is a function of that. Maybe not f of x, but we want to store some function of that subtree.
Say the goal is to store f of the subtree rooted at x at each node x in a field which I'll call x.f. So, normally nodes have a left child, right child, parent. But we're going to store an extra field x.f for some function that you define. This is not always possible, but here's a case where it is possible. That's going to be the easy case. Suppose x.f can be computed locally using lower information, lower nodes.
And we'll say, let's suppose it can be computed in constant time from information in the node x from x's children and from the f value that's stored in the children. I'll call that children.f. But really, I mean left child.f, right child.f, or if you have a 2-3 tree you have three children, potentially. And the .f of each of them.
OK. So suppose you can compute x.f locally just using one level down in constant time. Then, as you might expect, you can update whenever a node ends up changing. So more formally. If some set of nodes change-- call this at s.
So I'm stating a very general theorem here. If there is some set of nodes, which we changed something about them. We change either their f field, we change some of the data that's in the node, or we do a rotation, loosen those around. Then we count the total number of ancestors of these nodes. So this subtree. Those are the nodes that need to be updated because we're assuming we can compute x.f just given the children data. So if this data is changing, we have to update it's parents value of f because it depends on this child value. We have to update all those parents, all the way up to the root. So however many nodes there are there, that's the total cost.
Now, luckily, in an AVL tree, or 2-3 tree, most balanced search structures, the updates you do are very localized. When we do splits in a 2-3 tree we only do it up a single path to the root. So the number of ancestors here is just going to be log n. Same thing with an AVL tree. If you look at the rotations you do, they are up a single leaf to root path. And so the number of ancestors that need to be updated is always order log n. Things change, and there's an order log n ancestors of them.
So this is a little more general than we need, but it's just to point out if we did log n rotation spread out somewhere in the tree, that would actually be bad because the total number of ancestors could be log squared. But because in the structures we've seen, we just work on a single path to the root, we get log n. So in a little more detail here, whenever we do a rotation in an AVL tree. Let's say A, B, C, x, y.
Remember rotations? Been a while since we've done rotations. So we haven't changed any of the nodes in A, B, C, but we have changed the nodes x and y. So we're going to have to trigger an update of y. First, we'd want to update y.f and then we're going to trigger the update to x.f. And as long as this one can be computed from its children, then we compute y.f, then we can compute x from its children.
All right. So a constant number of extra things we need to do whenever we do rotation. And because the rotations lie on a single path, total cost that-- once we stop doing the rotations, in AVL insert say, then we still have to keep updating up to the root. But there's only log n at most log n nodes to do that.
OK. Same thing with 2-3 trees. We have a node split. So we have, I guess, three keys, four children. That's too many. So we split to two nodes and an extra node up here. Then we just trigger an update of this f value, an update of this f value, and an update of that f value. And because that just follows a single path everything's log n.
So this is a general theorem about augmentation. Any function that's well behaved in this sense, we can maintain in AVL trees and 2-3 trees. And I'll remind you and state, a little more generally, what you did in 006, which are called order statistic trees in the textbook.
So here we're going to-- let me first tell you what we're trying to achieve. This is the abstract data type, or the interface of the data structure. We want to do insert, delete, and say, successor searches. It's the usual thing we want out of a binary search tree. Predecessor too, sure. We want to do rank of a given key which is, tell me what is the index of that key in the overall sorted order of the items, of the keys?
We've talked about rank a few times already in this class. Depends whether you start at 0 or 1, but let's say we start at one. So if you say rank of the key that happens to be the minimum, you want to get one. If you say rank of the key that happens to be the median, you want to get n over 2 plus 1, and so on.
So it's a natural thing you might want to find out. And the converse operation is select, let's say of i, which is, give me the key of rank i.
We've talked about select as an offline operation. Given an array, find me the median. Or find me the n over seventh rank item. And we can do that in linear time given no data structure. Here, we want a data structure so that we can find the median, or the seventh item, or the n over seventh key, whatever in log n time. We want to do all of these in log n per operation.
OK. So in particular, rank of selective i should equal i. We're trying to find the item of that rank. So far, so good. And just to plug these two parts together. We have this data structure augmentation tool, we have this goal we want to achieve, we're going to achieve this goal by applying this technique where f is just the subtree size. It's the number of nodes in that subtree because that will let us compute rank.
So we're going to use easy tree augmentation with f of subtree equal to the number of nodes in the subtree. So in order for this to apply, we need to check that given a node x we can compute x.f just using its children. This is easy. We just add everything up. So x.f would be equal to 1. That's for x. Plus the sum of c.f for every child c.
I'll write this as a python interpolation so it looks a little more like an algorithm. I'm trying to be generic here. If it's a binary search tree you just do x.left.f, plus x.right.f. But this will work also for 2-3 trees. Pick your favorite data structure. As long as there's a constant number of children then this will take constant time. So we satisfied this condition. So we can do easy tree augmentation. And now we know we have subtree sizes. So given any node. We know the number of descendants below that node. So that's cool.
It lets us compute rank in select. I'll just give you those algorithms, quickly. We can check that they're log n time.
Yeah. So the idea is pretty simple. You have some key-- let's think about binary trees now, because it's a little bit easier. We have some item x. It has a left subtree, right subtree. And now let's look up from x. Just keep calling x.parent. So sometimes the parent is to the right of us and sometimes the parent is to the left of us. I'm going to draw this in a, kind of, funny way.
But this funny way has a very special property, which is that the x-coordinate in this diagram is the key value. Or is the sorted order of the keys, right? Everything in the left subtree of x has a value less than x. If we say all the keys are different. Everything to the right of x has a value greater than x. If x was the left child of its parent, that means this thing is also greater than x. And if we follow a parent and this was the right child of that parent, that means this thing is less than x. So that's why I drew it all the way over to the left. This thing is also less than x because it was a, I'll call it a left parent. Here we have a right parent, so that means this is something greater than x. And over here we have a left parent, so this is something less than x. Let's say that's the root.
In general, there's going to be some left edges and some right edges as we go up. These arrows will go either left or right in a binary tree. So the rank of x is just 1 plus the number of nodes that are less than x. Number of keys that are less than x. So there's these guys, there's these guys, and there's whatever's hanging off-- OK. Here I've almost violated my x-coordinate rule. If I make these really narrow, that's right. All of these things, all of these nodes in the left subtrees of these less than x nodes will also be less than x. If you think about these other subtrees, they're going to be bigger than x. So we don't really care about them.
So we just want to count up all these nodes and all of these nodes. So the algorithm to do that is pretty simple. We're just going to start out with--
I'm going to switch from this f notation to size. That's a little more natural. In general, you might have many functions. Size is the usual notation for subtree size. So we start out by counting up how many items are here. And if we want to start at a rank of 1, if the min has rank 1, then I should also do plus 1 for x itself. If you wanted to start at zero you just omit that plus 1. And then, all I do is walk up from x to the root of the tree. And whenever we go left from, say x to x prime. So that means we have an x prime. It's right child is x. And so when we went from x to its parent we went to the left.
Then we say rank plus equals x prime.left.size plus 1 for x prime itself. And maybe x prime.left.size is zero. Maybe there's no nodes over there. But at the very least we have to count those nodes that are to the left of us. And if there's anything down here we add up all those things. So that lets us compute rank.
How long does it take? Well, we're just walking up one path from a leaf to a root-- or not necessarily a leaf, but from some node x to the root. And as long we're using a balance structure like AVL trees. I guess I want binary here, so let's say AVL trees. Then this will take log n time. So I'm spending constant work per step, and there's log n steps. Clear?
So that's good old rank. Easy to do once you have subtree size. Let's do select for fun.
This may seem like review, but I drew out this picture explicitly because we're going to do it a lot today. We'll have pictures like this a bunch of times. Really helps to think about where the nodes are, which ones are less than x, which ones are greater than x. Let's do select first. This you may not have seen in 006.
So we're going to do the reverse. We're going to start at the root and we're going to walk down. Sounds easy enough. But now walking down is kind of like doing a search but we don't have a key we're searching for, we have a rank we're searching for. So what is that rank? Rank is i. OK. So on the other hand, we have the node x. We'd like to know the rank of x and compare that to i. That will tell us whether we should go left, or go right, or whether we happen to find the item.
Now one possibility is we call rank of x to find the rank of x. But that's dangerous because I'm going to have a four loop here and it's going to take log n iterations. If at every iteration of computing rank of x, and rank costs log n, then overall cost might be log squared n. So I can't afford to-- I want to know what the rank of x is but I can't afford to say rank, open paren, x. Because that recursive call will be too expensive. So what is the rank of x in this case? This is a little special. What's that?
AUDIENCE: Number of left children plus 1.
PROFESSOR: Number of left, or the size of the left subtree plus 1. Yep. Plus 1 if we're counting, starting at one. Very good. I'm slowly getting better. Didn't hit anyone this time. OK.
So at least for the root, this is the rank, and that only takes us constant time in the special case. So we'll have to check that it's still holds after I do the loop. But it will. So, cool. Now there are three cases. If i equals rank. If the rank we're searching for is the rank that we happen to have, then we're done, right? We just return x. That's the easy case.
More likely is that I will be either less than or greater than the rank of x. OK. So if i is less than the rank, this is fairly easy. We just say x equals x.left.
Did I get that right? Yep. In this case, the rank. So here we have x. It's at rank, rank. And then we have the left subtree and the right subtree. And so if the rank were searching for is less than rank, that means we know it's in here. So we should go left. And if we just said x equals x.left you might ask, well what rank are we searching for in here? Well, exactly the same rank. Fine. That's easy case.
In the other situation, if we're searching in here, we're searching for rank greater than rank. Then I want to go right but the new rank that I'm searching for is local to this subtree. I'm searching for i minus this stuff. This stuff is rank. So I'm going to let i be i minus rank.
Make sure I don't have any off by 1 errors. That seems to be right. OK. And then I do a loop. So I'll write repeat.
So then I'm going to go up here and say, OK. Now relative to this thing. What is the rank of the root of this subtree? Well, it's again going to be that node .left.size plus 1. And now I have the new rank I'm searching for, i. And I just keep going. You could write this recursively if you like, but here's an iterative version.
So it's actually very familiar to the select algorithm that we had, like when we did deterministic linear time median finding or randomized median finding. They had a very similar kind of recursion. But in that case, they were spending linear time to do the partition and that was expensive. Here, we're just spending constant time at each node and so the overall cost is log n. So that's nice. Any questions about that?
OK. I have a note here. Subtree size is obvious once you know that's what you should do. Another natural thing to try to do would be to augment, for each node, what is the rank of that node? Because then rank is really easy to find. And then select would basically be a regular search. I just look at the rank of the root, I see whether the rank I'm looking for is too big, or too small, and I go left or right, accordingly.
What would be bad about augmenting with rank of a node? Updates. Why? What's a bad example for an update?
AUDIENCE: If you add new in home element.
PROFESSOR: Right. Say we insert a new minimum element.
Good catch, cameraman. That was for the camera, obviously. So, right. If we insert, this is off to the side, but say we insert, I'll call it minus infinity. A new key that is smaller than all other keys, then the rank of every node changes. So that's bad. It means that easy tree augmentation, in particular, isn't going to apply. And furthermore, it would take linear time to do this. And you could keep inserting, if you insert keys in decreasing order from there, every time you do an insert, all the ranks increase by one. Maintaining that's going to cost linear time per update.
So you have to be really careful that the function you want to store actually can be maintained. Be very careful about that, say, on the quiz coming up, that when you're augmenting something you can actually maintain it. For example, it's very hard to maintain the depths of nodes because when you do a rotation a whole lot of depths change.
Depth is counting from the root. How deep am I? When I do a rotation then this entire subtree went down by one. This entire subtree went up by one. In this picture. But it's very easy to maintain heights, for example. Height counting from the bottom is OK, because I don't affect the height of a, b, and c. I affect it for x and y but that's just two nodes. That I can afford. So that's what you want to be careful of in the easy tree augmentation.
So most the time easy tree augmentation does the job. But in the remaining two examples, I want to show you cooler examples of augmentation. These are things you probably wouldn't be expected to come up with on your own, but they're cool. And they let us do more sophisticated operations.
So the first one is called level linking. And here we're going to do it in the context of 2-3 trees, partly for variety. So the idea of level linking is very simple. Let me draw a 2-3 tree.
Not a very impressive 2-3 tree. I guess I don't feel like drawing too much. Level linking is the idea of, in addition to these child and parent pointers, we're going to add links on all the levels. Horizontal links, you might call them.
OK. So that's nice. Two questions-- can we do this? And what's it good for? So let's start with can we do this. Remember in 2-3 trees all we have to think about are splits and merges. So in a split, we have, for a brief period, let's say three keys, four children. That's too many. So we change that to--
I'm going to change this in a moment. For now, this is the split you know and love, maybe. At least know. And if we think about where the leveling pointers are, we have one before. And then we just need to distribute those pointers to the two resulting nodes. And then we have to create a new pointer between the nodes that we just created. This is, of course, easy to do.
We're here. We're taking this node. We're splitting it in half. So we have the nodes right in our hands so just add pointers between them. And key thing is, there's some node over here on the left. It used to point to this node, now we have to change it to point to the left version. The left half of the node. And there's some node over on the right. We have to change it's left pointer to point to this right half of the node. But that's it. Constant time.
So this doesn't fall under the category of easy tree augmentation because this is not isolated to the subtree. We're also dealing with it's left and right subtrees. But still easy to do in constant time.
Merging nodes is going to be similar. If we steal a node from our parents or former sibling, nothing happens in terms of level links. But if we have, say, an empty node and a node that cannot afford any stealing. So we have single child here, two children, and we merge it into--
We're taking something from our parent. Bringing it down. Then we have three children afterwards. Again, we used to have these level pointers. Now we just have these level pointers. It's easy to maintain. It's just a constant size neighborhood.
Because we have the level links, we can get to our left and right neighbors and change where the links point to. So easy to maintain in constant time. I'll call it constant overhead. Every time we do a split or merge we spend additional constant time to do it. We're already spending constant time. So just changes everything by constant factor. So far, so good.
Now, I'm going to have to tweak this data structure a little bit. But let me first tell you why. What am I trying to achieve with this data structure? What I'm trying to achieve is something called the finger search property.
So let's just think about the case where I'm doing a successful search. I'm searching for key x and I find it in the data structure. I find it in the tree. Suppose I found one-- I search for x, I found it. And then I search for another key y. Actually I think I'll do the reverse. First I found y, now I'm searching for x. If x and y are nearby in the tree, I want this to run especially fast. For example, if x is the successor of y I want this to take constant time. That would be nice.
In the worst case x and y are very far away from me in the tree then I want it to take log n time. So how could I interpolate between constant time for finding the successor and log n time for finding the worst case search. So I'm going to call this search of x from y. Meaning, this is a little imprecise, but what I mean is when I call search, I tell it where I've already found y. And here it is. Here's the node storing y. And now I'm given a key x. And I want to find that key x given the node that stores key y. So how long should this take? Will be a good way to interpolate between constant time at one extreme. The good case, when x and y are basically neighbors in sorted order, versus log n time, in the worst case.
AUDIENCE: Distance along the graph.
PROFESSOR: Distance along the graph. That would be one reasonable definition. So I have a tree which you could think of as a graph. Measure the shortest path length from x to y. Or we have a more sophisticated graph over here. Maybe that length. The trouble with the distance in the graph, that's a reasonable suggestion, but it's very data structure specific. If I use an AVL tree without level links, then the distance could be one thing, whereas if I use a 2-3 tree, even without level lengths, it's going to be a different distance. If I use a 2-3 tree with level lengths it's going to be yet another distance. So that's a little unsatisfying. I want this to be an answer to a question. I don't want to phrase the question in terms of that data structure.
AUDIENCE: Difference between ranks of x and y?
PROFESSOR: Difference between ranks between x and y. That's close.
So I'm going to look at the rank of x and rank of y. Let's say, take the absolute difference. That's kind of how far away they are in sorted order. Do you want to add anything?
AUDIENCE: Log?
PROFESSOR: Log. Yeah. Because in the worst case the difference in ranks could be linear. So I want to add a log out here to get log n in that worst case.
Add a big o for safety. That's how much time we want to achieve. So this would be the finger search property that you can solve this problem in this much time. Again, difference in ranks is at most n. So this is at most log n. But if y is the successor of x this will only be constant and this will be constant.
So this is great if you're doing lots of searches and you tend to search for things that are nearby, but sometimes you search for things are far away. This gives you a nice bound.
On the one hand, we have, this is our goal. Log difference of ranks. On the other hand, we have the suggestion that what we can achieve is something like the distance in the graph.
But we have a problem with this. I used to think that data structure solved this problem, but it doesn't. Let me just draw-- actually I have a tree right there. I'm going to use that one. Suppose x is here and y is here. OK. This is a bit of a small tree but if you think about it long enough, this node is the predecessor of this node. So their difference in ranks should be 1.
But the distance in the graph here is two. Not very impressive. But in general, you have a tree of height log n. If you look at the root, and the predecessor of the root, they will have a rank difference of one by definition of predecessor. But the graph distance will be log n. So that's bad news, because if we're only following pointers there's no way to get from here to there in constant time. So we're not quite there.
We're going to use another tweak that data structure, which is store the data in the leaves. Tried to find a data structure that didn't require this and still got finger search. But as far as I know, there is none. No such data structure. If you look at, say, Wikipedia about B-trees, you'll see there's a ton of variations of B-trees. B+-trees, B*-trees. This is one of those. I think B+-trees.
As you saw, B-trees or 2-3 trees, every node stored one or two keys. And each key only existed in one spot. We're still only going to put each key in one spot, kind of. But it's only going to be the leaf spots. OK. Good news is most nodes are leaves, right? Constant fraction of the nodes are going to be leaves. So it doesn't change too much from a space efficiency standpoint. If we just put data down here and don't put-- I'm not going to put any keys up here for now.
So this a little weird. Let me draw an example of such a tree. So maybe we have 2, and 5, and 7, and 8, 9, let's say. Let's put 1 here. So I'm going to have a node here with three children, a node here with two children, and here's a node with two children. So I think this mimics this tree, roughly. I got it exactly right.
So here I've taken this tree structure. I've redrawn it. There's now no keys in these nodes. But everything else is going to be the same. Every node is going to have 0 children if it's a leaf, or two, or three children otherwise. Never have one child because then you wouldn't get logarithmic depth. All the leaves are going to be at the same depth.
And that's it. OK. That is a 2-3 tree with the data stored in the leaves. It's a useful trick to know. Now we're going to do a level linked 2-3 tree. So in addition to that picture, we're going to have links like this.
OK. And I should check that I can still do insert and delete into these structures. It's actually not too hard. But let's think about it.
I think, actually, it might be easier. Let's see. So if I want to do an insert-- OK. I have to first search for where I'm inserting. I haven't told you how to do search yet. OK. So let's first think about search.
What we're going to do is data structure augmentation. We have simple tree augmentation. So I'm going to do it and each node, what the functions I'm going to store are the minimum key in the subtree, and the maximum key in the subtree. There are many ways to do this, but I think this is kind of the simplest. So what that means is at this node, I'm going to store 1 as the min and 7 as the max.
And at this node it's going to be 1 at the min and 9 at the max. And here we have 8 as the min and 9 as the max. Again min and max of subtrees are easy to store. If I ever change a node I can update it based on its children, just by looking at the min of the leftmost child and the max of the rightmost child. If I didn't know 1 and 9, I could just look at this min and that max and that's going to be the min and the max of the overall tree. So in constant time I can update the min and the max of a node given the min and the max of its children. Special case is at the leaves. Then you have to actually look at keys and compare them. But leaves only have, at most, two keys. So pretty easy to compare them in constant time. OK.
So that's how I do the augmentation. Now how do I do a search? Well, if I'm at a node and I'm searching for a key. Well, let's say I'm at this node. I'm searching for a key like 8. What I'm going to do is look at all of the children. In this case, there's two. In the worst case there's three. I look at the min and max and I see where does 8 fall? Well it falls in this interval. If I was searching for 7 1/2 I know it's not there. It's going to be in between here. If I'm doing a successor then I'll go to the right. If I'm doing predecessor I'll go to the left. And then take either the maximum item or the minimum item.
If I'm searching for 8 I see, oh. 8 falls in the interval between 8 and 9, so I should clearly take the right child among those two children. In general, there's three children. Three intervals. Constant time. I can find where my key falls in the interval. OK.
So search is going to take log n time again, provided I have these mins and maxs. If you stare at it long enough, this is pretty much the same thing as regular search in a 2-3 tree. But I've put the data just one level down. OK. Good.
That was regular search. I still need to do finger search, but we'll get there. And now, if I want to do an insert into this data structure, what happens. Well I search for the key let's say I'm inserting 6. So maybe I go here. I say because 6. Is in this interval. 6 is in neither of these intervals. But it's closest to the interval 2, 5, or the interval 7. Let's say I go down to 2, 5. And well, to insert 6 I'll just add a 6 on there. Of course, now that node is too big.
So there's still going to be a split case at the leaves where I have let's say, a,b,c, too many keys. I'm going to split that into a,b and c. This is different from before. It used to be I would promote b to the parent because the parent needed the key there. Now parents don't have keys. So I'm just going to split this thing, roughly, in half. It works. It's still the case that whoever was the parent up here now has an additional child. One more child. So maybe that node now has four children but it's supposed to be two or three. So if I have a node with four children, what I'm going to do, I'm suppose to use these fancy arrows. What do I do in this case? It's just going to split that into two nodes with two children. And again this used to have a parent. Now that parent has an additional child, and that may cause another split.
It's just like before. Was just potentially split all the way up to the root. If we split the root then we get an additional level. But we could do all this and we can still maintain our level links, if we want.
But everything will take log n. I won't draw the delete case, as delete is slightly more annoying. But I think, in this case, you never have to worry about where is the key coming from, your child or your parent? You're just merging nodes so it's a little bit simpler. But you have to deal with the leaf case separately from the nonleaf case. OK.
So all this was to convince you that we can store data in the leaves. 2-3 trees still work fine. Now I claim that the graph distance in level link trees is within a constant factor of the finger search bound. So I claim I can get the finger search property in 2-3 trees, with data in the leaves, with level links. So lots of changes here. But in the end, we're going to get a finger search bound. Let's go over here.
So here's a finger search operation. First thing I want to do is identify a node that I'm working with. I want to start from y's node. So we're supposing that we're told the node, a leaf, that contains y. So I'm going to let v be that leaf.
OK. Because we're supposing we've already found y, and now all the data is in the leaves. So give me the leaf that contains y. So that should take constant time. That's just part of the input.
Now I'm going to do a combination of going up and horizontal. So starting at a leaf. And the first thing I'm going to do is check, does this leaf contain what I want? Does it contain the key I'm searching for, which is x? So that's going to be the case. At every node I store the min and the max. So if x happens to fall between the min and the max, then I'm happy.
Then I'm going to do a regular search in v's subtree. This seems weird in the case of a leaf. In the case of a leaf, this is just to check the two keys that are there. Which one is x. OK. But in general I gave you this search algorithm which was, if I decide which child to take, according to the ranges, that's a downward search. So that's what I'm calling regular search here. Maybe downward would be a little better.
This is the usual log n time thing. But we're going to claim a bound better than log n. If this is not the case, then I know x either falls before v.min or after v.max.
So if x is less than v.min then I'm going to go left. v equals v. I'll call it level left to be clear. You might say left is the left child. There's no left child here, of course. But level left is clear. We take the horizontal left pointer. And otherwise x is greater than v.max. And in that case I will go right. That seems logical.
And in both cases we're going to go up. x equals x.parent Whoops. v equals v.parent. X is not changing here. X is a key we're searching for. v is the node. V for vertex. So we're always going to go up, and then we're going to go either left or right, and we're going to keep doing that until we find a subtree that contains x in terms of key range. Then we're going to stop this part and we're just going to do downward search. I should say return here or something. I'm going to do a downward search, which was this regular algorithm. And then whatever it finds, that's what I return.
I claim the algorithm should be clear. What's less clear is that it achieves the bound that we want. But I claim that this will achieve the finger search property. Let me draw a picture of what this thing looks like kind of generically. On small examples it's hard to see what's going on. So I'm going to draw a piece of a large example.
Let's say we start here. This is where y was. I'm searching for x. Let's suppose x is to the right. 'Cause otherwise I go to the other board. So x is to the right. I'll discover that the range with just this node, this node maybe contains one other key. I'll find that range is too small. So I'm going to go follow the level right pointer, and I get to some other node.
Then I'm going to go to the parent. Maybe the parent was the parent of those two children so I'm going to draw it like that. Maybe I find this range is still too low. I need to go right to get to x, so I'm going to follow a level pointer to the right. I find a new subtree. I'll go to its parent. Maybe I find that this subtree, still the max is too small. So I have to go to the right again. And then I take the parent. So this was an example of a rightward parent. Here's an example of a leftward parent. This is maybe the parent of both of these two children.
Then maybe this subtree is still too small, the max is still smaller than x. So then I go right one more time. Then I follow the parent. Always alternating between right and parent until I find a node whose subtree contains x. It might have actually, x may be down here, because I immediately went to the parent without checking whether I found where x is.
But if I know that x is somewhere in here then I will do a downward search. It might go left and then down here, or it might go right, or there's actually potentially three children. One of these searches will find the key x that I'm looking for because I'm in the case where x is between v.min and v.max, so I know it's in there, somewhere. It could be x doesn't exist, but it's predecessor or successor is in there somewhere.
And so one of these three subtrees will contain the x range. And then I go follow that path. And keep going down until I find x or it's predecessor or successor. Once I find it's predecessor I can use a level right pointer to find its successor, and so on.
So that's kind of the general picture what's going on. We keep going rightward and we keep going up. Suppose we do k up steps. Let's look at this last step here. Step k.
How high am I in the tree? I started at the leaf level. Remember in a 2-3 tree all the leaves have the same level. And I went up every step.
Sorry. I don't know what this is, like the 2-step dance where, let's say every iteration of this loop I do one left or right step, and then a parent step. So I should call this iteration k. I guess there's two k steps, then.
Just to be clear. So in iteration k, that means I've gone up k times and I've gone either right or left k times. You can show if you start going right you keep going right. If you initially go left you'll keep going left. Doesn't matter too much.
At iteration k I am at height k, or k minus 1, or however you want to count. But let's call it k. So when I do this right pointer here I know that, for example, I am skipping over all of these keys. All the keys down-- the keys are in the leaves, so all these things down here, I'm jumping over them. How many keys are down there? Can you tell me, roughly, how many keys I'm skipping over when I'm moving right at height k? It's not a unique answer. But you can give me some bounds.
Say again. Number of children to the k power. Yeah. Except we don't know the number of children. But it's between 2 and 3 Closer one should be easy but I fail. So it's between two and three children. So there's the number-- if you look at a height k tree, how many leaves does it have? It's going to be between 2 to the k and 3 to the k. Because I have between 2 and 3 children at every node. And so it's exponential in k. That's all I'll need.
OK. When I'm at height k here, I'm skipping over a height k minus 1 tree or something. But it's going to be--
So in iteration k I'm skipping, at least, some constant times 2 to the k. Maybe to the k minus 1, or to the k minus 2. I'm being very sloppy. Doesn't matter. As long as it's exponential in k, I'm happy. Because I'm supposing that x and y are somewhat close. Let's call this rank difference d. Then I claim the number of iterations I'll need to do in this loop is, at most, order log d. Because if, when I get to the k-th iteration, I'm jumping over 2 to the k elements. How large does k have to be before 2 to the k is larger than d? Well, log d. Log base 2
The number of iterations is order log d, where d is the rank difference. d is the absolute value between rank of x and rank of y. And I'm being a little sloppy here. You probably want to use an induction. You need to show that they're really, these items here that you're skipping over that are strictly between x and y. But we know that there's only d items between x or y. Actually d minus 1, I guess. So as soon as we've skipped over all the items between x and y, then we'll find a range that contains x, and then we'll go do the downward search.
Now how long does the downward search cost? Whatever the height of the tree is. What's the height of the tree? That's the number of iterations. So the total cost. The downward search will cost the same as the rest of the search. And so the total cost is going to be order log d. Clear?
Any questions about finger searching with level linked data at the leaves, 2-3 trees?
AUDIENCE: Sir, I'm not sure why [INAUDIBLE] d, why is that?
PROFESSOR: I'm defining d to be the rank of x minus rank of y. My goal is to achieve a log d bound. And I'm claiming that because once I've skipped over d items, then I'm done. Then I've found x. And at step k I'm skipping over 2 to the k items. So how big is k going to be? Log d. That's all. I used d for a notation here. Cool.
Finger searching. It's nice. Especially if you're doing many consecutive searches that are all relatively close to each other. But that was easy. Let's do a more difficult augmentation.
So the last topic for today is range trees. This is probably the coolest example of augmentation, at least, that you'll see in this class. If you want to see more you should take advanced data structure 6851.
And range trees solve a problem called orthogonal range searching. Not orthogonal search ranging. Orthogonal range search.
So what's the problem? I'm going to give you a bunch of points. Draw them as fat dots so you can actually see them. In some dimension. So this is, for example, a 2D point set. OK. Over here I will draw a 3D point set. You can tell the difference, I'm sure.
There. Now it's a 3D point set. And this is a static point set. You could make this dynamic but let's just think about the static case. Don't want the 2D points and the 3D points to mix. Now, you get to preprocess this into a data structure. So this is a static data structure problem. And now I'm going to come along with a whole bunch of queries. A query will be a box. OK. In two dimensions, a box is a rectangle.
Something like this. Axis aligned. So I give you an x min, x max, a y min, and a y max. I want to know what are the points inside. Maybe I want you to list them. If there's a lot of them it's going to take a long time to list them. Maybe I just want to know 10 of them as examples.
Maybe this is a Google search or something. I just get the first 10 results in the first page, I hit next then want the next 10, that kind of thing. Or maybe I want to know how many search results there are. Number of points in the rectangle. Bunch of different problems.
In 3D, it's a 3D box. Which is a little harder to draw. You can't really tell which points are inside the box. Let's say these three points are all inside the box. I give you an interval in x, an interval in y, and an interval in z, and I want to know what are the points inside. How many are there? List them all. List 10 of them, whatever. OK.
I want to do this in poly log time, let's say. I'm going to achieve today log squared for the 2D problem and log cubed for the 3D problem, plus whatever the size output is. So let me just write that down. So the goal is to preprocess n points in d dimensions.
So you get to spend a bunch of time preprocessing to support a query which is, given a box, axis aligned box, find let's say the number of points in the box. Find k points in the box. I think that's good. That includes a special case of find all the points in the box. So this, of course, we have to pay a penalty of order k for the output. No getting around that. But I want the rest of the time to be log to the d.
So we're going to achieve log to the d n plus size of the output. And you get to control how big you want the output to be. So it's a pretty reasonable data structure. In a certain sense we will understand what the output is in log to the d time. If you actually want to list points, well, then you have to spend the time to do it.
All right. So 2D and 3D are great, but let's start with 1D. First we should understand 1D completely, then we can generalize. 1D we already know how to do. 1D I have a line. I have some points on the line.
And I'm given, as a query, some interval. And I want to know how many points are in the interval, give me the points in the interval, and so on. So how do I do this? Any ways?
If d is 1. So I want to achieve log d, sorry, log n, plus size of output. I hear whispers. Yeah?
AUDIENCE: Segment trees?
PROFESSOR: Segment tree? That's fancy. We won't cover segment trees. Probably segment trees do it. Yeah. We know lots of ways to do this. Yeah?
AUDIENCE: Sorted array?
PROFESSOR: Sorted array is probably the simplest. If I store the items in a sorted array and I have two values, I'll call them x1 and x2, because it's the x min and x max. Binary search for x1. Binary search for x2. Find the successor of x1 and the predecessor of x2. I'll find these two guys. And then I know all the ones in between. That's the match. So that'll take log n time to find those points and then we're good.
So we could do a sorted array. Of course, sorted array is a little hard to generalize. I don't want to do a 2D array, that sounds bad. You could, of course, do a binary search tree. Like an AVL tree. Same thing. Because we have log n search, find successor, and predecessor, I guess you could use Van Emde Boas, but that's hard to generalize to 2D.
You could use level links. Here's a fancy version. We could use level linked 2-3 trees with data in the leaves. Then once I find x min, I find this point, I can go to the successor in constant time because that's a finger search with a rank difference of 1. And I could just keep calling successor and in constant time per item I will find the next item. So we could do that easily with the sorted array.
BST is not so great because successor might cost log n each time. But if I have the level links then basically I'm just walking down the link list at the bottom of the tree. OK. So actually level linked is even better. BST would achieve something like log n plus k log n, where k is the size of the output. If I want k points in the box I have to pay log n. For each level linked I'll only pay log n plus k. Here I actually only need the levels at the leaves. Level links.
OK. All good. But I actually want to tell you a different way to do it that will generalize better. The pictures are going to look just like the pictures we've talked about.
So these would actually work dynamically. My goal here is just to achieve a static data structure. I'm going to idealize this solution a little bit. And just say, suppose I have a perfectly balanced binary search tree. That's going to be my data structure. OK. So the data structure is not hard, but what's interesting is how I do a range search.
So if I do range query of the interval, I'll call it ab. Then what I'm going to do is do a binary search for a, do a binary search for b, trim the common prefix of those search paths. That's basically finding the lowest common ancestor of a and b.
And then I'm going to do some stuff. Let me draw the picture. So here is, suppose here's the node that contains a. Here's the node that contains b. They may not be at the same depth, who knows. Then I'm going to look at the parents of a. I just came down from some path here, and some path down to b. I want to find this branching point where the paths to a and the paths to b diverge.
So let's just look at the parent of a. It could be a right parent, in which case there's a subtree here. Could be a left parent in which case, subtree here. I'm going to follow my convention again. That x-coordinate corresponds roughly to key. Left parent here. Maybe right parent here. Something like that.
OK. Remember it's a perfect tree. So, actually, all the leaves will be at the same level. And, roughly here, x-coordinate corresponds to key. So here is a. And I want to return all the keys that are between a and b. So that's everything in this sweep line.
The parents of the LCA don't matter, because this parents either going to be way over to the right or way over to the left. In both cases, it's outside the interval a to b. So what I've tried to highlight here, and I will color it in blue, is the relevant nodes for the search between a and b. So a is between a and b. This subtree is greater than a and less than b. This node, and these nodes. This node, and these nodes. This node and these nodes. The common ancestor. And then the corresponding thing over here. All the nodes in all these blue subtrees, plus these individual nodes, fall in the interval between a and b, and that's it.
OK. This should look super familiar. It's just like when we're computing rank. We're trying to figure out how many guys are to our left or to our right. We're basically doing a rightward rank from a and a leftward rank from b. And that finds all the nodes. And stopping when those two searches converge. And then we're finding all the nodes between a and b. I'm not going to write down the pseudocode because it's the same kind of thing. You look at right parents and left parents.
You just walk up from a. Whenever you get a right parent then you want that node, and the subtree to its right. And so that will highlight these nodes. Same thing for b, but you look at left parents. And then you stop when those two searches converge. So you're going to do them in lock step. You do one step for a and b. One step for a and b. And when they happen to hit the same node, then you're done. You add that node to your list. And what you end up with is a bunch of nodes and rooted subtrees.
The things I circled in blue is going to be my return value. So I'm going to return all of these nodes, explicitly. And I'm also going to return these subtrees. I'm not going to have to write them down. I'm just going to return the root of the subtree, and say, hey look. Here's an entire subtree that contains points that are in the answer. Don't have to list them explicitly, I can just give you the tree.
Then if I want to know how many results are in the answer, well, just augment to store subtree size at the beginning. And then I can count how many nodes are down here, how many nodes are down here, add that up for all the triangles, and then also add one for each of the blue nodes, and then I've counted the size of the answer in how much time? How many subtrees and how many nodes am I returning here? Log.
Log n nodes and log n rooted subtrees because at each step, I'm going up by one for a, and up by one for b. So it's like 2 log n. Log n.
So I would call this an implicit representation of the answer. From that implicit representation you can do subtree size. Augmentation to count the size the answer. You can just start walking through one by one, do an inter traversal of the trees, and you'll get the first k points in the answer in order k time. Question?
AUDIENCE: Just a clarification. You said when we were walking up, you want to get all the ancestors in their right subtrees. But you don't do that for the left parent, right?
PROFESSOR: That's right. As I'm walking up the tree, if it's a right parent then I take the right subtree and include that in the answer. If it's a left parent just forget about it. Don't do anything. Just keep following parents. Whenever I do right parent then I also add that node and the right subtree. If it's a left parent I don't include the node, I don't include the left subtree. I also don't include the right subtree. That would have too much stuff.
It's easy when you see the picture, you would write down the algorithm. It's clear. It's left versus right parents.
AUDIENCE: Would you include the left subtree of b?
PROFESSOR: I would also-- thank you. I should color the left subtree of b. I didn't apply symmetry perfectly. So we have the right subtree of a and the left subtree of b. Thanks. I would also include b if it's a closed interval.
Slightly more general. If a and b are not in the tree then this is really the successor of a and this is the predecessor of b. So then a and b don't have to be in there. This is still a well defined range search. OK. Now we really understand 1D. I claim we've almost solved all dimensions. All we need is a little bit of augmentation. So let's do it.
Let's start with 2D. But then 3D, and 4D, and so on will be easy. Why do I care about 4D range trees? Because maybe I have a database. Each of these points is actually just a row in the database which has four columns, four values. And what I'm trying to do here is find all the people in my database that have a salary between this and this, and have an age between this and that, and have a profession between this and this. I don't know what that means. Number of degrees between this and this, whatever.
You have some numerical data representing a person or thing in your database, then this is a typical kind of search you want to do. And you want to know how many answers you've got and then list the first hundreds of them, or whatever. So this is a practical thing in databases. This is what you might call an index in the database.
So let's start. Suppose your data is just two dimensional. You have two fields for every item. What I'm going to do is store a 1D range tree on all points by x. So this data structure makes sense if you fix a dimension. Say x is all I care about. Forget about y. So my point set. Yeah. So what that corresponds to is projecting each of these points onto the x-axis. And now also projecting my query.
So my new query is from here to here in x. And so this data structure will let me find all these points that match in x. That's not good because there's actually only two points that I want, but I find four points in this picture. But it's half of the answer. It's all the x matches forgetting about y.
Now here's the fun part. So when I do a search here I get log n nodes. Nodes are good because they have a single key in them. So I'll just check for each of those log n nodes. Do they also match in y? If they do, add it to the answer. If they don't forget about it. OK.
But the tricky part is I also get log n subtrees representing parts of the answer. So potentially it could be that your search, this rectangle, only has like five points. But if you look at this whole vertical slab there's a billion points. Now, luckily, those billion points are represented succinctly. There's just log n subtrees saying, well there's half a billion here, and a quarter billion here, and an eighth of a billion here.
Now for each of that big chunk of output, I want to very quickly find the ones that match in y. How would I find the ones matching in y? A range tree. Yeah. OK. So here's what we're going to do. For each node, call it x. x is overloaded. It's a coordinate. So many things. Let's call it v. In the, this thing I'm going to call the x-tree. So for every node in the x-tree I'm going to store another 1D range tree. But this time using the y-coordinate on all points in these rooted subtree.
At this point I really want to draw a diagram. So, rough picture. Forgive me for not drawing this perfectly.
This is roughly what the answer looks like for the 1D range search. This is the x-tree. And here I've searched between this value and this value in the x-coordinate. Basically I have log n nodes. I'm going to check those separately. Then I also have these log n subtrees. For each of those log n sub trees I'm going to have a pointer-- this is the augmentation-- to another tree of exactly the same size. On exactly the same data that's in here. It's also over here. But it's going to be sorted by y. And it's a 1D range tree by y. Tons of data duplication here. I took all these points and I copied them over here, but then built a 1D range tree in y. This is all preprocessing. So I don't have to pay for this. It's polynomial time. Don't worry too much.
And then I'm going to search in here. What does the search in there look? I'm going to get, you know, some more trees and a couple more nodes. OK. But now those items, those points, match in x and y because this whole subtree matched in x and I just did a y search, so I found things that matched in y.
So I get here another log n trees that are actually in my answer. And for each of these nodes I have a corresponding other data structure where I do a little search and I get part of the answer.
Every one. Sounds huge. This data structure sounds huge, but it's actually small. But one thing that's clear is it takes log squared n time, because I have log n triangles over here. For each of them I spend log n to find triangles over here. The total output is log squared n nodes, for each of them I have to check manually. Plus, so over here, there's log n, different searches I'm doing. Each one has size log n. So I get log squared little triangles that contain the results that match in x and y.
How much space in this data structure? That's the remaining challenge. Actually, it's not that hard, because if you look at a key. So look at some key in this x-tree. Let's look at a leaf because that's maybe the most interesting.
Here's the x-tree. x-tree has linear size. Just one tree. If I look at some key value, well, it lives in this subtree. And so there's going to be a corresponding blue structure of that size that contains that key. And then there's the parent. So there's a structure here. That has a corresponding blue triangle. And then its parent, that's another triangle. That contains-- I'm looking at a key k here. All of these triangles contain the key k. And so key k will be duplicated all this many times, but how many sub trees is k in? Log n. Each key, fundamental fact about balanced binary search trees, each key lives in log n subtrees. Namely all of its ancestors.
Awesome. Because that means the total space is n log n. There's n keys. Each of them is duplicated at most log n times. In general, log to the d minus 1. So If you do it in 3D, each of the blue trees, every node in it has a corresponding pointer to a red tree that's sorted by z. And you just keep doing this, sort of, nested searching, like super augmentation. But you're only losing a log factor each dimension you add.
Notes for Lecture 9 (PDF) are available.
Free Downloads
Video
- iTunes U (MP4 - 190MB)
- Internet Archive (MP4 - 190MB)
Subtitle
- English - US (SRT)