0:00 The following content is provided under a Creative
0:02Commons license.
0:04Your support will help MIT OpenCourseWare
0:06continue to offer high quality educational resources for free.
0:10To make a donation or view additional materials
0:13from hundreds of MIT courses, visit MIT OpenCourseWare
0:17at ocw.mit.edu.
0:21ERIK DEMAINE: Welcome to 6.851 Advanced Data Structures.
0:25I am Erik Demaine.
0:26You can call me Erik.
0:28We have two TAs, Tom Morgan and Justin Zhang.
0:31Tom's back there.
0:32Justin is late.
0:36And this class is about all kinds
0:38of very cool data structures.
0:40You should have already seen basic data structures
0:42like balance binary search trees and things
0:45like that, log n time to do wherever
0:48you want in one dimension.
0:50And here we're going to turn all those data structures
0:52on their head and consider them in all sorts
0:54of different models and additional cool problems.
0:56Today we're going to talk about time travel or temporal data
0:59structures, where you're manipulating time
1:02as any good time traveler should.
1:05Then we'll do geometry where we have higher dimensional
1:07data, more than one dimension.
1:09Then we'll look at a problem called
1:11dynamic optimality, which is, is there one best binary search
1:14tree that rules them all?
1:16Then we'll look at something called memory hierarchy, which
1:19is a way to model more realistic computers which have cache
1:22and then more cache and then main memory
1:24and then dish and all these different levels.
1:26How do you optimize for that?
1:28Hashing is probably the most famous, and most popular,
1:31most used data structure in computer science.
1:33We'll do a little bit on that.
1:35Integers, when you know that your data is integers and not
1:39just arbitrary black boxes that you can compare or do whatever,
1:42you can do a lot better with integers.
1:43You usually beat log n time.
1:45Often you can get constant time.
1:47For example, if you want to do priority queues,
1:49you can do square root log log n time.
1:51That's the best known randomized.
1:55Dynamic graphs, you have a graph you want to store,
1:58and the edges are being added and maybe deleted, like you're
2:01representing a social network.
2:02And people are friending and de-friending.
2:04You want to maintain some interesting information
2:06about that graph.
2:07Strings, you have a piece of text,
2:10such as the entire worldwide web.
2:12And you want to search for a substring.
2:14How do you do that efficiently?
2:15It's sort of the Google problem.
2:17Or you searching through DNA for patterns, whenever.
2:20And finally succinct data structures,
2:22which is all about taking what we normally
2:24consider optimal space or n space
2:26and reducing it down to the very bare minimum of bits of space.
2:30Usually if you want to store something
2:32where there's 2 to the n possibilities,
2:34you want to get away with n bits of space,
2:36maybe plus square root of n or something very tiny.
2:40So that's the sync data structures.
2:42So that's an overview of the entire class.
2:44And these are sort of the sections we'll be following.
2:47Let me give you a quick administrative overview
2:50of what we're doing.
2:54Requirements for the class--
2:55I guess, first, attending lecture.
2:58Obviously if you don't attend lecture,
2:59there'll be videos online.
3:00So that's resolvable.
3:02But let me know if you're not going to make it.
3:05We're going to have problems sets roughly every week.
3:07If you're taking the class for credit,
3:09they have a very simple rule of one page in, one page out.
3:12This is more constraint on us to write problems that
3:15have easy or short answers.
3:16You probably need to think about them
3:18a little bit before they're transparent, but then easy
3:21to write up.
3:23And then scribing lectures-- so we have a scribe for today,
3:26I hope.
3:27Here?
3:28Yes, good.
3:30So most of the lectures have already
3:32been scribed in some version, and your goal
3:34is to revise that scribe notes that if you don't like
3:38handwritten notes, which are also online, then easier
3:42for people to read.
3:44Let's see.
3:45Listeners welcome.
3:46We're going to have an open problem session.
3:48I really like open problems.
3:49I really like solving open problems.
3:50So we've done this every time this class has been offered.
3:53So if you're interested in also solving open problems,
3:55it's optional.
3:56I will organize-- in a couple of weeks,
3:59we'll have a weekly open problem session
4:03and try to solve all the things that
4:06push the frontier of advanced data structures.
4:09So in classes, we'll see the state of the art.
4:11And then we'll change the state of the art in those sessions.
4:15I think that's it.
4:16Any questions about the class before we
4:18get into the fun stuff?
4:22All right.
4:23Let's do some time traveling.
4:27Before I get to time traveling, though, I
4:28need to define our model of computation.
4:32A theme in this class is that the model of computation you're
4:35working with matters.
4:36Models matter.
4:38And there's lots of different models of computation.
4:40We'll see a few of the main ones in this class.
4:45And the starting point, and the one
4:48we'll be using throughout today, is called a pointer machine.
4:53It's an old one from the '80s.
4:57And it corresponds to what you might think about
4:59if you've done a lot of object-oriented programming,
5:02and before that, structure-oriented programming,
5:04I guess.
5:05So you have a bunch of nodes.
5:08They have some fields in them, a constant number of fields.
5:13You can think of these as objects or strucs
5:16in c It used to be records back in Pascal days,
5:20so a lot of the papers call them records.
5:22You could just have a constant number of fields.
5:24You could think of those numbered, labeled.
5:25It doesn't really matter because there's only
5:27a constant number of them.
5:28Each of the fields could be a pointer to another node,
5:32could be a null pointer, or could have some data in it.
5:35So I'll just assume that all my data is integers.
5:43You can have a pointer to yourself.
5:45You can have a pointer over here, whatever you want.
5:48A pointer machine would look something like this.
5:52In any moment, this is the state of the pointer machine.
5:55So you think this as the memory of your computer storing.
5:59And then you have some operations
6:02that you're allowed to do.
6:03That's the computation part of the model.
6:07You can think of this as the memory model.
6:10What you're allowed to do are create nodes.
6:13You can say something like, x equals new node.
6:19You can, I don't know, look at fields.
6:24You can do x equals y.field.
6:28You can set fields, x.field equals y.
6:33You can compute on these data, so you can add 5 and 7,
6:37do things like that.
6:39I'm not going to worry about--
6:40I'll just write et cetera.
6:43This is more a model about how everything's
6:45organized in memory, not so much about what you're allowed
6:48to do to the data items.
6:49In this lecture, it won't matter what
6:50you're doing to the data items.
6:52We never touch them.
6:53We just copy them around.
6:56So am I missing anything?
6:59Probably.
7:01I guess you could destroy nodes if you felt like it.
7:03But we won't have to today, because we
7:06don't want to throw anything away
7:07when you're time traveling.
7:09It's too dangerous.
7:12And then the one catch here is, what are x and y?
7:18There's going to be one node in this data structure
7:21or in your memory called the root node.
7:24And you could think of that as that's the thing you always
7:26have in your head.
7:27This is like your cache, if you will.
7:29It's just got a constant number of things, just
7:31like any other node.
7:32And x and y are fields of the root.
7:40So that sort of ties things down.
7:42You're always working relative to the root.
7:45But you can look at the data, basically follow this pointer,
7:50by looking at the field.
7:53You could set one of these pointers--
7:55I think I probably need another operation here,
7:58like x equals y.field1, field2, that sort of thing,
8:03and maybe the reverse.
8:06But you can manipulate all nodes sort
8:09of via the root is the idea.
8:10You follow pointers, do whatever.
8:12So pretty obvious, slightly annoying
8:14to write down formally.
8:16But that is pointer machine.
8:23And what we're going to be talking about today
8:26in time travel is suppose someone
8:28gives me a pointer machine data structure, for example,
8:31balanced binary search tree, linked list.
8:33A lot of data structures, especially classic data
8:36structures, follow pointer machine model.
8:39What we'd like to do is transform that data structure
8:41or make a new pointer machine data
8:42structure that does extra cool things,
8:45namely travel through time.
8:47So that's what we're going to do.
8:53There's two senses of time travel or temporal data
8:57structures that we're going to cover in this class.
9:02The one for today is called persistence,
9:05where you don't forget anything, like an elephant.
9:08And the other one is retroactivity.
9:15Persistence will be today.
9:16Retroactivity is next class.
9:19Basically, these correspond to two models of time travel.
9:21Persistence is the branching universe time travel model,
9:24where if you make a change in the past,
9:25you get a new universe.
9:27You never destroy old universes.
9:29Retroactivity is more like Back to the Future,
9:33when you go back, make a change, and then you
9:35can return to the present and see what happened.
9:37This is a lot harder to do.
9:39And we'll work on that next class.
9:42Persistence is what we will do today.
9:46So persistence.
9:57The general idea of persistence is to remember everything--
10:01the general goal, I would say.
10:05And by everything, I mean different versions
10:07of the data structure.
10:08So you're doing data structures in general.
10:11We have update operations and query operations.
10:14We're mainly concerned about updates here.
10:16Every time you do an update, you think of it
10:18as taking a version of the data structure and making a new one.
10:21And you never want to destroy old versions.
10:23So even though an update like an insert or something
10:26changes the data structure, we want to remember that past data
10:29as well.
10:30And then let's make this reasonable.
10:34All data structure operations are relative to a specified
10:37version.
10:47So an update makes and returns a new version.
11:05So when you do an insert, you specify
11:08a version of your data structure and the thing
11:10you want to insert.
11:11And the output is a new version.
11:13So then you could insert into that new version, keep going,
11:16or maybe go back to the old version, modify that.
11:20I haven't said exactly what's allowed here,
11:22but this is sort of the general goal.
11:25And then there are four levels of persistence
11:30that you might want to get.
11:33First level is called partial persistence.
11:37This is the easiest to obtain.
11:44And in partial persistence, you're
11:47only allowed to update the latest version, which
11:55means the versions are linearly ordered.
12:01This is the easiest to think about.
12:03And time travel can easily get confusing, so start simple.
12:09We have a timeline of various versions on it.
12:14This is the latest.
12:17And what we can do is update that version.
12:19We'll get a new version, and then our latest is this one.
12:23What this allows is looking back at the past to an old version
12:27and querying that version.
12:29So you can still ask questions about the old version,
12:31if you want to be able to do a search on any of these data
12:33structures.
12:34But you can't change them.
12:35You can only change the most recent version.
12:38So that's nice.
12:39It's kind of like time machine on Mac, I guess.
12:44If you've ever seen the movie Deja Vu, which is not
12:46very common, but it's a good time travel
12:48movie, in the first half of the movie, all they can do
12:51is look back at the past.
12:52Later they discover that actually they
12:54have a full persistence model.
12:57It takes a while for dramatic effect.
13:04In full persistence, you can update anything you want--
13:08so update any version.
13:18and so then the versions form a tree.
13:27OK.
13:28So in this model, maybe you initially
13:30have a nice line of versions.
13:32But now if I go back to this version and update it,
13:34I branch, get a new version here.
13:37And then I might keep modifying that version sometimes.
13:40Any of these guys can branch.
13:44So this is why I call it the branching universe model, when
13:47you update your branch.
13:52So no version ever gets destroyed here.
13:54Again, you can query all versions.
13:56But now you can also update any version.
13:59But you just make a new version.
14:00It's a totally new world.
14:02When I update this version, this version
14:04knows nothing about all the--
14:06this doesn't know about this future.
14:07It's created its own future.
14:10There's no way to sort of merge those universes together.
14:14It's kind of sad.
14:16That's why we have the third level of persistence,
14:22which lets us merge timelines.
14:24It's great for lots of fiction out there.
14:35If you've seen the old TV show Sliders,
14:38that would be confluent persistence.
14:50So confluent persistence, you can combine two versions
15:01to create a new version.
15:09And in this case, again, you can't destroy old versions.
15:13In persistence, you never destroy versions.
15:16So now the versions form a DAG, directed acyclic graph.
15:22So now we're allowing--
15:24OK, you make some changes, whatever.
15:25You branch your universe, make some changes.
15:30And now I can say, OK, take this version of the data
15:32structure and this version and recombine them.
15:35Get a new version, and then maybe make some more changes.
15:38OK, what does combine mean?
15:40Well, it depends on your data structure.
15:42A lot of data structures have combine operations
15:44like if you have linked lists, you have two linked lists,
15:48you can concatenate them.
15:49That's an easy operation.
15:50Even if you have binary search trees,
15:51you can concatenate them reasonably easy
15:53and combine it into one big binary search tree.
15:56So if your data structure has an operation that
15:59takes as input two data structures,
16:01then what we're saying is now it can take two versions, which
16:05is more general.
16:06So I could take the same data structure,
16:08make some changes in one way, separately make
16:10some changes in a different way, and then
16:12try to concatenate them or do something crazy.
16:14This is hard to do, and most of it
16:16is an open problem whether it can be done.
16:19But I'll tell you about it.
16:21Then there's another level even more than confluent
16:24persistence.
16:26This is hard to interpret in the time travel world,
16:30but it would be functional data structures.
16:32If you've ever programmed in a functional programming
16:34language, it's a little bit annoying from an algorithm's
16:36perspective, because it constrains you to work
16:39in a purely functional world.
16:43You can never modify anything.
16:45OK.
16:46Now, we don't want to modify versions.
16:49That's fine.
16:49But in a functional data structure,
16:51you're not allowed to modify any nodes ever.
16:54All you can do is make new notes.
17:03This is constraining, and you can't always
17:07get optimal running times in the functional world.
17:09But if you can get a functional data structure,
17:11you have all these things, because you
17:13can't destroy anything.
17:14If you can't destroy nodes, then in particular,
17:16you can't destroy versions.
17:17And all of these things just work for free.
17:19And so a bunch of special cases are known,
17:22interesting special cases, like search trees
17:24you can do in the functional world.
17:26And that makes all of these things easy.
17:29So the rest of this lecture is going
17:30to be general techniques for doing partial full persistence,
17:34what we know about confluent, and what
17:36we know about functional, brief overview.
17:41Any questions about those goals, problem definitions?
17:47Yeah.
17:48AUDIENCE: I'm still confused about functional, because--
17:50ERIK DEMAINE: What does functional mean?
17:52AUDIENCE: [INAUDIBLE]
17:55ERIK DEMAINE: Yeah, I guess you'll see what--
17:57functional looks like all the other things, I agree.
18:00You'll see in a moment how we actually implement
18:02partial and persistence.
18:03We're going to be changing nodes a lot.
18:07As long as we still represent the same data
18:10in the old versions, we don't have to represent it
18:13in the same way.
18:14That lets us do things more efficiently.
18:15Whereas in functional, you have to represent
18:17all the old versions in exactly the way
18:19you used to represent them.
18:20Here we can kind of mangle things around
18:22and it makes things faster.
18:23Yeah, good question.
18:25So it seems almost the same, but it's nodes versus versions.
18:29I haven't really defined a version.
18:31But it's just that all the queries answer the same way.
18:34That's what you need for persistence.
18:38Other questions?
18:40All right.
18:43Well, let's do some real data structures.
18:49We start with partial persistence.
18:55This is the easiest.
18:59For both partial and full persistence,
19:02there is the following result. Any pointer machine data
19:06structure, one catch with a constant number of pointers
19:19to any node--
19:22so this is constant n degree.
19:27In a pointer machine, you always have a constant number
19:30of pointers out of a node at most.
19:32But for this result to hold, we also
19:34need a constant number of pointers into any node.
19:37So this is an extra constraint.
19:42Can be transformed into another data structure that
19:47is partially persistent and does all the things it used to do--
19:53so I'll just say, can be made partially persistent.
20:00You have to pay something, but you have to pay very little--
20:03constant amortized factor overhead,
20:12multiplicative overhead and constant amount
20:21of additive space per change in the data structure.
20:30So every time you do a modification in your pointer
20:33machine-- you set one of the fields to something--
20:36you have to store that forever.
20:37So, I mean, this is the best you could hope to do.
20:39You've got to store everything that happened.
20:43You pay a constant factor overhead, eh.
20:45We're theoreticians.
20:46That doesn't matter.
20:48Then you get any data structure in this world
20:50can be made partially persistent.
20:53That's nice.
20:54Let's prove it.
21:00OK, the idea is pretty simple.
21:04Pointer machines are all about nodes and fields.
21:06So we just need to simulate whatever the data structure is
21:09doing to those nodes and fields in a way
21:11that we don't lose all the information
21:13and we can still search it very quickly.
21:17First idea is to store back pointers.
21:21And this is why we need the constant n degree constraint.
21:27So if we have a node--
21:31how do I want to draw a node here?
21:35So maybe these are the three fields of the node.
21:38I want to also store some back pointers.
21:42Whenever there is a node that points to this node,
21:48I want to have a back pointer that
21:50points back so I know where all the pointers came from.
21:54If there's only p pointers, then this is fine.
21:57There'll be p fields here.
22:00So still constant, still in the pointier machine model.
22:03OK, I'm going to need some other stuff too.
22:08So this is a simple thing, definitely want this.
22:11Because if my nodes ever move around,
22:13I've got to update the pointers to them.
22:15And where are those pointers?
22:16Well, the back pointers tell you where they are.
22:20Nodes will still be constant size,
22:22remain in pointer machine data structure.
22:25OK.
22:26That's idea one.
22:28Idea two is this part.
22:35This is going to store something called mods.
22:39It could stand for something, but I'll leave it as mods.
22:44So these are two of the fields of the data structure.
22:56Ah, one convenience here is for back pointers,
23:01I'm only going to store it for the latest version of the data
23:04structure.
23:16Sorry.
23:16I forgot about that.
23:19We'll come back to that later.
23:21And then the idea is to store these modifications.
23:23How many modifications?
23:25Let's say up to p, twice p.
23:34p was the bound on the n degree of a node.
23:38So I'm going to allow 2p modifications over here.
23:43And what's a modification look like?
23:48It's going to consist of three things--
23:50get them in the right order--
23:53the version in which something was changed,
23:56the field that got changed, and the value it go changed to.
24:02So the idea is that these are the fields here.
24:07We're not going to touch those.
24:09Once they're set to something--
24:11or, I mean, whatever they are initially,
24:13they will stay that way.
24:15And so instead of actually changing things like the data
24:18structure normally would, we're just
24:19going to add modifications here to say, oh,
24:21well at this time, this field changed to the value of 5.
24:25And then later on, it changed to the value 7.
24:27And then later on, this one changed to the value 23,
24:31whatever.
24:32So that's what they'll look like.
24:36There's a limit to how many--
24:37we can only store a constant number of mods to each node.
24:40And our constant will be 2p.
24:44OK.
24:45Those are the ideas, and now it's
24:46just a matter of making this all work
24:47and analyzing that it's constant amortized overhead.
24:53So first thing is if you want to read a field,
25:06how would I read a field?
25:08This is really easy.
25:11First you look at what the field is in the node itself.
25:15But then it might have been changed.
25:17And so remember when I say read the field,
25:19I actually mean while I'm given some version,
25:21v, I want to know what is the value of this field at version
25:24v, because I want to be able to look at any of the old data
25:28structures too.
25:29So this would be at version v. I just
25:37look through all the modifications.
25:38There's constantly many, so it takes constant time
25:40to just flip through them and say, well, what changes
25:43have happened up to version v?
25:46So I look at mods with version less than
25:56or equal to v. That will be all the changes that happened up
25:59to this point.
26:00I see, did this field change?
26:02I look at the latest one.
26:03That will be how I read the field of the node, so
26:07constant time.
26:08There's lots of ways to make this efficient in practice.
26:10But for our purposes, it doesn't matter.
26:13It's constant.
26:16The hard part is how do you change a field?
26:18Because there might not be any room in the mod structure.
26:35So to modify, say we want to set node.field equal to x.
26:47What we do is first we check, is there
26:52any space in the mod structure?
26:54If there's any blank mods, so if the node is not full,
27:03we just add a mod.
27:06So a mod will look like now field x.
27:13Just throw that in there.
27:15Because right at this moment--
27:17so we maintain a time counter, just increment
27:19it ever time we do a change.
27:21This field changed that value.
27:23So that's the easy case.
27:24The trouble, of course, is if the node is full--
27:29the moment you've all been waiting for.
27:31So what we're going to do here is make a new node.
27:33We've ran out of space.
27:34So we need to make a new node.
27:36We're not going to touch the old node, just going to let it sit.
27:38It still maintains all those old versions.
27:40Now we want a new node that represents the latest
27:43and greatest of this node.
27:44OK.
27:45So make a new node.
27:51I'll call it node prime to distinguish from node, where
27:57with all the mods, and this modification in particular,
28:04applied.
28:07OK, so we make a new version of this node.
28:11It's going to have some different fields, whatever
28:15was the latest version represented by those mods.
28:18It's still going to have back pointers,
28:20so we have to maintain all those back pointers.
28:24And now the mod, initially, is going
28:26to be empty, because we just applied them all.
28:29So this new node doesn't have any recent mods.
28:32Old node represents the old versions.
28:34This node is going to represent the new versions.
28:37What's wrong with this picture?
28:39AUDIENCE: Update pointers.
28:40ERIK DEMAINE: Update pointers.
28:42Yeah, there's pointers to the old version
28:43of the node, which are fine for the old versions of the data
28:47structure.
28:48But for the latest version of the data structure,
28:50this node has moved to this new location.
28:53So if there are any old pointers to that node,
28:56we've got to update them in the current version.
28:58We have to update them to point to this node instead.
29:00The old versions are fine, but the new version is in trouble.
29:04Other questions or all the same answer?
29:06Yeah.
29:06AUDIENCE: So if you wanted to read an old version
29:10but you just have the new version, [INAUDIBLE]?
29:15ERIK DEMAINE: OK--
29:16AUDIENCE: [INAUDIBLE]
29:17ERIK DEMAINE: The question is essentially,
29:19how do we hold on to versions?
29:22Essentially, you can think of a version of the data structure
29:24as where the root node is.
29:26That's probably the easiest.
29:27I mean, in general, we're representing versions
29:29by a number, v. But we always start at the root.
29:33And so you've given the data structure,
29:35which is represented by the root node.
29:36And you say, search for the value 5.
29:40Is it in this binary search tree or whatever?
29:43And then you just start navigating from the root,
29:45but you know I'm inversion a million or whatever.
29:49I know what version I'm looking for.
29:51So you start with the root, which never changes, let's say.
29:56And then you follow pointers that
29:58essentially tell you for that version
30:00where you should be going.
30:01I guess at the root version, it's a little trickier.
30:03You probably want a little array that says for this version,
30:07here's the root node.
30:08But that's a special case.
30:11Yeah.
30:11Another question?
30:13AUDIENCE: So on the new node that you
30:15created, the fields that you copied, you also have to have
30:19a version for them, right?
30:20Because [INAUDIBLE]?
30:26ERIK DEMAINE: These--
30:27AUDIENCE: Or do you version the whole node?
30:30ERIK DEMAINE: Here we're versioning the whole node.
30:32The original field values represent
30:34what was originally there, whenever this node was created.
30:37Then the mods specify what time the fields change.
30:40So I don't think we need times here.
30:44All right.
30:45So we've got to update two kinds of pointers.
30:47There's regular pointers, which live
30:48in the fields, which are things pointing to the node.
30:52But then there's also back pointers.
30:53Because if this is a pointer to a node,
30:55then there'll be a back pointer back to the node.
30:57And all of those have to change.
31:00Conveniently, the back pointers are easy.
31:11So if they're back pointers to the node,
31:13we change them to the node prime.
31:14How do we find the back pointers?
31:16Well, we just follow all the pointers
31:17and then there will be back pointers there.
31:21Because I said we're only maintaining
31:23backed pointers for the latest version,
31:25I don't need to preserve the old versions
31:28of those backed pointers.
31:29So I just go in and I change them.
31:31It takes constant time, because the constant number
31:33of things I point to, each one as a back pointer.
31:35So this is cheap.
31:37There's no persistence here.
31:39That's an advantage of partial persistence.
31:41The hard part is updating the pointers
31:44because those live in fields.
31:45I need to remember the old versions of those fields.
31:47And that we do recursively.
31:58Because to change those pointers,
32:00that's a field update.
32:01That's something exactly of this form.
32:02So that's the same operation but on a different node.
32:05So I just do that.
32:07I claim this is good.
32:08That's the end of the algorithm.
32:11Now we need to analyze it.
32:24How do we analyze it?
32:25Any guesses?
32:29AUDIENCE: Amortize it.
32:30ERIK DEMAINE: Amortized analysis, exactly
32:32the answer I was looking for.
32:33OK.
32:34[INAUDIBLE] amortization.
32:36The most powerful technique in amortization
32:38is probably the potential method.
32:40So we're going to use that.
32:42There's a sort of more--
32:44you'll see a charging argument in a moment.
32:50We want the potential function to represent when this data
32:53structure is in a bad state.
32:55Intuitively, it's in a bad state when a lot of nodes are full.
32:58Because then as soon as you make a change in them,
33:00they will burst, and you have to do all this crazy recursion
33:03and stuff.
33:04This case is nice and cheap.
33:05We just add a modification, constant time.
33:08This case, not so nice because we recurse.
33:10And then that's going to cause more recursions
33:12and all sorts of chaos could happen.
33:16So there's probably a few different potential functions
33:20that would work here.
33:21And an old version of these nodes I said
33:23should be the number of full nodes.
33:25But I think we can make life a little bit easier
33:27by the following.
33:32Basically, the total number of modifications--
33:36not quite the total, almost the total.
33:39So I'm going to do c times the sum of the number of mods
33:49in latest version nodes.
33:59OK.
34:00So because we sort of really only
34:02care about-- we're only changing the latest version,
34:05so I really only care about nodes that
34:07live in the latest version.
34:08What do I mean by this?
34:09Well, when I made this new node prime,
34:11this becomes the new representation of that node.
34:14The old version is dead.
34:15We will never change it again.
34:18If we're modifying, we will never even look at it again.
34:21Because now everything points to here.
34:24So I don't really care about that node.
34:26It's got a ton of mods.
34:27But what's nice is that when I create this new node, now
34:30the mod list is empty.
34:31So I start from scratch, just like reinstalling
34:33your operating system.
34:34It's a good feeling.
34:38And so the potential goes down by, I guess, c times 2 times p.
34:45When I do this change, potential goes down by basically p.
34:49AUDIENCE: Is c any constant or--
34:52ERIK DEMAINE: c will be a constant to be determined.
34:55I mean, it could be 1.
34:57It depends how you want to define it.
34:58I'm going to use the CLRS notion of amortized cost, which
35:02is actual cost plus change in potential.
35:06And then I need a constant here, because I'm
35:08measuring a running time versus some combinatorial quantity.
35:12So this will be to match the running time that we'll get to.
35:17OK.
35:17So what is amortized cost?
35:22There's sort of two cases modification.
35:24There's the cheap case and the not so cheap case.
35:28In general, amortized cost--
35:34in both cases, it's going to be at most--
35:37well, first of all, we do some constant work
35:39just to figure out all this stuff, make copies, whatever.
35:44So that's some constant time.
35:49That's the part that I don't want to try to measure.
35:52Then potentially, we add a new mod.
35:55If we add a mod, that increases the potential by c.
35:59Because we're just counting mods, multiplying by c.
36:02So we might get plus 1 mod.
36:04This is going to be an upper bound.
36:06We don't always add 1, but worst case, we always had 1,
36:09let's say.
36:11And then there's this annoying part.
36:14And this might happen, might not happen.
36:16So then there's a plus maybe.
36:20If this happens, we decrease the potential
36:23because we empty out the mods for that node in terms
36:26of the latest version.
36:27So then we get a negative 2cp, change in potential.
36:34And then we'd have to pay I guess up to p recursions.
36:49Because we have to--
36:51how many pointers are there to me?
36:53Well, at most p of them, because there are at most p pointers
36:58to any node.
37:02OK.
37:03This is kind of a weird--
37:05it's not exactly algebra here.
37:06I have this thing, recursions.
37:09But if you think about how this would expand,
37:11all right, this is constant time.
37:13That's good.
37:14And then if we do this--
37:15I'll put a question mark here.
37:16It might be here.
37:16It might not.
37:18If it's not here, find constant.
37:19If it is here, then this gets expanded into this thing.
37:24It's a weird way to write a recurrence.
37:26But we get p times whatever is in this right hand side.
37:30OK.
37:31But then there's this minus 2cp.
37:33So we're going to get p times 2c here.
37:36That's the initial cost.
37:37So that will cancel with this.
37:40And then we might get another recursion.
37:41But every time we get a recursion, all the terms
37:43cancel.
37:44So it doesn't matter whether this is here or not.
37:46You get 0, which is great.
37:49And you're left with the original 2c.
37:53Constant.
37:55OK.
37:56[INAUDIBLE] potential functions are always a little crazy.
37:59What's happening here is that, OK, maybe you add a mod.
38:03That's cheap.
38:05But when we have to do this work and we have to do this
38:08recursion-- this is up to 2p updates or recursions--
38:14we are charging it to the emptying of this node.
38:17The number of mods went from 2p down to 0.
38:21And so we're just charging this update cost
38:22to that modification.
38:24So if you like charging schemes, this is much more intuitive.
38:26But with charging schemes, it's always a little careful.
38:28You have to make sure you're not double charging.
38:30Here it's obvious that you're not double charging.
38:34Kind of a cool and magical.
38:37This is a paper by Driscoll, Sarnak, Sleator,
38:42Tarjan from 1989.
38:43So it's very early days of amortization.
38:45But they knew how to do it.
38:47Question?
38:48AUDIENCE: [INAUDIBLE]
38:50ERIK DEMAINE: What happens if you overflow the root?
38:52Yeah, I never thought about the root before today.
38:54But I think the way to fix the root is
38:57just you have one big table that says, for a given version--
39:02I guess a simple way would be to say,
39:04not only is a version a number, but it's also
39:06a pointer to the root.
39:07There we go.
39:07Pointer machine.
39:09So that way you're just always explicitly maintaining
39:11the root copy or the pointer.
39:15Because otherwise, you're in trouble.
39:18AUDIENCE: Then can you go back to [INAUDIBLE].
39:21ERIK DEMAINE: So in order to refer to an old version,
39:24you have to have the pointer to that root node.
39:26If you want to do it just from a version number,
39:29look at the data structure.
39:30Just from a version number, you would
39:31need some kind of lookup table, which
39:33is outside the pointer machine.
39:34So you could do it in a real computer,
39:36but a pointer machine is not technically allowed.
39:39So it's slightly awkward.
39:40No arrays are allowed in pointer machines,
39:42in case that wasn't clear.
39:43Another question?
39:44AUDIENCE: [INAUDIBLE] constant space to store for [INAUDIBLE].
39:48And also, what if we have really big numbers [INAUDIBLE]?
39:54ERIK DEMAINE: In this model, in the pointer machine model,
39:56we're assuming that whatever the data is in the items
39:58take constant space each.
40:01If you want to know about bigger things in here,
40:03then refer to future lectures.
40:05This is time travel, after all.
40:06Just go to a future class and then come back.
40:09[LAUGHS] So we'll get there, but right now,
40:11we're not thinking about what's in here.
40:15Whatever big thing you're trying to store,
40:16you reduce it down to constant size things.
40:19And then you spread them around nodes of a pointer machine.
40:22How you do that, that's up to the data structure.
40:25We're just transforming the data structure to be persistent.
40:28OK, you could ask about other models than pointer machines,
40:30but we're going to stick to pointer machines here.
40:34All right.
40:36That was partial persistence.
40:38Let's do full persistence.
40:41That was too easy.
40:46Same paper does full persistence.
40:48Systems That was just a warm up.
40:50Full persistence is actually not that much harder.
40:55So let me tell you basically what changes.
41:04There are two issues.
41:05One is that everything here has to change and not by much.
41:09We're still going to use back pointers.
41:11We're still going to have my mods.
41:12The number of mods is going to be slightly different
41:15but basically the same.
41:16Back pointers no longer just refer to the latest version.
41:19We have to maintain back pointers in all versions.
41:21So that's annoying.
41:22But hey, that's life.
41:24The amortization, the potential function
41:25will change slightly but basically not much.
41:30Sort of the bigger issue you might first wonder about,
41:33and it's actually the most challenging technically,
41:35is versions are no longer numbers.
41:37Because it's not a line.
41:39Versions are nodes in a tree.
41:41You should probably call them vertices
41:42in a tree to distinguish them from nodes in the pointer
41:45machine.
41:46OK, so you've got this tree of versions.
41:48And then versions are just some point on that tree.
41:53This is annoying because we like lines.
41:57We don't like trees as much.
41:58So what we're going to do is linearize the tree.
42:04Like, when in doubt, cheat.
42:12How do we do this?
42:13With tree traversal.
42:15Imagine I'm going to draw a super complicated tree
42:18of versions.
42:19Say there are three versions.
42:21OK.
42:22I don't want to number them, because that would be
42:24kind of begging the question.
42:26So let's just call them x, y, and z.
42:33All right.
42:34I mean, it's a directed tree, because we
42:36have the older versions.
42:37This is like the original version.
42:38And we made a change.
42:39We made a different change on the same version.
42:42What I'd like to do is a traversal of that tree,
42:45like a regular, as if you're going to sort those nodes.
42:48Actually, let me use color, high def here.
42:53So here's our traversal of the tree.
42:59And I want to look at the first and the last time I
43:01visit each node.
43:02So here's the first time I visit x.
43:05So I'll write this is the beginning of x.
43:09Capital X. Then this is the first time I visit y,
43:13so it's beginning of y.
43:15And then this is the last time I visit y, so it's the end of y.
43:19And then, don't care.
43:20Then this is the beginning of z.
43:24And this is the end of z.
43:27And then this is the end x.
43:29If I write those sequentially, I get bxbyeybzez,
43:38because this is so easy, ex.
43:42OK, you can think of these as parentheses, right?
43:45For whatever reason I chose b and e for beginning and ending,
43:48but this is like open parens, close parens.
43:50This is easy to do in linear time.
43:52I think you all know how.
43:53Except it's not a static problem.
43:55Versions are changing all the time.
43:56We're adding versions.
43:57We're never deleting versions, but we're always
43:59adding stuff to here.
44:00It's a little awkward, but the idea
44:01is I want to maintain this order,
44:05maintain the begin and the end of each you
44:16might say subtree of versions.
44:23This string, from bx to ex, represents
44:25all of the stuff in x's subtree, in the rooted tree
44:29starting at x.
44:33How do I maintain that?
44:40Using a data structure.
44:56So we're going to use something, a data structure we haven't yet
45:00seen.
45:02It will be in lecture 8.
45:04This is a time travel data structure,
45:06so I'm allowed to do that.
45:10So order maintenance data structure.
45:14You can think of this as a magical linked list.
45:16Let me tell you what the magical linked list can do.
45:19You can insert--
45:22I'm going to call it an item, because node
45:24would be kind of confusing given where we are right now.
45:28You can insert a new item in the list immediately before
45:32or after a given item.
45:37OK.
45:37This is like a regular linked list.
45:41Here's a regular linked list.
45:44And if I'm given a particular item like this one,
45:48I can say, well, insert a new item right here.
45:51You say, OK.
45:51Fine.
45:52I'll just make a new node and relink here, relink there.
45:57Constant time, right?
45:58So in an order maintenance data structure,
46:00you can do this in constant time.
46:01Wow!
46:02So amazing.
46:05OK, catch is the second operation you can do.
46:08Maybe I'll number these.
46:09This is the update.
46:10Then there's the query.
46:13The query is, what is the relative order
46:17of two notes, of two items?
46:24x and y.
46:27So now I give you this node and this node.
46:29And I say, which is to the left?
46:32Which is earlier in the order?
46:34I want to know, is x basically less than y
46:36in terms of the order in the list?
46:37Or is y less than x?
46:41And an order maintenance data structure
46:42can do this in constant time.
46:45Now it doesn't look like your mother's linked list, I guess.
46:50It's not the link list you learned in school.
46:52It's a magical linked list that can somehow
46:54answer these queries.
46:55How?
46:56Go to lecture 7.
46:58OK.
46:59Forward reference, lecture 8, sorry.
47:03For now, we're just going to assume that this magical data
47:05structure exists.
47:06So in constant time, this is great.
47:09Because if we're maintaining these b's and e's, we
47:11want to maintain the order that these things appear in.
47:16If we want to create a new version,
47:17like suppose we were just creating version z,
47:20well, it used to be everything without this bz, ez.
47:23And we'd just insert two items in here, bz and ez.
47:27They're right next to each other.
47:28And if we were given version x, we could just say,
47:30oh, we'll look at ex and insert two items right before it.
47:34Or you can put them right after bx.
47:36I mean, there's no actual order here.
47:37So it could have been y first and then z or z first and then
47:40y.
47:42So it's really easy to add a new version in constant time.
47:44You just do two of these insert operations.
47:47And now you have this magical order operation, which
47:50if I'm given two versions--
47:54I don't know, v and w--
47:56and I want to know is v an ancestor of w,
48:00now I can do it in constant time.
48:02So this lets me do a third operation, which is, is version
48:09v an ancestor of version w?
48:21Because that's going to be true if and only if bv
48:26is an ev nest around bw and ew.
48:39OK.
48:40So that's just three tests.
48:41They're probably not all even necessary.
48:43This one always holds.
48:45But if these guys fit in between these guys, then you know--
48:50now, what this tells us, what we care about here,
48:54is reading fields.
48:58When we read a field, we said, oh, we'll
49:00apply all the modifications that apply to version
49:02v. Before that, that was a linear order.
49:04So it's just all versions less than or equal to v. Now
49:06it's all versions that are ancestors of v. Given a mod,
49:10we need to know, does this mod apply to my version?
49:13And now I tell you, I can do that in constant time
49:16through magic.
49:17I just test these order relations.
49:20If they hold, then that mod applies to my version.
49:24So w's the version we're testing.
49:27v is some version in the mod.
49:29And I want to know, am descendant of that version?
49:32If so, the mod applies.
49:34And I update what the field is.
49:36I can do all pairwise ancestor checks and figure out,
49:39what is the most recent version in my ancestor history
49:43that modified a given field?
49:44That lets me read a field in constant time.
49:47Constants are getting kind of big at this point,
49:49but it can be done.
49:53Clear?
49:54A little bit of a black box here.
49:56But now we've gotten as far as reading.
50:01And we don't need to change much else.
50:04So this is good news
50:11Maybe I'll give you a bit of a diff.
50:15So full persistence, fully persistent theorem--
50:26done.
50:27OK.
50:27Same theorem just with full persistence.
50:30How do we do it?
50:31We store back pointers now for all versions.
50:35It's a little bit annoying.
50:36But how many mods do we use?
50:40There's lots of ways to get this to work,
50:42but I'm going to change this number
50:44to 2 times d plus p plus 1.
50:51Wait, what's d? d is the number of fields here.
50:56OK.
50:57We said it was constant number fields.
50:59I never said what that constant is. d for out degree, I guess.
51:03So p is in degree, max in degree. d is max out degree.
51:09So just slightly more-- that main reason for this
51:11is because back pointers now are treated like everyone else.
51:14We have to treat both the out pointers and the in pointers
51:17as basically the same.
51:18So instead of p, we have d plus p.
51:19And there's a plus 1 just for safety.
51:23It gets my amortization to work, hopefully.
51:28OK.
51:29Not much else-- this page is all the same.
51:32Mods are still, you give versions, fields, values,
51:35reading.
51:36OK, well, this is no longer less than or equal to v. But
51:41this is now with a version, sort of the nearest version, that's
51:47an ancestor of v.
51:50That's what we were just talking about.
51:52So that can be done in constant time.
51:54Check it for all of them, constant work.
51:57OK.
51:58That was the first part.
52:04Now we get to the hard part, which is modification.
52:07This is going to be different.
52:08Maybe you I should just erase--
52:10yeah, I think I'll erase everything,
52:13except the first clause.
52:24OK.
52:24If a node is not full, we'll just
52:26add a mod, just like before.
52:28What changes is when a node is full.
52:36Here we have to do something completely different.
52:38Why?
52:38Because if we just make a new version
52:41of this node that has empty mods, this one's still full.
52:45And I can keep modifying the same version.
52:48This new node that I just erased represents some new version.
52:52But if I keep modifying an old version, which
52:54I can do in full persistence, this node keeps being full.
52:57And I keep paying potentially huge cost.
53:00If all the nodes were full, and when I make this change
53:02every node gets copied, and then I
53:04make a change to the same version,
53:06every node gets copied again.
53:07This is going to take linear time per operation.
53:09So I can't do the old strategy.
53:11I need to somehow make this node less full.
53:15This is where we're definitely not functional.
53:17None of this was functional, but now I'm
53:19going to change an old node, not just make a new one in a more
53:24drastic way.
53:25Before I was adding a mod.
53:27That's not a functional operation.
53:28Now I'm actually going to remove mods from a node to rebalance.
53:33So what I'd like to do is split the node into two halves.
53:43OK.
53:43So I had some big node that was--
53:46I'll draw it-- completely full.
53:50Now I'm going to make two nodes.
53:52Here we go.
53:59This one is going to be half full.
54:01This one's going to be half full of mods.
54:04OK.
54:05The only question left is, what do I do with all these things?
54:12Basically what I'd like to do is have the--
54:14on the one hand, I want to have the old node.
54:18It's just where it used to be.
54:20I've just removed half of the mods, the second half,
54:23the later half.
54:25What does that mean?
54:26I don't know.
54:27Figure it out.
54:29It's linearized.
54:31I haven't thought deeply about that.
54:32Now we're going to make a new node with the second half
54:36of the mods.
54:40It's more painful than I thought.
54:41In reality, these mods represent a tree of modifications.
54:45And what you need to do is find a partition of that tree
54:48into two roughly equal halves.
54:51You can actually do a one third, 2/3 split.
54:52That's also in a future lecture, which whose number I forget.
54:57So really, you're splitting this tree
54:58into two roughly balanced halves.
55:01And so this 2 might actually need to change to a 3,
55:03but it's a constant.
55:06OK.
55:07What I want is for this to represent
55:09a subtree of versions.
55:10Let me draw the picture.
55:11So here's a tree of versions represented by the old mods.
55:15I'd like to cut out a subtree rooted at some node.
55:18So let's just assume for now this has exactly
55:21half the nodes.
55:22And this has half the nodes.
55:25In reality, I think it can be one third, 2/3.
55:29OK.
55:29But let's keep it convenient.
55:32So I want the new node to represent
55:34this subtree and this node to represent everything else.
55:37This node is as if this stuff hasn't happened yet.
55:41I mean, so it represents all these old versions that do not,
55:44that are not in the subtree.
55:45This represents all the latest stuff.
55:47So what I'm going to do is like before, I
55:49want to apply some mods to these fields.
55:54And whatever minds were relevant at this point, whatever
55:58had been applied, I apply those to the fields here.
56:02And so that means I can remove all of these mods.
56:06I only cared about these ones.
56:09Update these fields accordingly.
56:11I still have the other mods to represent all the other changes
56:14that could be in that subtree.
56:16OK.
56:16So we actually split the tree, and we apply mods to new nodes.
56:38Anything else I need to say?
56:42Oh, now we need to update pointers.
56:44That's always the fun part.
56:49Let's go over here.
57:05So old node hasn't moved.
57:07But this new node has moved.
57:09So for all of these versions, I want
57:13to change the pointer that used to point to old node
57:18should now point to new node.
57:20In this version, it's fine.
57:21It should still point to old node,
57:23because this represents all those old versions.
57:25But for the new version, that version in the subtree,
57:28I've got to point here instead.
57:30OK.
57:31So how many pointers could there be to this node
57:37that need to change.
57:38That's a tricky part in this analysis.
57:41Think about it for a while.
57:45I mean, in this new node, whatever
57:47is pointed to by either here or here in the new node also has
57:50a return pointer.
57:50All pointers are bidirectional.
57:52So we don't really care about whether they're forward
57:54or backward.
57:54How many pointers are there here?
57:56Well, there's d here and there's p here.
57:59But then there's also some additional pointers
58:01represented over here.
58:02How many?
58:04Well, if we assume this magical 50/50 split,
58:06there's right now d plus p plus 1 mods over here, half of them.
58:12Each of them might be a pointer to some other place, which
58:16has a return pointer in that version.
58:18So number of back pointers that we need to update
58:23is going to be this, 2 times d 2 times p plus 1.
58:30So recursively update at most 2 times d plus 2 times p
58:41plus 1 pointers to the node.
58:50The good news is this is really only half of them
58:52or some fraction of them.
58:54It used to be--
58:57well, there were more pointers before.
58:59We don't have to deal with these ones.
59:00That's where we're saving, and that's
59:01why this amortization works.
59:03Let me give you a potential function that makes this work--
59:12is minus c times sum of the number of empty mod slots.
59:23It's kind of the same potential but before
59:26we had this notion of dead and alive nodes.
59:28Now everything's alive because everything
59:30could change at any moment.
59:31So instead, I'm going to measure how much room I have
59:36in each node.
59:37Before I had no room in this node.
59:38Now I have half the space in both nodes.
59:41So that's good news.
59:44Whenever we have this recursion, we
59:48can charge it to a potential decrease.
59:56Fee goes down by--
1:00:01because I have a negative sign here--
1:00:03c times, oh man, 2 times d plus p plus 1, I think.
1:00:13Because there's d plus p plus 1 space here,
1:00:15d plus p plus 1 space here.
1:00:17I mean, we added one whole new node.
1:00:18And total capacity of a node in mods
1:00:20is 2 times d plus p plus 1.
1:00:23So we get that times c.
1:00:26And this is basically just enough,
1:00:28because this is 2 times d plus 2 times p plus 2.
1:00:32And here we have a plus 1.
1:00:34And so the recursion gets annihilated by 2 times d plus
1:00:392 times p plus 1.
1:00:41And then there's one c left over to absorb
1:00:43whatever constant cost there was to do all this other work.
1:00:47So I got the constants just to work,
1:00:51except that I cheated and it's really a one third, 2/3 split.
1:00:54So probably all of these constants have to change,
1:00:57such is life.
1:00:58But I think you get the idea.
1:01:01Any questions about full persistence?
1:01:07This is fun stuff, time travel.
1:01:10Yeah?
1:01:11AUDIENCE: So in the first half of the thing where
1:01:14the if, there's room you can put it in.
1:01:16ERIK DEMAINE: Right.
1:01:17AUDIENCE: I have a question about how
1:01:17we represent the version.
1:01:19Because before when we said restore now [INAUDIBLE].
1:01:23It made more sense if now was like a timestamp or something.
1:01:25ERIK DEMAINE: OK.
1:01:26Right, so how do we represent a version even here or anywhere?
1:01:31When we do a modification, an update, in the data structure,
1:01:34we want to return the new version.
1:01:36Basically, we're going to actually store
1:01:39the DAG of versions.
1:01:41And a version is going to be represented by a pointer
1:01:43into this DAG.
1:01:44One of the nodes in this DAG becomes a version.
1:01:47Every node in this DAG is going to store a pointer
1:01:50to the corresponding b character and a corresponding e character
1:01:53in this data structure, which then
1:01:56lets you do anything you want.
1:01:57Then you can query against that version,
1:01:59whether it's an ancestor of another version.
1:02:01So yeah, I didn't mention that.
1:02:02Versions are nodes in here.
1:02:04Nodes in here have pointers to the b's and e's.
1:02:06And vice versa, the b's and e's have pointers back
1:02:08to the corresponding version node.
1:02:10And then you can keep track of everything.
1:02:12Good question.
1:02:14Yeah?
1:02:15AUDIENCE: [INAUDIBLE] question.
1:02:16Remind me what d is in this.
1:02:17ERIK DEMAINE: Oh, d was the maximum out degree.
1:02:19It's the number of fields in a node, as defined right here.
1:02:26Other questions?
1:02:29Whew.
1:02:30OK, a little breather.
1:02:31That was partial persistence, full persistence.
1:02:33This is, unfortunately, the end of the really good results.
1:02:36As long as we have constant degree nodes,
1:02:38in and out degree, we can do all.
1:02:41We can do for persistence for free.
1:02:44Obviously there are practical constants involved here.
1:02:47But in theory, you can do this perfectly.
1:02:53Before we go on to confluence, there
1:02:54is one positive result, which is what if you
1:02:58don't like amortize bounds.
1:03:00There are various reasons amortize bounds might not
1:03:02be good.
1:03:03Maybe you really care about every operation
1:03:04being no slower than it was except by a constant factor.
1:03:08We're amortizing here, so some operations get really slow.
1:03:11But the others are all fast to compensate.
1:03:14You can deamortize, it's called.
1:03:22You can get constant worst case slowdown
1:03:30for partial persistence.
1:03:36This is a result of Garret Brodle from the late '90s, '97.
1:03:44For full persistence-- so it's an open problem.
1:03:47I don't know if people have worked on that.
1:03:55All right.
1:03:56So some, mostly good results.
1:03:59Let's move on to confluent persistence where things
1:04:01get a lot more challenging.
1:04:17Lots of things go out the window with confluent persistence.
1:04:20In particular, your versions are now a DAG.
1:04:23It's a lot harder to linearize a DAG.
1:04:25Trees are not that far from pads.
1:04:28But DAGs are quite far from pads, unfortunately.
1:04:33But that's not all that goes wrong.
1:04:44Let me first tell you the kind of end effect as a user.
1:04:50Imagine you have a data structure.
1:04:54Think of it as a list, I guess, which
1:04:57is a list of characters in your document.
1:04:59You're using vi or Word, your favorite, whatever.
1:05:03It's a text editor.
1:05:05You've got a string of words.
1:05:06And now you like to do things like copy and paste.
1:05:09It's a nice operation.
1:05:11So you select an interval of the string and you copy it.
1:05:16And then you paste it somewhere else.
1:05:18So now you've got two copies of that string.
1:05:21This is, in some sense, what you might
1:05:24call a confluent operation, because--
1:05:27yeah, maybe a cleaner way to think of it is the following.
1:05:30You have your string.
1:05:31Now I have an operation, which is split it.
1:05:33So now I have two strings.
1:05:35OK.
1:05:36And now I have an operation, which is split it.
1:05:38Now I have three strings.
1:05:40OK.
1:05:41Now I have an operation which is concatenate.
1:05:44So I can, for example, reconstruct
1:05:47the original string-- actually, I have the original string.
1:05:49No biggie.
1:05:51Let's say-- because I have all versions.
1:05:54I never lose them.
1:05:55So now instead, I'm going to cut the string here, let's say.
1:05:59So now I have this and this.
1:06:03And now I can do things like concatenate
1:06:06from here to here to here.
1:06:10And I will get this plus this plus this.
1:06:16OK.
1:06:17This guy moved here.
1:06:18So that's a copy/paste operation with a constant number
1:06:20of splits and concatenates.
1:06:22I could also do cut and paste.
1:06:23With confluence, I can do crazy cuts and pastes
1:06:26in all sorts of ways.
1:06:28So what?
1:06:29Well, the so what is I can actually
1:06:32double the size of my data structure
1:06:33in a constant number of operations.
1:06:36I can take, for example, the entire string
1:06:38and concatenate it to itself.
1:06:40That will double the number of characters,
1:06:41number of elements in there.
1:06:43I can do that again and again and again.
1:06:45So in u updates, I can potentially
1:06:51get a data structure size 2 to the u.
1:06:57Kind of nifty.
1:06:58I think this is why confluence is cool.
1:07:00It's also why it's hard.
1:07:02So not a big surprise.
1:07:03But, here we go.
1:07:08In that case, the version DAG, for reference, looks like this.
1:07:13You're taking the same version, combining it.
1:07:16So here I'm assuming I have a concatenate operation.
1:07:20And so the effect here, every time I do this,
1:07:24I double the size.
1:07:44All right.
1:07:44What do I want to say about confluent persistence?
1:07:46All right.
1:07:47Let me start with the most general result, which
1:07:53is by Fiat and Kaplan in 2003.
1:08:04They define a notion called effective depth of a version.
1:08:08Let me just write it down.
1:08:21It's kind of like if you took this DAG
1:08:24and expanded it out to be a tree of all possible paths.
1:08:30Instead of point to the same node,
1:08:31you could just duplicate that node
1:08:33and then have pointers left and right.
1:08:35OK.
1:08:35So if I did that, of course, this size grows exponentially.
1:08:38It explicitly represents the size of my data structure.
1:08:41At the bottom, if I have u things,
1:08:42I'm going to have 2 to the u leaves at the bottom.
1:08:45But then I can easily measure the number of paths
1:08:49from the root to the same version.
1:08:50At the bottom, I still label it, oh, those
1:08:52are all v. They're all the same version down there.
1:08:54So exponential number of paths, if I take log,
1:08:56I get what I call effective depth.
1:08:58It's like if you somehow could rebalance that tree,
1:09:02this is the best you could hope to do.
1:09:05It's not really a lower bound.
1:09:07But it's a number.
1:09:08It's a thing.
1:09:09OK.
1:09:10Then the result they achieve is that the overhead is
1:09:17log the number of updates plus-- this
1:09:19is a multiplicative overhead, so you take your running time.
1:09:22You multiply it by this.
1:09:25And this is a time and a space overhead.
1:09:31So maximum effective depth of all versions, maybe even
1:09:34sum of effective depths, but we'll just say max to be safe.
1:09:39Sorry-- sum over all the operations.
1:09:41This is per operation.
1:09:43You pay basically the effective depth
1:09:44of that operation as a factor.
1:09:48Now, the annoying thing is if you have this kind of set up
1:09:51where the size grew exponentially,
1:09:54then number of paths is exponential.
1:09:56Log of the number of paths is linear in u.
1:09:59And so this factor could be as much as u, linear slowdown.
1:10:06Now, Fiat and Kaplan argue linear slowdown is not
1:10:08that bad, because if you weren't even persistent, if you did
1:10:13this in the naive way of just recopying the data,
1:10:18you were actually spending exponential time to build
1:10:21the final data structure.
1:10:22It has exponential size.
1:10:23Just to represent it explicitly requires exponential time,
1:10:26so losing a linear factor to do u operations
1:10:29and now u squared time instead of 2 to the u.
1:10:31So it's a big improvement to do this.
1:10:35The downside of this approach is that even if you have a version
1:10:40DAG that looks like this, even if the size of the data
1:10:43structure is staying normal, staying linear, so
1:10:46this potential, you could be doubling the size.
1:10:48But we don't know what this merge operation is.
1:10:49Maybe it just throws away one of the versions
1:10:51or does something--
1:10:53somehow takes half the nodes from one
1:10:55side, half the nodes from the other side maybe.
1:10:57These operations do preserve size.
1:10:58Then there's no great reason why it should be a linear slowdown,
1:11:02but it is.
1:11:03OK?
1:11:04So it's all right but not great.
1:11:10And it's the best general result we know.
1:11:13They also prove a lower bound.
1:11:21So lower bound is some effect of depth, total bits of space.
1:11:37OK.
1:11:37What does this mean?
1:11:40So even if this is not happening,
1:11:42the number of bits of space you need
1:11:44in the worst case-- this does not
1:11:45apply to every data structure.
1:11:47That's one catch.
1:11:49They give a specific data structure
1:11:52where you need this much space.
1:11:53So it's similar to this kind of picture.
1:11:57We'll go into the details.
1:11:58And you need this much space.
1:12:00Now, this is kind of bad, because if there's
1:12:02u operations, and each of these is u, that's u squared space.
1:12:06So we actually need a factor u blow up in space.
1:12:09It looks like.
1:12:11But to be more precise, what this means is
1:12:14that you need omega e of v space, and therefore
1:12:17time overhead per update, if--
1:12:27this is not written in the paper--
1:12:29queries are free.
1:12:35Implicit here, they just want to slow down and increase space
1:12:40for the updates you do, which is pretty natural.
1:12:43Normally you think of queries as not increasing space.
1:12:46But in order to construct this lower bound,
1:12:49they actually do this many queries.
1:12:52So they do e of v queries and then one update.
1:12:55And they say, oh well, space had to go up by an extra e of v.
1:12:59So if you only charge updates for the space,
1:13:02then yes, you have to lose potentially
1:13:04a linear factor, this effect of death, potentially u.
1:13:07But if you also charge the queries,
1:13:09it's still constant in their example.
1:13:13So open question, for confluent persistence,
1:13:18can you achieve constant everything?
1:13:21Constant time and space overheads,
1:13:27multiplicative factor per operation,
1:13:33both updates and queries.
1:13:35So if you charge the queries, potentially you
1:13:37could get constant everything.
1:13:38This is a relatively new realization.
1:13:43And no one knows how to do this yet.
1:13:47Nice challenge.
1:13:47I think maybe we'll work on that in our first problem session.
1:13:50I would like to.
1:13:53Questions about that result?
1:13:54I'm not going to prove the result.
1:13:56But it is a fancy rebalancing of those kinds
1:13:59of pictures to get this log.
1:14:10There are other results I'd like to tell you about.
1:14:32So brand new result--
1:14:34that was from 2003.
1:14:35This is from 2012--
1:14:38no, '11, '11, sorry.
1:14:42It's SOTO, which is in January, so it's a little confusing.
1:14:47Is it '11?
1:14:49Maybe '12.
1:14:50Actually now I'm not sure.
1:14:51It's February already, right?
1:14:54A January, either this year or last year.
1:15:00It's not as general a transformation.
1:15:02It's only going to hold in what's called a disjoint case.
1:15:05But it gets a very good bound--
1:15:07not quite constant, but logarithmic.
1:15:09OK, logarithmic would also be nice.
1:15:12Or log, log n, whatever n is.
1:15:17Pick your favorite n, number of operations, say.
1:15:22OK.
1:15:25If you assume that confluent operations are performed only
1:15:39on two versions with no shared nodes--
1:15:50OK, this would be a way to forbid this kind of behavior
1:15:53where I concatenate the data structure with itself.
1:15:56All the nodes are common.
1:15:58If I guarantee that maybe I, you know, slice this up, slice it,
1:16:01dice it, wherever, and then re-emerge them
1:16:03in some other order, but I never use two copies
1:16:06of the same piece, that would be a valid confluent
1:16:10operation over here.
1:16:12This is quite a strong restriction
1:16:13that you're not allowed.
1:16:16If you try to, who knows what happens.
1:16:19Behavior's undefined.
1:16:19So won't tell you, oh, those two versions
1:16:21have this node in common.
1:16:22You've got to make a second copy of it.
1:16:24So somehow you have to guarantee that control and operations
1:16:27never overlap.
1:16:29But they can be reordered.
1:16:33Then you can get order log n overhead.
1:16:39n is the number of operations.
1:16:45I have a sketch of a proof of this
1:16:46but not very much time to talk about it.
1:16:48All right.
1:16:49Let me give you a quick picture.
1:16:51In general, the versions form a DAG.
1:16:55But if you make this assumption, and you look at a single node,
1:17:00and look at all the versions where that node appears,
1:17:03that is a tree.
1:17:05Because you're not allowed to remerge versions
1:17:07that have the same node.
1:17:08So while the big picture is a DAG,
1:17:11the small picture of a single guy is some tree.
1:17:17I'm drawing all these wiggly lines
1:17:18because there are all these versions where
1:17:20the node isn't changing.
1:17:21This is the entire version DAG.
1:17:23And then some of these nodes--
1:17:26some of these versions, I should say--
1:17:29that node that we're thinking about changes.
1:17:31OK, whenever it branches, it's probably
1:17:33because the actual node changed, maybe.
1:17:36I don't know.
1:17:37Anyway there are some dots here where the version changed,
1:17:40some of the leaves, maybe, that changed.
1:17:41Maybe some of them haven't yet.
1:17:44In fact, let's see.
1:17:48Here where it's change, it could be that we destroyed the node.
1:17:51Maybe it's gone from the actual data structure.
1:17:54But there still may be versions down here.
1:17:56It's not really a tree.
1:17:57It's a whole DAG of stuff down there.
1:17:59So that's kind of ugly.
1:18:01Where never the node still exists,
1:18:03I guess that is an actual leaf of the DAG.
1:18:05So those are OK.
1:18:06But as soon as I maybe delete that node,
1:18:08then there can be a whole subtree down there.
1:18:11OK.
1:18:12So now if you look at an arbitrary version,
1:18:15so what we're thinking about is how to implement reading,
1:18:17let's say.
1:18:18Reading and writing are more or less the same.
1:18:21I give you a version.
1:18:22I give you a node, and I give you a field.
1:18:23I want to know, what is the value of that field, that node,
1:18:26that version?
1:18:27So now where could a version fall?
1:18:30Well it has to be in this subtree.
1:18:31Because the node has to exist.
1:18:36And then it's maybe a pointer.
1:18:38A pointer could be to another node, which
1:18:42also has this kind of picture.
1:18:44They could be overlapping trees.
1:18:46In general, there are three cases.
1:18:48Either you're lucky, and the version you're talking about
1:18:51is a version where the node was changed.
1:18:53In that case, the data is just stored right there.
1:18:58That's easy.
1:18:59So you could just say, oh, how did the node change?
1:19:01Oh, that's what the field is.
1:19:02OK, follow the pointer.
1:19:05A slightly harder case it's a version
1:19:08in between two such changes.
1:19:09And maybe these are not updates.
1:19:11So I sort of want to know, what was the previous version where
1:19:17this node changed in constant time?
1:19:21It can be done.
1:19:22Not constant time, actually, logarithmic time,
1:19:25using a data structure called link-cut trees,
1:19:28another fun black box for now, which
1:19:31we will cover in lecture 19, far in the future.
1:19:36OK.
1:19:39Well, that's one case.
1:19:40There's also the version where maybe a version
1:19:43is down here in a subtree.
1:19:45I guess then the node didn't exist.
1:19:48Well, all these things can happen.
1:19:50And that's even harder.
1:19:51It's messy.
1:19:53They use another trick, which is called fractional cascading,
1:19:59which I'm not even going to try to describe what it means.
1:20:02But it's got a very cool name.
1:20:04Because we'll be covering it in lecture 3.
1:20:06So stay tuned for that.
1:20:07I'm not going to say how it applies to this setting,
1:20:09but it's a necessary step in here.
1:20:13In the remaining zero minutes, let
1:20:15me tell you a little bit about functional data structures.
1:20:17[LAUGHTER]
1:20:20Beauty of time travel.
1:20:24Functional-- I just want to give you
1:20:31some examples of things that can be done functionally.
1:20:33There's a whole book about functional data structures
1:20:35by Okasaki.
1:20:36It's pretty cool.
1:20:38A simple example is balanced BSTs.
1:20:42So if you just want to get log n time for everything,
1:20:44you can do that functionally.
1:20:45It's actually really easy.
1:20:46You pick your favorite balance BST, like red black trees.
1:20:48You implement it top down so you never follow parent pointers.
1:20:51So you don't need parent pointers.
1:20:52So then as you make changes down the tree, you just copy.
1:20:57It's called path copying.
1:20:58Whenever you're about to make a change,
1:21:00make a copy of that node.
1:21:02So you end up copying all the change nodes and all
1:21:05their ancestors.
1:21:06There's only log n of them, so it takes log n time.
1:21:09Clear?
1:21:10Easy.
1:21:11It's a nice technique.
1:21:12Sometimes path copying is very useful.
1:21:14Like link-cut trees, for example,
1:21:16can be made functional.
1:21:17We don't know what they are, but they're basically a BST.
1:21:19And you can make them functional.
1:21:21We use that in a paper.
1:21:23All right.
1:21:23Deques.
1:21:25These are doubly ended queues.
1:21:27So it's like a stack and a queue and everything.
1:21:29You can insert and delete from the beginning and the end.
1:21:32People start to know what these are now,
1:21:34because Python calls him that.
1:21:35But you can also do concatenation
1:21:41with deques in constant time per operation.
1:21:43This is cool.
1:21:44Deques are not very hard to make functional.
1:21:46But you can do deques and you can concatenate them
1:21:48like we were doing in the figure that's right behind this board.
1:21:51Constant time split is a little harder.
1:21:53That's actually one of my open problems.
1:21:56Can you do lists with split and concatenate in constant time--
1:22:01functionally or confluently, persistently, or whatever?
1:22:05Another example-- oh, you can do a mix of the two.
1:22:08You can get log n search in constant time deque operations,
1:22:12is you can do tries.
1:22:14So a try is a tree with a fixed topology.
1:22:17Think of it as a directory tree.
1:22:20So maybe you're using Subversion.
1:22:21Subversion has time travel operations.
1:22:23You can copy an entire subtree from one version
1:22:26and stick it into a new version, another version.
1:22:30So you get a version DAG.
1:22:32It's a confluently persistent data structure--
1:22:34not implemented optimally, because we don't necessarily
1:22:37know how.
1:22:38But there is one paper.
1:22:40This actually came from the open problem section of this class
1:22:43four years ago, I think.
1:22:45It's with Eric Price and Stefan Langerman.
1:22:49You can get very good results.
1:22:50I won't write them down because it takes a while.
1:22:52Basically log the degree of the nodes factor
1:22:56and get functional, and you can be even fancier
1:22:59and get slightly better bounds like log log the degree
1:23:02and get confluently persistent with various tricks,
1:23:05including using all of these data structures.
1:23:07So if you want to implement subversion optimally,
1:23:09that is known how to be done but hasn't actually been done yet.
1:23:14Because there are those pesky constant factors.
1:23:18I think that's all.
1:23:19What is known about functional is there's a log n separation.
1:23:23You can be log n away from the best.
1:23:26That's the worst separation known,
1:23:30between functional and just a regular old data structure.
1:23:33It'd be nice to improve that.
1:23:34Lots of open problems here.
1:23:35Maybe we'll work on them next time.