0:00 The following content is provided under a Creative

0:02Commons license.

0:04Your support will help MIT OpenCourseWare

0:06continue to offer high quality educational resources for free.

0:10To make a donation or view additional materials

0:13from hundreds of MIT courses, visit MIT OpenCourseWare

0:17at ocw.mit.edu.

0:21ERIK DEMAINE: Welcome to 6.851 Advanced Data Structures.

0:25I am Erik Demaine.

0:26You can call me Erik.

0:28We have two TAs, Tom Morgan and Justin Zhang.

0:31Tom's back there.

0:32Justin is late.

0:36And this class is about all kinds

0:38of very cool data structures.

0:40You should have already seen basic data structures

0:42like balance binary search trees and things

0:45like that, log n time to do wherever

0:48you want in one dimension.

0:50And here we're going to turn all those data structures

0:52on their head and consider them in all sorts

0:54of different models and additional cool problems.

0:56Today we're going to talk about time travel or temporal data

0:59structures, where you're manipulating time

1:02as any good time traveler should.

1:05Then we'll do geometry where we have higher dimensional

1:07data, more than one dimension.

1:09Then we'll look at a problem called

1:11dynamic optimality, which is, is there one best binary search

1:14tree that rules them all?

1:16Then we'll look at something called memory hierarchy, which

1:19is a way to model more realistic computers which have cache

1:22and then more cache and then main memory

1:24and then dish and all these different levels.

1:26How do you optimize for that?

1:28Hashing is probably the most famous, and most popular,

1:31most used data structure in computer science.

1:33We'll do a little bit on that.

1:35Integers, when you know that your data is integers and not

1:39just arbitrary black boxes that you can compare or do whatever,

1:42you can do a lot better with integers.

1:43You usually beat log n time.

1:45Often you can get constant time.

1:47For example, if you want to do priority queues,

1:49you can do square root log log n time.

1:51That's the best known randomized.

1:55Dynamic graphs, you have a graph you want to store,

1:58and the edges are being added and maybe deleted, like you're

2:01representing a social network.

2:02And people are friending and de-friending.

2:04You want to maintain some interesting information

2:06about that graph.

2:07Strings, you have a piece of text,

2:10such as the entire worldwide web.

2:12And you want to search for a substring.

2:14How do you do that efficiently?

2:15It's sort of the Google problem.

2:17Or you searching through DNA for patterns, whenever.

2:20And finally succinct data structures,

2:22which is all about taking what we normally

2:24consider optimal space or n space

2:26and reducing it down to the very bare minimum of bits of space.

2:30Usually if you want to store something

2:32where there's 2 to the n possibilities,

2:34you want to get away with n bits of space,

2:36maybe plus square root of n or something very tiny.

2:40So that's the sync data structures.

2:42So that's an overview of the entire class.

2:44And these are sort of the sections we'll be following.

2:47Let me give you a quick administrative overview

2:50of what we're doing.

2:54Requirements for the class--

2:55I guess, first, attending lecture.

2:58Obviously if you don't attend lecture,

2:59there'll be videos online.

3:00So that's resolvable.

3:02But let me know if you're not going to make it.

3:05We're going to have problems sets roughly every week.

3:07If you're taking the class for credit,

3:09they have a very simple rule of one page in, one page out.

3:12This is more constraint on us to write problems that

3:15have easy or short answers.

3:16You probably need to think about them

3:18a little bit before they're transparent, but then easy

3:21to write up.

3:23And then scribing lectures-- so we have a scribe for today,

3:26I hope.

3:27Here?

3:28Yes, good.

3:30So most of the lectures have already

3:32been scribed in some version, and your goal

3:34is to revise that scribe notes that if you don't like

3:38handwritten notes, which are also online, then easier

3:42for people to read.

3:44Let's see.

3:45Listeners welcome.

3:46We're going to have an open problem session.

3:48I really like open problems.

3:49I really like solving open problems.

3:50So we've done this every time this class has been offered.

3:53So if you're interested in also solving open problems,

3:55it's optional.

3:56I will organize-- in a couple of weeks,

3:59we'll have a weekly open problem session

4:03and try to solve all the things that

4:06push the frontier of advanced data structures.

4:09So in classes, we'll see the state of the art.

4:11And then we'll change the state of the art in those sessions.

4:15I think that's it.

4:16Any questions about the class before we

4:18get into the fun stuff?

4:22All right.

4:23Let's do some time traveling.

4:27Before I get to time traveling, though, I

4:28need to define our model of computation.

4:32A theme in this class is that the model of computation you're

4:35working with matters.

4:36Models matter.

4:38And there's lots of different models of computation.

4:40We'll see a few of the main ones in this class.

4:45And the starting point, and the one

4:48we'll be using throughout today, is called a pointer machine.

4:53It's an old one from the '80s.

4:57And it corresponds to what you might think about

4:59if you've done a lot of object-oriented programming,

5:02and before that, structure-oriented programming,

5:04I guess.

5:05So you have a bunch of nodes.

5:08They have some fields in them, a constant number of fields.

5:13You can think of these as objects or strucs

5:16in c It used to be records back in Pascal days,

5:20so a lot of the papers call them records.

5:22You could just have a constant number of fields.

5:24You could think of those numbered, labeled.

5:25It doesn't really matter because there's only

5:27a constant number of them.

5:28Each of the fields could be a pointer to another node,

5:32could be a null pointer, or could have some data in it.

5:35So I'll just assume that all my data is integers.

5:43You can have a pointer to yourself.

5:45You can have a pointer over here, whatever you want.

5:48A pointer machine would look something like this.

5:52In any moment, this is the state of the pointer machine.

5:55So you think this as the memory of your computer storing.

5:59And then you have some operations

6:02that you're allowed to do.

6:03That's the computation part of the model.

6:07You can think of this as the memory model.

6:10What you're allowed to do are create nodes.

6:13You can say something like, x equals new node.

6:19You can, I don't know, look at fields.

6:24You can do x equals y.field.

6:28You can set fields, x.field equals y.

6:33You can compute on these data, so you can add 5 and 7,

6:37do things like that.

6:39I'm not going to worry about--

6:40I'll just write et cetera.

6:43This is more a model about how everything's

6:45organized in memory, not so much about what you're allowed

6:48to do to the data items.

6:49In this lecture, it won't matter what

6:50you're doing to the data items.

6:52We never touch them.

6:53We just copy them around.

6:56So am I missing anything?

6:59Probably.

7:01I guess you could destroy nodes if you felt like it.

7:03But we won't have to today, because we

7:06don't want to throw anything away

7:07when you're time traveling.

7:09It's too dangerous.

7:12And then the one catch here is, what are x and y?

7:18There's going to be one node in this data structure

7:21or in your memory called the root node.

7:24And you could think of that as that's the thing you always

7:26have in your head.

7:27This is like your cache, if you will.

7:29It's just got a constant number of things, just

7:31like any other node.

7:32And x and y are fields of the root.

7:40So that sort of ties things down.

7:42You're always working relative to the root.

7:45But you can look at the data, basically follow this pointer,

7:50by looking at the field.

7:53You could set one of these pointers--

7:55I think I probably need another operation here,

7:58like x equals y.field1, field2, that sort of thing,

8:03and maybe the reverse.

8:06But you can manipulate all nodes sort

8:09of via the root is the idea.

8:10You follow pointers, do whatever.

8:12So pretty obvious, slightly annoying

8:14to write down formally.

8:16But that is pointer machine.

8:23And what we're going to be talking about today

8:26in time travel is suppose someone

8:28gives me a pointer machine data structure, for example,

8:31balanced binary search tree, linked list.

8:33A lot of data structures, especially classic data

8:36structures, follow pointer machine model.

8:39What we'd like to do is transform that data structure

8:41or make a new pointer machine data

8:42structure that does extra cool things,

8:45namely travel through time.

8:47So that's what we're going to do.

8:53There's two senses of time travel or temporal data

8:57structures that we're going to cover in this class.

9:02The one for today is called persistence,

9:05where you don't forget anything, like an elephant.

9:08And the other one is retroactivity.

9:15Persistence will be today.

9:16Retroactivity is next class.

9:19Basically, these correspond to two models of time travel.

9:21Persistence is the branching universe time travel model,

9:24where if you make a change in the past,

9:25you get a new universe.

9:27You never destroy old universes.

9:29Retroactivity is more like Back to the Future,

9:33when you go back, make a change, and then you

9:35can return to the present and see what happened.

9:37This is a lot harder to do.

9:39And we'll work on that next class.

9:42Persistence is what we will do today.

9:46So persistence.

9:57The general idea of persistence is to remember everything--

10:01the general goal, I would say.

10:05And by everything, I mean different versions

10:07of the data structure.

10:08So you're doing data structures in general.

10:11We have update operations and query operations.

10:14We're mainly concerned about updates here.

10:16Every time you do an update, you think of it

10:18as taking a version of the data structure and making a new one.

10:21And you never want to destroy old versions.

10:23So even though an update like an insert or something

10:26changes the data structure, we want to remember that past data

10:29as well.

10:30And then let's make this reasonable.

10:34All data structure operations are relative to a specified

10:37version.

10:47So an update makes and returns a new version.

11:05So when you do an insert, you specify

11:08a version of your data structure and the thing

11:10you want to insert.

11:11And the output is a new version.

11:13So then you could insert into that new version, keep going,

11:16or maybe go back to the old version, modify that.

11:20I haven't said exactly what's allowed here,

11:22but this is sort of the general goal.

11:25And then there are four levels of persistence

11:30that you might want to get.

11:33First level is called partial persistence.

11:37This is the easiest to obtain.

11:44And in partial persistence, you're

11:47only allowed to update the latest version, which

11:55means the versions are linearly ordered.

12:01This is the easiest to think about.

12:03And time travel can easily get confusing, so start simple.

12:09We have a timeline of various versions on it.

12:14This is the latest.

12:17And what we can do is update that version.

12:19We'll get a new version, and then our latest is this one.

12:23What this allows is looking back at the past to an old version

12:27and querying that version.

12:29So you can still ask questions about the old version,

12:31if you want to be able to do a search on any of these data

12:33structures.

12:34But you can't change them.

12:35You can only change the most recent version.

12:38So that's nice.

12:39It's kind of like time machine on Mac, I guess.

12:44If you've ever seen the movie Deja Vu, which is not

12:46very common, but it's a good time travel

12:48movie, in the first half of the movie, all they can do

12:51is look back at the past.

12:52Later they discover that actually they

12:54have a full persistence model.

12:57It takes a while for dramatic effect.

13:04In full persistence, you can update anything you want--

13:08so update any version.

13:18and so then the versions form a tree.

13:27OK.

13:28So in this model, maybe you initially

13:30have a nice line of versions.

13:32But now if I go back to this version and update it,

13:34I branch, get a new version here.

13:37And then I might keep modifying that version sometimes.

13:40Any of these guys can branch.

13:44So this is why I call it the branching universe model, when

13:47you update your branch.

13:52So no version ever gets destroyed here.

13:54Again, you can query all versions.

13:56But now you can also update any version.

13:59But you just make a new version.

14:00It's a totally new world.

14:02When I update this version, this version

14:04knows nothing about all the--

14:06this doesn't know about this future.

14:07It's created its own future.

14:10There's no way to sort of merge those universes together.

14:14It's kind of sad.

14:16That's why we have the third level of persistence,

14:22which lets us merge timelines.

14:24It's great for lots of fiction out there.

14:35If you've seen the old TV show Sliders,

14:38that would be confluent persistence.

14:50So confluent persistence, you can combine two versions

15:01to create a new version.

15:09And in this case, again, you can't destroy old versions.

15:13In persistence, you never destroy versions.

15:16So now the versions form a DAG, directed acyclic graph.

15:22So now we're allowing--

15:24OK, you make some changes, whatever.

15:25You branch your universe, make some changes.

15:30And now I can say, OK, take this version of the data

15:32structure and this version and recombine them.

15:35Get a new version, and then maybe make some more changes.

15:38OK, what does combine mean?

15:40Well, it depends on your data structure.

15:42A lot of data structures have combine operations

15:44like if you have linked lists, you have two linked lists,

15:48you can concatenate them.

15:49That's an easy operation.

15:50Even if you have binary search trees,

15:51you can concatenate them reasonably easy

15:53and combine it into one big binary search tree.

15:56So if your data structure has an operation that

15:59takes as input two data structures,

16:01then what we're saying is now it can take two versions, which

16:05is more general.

16:06So I could take the same data structure,

16:08make some changes in one way, separately make

16:10some changes in a different way, and then

16:12try to concatenate them or do something crazy.

16:14This is hard to do, and most of it

16:16is an open problem whether it can be done.

16:19But I'll tell you about it.

16:21Then there's another level even more than confluent

16:24persistence.

16:26This is hard to interpret in the time travel world,

16:30but it would be functional data structures.

16:32If you've ever programmed in a functional programming

16:34language, it's a little bit annoying from an algorithm's

16:36perspective, because it constrains you to work

16:39in a purely functional world.

16:43You can never modify anything.

16:45OK.

16:46Now, we don't want to modify versions.

16:49That's fine.

16:49But in a functional data structure,

16:51you're not allowed to modify any nodes ever.

16:54All you can do is make new notes.

17:03This is constraining, and you can't always

17:07get optimal running times in the functional world.

17:09But if you can get a functional data structure,

17:11you have all these things, because you

17:13can't destroy anything.

17:14If you can't destroy nodes, then in particular,

17:16you can't destroy versions.

17:17And all of these things just work for free.

17:19And so a bunch of special cases are known,

17:22interesting special cases, like search trees

17:24you can do in the functional world.

17:26And that makes all of these things easy.

17:29So the rest of this lecture is going

17:30to be general techniques for doing partial full persistence,

17:34what we know about confluent, and what

17:36we know about functional, brief overview.

17:41Any questions about those goals, problem definitions?

17:47Yeah.

17:48AUDIENCE: I'm still confused about functional, because--

17:50ERIK DEMAINE: What does functional mean?

17:52AUDIENCE: [INAUDIBLE]

17:55ERIK DEMAINE: Yeah, I guess you'll see what--

17:57functional looks like all the other things, I agree.

18:00You'll see in a moment how we actually implement

18:02partial and persistence.

18:03We're going to be changing nodes a lot.

18:07As long as we still represent the same data

18:10in the old versions, we don't have to represent it

18:13in the same way.

18:14That lets us do things more efficiently.

18:15Whereas in functional, you have to represent

18:17all the old versions in exactly the way

18:19you used to represent them.

18:20Here we can kind of mangle things around

18:22and it makes things faster.

18:23Yeah, good question.

18:25So it seems almost the same, but it's nodes versus versions.

18:29I haven't really defined a version.

18:31But it's just that all the queries answer the same way.

18:34That's what you need for persistence.

18:38Other questions?

18:40All right.

18:43Well, let's do some real data structures.

18:49We start with partial persistence.

18:55This is the easiest.

18:59For both partial and full persistence,

19:02there is the following result. Any pointer machine data

19:06structure, one catch with a constant number of pointers

19:19to any node--

19:22so this is constant n degree.

19:27In a pointer machine, you always have a constant number

19:30of pointers out of a node at most.

19:32But for this result to hold, we also

19:34need a constant number of pointers into any node.

19:37So this is an extra constraint.

19:42Can be transformed into another data structure that

19:47is partially persistent and does all the things it used to do--

19:53so I'll just say, can be made partially persistent.

20:00You have to pay something, but you have to pay very little--

20:03constant amortized factor overhead,

20:12multiplicative overhead and constant amount

20:21of additive space per change in the data structure.

20:30So every time you do a modification in your pointer

20:33machine-- you set one of the fields to something--

20:36you have to store that forever.

20:37So, I mean, this is the best you could hope to do.

20:39You've got to store everything that happened.

20:43You pay a constant factor overhead, eh.

20:45We're theoreticians.

20:46That doesn't matter.

20:48Then you get any data structure in this world

20:50can be made partially persistent.

20:53That's nice.

20:54Let's prove it.

21:00OK, the idea is pretty simple.

21:04Pointer machines are all about nodes and fields.

21:06So we just need to simulate whatever the data structure is

21:09doing to those nodes and fields in a way

21:11that we don't lose all the information

21:13and we can still search it very quickly.

21:17First idea is to store back pointers.

21:21And this is why we need the constant n degree constraint.

21:27So if we have a node--

21:31how do I want to draw a node here?

21:35So maybe these are the three fields of the node.

21:38I want to also store some back pointers.

21:42Whenever there is a node that points to this node,

21:48I want to have a back pointer that

21:50points back so I know where all the pointers came from.

21:54If there's only p pointers, then this is fine.

21:57There'll be p fields here.

22:00So still constant, still in the pointier machine model.

22:03OK, I'm going to need some other stuff too.

22:08So this is a simple thing, definitely want this.

22:11Because if my nodes ever move around,

22:13I've got to update the pointers to them.

22:15And where are those pointers?

22:16Well, the back pointers tell you where they are.

22:20Nodes will still be constant size,

22:22remain in pointer machine data structure.

22:25OK.

22:26That's idea one.

22:28Idea two is this part.

22:35This is going to store something called mods.

22:39It could stand for something, but I'll leave it as mods.

22:44So these are two of the fields of the data structure.

22:56Ah, one convenience here is for back pointers,

23:01I'm only going to store it for the latest version of the data

23:04structure.

23:16Sorry.

23:16I forgot about that.

23:19We'll come back to that later.

23:21And then the idea is to store these modifications.

23:23How many modifications?

23:25Let's say up to p, twice p.

23:34p was the bound on the n degree of a node.

23:38So I'm going to allow 2p modifications over here.

23:43And what's a modification look like?

23:48It's going to consist of three things--

23:50get them in the right order--

23:53the version in which something was changed,

23:56the field that got changed, and the value it go changed to.

24:02So the idea is that these are the fields here.

24:07We're not going to touch those.

24:09Once they're set to something--

24:11or, I mean, whatever they are initially,

24:13they will stay that way.

24:15And so instead of actually changing things like the data

24:18structure normally would, we're just

24:19going to add modifications here to say, oh,

24:21well at this time, this field changed to the value of 5.

24:25And then later on, it changed to the value 7.

24:27And then later on, this one changed to the value 23,

24:31whatever.

24:32So that's what they'll look like.

24:36There's a limit to how many--

24:37we can only store a constant number of mods to each node.

24:40And our constant will be 2p.

24:44OK.

24:45Those are the ideas, and now it's

24:46just a matter of making this all work

24:47and analyzing that it's constant amortized overhead.

24:53So first thing is if you want to read a field,

25:06how would I read a field?

25:08This is really easy.

25:11First you look at what the field is in the node itself.

25:15But then it might have been changed.

25:17And so remember when I say read the field,

25:19I actually mean while I'm given some version,

25:21v, I want to know what is the value of this field at version

25:24v, because I want to be able to look at any of the old data

25:28structures too.

25:29So this would be at version v. I just

25:37look through all the modifications.

25:38There's constantly many, so it takes constant time

25:40to just flip through them and say, well, what changes

25:43have happened up to version v?

25:46So I look at mods with version less than

25:56or equal to v. That will be all the changes that happened up

25:59to this point.

26:00I see, did this field change?

26:02I look at the latest one.

26:03That will be how I read the field of the node, so

26:07constant time.

26:08There's lots of ways to make this efficient in practice.

26:10But for our purposes, it doesn't matter.

26:13It's constant.

26:16The hard part is how do you change a field?

26:18Because there might not be any room in the mod structure.

26:35So to modify, say we want to set node.field equal to x.

26:47What we do is first we check, is there

26:52any space in the mod structure?

26:54If there's any blank mods, so if the node is not full,

27:03we just add a mod.

27:06So a mod will look like now field x.

27:13Just throw that in there.

27:15Because right at this moment--

27:17so we maintain a time counter, just increment

27:19it ever time we do a change.

27:21This field changed that value.

27:23So that's the easy case.

27:24The trouble, of course, is if the node is full--

27:29the moment you've all been waiting for.

27:31So what we're going to do here is make a new node.

27:33We've ran out of space.

27:34So we need to make a new node.

27:36We're not going to touch the old node, just going to let it sit.

27:38It still maintains all those old versions.

27:40Now we want a new node that represents the latest

27:43and greatest of this node.

27:44OK.

27:45So make a new node.

27:51I'll call it node prime to distinguish from node, where

27:57with all the mods, and this modification in particular,

28:04applied.

28:07OK, so we make a new version of this node.

28:11It's going to have some different fields, whatever

28:15was the latest version represented by those mods.

28:18It's still going to have back pointers,

28:20so we have to maintain all those back pointers.

28:24And now the mod, initially, is going

28:26to be empty, because we just applied them all.

28:29So this new node doesn't have any recent mods.

28:32Old node represents the old versions.

28:34This node is going to represent the new versions.

28:37What's wrong with this picture?

28:39AUDIENCE: Update pointers.

28:40ERIK DEMAINE: Update pointers.

28:42Yeah, there's pointers to the old version

28:43of the node, which are fine for the old versions of the data

28:47structure.

28:48But for the latest version of the data structure,

28:50this node has moved to this new location.

28:53So if there are any old pointers to that node,

28:56we've got to update them in the current version.

28:58We have to update them to point to this node instead.

29:00The old versions are fine, but the new version is in trouble.

29:04Other questions or all the same answer?

29:06Yeah.

29:06AUDIENCE: So if you wanted to read an old version

29:10but you just have the new version, [INAUDIBLE]?

29:15ERIK DEMAINE: OK--

29:17ERIK DEMAINE: The question is essentially,

29:19how do we hold on to versions?

29:22Essentially, you can think of a version of the data structure

29:24as where the root node is.

29:26That's probably the easiest.

29:27I mean, in general, we're representing versions

29:29by a number, v. But we always start at the root.

29:33And so you've given the data structure,

29:35which is represented by the root node.

29:36And you say, search for the value 5.

29:40Is it in this binary search tree or whatever?

29:43And then you just start navigating from the root,

29:45but you know I'm inversion a million or whatever.

29:49I know what version I'm looking for.

29:51So you start with the root, which never changes, let's say.

29:56And then you follow pointers that

29:58essentially tell you for that version

30:00where you should be going.

30:01I guess at the root version, it's a little trickier.

30:03You probably want a little array that says for this version,

30:07here's the root node.

30:08But that's a special case.

30:11Yeah.

30:11Another question?

30:13AUDIENCE: So on the new node that you

30:15created, the fields that you copied, you also have to have

30:19a version for them, right?

30:20Because [INAUDIBLE]?

30:26ERIK DEMAINE: These--

30:27AUDIENCE: Or do you version the whole node?

30:30ERIK DEMAINE: Here we're versioning the whole node.

30:32The original field values represent

30:34what was originally there, whenever this node was created.

30:37Then the mods specify what time the fields change.

30:40So I don't think we need times here.

30:44All right.

30:45So we've got to update two kinds of pointers.

30:47There's regular pointers, which live

30:48in the fields, which are things pointing to the node.

30:52But then there's also back pointers.

30:53Because if this is a pointer to a node,

30:55then there'll be a back pointer back to the node.

30:57And all of those have to change.

31:00Conveniently, the back pointers are easy.

31:11So if they're back pointers to the node,

31:13we change them to the node prime.

31:14How do we find the back pointers?

31:16Well, we just follow all the pointers

31:17and then there will be back pointers there.

31:21Because I said we're only maintaining

31:23backed pointers for the latest version,

31:25I don't need to preserve the old versions

31:28of those backed pointers.

31:29So I just go in and I change them.

31:31It takes constant time, because the constant number

31:33of things I point to, each one as a back pointer.

31:35So this is cheap.

31:37There's no persistence here.

31:39That's an advantage of partial persistence.

31:41The hard part is updating the pointers

31:44because those live in fields.

31:45I need to remember the old versions of those fields.

31:47And that we do recursively.

31:58Because to change those pointers,

32:00that's a field update.

32:01That's something exactly of this form.

32:02So that's the same operation but on a different node.

32:05So I just do that.

32:07I claim this is good.

32:08That's the end of the algorithm.

32:11Now we need to analyze it.

32:24How do we analyze it?

32:25Any guesses?

32:29AUDIENCE: Amortize it.

32:30ERIK DEMAINE: Amortized analysis, exactly

32:32the answer I was looking for.

32:33OK.

32:34[INAUDIBLE] amortization.

32:36The most powerful technique in amortization

32:38is probably the potential method.

32:40So we're going to use that.

32:42There's a sort of more--

32:44you'll see a charging argument in a moment.

32:50We want the potential function to represent when this data

32:53structure is in a bad state.

32:55Intuitively, it's in a bad state when a lot of nodes are full.

32:58Because then as soon as you make a change in them,

33:00they will burst, and you have to do all this crazy recursion

33:03and stuff.

33:04This case is nice and cheap.

33:05We just add a modification, constant time.

33:08This case, not so nice because we recurse.

33:10And then that's going to cause more recursions

33:12and all sorts of chaos could happen.

33:16So there's probably a few different potential functions

33:20that would work here.

33:21And an old version of these nodes I said

33:23should be the number of full nodes.

33:25But I think we can make life a little bit easier

33:27by the following.

33:32Basically, the total number of modifications--

33:36not quite the total, almost the total.

33:39So I'm going to do c times the sum of the number of mods

33:49in latest version nodes.

33:59OK.

34:00So because we sort of really only

34:02care about-- we're only changing the latest version,

34:05so I really only care about nodes that

34:07live in the latest version.

34:08What do I mean by this?

34:09Well, when I made this new node prime,

34:11this becomes the new representation of that node.

34:14The old version is dead.

34:15We will never change it again.

34:18If we're modifying, we will never even look at it again.

34:21Because now everything points to here.

34:24So I don't really care about that node.

34:26It's got a ton of mods.

34:27But what's nice is that when I create this new node, now

34:30the mod list is empty.

34:31So I start from scratch, just like reinstalling

34:33your operating system.

34:34It's a good feeling.

34:38And so the potential goes down by, I guess, c times 2 times p.

34:45When I do this change, potential goes down by basically p.

34:49AUDIENCE: Is c any constant or--

34:52ERIK DEMAINE: c will be a constant to be determined.

34:55I mean, it could be 1.

34:57It depends how you want to define it.

34:58I'm going to use the CLRS notion of amortized cost, which

35:02is actual cost plus change in potential.

35:06And then I need a constant here, because I'm

35:08measuring a running time versus some combinatorial quantity.

35:12So this will be to match the running time that we'll get to.

35:17OK.

35:17So what is amortized cost?

35:22There's sort of two cases modification.

35:24There's the cheap case and the not so cheap case.

35:28In general, amortized cost--

35:34in both cases, it's going to be at most--

35:37well, first of all, we do some constant work

35:39just to figure out all this stuff, make copies, whatever.

35:44So that's some constant time.

35:49That's the part that I don't want to try to measure.

35:52Then potentially, we add a new mod.

35:55If we add a mod, that increases the potential by c.

35:59Because we're just counting mods, multiplying by c.

36:02So we might get plus 1 mod.

36:04This is going to be an upper bound.

36:06We don't always add 1, but worst case, we always had 1,

36:09let's say.

36:11And then there's this annoying part.

36:14And this might happen, might not happen.

36:16So then there's a plus maybe.

36:20If this happens, we decrease the potential

36:23because we empty out the mods for that node in terms

36:26of the latest version.

36:27So then we get a negative 2cp, change in potential.

36:34And then we'd have to pay I guess up to p recursions.

36:49Because we have to--

36:51how many pointers are there to me?

36:53Well, at most p of them, because there are at most p pointers

36:58to any node.

37:02OK.

37:03This is kind of a weird--

37:05it's not exactly algebra here.

37:06I have this thing, recursions.

37:09But if you think about how this would expand,

37:11all right, this is constant time.

37:13That's good.

37:14And then if we do this--

37:15I'll put a question mark here.

37:16It might be here.

37:16It might not.

37:18If it's not here, find constant.

37:19If it is here, then this gets expanded into this thing.

37:24It's a weird way to write a recurrence.

37:26But we get p times whatever is in this right hand side.

37:30OK.

37:31But then there's this minus 2cp.

37:33So we're going to get p times 2c here.

37:36That's the initial cost.

37:37So that will cancel with this.

37:40And then we might get another recursion.

37:41But every time we get a recursion, all the terms

37:43cancel.

37:44So it doesn't matter whether this is here or not.

37:46You get 0, which is great.

37:49And you're left with the original 2c.

37:53Constant.

37:55OK.

37:56[INAUDIBLE] potential functions are always a little crazy.

37:59What's happening here is that, OK, maybe you add a mod.

38:03That's cheap.

38:05But when we have to do this work and we have to do this

38:08recursion-- this is up to 2p updates or recursions--

38:14we are charging it to the emptying of this node.

38:17The number of mods went from 2p down to 0.

38:21And so we're just charging this update cost

38:22to that modification.

38:24So if you like charging schemes, this is much more intuitive.

38:26But with charging schemes, it's always a little careful.

38:28You have to make sure you're not double charging.

38:30Here it's obvious that you're not double charging.

38:34Kind of a cool and magical.

38:37This is a paper by Driscoll, Sarnak, Sleator,

38:42Tarjan from 1989.

38:43So it's very early days of amortization.

38:45But they knew how to do it.

38:47Question?

38:50ERIK DEMAINE: What happens if you overflow the root?

38:52Yeah, I never thought about the root before today.

38:54But I think the way to fix the root is

38:57just you have one big table that says, for a given version--

39:02I guess a simple way would be to say,

39:04not only is a version a number, but it's also

39:06a pointer to the root.

39:07There we go.

39:07Pointer machine.

39:09So that way you're just always explicitly maintaining

39:11the root copy or the pointer.

39:15Because otherwise, you're in trouble.

39:18AUDIENCE: Then can you go back to [INAUDIBLE].

39:21ERIK DEMAINE: So in order to refer to an old version,

39:24you have to have the pointer to that root node.

39:26If you want to do it just from a version number,

39:29look at the data structure.

39:30Just from a version number, you would

39:31need some kind of lookup table, which

39:33is outside the pointer machine.

39:34So you could do it in a real computer,

39:36but a pointer machine is not technically allowed.

39:39So it's slightly awkward.

39:40No arrays are allowed in pointer machines,

39:42in case that wasn't clear.

39:43Another question?

39:44AUDIENCE: [INAUDIBLE] constant space to store for [INAUDIBLE].

39:48And also, what if we have really big numbers [INAUDIBLE]?

39:54ERIK DEMAINE: In this model, in the pointer machine model,

39:56we're assuming that whatever the data is in the items

39:58take constant space each.

40:01If you want to know about bigger things in here,

40:03then refer to future lectures.

40:05This is time travel, after all.

40:06Just go to a future class and then come back.

40:09[LAUGHS] So we'll get there, but right now,

40:11we're not thinking about what's in here.

40:15Whatever big thing you're trying to store,

40:16you reduce it down to constant size things.

40:19And then you spread them around nodes of a pointer machine.

40:22How you do that, that's up to the data structure.

40:25We're just transforming the data structure to be persistent.

40:28OK, you could ask about other models than pointer machines,

40:30but we're going to stick to pointer machines here.

40:34All right.

40:36That was partial persistence.

40:38Let's do full persistence.

40:41That was too easy.

40:46Same paper does full persistence.

40:48Systems That was just a warm up.

40:50Full persistence is actually not that much harder.

40:55So let me tell you basically what changes.

41:04There are two issues.

41:05One is that everything here has to change and not by much.

41:09We're still going to use back pointers.

41:11We're still going to have my mods.

41:12The number of mods is going to be slightly different

41:15but basically the same.

41:16Back pointers no longer just refer to the latest version.

41:19We have to maintain back pointers in all versions.

41:21So that's annoying.

41:22But hey, that's life.

41:24The amortization, the potential function

41:25will change slightly but basically not much.

41:30Sort of the bigger issue you might first wonder about,

41:33and it's actually the most challenging technically,

41:35is versions are no longer numbers.

41:37Because it's not a line.

41:39Versions are nodes in a tree.

41:41You should probably call them vertices

41:42in a tree to distinguish them from nodes in the pointer

41:45machine.

41:46OK, so you've got this tree of versions.

41:48And then versions are just some point on that tree.

41:53This is annoying because we like lines.

41:57We don't like trees as much.

41:58So what we're going to do is linearize the tree.

42:04Like, when in doubt, cheat.

42:12How do we do this?

42:13With tree traversal.

42:15Imagine I'm going to draw a super complicated tree

42:18of versions.

42:19Say there are three versions.

42:21OK.

42:22I don't want to number them, because that would be

42:24kind of begging the question.

42:26So let's just call them x, y, and z.

42:33All right.

42:34I mean, it's a directed tree, because we

42:36have the older versions.

42:37This is like the original version.

42:38And we made a change.

42:39We made a different change on the same version.

42:42What I'd like to do is a traversal of that tree,

42:45like a regular, as if you're going to sort those nodes.

42:48Actually, let me use color, high def here.

42:53So here's our traversal of the tree.

42:59And I want to look at the first and the last time I

43:01visit each node.

43:02So here's the first time I visit x.

43:05So I'll write this is the beginning of x.

43:09Capital X. Then this is the first time I visit y,

43:13so it's beginning of y.

43:15And then this is the last time I visit y, so it's the end of y.

43:19And then, don't care.

43:20Then this is the beginning of z.

43:24And this is the end of z.

43:27And then this is the end x.

43:29If I write those sequentially, I get bxbyeybzez,

43:38because this is so easy, ex.

43:42OK, you can think of these as parentheses, right?

43:45For whatever reason I chose b and e for beginning and ending,

43:48but this is like open parens, close parens.

43:50This is easy to do in linear time.

43:52I think you all know how.

43:53Except it's not a static problem.

43:55Versions are changing all the time.

43:56We're adding versions.

43:57We're never deleting versions, but we're always

43:59adding stuff to here.

44:00It's a little awkward, but the idea

44:01is I want to maintain this order,

44:05maintain the begin and the end of each you

44:16might say subtree of versions.

44:23This string, from bx to ex, represents

44:25all of the stuff in x's subtree, in the rooted tree

44:29starting at x.

44:33How do I maintain that?

44:40Using a data structure.

44:56So we're going to use something, a data structure we haven't yet

45:00seen.

45:02It will be in lecture 8.

45:04This is a time travel data structure,

45:06so I'm allowed to do that.

45:10So order maintenance data structure.

45:14You can think of this as a magical linked list.

45:16Let me tell you what the magical linked list can do.

45:19You can insert--

45:22I'm going to call it an item, because node

45:24would be kind of confusing given where we are right now.

45:28You can insert a new item in the list immediately before

45:32or after a given item.

45:37OK.

45:37This is like a regular linked list.

45:41Here's a regular linked list.

45:44And if I'm given a particular item like this one,

45:48I can say, well, insert a new item right here.

45:51You say, OK.

45:51Fine.

45:52I'll just make a new node and relink here, relink there.

45:57Constant time, right?

45:58So in an order maintenance data structure,

46:00you can do this in constant time.

46:01Wow!

46:02So amazing.

46:05OK, catch is the second operation you can do.

46:08Maybe I'll number these.

46:09This is the update.

46:10Then there's the query.

46:13The query is, what is the relative order

46:17of two notes, of two items?

46:24x and y.

46:27So now I give you this node and this node.

46:29And I say, which is to the left?

46:32Which is earlier in the order?

46:34I want to know, is x basically less than y

46:36in terms of the order in the list?

46:37Or is y less than x?

46:41And an order maintenance data structure

46:42can do this in constant time.

46:45Now it doesn't look like your mother's linked list, I guess.

46:50It's not the link list you learned in school.

46:52It's a magical linked list that can somehow

46:54answer these queries.

46:55How?

46:56Go to lecture 7.

46:58OK.

46:59Forward reference, lecture 8, sorry.

47:03For now, we're just going to assume that this magical data

47:05structure exists.

47:06So in constant time, this is great.

47:09Because if we're maintaining these b's and e's, we

47:11want to maintain the order that these things appear in.

47:16If we want to create a new version,

47:17like suppose we were just creating version z,

47:20well, it used to be everything without this bz, ez.

47:23And we'd just insert two items in here, bz and ez.

47:27They're right next to each other.

47:28And if we were given version x, we could just say,

47:30oh, we'll look at ex and insert two items right before it.

47:34Or you can put them right after bx.

47:36I mean, there's no actual order here.

47:37So it could have been y first and then z or z first and then

47:40y.

47:42So it's really easy to add a new version in constant time.

47:44You just do two of these insert operations.

47:47And now you have this magical order operation, which

47:50if I'm given two versions--

47:54I don't know, v and w--

47:56and I want to know is v an ancestor of w,

48:00now I can do it in constant time.

48:02So this lets me do a third operation, which is, is version

48:09v an ancestor of version w?

48:21Because that's going to be true if and only if bv

48:26is an ev nest around bw and ew.

48:39OK.

48:40So that's just three tests.

48:41They're probably not all even necessary.

48:43This one always holds.

48:45But if these guys fit in between these guys, then you know--

48:50now, what this tells us, what we care about here,

48:54is reading fields.

48:58When we read a field, we said, oh, we'll

49:00apply all the modifications that apply to version

49:02v. Before that, that was a linear order.

49:04So it's just all versions less than or equal to v. Now

49:06it's all versions that are ancestors of v. Given a mod,

49:10we need to know, does this mod apply to my version?

49:13And now I tell you, I can do that in constant time

49:16through magic.

49:17I just test these order relations.

49:20If they hold, then that mod applies to my version.

49:24So w's the version we're testing.

49:27v is some version in the mod.

49:29And I want to know, am descendant of that version?

49:32If so, the mod applies.

49:34And I update what the field is.

49:36I can do all pairwise ancestor checks and figure out,

49:39what is the most recent version in my ancestor history

49:43that modified a given field?

49:44That lets me read a field in constant time.

49:47Constants are getting kind of big at this point,

49:49but it can be done.

49:53Clear?

49:54A little bit of a black box here.

49:56But now we've gotten as far as reading.

50:01And we don't need to change much else.

50:04So this is good news

50:11Maybe I'll give you a bit of a diff.

50:15So full persistence, fully persistent theorem--

50:26done.

50:27OK.

50:27Same theorem just with full persistence.

50:30How do we do it?

50:31We store back pointers now for all versions.

50:35It's a little bit annoying.

50:36But how many mods do we use?

50:40There's lots of ways to get this to work,

50:42but I'm going to change this number

50:44to 2 times d plus p plus 1.

50:51Wait, what's d? d is the number of fields here.

50:56OK.

50:57We said it was constant number fields.

50:59I never said what that constant is. d for out degree, I guess.

51:03So p is in degree, max in degree. d is max out degree.

51:09So just slightly more-- that main reason for this

51:11is because back pointers now are treated like everyone else.

51:14We have to treat both the out pointers and the in pointers

51:17as basically the same.

51:18So instead of p, we have d plus p.

51:19And there's a plus 1 just for safety.

51:23It gets my amortization to work, hopefully.

51:28OK.

51:29Not much else-- this page is all the same.

51:32Mods are still, you give versions, fields, values,

51:35reading.

51:36OK, well, this is no longer less than or equal to v. But

51:41this is now with a version, sort of the nearest version, that's

51:47an ancestor of v.

51:50That's what we were just talking about.

51:52So that can be done in constant time.

51:54Check it for all of them, constant work.

51:57OK.

51:58That was the first part.

52:04Now we get to the hard part, which is modification.

52:07This is going to be different.

52:08Maybe you I should just erase--

52:10yeah, I think I'll erase everything,

52:13except the first clause.

52:24OK.

52:24If a node is not full, we'll just

52:26add a mod, just like before.

52:28What changes is when a node is full.

52:36Here we have to do something completely different.

52:38Why?

52:38Because if we just make a new version

52:41of this node that has empty mods, this one's still full.

52:45And I can keep modifying the same version.

52:48This new node that I just erased represents some new version.

52:52But if I keep modifying an old version, which

52:54I can do in full persistence, this node keeps being full.

52:57And I keep paying potentially huge cost.

53:00If all the nodes were full, and when I make this change

53:02every node gets copied, and then I

53:04make a change to the same version,

53:06every node gets copied again.

53:07This is going to take linear time per operation.

53:09So I can't do the old strategy.

53:11I need to somehow make this node less full.

53:15This is where we're definitely not functional.

53:17None of this was functional, but now I'm

53:19going to change an old node, not just make a new one in a more

53:24drastic way.

53:25Before I was adding a mod.

53:27That's not a functional operation.

53:28Now I'm actually going to remove mods from a node to rebalance.

53:33So what I'd like to do is split the node into two halves.

53:43OK.

53:43So I had some big node that was--

53:46I'll draw it-- completely full.

53:50Now I'm going to make two nodes.

53:52Here we go.

53:59This one is going to be half full.

54:01This one's going to be half full of mods.

54:04OK.

54:05The only question left is, what do I do with all these things?

54:12Basically what I'd like to do is have the--

54:14on the one hand, I want to have the old node.

54:18It's just where it used to be.

54:20I've just removed half of the mods, the second half,

54:23the later half.

54:25What does that mean?

54:26I don't know.

54:27Figure it out.

54:29It's linearized.

54:31I haven't thought deeply about that.

54:32Now we're going to make a new node with the second half

54:36of the mods.

54:40It's more painful than I thought.

54:41In reality, these mods represent a tree of modifications.

54:45And what you need to do is find a partition of that tree

54:48into two roughly equal halves.

54:51You can actually do a one third, 2/3 split.

54:52That's also in a future lecture, which whose number I forget.

54:57So really, you're splitting this tree

54:58into two roughly balanced halves.

55:01And so this 2 might actually need to change to a 3,

55:03but it's a constant.

55:06OK.

55:07What I want is for this to represent

55:09a subtree of versions.

55:10Let me draw the picture.

55:11So here's a tree of versions represented by the old mods.

55:15I'd like to cut out a subtree rooted at some node.

55:18So let's just assume for now this has exactly

55:21half the nodes.

55:22And this has half the nodes.

55:25In reality, I think it can be one third, 2/3.

55:29OK.

55:29But let's keep it convenient.

55:32So I want the new node to represent

55:34this subtree and this node to represent everything else.

55:37This node is as if this stuff hasn't happened yet.

55:41I mean, so it represents all these old versions that do not,

55:44that are not in the subtree.

55:45This represents all the latest stuff.

55:47So what I'm going to do is like before, I

55:49want to apply some mods to these fields.

55:54And whatever minds were relevant at this point, whatever

55:58had been applied, I apply those to the fields here.

56:02And so that means I can remove all of these mods.

56:06I only cared about these ones.

56:09Update these fields accordingly.

56:11I still have the other mods to represent all the other changes

56:14that could be in that subtree.

56:16OK.

56:16So we actually split the tree, and we apply mods to new nodes.

56:38Anything else I need to say?

56:42Oh, now we need to update pointers.

56:44That's always the fun part.

56:49Let's go over here.

57:05So old node hasn't moved.

57:07But this new node has moved.

57:09So for all of these versions, I want

57:13to change the pointer that used to point to old node

57:18should now point to new node.

57:20In this version, it's fine.

57:21It should still point to old node,

57:23because this represents all those old versions.

57:25But for the new version, that version in the subtree,

57:28I've got to point here instead.

57:30OK.

57:31So how many pointers could there be to this node

57:37that need to change.

57:38That's a tricky part in this analysis.

57:41Think about it for a while.

57:45I mean, in this new node, whatever

57:47is pointed to by either here or here in the new node also has

57:50a return pointer.

57:50All pointers are bidirectional.

57:52So we don't really care about whether they're forward

57:54or backward.

57:54How many pointers are there here?

57:56Well, there's d here and there's p here.

57:59But then there's also some additional pointers

58:01represented over here.

58:02How many?

58:04Well, if we assume this magical 50/50 split,

58:06there's right now d plus p plus 1 mods over here, half of them.

58:12Each of them might be a pointer to some other place, which

58:16has a return pointer in that version.

58:18So number of back pointers that we need to update

58:23is going to be this, 2 times d 2 times p plus 1.

58:30So recursively update at most 2 times d plus 2 times p

58:41plus 1 pointers to the node.

58:50The good news is this is really only half of them

58:52or some fraction of them.

58:54It used to be--

58:57well, there were more pointers before.

58:59We don't have to deal with these ones.

59:00That's where we're saving, and that's

59:01why this amortization works.

59:03Let me give you a potential function that makes this work--

59:12is minus c times sum of the number of empty mod slots.

59:23It's kind of the same potential but before

59:26we had this notion of dead and alive nodes.

59:28Now everything's alive because everything

59:30could change at any moment.

59:31So instead, I'm going to measure how much room I have

59:36in each node.

59:37Before I had no room in this node.

59:38Now I have half the space in both nodes.

59:41So that's good news.

59:44Whenever we have this recursion, we

59:48can charge it to a potential decrease.

59:56Fee goes down by--

1:00:01because I have a negative sign here--

1:00:03c times, oh man, 2 times d plus p plus 1, I think.

1:00:13Because there's d plus p plus 1 space here,

1:00:15d plus p plus 1 space here.

1:00:17I mean, we added one whole new node.

1:00:18And total capacity of a node in mods

1:00:20is 2 times d plus p plus 1.

1:00:23So we get that times c.

1:00:26And this is basically just enough,

1:00:28because this is 2 times d plus 2 times p plus 2.

1:00:32And here we have a plus 1.

1:00:34And so the recursion gets annihilated by 2 times d plus

1:00:392 times p plus 1.

1:00:41And then there's one c left over to absorb

1:00:43whatever constant cost there was to do all this other work.

1:00:47So I got the constants just to work,

1:00:51except that I cheated and it's really a one third, 2/3 split.

1:00:54So probably all of these constants have to change,

1:00:57such is life.

1:00:58But I think you get the idea.

1:01:01Any questions about full persistence?

1:01:07This is fun stuff, time travel.

1:01:10Yeah?

1:01:11AUDIENCE: So in the first half of the thing where

1:01:14the if, there's room you can put it in.

1:01:16ERIK DEMAINE: Right.

1:01:17AUDIENCE: I have a question about how

1:01:17we represent the version.

1:01:19Because before when we said restore now [INAUDIBLE].

1:01:23It made more sense if now was like a timestamp or something.

1:01:25ERIK DEMAINE: OK.

1:01:26Right, so how do we represent a version even here or anywhere?

1:01:31When we do a modification, an update, in the data structure,

1:01:34we want to return the new version.

1:01:36Basically, we're going to actually store

1:01:39the DAG of versions.

1:01:41And a version is going to be represented by a pointer

1:01:43into this DAG.

1:01:44One of the nodes in this DAG becomes a version.

1:01:47Every node in this DAG is going to store a pointer

1:01:50to the corresponding b character and a corresponding e character

1:01:53in this data structure, which then

1:01:56lets you do anything you want.

1:01:57Then you can query against that version,

1:01:59whether it's an ancestor of another version.

1:02:01So yeah, I didn't mention that.

1:02:02Versions are nodes in here.

1:02:04Nodes in here have pointers to the b's and e's.

1:02:06And vice versa, the b's and e's have pointers back

1:02:08to the corresponding version node.

1:02:10And then you can keep track of everything.

1:02:12Good question.

1:02:14Yeah?

1:02:15AUDIENCE: [INAUDIBLE] question.

1:02:16Remind me what d is in this.

1:02:17ERIK DEMAINE: Oh, d was the maximum out degree.

1:02:19It's the number of fields in a node, as defined right here.

1:02:26Other questions?

1:02:29Whew.

1:02:30OK, a little breather.

1:02:31That was partial persistence, full persistence.

1:02:33This is, unfortunately, the end of the really good results.

1:02:36As long as we have constant degree nodes,

1:02:38in and out degree, we can do all.

1:02:41We can do for persistence for free.

1:02:44Obviously there are practical constants involved here.

1:02:47But in theory, you can do this perfectly.

1:02:53Before we go on to confluence, there

1:02:54is one positive result, which is what if you

1:02:58don't like amortize bounds.

1:03:00There are various reasons amortize bounds might not

1:03:02be good.

1:03:03Maybe you really care about every operation

1:03:04being no slower than it was except by a constant factor.

1:03:08We're amortizing here, so some operations get really slow.

1:03:11But the others are all fast to compensate.

1:03:14You can deamortize, it's called.

1:03:22You can get constant worst case slowdown

1:03:30for partial persistence.

1:03:36This is a result of Garret Brodle from the late '90s, '97.

1:03:44For full persistence-- so it's an open problem.

1:03:47I don't know if people have worked on that.

1:03:55All right.

1:03:56So some, mostly good results.

1:03:59Let's move on to confluent persistence where things

1:04:01get a lot more challenging.

1:04:17Lots of things go out the window with confluent persistence.

1:04:20In particular, your versions are now a DAG.

1:04:23It's a lot harder to linearize a DAG.

1:04:25Trees are not that far from pads.

1:04:28But DAGs are quite far from pads, unfortunately.

1:04:33But that's not all that goes wrong.

1:04:44Let me first tell you the kind of end effect as a user.

1:04:50Imagine you have a data structure.

1:04:54Think of it as a list, I guess, which

1:04:57is a list of characters in your document.

1:04:59You're using vi or Word, your favorite, whatever.

1:05:03It's a text editor.

1:05:05You've got a string of words.

1:05:06And now you like to do things like copy and paste.

1:05:09It's a nice operation.

1:05:11So you select an interval of the string and you copy it.

1:05:16And then you paste it somewhere else.

1:05:18So now you've got two copies of that string.

1:05:21This is, in some sense, what you might

1:05:24call a confluent operation, because--

1:05:27yeah, maybe a cleaner way to think of it is the following.

1:05:30You have your string.

1:05:31Now I have an operation, which is split it.

1:05:33So now I have two strings.

1:05:35OK.

1:05:36And now I have an operation, which is split it.

1:05:38Now I have three strings.

1:05:40OK.

1:05:41Now I have an operation which is concatenate.

1:05:44So I can, for example, reconstruct

1:05:47the original string-- actually, I have the original string.

1:05:49No biggie.

1:05:51Let's say-- because I have all versions.

1:05:54I never lose them.

1:05:55So now instead, I'm going to cut the string here, let's say.

1:05:59So now I have this and this.

1:06:03And now I can do things like concatenate

1:06:06from here to here to here.

1:06:10And I will get this plus this plus this.

1:06:16OK.

1:06:17This guy moved here.

1:06:18So that's a copy/paste operation with a constant number

1:06:20of splits and concatenates.

1:06:22I could also do cut and paste.

1:06:23With confluence, I can do crazy cuts and pastes

1:06:26in all sorts of ways.

1:06:28So what?

1:06:29Well, the so what is I can actually

1:06:32double the size of my data structure

1:06:33in a constant number of operations.

1:06:36I can take, for example, the entire string

1:06:38and concatenate it to itself.

1:06:40That will double the number of characters,

1:06:41number of elements in there.

1:06:43I can do that again and again and again.

1:06:45So in u updates, I can potentially

1:06:51get a data structure size 2 to the u.

1:06:57Kind of nifty.

1:06:58I think this is why confluence is cool.

1:07:00It's also why it's hard.

1:07:02So not a big surprise.

1:07:03But, here we go.

1:07:08In that case, the version DAG, for reference, looks like this.

1:07:13You're taking the same version, combining it.

1:07:16So here I'm assuming I have a concatenate operation.

1:07:20And so the effect here, every time I do this,

1:07:24I double the size.

1:07:44All right.

1:07:44What do I want to say about confluent persistence?

1:07:46All right.

1:07:47Let me start with the most general result, which

1:07:53is by Fiat and Kaplan in 2003.

1:08:04They define a notion called effective depth of a version.

1:08:08Let me just write it down.

1:08:21It's kind of like if you took this DAG

1:08:24and expanded it out to be a tree of all possible paths.

1:08:30Instead of point to the same node,

1:08:31you could just duplicate that node

1:08:33and then have pointers left and right.

1:08:35OK.

1:08:35So if I did that, of course, this size grows exponentially.

1:08:38It explicitly represents the size of my data structure.

1:08:41At the bottom, if I have u things,

1:08:42I'm going to have 2 to the u leaves at the bottom.

1:08:45But then I can easily measure the number of paths

1:08:49from the root to the same version.

1:08:50At the bottom, I still label it, oh, those

1:08:52are all v. They're all the same version down there.

1:08:54So exponential number of paths, if I take log,

1:08:56I get what I call effective depth.

1:08:58It's like if you somehow could rebalance that tree,

1:09:02this is the best you could hope to do.

1:09:05It's not really a lower bound.

1:09:07But it's a number.

1:09:08It's a thing.

1:09:09OK.

1:09:10Then the result they achieve is that the overhead is

1:09:17log the number of updates plus-- this

1:09:19is a multiplicative overhead, so you take your running time.

1:09:22You multiply it by this.

1:09:25And this is a time and a space overhead.

1:09:31So maximum effective depth of all versions, maybe even

1:09:34sum of effective depths, but we'll just say max to be safe.

1:09:39Sorry-- sum over all the operations.

1:09:41This is per operation.

1:09:43You pay basically the effective depth

1:09:44of that operation as a factor.

1:09:48Now, the annoying thing is if you have this kind of set up

1:09:51where the size grew exponentially,

1:09:54then number of paths is exponential.

1:09:56Log of the number of paths is linear in u.

1:09:59And so this factor could be as much as u, linear slowdown.

1:10:06Now, Fiat and Kaplan argue linear slowdown is not

1:10:08that bad, because if you weren't even persistent, if you did

1:10:13this in the naive way of just recopying the data,

1:10:18you were actually spending exponential time to build

1:10:21the final data structure.

1:10:22It has exponential size.

1:10:23Just to represent it explicitly requires exponential time,

1:10:26so losing a linear factor to do u operations

1:10:29and now u squared time instead of 2 to the u.

1:10:31So it's a big improvement to do this.

1:10:35The downside of this approach is that even if you have a version

1:10:40DAG that looks like this, even if the size of the data

1:10:43structure is staying normal, staying linear, so

1:10:46this potential, you could be doubling the size.

1:10:48But we don't know what this merge operation is.

1:10:49Maybe it just throws away one of the versions

1:10:51or does something--

1:10:53somehow takes half the nodes from one

1:10:55side, half the nodes from the other side maybe.

1:10:57These operations do preserve size.

1:10:58Then there's no great reason why it should be a linear slowdown,

1:11:02but it is.

1:11:03OK?

1:11:04So it's all right but not great.

1:11:10And it's the best general result we know.

1:11:13They also prove a lower bound.

1:11:21So lower bound is some effect of depth, total bits of space.

1:11:37OK.

1:11:37What does this mean?

1:11:40So even if this is not happening,

1:11:42the number of bits of space you need

1:11:44in the worst case-- this does not

1:11:45apply to every data structure.

1:11:47That's one catch.

1:11:49They give a specific data structure

1:11:52where you need this much space.

1:11:53So it's similar to this kind of picture.

1:11:57We'll go into the details.

1:11:58And you need this much space.

1:12:00Now, this is kind of bad, because if there's

1:12:02u operations, and each of these is u, that's u squared space.

1:12:06So we actually need a factor u blow up in space.

1:12:09It looks like.

1:12:11But to be more precise, what this means is

1:12:14that you need omega e of v space, and therefore

1:12:17time overhead per update, if--

1:12:27this is not written in the paper--

1:12:29queries are free.

1:12:35Implicit here, they just want to slow down and increase space

1:12:40for the updates you do, which is pretty natural.

1:12:43Normally you think of queries as not increasing space.

1:12:46But in order to construct this lower bound,

1:12:49they actually do this many queries.

1:12:52So they do e of v queries and then one update.

1:12:55And they say, oh well, space had to go up by an extra e of v.

1:12:59So if you only charge updates for the space,

1:13:02then yes, you have to lose potentially

1:13:04a linear factor, this effect of death, potentially u.

1:13:07But if you also charge the queries,

1:13:09it's still constant in their example.

1:13:13So open question, for confluent persistence,

1:13:18can you achieve constant everything?

1:13:21Constant time and space overheads,

1:13:27multiplicative factor per operation,

1:13:33both updates and queries.

1:13:35So if you charge the queries, potentially you

1:13:37could get constant everything.

1:13:38This is a relatively new realization.

1:13:43And no one knows how to do this yet.

1:13:47Nice challenge.

1:13:47I think maybe we'll work on that in our first problem session.

1:13:50I would like to.

1:13:53Questions about that result?

1:13:54I'm not going to prove the result.

1:13:56But it is a fancy rebalancing of those kinds

1:13:59of pictures to get this log.

1:14:10There are other results I'd like to tell you about.

1:14:32So brand new result--

1:14:34that was from 2003.

1:14:35This is from 2012--

1:14:38no, '11, '11, sorry.

1:14:42It's SOTO, which is in January, so it's a little confusing.

1:14:47Is it '11?

1:14:49Maybe '12.

1:14:50Actually now I'm not sure.

1:14:51It's February already, right?

1:14:54A January, either this year or last year.

1:15:00It's not as general a transformation.

1:15:02It's only going to hold in what's called a disjoint case.

1:15:05But it gets a very good bound--

1:15:07not quite constant, but logarithmic.

1:15:09OK, logarithmic would also be nice.

1:15:12Or log, log n, whatever n is.

1:15:17Pick your favorite n, number of operations, say.

1:15:22OK.

1:15:25If you assume that confluent operations are performed only

1:15:39on two versions with no shared nodes--

1:15:50OK, this would be a way to forbid this kind of behavior

1:15:53where I concatenate the data structure with itself.

1:15:56All the nodes are common.

1:15:58If I guarantee that maybe I, you know, slice this up, slice it,

1:16:01dice it, wherever, and then re-emerge them

1:16:03in some other order, but I never use two copies

1:16:06of the same piece, that would be a valid confluent

1:16:10operation over here.

1:16:12This is quite a strong restriction

1:16:13that you're not allowed.

1:16:16If you try to, who knows what happens.

1:16:19Behavior's undefined.

1:16:19So won't tell you, oh, those two versions

1:16:21have this node in common.

1:16:22You've got to make a second copy of it.

1:16:24So somehow you have to guarantee that control and operations

1:16:27never overlap.

1:16:29But they can be reordered.

1:16:33Then you can get order log n overhead.

1:16:39n is the number of operations.

1:16:45I have a sketch of a proof of this

1:16:46but not very much time to talk about it.

1:16:48All right.

1:16:49Let me give you a quick picture.

1:16:51In general, the versions form a DAG.

1:16:55But if you make this assumption, and you look at a single node,

1:17:00and look at all the versions where that node appears,

1:17:03that is a tree.

1:17:05Because you're not allowed to remerge versions

1:17:07that have the same node.

1:17:08So while the big picture is a DAG,

1:17:11the small picture of a single guy is some tree.

1:17:17I'm drawing all these wiggly lines

1:17:18because there are all these versions where

1:17:20the node isn't changing.

1:17:21This is the entire version DAG.

1:17:23And then some of these nodes--

1:17:26some of these versions, I should say--

1:17:29that node that we're thinking about changes.

1:17:31OK, whenever it branches, it's probably

1:17:33because the actual node changed, maybe.

1:17:36I don't know.

1:17:37Anyway there are some dots here where the version changed,

1:17:40some of the leaves, maybe, that changed.

1:17:41Maybe some of them haven't yet.

1:17:44In fact, let's see.

1:17:48Here where it's change, it could be that we destroyed the node.

1:17:51Maybe it's gone from the actual data structure.

1:17:54But there still may be versions down here.

1:17:56It's not really a tree.

1:17:57It's a whole DAG of stuff down there.

1:17:59So that's kind of ugly.

1:18:01Where never the node still exists,

1:18:03I guess that is an actual leaf of the DAG.

1:18:05So those are OK.

1:18:06But as soon as I maybe delete that node,

1:18:08then there can be a whole subtree down there.

1:18:11OK.

1:18:12So now if you look at an arbitrary version,

1:18:15so what we're thinking about is how to implement reading,

1:18:17let's say.

1:18:18Reading and writing are more or less the same.

1:18:21I give you a version.

1:18:22I give you a node, and I give you a field.

1:18:23I want to know, what is the value of that field, that node,

1:18:26that version?

1:18:27So now where could a version fall?

1:18:30Well it has to be in this subtree.

1:18:31Because the node has to exist.

1:18:36And then it's maybe a pointer.

1:18:38A pointer could be to another node, which

1:18:42also has this kind of picture.

1:18:44They could be overlapping trees.

1:18:46In general, there are three cases.

1:18:48Either you're lucky, and the version you're talking about

1:18:51is a version where the node was changed.

1:18:53In that case, the data is just stored right there.

1:18:58That's easy.

1:18:59So you could just say, oh, how did the node change?

1:19:01Oh, that's what the field is.

1:19:02OK, follow the pointer.

1:19:05A slightly harder case it's a version

1:19:08in between two such changes.

1:19:09And maybe these are not updates.

1:19:11So I sort of want to know, what was the previous version where

1:19:17this node changed in constant time?

1:19:21It can be done.

1:19:22Not constant time, actually, logarithmic time,

1:19:25using a data structure called link-cut trees,

1:19:28another fun black box for now, which

1:19:31we will cover in lecture 19, far in the future.

1:19:36OK.

1:19:39Well, that's one case.

1:19:40There's also the version where maybe a version

1:19:43is down here in a subtree.

1:19:45I guess then the node didn't exist.

1:19:48Well, all these things can happen.

1:19:50And that's even harder.

1:19:51It's messy.

1:19:53They use another trick, which is called fractional cascading,

1:19:59which I'm not even going to try to describe what it means.

1:20:02But it's got a very cool name.

1:20:04Because we'll be covering it in lecture 3.

1:20:06So stay tuned for that.

1:20:07I'm not going to say how it applies to this setting,

1:20:09but it's a necessary step in here.

1:20:13In the remaining zero minutes, let

1:20:15me tell you a little bit about functional data structures.

1:20:17[LAUGHTER]

1:20:20Beauty of time travel.

1:20:24Functional-- I just want to give you

1:20:31some examples of things that can be done functionally.

1:20:33There's a whole book about functional data structures

1:20:35by Okasaki.

1:20:36It's pretty cool.

1:20:38A simple example is balanced BSTs.

1:20:42So if you just want to get log n time for everything,

1:20:44you can do that functionally.

1:20:45It's actually really easy.

1:20:46You pick your favorite balance BST, like red black trees.

1:20:48You implement it top down so you never follow parent pointers.

1:20:51So you don't need parent pointers.

1:20:52So then as you make changes down the tree, you just copy.

1:20:57It's called path copying.

1:20:58Whenever you're about to make a change,

1:21:00make a copy of that node.

1:21:02So you end up copying all the change nodes and all

1:21:05their ancestors.

1:21:06There's only log n of them, so it takes log n time.

1:21:09Clear?

1:21:10Easy.

1:21:11It's a nice technique.

1:21:12Sometimes path copying is very useful.

1:21:14Like link-cut trees, for example,

1:21:16can be made functional.

1:21:17We don't know what they are, but they're basically a BST.

1:21:19And you can make them functional.

1:21:21We use that in a paper.

1:21:23All right.

1:21:23Deques.

1:21:25These are doubly ended queues.

1:21:27So it's like a stack and a queue and everything.

1:21:29You can insert and delete from the beginning and the end.

1:21:32People start to know what these are now,

1:21:34because Python calls him that.

1:21:35But you can also do concatenation

1:21:41with deques in constant time per operation.

1:21:43This is cool.

1:21:44Deques are not very hard to make functional.

1:21:46But you can do deques and you can concatenate them

1:21:48like we were doing in the figure that's right behind this board.

1:21:51Constant time split is a little harder.

1:21:53That's actually one of my open problems.

1:21:56Can you do lists with split and concatenate in constant time--

1:22:01functionally or confluently, persistently, or whatever?

1:22:05Another example-- oh, you can do a mix of the two.

1:22:08You can get log n search in constant time deque operations,

1:22:12is you can do tries.

1:22:14So a try is a tree with a fixed topology.

1:22:17Think of it as a directory tree.

1:22:20So maybe you're using Subversion.

1:22:21Subversion has time travel operations.

1:22:23You can copy an entire subtree from one version

1:22:26and stick it into a new version, another version.

1:22:30So you get a version DAG.

1:22:32It's a confluently persistent data structure--

1:22:34not implemented optimally, because we don't necessarily

1:22:37know how.

1:22:38But there is one paper.

1:22:40This actually came from the open problem section of this class

1:22:43four years ago, I think.

1:22:45It's with Eric Price and Stefan Langerman.

1:22:49You can get very good results.

1:22:50I won't write them down because it takes a while.

1:22:52Basically log the degree of the nodes factor

1:22:56and get functional, and you can be even fancier

1:22:59and get slightly better bounds like log log the degree

1:23:02and get confluently persistent with various tricks,

1:23:05including using all of these data structures.

1:23:07So if you want to implement subversion optimally,

1:23:09that is known how to be done but hasn't actually been done yet.

1:23:14Because there are those pesky constant factors.

1:23:18I think that's all.

1:23:19What is known about functional is there's a log n separation.

1:23:23You can be log n away from the best.

1:23:26That's the worst separation known,

1:23:30between functional and just a regular old data structure.

1:23:33It'd be nice to improve that.

1:23:34Lots of open problems here.

1:23:35Maybe we'll work on them next time.