0:00
The following content is
provided under a Creative
0:02Commons license.
0:04Your support will help
MIT OpenCourseWare
0:06continue to offer high quality
educational resources for free.
0:10To make a donation or
view additional materials
0:13from hundreds of MIT courses,
visit MIT OpenCourseWare
0:17at ocw.mit.edu.
0:21ERIK DEMAINE: Welcome to 6.851
Advanced Data Structures.
0:25I am Erik Demaine.
0:26You can call me Erik.
0:28We have two TAs, Tom
Morgan and Justin Zhang.
0:31Tom's back there.
0:32Justin is late.
0:36And this class is
about all kinds
0:38of very cool data structures.
0:40You should have already
seen basic data structures
0:42like balance binary
search trees and things
0:45like that, log n
time to do wherever
0:48you want in one dimension.
0:50And here we're going to turn
all those data structures
0:52on their head and
consider them in all sorts
0:54of different models and
additional cool problems.
0:56Today we're going to talk about
time travel or temporal data
0:59structures, where
you're manipulating time
1:02as any good time
traveler should.
1:05Then we'll do geometry where
we have higher dimensional
1:07data, more than one dimension.
1:09Then we'll look at
a problem called
1:11dynamic optimality, which is,
is there one best binary search
1:14tree that rules them all?
1:16Then we'll look at something
called memory hierarchy, which
1:19is a way to model more realistic
computers which have cache
1:22and then more cache
and then main memory
1:24and then dish and all
these different levels.
1:26How do you optimize for that?
1:28Hashing is probably the most
famous, and most popular,
1:31most used data structure
in computer science.
1:33We'll do a little bit on that.
1:35Integers, when you know that
your data is integers and not
1:39just arbitrary black boxes that
you can compare or do whatever,
1:42you can do a lot
better with integers.
1:43You usually beat log n time.
1:45Often you can get constant time.
1:47For example, if you want
to do priority queues,
1:49you can do square
root log log n time.
1:51That's the best
known randomized.
1:55Dynamic graphs, you have
a graph you want to store,
1:58and the edges are being added
and maybe deleted, like you're
2:01representing a social network.
2:02And people are friending
and de-friending.
2:04You want to maintain some
interesting information
2:06about that graph.
2:07Strings, you have
a piece of text,
2:10such as the entire
worldwide web.
2:12And you want to search
for a substring.
2:14How do you do that efficiently?
2:15It's sort of the Google problem.
2:17Or you searching through
DNA for patterns, whenever.
2:20And finally succinct
data structures,
2:22which is all about
taking what we normally
2:24consider optimal
space or n space
2:26and reducing it down to the very
bare minimum of bits of space.
2:30Usually if you want
to store something
2:32where there's 2 to
the n possibilities,
2:34you want to get away
with n bits of space,
2:36maybe plus square root of
n or something very tiny.
2:40So that's the sync
data structures.
2:42So that's an overview
of the entire class.
2:44And these are sort of the
sections we'll be following.
2:47Let me give you a quick
administrative overview
2:50of what we're doing.
2:54Requirements for the class--
2:55I guess, first,
attending lecture.
2:58Obviously if you
don't attend lecture,
2:59there'll be videos online.
3:00So that's resolvable.
3:02But let me know if you're
not going to make it.
3:05We're going to have problems
sets roughly every week.
3:07If you're taking the
class for credit,
3:09they have a very simple rule
of one page in, one page out.
3:12This is more constraint on
us to write problems that
3:15have easy or short answers.
3:16You probably need
to think about them
3:18a little bit before they're
transparent, but then easy
3:21to write up.
3:23And then scribing lectures--
so we have a scribe for today,
3:26I hope.
3:27Here?
3:28Yes, good.
3:30So most of the
lectures have already
3:32been scribed in some
version, and your goal
3:34is to revise that scribe
notes that if you don't like
3:38handwritten notes, which
are also online, then easier
3:42for people to read.
3:44Let's see.
3:45Listeners welcome.
3:46We're going to have an
open problem session.
3:48I really like open problems.
3:49I really like solving
open problems.
3:50So we've done this every time
this class has been offered.
3:53So if you're interested in
also solving open problems,
3:55it's optional.
3:56I will organize-- in
a couple of weeks,
3:59we'll have a weekly
open problem session
4:03and try to solve
all the things that
4:06push the frontier of
advanced data structures.
4:09So in classes, we'll see
the state of the art.
4:11And then we'll change the state
of the art in those sessions.
4:15I think that's it.
4:16Any questions about
the class before we
4:18get into the fun stuff?
4:22All right.
4:23Let's do some time traveling.
4:27Before I get to time
traveling, though, I
4:28need to define our
model of computation.
4:32A theme in this class is that
the model of computation you're
4:35working with matters.
4:36Models matter.
4:38And there's lots of different
models of computation.
4:40We'll see a few of the
main ones in this class.
4:45And the starting
point, and the one
4:48we'll be using throughout today,
is called a pointer machine.
4:53It's an old one from the '80s.
4:57And it corresponds to
what you might think about
4:59if you've done a lot of
object-oriented programming,
5:02and before that,
structure-oriented programming,
5:04I guess.
5:05So you have a bunch of nodes.
5:08They have some fields in them,
a constant number of fields.
5:13You can think of these
as objects or strucs
5:16in c It used to be records
back in Pascal days,
5:20so a lot of the papers
call them records.
5:22You could just have a
constant number of fields.
5:24You could think of
those numbered, labeled.
5:25It doesn't really matter
because there's only
5:27a constant number of them.
5:28Each of the fields could be
a pointer to another node,
5:32could be a null pointer, or
could have some data in it.
5:35So I'll just assume that
all my data is integers.
5:43You can have a
pointer to yourself.
5:45You can have a pointer over
here, whatever you want.
5:48A pointer machine would
look something like this.
5:52In any moment, this is the
state of the pointer machine.
5:55So you think this as the memory
of your computer storing.
5:59And then you have
some operations
6:02that you're allowed to do.
6:03That's the computation
part of the model.
6:07You can think of this
as the memory model.
6:10What you're allowed to
do are create nodes.
6:13You can say something
like, x equals new node.
6:19You can, I don't
know, look at fields.
6:24You can do x equals y.field.
6:28You can set fields,
x.field equals y.
6:33You can compute on these
data, so you can add 5 and 7,
6:37do things like that.
6:39I'm not going to worry about--
6:40I'll just write et cetera.
6:43This is more a model
about how everything's
6:45organized in memory, not so
much about what you're allowed
6:48to do to the data items.
6:49In this lecture, it
won't matter what
6:50you're doing to the data items.
6:52We never touch them.
6:53We just copy them around.
6:56So am I missing anything?
6:59Probably.
7:01I guess you could destroy
nodes if you felt like it.
7:03But we won't have
to today, because we
7:06don't want to
throw anything away
7:07when you're time traveling.
7:09It's too dangerous.
7:12And then the one catch
here is, what are x and y?
7:18There's going to be one
node in this data structure
7:21or in your memory
called the root node.
7:24And you could think of that
as that's the thing you always
7:26have in your head.
7:27This is like your
cache, if you will.
7:29It's just got a constant
number of things, just
7:31like any other node.
7:32And x and y are
fields of the root.
7:40So that sort of
ties things down.
7:42You're always working
relative to the root.
7:45But you can look at the data,
basically follow this pointer,
7:50by looking at the field.
7:53You could set one
of these pointers--
7:55I think I probably need
another operation here,
7:58like x equals y.field1,
field2, that sort of thing,
8:03and maybe the reverse.
8:06But you can manipulate
all nodes sort
8:09of via the root is the idea.
8:10You follow pointers,
do whatever.
8:12So pretty obvious,
slightly annoying
8:14to write down formally.
8:16But that is pointer machine.
8:23And what we're going to
be talking about today
8:26in time travel is
suppose someone
8:28gives me a pointer machine
data structure, for example,
8:31balanced binary search
tree, linked list.
8:33A lot of data structures,
especially classic data
8:36structures, follow
pointer machine model.
8:39What we'd like to do is
transform that data structure
8:41or make a new
pointer machine data
8:42structure that does
extra cool things,
8:45namely travel through time.
8:47So that's what
we're going to do.
8:53There's two senses of time
travel or temporal data
8:57structures that we're going
to cover in this class.
9:02The one for today is
called persistence,
9:05where you don't forget
anything, like an elephant.
9:08And the other one
is retroactivity.
9:15Persistence will be today.
9:16Retroactivity is next class.
9:19Basically, these correspond
to two models of time travel.
9:21Persistence is the branching
universe time travel model,
9:24where if you make a
change in the past,
9:25you get a new universe.
9:27You never destroy old universes.
9:29Retroactivity is more
like Back to the Future,
9:33when you go back, make
a change, and then you
9:35can return to the present
and see what happened.
9:37This is a lot harder to do.
9:39And we'll work on
that next class.
9:42Persistence is what
we will do today.
9:46So persistence.
9:57The general idea of persistence
is to remember everything--
10:01the general goal, I would say.
10:05And by everything, I
mean different versions
10:07of the data structure.
10:08So you're doing data
structures in general.
10:11We have update operations
and query operations.
10:14We're mainly concerned
about updates here.
10:16Every time you do an
update, you think of it
10:18as taking a version of the data
structure and making a new one.
10:21And you never want to
destroy old versions.
10:23So even though an update
like an insert or something
10:26changes the data structure, we
want to remember that past data
10:29as well.
10:30And then let's make
this reasonable.
10:34All data structure operations
are relative to a specified
10:37version.
10:47So an update makes and
returns a new version.
11:05So when you do an
insert, you specify
11:08a version of your data
structure and the thing
11:10you want to insert.
11:11And the output is a new version.
11:13So then you could insert into
that new version, keep going,
11:16or maybe go back to the
old version, modify that.
11:20I haven't said exactly
what's allowed here,
11:22but this is sort of
the general goal.
11:25And then there are four
levels of persistence
11:30that you might want to get.
11:33First level is called
partial persistence.
11:37This is the easiest to obtain.
11:44And in partial
persistence, you're
11:47only allowed to update
the latest version, which
11:55means the versions
are linearly ordered.
12:01This is the easiest
to think about.
12:03And time travel can easily get
confusing, so start simple.
12:09We have a timeline of
various versions on it.
12:14This is the latest.
12:17And what we can do is
update that version.
12:19We'll get a new version, and
then our latest is this one.
12:23What this allows is looking back
at the past to an old version
12:27and querying that version.
12:29So you can still ask questions
about the old version,
12:31if you want to be able to do
a search on any of these data
12:33structures.
12:34But you can't change them.
12:35You can only change the
most recent version.
12:38So that's nice.
12:39It's kind of like time
machine on Mac, I guess.
12:44If you've ever seen the
movie Deja Vu, which is not
12:46very common, but it's
a good time travel
12:48movie, in the first half of
the movie, all they can do
12:51is look back at the past.
12:52Later they discover
that actually they
12:54have a full persistence model.
12:57It takes a while
for dramatic effect.
13:04In full persistence, you can
update anything you want--
13:08so update any version.
13:18and so then the
versions form a tree.
13:27OK.
13:28So in this model,
maybe you initially
13:30have a nice line of versions.
13:32But now if I go back to
this version and update it,
13:34I branch, get a
new version here.
13:37And then I might keep modifying
that version sometimes.
13:40Any of these guys can branch.
13:44So this is why I call it the
branching universe model, when
13:47you update your branch.
13:52So no version ever
gets destroyed here.
13:54Again, you can
query all versions.
13:56But now you can also
update any version.
13:59But you just make a new version.
14:00It's a totally new world.
14:02When I update this
version, this version
14:04knows nothing about all the--
14:06this doesn't know
about this future.
14:07It's created its own future.
14:10There's no way to sort of
merge those universes together.
14:14It's kind of sad.
14:16That's why we have the
third level of persistence,
14:22which lets us merge timelines.
14:24It's great for lots
of fiction out there.
14:35If you've seen the
old TV show Sliders,
14:38that would be
confluent persistence.
14:50So confluent persistence,
you can combine two versions
15:01to create a new version.
15:09And in this case, again, you
can't destroy old versions.
15:13In persistence, you
never destroy versions.
15:16So now the versions form a
DAG, directed acyclic graph.
15:22So now we're allowing--
15:24OK, you make some
changes, whatever.
15:25You branch your universe,
make some changes.
15:30And now I can say, OK, take
this version of the data
15:32structure and this version
and recombine them.
15:35Get a new version, and then
maybe make some more changes.
15:38OK, what does combine mean?
15:40Well, it depends on
your data structure.
15:42A lot of data structures
have combine operations
15:44like if you have linked lists,
you have two linked lists,
15:48you can concatenate them.
15:49That's an easy operation.
15:50Even if you have
binary search trees,
15:51you can concatenate
them reasonably easy
15:53and combine it into one
big binary search tree.
15:56So if your data structure
has an operation that
15:59takes as input two
data structures,
16:01then what we're saying is now
it can take two versions, which
16:05is more general.
16:06So I could take the
same data structure,
16:08make some changes in
one way, separately make
16:10some changes in a
different way, and then
16:12try to concatenate them
or do something crazy.
16:14This is hard to
do, and most of it
16:16is an open problem
whether it can be done.
16:19But I'll tell you about it.
16:21Then there's another level
even more than confluent
16:24persistence.
16:26This is hard to interpret
in the time travel world,
16:30but it would be functional
data structures.
16:32If you've ever programmed
in a functional programming
16:34language, it's a little bit
annoying from an algorithm's
16:36perspective, because it
constrains you to work
16:39in a purely functional world.
16:43You can never modify anything.
16:45OK.
16:46Now, we don't want
to modify versions.
16:49That's fine.
16:49But in a functional
data structure,
16:51you're not allowed to
modify any nodes ever.
16:54All you can do is
make new notes.
17:03This is constraining,
and you can't always
17:07get optimal running times
in the functional world.
17:09But if you can get a
functional data structure,
17:11you have all these
things, because you
17:13can't destroy anything.
17:14If you can't destroy
nodes, then in particular,
17:16you can't destroy versions.
17:17And all of these things
just work for free.
17:19And so a bunch of
special cases are known,
17:22interesting special
cases, like search trees
17:24you can do in the
functional world.
17:26And that makes all
of these things easy.
17:29So the rest of this
lecture is going
17:30to be general techniques for
doing partial full persistence,
17:34what we know about
confluent, and what
17:36we know about functional,
brief overview.
17:41Any questions about those
goals, problem definitions?
17:47Yeah.
17:48AUDIENCE: I'm still confused
about functional, because--
17:50ERIK DEMAINE: What
does functional mean?
17:52AUDIENCE: [INAUDIBLE]
17:55ERIK DEMAINE: Yeah, I
guess you'll see what--
17:57functional looks like all
the other things, I agree.
18:00You'll see in a moment
how we actually implement
18:02partial and persistence.
18:03We're going to be
changing nodes a lot.
18:07As long as we still
represent the same data
18:10in the old versions, we
don't have to represent it
18:13in the same way.
18:14That lets us do things
more efficiently.
18:15Whereas in functional,
you have to represent
18:17all the old versions
in exactly the way
18:19you used to represent them.
18:20Here we can kind of
mangle things around
18:22and it makes things faster.
18:23Yeah, good question.
18:25So it seems almost the same,
but it's nodes versus versions.
18:29I haven't really
defined a version.
18:31But it's just that all the
queries answer the same way.
18:34That's what you need
for persistence.
18:38Other questions?
18:40All right.
18:43Well, let's do some
real data structures.
18:49We start with
partial persistence.
18:55This is the easiest.
18:59For both partial and
full persistence,
19:02there is the following result.
Any pointer machine data
19:06structure, one catch with a
constant number of pointers
19:19to any node--
19:22so this is constant n degree.
19:27In a pointer machine, you
always have a constant number
19:30of pointers out
of a node at most.
19:32But for this result
to hold, we also
19:34need a constant number of
pointers into any node.
19:37So this is an extra constraint.
19:42Can be transformed into
another data structure that
19:47is partially persistent and does
all the things it used to do--
19:53so I'll just say, can be
made partially persistent.
20:00You have to pay something, but
you have to pay very little--
20:03constant amortized
factor overhead,
20:12multiplicative overhead
and constant amount
20:21of additive space per change
in the data structure.
20:30So every time you do a
modification in your pointer
20:33machine-- you set one of
the fields to something--
20:36you have to store that forever.
20:37So, I mean, this is the
best you could hope to do.
20:39You've got to store
everything that happened.
20:43You pay a constant
factor overhead, eh.
20:45We're theoreticians.
20:46That doesn't matter.
20:48Then you get any data
structure in this world
20:50can be made
partially persistent.
20:53That's nice.
20:54Let's prove it.
21:00OK, the idea is pretty simple.
21:04Pointer machines are all
about nodes and fields.
21:06So we just need to simulate
whatever the data structure is
21:09doing to those nodes
and fields in a way
21:11that we don't lose
all the information
21:13and we can still
search it very quickly.
21:17First idea is to
store back pointers.
21:21And this is why we need the
constant n degree constraint.
21:27So if we have a node--
21:31how do I want to
draw a node here?
21:35So maybe these are the
three fields of the node.
21:38I want to also store
some back pointers.
21:42Whenever there is a node
that points to this node,
21:48I want to have a
back pointer that
21:50points back so I know where
all the pointers came from.
21:54If there's only p pointers,
then this is fine.
21:57There'll be p fields here.
22:00So still constant, still in
the pointier machine model.
22:03OK, I'm going to need
some other stuff too.
22:08So this is a simple thing,
definitely want this.
22:11Because if my nodes
ever move around,
22:13I've got to update
the pointers to them.
22:15And where are those pointers?
22:16Well, the back pointers
tell you where they are.
22:20Nodes will still
be constant size,
22:22remain in pointer
machine data structure.
22:25OK.
22:26That's idea one.
22:28Idea two is this part.
22:35This is going to store
something called mods.
22:39It could stand for something,
but I'll leave it as mods.
22:44So these are two of the
fields of the data structure.
22:56Ah, one convenience here
is for back pointers,
23:01I'm only going to store it for
the latest version of the data
23:04structure.
23:16Sorry.
23:16I forgot about that.
23:19We'll come back to that later.
23:21And then the idea is to
store these modifications.
23:23How many modifications?
23:25Let's say up to p, twice p.
23:34p was the bound on the
n degree of a node.
23:38So I'm going to allow 2p
modifications over here.
23:43And what's a
modification look like?
23:48It's going to consist
of three things--
23:50get them in the right order--
23:53the version in which
something was changed,
23:56the field that got changed,
and the value it go changed to.
24:02So the idea is that these
are the fields here.
24:07We're not going to touch those.
24:09Once they're set to something--
24:11or, I mean, whatever
they are initially,
24:13they will stay that way.
24:15And so instead of actually
changing things like the data
24:18structure normally
would, we're just
24:19going to add modifications
here to say, oh,
24:21well at this time, this field
changed to the value of 5.
24:25And then later on, it
changed to the value 7.
24:27And then later on, this one
changed to the value 23,
24:31whatever.
24:32So that's what
they'll look like.
24:36There's a limit to how many--
24:37we can only store a constant
number of mods to each node.
24:40And our constant will be 2p.
24:44OK.
24:45Those are the
ideas, and now it's
24:46just a matter of
making this all work
24:47and analyzing that it's
constant amortized overhead.
24:53So first thing is if you
want to read a field,
25:06how would I read a field?
25:08This is really easy.
25:11First you look at what the
field is in the node itself.
25:15But then it might
have been changed.
25:17And so remember when
I say read the field,
25:19I actually mean while
I'm given some version,
25:21v, I want to know what is the
value of this field at version
25:24v, because I want to be able
to look at any of the old data
25:28structures too.
25:29So this would be at
version v. I just
25:37look through all
the modifications.
25:38There's constantly many,
so it takes constant time
25:40to just flip through them
and say, well, what changes
25:43have happened up to version v?
25:46So I look at mods
with version less than
25:56or equal to v. That will be all
the changes that happened up
25:59to this point.
26:00I see, did this field change?
26:02I look at the latest one.
26:03That will be how I read
the field of the node, so
26:07constant time.
26:08There's lots of ways to make
this efficient in practice.
26:10But for our purposes,
it doesn't matter.
26:13It's constant.
26:16The hard part is how
do you change a field?
26:18Because there might not be
any room in the mod structure.
26:35So to modify, say we want to
set node.field equal to x.
26:47What we do is first
we check, is there
26:52any space in the mod structure?
26:54If there's any blank mods,
so if the node is not full,
27:03we just add a mod.
27:06So a mod will look
like now field x.
27:13Just throw that in there.
27:15Because right at this moment--
27:17so we maintain a time
counter, just increment
27:19it ever time we do a change.
27:21This field changed that value.
27:23So that's the easy case.
27:24The trouble, of course,
is if the node is full--
27:29the moment you've
all been waiting for.
27:31So what we're going to do
here is make a new node.
27:33We've ran out of space.
27:34So we need to make a new node.
27:36We're not going to touch the old
node, just going to let it sit.
27:38It still maintains all
those old versions.
27:40Now we want a new node
that represents the latest
27:43and greatest of this node.
27:44OK.
27:45So make a new node.
27:51I'll call it node prime to
distinguish from node, where
27:57with all the mods, and this
modification in particular,
28:04applied.
28:07OK, so we make a new
version of this node.
28:11It's going to have some
different fields, whatever
28:15was the latest version
represented by those mods.
28:18It's still going to
have back pointers,
28:20so we have to maintain
all those back pointers.
28:24And now the mod,
initially, is going
28:26to be empty, because we
just applied them all.
28:29So this new node doesn't
have any recent mods.
28:32Old node represents
the old versions.
28:34This node is going to
represent the new versions.
28:37What's wrong with this picture?
28:39AUDIENCE: Update pointers.
28:40ERIK DEMAINE: Update pointers.
28:42Yeah, there's pointers
to the old version
28:43of the node, which are fine for
the old versions of the data
28:47structure.
28:48But for the latest version
of the data structure,
28:50this node has moved
to this new location.
28:53So if there are any old
pointers to that node,
28:56we've got to update them
in the current version.
28:58We have to update them to
point to this node instead.
29:00The old versions are fine, but
the new version is in trouble.
29:04Other questions or
all the same answer?
29:06Yeah.
29:06AUDIENCE: So if you wanted
to read an old version
29:10but you just have the
new version, [INAUDIBLE]?
29:15ERIK DEMAINE: OK--
29:16AUDIENCE: [INAUDIBLE]
29:17ERIK DEMAINE: The
question is essentially,
29:19how do we hold on to versions?
29:22Essentially, you can think of
a version of the data structure
29:24as where the root node is.
29:26That's probably the easiest.
29:27I mean, in general, we're
representing versions
29:29by a number, v. But we
always start at the root.
29:33And so you've given
the data structure,
29:35which is represented
by the root node.
29:36And you say, search
for the value 5.
29:40Is it in this binary
search tree or whatever?
29:43And then you just start
navigating from the root,
29:45but you know I'm inversion
a million or whatever.
29:49I know what version
I'm looking for.
29:51So you start with the root,
which never changes, let's say.
29:56And then you follow
pointers that
29:58essentially tell
you for that version
30:00where you should be going.
30:01I guess at the root version,
it's a little trickier.
30:03You probably want a little array
that says for this version,
30:07here's the root node.
30:08But that's a special case.
30:11Yeah.
30:11Another question?
30:13AUDIENCE: So on the
new node that you
30:15created, the fields that you
copied, you also have to have
30:19a version for them, right?
30:20Because [INAUDIBLE]?
30:26ERIK DEMAINE: These--
30:27AUDIENCE: Or do you
version the whole node?
30:30ERIK DEMAINE: Here we're
versioning the whole node.
30:32The original field
values represent
30:34what was originally there,
whenever this node was created.
30:37Then the mods specify what
time the fields change.
30:40So I don't think
we need times here.
30:44All right.
30:45So we've got to update
two kinds of pointers.
30:47There's regular
pointers, which live
30:48in the fields, which are
things pointing to the node.
30:52But then there's
also back pointers.
30:53Because if this is
a pointer to a node,
30:55then there'll be a back
pointer back to the node.
30:57And all of those have to change.
31:00Conveniently, the back
pointers are easy.
31:11So if they're back
pointers to the node,
31:13we change them to
the node prime.
31:14How do we find
the back pointers?
31:16Well, we just follow
all the pointers
31:17and then there will be
back pointers there.
31:21Because I said we're
only maintaining
31:23backed pointers for
the latest version,
31:25I don't need to preserve
the old versions
31:28of those backed pointers.
31:29So I just go in
and I change them.
31:31It takes constant time,
because the constant number
31:33of things I point to, each
one as a back pointer.
31:35So this is cheap.
31:37There's no persistence here.
31:39That's an advantage of
partial persistence.
31:41The hard part is
updating the pointers
31:44because those live in fields.
31:45I need to remember the old
versions of those fields.
31:47And that we do recursively.
31:58Because to change
those pointers,
32:00that's a field update.
32:01That's something
exactly of this form.
32:02So that's the same operation
but on a different node.
32:05So I just do that.
32:07I claim this is good.
32:08That's the end of the algorithm.
32:11Now we need to analyze it.
32:24How do we analyze it?
32:25Any guesses?
32:29AUDIENCE: Amortize it.
32:30ERIK DEMAINE: Amortized
analysis, exactly
32:32the answer I was looking for.
32:33OK.
32:34[INAUDIBLE] amortization.
32:36The most powerful
technique in amortization
32:38is probably the
potential method.
32:40So we're going to use that.
32:42There's a sort of more--
32:44you'll see a charging
argument in a moment.
32:50We want the potential function
to represent when this data
32:53structure is in a bad state.
32:55Intuitively, it's in a bad state
when a lot of nodes are full.
32:58Because then as soon as
you make a change in them,
33:00they will burst, and you have
to do all this crazy recursion
33:03and stuff.
33:04This case is nice and cheap.
33:05We just add a modification,
constant time.
33:08This case, not so nice
because we recurse.
33:10And then that's going
to cause more recursions
33:12and all sorts of
chaos could happen.
33:16So there's probably a few
different potential functions
33:20that would work here.
33:21And an old version
of these nodes I said
33:23should be the number
of full nodes.
33:25But I think we can make
life a little bit easier
33:27by the following.
33:32Basically, the total
number of modifications--
33:36not quite the total,
almost the total.
33:39So I'm going to do c times
the sum of the number of mods
33:49in latest version nodes.
33:59OK.
34:00So because we sort
of really only
34:02care about-- we're only
changing the latest version,
34:05so I really only
care about nodes that
34:07live in the latest version.
34:08What do I mean by this?
34:09Well, when I made
this new node prime,
34:11this becomes the new
representation of that node.
34:14The old version is dead.
34:15We will never change it again.
34:18If we're modifying, we will
never even look at it again.
34:21Because now everything
points to here.
34:24So I don't really
care about that node.
34:26It's got a ton of mods.
34:27But what's nice is that when
I create this new node, now
34:30the mod list is empty.
34:31So I start from scratch,
just like reinstalling
34:33your operating system.
34:34It's a good feeling.
34:38And so the potential goes down
by, I guess, c times 2 times p.
34:45When I do this change, potential
goes down by basically p.
34:49AUDIENCE: Is c any constant or--
34:52ERIK DEMAINE: c will be a
constant to be determined.
34:55I mean, it could be 1.
34:57It depends how you
want to define it.
34:58I'm going to use the CLRS
notion of amortized cost, which
35:02is actual cost plus
change in potential.
35:06And then I need a
constant here, because I'm
35:08measuring a running time versus
some combinatorial quantity.
35:12So this will be to match the
running time that we'll get to.
35:17OK.
35:17So what is amortized cost?
35:22There's sort of two
cases modification.
35:24There's the cheap case
and the not so cheap case.
35:28In general, amortized cost--
35:34in both cases, it's
going to be at most--
35:37well, first of all, we
do some constant work
35:39just to figure out all this
stuff, make copies, whatever.
35:44So that's some constant time.
35:49That's the part that I don't
want to try to measure.
35:52Then potentially,
we add a new mod.
35:55If we add a mod, that
increases the potential by c.
35:59Because we're just counting
mods, multiplying by c.
36:02So we might get plus 1 mod.
36:04This is going to
be an upper bound.
36:06We don't always add 1, but
worst case, we always had 1,
36:09let's say.
36:11And then there's
this annoying part.
36:14And this might happen,
might not happen.
36:16So then there's a plus maybe.
36:20If this happens, we
decrease the potential
36:23because we empty out the
mods for that node in terms
36:26of the latest version.
36:27So then we get a negative
2cp, change in potential.
36:34And then we'd have to pay
I guess up to p recursions.
36:49Because we have to--
36:51how many pointers
are there to me?
36:53Well, at most p of them, because
there are at most p pointers
36:58to any node.
37:02OK.
37:03This is kind of a weird--
37:05it's not exactly algebra here.
37:06I have this thing, recursions.
37:09But if you think about
how this would expand,
37:11all right, this
is constant time.
37:13That's good.
37:14And then if we do this--
37:15I'll put a question mark here.
37:16It might be here.
37:16It might not.
37:18If it's not here, find constant.
37:19If it is here, then this gets
expanded into this thing.
37:24It's a weird way to
write a recurrence.
37:26But we get p times whatever
is in this right hand side.
37:30OK.
37:31But then there's this minus 2cp.
37:33So we're going to
get p times 2c here.
37:36That's the initial cost.
37:37So that will cancel with this.
37:40And then we might get
another recursion.
37:41But every time we get a
recursion, all the terms
37:43cancel.
37:44So it doesn't matter
whether this is here or not.
37:46You get 0, which is great.
37:49And you're left with
the original 2c.
37:53Constant.
37:55OK.
37:56[INAUDIBLE] potential functions
are always a little crazy.
37:59What's happening here is
that, OK, maybe you add a mod.
38:03That's cheap.
38:05But when we have to do this
work and we have to do this
38:08recursion-- this is up to
2p updates or recursions--
38:14we are charging it to the
emptying of this node.
38:17The number of mods
went from 2p down to 0.
38:21And so we're just
charging this update cost
38:22to that modification.
38:24So if you like charging schemes,
this is much more intuitive.
38:26But with charging schemes,
it's always a little careful.
38:28You have to make sure
you're not double charging.
38:30Here it's obvious that
you're not double charging.
38:34Kind of a cool and magical.
38:37This is a paper by
Driscoll, Sarnak, Sleator,
38:42Tarjan from 1989.
38:43So it's very early
days of amortization.
38:45But they knew how to do it.
38:47Question?
38:48AUDIENCE: [INAUDIBLE]
38:50ERIK DEMAINE: What happens
if you overflow the root?
38:52Yeah, I never thought about
the root before today.
38:54But I think the way
to fix the root is
38:57just you have one big table
that says, for a given version--
39:02I guess a simple
way would be to say,
39:04not only is a version
a number, but it's also
39:06a pointer to the root.
39:07There we go.
39:07Pointer machine.
39:09So that way you're just
always explicitly maintaining
39:11the root copy or the pointer.
39:15Because otherwise,
you're in trouble.
39:18AUDIENCE: Then can you
go back to [INAUDIBLE].
39:21ERIK DEMAINE: So in order
to refer to an old version,
39:24you have to have the
pointer to that root node.
39:26If you want to do it just
from a version number,
39:29look at the data structure.
39:30Just from a version
number, you would
39:31need some kind of
lookup table, which
39:33is outside the pointer machine.
39:34So you could do it
in a real computer,
39:36but a pointer machine is
not technically allowed.
39:39So it's slightly awkward.
39:40No arrays are allowed
in pointer machines,
39:42in case that wasn't clear.
39:43Another question?
39:44AUDIENCE: [INAUDIBLE] constant
space to store for [INAUDIBLE].
39:48And also, what if we have
really big numbers [INAUDIBLE]?
39:54ERIK DEMAINE: In this model,
in the pointer machine model,
39:56we're assuming that whatever
the data is in the items
39:58take constant space each.
40:01If you want to know about
bigger things in here,
40:03then refer to future lectures.
40:05This is time travel, after all.
40:06Just go to a future
class and then come back.
40:09[LAUGHS] So we'll get
there, but right now,
40:11we're not thinking
about what's in here.
40:15Whatever big thing
you're trying to store,
40:16you reduce it down to
constant size things.
40:19And then you spread them around
nodes of a pointer machine.
40:22How you do that, that's
up to the data structure.
40:25We're just transforming the
data structure to be persistent.
40:28OK, you could ask about other
models than pointer machines,
40:30but we're going to stick
to pointer machines here.
40:34All right.
40:36That was partial persistence.
40:38Let's do full persistence.
40:41That was too easy.
40:46Same paper does
full persistence.
40:48Systems That was just a warm up.
40:50Full persistence is actually
not that much harder.
40:55So let me tell you
basically what changes.
41:04There are two issues.
41:05One is that everything here
has to change and not by much.
41:09We're still going to
use back pointers.
41:11We're still going
to have my mods.
41:12The number of mods is going
to be slightly different
41:15but basically the same.
41:16Back pointers no longer just
refer to the latest version.
41:19We have to maintain back
pointers in all versions.
41:21So that's annoying.
41:22But hey, that's life.
41:24The amortization, the
potential function
41:25will change slightly
but basically not much.
41:30Sort of the bigger issue you
might first wonder about,
41:33and it's actually the most
challenging technically,
41:35is versions are
no longer numbers.
41:37Because it's not a line.
41:39Versions are nodes in a tree.
41:41You should probably
call them vertices
41:42in a tree to distinguish them
from nodes in the pointer
41:45machine.
41:46OK, so you've got
this tree of versions.
41:48And then versions are just
some point on that tree.
41:53This is annoying
because we like lines.
41:57We don't like trees as much.
41:58So what we're going to
do is linearize the tree.
42:04Like, when in doubt, cheat.
42:12How do we do this?
42:13With tree traversal.
42:15Imagine I'm going to draw
a super complicated tree
42:18of versions.
42:19Say there are three versions.
42:21OK.
42:22I don't want to number
them, because that would be
42:24kind of begging the question.
42:26So let's just call
them x, y, and z.
42:33All right.
42:34I mean, it's a directed
tree, because we
42:36have the older versions.
42:37This is like the
original version.
42:38And we made a change.
42:39We made a different change
on the same version.
42:42What I'd like to do is a
traversal of that tree,
42:45like a regular, as if you're
going to sort those nodes.
42:48Actually, let me use
color, high def here.
42:53So here's our
traversal of the tree.
42:59And I want to look at the
first and the last time I
43:01visit each node.
43:02So here's the first
time I visit x.
43:05So I'll write this is
the beginning of x.
43:09Capital X. Then this is
the first time I visit y,
43:13so it's beginning of y.
43:15And then this is the last time
I visit y, so it's the end of y.
43:19And then, don't care.
43:20Then this is the beginning of z.
43:24And this is the end of z.
43:27And then this is the end x.
43:29If I write those sequentially,
I get bxbyeybzez,
43:38because this is so easy, ex.
43:42OK, you can think of these
as parentheses, right?
43:45For whatever reason I chose b
and e for beginning and ending,
43:48but this is like open
parens, close parens.
43:50This is easy to
do in linear time.
43:52I think you all know how.
43:53Except it's not
a static problem.
43:55Versions are changing
all the time.
43:56We're adding versions.
43:57We're never deleting
versions, but we're always
43:59adding stuff to here.
44:00It's a little
awkward, but the idea
44:01is I want to
maintain this order,
44:05maintain the begin and
the end of each you
44:16might say subtree of versions.
44:23This string, from
bx to ex, represents
44:25all of the stuff in x's
subtree, in the rooted tree
44:29starting at x.
44:33How do I maintain that?
44:40Using a data structure.
44:56So we're going to use something,
a data structure we haven't yet
45:00seen.
45:02It will be in lecture 8.
45:04This is a time travel
data structure,
45:06so I'm allowed to do that.
45:10So order maintenance
data structure.
45:14You can think of this as
a magical linked list.
45:16Let me tell you what the
magical linked list can do.
45:19You can insert--
45:22I'm going to call it
an item, because node
45:24would be kind of confusing
given where we are right now.
45:28You can insert a new item in
the list immediately before
45:32or after a given item.
45:37OK.
45:37This is like a
regular linked list.
45:41Here's a regular linked list.
45:44And if I'm given a particular
item like this one,
45:48I can say, well, insert
a new item right here.
45:51You say, OK.
45:51Fine.
45:52I'll just make a new node and
relink here, relink there.
45:57Constant time, right?
45:58So in an order maintenance
data structure,
46:00you can do this
in constant time.
46:01Wow!
46:02So amazing.
46:05OK, catch is the second
operation you can do.
46:08Maybe I'll number these.
46:09This is the update.
46:10Then there's the query.
46:13The query is, what
is the relative order
46:17of two notes, of two items?
46:24x and y.
46:27So now I give you this
node and this node.
46:29And I say, which is to the left?
46:32Which is earlier in the order?
46:34I want to know, is x
basically less than y
46:36in terms of the
order in the list?
46:37Or is y less than x?
46:41And an order maintenance
data structure
46:42can do this in constant time.
46:45Now it doesn't look like your
mother's linked list, I guess.
46:50It's not the link list
you learned in school.
46:52It's a magical linked
list that can somehow
46:54answer these queries.
46:55How?
46:56Go to lecture 7.
46:58OK.
46:59Forward reference,
lecture 8, sorry.
47:03For now, we're just going to
assume that this magical data
47:05structure exists.
47:06So in constant
time, this is great.
47:09Because if we're maintaining
these b's and e's, we
47:11want to maintain the order
that these things appear in.
47:16If we want to create
a new version,
47:17like suppose we were
just creating version z,
47:20well, it used to be everything
without this bz, ez.
47:23And we'd just insert two
items in here, bz and ez.
47:27They're right next
to each other.
47:28And if we were given version
x, we could just say,
47:30oh, we'll look at ex and insert
two items right before it.
47:34Or you can put them
right after bx.
47:36I mean, there's no
actual order here.
47:37So it could have been y first
and then z or z first and then
47:40y.
47:42So it's really easy to add a
new version in constant time.
47:44You just do two of
these insert operations.
47:47And now you have this magical
order operation, which
47:50if I'm given two versions--
47:54I don't know, v and w--
47:56and I want to know is
v an ancestor of w,
48:00now I can do it
in constant time.
48:02So this lets me do a third
operation, which is, is version
48:09v an ancestor of version w?
48:21Because that's going to
be true if and only if bv
48:26is an ev nest around bw and ew.
48:39OK.
48:40So that's just three tests.
48:41They're probably not
all even necessary.
48:43This one always holds.
48:45But if these guys fit in between
these guys, then you know--
48:50now, what this tells us,
what we care about here,
48:54is reading fields.
48:58When we read a field,
we said, oh, we'll
49:00apply all the modifications
that apply to version
49:02v. Before that, that
was a linear order.
49:04So it's just all versions
less than or equal to v. Now
49:06it's all versions that are
ancestors of v. Given a mod,
49:10we need to know, does this
mod apply to my version?
49:13And now I tell you, I can
do that in constant time
49:16through magic.
49:17I just test these
order relations.
49:20If they hold, then that
mod applies to my version.
49:24So w's the version
we're testing.
49:27v is some version in the mod.
49:29And I want to know, am
descendant of that version?
49:32If so, the mod applies.
49:34And I update what the field is.
49:36I can do all pairwise ancestor
checks and figure out,
49:39what is the most recent
version in my ancestor history
49:43that modified a given field?
49:44That lets me read a
field in constant time.
49:47Constants are getting
kind of big at this point,
49:49but it can be done.
49:53Clear?
49:54A little bit of
a black box here.
49:56But now we've gotten
as far as reading.
50:01And we don't need
to change much else.
50:04So this is good news
50:11Maybe I'll give you
a bit of a diff.
50:15So full persistence,
fully persistent theorem--
50:26done.
50:27OK.
50:27Same theorem just
with full persistence.
50:30How do we do it?
50:31We store back pointers
now for all versions.
50:35It's a little bit annoying.
50:36But how many mods do we use?
50:40There's lots of ways
to get this to work,
50:42but I'm going to
change this number
50:44to 2 times d plus p plus 1.
50:51Wait, what's d? d is the
number of fields here.
50:56OK.
50:57We said it was
constant number fields.
50:59I never said what that constant
is. d for out degree, I guess.
51:03So p is in degree, max in
degree. d is max out degree.
51:09So just slightly more--
that main reason for this
51:11is because back pointers now
are treated like everyone else.
51:14We have to treat both the out
pointers and the in pointers
51:17as basically the same.
51:18So instead of p,
we have d plus p.
51:19And there's a plus
1 just for safety.
51:23It gets my amortization
to work, hopefully.
51:28OK.
51:29Not much else-- this
page is all the same.
51:32Mods are still, you give
versions, fields, values,
51:35reading.
51:36OK, well, this is no longer
less than or equal to v. But
51:41this is now with a version, sort
of the nearest version, that's
51:47an ancestor of v.
51:50That's what we were
just talking about.
51:52So that can be done
in constant time.
51:54Check it for all of
them, constant work.
51:57OK.
51:58That was the first part.
52:04Now we get to the hard
part, which is modification.
52:07This is going to be different.
52:08Maybe you I should just erase--
52:10yeah, I think I'll
erase everything,
52:13except the first clause.
52:24OK.
52:24If a node is not
full, we'll just
52:26add a mod, just like before.
52:28What changes is
when a node is full.
52:36Here we have to do something
completely different.
52:38Why?
52:38Because if we just
make a new version
52:41of this node that has empty
mods, this one's still full.
52:45And I can keep modifying
the same version.
52:48This new node that I just erased
represents some new version.
52:52But if I keep modifying
an old version, which
52:54I can do in full persistence,
this node keeps being full.
52:57And I keep paying
potentially huge cost.
53:00If all the nodes were full,
and when I make this change
53:02every node gets
copied, and then I
53:04make a change to
the same version,
53:06every node gets copied again.
53:07This is going to take
linear time per operation.
53:09So I can't do the old strategy.
53:11I need to somehow make
this node less full.
53:15This is where we're
definitely not functional.
53:17None of this was
functional, but now I'm
53:19going to change an old node, not
just make a new one in a more
53:24drastic way.
53:25Before I was adding a mod.
53:27That's not a
functional operation.
53:28Now I'm actually going to remove
mods from a node to rebalance.
53:33So what I'd like to do is
split the node into two halves.
53:43OK.
53:43So I had some big
node that was--
53:46I'll draw it-- completely full.
53:50Now I'm going to make two nodes.
53:52Here we go.
53:59This one is going
to be half full.
54:01This one's going to
be half full of mods.
54:04OK.
54:05The only question left is, what
do I do with all these things?
54:12Basically what I'd like
to do is have the--
54:14on the one hand, I want
to have the old node.
54:18It's just where it used to be.
54:20I've just removed half of
the mods, the second half,
54:23the later half.
54:25What does that mean?
54:26I don't know.
54:27Figure it out.
54:29It's linearized.
54:31I haven't thought
deeply about that.
54:32Now we're going to make a
new node with the second half
54:36of the mods.
54:40It's more painful
than I thought.
54:41In reality, these mods represent
a tree of modifications.
54:45And what you need to do is
find a partition of that tree
54:48into two roughly equal halves.
54:51You can actually do a
one third, 2/3 split.
54:52That's also in a future lecture,
which whose number I forget.
54:57So really, you're
splitting this tree
54:58into two roughly
balanced halves.
55:01And so this 2 might actually
need to change to a 3,
55:03but it's a constant.
55:06OK.
55:07What I want is for
this to represent
55:09a subtree of versions.
55:10Let me draw the picture.
55:11So here's a tree of versions
represented by the old mods.
55:15I'd like to cut out a
subtree rooted at some node.
55:18So let's just assume
for now this has exactly
55:21half the nodes.
55:22And this has half the nodes.
55:25In reality, I think it
can be one third, 2/3.
55:29OK.
55:29But let's keep it convenient.
55:32So I want the new
node to represent
55:34this subtree and this node
to represent everything else.
55:37This node is as if this
stuff hasn't happened yet.
55:41I mean, so it represents all
these old versions that do not,
55:44that are not in the subtree.
55:45This represents all
the latest stuff.
55:47So what I'm going to
do is like before, I
55:49want to apply some
mods to these fields.
55:54And whatever minds were
relevant at this point, whatever
55:58had been applied, I apply
those to the fields here.
56:02And so that means I can
remove all of these mods.
56:06I only cared about these ones.
56:09Update these fields accordingly.
56:11I still have the other mods to
represent all the other changes
56:14that could be in that subtree.
56:16OK.
56:16So we actually split the tree,
and we apply mods to new nodes.
56:38Anything else I need to say?
56:42Oh, now we need to
update pointers.
56:44That's always the fun part.
56:49Let's go over here.
57:05So old node hasn't moved.
57:07But this new node has moved.
57:09So for all of these
versions, I want
57:13to change the pointer that
used to point to old node
57:18should now point to new node.
57:20In this version, it's fine.
57:21It should still
point to old node,
57:23because this represents
all those old versions.
57:25But for the new version,
that version in the subtree,
57:28I've got to point here instead.
57:30OK.
57:31So how many pointers could
there be to this node
57:37that need to change.
57:38That's a tricky part
in this analysis.
57:41Think about it for a while.
57:45I mean, in this
new node, whatever
57:47is pointed to by either here or
here in the new node also has
57:50a return pointer.
57:50All pointers are bidirectional.
57:52So we don't really care
about whether they're forward
57:54or backward.
57:54How many pointers
are there here?
57:56Well, there's d here
and there's p here.
57:59But then there's also
some additional pointers
58:01represented over here.
58:02How many?
58:04Well, if we assume this
magical 50/50 split,
58:06there's right now d plus p plus
1 mods over here, half of them.
58:12Each of them might be a pointer
to some other place, which
58:16has a return pointer
in that version.
58:18So number of back pointers
that we need to update
58:23is going to be this, 2
times d 2 times p plus 1.
58:30So recursively update at
most 2 times d plus 2 times p
58:41plus 1 pointers to the node.
58:50The good news is this is
really only half of them
58:52or some fraction of them.
58:54It used to be--
58:57well, there were
more pointers before.
58:59We don't have to
deal with these ones.
59:00That's where we're
saving, and that's
59:01why this amortization works.
59:03Let me give you a potential
function that makes this work--
59:12is minus c times sum of the
number of empty mod slots.
59:23It's kind of the same
potential but before
59:26we had this notion of
dead and alive nodes.
59:28Now everything's alive
because everything
59:30could change at any moment.
59:31So instead, I'm going to
measure how much room I have
59:36in each node.
59:37Before I had no
room in this node.
59:38Now I have half the
space in both nodes.
59:41So that's good news.
59:44Whenever we have
this recursion, we
59:48can charge it to a
potential decrease.
59:56Fee goes down by--
1:00:01because I have a
negative sign here--
1:00:03c times, oh man, 2 times
d plus p plus 1, I think.
1:00:13Because there's d plus
p plus 1 space here,
1:00:15d plus p plus 1 space here.
1:00:17I mean, we added
one whole new node.
1:00:18And total capacity
of a node in mods
1:00:20is 2 times d plus p plus 1.
1:00:23So we get that times c.
1:00:26And this is basically
just enough,
1:00:28because this is 2 times
d plus 2 times p plus 2.
1:00:32And here we have a plus 1.
1:00:34And so the recursion gets
annihilated by 2 times d plus
1:00:392 times p plus 1.
1:00:41And then there's one
c left over to absorb
1:00:43whatever constant cost there
was to do all this other work.
1:00:47So I got the constants
just to work,
1:00:51except that I cheated and it's
really a one third, 2/3 split.
1:00:54So probably all of these
constants have to change,
1:00:57such is life.
1:00:58But I think you get the idea.
1:01:01Any questions about
full persistence?
1:01:07This is fun stuff, time travel.
1:01:10Yeah?
1:01:11AUDIENCE: So in the first
half of the thing where
1:01:14the if, there's room
you can put it in.
1:01:16ERIK DEMAINE: Right.
1:01:17AUDIENCE: I have a
question about how
1:01:17we represent the version.
1:01:19Because before when we said
restore now [INAUDIBLE].
1:01:23It made more sense if now was
like a timestamp or something.
1:01:25ERIK DEMAINE: OK.
1:01:26Right, so how do we represent a
version even here or anywhere?
1:01:31When we do a modification, an
update, in the data structure,
1:01:34we want to return
the new version.
1:01:36Basically, we're going
to actually store
1:01:39the DAG of versions.
1:01:41And a version is going to
be represented by a pointer
1:01:43into this DAG.
1:01:44One of the nodes in this
DAG becomes a version.
1:01:47Every node in this DAG is
going to store a pointer
1:01:50to the corresponding b character
and a corresponding e character
1:01:53in this data
structure, which then
1:01:56lets you do anything you want.
1:01:57Then you can query
against that version,
1:01:59whether it's an ancestor
of another version.
1:02:01So yeah, I didn't mention that.
1:02:02Versions are nodes in here.
1:02:04Nodes in here have pointers
to the b's and e's.
1:02:06And vice versa, the b's
and e's have pointers back
1:02:08to the corresponding
version node.
1:02:10And then you can keep
track of everything.
1:02:12Good question.
1:02:14Yeah?
1:02:15AUDIENCE: [INAUDIBLE] question.
1:02:16Remind me what d is in this.
1:02:17ERIK DEMAINE: Oh, d was
the maximum out degree.
1:02:19It's the number of fields in
a node, as defined right here.
1:02:26Other questions?
1:02:29Whew.
1:02:30OK, a little breather.
1:02:31That was partial persistence,
full persistence.
1:02:33This is, unfortunately, the
end of the really good results.
1:02:36As long as we have
constant degree nodes,
1:02:38in and out degree,
we can do all.
1:02:41We can do for
persistence for free.
1:02:44Obviously there are practical
constants involved here.
1:02:47But in theory, you
can do this perfectly.
1:02:53Before we go on to
confluence, there
1:02:54is one positive result,
which is what if you
1:02:58don't like amortize bounds.
1:03:00There are various reasons
amortize bounds might not
1:03:02be good.
1:03:03Maybe you really care
about every operation
1:03:04being no slower than it was
except by a constant factor.
1:03:08We're amortizing here, so some
operations get really slow.
1:03:11But the others are all
fast to compensate.
1:03:14You can deamortize, it's called.
1:03:22You can get constant
worst case slowdown
1:03:30for partial persistence.
1:03:36This is a result of Garret
Brodle from the late '90s, '97.
1:03:44For full persistence--
so it's an open problem.
1:03:47I don't know if people
have worked on that.
1:03:55All right.
1:03:56So some, mostly good results.
1:03:59Let's move on to confluent
persistence where things
1:04:01get a lot more challenging.
1:04:17Lots of things go out the window
with confluent persistence.
1:04:20In particular, your
versions are now a DAG.
1:04:23It's a lot harder
to linearize a DAG.
1:04:25Trees are not that
far from pads.
1:04:28But DAGs are quite far
from pads, unfortunately.
1:04:33But that's not all
that goes wrong.
1:04:44Let me first tell you the
kind of end effect as a user.
1:04:50Imagine you have
a data structure.
1:04:54Think of it as a
list, I guess, which
1:04:57is a list of characters
in your document.
1:04:59You're using vi or Word,
your favorite, whatever.
1:05:03It's a text editor.
1:05:05You've got a string of words.
1:05:06And now you like to do
things like copy and paste.
1:05:09It's a nice operation.
1:05:11So you select an interval of
the string and you copy it.
1:05:16And then you paste
it somewhere else.
1:05:18So now you've got two
copies of that string.
1:05:21This is, in some
sense, what you might
1:05:24call a confluent
operation, because--
1:05:27yeah, maybe a cleaner way to
think of it is the following.
1:05:30You have your string.
1:05:31Now I have an operation,
which is split it.
1:05:33So now I have two strings.
1:05:35OK.
1:05:36And now I have an operation,
which is split it.
1:05:38Now I have three strings.
1:05:40OK.
1:05:41Now I have an operation
which is concatenate.
1:05:44So I can, for
example, reconstruct
1:05:47the original string-- actually,
I have the original string.
1:05:49No biggie.
1:05:51Let's say-- because
I have all versions.
1:05:54I never lose them.
1:05:55So now instead, I'm going to
cut the string here, let's say.
1:05:59So now I have this and this.
1:06:03And now I can do
things like concatenate
1:06:06from here to here to here.
1:06:10And I will get this
plus this plus this.
1:06:16OK.
1:06:17This guy moved here.
1:06:18So that's a copy/paste
operation with a constant number
1:06:20of splits and concatenates.
1:06:22I could also do cut and paste.
1:06:23With confluence, I can
do crazy cuts and pastes
1:06:26in all sorts of ways.
1:06:28So what?
1:06:29Well, the so what
is I can actually
1:06:32double the size of
my data structure
1:06:33in a constant number
of operations.
1:06:36I can take, for example,
the entire string
1:06:38and concatenate it to itself.
1:06:40That will double the
number of characters,
1:06:41number of elements in there.
1:06:43I can do that again
and again and again.
1:06:45So in u updates,
I can potentially
1:06:51get a data structure
size 2 to the u.
1:06:57Kind of nifty.
1:06:58I think this is why
confluence is cool.
1:07:00It's also why it's hard.
1:07:02So not a big surprise.
1:07:03But, here we go.
1:07:08In that case, the version DAG,
for reference, looks like this.
1:07:13You're taking the same
version, combining it.
1:07:16So here I'm assuming I have
a concatenate operation.
1:07:20And so the effect here,
every time I do this,
1:07:24I double the size.
1:07:44All right.
1:07:44What do I want to say about
confluent persistence?
1:07:46All right.
1:07:47Let me start with the
most general result, which
1:07:53is by Fiat and Kaplan in 2003.
1:08:04They define a notion called
effective depth of a version.
1:08:08Let me just write it down.
1:08:21It's kind of like
if you took this DAG
1:08:24and expanded it out to be a
tree of all possible paths.
1:08:30Instead of point
to the same node,
1:08:31you could just
duplicate that node
1:08:33and then have pointers
left and right.
1:08:35OK.
1:08:35So if I did that, of course,
this size grows exponentially.
1:08:38It explicitly represents the
size of my data structure.
1:08:41At the bottom, if
I have u things,
1:08:42I'm going to have 2 to the
u leaves at the bottom.
1:08:45But then I can easily
measure the number of paths
1:08:49from the root to
the same version.
1:08:50At the bottom, I still
label it, oh, those
1:08:52are all v. They're all the
same version down there.
1:08:54So exponential number
of paths, if I take log,
1:08:56I get what I call
effective depth.
1:08:58It's like if you somehow
could rebalance that tree,
1:09:02this is the best you
could hope to do.
1:09:05It's not really a lower bound.
1:09:07But it's a number.
1:09:08It's a thing.
1:09:09OK.
1:09:10Then the result they achieve
is that the overhead is
1:09:17log the number of
updates plus-- this
1:09:19is a multiplicative overhead,
so you take your running time.
1:09:22You multiply it by this.
1:09:25And this is a time
and a space overhead.
1:09:31So maximum effective depth
of all versions, maybe even
1:09:34sum of effective depths, but
we'll just say max to be safe.
1:09:39Sorry-- sum over
all the operations.
1:09:41This is per operation.
1:09:43You pay basically
the effective depth
1:09:44of that operation as a factor.
1:09:48Now, the annoying thing is if
you have this kind of set up
1:09:51where the size
grew exponentially,
1:09:54then number of paths
is exponential.
1:09:56Log of the number of
paths is linear in u.
1:09:59And so this factor could be
as much as u, linear slowdown.
1:10:06Now, Fiat and Kaplan argue
linear slowdown is not
1:10:08that bad, because if you weren't
even persistent, if you did
1:10:13this in the naive way of
just recopying the data,
1:10:18you were actually spending
exponential time to build
1:10:21the final data structure.
1:10:22It has exponential size.
1:10:23Just to represent it explicitly
requires exponential time,
1:10:26so losing a linear
factor to do u operations
1:10:29and now u squared time
instead of 2 to the u.
1:10:31So it's a big
improvement to do this.
1:10:35The downside of this approach is
that even if you have a version
1:10:40DAG that looks like this,
even if the size of the data
1:10:43structure is staying
normal, staying linear, so
1:10:46this potential, you could
be doubling the size.
1:10:48But we don't know what
this merge operation is.
1:10:49Maybe it just throws
away one of the versions
1:10:51or does something--
1:10:53somehow takes half
the nodes from one
1:10:55side, half the nodes from
the other side maybe.
1:10:57These operations
do preserve size.
1:10:58Then there's no great reason why
it should be a linear slowdown,
1:11:02but it is.
1:11:03OK?
1:11:04So it's all right but not great.
1:11:10And it's the best
general result we know.
1:11:13They also prove a lower bound.
1:11:21So lower bound is some effect
of depth, total bits of space.
1:11:37OK.
1:11:37What does this mean?
1:11:40So even if this
is not happening,
1:11:42the number of bits
of space you need
1:11:44in the worst case--
this does not
1:11:45apply to every data structure.
1:11:47That's one catch.
1:11:49They give a specific
data structure
1:11:52where you need this much space.
1:11:53So it's similar to
this kind of picture.
1:11:57We'll go into the details.
1:11:58And you need this much space.
1:12:00Now, this is kind of
bad, because if there's
1:12:02u operations, and each of these
is u, that's u squared space.
1:12:06So we actually need a
factor u blow up in space.
1:12:09It looks like.
1:12:11But to be more precise,
what this means is
1:12:14that you need omega e of
v space, and therefore
1:12:17time overhead per update, if--
1:12:27this is not written
in the paper--
1:12:29queries are free.
1:12:35Implicit here, they just want
to slow down and increase space
1:12:40for the updates you do,
which is pretty natural.
1:12:43Normally you think of queries
as not increasing space.
1:12:46But in order to construct
this lower bound,
1:12:49they actually do
this many queries.
1:12:52So they do e of v queries
and then one update.
1:12:55And they say, oh well, space
had to go up by an extra e of v.
1:12:59So if you only charge
updates for the space,
1:13:02then yes, you have
to lose potentially
1:13:04a linear factor, this effect
of death, potentially u.
1:13:07But if you also
charge the queries,
1:13:09it's still constant
in their example.
1:13:13So open question, for
confluent persistence,
1:13:18can you achieve
constant everything?
1:13:21Constant time and
space overheads,
1:13:27multiplicative
factor per operation,
1:13:33both updates and queries.
1:13:35So if you charge the
queries, potentially you
1:13:37could get constant everything.
1:13:38This is a relatively
new realization.
1:13:43And no one knows
how to do this yet.
1:13:47Nice challenge.
1:13:47I think maybe we'll work on that
in our first problem session.
1:13:50I would like to.
1:13:53Questions about that result?
1:13:54I'm not going to
prove the result.
1:13:56But it is a fancy
rebalancing of those kinds
1:13:59of pictures to get this log.
1:14:10There are other results
I'd like to tell you about.
1:14:32So brand new result--
1:14:34that was from 2003.
1:14:35This is from 2012--
1:14:38no, '11, '11, sorry.
1:14:42It's SOTO, which is in January,
so it's a little confusing.
1:14:47Is it '11?
1:14:49Maybe '12.
1:14:50Actually now I'm not sure.
1:14:51It's February already, right?
1:14:54A January, either this
year or last year.
1:15:00It's not as general
a transformation.
1:15:02It's only going to hold in
what's called a disjoint case.
1:15:05But it gets a very good bound--
1:15:07not quite constant,
but logarithmic.
1:15:09OK, logarithmic
would also be nice.
1:15:12Or log, log n, whatever n is.
1:15:17Pick your favorite n,
number of operations, say.
1:15:22OK.
1:15:25If you assume that confluent
operations are performed only
1:15:39on two versions with
no shared nodes--
1:15:50OK, this would be a way to
forbid this kind of behavior
1:15:53where I concatenate the
data structure with itself.
1:15:56All the nodes are common.
1:15:58If I guarantee that maybe I, you
know, slice this up, slice it,
1:16:01dice it, wherever, and
then re-emerge them
1:16:03in some other order, but
I never use two copies
1:16:06of the same piece, that
would be a valid confluent
1:16:10operation over here.
1:16:12This is quite a
strong restriction
1:16:13that you're not allowed.
1:16:16If you try to, who
knows what happens.
1:16:19Behavior's undefined.
1:16:19So won't tell you,
oh, those two versions
1:16:21have this node in common.
1:16:22You've got to make
a second copy of it.
1:16:24So somehow you have to guarantee
that control and operations
1:16:27never overlap.
1:16:29But they can be reordered.
1:16:33Then you can get
order log n overhead.
1:16:39n is the number of operations.
1:16:45I have a sketch
of a proof of this
1:16:46but not very much
time to talk about it.
1:16:48All right.
1:16:49Let me give you a quick picture.
1:16:51In general, the
versions form a DAG.
1:16:55But if you make this assumption,
and you look at a single node,
1:17:00and look at all the versions
where that node appears,
1:17:03that is a tree.
1:17:05Because you're not allowed
to remerge versions
1:17:07that have the same node.
1:17:08So while the big
picture is a DAG,
1:17:11the small picture of a
single guy is some tree.
1:17:17I'm drawing all
these wiggly lines
1:17:18because there are all
these versions where
1:17:20the node isn't changing.
1:17:21This is the entire version DAG.
1:17:23And then some of these nodes--
1:17:26some of these versions,
I should say--
1:17:29that node that we're
thinking about changes.
1:17:31OK, whenever it
branches, it's probably
1:17:33because the actual
node changed, maybe.
1:17:36I don't know.
1:17:37Anyway there are some dots
here where the version changed,
1:17:40some of the leaves,
maybe, that changed.
1:17:41Maybe some of them haven't yet.
1:17:44In fact, let's see.
1:17:48Here where it's change, it could
be that we destroyed the node.
1:17:51Maybe it's gone from the
actual data structure.
1:17:54But there still may
be versions down here.
1:17:56It's not really a tree.
1:17:57It's a whole DAG of
stuff down there.
1:17:59So that's kind of ugly.
1:18:01Where never the
node still exists,
1:18:03I guess that is an
actual leaf of the DAG.
1:18:05So those are OK.
1:18:06But as soon as I maybe
delete that node,
1:18:08then there can be a
whole subtree down there.
1:18:11OK.
1:18:12So now if you look at
an arbitrary version,
1:18:15so what we're thinking about
is how to implement reading,
1:18:17let's say.
1:18:18Reading and writing are
more or less the same.
1:18:21I give you a version.
1:18:22I give you a node, and
I give you a field.
1:18:23I want to know, what is the
value of that field, that node,
1:18:26that version?
1:18:27So now where could
a version fall?
1:18:30Well it has to be
in this subtree.
1:18:31Because the node has to exist.
1:18:36And then it's maybe a pointer.
1:18:38A pointer could be to
another node, which
1:18:42also has this kind of picture.
1:18:44They could be overlapping trees.
1:18:46In general, there
are three cases.
1:18:48Either you're lucky, and the
version you're talking about
1:18:51is a version where
the node was changed.
1:18:53In that case, the data is
just stored right there.
1:18:58That's easy.
1:18:59So you could just say, oh,
how did the node change?
1:19:01Oh, that's what the field is.
1:19:02OK, follow the pointer.
1:19:05A slightly harder
case it's a version
1:19:08in between two such changes.
1:19:09And maybe these are not updates.
1:19:11So I sort of want to know, what
was the previous version where
1:19:17this node changed
in constant time?
1:19:21It can be done.
1:19:22Not constant time,
actually, logarithmic time,
1:19:25using a data structure
called link-cut trees,
1:19:28another fun black
box for now, which
1:19:31we will cover in lecture
19, far in the future.
1:19:36OK.
1:19:39Well, that's one case.
1:19:40There's also the version
where maybe a version
1:19:43is down here in a subtree.
1:19:45I guess then the
node didn't exist.
1:19:48Well, all these
things can happen.
1:19:50And that's even harder.
1:19:51It's messy.
1:19:53They use another trick, which
is called fractional cascading,
1:19:59which I'm not even going to
try to describe what it means.
1:20:02But it's got a very cool name.
1:20:04Because we'll be
covering it in lecture 3.
1:20:06So stay tuned for that.
1:20:07I'm not going to say how
it applies to this setting,
1:20:09but it's a necessary
step in here.
1:20:13In the remaining
zero minutes, let
1:20:15me tell you a little bit about
functional data structures.
1:20:17[LAUGHTER]
1:20:20Beauty of time travel.
1:20:24Functional-- I just
want to give you
1:20:31some examples of things that
can be done functionally.
1:20:33There's a whole book about
functional data structures
1:20:35by Okasaki.
1:20:36It's pretty cool.
1:20:38A simple example
is balanced BSTs.
1:20:42So if you just want to get
log n time for everything,
1:20:44you can do that functionally.
1:20:45It's actually really easy.
1:20:46You pick your favorite balance
BST, like red black trees.
1:20:48You implement it top down so you
never follow parent pointers.
1:20:51So you don't need
parent pointers.
1:20:52So then as you make changes
down the tree, you just copy.
1:20:57It's called path copying.
1:20:58Whenever you're about
to make a change,
1:21:00make a copy of that node.
1:21:02So you end up copying all
the change nodes and all
1:21:05their ancestors.
1:21:06There's only log n of them,
so it takes log n time.
1:21:09Clear?
1:21:10Easy.
1:21:11It's a nice technique.
1:21:12Sometimes path copying
is very useful.
1:21:14Like link-cut
trees, for example,
1:21:16can be made functional.
1:21:17We don't know what they are,
but they're basically a BST.
1:21:19And you can make
them functional.
1:21:21We use that in a paper.
1:21:23All right.
1:21:23Deques.
1:21:25These are doubly ended queues.
1:21:27So it's like a stack and
a queue and everything.
1:21:29You can insert and delete from
the beginning and the end.
1:21:32People start to know
what these are now,
1:21:34because Python calls him that.
1:21:35But you can also
do concatenation
1:21:41with deques in constant
time per operation.
1:21:43This is cool.
1:21:44Deques are not very
hard to make functional.
1:21:46But you can do deques and
you can concatenate them
1:21:48like we were doing in the figure
that's right behind this board.
1:21:51Constant time split
is a little harder.
1:21:53That's actually one
of my open problems.
1:21:56Can you do lists with split and
concatenate in constant time--
1:22:01functionally or confluently,
persistently, or whatever?
1:22:05Another example-- oh, you
can do a mix of the two.
1:22:08You can get log n search in
constant time deque operations,
1:22:12is you can do tries.
1:22:14So a try is a tree
with a fixed topology.
1:22:17Think of it as a directory tree.
1:22:20So maybe you're
using Subversion.
1:22:21Subversion has time
travel operations.
1:22:23You can copy an entire
subtree from one version
1:22:26and stick it into a new
version, another version.
1:22:30So you get a version DAG.
1:22:32It's a confluently
persistent data structure--
1:22:34not implemented optimally,
because we don't necessarily
1:22:37know how.
1:22:38But there is one paper.
1:22:40This actually came from the open
problem section of this class
1:22:43four years ago, I think.
1:22:45It's with Eric Price
and Stefan Langerman.
1:22:49You can get very good results.
1:22:50I won't write them down
because it takes a while.
1:22:52Basically log the degree
of the nodes factor
1:22:56and get functional, and
you can be even fancier
1:22:59and get slightly better
bounds like log log the degree
1:23:02and get confluently persistent
with various tricks,
1:23:05including using all of
these data structures.
1:23:07So if you want to implement
subversion optimally,
1:23:09that is known how to be done but
hasn't actually been done yet.
1:23:14Because there are those
pesky constant factors.
1:23:18I think that's all.
1:23:19What is known about functional
is there's a log n separation.
1:23:23You can be log n
away from the best.
1:23:26That's the worst
separation known,
1:23:30between functional and just
a regular old data structure.
1:23:33It'd be nice to improve that.
1:23:34Lots of open problems here.
1:23:35Maybe we'll work
on them next time.