Context Navigation

source: valtobtest/subversion-1.6.2/notes/fs-improvements.txt @ 3

Last change on this file since 3 was 3, checked in by valtob, 15 years ago
subversion source 1.6.2 as test
File size: 22.4 KB

Line
1	Last updated: $Date: 2001/07/02 21:46:52 $
2
3	Three things that need to happen in the filesystem:
4
5	a) Switch to a sane node key system (issue #654).
6
7	b) We need to operate on file contents without holding the entire
8	contents in RAM. Berkeley DB gives us tools to operate on
9	regions within a value, we just need to use them.
10
11	c) We need to do reverse-delta storage in the filesystem (with
12	checksums).
13
14	d) Record (potentially) multiple parents per change.
15
16	e) Implement atomic renames.
17
18	Some thoughts on them:
19
20
21	a) Switch to a sane node key system (issue #654)
22	================================================
23
24	For more background, read the archived dev list thread with subject
25	"Maintaining NodeID sanity":
26
27	http://subversion.tigris.org/servlets/ReadMsg?msgId=72265&listName=dev
28
29	Here's the plan, mostly by Bill Tutt and Branko, with assists from
30	Karl and Mike:
31
32	Note:
33
34	- This is described in terms of a BDB implementation. Translate to
35	RDB-speak in your head as necessary :-).
36
37	- This proposal supports one copy (a.k.a. branch) operation. You
38	can call it anything you want: "copy", "branch", "split",
39	"pumpkin", whatever. We're calling it "copy" :-). It is the SCM
40	branch operation.
41
42	First, a node's key consists of three parts:
43
44	nodeID.copyID.txnID
45
46	The "txnID" is really just a unique identifier, but we happened to
47	inherit it from the fs txn, and we needed a name for that portion,
48	so... :-) Also, the copyID could theoretically live in the node's
49	value instead of in its key, but it feels right to put it in the
50	key. (How's that for a powerful argument?) For nodes that are not
51	copies, the copyID is just "0" or some other special value.
52
53	There are no more mutability flags -- mutability is determined by
54	examining whether the node key's txnID matches the txn in question.
55	Therefore, there is no stabilization walk at commit time.
56
57	When we commit a change to a node, the nodeID and copyID stay the
58	same, only the txnID changes (actually there is a circumstance where
59	the copyID can change, but more on that later). The new txnID is not
60	necessarily greater than the old one -- sometimes txns get committed
61	out of order! -- but anyway it's different from the old txnID, and the
62	same new txnID is used for all other changes in that commit.
63
64	[Greg and Karl disagree on whether to use integer types or `char *'
65	for the parsed representation of IDs. See the dev list thread
66	that starts here for the details of this:
67	http://subversion.tigris.org/servlets/ReadMsg?msgId=72277&listName=dev
68	]
69
70	After a commit, the txn record in the transactions table does not go
71	away; instead, it is updated so it now maps the txnID to the new
72	revision. This allows us to determine the revision a node was
73	committed in, in constant time, given the node's key.
74
75	Each new version of a node stores the node's predecessor (and does not
76	store copyform history). When node "5.0.fzb" is committed as a
77	successor to "5.0.qnr", the new node's value stores a reference to
78	"5.0.qnr".
79
80	What about copies?
81
82	As in the current fs, copies are shallow. The top of the copied tree
83	gets a new node, but the nodes below it are shared with the copy
84	source. The new node key keeps the same nodeID, but gets a new txnID,
85	and gets the next unique copyID (replacing the current copyID, if
86	any).
87
88	In the example below, we copy `A' to `Y'. Node keys for `A', `Y', and
89	`bar' are given in parentheses:
90
91	BEFORE THE COPY AFTER THE COPY
92	<root> <root>
93	/ \| / \| \
94	/ \| / \| \
95	/ \| / \| \
96	A(3.0.m) B A(3.0.m) B Y(3.jfb.p)
97	/ \ \| / \ \| / \
98	foo bar qux foo bar qux foo bar
99	(5.0.abc) (5.0.abc) (5.0.abc)
100
101	Let's flesh out the example with some commits under A and Y. To save
102	space, the colons represent time flow, not directory hierarchy --
103	imagine they're the Z axis coming out of the screen or something :-).
104
105	<root>
106	/ \| \
107	/ \| \
108	/ \| \
109	A(3.0.m) B Y(3.jfb.p)
110	/ \ \| / \
111	foo bar qux foo bar
112	(5.0.abc) (5.0.abc)
113	: :
114	: :
115	(5.0.ejk) (5.jfb.mht)
116	: :
117	: :
118	(5.0.xyz) (5.jfb.uuu)
119	: :
120	: :
121	(5.0.r2d2) (5.jfb.2pz)
122	: :
123	: :
124	(5.0.c3po) (5.jfb.rdt)
125
126	Let's see how easy it is to answer various questions in this system:
127
128	Obviously, is_related() is simple -- just check that the nodeID
129	portion is the same. You may not know if the relationship is cousins
130	vs ancestor/descendant, but you know whether or not they're related.
131
132	Asking is_predecessor(A,B) is also easy. Just fetch the predecessor
133	pointer from B and see if it's A.
134
135	Finding out what revisions a node changed in is proportional to the
136	number of changes the node has undergone: start at the node, walk back
137	through its predecessors, and for each txnID, look up the revision
138	number via the transactions table (as described earlier).
139
140	During this walk, you can always tell when you encounter a node that
141	results from a copy, because the copyID portion will either change or
142	disappear entirely. When this happens, you know one of two things is
143	true: either the previous node in the walk was the top of a copied
144	tree, or this node (the one with the different copyID) was one of
145	the unchanged nodes down inside a copied tree.
146
147	One might think "Oh, we'll distinguish between these cases by walking
148	up the parents of the node, and seeing if we eventually encounter the
149	old copyID in one of the parents. That would tell us that we're in
150	the second case. If we never encounter it, that tells us we're in the
151	first."
152
153	Not so fast, Spidey. We don't have parent pointers -- this is a
154	predecessor walk by node keys; we can't just walk up the parent path
155	like that. Fortunately, copyIDs are generated from a new `copies'
156	table, which maps unique copyIDs onto (REV COPY_DST_PATH
157	COPY_DST_NODEKEY). We look up the rev/path for the old copyID,
158	convert it to a node key, and compare it to the node key we're
159	currently on. VoilÃ ! Actually, we're not sure we'll store all of
160	those in the copies table, it may boil down to just the DST_NODEKEY or
161	just the other two, we'll see.
162
163	Writing those predecessor walk loops well is left as an exercise for
164	the reader (umm, more likely for the writers, heh), but you can see
165	that the necessary questions can be answered efficiently.
166
167	Note that, like txnIDs, copyIDs are just unique numbers. They may be
168	increasing monotonically in the `copies' table, but (due to the fact
169	that txn A may be started before txn B yet be committed afterwards)
170	it's quite possible that a higher copyID will become visible in the
171	revision history before a lower one.
172
173	The one thing we can know is that a lower copyID can never be a
174	branchwise descendent of a lower copyID, since the lower one must have
175	been committed before any of its descendants txns were started, of
176	course. I'm not sure this minimal inference will ever be useful, but
177	anyway it's all we've got. Anyway, right now, we only ever need ask
178	if two copyIDs are equal -- we don't order them.
179
180	Okay, now what if A already had copied trees underneath it when we
181	copied it to Y? Suppose `bar' is down in one of those subdirectories?
182
183	Then when we're committing on /Y/.../bar, we watch for copyIDs as we
184	walk down from root, like usual, and if we're in a copy underneath a
185	copy, we bubble down a _new_ copyID, distinct from both Y's and B's,
186	starting from that point. Justification: a branch of a branch is
187	still a branch, so it gets its own copyID.
188
189	At this point, I'm going to hand-wave on describing the
190	copy-under-copy behavior any further. I think the above is enough to
191	see that there are no insurmountable problems here, and that the
192	filesystem will now have node keys that [lazily] reflect actual
193	branching patterns.
194
195	A few more notes:
196
197	The is_ancestor(X,Y) question is really only asked during merge(),
198	that is, during commits. If entry "somedir/blah" has NodeID Y in the
199	txn being committed, and NodeID X in head tree being merged into that
200	txn, then we need to make sure X is an ancestor of Y, so that when the
201	commit replaces X with Y, we know we're not losing any history.
202
203	Therefore, we think we can get away with storing the ancestors in a
204	distributed fashion, as a chain: each node knows its immediate
205	predecessor (or "precessors", in the future), and you walk back until
206	you have your answer. In real life, we won't usually be walking back
207	too far, and the search can be further bounded by the revision range
208	implied by the two nodes. If profiling proves these walks to be a
209	bottleneck, we can start caching the results of such walks in a new
210	table whose keys are node keys, and values are _full_ ancestor lists.
211
212
213
214	b) Operate on portions of files efficiently.
215	============================================
216
217	[still pondering this section]
218
219	We should take advantage of Berkeley DB's partial record stuff, all
220	the way up to the top of the svn fs interfaces.
221
222	- dag_node_t gets two new fields: contents_offset and contents_len.
223	They apply to the node's cache of the contents, not the header or
224	proplist.
225
226	- svn_fs__get_node_rev_contents() takes offset and len arguments,
227	fetches only that data. The dag_node_t will remember the offset
228	and len.
229
230	- svn_fs__put_node_rev_contents() takes offset and len args as
231	well.
232
233	- change set_node_revision() accordingly.
234
235	- ... todo thinking here ...
236
237	So now, whenever you read or write a node revision, you are operating
238	on a range. There will be some way to say "I mean the whole thing",
239	of course, so it won't be necessary to know the size in advance.
240
241	Thought: possibly we should stop storing data in the dag_node_t
242	itself, and just return the data in a void pointer passed to
243	svn_fs__get_node_rev_contents(). Still pondering.
244
245
246
247	c) Reverse-delta storage.
248	=========================
249
250	The naive way to recover an old text is:
251
252	retrieve_node_rev (N)
253	{
254	grab_node_revision (&N);
255
256	if (is_fulltext (N))
257	return N;
258	else if (is_shared (N))
259	return retrieve_node_rev (get_sharee (N));
260	else if (is_svndiff (N))
261	return svnpatch (get_svndiff (N), retrieve_node_rev (get_base (N)))
262	}
263
264	(Loose pseudo-code, obviously, and the recursion could be a loop, but
265	you get the idea.)
266
267	The trouble with this is that it constructs and patches each
268	intermediate revision. That'll probably be i/o bound, and anyway much
269	of the intermediate material may not end up in the final target, in
270	which case reconstructing it was a waste of time.
271
272	What we really want is a way to compose a bunch of svndiffs, and then
273	apply that composition to the current head, to give us the older
274	revision in one step (well, one application anyway). Both the
275	composition and the final application need to happen in a streamy or
276	windowed fashion -- we shouldn't have to hold the entire diff in
277	memory, nor the entire source, nor the target.
278
279	Here's a way to do this:
280
281	An svndiff is a series of instructions that are followed to
282	reconstruct the target. There are three possible instructions:
283
284	a) Insert X bytes of new text into the TARGET, and I'm giving you
285	those bytes right here.
286	b) Copy N bytes, starting from offset F in the SOURCE, into the
287	TARGET.
288	c) Copy N bytes, starting from offset F in the TARGET, into the
289	TARGET.
290
291	(Note that (c) can actually run past the current end of the target, as
292	long by the time you get there, the target is longer.)
293
294	To compose two svndiffs...
295
296	...and I hate to tantalize you, but I'm late and have to run now,
297	so I'll try to finish this tomorrow... crimminy... The quick
298	summary is, we build a new svndiff (the composition of all the
299	intermediates), and as it gets too large, we windowify as we go
300	and put each window temporarily in the database; this makes the
301	composition as a whole less efficient, but means that at any
302	given time we don't have to have the whole thing in memory. The
303	arithmetic for offset-adjustment is fairly straightforward even
304	when one has to offload windows, I believe. It's nice that the
305	source is already in the db and we get offset+length style
306	operations from Berkeley naturally anyway. Branko or anyone,
307	feel free to continue this recipe and see if you can take it
308	somewhere before I get in tomorrow morning... -kff
309
310
311	------------------------------------------------------------------------
312	Notes from JimB about optimizing the in-repository delta generation
313	to make deltas that can be composed more quickly:
314
315	I talked about this with Karl on the phone, and gave pretty bad
316	explanations of my thinking; I'll try to do better today. This will
317	also provide some background for other readers.
318
319	I'm told that RCS reconstructs older file revisions by walking the
320	list of diffs, ignoring the actual text, and simply recording the line
321	numbers and sizes of the substitutions. Then, it can run over that
322	data and do arithmetic to construct a single `composed' diff against
323	the youngest revision would reconstructs the older revision. Then,
324	you apply that composed diff to get the revision you wanted.
325
326	For example, if your fulltext is:
327
328	rev 3: abcdefghijklmnopqrst
329
330	and your deltas are:
331
332	rev 2: replace 3--6 with "howdy" (yielding abchowdyghijklmnopqrst)
333	rev 1: replace 6--8 with " are you" (yielding abchow are youghijklmnopqrst)
334
335	then the RCS algorithm would gather this info:
336
337	rev 2: replace 3--6 with 5 chars (call them X)
338	rev 1: replace 6--8 with 8 chars (call them Y)
339
340	Now, without looking at any of the text, it can compose the two deltas
341	to get the following delta, yielding rev 1:
342
343	replace 3--6 with range 0--3 from X
344	replace 6--6 with range 0--8 from Y (i.e. insert Y at 6)
345
346	If we then apply this `composed' delta to the original text, we get:
347
348	abchow are youghijklmnopqrst
349
350	The point of all this is that you don't need to mess around with
351	actual text until the very end. Until that point, the amount of work
352	depends on the amount of change, not on the amount of text involved.
353	And when you do get around to actually assembling text, the amount of
354	work depends on the size of the output file --- because you're only
355	touching each piece of text once --- and only weakly on the amount of
356	change.
357
358	Our current svndiff format frustrates this somewhat by compressing the
359	new text, as well as pulling in pieces of the original. The
360	compression process produces a lot of delta ops that deal with small
361	pieces of text. (Or at least, I expect it does...) So even if the
362	change is something simple --- replacing a single block of text in the
363	middle of the file, say --- you end up with an svndiff with a lot of
364	ops, mostly concerned with building the replacement text from pieces
365	of itself. This is great, except that having lots of small ops
366	increases the amount of work the delta composition phase needs to do.
367	In fact, if the ops usually deal with really small pieces of text ---
368	a few dozen bytes or so --- I expect it'd be faster to just throw the
369	actual text around. Memcpy is pretty optimized on real systems; you
370	could copy a lot of bytes in the time it would take to do funky
371	intersections and adjustments on a list of ops talking about those
372	bytes, so those ops had better refer to large blocks of bytes.
373
374	I'm not sure what to do with that. It almost seems like we want the
375	text delta computation algorithm to optimize deltas for network
376	transmission and deltas for storage differently.
377
378
379	------------------------------------------------------------------------
380	Notes from Brane: Delta composition
381
382	Despite JimB's concerns, it turns out that delta composition is
383	straight-forward. The basic idea is that combining two deltas is
384	equivalent to applying the second delta to a representation of the
385	first delta's result. Bear with me while I shamelessly abuse what
386	little I remember about linear algebra.
387
388	Given two deltas, A and B, and the source stream S, we want to
389	construct a composed delta, AB, that converts S to the target stream
390	T. Let's borrow notation from linear algebra and write this
391	transformation like this:
392
393	T = AB(S)
394
395	Abusing algebra some more, I'll assume that a delta behaves like any
396	other linear transformatsion; therefore,
397
398	AB(S) = B(A(S))
399
400	and since I'm not about to develop any rigorous proofs here, I'll
401	just say that it follows from the above that
402
403	T = B(S'), where S' = A(S)
404
405	A small note here: Every delta is actually an n-tuple of delta
406	opserations, represented by what I'll call the delta operators [n]
407	(for new data), [s] (for source copies) and [t] (for target
408	copies). [s] and [t] create bits of the target stream by operating on
409	contiguous parts of the source and (existing) target stream,
410	respectively; while [n] does the same by operating on raw data.
411
412
413	Now, what we actually want to do is derive some form of AB (which, by
414	the way, does not have a unique representation, sice we're not trying
415	to find the optimal ("normalized") transform) that doesn't in any way
416	rely on the value of S'. We do that by building a representation of S'
417	that relies only on S, and any new data introduced by the [n]
418	operators in A. That's possible because any [t] ops in A merely copy
419	parts of S' that have been previously defined by [s] and [n]
420	ops. Transforming A by (recursively) replacing all [t] ops with the
421	equivalent [s] and [n] ops gives us exactly such a representation,
422	which I'll call A'. [*]
423
424	Building AB from B and A' is trivial: traversing the list of delta ops
425	in B, copy all [n] and [t] ops into the result; for [s] ops, copy the
426	range of ops from A' that define the appropriate range in S'. For some
427	of the copies, the first or last op from the range in A' will have to
428	be split, and the first op in the copy range can sometimes be merged
429	with the previous op in AB.
430
431
432	Of course, stopping here could give a very sub-optimal AB, because it
433	could contain many duplicate copies of the same op ranges from A'. We
434	fix this by doing exactly the opposite transformation than A->A': by
435	transforming [s] ops from B into [t] ops. We do this by recording the
436	source and target of each copy from A' to AB and, whenever the [s]
437	from B describes a range in T that was already defined, converting
438	that into a [t] instead [**]. Unlike the A->A' transform, we can't remove
439	all copies from A' to AB (we can only do that when AB doesn't refer to
440	S at all), but we can significantly reduce the number of such copies.
441
442	The resulting AB will usually not be the optimal delta from S to T,
443	because we will never actually look at S while constructing AB.
444
445
446	Summarizing the above, we get the following delta composition
447	algorithm:
448
449	;; X is the set of byte ranges in T defined by copies from S'
450	;; Y is the current offset in T, defined by the ops in AB
451	foreach OP in B:
452	if (OP = [t]) or (OP = [n]):
453	copy OP to AB
454	else
455	R = (set of ops from A that define S'[OP.offset, OP.length])
456	[**] using X, find the optimal set O of [t] ops to replace part of R
457	foreach OP' in R:
458	if OP' is in (R - O):
459	if (OP' = [t]):
460	[*] replace OP' with equivalent [s] and [n] ops
461	copy OP' to AB
462	else
463	[**] copy OP's equivalent in O to AB
464	insert S'[OP.offset, OP.length]->[Y, OP.length] into X
465
466
467	This algorithm ignores details such as op splitting and merging,
468	ensuring every op from O gets copied exactly once, helper indexes,
469	etc..., but those are implementation details.
470
471
472	------------------------------------------------------------------------
473	Notes from Brane: Delta storage
474
475	O.K., now that delta composition is out of the way, let's talk a bit
476	about storing and retrieving deltified text from the filesystem. I'll
477	just jot down a few thoughts about how this could be done, assuming
478	one of two possible models of storage management in the filesystem:
479
480	a) all data store and retrieve operations are synchronous; or,
481	b) deltification can run in the background.
482
483
484	1) Storing new data
485
486	New data arrives either as fulltext (from add) or as a delta from an
487	existing version (copy, modify).
488
489	a) if the new data is a delta, convert it to fulltext (possibly
490	combining several existing deltas in the process). Store the
491	fulltext in the tip, and replace the previous tip (which, by
492	induction, contains fulltext) with a delta from the current tip.
493
494	b) store the fulltext or delta in the tip and mark it for async
495	modification. Do a) in the background.
496
497	2) Retrieving old data
498
499	a) If the data is a delta, follow the delte references (combining
500	the deltas) until a fulltext is found; apply the combined delta
501	to get the required fulltext.
502
503	If the combined delta reduces to a no-op (the two fulltexts are
504	the same), store the fulltext in the younger of the two nodes and
505	replace the older node's data with a "same" note.
506
507	b) Same as para 1 of a), then mark the node for async
508	modification. In the background, find the diff between the two
509	fulltexts. If they're equal, do para 2 of a). Otherwise, if the
510	diff is smaller than the current diff in the node, replace the
511	current representation. ("Smaller" could be construed as "more
512	optimal" -- it would make sense to take into account the number
513	of delta combinations that could be skipped by replacing the
514	current representation when comparing sizes.)
515
516
517	d) Multiple parents per change
518	==============================
519
520	This is necessary for -- at least -- accurrate Merge Tracking, to
521	allow for accurate calculation of change set graph. Use cases
522	include:
523
524	1) Avoiding repeated merges when performing cyclic merging
525	(e.g branch A -> B -> A).
526
527	2) Sussing explicit merge info when a change in merge info occurs
528	during a copy operation (e.g. to avoid paying attention to implied
529	merge info from the copy source).
530
531	Mercurial (hg) does this.
532
533
534	e) Atomic renames
535	=================
536
537	This may just be a means to an end? Mercurial (hg) gets by without
538	this, but this may be due to its distributed repository implementation.
539
540	It seems we can handle a lot of the desired use cases (see
541	notes/tree-conflicts.txt) without true renames.
542
543	Exploratory work has been started on the fs-atomic-renames branch.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: