source: valtobtest/subversion-1.6.2/notes/fs-improvements.txt @ 3

Last change on this file since 3 was 3, checked in by valtob, 15 years ago

subversion source 1.6.2 as test

File size: 22.4 KB
Line 
1Last updated: $Date: 2001/07/02 21:46:52 $
2
3Three things that need to happen in the filesystem:
4
5   a) Switch to a sane node key system (issue #654).
6
7   b) We need to operate on file contents without holding the entire
8      contents in RAM.  Berkeley DB gives us tools to operate on
9      regions within a value, we just need to use them.
10
11   c) We need to do reverse-delta storage in the filesystem (with
12      checksums).
13
14   d) Record (potentially) multiple parents per change.
15
16   e) Implement atomic renames.
17
18Some thoughts on them:
19
20
21a) Switch to a sane node key system (issue #654)
22================================================
23
24For more background, read the archived dev list thread with subject
25"Maintaining NodeID sanity":
26
27   http://subversion.tigris.org/servlets/ReadMsg?msgId=72265&listName=dev
28
29Here's the plan, mostly by Bill Tutt and Branko, with assists from
30Karl and Mike:
31
32Note:
33
34   - This is described in terms of a BDB implementation.  Translate to
35     RDB-speak in your head as necessary :-).
36
37   - This proposal supports one copy (a.k.a. branch) operation.  You
38     can call it anything you want: "copy", "branch", "split",
39     "pumpkin", whatever.  We're calling it "copy" :-).  It is the SCM
40     branch operation.
41
42First, a node's key consists of three parts:
43
44   nodeID.copyID.txnID
45
46The "txnID" is really just a unique identifier, but we happened to
47inherit it from the fs txn, and we needed a name for that portion,
48so... :-) Also, the copyID could theoretically live in the node's
49value instead of in its key, but it feels right to put it in the
50key.  (How's that for a powerful argument?)  For nodes that are not
51copies, the copyID is just "0" or some other special value.
52
53There are no more mutability flags -- mutability is determined by
54examining whether the node key's txnID matches the txn in question.
55Therefore, there is no stabilization walk at commit time.
56
57When we commit a change to a node, the nodeID and copyID stay the
58same, only the txnID changes (actually there is a circumstance where
59the copyID can change, but more on that later).  The new txnID is not
60necessarily greater than the old one -- sometimes txns get committed
61out of order! -- but anyway it's different from the old txnID, and the
62same new txnID is used for all other changes in that commit.
63
64  [Greg and Karl disagree on whether to use integer types or `char *'
65   for the parsed representation of IDs.  See the dev list thread
66   that starts here for the details of this:
67   http://subversion.tigris.org/servlets/ReadMsg?msgId=72277&listName=dev
68  ]
69
70After a commit, the txn record in the transactions table does not go
71away; instead, it is updated so it now maps the txnID to the new
72revision.  This allows us to determine the revision a node was
73committed in, in constant time, given the node's key.
74
75Each new version of a node stores the node's predecessor (and does not
76store copyform history).  When node "5.0.fzb" is committed as a
77successor to "5.0.qnr", the new node's value stores a reference to
78"5.0.qnr".
79
80What about copies?
81
82As in the current fs, copies are shallow.  The top of the copied tree
83gets a new node, but the nodes below it are shared with the copy
84source.  The new node key keeps the same nodeID, but gets a new txnID,
85and gets the next unique copyID (replacing the current copyID, if
86any).
87
88In the example below, we copy `A' to `Y'.  Node keys for `A', `Y', and
89`bar' are given in parentheses:
90
91            BEFORE THE COPY               AFTER THE COPY
92               <root>                         <root>     
93              /  |                           /  |  \     
94            /    |                         /    |    \   
95          /      |                       /      |      \
96        A(3.0.m) B                     A(3.0.m) B       Y(3.jfb.p)
97        / \      |                     / \      |      / \
98     foo  bar   qux                 foo  bar   qux   foo  bar
99       (5.0.abc)                       (5.0.abc)        (5.0.abc)
100
101Let's flesh out the example with some commits under A and Y.  To save
102space, the colons represent time flow, not directory hierarchy --
103imagine they're the Z axis coming out of the screen or something :-).
104
105                         <root>     
106                        /  |  \     
107                      /    |    \   
108                    /      |      \
109                 A(3.0.m)  B       Y(3.jfb.p)
110                  / \      |      / \
111               foo  bar   qux   foo  bar
112                 (5.0.abc)         (5.0.abc)
113                     :                 :
114                     :                 :
115                 (5.0.ejk)         (5.jfb.mht)
116                     :                 :
117                     :                 :
118                 (5.0.xyz)         (5.jfb.uuu)
119                     :                 :
120                     :                 :
121                 (5.0.r2d2)        (5.jfb.2pz)
122                     :                 :
123                     :                 :
124                 (5.0.c3po)        (5.jfb.rdt)
125
126Let's see how easy it is to answer various questions in this system:
127
128Obviously, is_related() is simple -- just check that the nodeID
129portion is the same.  You may not know if the relationship is cousins
130vs ancestor/descendant, but you know whether or not they're related.
131
132Asking is_predecessor(A,B) is also easy.  Just fetch the predecessor
133pointer from B and see if it's A.
134
135Finding out what revisions a node changed in is proportional to the
136number of changes the node has undergone: start at the node, walk back
137through its predecessors, and for each txnID, look up the revision
138number via the transactions table (as described earlier).
139
140During this walk, you can always tell when you encounter a node that
141results from a copy, because the copyID portion will either change or
142disappear entirely.  When this happens, you know one of two things is
143true: either the previous node in the walk was the top of a copied
144tree, or *this* node (the one with the different copyID) was one of
145the unchanged nodes down inside a copied tree.
146
147One might think "Oh, we'll distinguish between these cases by walking
148up the parents of the node, and seeing if we eventually encounter the
149old copyID in one of the parents.  That would tell us that we're in
150the second case.  If we never encounter it, that tells us we're in the
151first."
152
153Not so fast, Spidey.  We don't have parent pointers -- this is a
154predecessor walk by node keys; we can't just walk up the parent path
155like that.  Fortunately, copyIDs are generated from a new `copies'
156table, which maps unique copyIDs onto (REV COPY_DST_PATH
157COPY_DST_NODEKEY).  We look up the rev/path for the old copyID,
158convert it to a node key, and compare it to the node key we're
159currently on.  Voilà!  Actually, we're not sure we'll store all of
160those in the copies table, it may boil down to just the DST_NODEKEY or
161just the other two, we'll see.
162
163Writing those predecessor walk loops well is left as an exercise for
164the reader (umm, more likely for the writers, heh), but you can see
165that the necessary questions can be answered efficiently.
166
167Note that, like txnIDs, copyIDs are just unique numbers.  They may be
168increasing monotonically in the `copies' table, but (due to the fact
169that txn A may be started before txn B yet be committed afterwards)
170it's quite possible that a higher copyID will become visible in the
171revision history before a lower one.
172
173The one thing we can know is that a lower copyID can never be a
174branchwise descendent of a lower copyID, since the lower one must have
175been committed before any of its descendants txns were started, of
176course.  I'm not sure this minimal inference will ever be useful, but
177anyway it's all we've got.  Anyway, right now, we only ever need ask
178if two copyIDs are equal -- we don't order them.
179
180Okay, now what if A already had copied trees underneath it when we
181copied it to Y?  Suppose `bar' is down in one of those subdirectories?
182
183Then when we're committing on /Y/.../bar, we watch for copyIDs as we
184walk down from root, like usual, and if we're in a copy underneath a
185copy, we bubble down a _new_ copyID, distinct from both Y's and B's,
186starting from that point.  Justification: a branch of a branch is
187still a branch, so it gets its own copyID.
188
189At this point, I'm going to hand-wave on describing the
190copy-under-copy behavior any further.  I think the above is enough to
191see that there are no insurmountable problems here, and that the
192filesystem will now have node keys that [lazily] reflect actual
193branching patterns.
194
195A few more notes:
196
197The is_ancestor(X,Y) question is really only asked during merge(),
198that is, during commits.  If entry "somedir/blah" has NodeID Y in the
199txn being committed, and NodeID X in head tree being merged into that
200txn, then we need to make sure X is an ancestor of Y, so that when the
201commit replaces X with Y, we know we're not losing any history.
202
203Therefore, we think we can get away with storing the ancestors in a
204distributed fashion, as a chain: each node knows its immediate
205predecessor (or "precessors", in the future), and you walk back until
206you have your answer.  In real life, we won't usually be walking back
207too far, and the search can be further bounded by the revision range
208implied by the two nodes.  If profiling proves these walks to be a
209bottleneck, we can start caching the results of such walks in a new
210table whose keys are node keys, and values are _full_ ancestor lists.
211
212
213
214b) Operate on portions of files efficiently.
215============================================
216
217   [still pondering this section]
218
219We should take advantage of Berkeley DB's partial record stuff, all
220the way up to the top of the svn fs interfaces.
221
222   - dag_node_t gets two new fields: contents_offset and contents_len.
223     They apply to the node's cache of the contents, not the header or
224     proplist.
225
226   - svn_fs__get_node_rev_contents() takes offset and len arguments,
227     fetches only that data.  The dag_node_t will remember the offset
228     and len.
229
230   - svn_fs__put_node_rev_contents() takes offset and len args as
231     well.
232
233   - change set_node_revision() accordingly.
234
235   - ... todo thinking here ...
236
237So now, whenever you read or write a node revision, you are operating
238on a range.  There will be some way to say "I mean the whole thing",
239of course, so it won't be necessary to know the size in advance.
240
241Thought: possibly we should stop storing data in the dag_node_t
242itself, and just return the data in a void pointer passed to
243svn_fs__get_node_rev_contents().  Still pondering.
244
245
246
247c) Reverse-delta storage.
248=========================
249
250The naive way to recover an old text is:
251
252   retrieve_node_rev (N)
253   {
254     grab_node_revision (&N);
255
256     if (is_fulltext (N))
257       return N;
258     else if (is_shared (N))
259       return retrieve_node_rev (get_sharee (N));
260     else if (is_svndiff (N))
261       return svnpatch (get_svndiff (N), retrieve_node_rev (get_base (N)))
262   }
263
264(Loose pseudo-code, obviously, and the recursion could be a loop, but
265you get the idea.)
266
267The trouble with this is that it constructs and patches each
268intermediate revision.  That'll probably be i/o bound, and anyway much
269of the intermediate material may not end up in the final target, in
270which case reconstructing it was a waste of time.
271
272What we really want is a way to compose a bunch of svndiffs, and then
273apply that composition to the current head, to give us the older
274revision in one step (well, one application anyway).  Both the
275composition and the final application need to happen in a streamy or
276windowed fashion -- we shouldn't have to hold the entire diff in
277memory, nor the entire source, nor the target.
278
279Here's a way to do this:
280
281An svndiff is a series of instructions that are followed to
282reconstruct the target.  There are three possible instructions:
283
284   a) Insert X bytes of new text into the TARGET, and I'm giving you
285      those bytes right here.
286   b) Copy N bytes, starting from offset F in the SOURCE, into the
287      TARGET.
288   c) Copy N bytes, starting from offset F in the TARGET, into the
289      TARGET.
290
291(Note that (c) can actually run past the current end of the target, as
292long by the time you get there, the target is longer.)
293
294To compose two svndiffs...
295
296   ...and I hate to tantalize you, but I'm late and have to run now,
297      so I'll try to finish this tomorrow... crimminy... The quick
298      summary is, we build a new svndiff (the composition of all the
299      intermediates), and as it gets too large, we windowify as we go
300      and put each window temporarily in the database; this makes the
301      composition as a whole less efficient, but means that at any
302      given time we don't have to have the whole thing in memory.  The
303      arithmetic for offset-adjustment is fairly straightforward even
304      when one has to offload windows, I believe.  It's nice that the
305      source is already in the db and we get offset+length style
306      operations from Berkeley naturally anyway.  Branko or anyone,
307      feel free to continue this recipe and see if you can take it
308      somewhere before I get in tomorrow morning...  -kff
309
310
311------------------------------------------------------------------------
312   Notes from JimB about optimizing the in-repository delta generation
313   to make deltas that can be composed more quickly:
314
315I talked about this with Karl on the phone, and gave pretty bad
316explanations of my thinking; I'll try to do better today.  This will
317also provide some background for other readers.
318
319I'm told that RCS reconstructs older file revisions by walking the
320list of diffs, ignoring the actual text, and simply recording the line
321numbers and sizes of the substitutions.  Then, it can run over that
322data and do arithmetic to construct a single `composed' diff against
323the youngest revision would reconstructs the older revision.  Then,
324you apply that composed diff to get the revision you wanted.
325
326For example, if your fulltext is:
327
328rev 3:  abcdefghijklmnopqrst
329
330and your deltas are:
331
332rev 2:  replace 3--6 with "howdy"    (yielding abchowdyghijklmnopqrst)
333rev 1:  replace 6--8 with " are you" (yielding abchow are youghijklmnopqrst)
334
335then the RCS algorithm would gather this info:
336
337rev 2:  replace 3--6 with 5 chars (call them X)
338rev 1:  replace 6--8 with 8 chars (call them Y)
339
340Now, without looking at any of the text, it can compose the two deltas
341to get the following delta, yielding rev 1:
342
343        replace 3--6 with range 0--3 from X
344        replace 6--6 with range 0--8 from Y (i.e. insert Y at 6)
345
346If we then apply this `composed' delta to the original text, we get:
347
348        abchow are youghijklmnopqrst
349
350The point of all this is that you don't need to mess around with
351actual text until the very end.  Until that point, the amount of work
352depends on the amount of change, not on the amount of text involved.
353And when you do get around to actually assembling text, the amount of
354work depends on the size of the output file --- because you're only
355touching each piece of text once --- and only weakly on the amount of
356change.
357
358Our current svndiff format frustrates this somewhat by compressing the
359new text, as well as pulling in pieces of the original.  The
360compression process produces a lot of delta ops that deal with small
361pieces of text.  (Or at least, I expect it does...)  So even if the
362change is something simple --- replacing a single block of text in the
363middle of the file, say --- you end up with an svndiff with a lot of
364ops, mostly concerned with building the replacement text from pieces
365of itself.  This is great, except that having lots of small ops
366increases the amount of work the delta composition phase needs to do.
367In fact, if the ops usually deal with really small pieces of text ---
368a few dozen bytes or so --- I expect it'd be faster to just throw the
369actual text around.  Memcpy is pretty optimized on real systems; you
370could copy a lot of bytes in the time it would take to do funky
371intersections and adjustments on a list of ops talking about those
372bytes, so those ops had better refer to large blocks of bytes.
373
374I'm not sure what to do with that.  It almost seems like we want the
375text delta computation algorithm to optimize deltas for network
376transmission and deltas for storage differently.
377
378
379------------------------------------------------------------------------
380  Notes from Brane: Delta composition
381
382Despite JimB's concerns, it turns out that delta composition is
383straight-forward. The basic idea is that combining two deltas is
384equivalent to applying the second delta to a representation of the
385first delta's result. Bear with me while I shamelessly abuse what
386little I remember about linear algebra.
387
388Given two deltas, A and B, and the source stream S, we want to
389construct a composed delta, AB, that converts S to the target stream
390T. Let's borrow notation from linear algebra and write this
391transformation like this:
392
393    T = AB(S)
394
395Abusing algebra some more, I'll assume that a delta behaves like any
396other linear transformatsion; therefore,
397
398    AB(S) = B(A(S))
399
400and since I'm not about to develop any rigorous proofs here, I'll
401just say that it follows from the above that
402
403    T = B(S'), where S' = A(S)
404
405A small note here: Every delta is actually an n-tuple of delta
406opserations, represented by what I'll call the delta operators [n]
407(for new data), [s] (for source copies) and [t] (for target
408copies). [s] and [t] create bits of the target stream by operating on
409contiguous parts of the source and (existing) target stream,
410respectively; while [n] does the same by operating on raw data.
411
412
413Now, what we actually want to do is derive some form of AB (which, by
414the way, does not have a unique representation, sice we're not trying
415to find the optimal ("normalized") transform) that doesn't in any way
416rely on the value of S'. We do that by building a representation of S'
417that relies only on S, and any new data introduced by the [n]
418operators in A. That's possible because any [t] ops in A merely copy
419parts of S' that have been previously defined by [s] and [n]
420ops. Transforming A by (recursively) replacing all [t] ops with the
421equivalent [s] and [n] ops gives us exactly such a representation,
422which I'll call A'. [*]
423
424Building AB from B and A' is trivial: traversing the list of delta ops
425in B, copy all [n] and [t] ops into the result; for [s] ops, copy the
426range of ops from A' that define the appropriate range in S'. For some
427of the copies, the first or last op from the range in A' will have to
428be split, and the first op in the copy range can sometimes be merged
429with the previous op in AB.
430
431
432Of course, stopping here could give a very sub-optimal AB, because it
433could contain many duplicate copies of the same op ranges from A'. We
434fix this by doing exactly the opposite transformation than A->A': by
435transforming [s] ops from B into [t] ops. We do this by recording the
436source and target of each copy from A' to AB and, whenever the [s]
437from B describes a range in T that was already defined, converting
438that into a [t] instead [**]. Unlike the A->A' transform, we can't remove
439all copies from A' to AB (we can only do that when AB doesn't refer to
440S at all), but we can significantly reduce the number of such copies.
441
442The resulting AB will usually not be the optimal delta from S to T,
443because we will never actually look at S while constructing AB.
444
445
446Summarizing the above, we get the following delta composition
447algorithm:
448
449    ;; X is the set of byte ranges in T defined by copies from S'
450    ;; Y is the current offset in T, defined by the ops in AB
451    foreach OP in B:
452      if (OP = [t]) or (OP = [n]):
453        copy OP to AB
454      else
455        R = (set of ops from A that define S'[OP.offset, OP.length])
456[**]    using X, find the optimal set O of [t] ops to replace part of R
457        foreach OP' in R:
458          if OP' is in (R - O):
459            if (OP' = [t]):
460[*]           replace OP' with equivalent [s] and [n] ops
461            copy OP' to AB
462          else
463[**]        copy OP's equivalent in O to AB
464        insert S'[OP.offset, OP.length]->[Y, OP.length] into X
465
466
467This algorithm ignores details such as op splitting and merging,
468ensuring every op from O gets copied exactly once, helper indexes,
469etc..., but those are implementation details.
470
471
472------------------------------------------------------------------------
473  Notes from Brane: Delta storage
474
475O.K., now that delta composition is out of the way, let's talk a bit
476about storing and retrieving deltified text from the filesystem. I'll
477just jot down a few thoughts about how this could be done, assuming
478one of two possible models of storage management in the filesystem:
479
480  a) all data store and retrieve operations are synchronous; or,
481  b) deltification can run in the background.
482
483
4841) Storing new data
485
486New data arrives either as fulltext (from add) or as a delta from an
487existing version (copy, modify).
488
489  a) if the new data is a delta, convert it to fulltext (possibly
490     combining several existing deltas in the process). Store the
491     fulltext in the tip, and replace the previous tip (which, by
492     induction, contains fulltext) with a delta from the current tip.
493
494  b) store the fulltext or delta in the tip and mark it for async
495     modification. Do a) in the background.
496
4972) Retrieving old data
498
499  a) If the data is a delta, follow the delte references (combining
500     the deltas) until a fulltext is found; apply the combined delta
501     to get the required fulltext.
502
503     If the combined delta reduces to a no-op (the two fulltexts are
504     the same), store the fulltext in the younger of the two nodes and
505     replace the older node's data with a "same" note.
506
507  b) Same as para 1 of a), then mark the node for async
508     modification. In the background, find the diff between the two
509     fulltexts. If they're equal, do para 2 of a). Otherwise, if the
510     diff is smaller than the current diff in the node, replace the
511     current representation. ("Smaller" could be construed as "more
512     optimal" -- it would make sense to take into account the number
513     of delta combinations that could be skipped by replacing the
514     current representation when comparing sizes.)
515
516
517d) Multiple parents per change
518==============================
519
520This is necessary for -- at least -- accurrate Merge Tracking, to
521allow for accurate calculation of change set graph.  Use cases
522include:
523
5241) Avoiding repeated merges when performing cyclic merging
525   (e.g branch A -> B -> A).
526
5272) Sussing explicit merge info when a change in merge info occurs
528   during a copy operation (e.g. to avoid paying attention to implied
529   merge info from the copy source).
530
531Mercurial (hg) does this.
532
533
534e) Atomic renames
535=================
536
537This may just be a means to an end?  Mercurial (hg) gets by without
538this, but this may be due to its distributed repository implementation.
539
540It seems we can handle a lot of the desired use cases (see
541notes/tree-conflicts.txt) without true renames.
542
543Exploratory work has been started on the fs-atomic-renames branch.
Note: See TracBrowser for help on using the repository browser.