source: valtobtest/subversion-1.6.2/notes/skip-deltas @ 3

Last change on this file since 3 was 3, checked in by valtob, 15 years ago

subversion source 1.6.2 as test

File size: 6.7 KB
Line 
1Skip-Deltas in Subversion
2=========================
3
4To keep repositories at a manageable size, essentially all version
5control systems use some kind of relative compression technique such
6that two similar versions of the same file don't take up much more
7space than just one version of that file.  The two most common
8techniques are the SCCS "weave", which represents all revisions of a
9file as a single data stream with the moral equivalent of #ifdefs, and
10the technique of storing deltas (differences) between related
11revisions of files (see
12http://web.mit.edu/ghudson/thoughts/file-versioning for details).
13Subversion uses deltas.
14
15Subversion uses a technique called "skip-deltas" to ensure that only a
16reasonable number of deltas need to be composed in order to retrieve
17any revisions of a file.  The concept of skip-deltas is inspired by
18the concept of skip-lists, but they aren't all that similar.
19
20For the purposes of this document, we will pretend that revisions of a
21file are numbered starting from 0.  In reality, this number
22corresponds to the "change count" field of a node-revision in each
23filesystem back end.
24
25Skip-Deltas in the FSFS Back End
26================================
27
28In the FSFS back end, each revision of a file is represented as a
29delta against an older revision of the file.  The first revision is
30represented as a delta against the empty stream (i.e. it is
31self-compressed).  To reconstruct a revision of a file, the filesystem
32code determines the chain of deltas leading back to revision 0,
33composes them all together using the delta combiner, and applies the
34resulting super-delta to the empty stream in order to reconstruct the
35file contents.
36
37The most obvious strategy would be to choose revision N-1 as the delta
38base for revision N.  But even with the delta combiner, it would
39become very slow to retrieve revision 1000 of a file if we had to
40piece together 1000 deltas.  So, to choose the delta base for revision
41N, we write out N in binary and flip the rightmost bit whose value is
421.  For instance, if we are storing 54, we write it out in binary as
43110110, flip the last '1' bit to get 110100, and thus pick revision 52
44of the file as the delta base.  A file with ten versions (numbered
450-9) would have those versions represented as follows:
46
47  0 <- 1    2 <- 3    4 <- 5    6 <- 7
48  0 <------ 2         4 <------ 6
49  0 <---------------- 4
50  0 <------------------------------------ 8 <- 9
51
52where "0 <- 1" means that the delta base for revision 1 is revision 0.
53
54Because we flip the rightmost '1' bit each time we pick a delta base,
55at most lg(N) deltas are necessary to reconstruct revision N of a
56file.
57
58Skip-deltas in the BDB Back End
59===============================
60
61In the BDB back end, each revision of a file is represented as a delta
62against a newer revision of the file--the opposite of FSFS.  The
63newest version of a file is stored in plain text.  Whereas in FSFS, we
64add a new version of a file simply by picking a delta base and writing
65out a delta, in BDB the process is more complicated: we write out the
66new version of the file in plain text; then, after the commit is
67confirmed, we go back and "redeltify" one or more older versions of
68the file against the new one.
69
70The goal of the redeltification process is to produce the reverse of
71the FSFS diagram:
72
73  0 ------------------------------------> 8 -> 9
74                      4 ----------------> 8
75            2 ------> 4         6 ------> 8
76       1 -> 2    3 -> 4    5 -> 6    7 -> 8
77
78To accomplish this, we write out the number in binary, count the
79number of trailing zeros, and redeltify that number of ancestor
80revisions plus 1.  For instance, when we see revision 8, we write it
81out as "1000", note that there are three trailing 0s, and resolve to
82redeltify four ancestor revisions: the revisions one back, two back,
83four back, and eight back.
84
85As it turns out, the above diagram is a fiction.  To reduce overhead,
86the BDB back end makes three compromises to the skip-delta scheme:
87
88  * When storing file revisions 0-31, only the immediate predecessor
89    is redeltified.
90
91  * We don't redeltify the ancestor revision which is two back from
92    the one we are storing.
93
94  * We never redeltify revision 0 of a file.
95
96Despite these compromises, the asymptotic behavior of the BDB
97skip-delta scheme is the same as the simpler FSFS one: O(lg(N)) deltas
98are necessary to reconstruct any revision of a file which has had N
99revisions.
100
101Skip-Deltas and Branches
102========================
103
104If a file's history diverges because it is copied and the modified on
105both branches, the behavior is as follows:
106
107  * In FSFS, we choose delta bases just as we would if each branch
108    were an isolated linear path leading back to revision 0.  For
109    instance, if a file has six revisions (0-5), then branches into
110    revisions 6-7 and revisions 6'-8', they would look like:
111
112    0 <- 1    2 <- 3    4 <- 5    6 <- 7
113    0 <------ 2         4 <------ 6
114                                  6' <- 7'
115    0 <-------------------------------------- 8'
116
117  * In BDB, we redeltify ancestor revisions just as we would if each
118    branch were an isolated linear path leading back to revision 0.
119    The result depends on the order of commits.  If a file has four
120    revisions (0-3), then branches into revisions 4 and 4', then if 4
121    was committed first and 4' was committed second, the result would
122    look like:
123
124                            4
125    0 --------------------> 4'
126                2 --------> 4'
127          1 --> 2     3 --> 4'
128
129    but if instead, 4 was committed second, the result would look
130    like:
131
132                            4'
133    0 --------------------> 4
134                2 --------> 4
135          1 --> 2     3 --> 4
136
137    Although this order dependency may be confusing to think about,
138    it causes no complexity in the code, and the O(lg(N)) asymptotic
139    behavior is clearly preserved.
140
141Note that in the BDB back end, a branched file has a separate
142plain-text representation for each branch head, while in the FSFS back
143end, that is not the case.
144
145Costs of Skip-Deltas
146====================
147
148In most workloads, the delta for a file revision becomes larger if the
149delta base is farther away--in terms of the diagrams, longer arrows
150take up more space.  In the worst case, where all changes to the file
151are orthogonal to each other, a delta across N file revisions may be N
152times as expensive as a delta across one revision.
153
154In either back end, the average number of revisions crossed by a delta
155arrow is O(lg(N)), if the file has had N revisions.  So we may assume
156that in the worst case, skip-deltas incur an O(lg(N)) space penalty
157while providing an O(N/lg(N)) time benefit.  The practical space
158penalty appears to be substantially less than O(lg(N)), because many
159files have short histories and many changes are not orthogonal to each
160other.
Note: See TracBrowser for help on using the repository browser.