Context Navigation

skip-deltas @ 3

Last change on this file since 3 was 3, checked in by valtob, 15 years ago
subversion source 1.6.2 as test
File size: 6.7 KB

Line
1	Skip-Deltas in Subversion
2	=========================
3
4	To keep repositories at a manageable size, essentially all version
5	control systems use some kind of relative compression technique such
6	that two similar versions of the same file don't take up much more
7	space than just one version of that file. The two most common
8	techniques are the SCCS "weave", which represents all revisions of a
9	file as a single data stream with the moral equivalent of #ifdefs, and
10	the technique of storing deltas (differences) between related
11	revisions of files (see
12	http://web.mit.edu/ghudson/thoughts/file-versioning for details).
13	Subversion uses deltas.
14
15	Subversion uses a technique called "skip-deltas" to ensure that only a
16	reasonable number of deltas need to be composed in order to retrieve
17	any revisions of a file. The concept of skip-deltas is inspired by
18	the concept of skip-lists, but they aren't all that similar.
19
20	For the purposes of this document, we will pretend that revisions of a
21	file are numbered starting from 0. In reality, this number
22	corresponds to the "change count" field of a node-revision in each
23	filesystem back end.
24
25	Skip-Deltas in the FSFS Back End
26	================================
27
28	In the FSFS back end, each revision of a file is represented as a
29	delta against an older revision of the file. The first revision is
30	represented as a delta against the empty stream (i.e. it is
31	self-compressed). To reconstruct a revision of a file, the filesystem
32	code determines the chain of deltas leading back to revision 0,
33	composes them all together using the delta combiner, and applies the
34	resulting super-delta to the empty stream in order to reconstruct the
35	file contents.
36
37	The most obvious strategy would be to choose revision N-1 as the delta
38	base for revision N. But even with the delta combiner, it would
39	become very slow to retrieve revision 1000 of a file if we had to
40	piece together 1000 deltas. So, to choose the delta base for revision
41	N, we write out N in binary and flip the rightmost bit whose value is
42	1. For instance, if we are storing 54, we write it out in binary as
43	110110, flip the last '1' bit to get 110100, and thus pick revision 52
44	of the file as the delta base. A file with ten versions (numbered
45	0-9) would have those versions represented as follows:
46
47	0 <- 1 2 <- 3 4 <- 5 6 <- 7
48	0 <------ 2 4 <------ 6
49	0 <---------------- 4
50	0 <------------------------------------ 8 <- 9
51
52	where "0 <- 1" means that the delta base for revision 1 is revision 0.
53
54	Because we flip the rightmost '1' bit each time we pick a delta base,
55	at most lg(N) deltas are necessary to reconstruct revision N of a
56	file.
57
58	Skip-deltas in the BDB Back End
59	===============================
60
61	In the BDB back end, each revision of a file is represented as a delta
62	against a newer revision of the file--the opposite of FSFS. The
63	newest version of a file is stored in plain text. Whereas in FSFS, we
64	add a new version of a file simply by picking a delta base and writing
65	out a delta, in BDB the process is more complicated: we write out the
66	new version of the file in plain text; then, after the commit is
67	confirmed, we go back and "redeltify" one or more older versions of
68	the file against the new one.
69
70	The goal of the redeltification process is to produce the reverse of
71	the FSFS diagram:
72
73	0 ------------------------------------> 8 -> 9
74	4 ----------------> 8
75	2 ------> 4 6 ------> 8
76	1 -> 2 3 -> 4 5 -> 6 7 -> 8
77
78	To accomplish this, we write out the number in binary, count the
79	number of trailing zeros, and redeltify that number of ancestor
80	revisions plus 1. For instance, when we see revision 8, we write it
81	out as "1000", note that there are three trailing 0s, and resolve to
82	redeltify four ancestor revisions: the revisions one back, two back,
83	four back, and eight back.
84
85	As it turns out, the above diagram is a fiction. To reduce overhead,
86	the BDB back end makes three compromises to the skip-delta scheme:
87
88	* When storing file revisions 0-31, only the immediate predecessor
89	is redeltified.
90
91	* We don't redeltify the ancestor revision which is two back from
92	the one we are storing.
93
94	* We never redeltify revision 0 of a file.
95
96	Despite these compromises, the asymptotic behavior of the BDB
97	skip-delta scheme is the same as the simpler FSFS one: O(lg(N)) deltas
98	are necessary to reconstruct any revision of a file which has had N
99	revisions.
100
101	Skip-Deltas and Branches
102	========================
103
104	If a file's history diverges because it is copied and the modified on
105	both branches, the behavior is as follows:
106
107	* In FSFS, we choose delta bases just as we would if each branch
108	were an isolated linear path leading back to revision 0. For
109	instance, if a file has six revisions (0-5), then branches into
110	revisions 6-7 and revisions 6'-8', they would look like:
111
112	0 <- 1 2 <- 3 4 <- 5 6 <- 7
113	0 <------ 2 4 <------ 6
114	6' <- 7'
115	0 <-------------------------------------- 8'
116
117	* In BDB, we redeltify ancestor revisions just as we would if each
118	branch were an isolated linear path leading back to revision 0.
119	The result depends on the order of commits. If a file has four
120	revisions (0-3), then branches into revisions 4 and 4', then if 4
121	was committed first and 4' was committed second, the result would
122	look like:
123
124	4
125	0 --------------------> 4'
126	2 --------> 4'
127	1 --> 2 3 --> 4'
128
129	but if instead, 4 was committed second, the result would look
130	like:
131
132	4'
133	0 --------------------> 4
134	2 --------> 4
135	1 --> 2 3 --> 4
136
137	Although this order dependency may be confusing to think about,
138	it causes no complexity in the code, and the O(lg(N)) asymptotic
139	behavior is clearly preserved.
140
141	Note that in the BDB back end, a branched file has a separate
142	plain-text representation for each branch head, while in the FSFS back
143	end, that is not the case.
144
145	Costs of Skip-Deltas
146	====================
147
148	In most workloads, the delta for a file revision becomes larger if the
149	delta base is farther away--in terms of the diagrams, longer arrows
150	take up more space. In the worst case, where all changes to the file
151	are orthogonal to each other, a delta across N file revisions may be N
152	times as expensive as a delta across one revision.
153
154	In either back end, the average number of revisions crossed by a delta
155	arrow is O(lg(N)), if the file has had N revisions. So we may assume
156	that in the worst case, skip-deltas incur an O(lg(N)) space penalty
157	while providing an O(N/lg(N)) time benefit. The practical space
158	penalty appears to be substantially less than O(lg(N)), because many
159	files have short histories and many changes are not orthogonal to each
160	other.

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format