1 | Anyone who has worked on the libsvn_wc code will have discovered that |
---|
2 | the current code is a complicated mess of special cases, and that it |
---|
3 | is difficult to understand, inconsistent, slow and buggy. I know this |
---|
4 | because I wrote some of it. It's possible that the libsvn_wc code |
---|
5 | will gradually evolve into an elegant, efficient, code base; on the |
---|
6 | other hand comments like "when we rewrite libsvn_wc" regularly appear |
---|
7 | on the dev list. This document is *not* a plan or design for a |
---|
8 | rewrite, it's just some of the thoughts of a libsvn_wc hacker. |
---|
9 | |
---|
10 | From Past to Present |
---|
11 | ==================== |
---|
12 | |
---|
13 | The original code for libsvn_wc used an implementation that stored |
---|
14 | more or less all state information on disk in the .svn area on disk, |
---|
15 | so during most operations the entries files were read and written many |
---|
16 | times. This led to the development of an API that passed around a lot |
---|
17 | of path parameters (as svn_string_t* originally, as const char* now) |
---|
18 | and to the development of the svn_io_xxx functions, which also accept |
---|
19 | path parameters. The implementation was slow, and didn't scale |
---|
20 | particularly well as working copies got larger. To improve things the |
---|
21 | current access baton and entries caching system was gradually hacked |
---|
22 | in, and libsvn_wc is now faster and scales a bit better, but problems |
---|
23 | still remain. |
---|
24 | |
---|
25 | A good example of the problems caused by the "path as parameter" API |
---|
26 | is svn_wc_text_modified_p. Its basic function is to determine if the |
---|
27 | base text file and the working file are the same or different, but |
---|
28 | physical IO operations have to be repeated because they are buried |
---|
29 | behind several layers of API. It's difficult to fix without |
---|
30 | rewriting, or duplicating, a number of svn_io_xxx and svn_wc_xxx |
---|
31 | functions. Aside from the repeated IO itself, each IO operation also |
---|
32 | has to repeat the UTF-8 to native path conversion. |
---|
33 | |
---|
34 | The current entries caching makes things faster than in the past, but |
---|
35 | has its own problems. Most operations now cache the entire entries |
---|
36 | hierarchy in memory which limits the size of the working copies that |
---|
37 | can be handled. The problem is difficult to solve as some operations |
---|
38 | make multiple passes--commit for instance makes a first pass searching |
---|
39 | for modifications, a second pass reporting to the repository, and a |
---|
40 | third pass to do post-commit processing. |
---|
41 | |
---|
42 | The original code also did not always make a distinction between the |
---|
43 | versioned hierarchy in the entries file and the physical hierarchy on |
---|
44 | disk. Things like using stat() or svn_io_check_path() calls to |
---|
45 | determine whether an item was versioned as file or directory do not |
---|
46 | work when the working copy on disk is obstructed or incomplete. |
---|
47 | |
---|
48 | The Future |
---|
49 | ========== |
---|
50 | |
---|
51 | Some of these ideas are trivial, some of them are difficult to |
---|
52 | implement, some of them may not work at all. |
---|
53 | |
---|
54 | - Have an svn_wc_t context object, opaque outside the library, that |
---|
55 | would replace the access batons. This would get passed through most |
---|
56 | of the libsvn_wc functions and could read/cache the entries files on |
---|
57 | demand as the working copy was traversed. It could also cache the |
---|
58 | UTF-8 xlate handle. |
---|
59 | |
---|
60 | - Have an API to svn_wc_entry_t, perhaps make the struct opaque, so |
---|
61 | that things like URL need not be constructed when the entries file |
---|
62 | is read but can be created on demand if required and possibly cached |
---|
63 | once created. The aim would be to reduce the memory used by the |
---|
64 | entries cache. |
---|
65 | |
---|
66 | - Consider caching physical IO results in svn_wc_entry_t/svn_wc_t. |
---|
67 | Should we really stat() any file more than once? This becomes less |
---|
68 | important as we reduce the number of IO operations. |
---|
69 | |
---|
70 | - Consider caching UTF-8 to native path conversions either in |
---|
71 | svn_wc_t, or svn_wc_entry_t, or locally in functions and using |
---|
72 | svn_io_xxx equivalents that accept native paths. This becomes less |
---|
73 | important as we reduce the number of IO operations. |
---|
74 | |
---|
75 | - Make interfaces pass svn_wc_entry_t* rather than simple paths. The |
---|
76 | public API using const char* paths would remain to be used by |
---|
77 | libsvn_client et al. |
---|
78 | |
---|
79 | - Maintain a clear distinction between the versioned hierarchy and the |
---|
80 | physical hierarchy when writing code, it's usually a mistake to use |
---|
81 | one when the other should be used. To this end, audit the use of |
---|
82 | svn_io_check_path(). |
---|
83 | |
---|
84 | - Avoid using stat() to determine if an item is present on disk before |
---|
85 | using the item, just use it straight away and handle the error if it |
---|
86 | doesn't exist. |
---|
87 | |
---|
88 | - Search out and destroy functions that read and discard entries files |
---|
89 | e.g. the apparently "simple" functions like svn_wc_is_wc_root or |
---|
90 | check_wc_root. Such overhead is expensive when used by operations |
---|
91 | that are not going to do much other work, running status on a single |
---|
92 | file for example. The overhead may not matter to a command line |
---|
93 | client, but it can matter to a GUI that makes many such calls. |
---|
94 | |
---|
95 | - Consider supporting out of tree .svn directories. |
---|
96 | |
---|
97 | - In the present code most operations are IO bound and have CPU to |
---|
98 | spare. Perhaps compressed text-bases would make things faster |
---|
99 | rather than slower, by trading spare CPU for reduced IO? |
---|
100 | |
---|
101 | - Keep track of the last text time written into an entries file and |
---|
102 | store it in svn_wc_t. Then when we come to do a timestamp sleep |
---|
103 | we can do it from that time rather than the current time. |
---|
104 | |
---|
105 | - Store working file size in the entries file and use it as another |
---|
106 | shortcut to detect modifications. This should not need any extra |
---|
107 | system calls, the stat() for timestamp can also return the size. |
---|
108 | When it triggers it will be much faster than possibly detranslating |
---|
109 | and then doing a byte-by-byte comparison. |
---|
110 | ### Problem: This doesn't work when the file needs translation, because the |
---|
111 | ### file might be modified in such a way that these modifications disappear |
---|
112 | ### when the file is detranslated. |
---|
113 | |
---|
114 | - Make the entries file smaller. The properties committed-date |
---|
115 | committed-rev and last-author are really only needed for keyword |
---|
116 | expansion, so only store them if the appropriate svn:keywords value |
---|
117 | is present. Note that committed-rev has a more general use as rPREV, |
---|
118 | however just about all uses of rPREV involve repository access so |
---|
119 | rPREV could be determined via an RA call. Removing the three |
---|
120 | properties could reduce entries file size by as much as one third, |
---|
121 | it's possible that might make reading, writing and parsing faster. |
---|
122 | It would reduce the memory used to cache the entries, an ABI change |
---|
123 | to svn_wc_entry_t might reduce it further. |
---|
124 | |
---|
125 | - Look at calls to svn_wc__get_keywords and svn_wc__get_eol_style |
---|
126 | Each of those reads the properties file. If they occur together |
---|
127 | then consider replacing them with a single call to svn_wc_prop_list, |
---|
128 | and perhaps write some functions that accept the properties hash |
---|
129 | as an argument. Alternatively, consider caching the existence of |
---|
130 | these two properties in the entries file to avoid reading the props |
---|
131 | file at all in some cases. |
---|
132 | |
---|
133 | - Optimise update/incomplete handling to reduce the number of times |
---|
134 | the entry file gets written. |
---|
135 | http://svn.haxx.se/dev/archive-2005-03/0060.shtml |
---|
136 | |
---|
137 | * Avoid adding incomplete="true" if the revision is not changing. |
---|
138 | |
---|
139 | * Don't write incomplete="true" immediately, cache it in the access |
---|
140 | baton and only write it when next writing the entries file. |
---|
141 | |
---|
142 | * Combine removing incomplete="true", and revision bumping, with the |
---|
143 | last change due to the update. |
---|
144 | |
---|
145 | - The svn_wc_t context could work in conjunction with a more advanced |
---|
146 | svn_wc_crawl_revisions system. This would provide a way of plugging |
---|
147 | multiple callbacks into a queue, probably with some sort of ordering |
---|
148 | and filtering ability, the aim being to replace most/all of the |
---|
149 | existing explicit loops. This would put more of the pool handling in |
---|
150 | one central location, it may even be possible to provide different |
---|
151 | entry caching schemes. I don't know how practical this idea is, or |
---|
152 | even if it is desirable. |
---|
153 | |
---|
154 | - Have a .svn/deleted directory so that schedule delete directories |
---|
155 | can be moved out of the working copy. At present a skeleton hierarchy |
---|
156 | of schedule delete directories remains in the working copy until the |
---|
157 | delete is committed. |
---|
158 | |
---|
159 | - When handling a delete received during update/switch perhaps do it |
---|
160 | in two stages. First move the item into a holding area within .svn |
---|
161 | and finally delete all such items at the end of the update. This |
---|
162 | would allow adds-with-history to use the deleted item and so might |
---|
163 | be a way to handle moves (implemented as delete plus add) in the |
---|
164 | presence of local modifications. Thought would have to be given to |
---|
165 | the revision of the local deleted item, what happens if it doesn't |
---|
166 | match the copyfrom revision? Perhaps we could get diffs, rather |
---|
167 | than full text, for adds-with-history if the copyfrom source is |
---|
168 | reported to the repository? |
---|
169 | |
---|
170 | - Consider implementing atomic move for wc-to-wc moves, rather than |
---|
171 | using copy+delete. This would be considerably faster for big |
---|
172 | directories, would lead to better revert behaviour, and avoid |
---|
173 | case-insensitivity problems (and if we ever get atomic mv in |
---|
174 | libsvn_fs then the wc code would be ready for it). |
---|
175 | |
---|
176 | - Consider writing some libsvn_wc compiled C regression tests to allow |
---|
177 | more complete coverage. Most of the current libsvn_wc testing is |
---|
178 | done via the command line client and it can be hard to get a working |
---|
179 | copy into the state necessary to test all code paths. |
---|
180 | |
---|
181 | - There are some basic features that are fragile. Switch has some |
---|
182 | bugs that can break a working copy, see issue 1906. I don't know |
---|
183 | how the system is supposed to work in theory, let alone how it |
---|
184 | should be implemented. Non-recursive checkout is broken, see issue |
---|
185 | 695; this probably applies to non-recursive update and switch as well. |
---|
186 | |
---|
187 | - Use absolute paths within libsvn_wc so that "." is not automatically |
---|
188 | a wc root. |
---|
189 | |
---|
190 | - Read notes/entries-caching for some details of the logging/caching |
---|
191 | in the current libsvn_wc. It's important that writing the entries |
---|
192 | file is handled efficiently. |
---|