source: valtobtest/subversion-1.6.2/notes/wc-improvements @ 3

Last change on this file since 3 was 3, checked in by valtob, 15 years ago

subversion source 1.6.2 as test

File size: 9.6 KB
Line 
1Anyone who has worked on the libsvn_wc code will have discovered that
2the current code is a complicated mess of special cases, and that it
3is difficult to understand, inconsistent, slow and buggy.  I know this
4because I wrote some of it.  It's possible that the libsvn_wc code
5will gradually evolve into an elegant, efficient, code base; on the
6other hand comments like "when we rewrite libsvn_wc" regularly appear
7on the dev list.  This document is *not* a plan or design for a
8rewrite, it's just some of the thoughts of a libsvn_wc hacker.
9
10From Past to Present
11====================
12
13The original code for libsvn_wc used an implementation that stored
14more or less all state information on disk in the .svn area on disk,
15so during most operations the entries files were read and written many
16times.  This led to the development of an API that passed around a lot
17of path parameters (as svn_string_t* originally, as const char* now)
18and to the development of the svn_io_xxx functions, which also accept
19path parameters.  The implementation was slow, and didn't scale
20particularly well as working copies got larger.  To improve things the
21current access baton and entries caching system was gradually hacked
22in, and libsvn_wc is now faster and scales a bit better, but problems
23still remain.
24
25A good example of the problems caused by the "path as parameter" API
26is svn_wc_text_modified_p.  Its basic function is to determine if the
27base text file and the working file are the same or different, but
28physical IO operations have to be repeated because they are buried
29behind several layers of API.  It's difficult to fix without
30rewriting, or duplicating, a number of svn_io_xxx and svn_wc_xxx
31functions.  Aside from the repeated IO itself, each IO operation also
32has to repeat the UTF-8 to native path conversion.
33
34The current entries caching makes things faster than in the past, but
35has its own problems.  Most operations now cache the entire entries
36hierarchy in memory which limits the size of the working copies that
37can be handled.  The problem is difficult to solve as some operations
38make multiple passes--commit for instance makes a first pass searching
39for modifications, a second pass reporting to the repository, and a
40third pass to do post-commit processing.
41
42The original code also did not always make a distinction between the
43versioned hierarchy in the entries file and the physical hierarchy on
44disk.  Things like using stat() or svn_io_check_path() calls to
45determine whether an item was versioned as file or directory do not
46work when the working copy on disk is obstructed or incomplete.
47
48The Future
49==========
50
51Some of these ideas are trivial, some of them are difficult to
52implement, some of them may not work at all.
53
54- Have an svn_wc_t context object, opaque outside the library, that
55  would replace the access batons.  This would get passed through most
56  of the libsvn_wc functions and could read/cache the entries files on
57  demand as the working copy was traversed.  It could also cache the
58  UTF-8 xlate handle.
59
60- Have an API to svn_wc_entry_t, perhaps make the struct opaque, so
61  that things like URL need not be constructed when the entries file
62  is read but can be created on demand if required and possibly cached
63  once created.  The aim would be to reduce the memory used by the
64  entries cache.
65
66- Consider caching physical IO results in svn_wc_entry_t/svn_wc_t.
67  Should we really stat() any file more than once?  This becomes less
68  important as we reduce the number of IO operations.
69
70- Consider caching UTF-8 to native path conversions either in
71  svn_wc_t, or svn_wc_entry_t, or locally in functions and using
72  svn_io_xxx equivalents that accept native paths.  This becomes less
73  important as we reduce the number of IO operations.
74
75- Make interfaces pass svn_wc_entry_t* rather than simple paths.  The
76  public API using const char* paths would remain to be used by
77  libsvn_client et al.
78
79- Maintain a clear distinction between the versioned hierarchy and the
80  physical hierarchy when writing code, it's usually a mistake to use
81  one when the other should be used.  To this end, audit the use of
82  svn_io_check_path().
83
84- Avoid using stat() to determine if an item is present on disk before
85  using the item, just use it straight away and handle the error if it
86  doesn't exist.
87
88- Search out and destroy functions that read and discard entries files
89  e.g. the apparently "simple" functions like svn_wc_is_wc_root or
90  check_wc_root.  Such overhead is expensive when used by operations
91  that are not going to do much other work, running status on a single
92  file for example.  The overhead may not matter to a command line
93  client, but it can matter to a GUI that makes many such calls.
94
95- Consider supporting out of tree .svn directories.
96
97- In the present code most operations are IO bound and have CPU to
98  spare.  Perhaps compressed text-bases would make things faster
99  rather than slower, by trading spare CPU for reduced IO?
100
101- Keep track of the last text time written into an entries file and
102  store it in svn_wc_t.  Then when we come to do a timestamp sleep
103  we can do it from that time rather than the current time.
104
105- Store working file size in the entries file and use it as another
106  shortcut to detect modifications.  This should not need any extra
107  system calls, the stat() for timestamp can also return the size.
108  When it triggers it will be much faster than possibly detranslating
109  and then doing a byte-by-byte comparison.
110  ### Problem: This doesn't work when the file needs translation, because the
111  ### file might be modified in such a way that these modifications disappear
112  ### when the file is detranslated.
113
114- Make the entries file smaller.  The properties committed-date
115  committed-rev and last-author are really only needed for keyword
116  expansion, so only store them if the appropriate svn:keywords value
117  is present.  Note that committed-rev has a more general use as rPREV,
118  however just about all uses of rPREV involve repository access so
119  rPREV could be determined via an RA call.  Removing the three
120  properties could reduce entries file size by as much as one third,
121  it's possible that might make reading, writing and parsing faster.
122  It would reduce the memory used to cache the entries, an ABI change
123  to svn_wc_entry_t might reduce it further.
124
125- Look at calls to svn_wc__get_keywords and svn_wc__get_eol_style
126  Each of those reads the properties file.  If they occur together
127  then consider replacing them with a single call to svn_wc_prop_list,
128  and perhaps write some functions that accept the properties hash
129  as an argument.  Alternatively, consider caching the existence of
130  these two properties in the entries file to avoid reading the props
131  file at all in some cases.
132
133- Optimise update/incomplete handling to reduce the number of times
134  the entry file gets written.
135  http://svn.haxx.se/dev/archive-2005-03/0060.shtml
136
137  * Avoid adding incomplete="true" if the revision is not changing.
138
139  * Don't write incomplete="true" immediately, cache it in the access
140    baton and only write it when next writing the entries file.
141
142  * Combine removing incomplete="true", and revision bumping, with the
143    last change due to the update.
144
145- The svn_wc_t context could work in conjunction with a more advanced
146  svn_wc_crawl_revisions system.  This would provide a way of plugging
147  multiple callbacks into a queue, probably with some sort of ordering
148  and filtering ability, the aim being to replace most/all of the
149  existing explicit loops.  This would put more of the pool handling in
150  one central location, it may even be possible to provide different
151  entry caching schemes.  I don't know how practical this idea is, or
152  even if it is desirable.
153
154- Have a .svn/deleted directory so that schedule delete directories
155  can be moved out of the working copy. At present a skeleton hierarchy
156  of schedule delete directories remains in the working copy until the
157  delete is committed.
158
159- When handling a delete received during update/switch perhaps do it
160  in two stages.  First move the item into a holding area within .svn
161  and finally delete all such items at the end of the update.  This
162  would allow adds-with-history to use the deleted item and so might
163  be a way to handle moves (implemented as delete plus add) in the
164  presence of local modifications.  Thought would have to be given to
165  the revision of the local deleted item, what happens if it doesn't
166  match the copyfrom revision?  Perhaps we could get diffs, rather
167  than full text, for adds-with-history if the copyfrom source is
168  reported to the repository?
169
170- Consider implementing atomic move for wc-to-wc moves, rather than
171  using copy+delete.  This would be considerably faster for big
172  directories, would lead to better revert behaviour, and avoid
173  case-insensitivity problems (and if we ever get atomic mv in
174  libsvn_fs then the wc code would be ready for it).
175
176- Consider writing some libsvn_wc compiled C regression tests to allow
177  more complete coverage.  Most of the current libsvn_wc testing is
178  done via the command line client and it can be hard to get a working
179  copy into the state necessary to test all code paths.
180
181- There are some basic features that are fragile.  Switch has some
182  bugs that can break a working copy, see issue 1906.  I don't know
183  how the system is supposed to work in theory, let alone how it
184  should be implemented.  Non-recursive checkout is broken, see issue
185  695; this probably applies to non-recursive update and switch as well.
186
187- Use absolute paths within libsvn_wc so that "." is not automatically
188  a wc root.
189
190- Read notes/entries-caching for some details of the logging/caching
191  in the current libsvn_wc.  It's important that writing the entries
192  file is handled efficiently.
Note: See TracBrowser for help on using the repository browser.