Assignee: MartinPool
Created: 2005-10-11 by MartinPool
Contributors: MartinPool, AaronBentley
Status: BrainDump
Queues:
Implementation branch: none yet
Malone bugs:
Summary
A proposed new format for storing weave-like history information.
(WeaveDelta isn't a really satisfactory name to me, except in that it suggests the combination of the two techniques. I kind of like "knit" but maybe that's too cute. -- MartinPool 2005-10-17 01:07:52)
This is similar to Aaron's proposed http://www.aaronbentley.com/weavediff format.
Design
As for weaves, one object stores the history for a file including branching and joining. From this can be retrieved any historical text, origin annotations for any text, or the shape of the graph for that file.
Like in the weave file, we use integers as abbreviations for version names within the file. They have no global meaning.
Both versions and lines are indexed from 0.
The file consists of a series of records, each describing a version of the file. Each record stores information to reconstruct the text and line origins, the symbolic name (revision-id) of the version, the hash of the text, and information about the delta.
File texts can be stored either as complete copies, or relative to one parent of the text, as line-based deltas.
Whether the text is stored completely or as a delta, the origin of newly introduced lines is given as an index into the list of versions stored in the file.
The diff is in the style produced by python difflib, i.e. three integers A,B,C saying that lines [A:B] of the original text should be replaced by the C following lines. If A=B this is a pure insertion, if C=0 this is a pure deletion. The line numbers are in the delta-basis text, not in the weave or a composite. So 0,0,7 means an insertion of 7 lines before the previous text. If the number of lines in the previous text is N, N,N,3 means 3 lines are appended.
The annotations are constrained that they must reference only revisions in the ancestry of that version.
There is no sha1 protection for the annotations. It could be added, but seems perhaps less useful than the hash of the text.
Examples
Here are four file versions:
test-0
hello world
test-1a (parent test-0)
blue world
test-1b (parent test-0)
hello green world
test-2 (parents test-1a, test-1b)
hello blue world
Here is an illustration, in a more verbose syntax than might finally be used, with comments.
weavedelta v7 record 0 # sequence in this file version test-0 # global identifier for this version sha1 f572d396fae92... # of full text parents # no parents for this version full-copy # therefore stored from scratch lines 2 # count following lines 0 hello # integer before space gives origin of linej 0 world record 1 version test-1a sha1 1298319283 parents 0 # index 0 delta lines 1 # one line of delta 0,1,1 # change first line to this one line 1 blue # "blue" from version 1 record 2 version test-1b sha1 1293812983 parents 0 delta lines 1 # one line of deltas 1,1,1 # insert before line 1 2 green record 3 version test-2 parents 1 2 sha1 1219281928 parents 1 2 delta # relative to first parent i.e. test-1a lines 1 0,1,2 # remove first line of test-1a 0 hello # and replace with values from test-1b and parent 2 green
The syntax could be made less verbose by inferring some field names.
Changes from weavediff
Removed the requirement that revision diffs can be applied verbatim to a foreign weavediff that contains all the same parents; instead the diff can be copied across if the parents are identical but the indexes need to be adjusted ("mbp@23890" may be record 100 in one weavedelta and 150 in another.)
- There are advantages to retaining the exact same text.
- Text can be signed, avoiding 'malicious patch' worries
- Revisions can be exchanged more easily
- Installation is simpler
- File history can be split into multiple, independent files.
Do you mean the exact same diff text? -- MartinPool 2005-10-17 02:13:09
Yes. "retaining the same [diff] text" as opposed to adjusting indexes. -- AaronBentley
Changes from weave
The file format is similar to that used by weave, which seems to have worked reasonably well in being robust, debuggable, and fast to parse. It doesn't seem like binary files are clearly faster to load in Python than files of this style.
I assert (currently without proof) that it's possible to do weave-style or cdv-style merges on only annotated copies without needing the full weave.
One loss of this approach compared to full weaves is that alignment between two texts is not implicitly preserved by the weave. This does not give such strong consideration to the path by which two versions were reached. This seems likely to help some merges and hurt others.
Options
These design features are not necessary but could be added to improve particular cases:
- For versions with no parents, all lines must be new in that version and there is no need to store annotations.
- Often there will be a series of lines with the same annotation;
- Implementation can detect when there is no value in doing delta compression and always store full texts.
- One case would be if revisions were stored wrapped up - there are typically few or no common lines from one to the next.
Acutally, if you use 1 line per info, there are quite a few lines in a revision that can be shared. The current problem with weave files (v5), is that a lot of annotation is necessary for every change. When a single line is changed while the surrounding stays the same, you get :
{ 1000 [ 1001 . foo } ] 1001 { 1001 . bar }Which is 6 control lines (36 bytes), versus 2 content lines (8 bytes). The proposed format has less overhead per line of change:
lines 1 5000,5001,1 1001 bar
Only 1 control line (17 bytes). (I'm not counting the "lines 1" because that holds for all changes. and doesn't grow much with multiple changes. With less overhead, it makes sense to have more common lines. (One of my results from playing with the revision format was that with the current weave it was better to *not* have a common line in the middle of uncommon ones, because the overhead was larger than the extra line cost, both for number of control lines, and total number of bytes.) --JohnMeinel
- One case would be if revisions were stored wrapped up - there are typically few or no common lines from one to the next.
- Some files (e.g. inventory) may not care about annotations and can omit them.
- Binaries can be stored without annotations. It may often be reasonable to store full copies without
- delta compression. For files that are almost but not really text (PDFs?), they could be stored as line data, which should be safe for binaries.
Rationale
Weave formats allow powerful annotation and merging features, but have some scalability limits:
- The entire weave must be traversed when inserting or extracting.
- The whole file is rewritten on each update, which generates more disk IO and is perceived to have a higher risk of corruption or data loss. It would be nice to have each file written only once, or failing that to have each file only appended to.
Passive operation over http means that there is substantial overhead in opening a file. (This is true for local disk, where we need to consider the allocation overhead of having many small files.) This inclines us towards grouping changes into files.
bzr 0.1 makes use of the ancestry stored in a weave as a guide to the reduced history graph relevant to that file. This can be used to speed 'bzr log' on a single file or to
To avoid needing to scan the whole file, an index is stored in a separate file. This index can be either append-only, or could be rewritten each time and therefore ordered.
Further Details
The basic approach is to store full copies and forward deltas (the same approach as arch and hg.) Annotation is stored along with texts, indicating
Weaves have been reasonably successful in using a format that is designed for regular text files but safely accomodates non-text files. Using a traditional unix-style text file aids debugging and this style is fast to parse in Python. For binary files we may wish to use different splitting or quoting, as Gustavo suggested for weaves.
Discussion
Unresolved Issues
