Bazaar

Bazaar

 




Wiki Tools

  • Login
  • Create Profile
  • FindPage
  • RecentChanges
  • Page History
  • Attachments

Summary

A proposed new format for storing weave-like history information.

  • (WeaveDelta isn't a really satisfactory name to me, except in that it suggests the combination of the two techniques. I kind of like "knit" but maybe that's too cute. -- MartinPool 2005-10-17 01:07:52)

This is similar to Aaron's proposed http://www.aaronbentley.com/weavediff format.

Design

As for weaves, one object stores the history for a file including branching and joining. From this can be retrieved any historical text, origin annotations for any text, or the shape of the graph for that file.

Like in the weave file, we use integers as abbreviations for version names within the file. They have no global meaning.

Both versions and lines are indexed from 0.

The file consists of a series of records, each describing a version of the file. Each record stores information to reconstruct the text and line origins, the symbolic name (revision-id) of the version, the hash of the text, and information about the delta.

File texts can be stored either as complete copies, or relative to one parent of the text, as line-based deltas.

Whether the text is stored completely or as a delta, the origin of newly introduced lines is given as an index into the list of versions stored in the file.

The diff is in the style produced by python difflib, i.e. three integers A,B,C saying that lines [A:B] of the original text should be replaced by the C following lines. If A=B this is a pure insertion, if C=0 this is a pure deletion. The line numbers are in the delta-basis text, not in the weave or a composite. So 0,0,7 means an insertion of 7 lines before the previous text. If the number of lines in the previous text is N, N,N,3 means 3 lines are appended.

The annotations are constrained that they must reference only revisions in the ancestry of that version.

There is no sha1 protection for the annotations. It could be added, but seems perhaps less useful than the hash of the text.

Examples

Here are four file versions:

test-0

hello
world

test-1a (parent test-0)

blue
world

test-1b (parent test-0)

hello
green
world

test-2 (parents test-1a, test-1b)

hello
blue
world

Here is an illustration, in a more verbose syntax than might finally be used, with comments.

weavedelta v7

record 0                 # sequence in this file
version test-0           # global identifier for this version
sha1 f572d396fae92...    # of full text
parents                  # no parents for this version
full-copy                # therefore stored from scratch
lines 2                  # count following lines
0 hello                  # integer before space gives origin of linej
0 world

record 1
version test-1a
sha1 1298319283
parents 0               # index 0
delta
lines 1                 # one line of delta
0,1,1                   # change first line to this one line
1 blue                  # "blue" from version 1

record 2
version test-1b
sha1 1293812983
parents 0
delta
lines 1                 # one line of deltas
1,1,1                   # insert before line 1
2 green         

record 3
version test-2
parents 1 2
sha1 1219281928
parents 1 2
delta                   # relative to first parent i.e. test-1a
lines 1
0,1,2                   # remove first line of test-1a
0 hello                 # and replace with values from test-1b and parent
2 green

The syntax could be made less verbose by inferring some field names.

Changes from weavediff

Removed the requirement that revision diffs can be applied verbatim to a foreign weavediff that contains all the same parents; instead the diff can be copied across if the parents are identical but the indexes need to be adjusted ("mbp@23890" may be record 100 in one weavedelta and 150 in another.)

  • There are advantages to retaining the exact same text.
    • Text can be signed, avoiding 'malicious patch' worries
    • Revisions can be exchanged more easily
    • Installation is simpler
    • File history can be split into multiple, independent files.
      • --AaronBentley

        • Do you mean the exact same diff text? -- MartinPool 2005-10-17 02:13:09

          • Yes. "retaining the same [diff] text" as opposed to adjusting indexes. -- AaronBentley

Changes from weave

The file format is similar to that used by weave, which seems to have worked reasonably well in being robust, debuggable, and fast to parse. It doesn't seem like binary files are clearly faster to load in Python than files of this style.

I assert (currently without proof) that it's possible to do weave-style or cdv-style merges on only annotated copies without needing the full weave.

One loss of this approach compared to full weaves is that alignment between two texts is not implicitly preserved by the weave. This does not give such strong consideration to the path by which two versions were reached. This seems likely to help some merges and hurt others.

Options

These design features are not necessary but could be added to improve particular cases:

  • For versions with no parents, all lines must be new in that version and there is no need to store annotations.
  • Often there will be a series of lines with the same annotation;
  • Implementation can detect when there is no value in doing delta compression and always store full texts.
    • One case would be if revisions were stored wrapped up - there are typically few or no common lines from one to the next.
      • Acutally, if you use 1 line per info, there are quite a few lines in a revision that can be shared. The current problem with weave files (v5), is that a lot of annotation is necessary for every change. When a single line is changed while the surrounding stays the same, you get :

        { 1000
        [ 1001
        . foo
        }
        ] 1001
        { 1001
        . bar
        }
        

        Which is 6 control lines (36 bytes), versus 2 content lines (8 bytes). The proposed format has less overhead per line of change:

        lines 1
        5000,5001,1
        1001 bar
        

        Only 1 control line (17 bytes). (I'm not counting the "lines 1" because that holds for all changes. and doesn't grow much with multiple changes. With less overhead, it makes sense to have more common lines. (One of my results from playing with the revision format was that with the current weave it was better to *not* have a common line in the middle of uncommon ones, because the overhead was larger than the extra line cost, both for number of control lines, and total number of bytes.) --JohnMeinel

  • Some files (e.g. inventory) may not care about annotations and can omit them.
  • Binaries can be stored without annotations. It may often be reasonable to store full copies without
    • delta compression. For files that are almost but not really text (PDFs?), they could be stored as line data, which should be safe for binaries.

Rationale

Weave formats allow powerful annotation and merging features, but have some scalability limits:

  • The entire weave must be traversed when inserting or extracting.
  • The whole file is rewritten on each update, which generates more disk IO and is perceived to have a higher risk of corruption or data loss. It would be nice to have each file written only once, or failing that to have each file only appended to.

Passive operation over http means that there is substantial overhead in opening a file. (This is true for local disk, where we need to consider the allocation overhead of having many small files.) This inclines us towards grouping changes into files.

bzr 0.1 makes use of the ancestry stored in a weave as a guide to the reduced history graph relevant to that file. This can be used to speed 'bzr log' on a single file or to

To avoid needing to scan the whole file, an index is stored in a separate file. This index can be either append-only, or could be rewritten each time and therefore ordered.

Further Details

The basic approach is to store full copies and forward deltas (the same approach as arch and hg.) Annotation is stored along with texts, indicating

Weaves have been reasonably successful in using a format that is designed for regular text files but safely accomodates non-text files. Using a traditional unix-style text file aids debugging and this style is fast to parse in Python. For binary files we may wish to use different splitting or quoting, as Gustavo suggested for weaves.

Discussion

Unresolved Issues

Questions and Answers