Some notes about Mercurial's design Revlogs: The fundamental storage type in Mercurial is a "revlog". A revlog is the set of all revisions to a file. Each revision is either stored compressed in its entirety or as a compressed binary delta against the previous version. The decision of when to store a full version is made based on how much data would be needed to reconstruct the file. This lets us ensure that we never need to read huge amounts of data to reconstruct a file, regardless of how many revisions of it we store. In fact, we should always be able to do it with a single read, provided we know when and where to read. This is where the index comes in. Each revlog has an index containing a special hash (nodeid) of the text, hashes for its parents, and where and how much of the revlog data we need to read to reconstruct it. Thus, with one read of the index and one read of the data, we can reconstruct any version in time proportional to the file size. Similarly, revlogs and their indices are append-only. This means that adding a new version is also O(1) seeks. Generally revlogs are used to represent revisions of files, but they also are used to represent manifests and changesets. Manifests: A manifest is simply a list of all files in a given revision of a project along with the nodeids of the corresponding file revisions. So grabbing a given version of the project means simply looking up its manifest and reconstruction all the file revisions pointed to by it. Changesets: A changeset is a list of all files changed in a check-in along with a change description and some metadata like user and date. It also contains a nodeid to the relevent revision of the manifest. Changesets and manifests are one-to-one, but contain different data for convenience. Nodeids: Nodeids are unique ids that are used to represent the contents of a file AND its position in the project history. That is, if you change a file and then change it back, the result will have a different nodeid because it has different history. This is accomplished by including the parents of a given revision's nodeids with the revision's text when calculating the hash. Graph merging: Nodeids are implemented as they are to simplify merging. Merging a pair of directed acyclic graphs (aka "the family tree" of the file history) requires some method of determining if nodes in different graphs correspond. Simply comparing the contents of the node (by comparing text of given revisions or their hashes) can get confused by identical revisions in the tree. The nodeid approach makes it trivial - the hash uniquely describes a revision's contents and its graph position relative to the root, so merge is simply checking whether each nodeid in graph A is in the hash table of graph B. If not, we pull them in, adding them sequentially to the revlog. Branching and merging: Everything in Mercurial is potentially a branch and every user effectively works in their own branch. When you do a checkout, Mercurial remembers what the parent changeset was and uses it for the next check in. To do a merge of branches in Mercurial, you check out the heads of the two branches into the same working directory which causes a merge to be performed, and then check in the result once you're happy with it. The resulting checkin will have two parents. It decides when a merge is necessary by first determining if there are any uncommitted changes in the working directory. This effectively makes the working directory a branch off the checked in version it's based on. Then it also determines if the working directory is a direct ancestor or descendent of the second version we're attempting to checkout. If neither is true, we simply replace the working directory version with the new version. Otherwise we perform a merge between the two versions. Merging files and manifests: We begin by comparing two versions manifests and deciding which files need to be added, deleted, and merged. Then for each file, we perform a graph merge and resolve as above. It's important to merge files using per-file DAGs rather than just changeset level DAGs as this diagram illustrates: M M1 M2 AB |`-------v M2 clones M aB AB file A is change in mainline |`---v AB' file B is changed in M2 | aB / | M1 clones M | ab/ | M1 changes B | ab' | M1 merges from M2, changes to B conflict | | A'B' M2 changes A `---+--.| | a'B' M2 merges from mainline, changes to A conflict `--.| ??? depending on which ancestor we choose, we will have to redo A hand-merge, B hand-merge, or both but if we look at the files independently, everything is fine The result is a merged version in the working directory, waiting for check-in. Rollback: When performing a commit or a merge, we order things so that the changeset entry gets added last. We keep a transaction log of the name of each file touched and its length prior to the transaction. On abort, we simply truncate each file to its prior length. This is one of the nice properties of the append-only structure of the revlogs. We can also reuse this journal for "rollback". Merging between repositories: One of the key features of Mercurial is the ability to merge between independent repositories in a decentralized fashion. Each repository can act as a read-only server or a client. Clients operating by pulling all branches that it hasn't seen from the server and adding them into its graph. This is done in two steps: searching for new "roots" and pulling a "changegroup" Searching for new "roots" begins by finding all new heads and searching backwards from those heads to the first unknown nodes in their respective branches. These nodes are the 'roots' that are used to calculate the 'changegroup': the set of all changesets starting at those roots. Mercurial takes pains to make this search efficient in both bandwidth and round-trips. Once the roots are found, the changegroup can be transferred as a single streaming transfer. This is organized as an ordered set of deltas for changesets, manifests, and files. Large chunks of deltas can be directly added to the repository without unpacking so it's fairly fast.