Subversion migration to Git

Some time ago I was tasked with migrating our Subversion repositories to Git. This article was only written recently because, well, I had forgotten about the notes I had taken during the migration and only stumbled on them recently.

Our largest repository was something like 500Go and contained a little more than 50'000 commits. The goal was to recover the svn history into git, keep a much information as possible about the commits and the links between them, keep the branches. During the history, there were a number of periodic database dumps that were committed that now weighted down the repository without serving any purpose. There were also a number of branches that were never used and contained nothing of interest.

The decision was also taken to split some of the tools into their own repositories instead of keeping them into the same repository, cleaning up the main repository to keep only the main project and related sources.

Principles

  • After some experiments, I decided to use svn2git, a tool used by KDE for their migration. It has the advantage of taking a rule file that allow splitting a repository by the svn path, processing tags and branches and transforming them, ignoring other paths, …
  • As the import of such a large repository is slow, I decided to mount a btrfs partition so that each step can be snapshotted, allowing me to test the next step without having any fear of having to start again at the beginning.
  • Some binary files were added to the svn history and it made sense keeping them. I decided to migrate them to git-lfs to reduce the history size without losing them completely.
  • A lot of commit messages contain references to other commits, I wanted to process these commit messages and transform the reference to a r commit into a git hash so that tools can create a link automatically.

Tools

The first to retrieve is svn2git.

The compilation should be easy. First install the dependencies and compile it.

$ git clone https://github.com/svn-all-fast-export/svn2git.git
$ sudo apt install libqt4-dev libapr1-dev libsvn-dev
$ qmake .
$ make

Once the tool is compiled, we can prepare the btrfs mount in which we will run the migration steps.

$ mkdir repositories
$ truncate -s 300G repositories.btrfs
$ sudo mkfs.btrfs repositories.btrfs
$ sudo mount repositories.btrfs repositories
$ sudo chown 1000:1000 repositories

We will also write a small tool in Go to process the commit messages.

sudo apt install golang

We will also need bfg, a git cleansing tool. You can download the jar file on the BFG Repo-Cleaner website.

First steps

The first step of the migration is to retrieve the svn repository itself on the local machine. This is not a checkout of the repository, we need the server folder directly, with the whole history and metadata.

rsync -avz --progress sshuser@svn.myserver.com:/srv/svn_myrepository/ .

In this case I had SSH access to the server, allowing me to simply rsync the repository. Doing so allowed me to prepare the migration in advance, only copying the new commits on each synchronisation and not the whole repository with its large history. Most of the repository files are never updated so this step is only slow on the first execution.

User mapping

The first step is to create a mapping file that will map the svn users to git users. A user in svn is a username whereas in git this is a name and email address.

To get a list of user accounts, we can use the svn command directly on the local repository like this :

svn log file:///home/tsc/svn_myrepository \
    | egrep '^r.*lines?$' \
    | awk -F'|' '{print $2;}' \
    | sort \
    | uniq

This will return the list of users in the logs. For each of these users, you should create a line in a mapping file, like so :

auser Albert User <albert.user@example.com>
aperson Anaelle Personn <anaelle.personn@example.com>

This file will be given as input to svn2git and should be complete, otherwise the import will fail.

Path mapping

The second mapping for the svn to git migration of a repository is the svn2git rules. This file will tell the program what will go where. In our case, the repository was not stricly adhering to the svn standard tree, containing a trunk, tags and branches structure as well as some other folders for “out-of-branch” projects.

# We create the main repository
create repository svn_myrepository
end repository

# We create repositories for external tools that will move
# to their own repositories
create repository aproject
end repository
create repository bproject
end repository
create repository cproject
end repository

# We declare a variable to ease the declaration of the
# migration rules further down
declare PROJECTS=aproject|bproject|cproject

# We create repositories for out-of-branch folders
# that will migrate to their own repositories
create repository aoutofbranch
end repository
create repository boutofbranch
end repository

# We always ignore database dumps wherever there are.
# In our case, the database dumps are named "database-dump-20100112"
# or forms close to that.
match /.*/database([_-][^/]+)?[-_](dump|oracle|mysql)[^/]+
end match

# There are also dumps stored in their own folder
match /.*/database/backup(/old)?/.*(.zip|.sql|.lzma)
end match

# At some time the build results were also added to the history, we want
# to ignore them
match /.*/(build|dist|cache)/
end match

# We process our external tools only on the master branch.
# We use the previously declared variable to reduce the repetition
# and use the pattern match to move it to the correct repository.
match /trunk/(tools/)?(${PROJECTS})/
  repository \2
  branch master
end match

# And we ignore them if there are on tags or branches
match /.*/(tools/)?${PROJECTS}/
end match

# We start processing our main project after the r10, as the
# first commits were missing the trunk and moved the branches, trunk and tags
# folders around.
match /trunk/
  min revision 10
  repository svn_myrepository
  branch master
end match

# There are branches that are hierarchically organized.
# Such cases have to be explicitly configured.
match /branches/(old|dev|customers)/([^/]+)/
  repository svn_myrepository
  branch \1/\2
end match

# Other branches are as expected directly in the branches folder.
match /branches/([^/]+)/
  repository svn_myrepository
  branch \1
end match

# The tags were used in a strange fashion before the commit r2500,
# so we ignore everything before that refactoring
match /tags/([^/]+)/
  max revision 2500
end match

# After that, we create a branch for each tag as the svn tags
# were not used correctly and were committed to. We just name
# them differently and will process them afterwards.
match /tags/([^/]+)/([^/]+)/
  min revision 2500
  repository svn_myrepository
  branch \1-\2
end match

# Our out-of-branch folder will be processed directly, only creating
# a master branch.
match /aoutofbranch/
  repository aoutofbranch
  branch master
end match

match /boutofbranch/
  repository boutofbranch
  branch master
end match

# Everything else is discarded and ignored
match /
end match

This file will quickly grow with the number of migration operations that you want to do. Ignore the files here if possible as it will reduce the migration time as well as the postprocessing that will need to be done afterwards. In my case, a number of files were too complex to match during the migration or were spotted only afterwards and had to be cleaned in a second pass with other tools.

Migration

This step will take a lot of time as it will read the whole svn history, process the declared rules and generate the git repositories and every commit.

$ cd repositories
$ ~/workspace/svn2git/svn-all-fast-export \
    --add-metadata \
    --svn-branches \
    --identity-map ~/workspace/migration-tools/accounts-map.txt \
    --rules ~/workspace/migration-tools/svnfast.rules \
    --commit-interval 2000 \
    --stat \
    /home/tsc/svn_myrepository

If there is a crash during this step, it means that you are either missing an account in your mapping, that one of your rule is emitting an erroneous branch, repository or that no rule is matching.

Once this step finished, I like to do a btrfs snapshot so that I can return to this step when putting the next steps into place.

btrfs subvolume snaphost -r repositories repositories/snap-1-import

Cleanup

The next phase is to cleanup our import. There will always be a number of branches that are unused, named incorrectly, contain only temporary files or branches that are so far from the standard naming that our rules cannot process them correctly.

We will simply delete them or rename them using git.

$ cd svn_myrepository
$ git branch -D oldbranch-0.3.1
$ git branch -D customer/backup_temp
$ git branch -m customer/stable_v1.0 stable-1.0

The goal at this step is to cleanup the branches that will be kept after the migration. We do this now to reduce the repository size early on and thus reduce the time needed for the next steps.

If you see branches that can be deleted or renamed further down the road, you can also remove or rename them then.

I like to take a snapshot at this stage as the next stage usually involves a lot of tests and manually building a list of things to remove.

btrfs subvolume snaphost -r repositories repositories/snap-2a-cleanup

We can also remove files that were added and should not have been by checking a list of every file every checked into our new git repository, inspecting it manually and add the identifiers of files to remove in a new file :

$ git rev-list --objects --all > ./all-files
$ cat ./all-files | your-filter | cut -d' ' -f1 > ./to-delete-ids
$ java -jar ~/Downloads/bfg-1.12.15.jar --private --no-blob-protection --strip-blobs-with-ids ./to-delete-ids

We will take a snapshot again, as the next step also involves checks and tests.

btrfs subvolume snaphost -r repositories repositories/snap-2b-cleanup

Next, we will convert the binary files that we still want to keep in our repository to Git-LFS. This allows git to only keep track of the hash of the file in the history and not store the whole binary in the repository, thus reducing the size of the clones.

BFG does this quickly and efficiently, removing every file matching the given name from the history and storing it in Git-LFS. This step will require some exploration of the previous all-files file to identify which files need to be converted.

$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs 'my-important-archive*.zip'
$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs '*.ear'

After the cleanup, I also like to do a btrfs snapshot so that the history rewrite step can be executed and tested multiple times.

btrfs subvolume snaphost -r repositories repositories/snap-2c-cleanup

Linking a svn revision to a git commit

The logs prints for each revision a line mapping to a mark on the git marks file. In the git repository, there is then a marks file that map this mark to a commit hash. We can use this information to build a mapping database that can store that information for later.

In our case, I wrote a Java program that will parse both files and store the resulting mapping into a LevelDB database.

This database will then be used by a Golang server that will read this mapping database in memory and serve a RPC server that we will call from Golang binaries in a git filter-branch call. The Golang server will also need to keep track of the modifications to the git commit hashes as the history rewrite changes them.

First, the Java tool to read the logs and generate the LevelDB database :

import com.google.common.collect.BiMap;
import com.google.common.collect.HashBiMap;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.io.filefilter.DirectoryFileFilter;
import org.apache.commons.io.filefilter.IOFileFilter;
import org.iq80.leveldb.DB;
import org.iq80.leveldb.Options;
import org.iq80.leveldb.impl.Iq80DBFactory;

public class CommitMapping {

    public static String FILE_LOG_IMPORT = "../log-svn_myrepository";
    public static String FILE_MARKS = "marks-svn_myrepository";
    public static String FILE_BFG_DIR = "../svn_myrepository.bfg-report";

    public static Pattern PATTERN_LOG = Pattern.compile("^progress SVN (r\\d+) branch .* = (:\\d+)");

    public static void main(String[] args) throws Exception {

        List<String> importLines = IOUtils.readLines(new FileReader(new File(FILE_LOG_IMPORT)));
        List<String> marksLines = IOUtils.readLines(new FileReader(new File(FILE_MARKS)));
        
        Collection<File> passFilesCol = FileUtils.listFiles(new File(FILE_BFG_DIR), new IOFileFilter() {
            @Override
            public boolean accept(File pathname, String name) {
                return name.equals("object-id-map.old-new.txt");
            }

            @Override
            public boolean accept(File path) {
                return this.accept(path, path.getName());
            }
        }, DirectoryFileFilter.DIRECTORY);
        
        List<File> passFiles = new ArrayList<>(passFilesCol);
        
        Collections.sort(passFiles, (File o1, File o2) -> o1.getParentFile().getName().compareTo(o2.getParentFile().getName()));

        Map<String, String> commitToIdentifier = new LinkedHashMap<>();
        Map<String, String> identifierToHash = new HashMap<>();

        for (String importLine : importLines) {
            Matcher marksMatch = PATTERN_LOG.matcher(importLine);

            if (marksMatch.find()) {
                String dest = marksMatch.group(2);
                if (dest == null || dest.length() == 0 || ":0".equals(dest)) continue;
                
                commitToIdentifier.put(marksMatch.group(1), dest);
            } else {
                System.err.println("Unknown line : " + importLine);
            }

        }

        File dbFile = new File(System.getenv("HOME") + "/mapping-db");
        File humanFile = new File(System.getenv("HOME") + "/mapping");

        FileUtils.deleteQuietly(dbFile);

        Options options = new Options();
        options.createIfMissing(true);
        DB db = Iq80DBFactory.factory.open(dbFile, options);

        marksLines.stream().map((line) -> line.split("\\s", 2)).forEach((parts) -> identifierToHash.put(parts[0], parts[1]));
        
        BiMap<String, String> commitMapping = HashBiMap.create(commitToIdentifier.size());
        for (String commit : commitToIdentifier.keySet()) {
            
            String importId = commitToIdentifier.get(commit);
            String hash = identifierToHash.get(importId);
            
            if (hash == null) continue;
            commitMapping.put(commit, hash);
        }
        
        System.err.println("Got " + commitMapping.size() + " svn -> initial import entries.");
        
        for (File file : passFiles) {
            System.err.println("Processing file " + file.getAbsolutePath());

            List<String> bfgPass = IOUtils.readLines(new FileReader(file));
            Map<String, String> hashMapping = bfgPass.stream().map((line) -> line.split("\\s", 2)).collect(Collectors.toMap(parts -> parts[0], parts -> parts[1]));
            
            for (String hash : hashMapping.keySet()) {
                String rev = commitMapping.inverse().get(hash);
                if (rev != null) {
                    String newHash = hashMapping.get(hash);
                    System.err.println("Replacing r" + rev + ", was " + hash + ", is " + newHash);
                    commitMapping.replace(rev, newHash);
                }
            }
        }

        PrintStream fos = new PrintStream(humanFile);
        for (Map.Entry<String, String> entry : commitMapping.entrySet()) {
            String commit = entry.getKey();
            String target = entry.getValue();

            fos.println(commit + "\t" + target);
            db.put(Iq80DBFactory.bytes(commit), Iq80DBFactory.bytes(target));
        }

        db.close();
        fos.close();
    }
}

We will use RPC between a client and server to allow the LevelDB database to be kept open and have very light clients that query a running server as they will be executed for each commit. After some tests, opening the database was really time consuming thus this approach, even though the server will do very little.

The structure of our go project is the following :

go-gitcommit/client-common:
rpc.go

go-gitcommit/client-insert:
insert-mapping.go

go-gitcommit/client-query:
query-mapping.go

go-gitcommit/server:
server.go

First, some plumping for the RPC in rpc.go :

package Client

import (
	"net"
	"net/rpc"
	"time"
)

type (
	// Client -
	Client struct {
		connection *rpc.Client
	}

	// MappingItem is the response from the cache or the item to insert into the cache
	MappingItem struct {
		Key   string
		Value string
	}

	// BulkQuery allows to mass query the DB in one go.
	BulkQuery []MappingItem
)

// NewClient -
func NewClient(dsn string, timeout time.Duration) (*Client, error) {
	connection, err := net.DialTimeout("tcp", dsn, timeout)
	if err != nil {
		return nil, err
	}
	return &Client{connection: rpc.NewClient(connection)}, nil
}

// InsertMapping -
func (c *Client) InsertMapping(item MappingItem) (bool, error) {
	var ack bool
	err := c.connection.Call("RPC.InsertMapping", item, &ack)
	return ack, err
}

// GetMapping -
func (c *Client) GetMapping(bulk BulkQuery) (BulkQuery, error) {
	var bulkResponse BulkQuery
	err := c.connection.Call("RPC.GetMapping", bulk, &bulkResponse)
	return bulkResponse, err
}

Next the Golang server that will read this database in server.go :

package main

import (
	"fmt"
	"log"
	"net"
	"net/rpc"
	"os"
	"time"

	"github.com/syndtr/goleveldb/leveldb"

	Client "../client-common"
)

var (
	cacheDBPath = os.Getenv("HOME") + "/mapping-db"

	cacheDB *leveldb.DB
	flowMap map[string]string

	f *os.File
	g *os.File
)

type (
	// RPC is the base class of our RPC system
	RPC struct {
	}
)

func main() {
	var cacheDBerr error

	cacheDB, cacheDBerr = leveldb.OpenFile(cacheDBPath, nil)
	if cacheDBerr != nil {
		fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
		log.Fatal(cacheDBerr)
	}

	roErr := cacheDB.SetReadOnly()
	if roErr != nil {
		fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
		log.Fatal(roErr)
	}

	flowMap = make(map[string]string)

	f, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.log")
	defer f.Close()
	g, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.ins")
	defer g.Close()

	rpc.Register(NewRPC())

	l, e := net.Listen("tcp", ":9876")
	if e != nil {
		log.Fatal("listen error:", e)
	}

	go flushLog()

	rpc.Accept(l)
}

func flushLog() {
	for {
		time.Sleep(100 * time.Millisecond)
		f.Sync()
	}
}

// NewRPC -
func NewRPC() *RPC {
	return &RPC{}
}

// InsertMapping -
func (r *RPC) InsertMapping(mappingItem Client.MappingItem, ack *bool) error {
	old := mappingItem.Key
	new := mappingItem.Value

	flowMap[old] = new

	g.WriteString(fmt.Sprintf("Inserted mapping %s -> %s\n", old, new))

	*ack = true

	return nil
}

// GetMapping -
func (r *RPC) GetMapping(bulkQuery Client.BulkQuery, resp *Client.BulkQuery) error {
	for i := range bulkQuery {
		key := bulkQuery[i].Key

		response, _ := cacheDB.Get([]byte(key), nil)

		gitCommit := key
		if response != nil {
			responseStr := string(response[:])
			responseUpdated := flowMap[responseStr]
			if responseUpdated != "" {
				gitCommit = string(responseUpdated[:])[:12] + "(" + key + ")"

				f.WriteString(fmt.Sprintf("Response to mapping %s -> %s\n", bulkQuery[i].Key, gitCommit))
			} else {
				f.WriteString(fmt.Sprintf("No git mapping for entry %s\n", responseStr))
			}
		} else {
			f.WriteString(fmt.Sprintf("Unknown revision %s\n", key))
		}

		bulkQuery[i].Value = gitCommit
	}

	*resp = bulkQuery

	return nil
}

And finally our clients. The insert client will be called from git filter-branch with the previous and current commit hashes after processing each commit. We store this information into the database so that the hashes are correct when mapping a revision. The code goes into insert-mapping.go :

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	Client "../client-common"
)

func main() {
	old := os.Args[1]
	new := os.Args[2]

	rpcClient, err := Client.NewClient("localhost:9876", time.Millisecond*500)
	if err != nil {
		log.Fatal(err)
	}

	mappingItem := Client.MappingItem{
		Key:   old,
		Value: new,
	}

	ack, err := rpcClient.InsertMapping(mappingItem)
	if err != nil || !ack {
		log.Fatal(err)
	}

	fmt.Println(new)
}

The query client will receive the commit message for each commit, check whether it contains a r mapping and query the server for a hash for this commit. It goes into query-mapping.go :

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
	"regexp"
	"strings"
	"time"

	client "../client-common"
)

func main() {
	reader := bufio.NewReader(os.Stdin)
	text, _ := reader.ReadString('\n')

	re := regexp.MustCompile(`\Wr[0-9]+`)
	matches := re.FindAllString(text, -1)

	if matches == nil {
		fmt.Print(text)
		return
	}

	rpcClient, err := client.NewClient("localhost:9876", time.Millisecond*500)
	if err != nil {
		log.Fatal(err)
	}

	var bulkQuery client.BulkQuery

	for i := range matches {
		if matches[i][0] != '-' {
			key := matches[i][1:]
			bulkQuery = append(bulkQuery, client.MappingItem{Key: key})
		}
	}

	gitCommits, _ := rpcClient.GetMapping(bulkQuery)

	for i := range gitCommits {
		gitCommit := gitCommits[i].Value
		key := gitCommits[i].Key

		text = strings.Replace(text, key, gitCommit, 1)
	}

	fmt.Print(text)
}

For this step, we will need to first compile and execute the Java program. Once it succeeded in creating the database, we will compile and execute the Go server in the background.

Then, we can launch git filter-branch on our repository to rewrite the history :

$ git filter-branch \
    --commit-filter 'NEW=`git_commit_non_empty_tree "$@"`; \
                     ${HOME}/migration-tools/go-gitcommit/client-insert/client-insert $GIT_COMMIT $NEW' \
    --msg-filter "${HOME}/migration-tools/go-gitcommit/client-query/client-query" \
    -- --all --author-date-order

As after each step, we will generate a snapshot, even though it should be the last step that cannot be repeated easily.

btrfs subvolume snaphost -r repositories repositories/snap-3-mapping

We now clean the repository that should contain a lot of unused blobs, branches, commits, …

$ git reflog expire --expire=now --all
$ git prune --expire=now --progress
$ git repack -adf --window-memory=512m

We now have a repository that should be more or less clean. You will have to check the history, the size of the blobs and whether some branches can still be deleted before pushing it to your server.