Subversion migration to Git
Some time ago I was tasked with migrating our Subversion repositories to Git. This article was only written recently because, well, I had forgotten about the notes I had taken during the migration and only stumbled on them recently.
Our largest repository was something like 500Go and contained a little more than 50'000 commits. The goal was to recover the svn history into git, keep a much information as possible about the commits and the links between them, keep the branches. During the history, there were a number of periodic database dumps that were committed that now weighted down the repository without serving any purpose. There were also a number of branches that were never used and contained nothing of interest.
The decision was also taken to split some of the tools into their own repositories instead of keeping them into the same repository, cleaning up the main repository to keep only the main project and related sources.
Principles
- After some experiments, I decided to use svn2git, a tool used by KDE for their migration. It has the advantage of taking a rule file that allow splitting a repository by the svn path, processing tags and branches and transforming them, ignoring other paths, …
- As the import of such a large repository is slow, I decided to mount a btrfs partition so that each step can be snapshotted, allowing me to test the next step without having any fear of having to start again at the beginning.
- Some binary files were added to the svn history and it made sense keeping them. I decided to migrate them to git-lfs to reduce the history size without losing them completely.
- A lot of commit messages contain references to other commits, I wanted to process these commit messages
and transform the reference to a
r
commit into a git hash so that tools can create a link automatically.
Tools
The first to retrieve is svn2git.
The compilation should be easy. First install the dependencies and compile it.
$ git clone https://github.com/svn-all-fast-export/svn2git.git
$ sudo apt install libqt4-dev libapr1-dev libsvn-dev
$ qmake .
$ make
Once the tool is compiled, we can prepare the btrfs mount in which we will run the migration steps.
$ mkdir repositories
$ truncate -s 300G repositories.btrfs
$ sudo mkfs.btrfs repositories.btrfs
$ sudo mount repositories.btrfs repositories
$ sudo chown 1000:1000 repositories
We will also write a small tool in Go to process the commit messages.
sudo apt install golang
We will also need bfg
, a git cleansing tool. You can download the jar
file on the BFG Repo-Cleaner website.
First steps
The first step of the migration is to retrieve the svn repository itself on the local machine. This is not a checkout of the repository, we need the server folder directly, with the whole history and metadata.
rsync -avz --progress sshuser@svn.myserver.com:/srv/svn_myrepository/ .
In this case I had SSH access to the server, allowing me to simply rsync the repository. Doing so allowed me to prepare the migration in advance, only copying the new commits on each synchronisation and not the whole repository with its large history. Most of the repository files are never updated so this step is only slow on the first execution.
User mapping
The first step is to create a mapping file that will map the svn users to git users. A user in svn is a username whereas in git this is a name and email address.
To get a list of user accounts, we can use the svn command directly on the local repository like this :
svn log file:///home/tsc/svn_myrepository \
| egrep '^r.*lines?$' \
| awk -F'|' '{print $2;}' \
| sort \
| uniq
This will return the list of users in the logs. For each of these users, you should create a line in a mapping file, like so :
auser Albert User <albert.user@example.com>
aperson Anaelle Personn <anaelle.personn@example.com>
This file will be given as input to svn2git
and should be complete, otherwise the import will fail.
Path mapping
The second mapping for the svn to git migration of a repository is the svn2git rules. This file will tell the program what will go where. In our case, the repository was not stricly adhering to the svn standard tree, containing a trunk, tags and branches structure as well as some other folders for “out-of-branch” projects.
# We create the main repository
create repository svn_myrepository
end repository
# We create repositories for external tools that will move
# to their own repositories
create repository aproject
end repository
create repository bproject
end repository
create repository cproject
end repository
# We declare a variable to ease the declaration of the
# migration rules further down
declare PROJECTS=aproject|bproject|cproject
# We create repositories for out-of-branch folders
# that will migrate to their own repositories
create repository aoutofbranch
end repository
create repository boutofbranch
end repository
# We always ignore database dumps wherever there are.
# In our case, the database dumps are named "database-dump-20100112"
# or forms close to that.
match /.*/database([_-][^/]+)?[-_](dump|oracle|mysql)[^/]+
end match
# There are also dumps stored in their own folder
match /.*/database/backup(/old)?/.*(.zip|.sql|.lzma)
end match
# At some time the build results were also added to the history, we want
# to ignore them
match /.*/(build|dist|cache)/
end match
# We process our external tools only on the master branch.
# We use the previously declared variable to reduce the repetition
# and use the pattern match to move it to the correct repository.
match /trunk/(tools/)?(${PROJECTS})/
repository \2
branch master
end match
# And we ignore them if there are on tags or branches
match /.*/(tools/)?${PROJECTS}/
end match
# We start processing our main project after the r10, as the
# first commits were missing the trunk and moved the branches, trunk and tags
# folders around.
match /trunk/
min revision 10
repository svn_myrepository
branch master
end match
# There are branches that are hierarchically organized.
# Such cases have to be explicitly configured.
match /branches/(old|dev|customers)/([^/]+)/
repository svn_myrepository
branch \1/\2
end match
# Other branches are as expected directly in the branches folder.
match /branches/([^/]+)/
repository svn_myrepository
branch \1
end match
# The tags were used in a strange fashion before the commit r2500,
# so we ignore everything before that refactoring
match /tags/([^/]+)/
max revision 2500
end match
# After that, we create a branch for each tag as the svn tags
# were not used correctly and were committed to. We just name
# them differently and will process them afterwards.
match /tags/([^/]+)/([^/]+)/
min revision 2500
repository svn_myrepository
branch \1-\2
end match
# Our out-of-branch folder will be processed directly, only creating
# a master branch.
match /aoutofbranch/
repository aoutofbranch
branch master
end match
match /boutofbranch/
repository boutofbranch
branch master
end match
# Everything else is discarded and ignored
match /
end match
This file will quickly grow with the number of migration operations that you want to do. Ignore the files here if possible as it will reduce the migration time as well as the postprocessing that will need to be done afterwards. In my case, a number of files were too complex to match during the migration or were spotted only afterwards and had to be cleaned in a second pass with other tools.
Migration
This step will take a lot of time as it will read the whole svn history, process the declared rules and generate the git repositories and every commit.
$ cd repositories
$ ~/workspace/svn2git/svn-all-fast-export \
--add-metadata \
--svn-branches \
--identity-map ~/workspace/migration-tools/accounts-map.txt \
--rules ~/workspace/migration-tools/svnfast.rules \
--commit-interval 2000 \
--stat \
/home/tsc/svn_myrepository
If there is a crash during this step, it means that you are either missing an account in your mapping, that one of your rule is emitting an erroneous branch, repository or that no rule is matching.
Once this step finished, I like to do a btrfs snapshot so that I can return to this step when putting the next steps into place.
btrfs subvolume snaphost -r repositories repositories/snap-1-import
Cleanup
The next phase is to cleanup our import. There will always be a number of branches that are unused, named incorrectly, contain only temporary files or branches that are so far from the standard naming that our rules cannot process them correctly.
We will simply delete them or rename them using git.
$ cd svn_myrepository
$ git branch -D oldbranch-0.3.1
$ git branch -D customer/backup_temp
$ git branch -m customer/stable_v1.0 stable-1.0
The goal at this step is to cleanup the branches that will be kept after the migration. We do this now to reduce the repository size early on and thus reduce the time needed for the next steps.
If you see branches that can be deleted or renamed further down the road, you can also remove or rename them then.
I like to take a snapshot at this stage as the next stage usually involves a lot of tests and manually building a list of things to remove.
btrfs subvolume snaphost -r repositories repositories/snap-2a-cleanup
We can also remove files that were added and should not have been by checking a list of every file every checked into our new git repository, inspecting it manually and add the identifiers of files to remove in a new file :
$ git rev-list --objects --all > ./all-files
$ cat ./all-files | your-filter | cut -d' ' -f1 > ./to-delete-ids
$ java -jar ~/Downloads/bfg-1.12.15.jar --private --no-blob-protection --strip-blobs-with-ids ./to-delete-ids
We will take a snapshot again, as the next step also involves checks and tests.
btrfs subvolume snaphost -r repositories repositories/snap-2b-cleanup
Next, we will convert the binary files that we still want to keep in our repository to Git-LFS. This allows git to only keep track of the hash of the file in the history and not store the whole binary in the repository, thus reducing the size of the clones.
BFG does this quickly and efficiently, removing every file matching the
given name from the history and storing it in Git-LFS. This step will
require some exploration of the previous all-files
file to identify which
files need to be converted.
$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs 'my-important-archive*.zip'
$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs '*.ear'
After the cleanup, I also like to do a btrfs snapshot so that the history rewrite step can be executed and tested multiple times.
btrfs subvolume snaphost -r repositories repositories/snap-2c-cleanup
Linking a svn revision to a git commit
The logs prints for each revision a line mapping to a mark on the git marks file. In the git repository, there is then a marks file that map this mark to a commit hash. We can use this information to build a mapping database that can store that information for later.
In our case, I wrote a Java program that will parse both files and store the resulting mapping into a LevelDB database.
This database will then be used by a Golang server that will read this mapping
database in memory and serve a RPC server that we will call from Golang
binaries in a git filter-branch
call. The Golang server will also need
to keep track of the modifications to the git commit hashes as the history
rewrite changes them.
First, the Java tool to read the logs and generate the LevelDB database :
import com.google.common.collect.BiMap;
import com.google.common.collect.HashBiMap;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.io.filefilter.DirectoryFileFilter;
import org.apache.commons.io.filefilter.IOFileFilter;
import org.iq80.leveldb.DB;
import org.iq80.leveldb.Options;
import org.iq80.leveldb.impl.Iq80DBFactory;
public class CommitMapping {
public static String FILE_LOG_IMPORT = "../log-svn_myrepository";
public static String FILE_MARKS = "marks-svn_myrepository";
public static String FILE_BFG_DIR = "../svn_myrepository.bfg-report";
public static Pattern PATTERN_LOG = Pattern.compile("^progress SVN (r\\d+) branch .* = (:\\d+)");
public static void main(String[] args) throws Exception {
List<String> importLines = IOUtils.readLines(new FileReader(new File(FILE_LOG_IMPORT)));
List<String> marksLines = IOUtils.readLines(new FileReader(new File(FILE_MARKS)));
Collection<File> passFilesCol = FileUtils.listFiles(new File(FILE_BFG_DIR), new IOFileFilter() {
@Override
public boolean accept(File pathname, String name) {
return name.equals("object-id-map.old-new.txt");
}
@Override
public boolean accept(File path) {
return this.accept(path, path.getName());
}
}, DirectoryFileFilter.DIRECTORY);
List<File> passFiles = new ArrayList<>(passFilesCol);
Collections.sort(passFiles, (File o1, File o2) -> o1.getParentFile().getName().compareTo(o2.getParentFile().getName()));
Map<String, String> commitToIdentifier = new LinkedHashMap<>();
Map<String, String> identifierToHash = new HashMap<>();
for (String importLine : importLines) {
Matcher marksMatch = PATTERN_LOG.matcher(importLine);
if (marksMatch.find()) {
String dest = marksMatch.group(2);
if (dest == null || dest.length() == 0 || ":0".equals(dest)) continue;
commitToIdentifier.put(marksMatch.group(1), dest);
} else {
System.err.println("Unknown line : " + importLine);
}
}
File dbFile = new File(System.getenv("HOME") + "/mapping-db");
File humanFile = new File(System.getenv("HOME") + "/mapping");
FileUtils.deleteQuietly(dbFile);
Options options = new Options();
options.createIfMissing(true);
DB db = Iq80DBFactory.factory.open(dbFile, options);
marksLines.stream().map((line) -> line.split("\\s", 2)).forEach((parts) -> identifierToHash.put(parts[0], parts[1]));
BiMap<String, String> commitMapping = HashBiMap.create(commitToIdentifier.size());
for (String commit : commitToIdentifier.keySet()) {
String importId = commitToIdentifier.get(commit);
String hash = identifierToHash.get(importId);
if (hash == null) continue;
commitMapping.put(commit, hash);
}
System.err.println("Got " + commitMapping.size() + " svn -> initial import entries.");
for (File file : passFiles) {
System.err.println("Processing file " + file.getAbsolutePath());
List<String> bfgPass = IOUtils.readLines(new FileReader(file));
Map<String, String> hashMapping = bfgPass.stream().map((line) -> line.split("\\s", 2)).collect(Collectors.toMap(parts -> parts[0], parts -> parts[1]));
for (String hash : hashMapping.keySet()) {
String rev = commitMapping.inverse().get(hash);
if (rev != null) {
String newHash = hashMapping.get(hash);
System.err.println("Replacing r" + rev + ", was " + hash + ", is " + newHash);
commitMapping.replace(rev, newHash);
}
}
}
PrintStream fos = new PrintStream(humanFile);
for (Map.Entry<String, String> entry : commitMapping.entrySet()) {
String commit = entry.getKey();
String target = entry.getValue();
fos.println(commit + "\t" + target);
db.put(Iq80DBFactory.bytes(commit), Iq80DBFactory.bytes(target));
}
db.close();
fos.close();
}
}
We will use RPC between a client and server to allow the LevelDB database to be kept open and have very light clients that query a running server as they will be executed for each commit. After some tests, opening the database was really time consuming thus this approach, even though the server will do very little.
The structure of our go project is the following :
go-gitcommit/client-common:
rpc.go
go-gitcommit/client-insert:
insert-mapping.go
go-gitcommit/client-query:
query-mapping.go
go-gitcommit/server:
server.go
First, some plumping for the RPC in rpc.go
:
package Client
import (
"net"
"net/rpc"
"time"
)
type (
// Client -
Client struct {
connection *rpc.Client
}
// MappingItem is the response from the cache or the item to insert into the cache
MappingItem struct {
Key string
Value string
}
// BulkQuery allows to mass query the DB in one go.
BulkQuery []MappingItem
)
// NewClient -
func NewClient(dsn string, timeout time.Duration) (*Client, error) {
connection, err := net.DialTimeout("tcp", dsn, timeout)
if err != nil {
return nil, err
}
return &Client{connection: rpc.NewClient(connection)}, nil
}
// InsertMapping -
func (c *Client) InsertMapping(item MappingItem) (bool, error) {
var ack bool
err := c.connection.Call("RPC.InsertMapping", item, &ack)
return ack, err
}
// GetMapping -
func (c *Client) GetMapping(bulk BulkQuery) (BulkQuery, error) {
var bulkResponse BulkQuery
err := c.connection.Call("RPC.GetMapping", bulk, &bulkResponse)
return bulkResponse, err
}
Next the Golang server that will read this database in server.go
:
package main
import (
"fmt"
"log"
"net"
"net/rpc"
"os"
"time"
"github.com/syndtr/goleveldb/leveldb"
Client "../client-common"
)
var (
cacheDBPath = os.Getenv("HOME") + "/mapping-db"
cacheDB *leveldb.DB
flowMap map[string]string
f *os.File
g *os.File
)
type (
// RPC is the base class of our RPC system
RPC struct {
}
)
func main() {
var cacheDBerr error
cacheDB, cacheDBerr = leveldb.OpenFile(cacheDBPath, nil)
if cacheDBerr != nil {
fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
log.Fatal(cacheDBerr)
}
roErr := cacheDB.SetReadOnly()
if roErr != nil {
fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
log.Fatal(roErr)
}
flowMap = make(map[string]string)
f, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.log")
defer f.Close()
g, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.ins")
defer g.Close()
rpc.Register(NewRPC())
l, e := net.Listen("tcp", ":9876")
if e != nil {
log.Fatal("listen error:", e)
}
go flushLog()
rpc.Accept(l)
}
func flushLog() {
for {
time.Sleep(100 * time.Millisecond)
f.Sync()
}
}
// NewRPC -
func NewRPC() *RPC {
return &RPC{}
}
// InsertMapping -
func (r *RPC) InsertMapping(mappingItem Client.MappingItem, ack *bool) error {
old := mappingItem.Key
new := mappingItem.Value
flowMap[old] = new
g.WriteString(fmt.Sprintf("Inserted mapping %s -> %s\n", old, new))
*ack = true
return nil
}
// GetMapping -
func (r *RPC) GetMapping(bulkQuery Client.BulkQuery, resp *Client.BulkQuery) error {
for i := range bulkQuery {
key := bulkQuery[i].Key
response, _ := cacheDB.Get([]byte(key), nil)
gitCommit := key
if response != nil {
responseStr := string(response[:])
responseUpdated := flowMap[responseStr]
if responseUpdated != "" {
gitCommit = string(responseUpdated[:])[:12] + "(" + key + ")"
f.WriteString(fmt.Sprintf("Response to mapping %s -> %s\n", bulkQuery[i].Key, gitCommit))
} else {
f.WriteString(fmt.Sprintf("No git mapping for entry %s\n", responseStr))
}
} else {
f.WriteString(fmt.Sprintf("Unknown revision %s\n", key))
}
bulkQuery[i].Value = gitCommit
}
*resp = bulkQuery
return nil
}
And finally our clients. The insert client will be called from git filter-branch
with the previous and current commit hashes after processing each commit. We
store this information into the database so that the hashes are correct when
mapping a revision. The code goes into insert-mapping.go
:
package main
import (
"fmt"
"log"
"os"
"time"
Client "../client-common"
)
func main() {
old := os.Args[1]
new := os.Args[2]
rpcClient, err := Client.NewClient("localhost:9876", time.Millisecond*500)
if err != nil {
log.Fatal(err)
}
mappingItem := Client.MappingItem{
Key: old,
Value: new,
}
ack, err := rpcClient.InsertMapping(mappingItem)
if err != nil || !ack {
log.Fatal(err)
}
fmt.Println(new)
}
The query client will receive the commit message for each commit, check
whether it contains a r
mapping and query the server for a hash for this
commit. It goes into query-mapping.go
:
package main
import (
"bufio"
"fmt"
"log"
"os"
"regexp"
"strings"
"time"
client "../client-common"
)
func main() {
reader := bufio.NewReader(os.Stdin)
text, _ := reader.ReadString('\n')
re := regexp.MustCompile(`\Wr[0-9]+`)
matches := re.FindAllString(text, -1)
if matches == nil {
fmt.Print(text)
return
}
rpcClient, err := client.NewClient("localhost:9876", time.Millisecond*500)
if err != nil {
log.Fatal(err)
}
var bulkQuery client.BulkQuery
for i := range matches {
if matches[i][0] != '-' {
key := matches[i][1:]
bulkQuery = append(bulkQuery, client.MappingItem{Key: key})
}
}
gitCommits, _ := rpcClient.GetMapping(bulkQuery)
for i := range gitCommits {
gitCommit := gitCommits[i].Value
key := gitCommits[i].Key
text = strings.Replace(text, key, gitCommit, 1)
}
fmt.Print(text)
}
For this step, we will need to first compile and execute the Java program. Once it succeeded in creating the database, we will compile and execute the Go server in the background.
Then, we can launch git filter-branch
on our repository to rewrite the
history :
$ git filter-branch \
--commit-filter 'NEW=`git_commit_non_empty_tree "$@"`; \
${HOME}/migration-tools/go-gitcommit/client-insert/client-insert $GIT_COMMIT $NEW' \
--msg-filter "${HOME}/migration-tools/go-gitcommit/client-query/client-query" \
-- --all --author-date-order
As after each step, we will generate a snapshot, even though it should be the last step that cannot be repeated easily.
btrfs subvolume snaphost -r repositories repositories/snap-3-mapping
We now clean the repository that should contain a lot of unused blobs, branches, commits, …
$ git reflog expire --expire=now --all
$ git prune --expire=now --progress
$ git repack -adf --window-memory=512m
We now have a repository that should be more or less clean. You will have to check the history, the size of the blobs and whether some branches can still be deleted before pushing it to your server.