Migrating and filtering Subversion repositories
I’ve been tasked with merging one subversion repository into another. The source repository is 35GB and the destination about 2GB. The source repository is bloated by the ritual inclusion of large directories of binary files that were wiped and re-added continually as well as branched and tagged hundreds of times.
The goal is to eliminate these binaries and all history related to them in order to reduce the repository down to a much more manageable size. Easy enough to do with svnadmin dump and svndumpfilter exclude X,Y,Z, right?
I was pleased to find out that svndumpfilter does take multiple excludes or includes on the command line. I didn’t see this in the documentation, and found a surprising number of people piping multiple copies of svndumpfilter together in order to exclude multiple things. I figured out that I’d need 1,188 exclusions in order to wipe out all permutations of these binary objects in trunk, tags, and branches. It was simple enough to build a script with awk that would list the exclusions on the svndumpfilter command line.
Unfortunately, 162 revisions into 45,000, it failed because of an svn copy from an excluded part into an included part. Svndumpfilter cannot handle this scenario because by the time it identifies that it needs something it’s excluded, it’s already well past it in the stream. I was half expecting this from the documentation, but hoping against reason that none of the developes had ever copied anything.
More searching turned up svndumpfilter2 and svndumpfilter3 – two scripts that were written to solve this specific problem. They use svnlook to grab the missing piece directly from the repository. Unfortunately, I quickly found that svndumpfilter is known for crashing on large repositories – of a whopping 150MB. Svndumpfilter3 was written to overcome svndumpfilter2’s limitations, but the author’s web page has a large disclaimer that it’s known to fail on large repositories – not confidence inspiring.
Browsing through the subversion 1.5 manual again, I found this blurb that caused a lightbulb to go off:
Q: How does svnsync deal with parts of the master repository that I'm not authorized to read? A: svnsync will simply not copy parts of the repository that you cannot read; files copied from "private" parts of the repository into "public" parts will look as if they have been added from scratch. If a revision only modifies files that you cannot read, it will appear to be empty. (Just like with "svn log", log messages from revisions you cannot read part of will be empty.)
This might be the exact functionality we need – it would give us the opportunity to use svnsync to keep a synchronized mirror going up to the point of the final split while easily filtering out the data we didn’t want. Since I haven’t stumbled across this filtering technique described elsewhere on the net, I figured it might be blog-worthy. Svnsync doesn’t support synchronizing multiple subdirectories to a single repository, so we used a temporary repository as a mirror, dumped it, and then imported it into our final repository.
- Enable authorization in repository/conf/svnserve.conf
- Configure authz to restrict access to the directories you want to exclude
- Create a destination repository for svnsync
- Use svnsync to synchronize to the temporary destination repository
- Dump the destination repository
- Import the dump into the destination repository
1. Enable authorization in repository/conf/snvserve.conf
This will differ if you’re using webdav, but in this particular case, ssh+svnserve is the protocol of choice.
### Uncomment the line below to use the default authorization file authz-db = authz
2. Configure authz to restrict access to the directories you want to exclude
You’ll need to modify the authz file to restrict access to the paths in the source repository that you do not want to get synchronized to your destination repository. This is easily accomplished by setting the permissions for the user you’re going to use to do the svnsync to nothing – that is, no “r” or “rw.”
[/] * = rw [/trunk/binaryfilespath] svnsync = [/tags] svnsync = [/branches] svnsync =
3. Create a destination repository for svnsync
You need a new repository for svnsync to mirror to, and you’ll have to modify the pre-revprop-change script so that it allows at least the user you’re using to modify revprops.
$ svnadmin create dest $ cat << 'EOF' > dest/hooks/pre-revprop-change #!/bin/sh exit 0 EOF
4. Use svnsync to synchronize to the temporary destination repository
One nice advantage of using svnsync is that you can keep running the sync to keep it updated and allow for a very small delta at the time of switchover. We used this to run the initial sync hours ahead of time, and only spend a few minutes at the time all the developers had to log out of the system. Note: You only run initialize once!
$ svnsync init --username svnsync file://`pwd`/dest svn+ssh://svnhost/var/svn/oldrepository Copied properties for revision 0 $ svnsync sync file://`pwd`/dest Committed revision 1. Copied properties for revision 1. Committed revision 2. Copied properties for revision 2. Committed revision 3. Copied properties for revision 3. ... $ svnsync sync file://`pwd`/dest Committed revision 47483. Copied properties for revision 47483. Committed revision 47484. Copied properties for revision 47484. ...
5. Dump the destination repository
$ svnadmin dump ./dest > clean-repository.dump
6. Import the dump into the destination repository
This step is a tad bit trickier than you might suspect. If you want to import the dump into a new subdirectory in the repository, you must create that subdirectory first or you’ll get an error:
svnadmin: File not found: transaction '0-1', path 'trunk/newdir' $ mkdir -p trunk/newdir $ svn import -m "Import structure" trunk file:///var/svn/repository $ svnadmin load --parent-dir trunk/newdir /var/svn/repository < clean-repository.dump <<< Started new transaction, based on original revision 1 ------- Committed new rev 44888 (loaded from original rev 1) >>>
It goes without saying, test your newly imported repository. You don’t want to get 3 months down the line and realize you missed a truckload of revision history.