UCSC Genome Bioinformatics
  Procedure for Creating a Mirror Site for the UCSC Genome Browser
 

The most complete and up-to-date instructions for setting up a mirror can be found in our source tree. You may choose to set up either a full mirror browser or a partial one, depending on your disk space and needs. Additionally, these blog posts may be helpful with setting up a mirror site on CentOS and Ubuntu.

A license is required for commercial download and/or installation of the Genome Browser binaries and source code. No license is needed for academic, nonprofit, and personal use. To purchase a license, see our License Instructions or visit the Genome Browser store.

Space required:

The amount of data available in the Genome Browser is growing constantly. To determine the size of any of the download directories mentioned in these instructions, use the rsync "-n" option on the directory prior to actually transferring the data. For instance, to find the size of the /gbdb directory, run:

rsync -navP rsync://hgdownload.cse.ucsc.edu/gbdb/
The rsync options used in this command are:
-n, --dry-run         show what would have been transferred
-a, --archive         archive mode, equivalent to -rlptgoD
-v, --verbose         increase verbosity
-P                    equivalent to --partial --progress

Mirror site questions may be directed to the mailing list genome-mirror@soe.ucsc.edu. Messages sent to this address will be posted to the moderated genome-mirror mailing list, which is archived on a SEARCHABLE PUBLIC Google Groups forum.

Subscribe to the genome-mirror mailing list.



  Using UDR to Speed Up the Download Process
 

UDR (UDT Enabled Rsync) is a download protocol that is very efficent at sending large amounts of data over long distances. UDR utilizes rsync as the transport mechanism, but sends the data over the UDT protocol. UDR is not written or managed by UCSC. It is an open source tool created by the Laboratory for Advanced Computing at the University of Chicago. It has been tested to work under Linux, FreeBSD and Mac OSX, but may work under other UNIX variants. The source code can be obtained through GitHub.

If you are a casual or occasional manual downloader of data, there is no need to change your method; continue to visit our download server to download the files you need. This new protocol has been put in place to enable huge amounts of data to be downloaded quickly over long distances.

Typical TCP-based protocols like http, ftp and rsync have a problem in that the further away the download source is from you, the slower the speed becomes. Protocols like UDT/UDR allow for many UDP packets to be sent in batch, thus allowing for much higher transmit speeds over long distances. UDR will be especially useful for users who are downloading from places that are far away from California. The US East Coat and the international community will likely see much higher download speeds by using UDR rather than rsync, http or ftp.

If you need help building the UDR binaries or have questions about how UDR functions, please read the documentation on the GitHub page and if necessary, contact the UDR authors via the GitHub page. We recommend reading the documentation on the UDR GitHub page to better understand how UDR works. UDR is written in C++. It is Open Source and is released under the Apache 2.0 License. In order for it to work, you must have rsync installed on your system.

For your convenience, we are offering a binary distribution of UDR for Red Hat Enterprise Linux 6.x (or variants such as CentOS 6 or Scientific Linux 6). You'll find both a 64-bit and 32-bit rpm here.

Once you have a working UDR binary, either by building from source or by installing the rpm, you can download files from either of our our download servers in a very similar fashion to rsync. For example, using rsync, you may want to download all of the MySQL tables for the hg19 database using the following command:

			
rsync -avP rsync://hgdownload.cse.ucsc.edu/mysql/hg19/ /my/local/hg19/
Using UDR is very similar. The UDR syntax for downloading the same data would be:
udr rsync -avP hgdownload.cse.ucsc.edu::mysql/hg19/ /my/local/hg19/
If you installed the rpm, use the man udr command for more information via the man page; if you installed from source, please refer to the UDR GitHub page for more details on the capabilities of UDR and how to use it.

UDR establishes connections on TCP/9000, then transmits the data stream over UDP/9000-9100. Your institution may need to modify its firewall rules to allow inbound and outbound ports TCP/9000 and UDP/9000-9100 from either of the two download machines.

If you have difficulties installing or using UDR on your system, contact the Laboratory for Advanced Computing through their GitHub page.

If you have questions about mirroring the UCSC Genome Browser, direct them to the mailing list genome-mirror@soe.ucsc.edu. Messages sent to this address will be posted to the moderated genome-mirror mailing list, which is archived on a SEARCHABLE PUBLIC Google Groups forum.