基因组定位的转换

最近在整理一批数据,有一部分数据是hg19的定位,还有一部分是hg38的定位。最后需要将这两部分数据都统一到hg38的定位,因此需要批量将hg19的定位转换到hg38的定位。之前UCSC上有一个工具——hgLiftOver,是在线的。

NCBI有一个perl脚本,可以在本地运行(当然还是要联网的)。根据使用说明,该工具一次能转换25万条数据,最多同时运行四个程序。
​工具下载地址:remap_api.pl (ftp://ftp.ncbi.nlm.nih.gov/pub/remap/)
具体使用参见: http://blog.sina.com.cn/s/blog_8808cae20102veb0.html

测试一下:
38版本chr22:23420826-23435206
37版本chr22:23763013-23777393

《基因组定位的转换》有2个想法

  1. There are at least three well known tools that can help you with these kinds of tasks:

    UCSC liftOver. This tool is available through a simple web interface or it can be downloaded as a standalone executable. To use the executable you will also need to download the appropriate chain file. Each chain file describes conversions between a pair of genome assemblies. Liftover can be used through Galaxy as well. There is a python implementation of liftover called pyliftover that does conversion of point coordinates only.

    NCBI Remap. This tool is conceptually similar to liftOver in that in manages conversions between a pair of genome assemblies but it uses different methods to achieve these mappings. It is also available through a simple web interface or you can use the API for NCBI Remap.

    The Ensembl API. The final example I described above (converting between coordinate systems within a single genome assembly) can be accomplished with the Ensembl core API. Many examples are provided within the installation, overview, tutorial and documentation sections of the Ensembl API project. In particular, refer to these sections of the tutorial: ‘Coordinates’, ‘Coordinate systems’, ‘Transform’, and ‘Transfer’. Ensembl also has a simple web service for coordinate conversions.

    Bioconductor rtracklayer package. For R users, Bioconductor has an implementation of UCSC liftOver in the rtracklayer package. To see documentation on how to use it, open an R session and run the following commands.

    CrossMap. A standalone open source program for convenient conversion of genome coordinates (or annotation files) between different assemblies. It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome. Not recommended for converting genome coordinates between species.

    source(“http://bioconductor.org/biocLite.R”)
    biocLite(“rtracklayer”)
    library(rtracklayer)
    ?liftOver

    Flo. A liftover pipeline for different reference genome builds of the same species. It describes the process as follows: “align the new assembly with the old one, process the alignment data to define how a coordinate or coordinate range on the old assembly should be transformed to the new assembly, transform the coordinates.”

发表评论

电子邮件地址不会被公开。 必填项已用*标注