【8.2】基因组坐标转基因名(pyensembl)
坐标转基因
一、pyensembl
1.1 安装pyensembl
activate3
pip install pyensembl
下载gtf注释文件
cd /data/database/genome
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
gzip -d hg38.refGene.gtf.gz
grep -v fix /data/database/genome/hg38/gtf/hg38.refGene.gtf >/data/database/genome/hg38/gtf/hg38.refGene_remove_fix.gtf
grep -v alt /data/database/genome/hg38/gtf/hg38.refGene_remove_fix.gtf >/data/database/genome/hg38/gtf/hg38.refGene_remove.gtf
grep -v MIR /data/database/genome/hg38/gtf/hg38.refGene_remove.gtf >/data/database/genome/hg38/gtf/hg38.refGene_remove_3.gtf
1.2 python使用示例
import os
import sys
import pyensembl
import sqlite3
from pyensembl import EnsemblRelease
from pyensembl.genome import Genome
from bpkit.utils import safe_mkdir
os.environ['PYENSEMBL_CACHE_DIR'] = '/data/tmp'
print(sys.modules['pyensembl'])
def get_genname_by_loc():
data = Genome(
reference_name='hg38',
annotation_name='features',
gtf_path_or_url='/data/database/genome/hg38/gtf/hg38.refGene_remove_3.gtf') # gtf_path_or_url用来指定gtf的路径
# parse GTF and construct database of genomic features
data.index() # 建立index,其实就是建立sqlite的书哭哭
gene_names = data.gene_ids_at_locus(contig='chr12', position=25245365 )
# gene_names = data.gene_names_at_locus(contig='chr12', position=2524)
# exon_ids = data.exon_ids_of_gene_name('KRAS')
print(gene_names)
# print(exon_ids)
三、讨论
3.1 多个版本的参考基因组
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
- hg38.ensGene.gtf.gz
- hg38.knownGene.gtf.gz
- hg38.ncbiRefSeq.gtf.gz
- hg38.refGene.gtf.gz
OR4F16 在refGene中,在5号染色体和1号染色体中都存在;但是在ncbiRefSeq中只存在与1号染色体。
3.2 网页版基于坐标或者基于基因名查看信息

参考资料
药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn
