cd /data/software
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred

chmod 755 /data/software/gtfToGenePred


2.1 程序说明

gtfToGenePred - convert a GTF file to a genePred
   gtfToGenePred gtf genePred

     -genePredExt - create a extended genePred, including frame
      information and gene name
     -allErrors - skip groups with errors rather than aborting.
      Useful for getting infomation about as many errors as possible.
     -ignoreGroupsWithoutExons - skip groups contain no exons rather than
      generate an error.
     -infoOut=file - write a file with information on each transcript
     -sourcePrefix=pre - only process entries where the source name has the
      specified prefix.  May be repeated.
     -impliedStopAfterCds - implied stop codon in after CDS
     -simple    - just check column validity, not hierarchy, resulting genePred may be damaged
     -geneNameAsName2 - if specified, use gene_name for the name2 field
      instead of gene_id.
     -includeVersion - it gene_version and/or transcript_version attributes exist, include the version
      in the corresponding identifiers.

2.2 网上下载genPred

CHOPCHOP script will need a table to look up genomic coordinates if you want to supply names of the genes rather than coordinates. To get example genePred table:

Select organism and assembly
Select group: Genes and Gene Predictions
Select track: RefSeq Genes or Ensemble Genes
Select table: refFlat or ensGene
Select region: genome
Select output format: all fields from selected table
Fill name with extension ".gene_table' e.g. danRer10.gene_table
Get output


cd /data/database/homo/genepred

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz
gunzip *.gz 

/data/software/gtfToGenePred -genePredExt -geneNameAsName2 gencode.v29.annotation.gtf hg38.genePred      

head head hg38.genePred
name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
ENST00000456328.2       chr1    +       11868   14409   14409   14409   3       11868,12612,13220,      12227,12721,14409,      0       DDX11L1 none    none    -1,-1,-1,

sed '1i\name\tchrom\tstrand\ttxStart\ttxEnd\tcdsStart\tcdsEnd\texonCount\texonStarts\texonEnds\tscore\tname2\tcdsStartStat\tcdsEndStat\texonFrames' hg38.genePred > hg38.gene_table


个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学