溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

NCBI如何批量下載數據

發布時間：2022-02-23 10:38:27 來源：億速云閱讀：646 作者：小新欄目：開發技術

這篇文章給大家分享的是有關NCBI如何批量下載數據的內容。小編覺得挺實用的，因此分享給大家做個參考，一起跟隨小編過來看看吧。

NCBI批量搜索、下載序列

腳本代碼：

from Bio import Entrez
import os,sys
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio import SeqIO
import sys, os, argparse, os.path,re,math,time
'''
database:
['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest',
'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap',
'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene',
'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc',
'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound',
'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists']
'''
parser = argparse.ArgumentParser(description='This script is used to fasta from ncbi ')
parser.add_argument('-t','--term',help='input search  term : https://www.ncbi.nlm.nih.gov/books/NBK3837/#_EntrezHelp_Entrez_Searching_Options_',required=True)
parser.add_argument('-d','--database',help='Please input database to search nucleotide or protein  default nucleotide',default = 'nucleotide',required=False)
parser.add_argument('-r','--rettype',help='return type fasta or gb default gb',default = "gb",required=False)
parser.add_argument('-o','--out_dir',help='Please input  out_put directory path',default = os.getcwd(),required=False)
parser.add_argument('-n','--name',default ='seq',required=False,help='Please specify the output, seq')
args = parser.parse_args()
dout=''

if os.path.exists(args.out_dir):
    dout=os.path.abspath(args.out_dir)
else:
    os.mkdir(args.out_dir)
    dout=os.path.abspath(args.out_dir)
output_handle = open(dout+'/'+args.name+'.%s'%args.rettype, "w")
Entrez.email = "huangls@biomics.com.cn"     # Always tell NCBI who you are
#handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
#print(handle.read())
handle = Entrez.esearch(db=args.database, term=args.term, idtype="acc")
record = Entrez.read(handle)

for i in record['IdList']:
    print i+'\n'
    handle = Entrez.efetch(db=args.database, id=i, rettype=args.rettype, retmode="text")
    #print(handle.read())
    record = SeqIO.read(handle, args.rettype)
    SeqIO.write(record, output_handle, args.rettype)

output_handle.close()

幫助文檔：

 1python /share/work/huangls/piplines/01.script/search_NCBI.py -h
 2usage: search_NCBI.py [-h] -t TERM [-d DATABASE] [-r RETTYPE] [-o OUT_DIR]
 3                      [-n NAME]
 4This script is used to fasta from ncbi
 5optional arguments:
 6  -h, --help            show this help message and exit
 7  -t TERM, --term TERM  input search term : https://www.ncbi.nlm.nih.gov/books
 8                        /NBK3837/#_EntrezHelp_Entrez_Searching_Options_
 9  -d DATABASE, --database DATABASE
10                        Please input database to search nucleotide or protein
11                        default nucleotide
12  -r RETTYPE, --rettype RETTYPE
13                        return type fasta or gb default gb
14  -o OUT_DIR, --out_dir OUT_DIR
15                        Please input out_put directory path
16  -n NAME, --name NAME  Please specify the output, seq

使用說明：

先來看一個示例：

python search_NCBI.py -t "Polygonatum[Organism] AND chloroplast AND PsaA" -d protein -r fasta -n psaA

該命令是從NCBI的蛋白質數據庫下載所有黃精屬中葉綠體上的PsaA基因的蛋白序列，輸出格式為fasta。

-t：后面跟的是搜索條件，用雙引號引起來。我們可以用布爾運算符和索引構建器更精確查找內容。先來介紹下布爾運算符，布爾運算符提供了一種生成精確查詢的方法，可以產生定義良好的結果集。布爾運算符主要有3個，分別是AND、OR和NOT。它們的工作原理如下：

NCBI如何批量下載數據

AND運算符是必須大寫的，而OR和NOT不是必須的，但是建議三種運算符都用大寫。

布爾運算符的運算順序都是從左往右，例如：

promoters OR response elements NOT human AND mammals

表示查詢除人類外的哺乳類動物中的promoters或response elements。而使用括號可以改變運算順序，例如：

promoters OR response elements NOT （human OR mouse）AND mammals

表示查詢除人類和老鼠外的哺乳類動物中的promoters或response elements。

"[ ]"里的內容是索引構建器，可以解釋前面搜索詞的類型，如示例中的[Organism]表示前面的Polygonatum是一個有機體。下面是一些其它示例：

NCBI如何批量下載數據

此外，還能進行范圍的搜索，例如序列長度和發表日期。

NCBI如何批量下載數據

-d：后面跟搜索數據庫，nucleotide 或 protein，默認 nucleotide。

-r：后面跟輸出格式，fasta 或 gb（genbank），默認 gb。

-o：后面跟輸出目錄。

-n：后面跟輸出文件名前綴。

從genbank提取序列

再給大家安利一個python程序，該程序可以根據提供的基因名列表，從genbank文件中提取基因組序列、有關基因的cds和蛋白序列、基因的位置信息，分別存放在 *.gb.genome.fa 、 *.gb.cds.fa 、 *.gb.pep.fa 、 *.gb.cds_location.txt 文件中。

腳本代碼：

import sys, os, argparse, os.path ,glob
from Bio import SeqIO

parser = argparse.ArgumentParser(description='This script was used to get fa from genbank file; *.faa =pep file; *.ffn=cds file; *fna=genome fa file')
parser.add_argument('-i','--id',help='Please input gene list file',required=True)
parser.add_argument('-m','--in_dir',help='Please input  in_put directory path;default cwd',default = os.getcwd(),required=False)
parser.add_argument('-o','--out_dir',help='Please input  out_put directory path;default cwd',default = os.getcwd(),required=False)
args = parser.parse_args()
dout=''
din=''
if os.path.exists(args.in_dir):
    din=os.path.abspath(args.in_dir)
if os.path.exists(args.out_dir):
    dout=os.path.abspath(args.out_dir)
else:
    os.mkdir(args.out_dir)
    dout=os.path.abspath(args.out_dir)
args.id=os.path.abspath(args.id)
gene = {}
input = open(args.id, "r")
for line in input :
    line = line.strip()
    gene[line] = line
genbank=glob.glob(din+"/*gb")
for gdkfile in genbank :
    name = os.path.basename(gdkfile)
    input_handle  = open(gdkfile, "r")
    pep_file = dout+'/'+name+".pep.fa"
    genePEP = open(pep_file, "w")
    cds_file = dout+'/'+name+".cds.fa"
    geneCDS = open(cds_file, "w")
    gene_file = dout+'/'+name+".genome.fa"
    gene_handle = open(gene_file, "w")
    cds_locat_file = dout+'/'+name+".cds_location.txt"
    cds_locat_handle = open(cds_locat_file, "w")
    for seq_record in SeqIO.parse(input_handle, "genbank") :
        print "Dealing with GenBank record %s" % seq_record.id
        gene_handle.write(">%s %s\n%s\n" % (
            seq_record.id,
            seq_record.description,
            seq_record.seq))
        for seq_feature in seq_record.features :
            geneSeq = seq_feature.extract(seq_record.seq)
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                if gene.has_key(seq_feature.qualifiers['gene'][0]):
                    genePEP.write(">%s\n%s\n" % (
                        seq_feature.qualifiers['gene'][0],
                        #seq_record.name,
                        seq_feature.qualifiers['translation'][0]))
                    geneCDS.write(">%s\n%s\n" % (
                        seq_feature.qualifiers['gene'][0],
                        #seq_record.name,
                        geneSeq ))
                    cds_locat_handle.write(">%s location %s\n" % (
                        seq_feature.qualifiers['gene'][0],
                        seq_feature.location ))
    input_handle.close()
    genePEP.close()
    geneCDS.close()

幫助文檔：

python /share/work/wangq/script/genbank/genbank.py 
usage: get_data_NCBI.py -i IDLIST -o OUT_DIR -m IN_DIR
optional arguments:
  -i IDLIST, --idlist IDLIST
                        Please gene name list file
  -m IN_DIR, --in_dir IN_DIR
                        Please input complete in_put directory path
  -o OUT_DIR, --out_dir OUT_DIR
                        Please input complete out_put directory path
例：python /share/work/wangq/script/genbank/genbank.py -i id.txt -m /share/nas1/wangq/work/NCBI_download -o /share/nas1/wangq/work/NCBI_download

注意：-m 后輸入的是一目錄，該目錄下可以有多個 genbank 文件，程序會批量讀取。-i 后跟需提取的基因名稱列表，格式如下：

rpl2
psbA
ndhD
ndhF

genbank轉gff3

最后一個腳本 bp_genbank2gff3.pl，此腳本可以根據 genbank 文件生成 gff3 文件，由Bioperl提供，安裝并配置過 Bioperl 就可以直接使用。用法也很簡單，bp_genbank2gff3.pl 后跟 genbank 文件就可以啦！

bp_genbank2gff3.pl  filename(s)

感謝各位的閱讀！關于“NCBI如何批量下載數據”這篇文章就分享到這里了，希望以上內容可以對大家有一定的幫助，讓大家可以學到更多知識，如果覺得文章不錯，可以把它分享出去讓更多的人看到吧！

向AI問一下細節

推薦閱讀：

免責聲明：本站發布的內容（圖片、視頻和文字）以原創、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯系站長郵箱：is@yisu.com進行舉報，并提供相關證據，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
bedtools如何求交集
下一篇新聞：
如何利用CODEML中的Site Models進行正選擇基因分析

猜你喜歡

AI
助
手

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女