当前位置:网站首页>Download the details and sequence of the original data access from the ENA database in EBI
Download the details and sequence of the original data access from the ENA database in EBI
2022-07-05 04:35:00 【Most afraid of being gentle】
List of articles
Biology students will probably use EBI
database , And one of the ENA
The database contains abundant sequence information , So how to get them ?
1. ENA The retrieval function of the database
ENA
Database website :https://www.ebi.ac.uk/ena/browser/home
Before downloading information, the first thing is to retrieve information :
1. Simple search
Here's the picture , Search more directly sediment
Information about , There will be a lot of relevant sequence information , search
The search box can search all proteins ,RNA,DNA, Strain classification name ,accession Number, etc. .
2. Complex search
Want all your more accurate , You need to use advanced search
. stay search Drop down options in the column .advance search
It is divided into the following steps , Among them, as a condition screening is Data Type
and Query
Options .
(Query It can also be based on ENA Search syntax rules to write code search )Data Type
: Is the type of sequence you choose , Choose here Raw reads, The raw data of sequencing .Rule ID/Name Don't fill in .Query
Options : Is the information condition of the sequence you want to filter . Select the corresponding condition in the opposite option box on the left .
for example : What I chose was libraray source and library strategy Namely metagenomic and wgs
, That is to say, I want to screen the whole genome data of all macrogenomic tests . In addition to Query
The corresponding... Will be generated in the box ENA Code of the database . Here, of course, you can also choose other conditions , Finally, click search
Options will show results .
Download all results : My subsequent processing is based on downloading txt
File to do . So click TSV
Download to get the corresponding txx
The file .
The result in the file is brief accession
The number corresponds to title Information .
Results file :
2. ENA In the database accession Download the details of
I wrote this one myself python
Script implementation . Attach the source code directly .
Because recursive and asynchronous downloading are used , Therefore, in the middle of the cycle, there will be error messages such as network request timeout , Don't worry about him , Just wait until the program is over . The program only needs to modify the file name in the last step .
# You can download the package you need
import aiohttp
import aiofiles
import asyncio
import requests
import pandas as pd
import time
import csv
""" 1. Based on ENA Database download txt file , Crawling accession Details of 2. The generated results are multiple output file , Then merge """
def read_file(file_name):
with open(file_name, 'r', encoding='utf-8')as f:
f = f.readlines()
accession_list = []
for i in f[1::]:
accession = i.strip().split('\t')[0].strip('"')
accession_list.append(accession)
return accession_list
async def get_biosample_info(accession_number, session, df):
headers = {
# "Referer": f'https://www.ebi.ac.uk/ena/browser/view/{accession_number}',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
}
params = {
"result": "read_run",
"accession": accession_number,
"offset": 0,
"limit": 1000,
"format": "json",
"fields": "study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,submission_accession,tax_id,scientific_name,instrument_platform,instrument_model,library_name,nominal_length,library_layout,library_strategy,library_source,library_selection,read_count,base_count,center_name,first_public,last_updated,experiment_title,study_title,study_alias,experiment_alias,run_alias,fastq_bytes,fastq_md5,fastq_ftp,fastq_aspera,fastq_galaxy,submitted_bytes,submitted_md5,submitted_ftp,submitted_aspera,submitted_galaxy,submitted_format,sra_bytes,sra_md5,sra_ftp,sra_aspera,sra_galaxy,cram_index_ftp,cram_index_aspera,cram_index_galaxy,sample_alias,broker_name,sample_title,nominal_sdev,first_created"
}
url = r'https://www.ebi.ac.uk/ena/portal/api/filereport'
async with session.get(url, headers=headers, params=params) as resp:
if resp.content:
biosample_info_dic = await resp.json()
biosample_info_dic = biosample_info_dic[0]
print('now dealing with %s'%accession_number)
for k, v in biosample_info_dic.items():
df.loc[accession_number, k] = v
resp.close()
async def main(accession_list, num, input_file):
tasks = []
df = pd.DataFrame()
if accession_list:
# prevent ssl Report errors
timeout = aiohttp.ClientTimeout(total=600) # Set the timeout to 600 second
# force_close=True
connector = aiohttp.TCPConnector(limit=80, ssl=False) # Reduce the number of concurrent , no need ssl verification
async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
for accession in accession_list:
task = asyncio.create_task(get_biosample_info(accession, session, df))
tasks.append(task)
await asyncio.wait(tasks)
print(df)
output_file = input_file.split('_')[0]+"_output"+str(num)+".xlsx"
df.to_excel(output_file)
# Find those that have not been downloaded accession Number
accessions = list(df.index)
not_download_accession = [i for i in accession_list if i not in accessions]
# Recursive download
await main(not_download_accession, int(num)+1, input_file)
if __name__ == '__main__':
try:
print(' Start the download ')
loop = asyncio.get_event_loop()
# Just modify the downloaded file name
loop.run_until_complete(main(read_file('results_read_run_tsv.txt'), 1, 'results_read_run_tsv.txt'))
loop.close()
print(" Download the end ")
except:
pass
The result is each accession Corresponding details .( You need to manually output Just merge the results of , Finally, I didn't bother to write a program to merge )
3. ENA In the database accession Download the sequence of
In the file sra_ftp
and sra_aspera
There will be a corresponding download address in , You can download in batches by integrating the website yourself .
边栏推荐
- 电源管理总线 (PMBus)
- 函數(易錯)
- [AI bulletin 20220211] the hard core up owner has built a lidar and detailed AI accelerator
- Here comes the Lantern Festival red envelope!
- Leetcode hot topic Hot 100 day 33: "subset"
- Qt蓝牙:搜索蓝牙设备的类——QBluetoothDeviceDiscoveryAgent
- 概率论与数理统计考试重点复习路线
- Moco is not suitable for target detection? MsrA proposes object level comparative learning target detection pre training method SOCO! Performance SOTA! (NeurIPS 2021)...
- Introduction to RT thread kernel (5) -- memory management
- How can CIOs use business analysis to build business value?
猜你喜欢
取余操作是一个哈希函数
解密函数计算异步任务能力之「任务的状态及生命周期管理」
【虚幻引擎UE】实现UE5像素流部署仅需六步操作少走弯路!(4.26和4.27原理类似)
揭秘技术 Leader 必备的七大清奇脑回路
The principle of attention mechanism and its application in seq2seq (bahadanau attention)
2022-2028 global and Chinese FPGA prototype system Market Research Report
level17
NetSetMan pro (IP fast switching tool) official Chinese version v5.1.0 | computer IP switching software download
How should programmers learn mathematics
Is there a sudden failure on the line? How to make emergency diagnosis, troubleshooting and recovery
随机推荐
自动语音识别(ASR)研究综述
2022-2028 global and Chinese FPGA prototype system Market Research Report
Observable time series data downsampling practice in Prometheus
Setting up redis cluster cluster under Windows
[PCL self study: feature9] global aligned spatial distribution (GASD) descriptor (continuously updated)
TPG x AIDU | AI leading talent recruitment plan in progress!
Official announcement! The third cloud native programming challenge is officially launched!
Introduction to RT thread kernel (5) -- memory management
指针函数(基础)
Hypothesis testing -- learning notes of Chapter 8 of probability theory and mathematical statistics
Mode in BST (binary tree & Notes on question brushing)
Web开发人员应该养成的10个编程习惯
CUDA Programming atomic operation atomicadd reports error err:msb3721, return code 1
函数(易错)
Sequelize. JS and hasmany - belongsto vs hasmany in serialize js
Advanced length of redis -- deletion strategy, master-slave replication, sentinel mode
Leetcode hot topic Hot 100 day 33: "subset"
Interview related high-frequency algorithm test site 3
Cookie learning diary 1
[thingsboard] how to replace the homepage logo