###################################################################################### A Python package for retrieving metadata and downloading datasets from SRA/ENA/GEO ######################################################################################
.. image:: https://img.shields.io/pypi/v/pysradb.svg?style=flat-square :target: https://pypi.python.org/pypi/pysradb
.. image:: https://anaconda.org/bioconda/pysradb/badges/version.svg :target: https://anaconda.org/bioconda/pysradb/badges/version.svg
.. image:: https://static.pepy.tech/personalized-badge/pysradb?period=month&units=international_system&left_color=black&right_color=brightgreen&left_text=Downloads/month :target: https://pepy.tech/project/pysradb
.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square :target: http://bioconda.github.io/recipes/pysradb/README.html
.. image:: https://zenodo.org/badge/159590788.svg :target: https://zenodo.org/badge/latestdoi/159590788
.. image:: https://github.com/saketkc/pysradb/workflows/push/badge.svg :target: https://github.com/saketkc/pysradb/actions
Documentation
https://saketkc.github.io/pysradb
CLI Usage
pysradb
supports command line usage. See CLI <https://saket-choudhary.me/pysradb/cmdline.html>
instructions or quickstart guide <https://www.saket-choudhary.me/pysradb/quickstart.html>
.
::
$ pysradb usage: pysradb [-h][--version] [--citation] {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs} ...
pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
version: 1.0.1
Citation: 10.12688/f1000research.18676.1
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--citation how to cite
subcommands:
{metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
metadata Fetch metadata for SRA project (SRPnnnn)
download Download SRA project (SRPnnnn)
search Search SRA for matching text
gse-to-gsm Get GSM for a GSE
gse-to-srp Get SRP for a GSE
gsm-to-gse Get GSE for a GSM
gsm-to-srp Get SRP for a GSM
gsm-to-srr Get SRR for a GSM
gsm-to-srs Get SRS for a GSM
gsm-to-srx Get SRX for a GSM
srp-to-gse Get GSE for a SRP
srp-to-srr Get SRR for a SRP
srp-to-srs Get SRS for a SRP
srp-to-srx Get SRX for a SRP
srr-to-gsm Get GSM for a SRR
srr-to-srp Get SRP for a SRR
srr-to-srs Get SRS for a SRR
srr-to-srx Get SRX for a SRR
srs-to-gsm Get GSM for a SRS
srs-to-srx Get SRX for a SRS
srx-to-srp Get SRP for a SRX
srx-to-srr Get SRR for a SRX
srx-to-srs Get SRS for a SRX
Quickstart
A Google Colaboratory version of most used commands are available in this Colab Notebook <https://colab.research.google.com/drive/1C60V-jkcNZiaCra_V5iEyFs318jgVoUR>
_ . Note that this requires only an active internet connection (no additional downloads are made).
The following notebooks document all the possible features of pysradb
:
Python API <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/01.Python-API_demo.ipynb>
_Downloading datasets from SRA - command line <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/02.Commandline_download.ipynb>
_Parallely download multiple datasets - Python API <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/03.ParallelDownload.ipynb>
_Converting SRA-to-fastq - command line (requires conda) <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/04.SRA_to_fastq_conda.ipynb>
_Downloading subsets of a project - Python API <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/05.Downloading_subsets_of_a_project.ipynb>
_Download BAMs <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/06.Download_BAMs.ipynb>
_Metadata for multiple SRPs <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/07.Multiple_SRPs.ipynb>
_Multithreaded fastq downloads using Aspera Client <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/08.pysradb_ascp_multithreaded.ipynb>
_Searching SRA/GEO/ENA <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/09.Query_Search.ipynb>
_Installation
To install stable version using pip
:
.. code-block:: bash
pip install pysradb
Alternatively, if you use conda:
.. code-block:: bash
conda install -c bioconda pysradb
This step will install all the dependencies.
If you have an existing environment with a lot of pre-installed packages, conda might be slow <https://github.com/bioconda/bioconda-recipes/issues/13774>
_.
Please consider creating a new enviroment for pysradb
:
.. code-block:: bash
conda create -c bioconda -n pysradb PYTHON=3.7 pysradb
.. code-block:: bash
pandas requests tqdm xmltodict
.. code-block:: bash
git clone https://github.com/saketkc/pysradb.git cd pysradb && pip install -r requirements.txt pip install -e .
Using pysradb
::
$ pysradb metadata SRP000941 | head
study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases
SRP000941 SRX056722 Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS184466 Illumina HiSeq 2000 26900401 531654480 SRR179707 26900401 807012030
SRP000941 SRX027889 Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS116481 Illumina Genome Analyzer II 37528590 779578968 SRR067978 37528590 1351029240
SRP000941 SRX027888 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116483 Illumina Genome Analyzer II 13603127 3232309537 SRR067977 13603127 489712572
SRP000941 SRX027887 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116562 Illumina Genome Analyzer II 22430523 506327844 SRR067976 22430523 807498828
SRP000941 SRX027886 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116560 Illumina Genome Analyzer II 15342951 301720436 SRR067975 15342951 552346236
SRP000941 SRX027885 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116482 Illumina Genome Analyzer II 39725232 851429082 SRR067974 39725232 1430108352
SRP000941 SRX027884 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS116481 Illumina Genome Analyzer II 32633277 544478483 SRR067973 32633277 1174797972
SRP000941 SRX027883 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS004118 Illumina Genome Analyzer II 22150965 3262293717 SRR067972 9357767 336879612
SRP000941 SRX027883 Reference Epigenome: ChIP-Seq Input from hESC H1 Cells Reference Epigenome: ChIP-Seq Input from hESC H1 Cells 9606 Homo sapiens ChIP-Seq GENOMIC RANDOM SRS004118 Illumina Genome Analyzer II 22150965 3262293717 SRR067971 12793198 460555128
::
$ pysradb metadata SRP075720 --detailed | head
study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases
SRP075720 SRX1800476 GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467643 Illumina HiSeq 2500 2547148 97658407 SRR3587912 2547148 127357400
SRP075720 SRX1800475 GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467642 Illumina HiSeq 2500 2676053 101904264 SRR3587911 2676053 133802650
SRP075720 SRX1800474 GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467641 Illumina HiSeq 2500 1603567 61729014 SRR3587910 1603567 80178350
SRP075720 SRX1800473 GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467640 Illumina HiSeq 2500 2498920 94977329 SRR3587909 2498920 124946000
SRP075720 SRX1800472 GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467639 Illumina HiSeq 2500 2226670 83473957 SRR3587908 2226670 111333500
SRP075720 SRX1800471 GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467638 Illumina HiSeq 2500 2269546 87486278 SRR3587907 2269546 113477300
SRP075720 SRX1800470 GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467636 Illumina HiSeq 2500 2333284 88669838 SRR3587906 2333284 116664200
SRP075720 SRX1800469 GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467637 Illumina HiSeq 2500 2071159 79689296 SRR3587905 2071159 103557950
SRP075720 SRX1800468 GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq 10090 Mus musculus RNA-Seq TRANSCRIPTOMIC cDNA SRS1467635 Illumina HiSeq 2500 2321657 89307894 SRR3587904 2321657 116082850
::
$ pysradb srp-to-gse SRP075720
study_accession study_alias
SRP075720 GSE81903
::
$ pysradb gsm-to-srp GSM2177186
experiment_alias study_accession
GSM2177186 SRP075720
::
$ pysradb gsm-to-gse GSM2177186
experiment_alias study_alias
GSM2177186 GSE81903
::
$ pysradb gsm-to-srx GSM2177186
experiment_alias experiment_accession
GSM2177186 SRX1800089
::
$ pysradb gsm-to-srr GSM2177186
experiment_alias run_accession
GSM2177186 SRR3587529
::
$ pysradb download -g GSE161707
pysradb
makes it super easy to download datasets from SRA parallely:
Using 8 threads to download:
::
$ pysradb download -y -t 8 --out-dir ./pysradb_downloads -p SRP063852
Downloads are organized by SRP/SRX/SRR
mimicking the hierarchy of SRA projects.
::
$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download
This will download all RNA-seq
samples coming from this project.
With aspera-client <https://downloads.asperasoft.com/en/downloads/8?list>
_ installed, pysradb
can perform ultra fast downloads:
To download all original fastqs with aspera-client
installed utilizing 8 threads:
::
$ pysradb download -t 8 --use_ascp -p SRP002605
Refer to the notebook for (shallow) time benchmarks <https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/08.pysradb_ascp_multithreaded.ipynb>
_.
Publication
pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive <https://f1000research.com/articles/8-532/v1>
_
Presentation slides from BOSC (ISMB-ECCB) 2019: https://f1000research.com/slides/8-1183
Citation
Choudhary, Saket. "pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive." F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)
::
@article{Choudhary2019,
doi = {10.12688/f1000research.18676.1},
url = {https://doi.org/10.12688/f1000research.18676.1},
year = {2019},
month = apr,
publisher = {F1000 (Faculty of 1000 Ltd)},
volume = {8},
pages = {532},
author = {Saket Choudhary},
title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive},
journal = {F1000Research}
}
Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
Zenodo DOI: 10.5281/zenodo.2306881
Questions?
Open an issue <https://github.com/saketkc/pysradb/issues>
or join our Slack Channel <https://join.slack.com/t/pysradb/shared_invite/zt-f01jndpy-KflPu3Be5Aq3FzRh5wj1Ug>
.
####### History #######
1.4.2 (06-17-2022)
#163 <https://github.com/saketkc/pysradb/issues/163>
)1.4.1 (06-04-2022)
1.4.0 (06-04-2022)
#161 <https://github.com/saketkc/pysradb/issues/161>
)#159 <https://github.com/saketkc/pysradb/pull/159>
)#160 <https://github.com/saketkc/pysradb/issues/160>
)1.3.0 (02-18-2022)
study_title
to --detailed
flag (#152 <https://github.com/saketkc/pysradb/issues/152>
_)KeyError
in metadata
where some new IDs do not have any metadata (#151 <https://github.com/saketkc/pysradb/issues/151>
_)1.2.0 (01-10-2022)
#149 <https://github.com/saketkc/pysradb/pull/149>
)1.1.0 (12-12-2021)
gsm-to-gse
failure (#128 <https://github.com/saketkc/pysradb/pull/128>
_)#144 <https://github.com/saketkc/pysradb/pull/144>
_)#146 <https://github.com/saketkc/pysradb/pull/146>
_)pysradb dowload -g <GSE>
(#129 <https://github.com/saketkc/pysradb/pull/129>
_)1.0.1 (01-10-2021)
1.0.0 (01-09-2021)
metadb
and SRAdb
based search through CLI - everything defaults to SRAweb
SRAweb
now supports search <https://saket-choudhary.me/pysradb/quickstart.html#search>
_N/A
is now replaced with pd.NA
--detailed
: instrument_model
and instrument_model_desc
#75 <https://github.com/saketkc/pysradb/issues/75>
_0.11.1 (09-18-2020)
library_layout
is now outputted in metadata #56-detailed
unifies columns for ENA fastq links instead of appending _x/_y #590.11.0 (09-04-2020)
pysradb download
now supports multiple threads for paralle downloadspysradb download
also supports ultra fast downloads of FASTQs from ENA using aspera-client0.10.3 (03-26-2020)
Contributors
0.10.2 (02-05-2020)
--detailed
0.10.1 (02-04-2020)
0.10.0 (01-31-2020)
0.9.9 (01-15-2020)
0.9.7 (01-20-2020)
0.9.6 (07-20-2019)
SRAweb
to perform queries over the web if the SQLite is missing or does not contain the relevant record.0.9.0 (02-27-2019)
0.8.0 (02-26-2019)
srr-to-gsm
: convert SRR to GSM--out_dir
is now out-dir
0.7.1 (02-18-2019)
Important: Python2 is no longer supported. Please consider moving to Python3.
0.7.0 (02-08-2019)
gsm-to-srr
: convert GSM to SRRgsm-to-srx
: convert GSM to SRXgsm-to-gse
: convert GSM to GSEThe following commad line options have been renamed and the changes are not compatible with 0.6.0 release:
sra-metadata
-> metadata
.sra-search
-> search
.srametadb
-> metadb
.0.6.0 (12-25-2018)
sra-metadata
download
now allows piped inputs0.5.0 (12-24-2018)
srr_to_srx
: Convert SRR to SRX/SRPsrp_to_srx
: Convert SRP to SRXsra-metadata
to give minimal information--assay
, --desc
, --detailed
flag for sra-metadata
0.4.2 (12-16-2018)
0.4.0 (12-12-2018)
BASEdb
class to handle common database connections0.3.0 (12-05-2018)
sample_attribute
and experiment_attribute
are now included by default in the df returned by sra_metadata()
expand_sample_attribute_columns: expand metadata dataframe based on attributes in
sample_attribute` columnguess_cell_type()
/guess_tissue_type()
/guess_strain_type()
0.2.2 (12-03-2018)
search_sra()
allows full text search on SRA metadata.0.2.0 (12-03-2018)
The following methods have been renamed and the changes are not compatible with 0.1.0 release:
get_query()
-> query()
.sra_convert()
-> sra_metadata()
.get_table_counts()
-> all_row_counts()
.download_sradb_file()
makes fetching SRAmetadb.sqlite
file easy; wget is no longer
required.ftp
protocol is now supported besides fsp
and hence aspera-client
is now optional.
We however, strongly recommend aspera-client
for faster downloads.SettingWithCopyWarning
by excplicitly doing operations on a copy of
the dataframe instead of the original.Besides these, all methods now follow a numpydoc
compatible documentation.
0.1.0 (12-01-2018)
Version | Tag | Published |
---|---|---|
1.4.2 | 24d ago | |
1.4.1 | 2mos ago | |
1.4.0 | 2mos ago | |
1.3.0 | 6mos ago |