Collection of Python libraries to parse bioinformatics files, or perform computation related to assembly, annotation, and comparative genomics.
|Authors||Haibao Tang (tanghaibao)|
|Vivek Krishnakumar (vivekkrish)|
|Jingping Li (Jingping)|
|Xingtan Zhang (tangerzhang)|
If you use the MCscan pipeline for synteny inference, please cite:
Tang et al. (2008) Synteny and Collinearity in Plant Genomes. Science
If you use the ALLMAPS pipeline for genome scaffolding, please cite:
Tang et al. (2015) ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biology
For other uses, please cite the package directly:
Tang et al. (2015). jcvi: JCVI utility libraries. Zenodo. 10.5281/zenodo.31631
Following modules are available as generic Bioinformatics handling methods.
.ace format (phrap, cap3, etc.),
.coords format (
obo format (ontology),
.psl format (UCSC blat, GMAP, etc.),
.posmap format (Celera
.sam format (read mapping),
format (TIGR assembly format), etc.
Then there are modules that contain domain-specific methods.
Please visit wiki for full-fledged applications.
Following are a list of third-party python packages that are used by some routines in the library. These dependencies are not mandatory since they are only used by a few modules.
There are other Python modules here and there in various scripts. The
best way is to install them via
pip install when you see
The easiest way is to install it via PyPI:
pip install jcvi
To install the development version:
pip install git+git://github.com/tanghaibao/jcvi.git
Alternatively, if you want to install manually:
cd ~/code # or any directory of your choice git clone git://github.com/tanghaibao/jcvi.git pip install -e .
In addition, a few module might ask for locations of external programs,
if the extended cannot be found in your
PATH. The external programs
that are often used are:
Most of the scripts in this package contains multiple actions. To use
Usage: python -m jcvi.formats.fasta ACTION Available ACTIONs: clean | Remove irregular chars in FASTA seqs diff | Check if two fasta records contain same information extract | Given fasta file and seq id, retrieve the sequence in fasta format fastq | Combine fasta and qual to create fastq file filter | Filter the records by size format | Trim accession id to the first space or switch id based on 2-column mapping file fromtab | Convert 2-column sequence file to FASTA format gaps | Print out a list of gap sizes within sequences gc | Plot G+C content distribution identical | Given 2 fasta files, find all exactly identical records ids | Generate a list of headers info | Run `sequence_info` on fasta files ispcr | Reformat paired primers into isPcr query format join | Concatenate a list of seqs and add gaps in between longestorf | Find longest orf for CDS fasta pair | Sort paired reads to .pairs, rest to .fragments pairinplace | Starting from fragment.fasta, find if adjacent records can form pairs pool | Pool a bunch of fastafiles together and add prefix qual | Generate dummy .qual file based on FASTA file random | Randomly take some records sequin | Generate a gapped fasta file for sequin submission simulate | Simulate random fasta file for testing some | Include or exclude a list of records (also performs on .qual file if available) sort | Sort the records by IDs, sizes, etc. summary | Report the real no of bases and N's in fasta files tidy | Normalize gap sizes and remove small components in fasta translate | Translate CDS to proteins trim | Given a cross_match screened fasta, trim the sequence trimsplit | Split sequences at lower-cased letters uniq | Remove records that are the same
Then you need to use one action, you can just do:
python -m jcvi.formats.fasta extract
This will tell you the options and arguments it expects.
Feel free to check out other scripts in the package, it is not just for FASTA.