Reference package construction from scratch

Please report comments or bugs to Pierre Lefeuvre -


Important notice

Note that most of the R code included in the vignette is not executed during vignette build. Some of the rentrez (very good) package code triggers HTTP failure during the CRAN check and is thus not evaluated during the build. It should run smoothly on you desktop computer.


A phylogenetic placement corresponds to the position of a query sequence in a reference tree. Phylogenetic placement inference using pplacer requires the construction of “reference packages”, i.e. informations on the phylogeny and taxonomy of a given group of organisms. This vignette presents example of reference package construction using taxtastic, rppr and R.

The polerovirus example

Imagine you obtained some viral contigs that were assigned to the Polerovirus genus using a BLAST search. You want to obtain phylogenetic placements of these sequences using pplacer and now have to build a reference package for this genus.

A search on the ICTV website informs us that polerovirus belongs to the Luteovidiae family and that their “genome size is fairly uniform ranging from 5.6 kb to 6.0 kb”.

Loading the required packages

We will use the rentrez package to query NCBI from R. Along with a vignette available at the rentrez CRAN webpage, examples of rentrez use are available here.

set_config(config(http_version = 0))

How many polerovirus sequences are available on NCBI ?

We first search for the taxonomic id associated to the Polerovirus genus in NCBI…

r_search <- entrez_search(db="taxonomy", term="polerovirus")

… and use this taxonomic id (i.e. 119164) to query the nucleotide database. Note that we remove the reference sequence set from the search (using “NOT srcdb_refseq[PROP]”) as these are duplicates from records already available in the nucleotide database.

r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP]")

This number represents the total number of sequences from the genus polerovirus. We will now search for full genomes using the “AND 5000:6500[slen]” query.

r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP] AND 5000:6500[slen]")

To recover all the GenBank identifiers (gi), we have to increase the retmax parameter (default is 20) to something superior to r_search$count.

r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP] AND 5000:6500[slen]",retmax=1000)
gi <- r_search$ids

Download all these sequences

Get the genbank informations in XML format

We now use these gi to download the information relative to the sequences in XML format. XML is a handy format as it allow to directly extract specific fields form the genbank file. Because the number of sequence may be a bit large, we will use the history option in our rentrez query. Sequences will then be obtained by batch of 200.

r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP] AND 5000:6500[slen]",use_history=TRUE)
all_recs <- NULL
for(seq_start in seq(1,ceiling(r_search$count/200)*200,200)){
     all_recs <- c(all_recs,entrez_fetch(db="nuccore", web_history=r_search$web_history,rettype="gbc",retmode="xml",retmax=200,retstart=seq_start))
rec_list <-,lapply(all_recs,xmlToList))

Extract the fasta from the XML


Extract the taxonomy from the XML

Also, we need to extract the polerovirus classification of each sequence.

source_name <- t(sapply(rec_list,function(X){c(X$INSDSeq_locus,X$INSDSeq_organism)}))

Alignement and phylogeny

Now we have the sequences on the disk we can align the sequences (here using MAFFT) and construct a phylogenetic tree (here using FastTree)…

### in bash

mafft --adjustdirection --reorder polerovirus_from_genbank.fasta > polerovirus_from_genbank_MAFFT.fasta

FastTree -nt -gamma -gtr -log polerovirus_from_genbank_MAFFT.log polerovirus_from_genbank_MAFFT.fasta > polerovirus_from_genbank_MAFFT.tre

…and import the tree in R for control.

tree <- read.tree(paste(find.package("BoSSA"),"/extdata/polerovirus_from_genbank_MAFFT.tre",sep=""))

Note that using the “adjustdirection” during alignment, MAFFT generates reverse complement sequences and align them together with the remaining sequences. The best orientation is conserved for each sequence. When the reverse complement is aligned, the program add the “_R_” prefix in the name of the sequence. In this case and in order to keep the correspondance between informations files and sequences names, you’ll have to edit the sequences names.

After plotting the tree, and the node labels, it is apparent that the node 231 may be a good rooting point (may change depending on the sequence set you download i.e. as more poleroviruses sequences are upload to NCBI, the tree structure and node numbering will change).

root_tree <- root(tree,node=231,resolve=TRUE)

The rooted tree is then written to the disk.


The Taxonomy

Download the polerovirus taxonomy IDs

We need to obtain the complete taxonomy information for the polerovirus. Taxastic has a function for that. We first download the full taxonomy from NCBI using taxit and subset it for the poleroviruses tax ids.

r_search <- entrez_search(db="taxonomy", term="txid119164[orgn]",retmax=1000)
tax_ids <-  r_search$ids
### in bash

taxit new_database

taxit taxtable -f -o polerovirus_taxonomy.csv ncbi_taxonomy.db

The information file

We can now create and write the information file for the sequences from our phylogenetic tree.

taxo <- read.csv(paste(find.package("BoSSA"),"/extdata/polerovirus_taxonomy.csv",sep=""))
tax_id <- taxo$tax_id[match(source_name[,2],taxo$tax_name)]
info <- data.frame(seqname=source_name[,1],accession=source_name[,1],tax_id=tax_id,species=source_name[,2],is_type=rep("no",nrow(source_name)))

Create the refpkg using taxit

The reference package is created using the taxit create command.

### in bash

taxit create -l polerovirus -P polerovirus.refpkg \
    --taxonomy polerovirus_taxonomy.csv \
    --aln-fasta polerovirus_from_genbank_MAFFT.fasta \
    --seq-info polerovirus_info.csv \
    --tree-stats polerovirus_from_genbank_MAFFT.log \
    --tree-file polerovirus_ROOTED.tre --no-reroot

Check the refpkg using rppr

Rppr offers the possibility to check the reference package using the following commands.

### in bash

rppr check -c polerovirus.refpkg

rppr info -c polerovirus.refpk

Refpkg stats using BoSSA

Using BoSSA, we can extract some summary statistics and draw some plots.

  • A reference package summary
refpkg_path <- paste(find.package("BoSSA"),"/extdata/polerovirus.refpkg",sep="")

## ### Reference package summary
## Path:/tmp/RtmpK6GO3B/Rinst3a903489a3e2/BoSSA/extdata/polerovirus.refpkg
## Tree with 204 tips 203 nodes
## Classification:
## root 1 
## superkingdom 1 
## below_superkingdom 1 
## below_below_superkingdom 1 
## family 1 
## genus 1 
## below_genus 1 
## species 33 
## below_species 2
  • A tree with tips colored according to a taxonomic level, here the species

  • A pie chart summarizing the taxonomy with here again, the species level. Note there is a slight decay between the text labels and slices…

  • Alternatively, a input file for KronaTools to generate a krona chart could be generated.

A file (default is for_krona.txt) is generated and can be used to generate a krona chart with the scrit from the Krona software: for_krona.txt

Subsample the refpkg

It could be interesting to reduce the size of the reference package. Whereas, there is still a decent number of sequences in this polerovirus dataset, this number could be way higher and represent a computationnal burden for the rest of the analysis. Here is some code example to subset each species to a maximum of 10 sequences. T-Coffee could be usefull if you would like to subsample based on diversity.

d <- cophenetic.phylo(root_tree)

ids <- NULL
for(i in 1:length(unique(info$species))){
    usp <- unique(info$species)[i]
  sub <- info[as.character(info$species)==usp,]
  acc <- as.character(sub[,1])
  # the following code line prevent the code from crashing
  # i.e. new sequences not available in the example may be uploaded
  # when you will run the code
  acc <- acc[!,colnames(d)))]

    ids <- c(ids,acc)
    h <- hclust(as.dist(d[acc,acc]))
    grp <- cutree(h,k=10)
    ids <- c(ids,names(grp)[match(1:10,grp)])

to_remove <- root_tree$tip.label[!root_tree$tip.label%in%ids]

Note that the highly recommanded practice is to subset the alignement, re-align and compute a new phylogenetic tree, one can also directly subset the already available phylogeny (in a “quick and dirty” way).

root_tree2 <- drop.tip(root_tree,to_remove)

Using these new files, you can create another smaller refpkg using the “taxit create” command as described above.


If you find BoSSA and/or its tutorials useful, you may cite:

## To cite package 'BoSSA' in publications use:
##   Pierre Lefeuvre (2020). BoSSA: A Bunch of Structure and Sequence
##   Analysis. R package version 3.7.
## A BibTeX entry for LaTeX users is
##   @Manual{,
##     title = {BoSSA: A Bunch of Structure and Sequence Analysis},
##     author = {Pierre Lefeuvre},
##     year = {2020},
##     note = {R package version 3.7},
##   }
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

Other resources

On phylogenetic placements

pplacer website and documentation


On krona charts


R packages used by BoSSA

RSQLite and jsonlite to read files, ape and phangorn to manipulate phylogenetic trees and plotrix for pie charts.