Phylogeny FAQ

  1. How is my privacy respected?
  2. Who can use this service?
  3. What are the main advantages of using this phylogeny pipeline?
  4. How can I use the files attached to the result e-mails?
  5. Why do I have to upload query and reference to infer a phylogeny?
  6. How do I cite this service?
  7. How do I describe the methods and the results?
  8. Which steps does the pipeline conduct?
  9. Why did you create this service?
  10. How long does a job take?
  11. What does a failed base-frequency check mean?
  12. To which type of 16S similarities can the phylum-specific 16S thresholds from Meier-Kolthoff et al. (2013) be applied?
  13. What do "file 'query.fas' is empty (or does not exist)", "… not found" and "distinct numbers of accessions in query and download" mean?
  14. Which data types are supported? E.g., can I analyse RNA sequences?
  15. When I upload aligned sequences, will that alignment be used?

Answers

1. How is my privacy respected?

All uploaded gene sequences are deleted on the server within 24 hours after the calculations have been completed. The users' e-mail addresses are not made available to third parties; they are only used directly on the server to calculate pseudonymised usage statistics.

As for most web site maintainers, it is also important for us to know about the countries our users/visitors are coming from. We thus use Piwik, a free and open source web analysis application written by a team of international developers. It tracks online visits to our website and displays anonymised reports on these visits for analysis.

2. Who can use this service?

Use of this form is free for academic purposes. For all other uses, please contact the authors.

3. What are the main advantages of using this phylogeny pipeline?

For phylogeny reconstruction, this service combines state-of-the-art software for multiple sequence alignment, maximum likelihood (ML) and maximum parsimony (MP) analysis. Nucleotide data are optionally downloaded from GenBank and always automatically checked for reverse-complement sequences and duplicated labels. In the case of amino-acid data, the optimal model for ML is automatically determined (for nucleotide data, we believe GTR to be alright, as does the author of RAxML). The pipeline is thus ideally suited for moderately sized single-gene data sets as used, e.g., in the description of new bacterial or other species. Uploaded RNA sequences are automatically converted to DNA sequences.

Moreover, optionally pairwise nucleotide similarities are calculated. The method for calculating these similarities exactly corresponds to the one used in a study for defining 16S rRNA gene similarity thresholds to determine whether or not a DDH reaction was mandatory for deciding whether or not two strains should be assigned to the same or to distinct species. These thresholds are available for specific user-chosen error ratios as well as with phylum-specific values.

In contrast to many other phylogeny tools which truncate or sequence labels or replace characters within them, here the original labels are provided in the output. The only modifications made are trimming whitespace from their ends and replacing consecutive runs of whitespace characters with a single space.

Finally, the results e-mails already include publication-ready text describing all methods used the pipeline, the results, and the according literature references.

4. How can I use the files attached to the result e-mails?

The files attached by the service use the following standardized file extensions:

fas
Multiple sequence alignment in FASTA format, original labels restored.
pdf
PDF file depicting the midpoint-rooted phylogenetic tree. This is a raw, not really publication-ready figure.
phy
Phylogenetic tree in Newick format, labels cleaned. Use it if your software cannot read the NEXUS-formatted tree.
tre
Phylogenetic tree in NEXUS format, original labels restored and protected.
tsv
Tabulator-separated file containing either the percent nucleotide similarities between each query sequence and all reference sequences or the G+C content of the sequences.

Each of the file types might be missing, as the according step of the analysis might not have been requested, or an error might have occurred. The main text of the result e-mail contains an analysis protocol with a detailed list of requested, successful and unsuccessful steps.

The phylogenetic tree in NEXUS format is ideally suited for viewing and manipulating it with FigTree but should be compatible with all NEXUS-compliant tree viewers. The tree description itself is unrooted (ML and MP yield unrooted trees!), but the FigTree block contains instructions for midpoint-rooting, as described in the e-mail text. You can re-root the tree when necessary using, e.g., an outgroup contained in the data set, but then you should explicitly explain how the rooting was conducted.

For viewers which do not understand NEXUS (and do not understand the proper Newick format, which allows for protecting any label with single quotes), the tree in Newick format can be used. Special characters within labels are replaced in that file. This tree is unrooted, as ML and MP yield unrooted trees. You should re-root it using, e.g., an outgroup contained in the data set, and explicitly explain how rooting was conducted.

Two kinds of TSV files can be produced. The first kind contains three columns per line providing (1) the name of a query sequence, (2) the name of a reference sequence and (3) the pairwise similarity between them. This kind of file can be unselected by choosing to infer phylogenies only. The second kind of TSV file contains two columns per line providing (1) the name of a sequence and (2) its percent G+C content. This is useful for assessing phylogenetic distortion due to a compositional bias (see FAQ entry 8). This kind of file is automatically unselected by uploading amino-acid sequences.

5. Why do I have to upload query and reference to infer a phylogeny?

You don't. If you omit the reference sequences, the calculation of pairwise similarities is simply skipped. Importantly, you can also upload more query sequences in that case. But unless you unselect the similarity calculations, reference sequences remain mandatory.

6. How do I cite this service?

At the bottom of the main text of result e-mails sent around by this service, all necessary references are listed. You just have to format and arrange them according to the instructions for authors of the chosen journal.

7. How do I describe the methods and the results?

Close to the middle of the main text of the result e-mails sent around by this service, suggestions for phrasing the according sections in the methods as well as the results chapter are contained. You just have to format and arrange them according to the instructions for authors of the chosen journal. You might also need to rephrase them slightly to avoid being falsely detected by plagiarism scanners. Watch out for instructions enclosed in square brackets. These indicate sections whose content must frequently be adapted, too.

8. Which steps does the pipeline conduct?

The analysis protocol is contained in the result e-mails (after the preamble) and lists the notes, warnings and errors, if any, from all conducted steps. For the technical details see FAQ entry 7. Step 0 is special because it is optional; when no GenBank accessions are provided nothing is downloaded from GenBank. Further steps include checking the sequences, determining pairwise similarities, creating a multiple sequence alignment, inferring ML and MP trees, testing for a compositional bias, creating a tree file suitable for FigTree, and drawing the tree in a PDF file. Checking for reverse-complement sequences (of course!), determining pairwise similarities and testing for a base-frequency bias (of course!) are always skipped in the case of amino-acid sequences. Pairwise similarities can be unselected, as well as inferring the trees. The latter might make sense in some situations, because the time-limiting step is ML bootstrapping.

9. Why did you create this service?

We have observed that many phylogenetic studies in papers on taxonomic classification, particularly in microbiology, are still based on under-complex models such as Jukes-Cantor (JC69) and venerable but outdated algorithms or programs such as neighbour joining or ClustalW. This issue (together with a tendency to over-estimate the reliability of branches that receive poor branch support) casts some doubt on certain taxonomic decisions. This service makes it easy to apply, in contrast, state-of-the-art alignment and phylogenetic inference software.

10. How long does a job take?

This depends not only on the number and length of the uploaded sequences but also on how much phylogenetic signal is in the data. The time-limiting step is the ML analysis, whose bootstrapping part will converge more quickly according to the bootstopping criterion when the data contain a strong phylogenetic signal. Moreover, the computation time also depends on the load of the server. The higher the load, the fewer threads will be allocated to the job, thus increasing the running time.

11. What does a failed base-frequency check mean?

Models for phylogenetic inference are usually stationary, i.e. they assume fixed frequencies of the character states (either equal ones or the empirical frequencies, as in the case of this service). This can sometimes lead to artefacts, when sequences are grouped together simply because of similar character-state frequencies. A compositional bias of nucleotide sequence usually means deviating G+C contents, and these might yield an artefact by causing otherwise not closely related sequences together that show a similar G+C content. However, a similar G+C content might as well be caused by a close relationship. The conducted test is simple and ignores phylogenetic structure; a failed test thus is not necessarily problematic. The reported G+C content values should be watched and checked for sequences with a similar value that were grouped together but should not belong together.

12. To which type of 16S similarities can the phylum-specific 16S thresholds from Meier-Kolthoff et al. (2013) be applied?

Whether or not a DNA:DNA hybridization value between two strains should be determined for the discrimination of species in a taxonomic analysis depends on the similarity of the two underlying 16S rRNA gene sequences. In the proposal by Meier-Kolthoff et al. (2013), the long-standing 97% 16S threshold (Stackebrandt and Goebel, 1994) was increased by replacing it not only with a general threshold but also with phylum-specific thresholds. Here, it is important to note that these thresholds originate from a statistical model based on an empirical 16S data set from which similarities were calculated under distinct settings. In order to properly apply the suggested thresholds to other strains, 16S similarities between them must be calculated under exactly the same settings, which is what the server does. For instance, even though pairwise sequence alignment can be solved exactly, it can yield distinct results under distinct settings, and these in turn can affect the resulting similarities.

13. What do "file 'query.fas' is empty (or does not exist)", "… not found" and "distinct numbers of accessions in query and download" mean?

The error message about the empty or missing query (or reference) FASTA file indicates that the data downloaded from GenBank do not contain the sequences, if any, in an acceptable format. Note that this can only be determined after the fact by examining the downloaded files. Usually this is due to either wrong GenBank accessions (note that these are searched for in the "nucleotides" database) or downtimes of the GenBank servers. The "… not found" warning points to a wrong GenBank accession, whereas distinct numbers of accessions in query and download most likely mean that GenBank did not respond. To further assess this, try to apply the same accessions numbers via the GenBank web interface. If this does not fail, the error might be on the side of the GGDC phylogeny server. Please report a bug in that case; include the accessions used in your message.

14. Which data types are supported? E.g., can I analyse RNA sequences?

The server supports nucleotide (DNA or RNA) and amino-acid sequences. RNA sequences can be uploaded. They will then automatically be converted to DNA sequences. You cannot mix RNA and DNA sequences, however (you cannot mix DNA and amino-acid sequences either).

15. When I upload aligned sequences, will that alignment be used?

When only query sequences but no reference sequences are provided and a query FASTA file containing already aligned sequences is uploaded, this alignment is recognized and used. Multiple sequence alignment conducted by the server itself is skipped in that case. Beware of uploading sequences that are actually unaligned but have been padded with gaps to obtain uniform lengths. Also note that dots are interpreted like dashes by the server because certain software packages use dots to indicate leading and trailing gaps. Some other programs use dots to indicate identity to the first sequences; this use of dots must be avoided when working with the GGDC phylogeny server.