Answer

All uploaded gene sequences are deleted on the server within 24 hours after the calculations have been completed. The users' e-mail addresses are not made available to third parties; they are only used directly on the server to calculate pseudonymised usage statistics.

As for most web site maintainers, it is also important for us to know about the countries our users/visitors are coming from. We thus use Piwik, a free and open source web analysis application written by a team of international developers. It tracks online visits to our website and displays anonymised reports on these visits for analysis.

The entire GGDC web site supports HTTPS, a protocol for secure communication over a computer network and widely used on the Internet. In practice, this provides a reasonable guarantee that one is communicating with precisely the website that one intended to communicate with (as opposed to an impostor), as well as ensuring that the contents of communications between the user and site cannot be read or forged by any third party.

Answer

Use of this form is free for academic purposes. For all other uses, please contact the authors.

Answer

Across all submission forms, GenBank accessions can be specified in three different ways: either by using single accessions (separated by blanks), by providing a range of accessions (e.g. AE000782-AE000785), or a combination of both (e.g. AE000781 AE000782-AE000785). In either case all accessions (either single accessions or a range of accessions or both) which belong to a particular genome must be written in a single row within the respective text field. In the following example, rows 1, 3, 4 and 6 denote separate entities (viruses in that case) which are represented by a single GenBank accession ID each. In row no. 2 a range of accessions is provided as indicated by the hyphen ('-'), whereas row no. 5 shows an example in which multiple accessions are explicitly provided (even though they form a range, too).

AF227250
HM246720-HM246724
FJ483838
FJ539134
FJ539135 FJ539136 FJ539137
FJ539132

Please note that the download of sequences via GenBank master record accessions is not possible any more as of April 2017. One reason for this restriction is that it speeds up GenBank downloads. Moreover, such master records frequently link to distinct representations of the same assembled genome sequence and thus cause the comparison of duplicated sequences. An example for such a master record can be found here. It links to two distinct representations of the same genome sequence, although only one of them should be used:

  WGS         AGSE01000001-AGSE01000004
  WGS_SCAFLD  KK583188

Of course the GGDC already reported the sequence data which were used in the calculations in "check_file.txt" (attached to the result e-mail) but because this file might be overlooked we decided to not support the download of master records anymore. Instead, users can now explicitly specify ranges of accessions as shown above.

Answer

In empirical comparisons of GGDC with other digital DDH methods, GGDC yielded the highest correlations with traditional DDH, thus ensuring the highest consistency regarding the species-delimitation approach that currently dominates in microbial taxonomy, without sharing the disadvantages of traditional DDH. This is crucial because approaches auch as ANI have by their correlation with traditional DDH values, too. To the best of our knowledge, GGDC (2) is also the only replacement method for traditional DDH that provides confidence intervals. Moreover, that GGDC delivers values on the same scale as traditional DDH (instead of, e.g., ANI values) makes it easy to compare GGDC results with wet-lab DDH values. Finally, as of December 2014 GGDC 2 conducts comparisons with subspecies boundaries, too, and in January 2016 G+C content calculations were also incorporated. A graphical overview of the advantages of GGDC over alternative methods is also available in this section of Hans-Peter Klenk's acceptance speak for the 2014 Bergey Award.

Answer

This is actually a loaded question, which, unfortunately, is occasionally still raised. The question is loaded because it presupposes a certain equivalence between traditional DDH and digital DDH that simply does not exist. More specifically, the question presupposes that the disadvantages of traditional DDH (which are the reasons for replacing the practice) are also disadvantages of digital DDH. This presumption is flawed, as can easily be demonstrated. The advantageous features of traditional DDH are (1) its use of the information from complete genomes, at least conceptionally, and (2) its use of quantification. The disadvantageous features of traditional DDH are (3) that it is a tedious method that can only be conducted by few specialized molecular laboratories and (4) that it does not work incrementally, i.e. the effort for one pairwise comparison does not yield anything of use for any other pairwise comparison.

It is obvious that digital DDH keeps (and even enhances) the advantageous features (1) and (2) but abandons the disadvantageous features (3) and (4). For this reason, the need to abandon traditional DDH is actually not an argument against but an argument for digital DDH. For instance, (3) does not hold for digital DDH because the limiting factor for digital DDH is genome sequencing, but once obtained a genome sequence can be used for many things in addition to digital DDH.

Sometimes the claim that one wants to get rid of traditional DDH is even put forward as an argument for preferring ANI over the GGDC and digital DDH. As such, the argument is an example of poor scholarship, if not outright foolish. Indeed, the argument completely overlooks that the justification for ANI ‐ regarding its use in general as well as regarding the ANI threshold for species delimitation ‐ was its high correlation with traditional DDH. This holds for the original ANI approach put forward by Goris, Konstantinidis and colleagues as well as the so-called ANIb and ANIm methods published by Richter and Rossello-Mora. However, using exactly the same criterion, digital DDH as calculated using the GGDC works better, because it yields an even higher correlation with traditional DDH. In addition to other advantages of the GGDC over ANI, this was demonstrated in our publications on the GGDC using empirical data sets larger than in previous publications.

Answer

Please see our 'Background' page for the publications that describe the algorithms and statistics used by the GGDC. By using this service you agree to cite at least one of the GGDC papers. There are additional references for subspecies delineation and G+C content interpretation.

Answer

Yes, if one uses formula 2 (which is the recommended formula anyway), one needs only about 20% of the genome to get the same result as with the full genome. The other two formulas will be severely affected by genome incompleteness. See also , and Auch et al. (2010).

Answer

GGDC 2 reports two types of confidence intervals (CIs) for specifying the uncertainty associated with the reported DDH estimates.

  1. Model-based CI.

    As any statistical model, the one used for the prediction of DDH values from intergenomic distances has an inherent error of estimation, which can be assessed with the help of model-based CIs. Briefly, the C.I. for a given point estimate (i.e., fit) is calculated on the link of the GLM and the resulting bounds are then transformed using the inverse of the link function (e.g., according to: Zuur, A., et al. Mixed Effects Models and Extensions in Ecology with R. New York: Springer, 2009. Print.). By definition the resulting C.I. is asymmetric, i.e., the width of the upper part of the interval (as measured from the point estimate) can differ from the corresponding lower part. In the result e-mail the model-based C.I. is given in square brackets after the DDH estimate.

    Such uncertainty, is, of course, also present in alternative approaches such as ANI, even though we are not aware of an ANI implementation that actually calculate CIs. But in our view, ANI thresholds supposed to be equivalent to 70% wet-lab DDH should better be provided with CIs, too. (The uncertainty in ANI implementations might actually be higher than the one inherent to GGDC because GGDC uses a larger empirical data set; see Meier-Kolthoff et al., 2013).

  2. Resampling-based CI.

    The intergenomic distances used for DDH prediction can themselves be resampled via a special bootstrap implementation (see Meier-Kolthoff et al., 2013). These so-called replicate distances deviate from the original distance to a certain extent and thus allow for the assessment of a 95% confidence interval on both DDH and GGD (Genome-to-Genome Distance) scales.

Since the model-based CIs are usually larger than those provided by bootstrapping, the latter are optional and not reported by default.

Answer

The calculation of bootstrap-based confidence intervals (C.I.) is optional (see previous FAQ item), however it is relatively compute-intense. Hence, to avoid unnecessary load on the server, we disable the calculation of bootstrap-based C.I., if the number of genomes exceeds a certain threshold (current threshold is 20). (No worries, model-based confidence intervals are always reported.)

Answer

The GGDC compares a query genome with a reference genome and calculates an intergenomic distance under three different distance formulae. An in-depth description of these can be found in the accompanying publications. The formulae support your decision about the relatedness of your novel strain to known (type) strains.

  • Formula 1: length of all HSPs divided by total genome length
  • Formula 2: sum of all identities found in HSPs divided by overall HSP length
  • Formula 3: sum of all identities found in HSPs divided by total genome length
Note
Formula 2 is independent of genome length and is thus robust against the use of incomplete draft genomes.
For other reasons for preferring formula 2, see . If there are any significant differences between the three formulae, please base your decision on the recommended formula 2.

An exemplary result from the GGDC 1.0 would look like this:
   
  Submission: 13-04-05-11-48-0015480
  Program: NCBI-BLAST


  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  Escherichia_coli_IAI1 (query) vs. Escherichia_coli_IAI39 (reference):
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


  Formula: 1 (HSP length / total length)
  Distance: 0.1646
  DDH estimate (regression-based): 76.90
  Estimate of DDH <=70% (threshold-based): no (threshold=0.2676)

  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  Formula: 2 (identities / HSP length)
  Distance: 0.0304
  DDH estimate (regression-based): 77.06
  Estimate of DDH <=70% (threshold-based): no (threshold=0.0412)

  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  Formula: 3 (identities / total length)
  Distance: 0.1899
  DDH estimate (regression-based): 76.06
  Estimate of DDH <=70% (threshold-based): no (threshold=0.2945)
   
  
An exemplary result from the GGDC 2.1 would look like this:
   
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  E_coli_K12_W3110 (query) vs. E_coli_O1_K1_H7_DSM_30083 (reference):
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


  Formula: 1 (HSP length / total length)
  Distance: 0.1614
  DDH estimate (GLM-based): 74.20% +3.62/-3.98
  Probability that DDH > 70% (i.e., same species): 83.25% (via logistic 
regression)
  Probability that DDH > 79% (i.e., same subspecies): 43.32% (via logistic 
regression)

  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  Formula: 2 (identities / HSP length) (RECOMMENDED)
  Distance: 0.0292
  DDH estimate (GLM-based): 75.20% +2.78/-3.01
  Probability that DDH > 70% (i.e., same species): 85.97% (via logistic 
regression)
  Probability that DDH > 79% (i.e., same subspecies): 38.55% (via logistic 
regression)

  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  Formula: 3 (identities / total length)
  Distance: 0.1859
  DDH estimate (GLM-based): 77.10% +3.13/-3.46
  Probability that DDH > 70% (i.e., same species): 92.14% (via logistic 
regression)
  Probability that DDH > 79% (i.e., same subspecies): 46.06% (via logistic 
regression)

  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  Difference in % G+C: 0.16  (interpretation: either distinct or same species)
   
  

This example also shows that a lower distance value with formula 2 than with formula 1 (or 3) might yield a lower DDH estimate with formula 2 than with formula 1 (or 3). This is simply caused by the three distance formulae operating on three different scales. Distance values can only be compared if they have been obtained with the same formula. If so, lower distances will of course always yield a higher DDH estimate. The DDH estimates are, of course, all on the same scale.

The interpretation of this example is as follows. All three formulae confirm the hypothesis that the two genomes belong to the same species (note the confidence intervals). Because the reference genome is from the type strain, this confirms that the query genome is from an Escherichia coli strain, too. All three formulae also yield a dDDH estimate smaller than 79%, which indicates that the two strains belong to distinct subspecies. This is not significant for formula 3 but for formula 2, which is the recommended one. More details on the subspecies of E. coli are found elsewhere. For the interpretation of the G+C content, see and .

Answer

Processing of your submission may take several minutes depending on the current workload of the server, genome sizes and, of course, the total number of comparisons. Using GenBank accession numbers (such as AE000782) is slightly slower than uploading FASTA files directly because the genomes need to be downloaded from GenBank first. Here, the GGDC depends on the accessability of Genbank; in rush hours when Genbank is accessed by numerous users the download speed might be somewhat reduced. Moreover, choosing BLAT as local alignment tool, in conjunction with FASTA files that consist of many genome parts, may be extremely slow; use the default settings instead (BLAST+). After the job has finished, an e-mail containing the results will be sent to the user-provided address.

As a rule of thumb a comparison between two E. coli genomes takes about 1 minute, whereas the same comparison with optional bootstrap replicate-based confidence intervals takes almost 5 minutes (note: calculating bootstrap-based confidence intervals is not necessary because the DDH prediction model itself already provides confidence intervals). However, if you provide significantly larger genomes, the running time might take longer.

Answer

This service is designed for small and middle-sized datasets with genome sequences of at most 15 MB in size. This limitation should be sufficient for all currently sequenced prokaryotic genomes. For example, the largest prokaryotic genomes sequenced to date have approximately 13 Mbp (as in the case of Ktedonobacter racemifer). Whereas up to 50 reference genomes are allowed, there is an additional overall size limit of c. 450 MB, which we attempt to lift in the future. It only affects FASTA upload but not the use of GenBank accession numbers (such as AE000782). If you intend to use GGDC for larger data sizes, please contact the authors.

Answer

Distance calculation frequently fails with wrong input data, i.e. if neither complete genomes nor draft genomes are provided. The upload of single-gene (e.g., 16S rRNA gene) nucleotide or amino-acid sequences is strongly discouraged, as GGDC is not devised for this kind of data (even though the underlying GBDP method can be applied to such data). This often results in distance value deliberately indicating an error, such as 'NaN' (i.e., 'not a number').

Another frequent kind of failure is caused by wrong GenBank accession numbers, as indicated in the GGDC result e-mails. Note that GenBank accession numbers of amino-acid sequences also yield a GenBank download error. Suitable GenBank accessions numbers auch as AE000782 refer to a nucleotide genome sequence (additional protein annotation causes no harm). GenBank assembly acession numbers are not supported. As of April 2017, GenBank master record accession numbers are not supported any more either. See the general FAQ for details.

In rare cases we also observed that alignment programs such as BLAST+ might fail during the alignment of genomes containing an extremely high number of contigs, thus producing no hits. Here, the GGDC would also produce 'NaN' as a result to indicate an erroneous run.

Answer

GGDC v1.0 is intended for comparing assembled whole-genome sequences comprising one to several chromosomes, extrachromosomal replicons or scaffolds. It is not intended for dealing with, e.g., multi-FASTA files separately containing the nucleotide sequences of all genes. The GGDC v1.0 service cannot handle multi-FASTA files with a large number of FASTA entries (say, more than 500 sequences), since the multi-FASTA files are split to FASTA files each containing one single entry before processing. The GGDC v2.0 supports any multi-FASTA format. Its DDH prediction model has been accordingly adapted (and has many other advantages over GGDC v1.0 anyway).

Answer
  • One of them is a CSV (comma-separated values) file containing all results in a format readable by spreadsheet programs such as LibreOffice, OpenOffice or Excel (set the column separator to a comma).

  • Another one is the text file "check_file.txt" that can be used for checking whether or not the appropriate datasets were compared. It contains not only the list of compared genomes but also the list of all genome parts (scaffolds and/or replicons) used by GGDC for this submission. It is recommended that this file is consulted before conclusions are drawn from the GGDC outcome. As of GGDC 2.1, the check file is in valid YAML format and could automatically be parsed with a script.

Answer

In the past, if genome names had not been explicitly specified during the submission process, the name "genome_1" would have been used as placeholder to refer to the query organism, whereas "genome_2" to "genome_11" would have referred to the (up to ten) reference organisms in the order they had been listed in the submission form. Now, the GGDC directly uses the FASTA file names as provided by the user, i.e., there is no need for a tedious manual specification of genome labels anymore.

Answer

'Infinity' and 'NaN' mean that no matches have been found between the two genomes (given the local-alignment settings), hence the respective distance formula yields anomalous results (i.e., a division by zero occurs). In such cases you can be sure that the DDH estimate is virtually 0% unless the genome sequences could not even be downloaded.

If you have provided invalid GenBank accession IDs, the GGDC will also report 'NaN' as a result (rarely a negative number), because a genome comparison is not possible in such a case for obvious reasons. See for why this can happen.

Answer

If you have specified an accession number as input data, our server will try to download the according sequences from NCBI. However, if the accession number is either invalid or does not contain sequence data at all, the GGDC will report this incident via the above message. In such a case make sure that all your GenBank accessions are valid and actually provide sequence information. There are various ways to specify GenBank accessions as detailed in the General FAQ.

Please note that the use of so-called 'master record' accessions is not supported anymore, which is detailed in the General FAQ, too.

Answer

In many cases the three distinct formulae give almost identical results regarding DDH. For instance, for two E. coli genomes used as test data, with BLAST+ we obtain 74.80% ± 3.80, 76.70% ± 2.87 and 77.80% ± 3.28, respectively. But for other genomes the results can be distinct, and sometimes even different regarding the 70% boundary. This is not caused by an error in the calculations but by the three distance formulae exploring distinct aspects of genome evolution.

All models for estimating DDH from intergenomic distances have been inferred separately, and all formulae yielded very high correlations. Whereas the test dataset used was larger than in any previous studies, we were unable to prefer a certain distance formula on the basis of the model-building results alone when using the test dataset. This does not mean, however, that other criteria would not yield stronger differences between the formulae. For instance, formula 2 is the only one that can be used with incompletely sequenced genomes. But there are other reasons why we recommend formula 2.

From a biological point of view, if all genomes have been sequenced completely (have "Finished" status), or almost so, but formula 1 yields much higher DDH similarities than formula 2, this indicates that the two strains changed comparatively little in gene content but comparatively strongly regarding the gene sequences. Depending on the kind of organisms and on the selection pressures under which they evolved, this might not be an unreasonable scenario, even though many strains within the same species differ considerably in their gene content. For instance, two strains might differ more or less only in the presence of a plasmid (see ). This alone might be a reason for preferring formula 2, but the other two formulae potentially provide valuable additional biological information and hence are included in the results.

Answer

It is often crucial to not only consider what the conventions are but also what the rationale is behind these conventions. The 70% DDH rule has the advantage of making species quantitatively comparable. Whereas this alone could be achieved with any threshold, taxonomic conservatism dictates that the well-known, predefined threshold should be maintained wherever possible. The GGDC allows for keeping this threshold while at the same moving to modern, genome-sequence based and highly reliable methods.

Bacterial subspecies were, indeed, traditionally not determined based on a distance or similarity threshold, but on a qualitative assessment of usually quite few phenotypic characters. Moreover, compared to the number of validly published species names, not many subspecies names have a standing in nomenclature. So taxonomic conservatism would not be significantly violated by the introduction of a quantitative threshold for subspecies delineation. (This holds even though switching the criterion implies that one cannot expect the new method to frequently confirm existing subspecies boundaries.) Moreover, we believe that the high resolution provided by genome sequences now calls for introducing methods that make microbial subspecies quantitatively comparable, too.

Given the other advantages of the GGDC, it thus makes most sense to establish a dDDH threshold for subspecies delineation. But which one? Because of the low number of validly published subspecies names, and due to the fact that they are not expected to have any quantitative consistency regarding traditional DDH, it makes not much sense to attempt to estimate a boundary from the currently existing microbial subspecies. Rather, clustering consistency was the main criterion in our approach to determine a dDDH threshold for subspecies. It suggested a value of c. 79% dDDH.

Answer

To the best of our knowledge a threshold of traditional DDH similarities has never been established for genera. Hence there is neither a GGDC threshold for genus boundaries. We fear it would not make much sense to introduce one either. One of the reasons for our scepticism is saturation. That is, the lower DDH similarities get, the worse might be their representation of phylogenetic relationships. Moreover, the larger the overall phylogenetic distances, the higher the chance of getting less ultrametric data. Non-ultrametricity is a general phenomenon not directly related to traditional or digital DDH. Not caring about ultrametricity at all is taxonomically naive, as evident from probably each textbook on phylogenetics. If any data are too strongly deviating from ultrametricity, applying pairwise distance or similarity thresholds to them for delineating taxa is frivolous. For species and subspecies boundaries, it has been shown that non-ultrametricity is not normally a problem for the GGDC (we are not aware that alternative approaches such as ANI have even been assessed in this respect). However, delineating genera using pairwise distances or similarities could suffer more strongly from non-ultrametricity.

Answer

Indeed, extrachromosomal DNA is present in many bacteria and thus is expected to impact traditional DDH experiments as well as digital DDH calculations. But if bias due to extrachromosomal DNA really was a problem, traditional DDH should simply never have become the gold standard in bacterial species delimitation. GGDC as well as methods such as ANI use entire genomes and do not discriminate between chromosomal and extrachromosomal DNA. Note that changes in the assignment of genes into replicons do not necessarily indicate significant differences in gene presence or absence, let alone in gene sequences. But only those would significantly affect DDH. For this reason, users concerned about distortion by extrachomosomal elements should just use the formula 2, which is the recommended formula anyway. It is quite unaffected by changes in gene content; for details see . All three formulas are expected to be independent from changes in gene order (Henz et al. 2005).

The same reasoning holds for horizontal gene transfer (HGT). All bacterial genomes are to some degree affected by HGT, hence all whole-genome methods (including traditional DDH, digital DDH and ANI) are to some degree affected by HGT. HGT can influence both the gene content (by adding genes homologs of which were not present before to a genome) as well as the similarity between homologous genes found in distinct genomes (by adding genes homologs of which were already present to a genome). Like the main ANI implementations, GBDP formula 2 is quite unaffected by gene content. Regarding gene similarity, all GBDP methods have the advantage of conducting an on-the-fly correction for paralogy. This means that in the case of overlapping hits the better one is preferred. As hits to xenologous genes are unlikely to be better than hits to orthologous genes, particularly in the case of closely related genomes, this kind of correction is likely to temper the impact of HGT, too. In contrast, ANI has no correction for paralogy and thus is more likely to be affected by HGT than GBDP.

Answer

In contrast to DDH dissimilarities, differences in percent genomic G+C content between distinct species can be quite close to zero. They just cannot be larger than 1 within the same species (Meier-Kolthoff et al. 2014). Thus when DDH indicates same species, a percent G+C content difference > 1 is not normally possible (and should be reported to the GGDC staff), whereas a percent G+C content difference <=1 confirms the DDH result. When, in contrast, DDH indicates distinct species, a percent G+C content difference> 1 confirms this, whereas a percent G+C content difference <=1 does not say anything. Note that within-species differences in percent G+C content> 1 reported in the older literature are due to artefacts of the applied methods; genome sequencing is expected to be way more exact regarding the G+C content.

Answer

Exact G+C content values inferred from the genome sequences are included in the check file attached to each GGDC result message (see ). The values for entire genome sequences should be included in publications on these genomes particularly if taxonomic conclusions are drawn from them; see for details. The G+C content values for individual genome parts (such as scaffolds, contigs, chromosomes or extrachromosomal replicons) should also be checked because strong deviations between them might indicate contaminations or assembly artefacts.

Estimating dDDH similarities is not actually needed to obtain exact genomic G+C content values. If you are only interested in them, just arbitrarily choose one of your genome sequences as query and the others as references. The check file lists the G+C content values for all of them. If you have only a single genome, simply compare it to itself.

Answer

It is correct that the expected sequence identity of two random nucleotide sequences is 25%. However, this holds only if the sequences can be globally aligned without gaps. In contrast, genomes evolve not only via substitutions, insertions and deletions within genes, but also via gains, losses and rearrangements of entire genes. Thus the 25% boundary has no direct meaning for entire genomes. Moreover, digital DDH starts by determining local alignments between two genomes. These local alignments would not normally be found by programs such as BLAST (or filtered out later on due to low quality) if the two genomes had a random relationship throughout. Their intergenomic distance would be maximum and their digital DDH similarity would be 0 (or close to 0, depending on the model). For this reason, intergenomic distances calculated by GBDP, and dDDH values derived from them, are meaningful throughout their range and should literally be reported. One must only keep in mind that some kind of saturation occurs, as very small low real identities can not be distinguished from each other. Even in the case of data for which the 25% identity boundary was meaningful, 25% dDDH would correspond to way more than 25% identity.

Answer

This can only happen when using Genbank sequence accession numbers (such as AE000782). Cause of this error is low availability of the Genbank servers. This might be due to maintenance downtime or high demand. Unless the check file attached to the GGDC message really indicated that the data used are complete anyway you should definitively try the GGDC server again at a later time in conjunction with Genbank.

Answer

The GGDC 1.0 was superseded by the GGDC 2.1, which is an updated and enhanced version of the previous GGDC 1 and incorporates improved DDH-prediction models and additional features such as confidence-interval estimation. To the best of our knowledge, it is the only digital DDH method that provides this feature. Of all genome-based methods we are aware of, GGDC 2 yields the highest correspondence to traditional DDH (without sharing its drawbacks). Details are described in our BMC Bioinformatics study.

Answer

The freely available GGDC 2.1 implements the latest version of the Genome BLAST Distance Phylogeny (GBDP) method as published in Meier-Kolthoff et al. (2013). Even though a legacy version of GBDP is available here, it lacks important features (e.g., calculation of pseudo-bootstrapping replicates, prediction of confidence intervals etc.). That said, we are planning to release the latest version of GBDP in the course of 2017 but there is still a little bit of work left (e.g., writing a proper user manual, code documentation, publication of an application note etc.). Once the standalone version is available, we will announce it on this website. Until then, if you require larger analyses (phylogenomic analyses as well as (sub-)species delimitation via digital DDH) that do not fit into the scope of the web service, please let us know.

Answer

"VICTOR" stands for "Virus Classification and Tree Building Online Resource".

if you want to infer phylogenies from the genome or proteome sequences of (prokaryotic and potentially other) viruses and/or obtain estimates for taxon boundaries at distinct ranks.

Work on VICTOR has been funded by the German Research Council as part of the SFB TRR 51.

Answer

All relevant citations are listed in the result e-mails sent around by this service. The main VICTOR publication has been published in Bioinformatics.

Answer

Close to the middle of the main text of the result e-mails sent around by this service, suggestions for phrasing the according sections in the methods as well as the results chapter are contained. You just have to format and arrange them according to the instructions for authors of the chosen journal. You might also need to rephrase them slightly to avoid being falsely detected by plagiarism scanners. Watch out for instructions enclosed in square brackets. These indicate sections whose content must frequently be adapted, too.

Answer

Based on the results of the VICTOR service, users can make an informed decision on the evolutionary relationships between prokaryotic viruses. The method was thoroughly optimized against a large reference dataset of genome-sequenced taxa recognized by the International Committee on Taxonomy of Viruses (ICTV) and showed a high agreement with the classification, particularly at the species and genus level. See the for details.

Use of VICTOR is simple. Data can be uploaded in and yield result e-mails.

it should not be a problem to apply VICTOR to the genomes or proteomes of other kinds of viruses. Phylogenetically and regarding the estimates for taxon boundaries, VICTOR might even work well for them, too. VICTOR has just not yet been tested in this respect.

Answer

Once your VICTOR submission has received a free computation slot on the server, the estimated running time of your job is expected to be as shown below. If you want to check whether or not there are still free slots, you can check the payload progress bar at the end of the VICTOR submission page.

Estimated running time of VICTOR submissions in dependence of data type
If sufficient server resources are available, VICTOR switches to a fast track mode, thus reducing the overall running time of your submission by a factor of about 4.

Answer

You can upload FASTA files, GenBank files and/or GenBank accession IDs. Please note that there are various ways to specify GenBank accessions as detailed in the General FAQ.

Analysis is either at the genome or proteome level; you cannot mix them. Incomplete genomes can be analysed but then other must be preferred.

At least four usable genomes or proteomes must be uploaded, otherwise phylogenies cannot be inferred.

A length check ensures that genomes of cellular organisms are not processed by VICTOR. If you think this length check hinders you analysing viruses with VICTOR, please contact the authors.

Answer

VICTOR delivers e-mails which contain the results from applying distinct distance formulas in otherwise identical GBDP runs. This means one tree per formula and one set of clustering results per formula. The indicates that formula d6 should be preferred when amino-acid sequences of prokaryotic viruses are analysed — unless incomplete proteome sequences are contained in the data set. In that case d4 is the formula of choice. The also indicates that formula d0 should be preferred when nucleotide sequences of prokaryotic viruses are analysed — unless incomplete genome sequences are contained in the data set. In that case d4 is again the formula of choice.

The meaning of the three formulas is described in the GGDC FAQ but for historical reasons it uses different terms. GGDC formula 1 is VICTOR d0, GGDC formula 2 is VICTOR d4, and GGDC formula 3 is VICTOR d6.

Answer

The files attached by the service use the following standardized file extensions:

pdf
PDF file depicting a midpoint-rooted phylogenetic tree. This is a not necessarily a publication-ready figure.
phy
Phylogenetic tree in Newick format, labels cleaned.
tsv
Tabulator-separated file containing the affiliations to clusters at the species, genus and family rank.

Marks for the cluster affiliations at the species (S), genus (G) and family (F) level contained in the tip labels of the phylogenetic trees are found after an "@" sign.

Answer

For phylogeny reconstruction, this service combines state-of-the-art software for multiple sequence alignment, maximum likelihood (ML) and maximum parsimony (MP) analysis. Nucleotide data are optionally downloaded from GenBank and always automatically checked for reverse-complement sequences and duplicated labels. In the case of amino-acid data, the optimal model for ML is automatically determined (for nucleotide data, we believe GTR to be alright, as does the author of RAxML). The pipeline is thus ideally suited for moderately sized single-gene data sets as used, e.g., in the description of new bacterial or other species. Uploaded RNA sequences are automatically converted to DNA sequences.

Moreover, optionally pairwise nucleotide similarities are calculated. The method for calculating these similarities exactly corresponds to the one used in a study for defining 16S rRNA gene similarity thresholds to determine whether or not a DDH reaction was mandatory for deciding whether or not two strains should be assigned to the same or to distinct species. These thresholds are available for specific user-chosen error ratios as well as with phylum-specific values.

In contrast to many other phylogeny tools which truncate sequence labels or replace characters within them, here the original labels are provided in the output. The only modifications made are trimming whitespace from their ends and replacing consecutive runs of whitespace characters with a single space.

Finally, the results e-mails already include publication-ready text describing all methods used the pipeline, the results, and the according literature references.

Answer

All necessary references are listed At the bottom of the main text of result e-mails sent around by this service. You just have to format and arrange them according to the instructions for authors of the chosen journal.

Answer

The files attached by the service use the following standardized file extensions:

File type Details
fas Multiple sequence alignment in FASTA format, original labels restored.
pdf PDF file depicting the midpoint-rooted phylogenetic tree. This is intended to look well but it is not necessarily a publication-ready figure.
phy Phylogenetic tree in Newick format, labels cleaned. Use it if your software cannot read the NEXUS-formatted tree.
tre Phylogenetic tree in NEXUS format, original labels restored and protected.
tsv Tabulator-separated file containing either the percent nucleotide similarities between each query sequence and all reference sequences or the G+C content of the sequences.

Each of the file types might be missing, as the according step of the analysis might not have been requested, or an error might have occurred. The main text of the result e-mail contains an analysis protocol with a detailed list of requested, successful and unsuccessful steps.

The phylogenetic tree in NEXUS format is ideally suited for viewing and manipulating it with FigTree but should be compatible with all NEXUS-compliant tree viewers. The tree description itself is unrooted (ML and MP yield unrooted trees!), but the FigTree block contains instructions for midpoint-rooting, as described in the e-mail text. You can re-root the tree when necessary using, e.g., an outgroup contained in the data set, but then you should explicitly explain how the rooting was conducted.

For viewers which do not understand NEXUS (and do not understand the proper Newick format, which allows for protecting any label with single quotes), the tree in Newick format can be used. Special characters within labels are replaced in that file. This tree is unrooted, as ML and MP yield unrooted trees. You should re-root it using, e.g., an outgroup contained in the data set, and explicitly explain how rooting was conducted.

Two kinds of TSV files can be produced. The first kind contains three columns per line providing (1) the name of a query sequence, (2) the name of a reference sequence and (3) the pairwise similarity between them. This kind of file can be unselected by choosing to infer phylogenies only. The second kind of TSV file contains two columns per line providing (1) the name of a sequence and (2) its percent G+C content. This is useful for assessing phylogenetic distortion due to a compositional bias (see ). This kind of file is automatically unselected by uploading amino-acid sequences.

Answer

You don't. If you omit the reference sequences, the calculation of pairwise similarities is simply skipped. Importantly, you can also upload more query sequences in that case. But unless you unselect the similarity calculations, reference sequences remain mandatory.

Answer

Close to the middle of the main text of the result e-mails sent around by this service, suggestions for phrasing the according sections in the methods as well as the results chapter are contained. You just have to format and arrange them according to the instructions for authors of the chosen journal. You might also need to rephrase them slightly to avoid being falsely detected by plagiarism scanners. Watch out for instructions enclosed in square brackets. These indicate sections whose content must frequently be adapted, too.

The result messages report the usual parameters either optimized as part of the model during an maximum-likelihood (ML) analysis or describing the final outcome of an ML or maximum-parsimony (MP) analysis. None of these numbers are specific for our service. The alpha parameter determines the shape of the GAMMA distribution and thus is part of the model. The highest log likelihood is the ML score of the best ML tree found (the higher the better). For details consult the literature on ML phylogenetic inference. The best MP score is the one of the best MP tree found (the lower the better). Consistency and retention index are related to the proportion of homoplasies; 1 means no homoplasies at all but this hardly occurs in real-world data sets. For details please consult the literature on MP phylogenetic inference.

Answer

The analysis protocol is contained in the result e-mails (after the preamble) and lists the notes, warnings and errors, if any, from all conducted steps. For the technical details see . Step 0 is special because it is optional; when no GenBank accession numbers are provided nothing is downloaded from GenBank. Further steps include checking the sequences, determining pairwise similarities, creating a multiple sequence alignment, inferring ML and MP trees, testing for a compositional bias, creating a tree file suitable for FigTree, and drawing the tree in a PDF file. Checking for reverse-complement sequences (of course!), determining pairwise similarities and testing for a base-frequency bias (of course!) are always skipped in the case of amino-acid sequences. Pairwise similarities can be unselected, as well as inferring the trees. The latter might make sense in some situations, because the time-limiting step is ML bootstrapping.

Answer

We have observed that many phylogenetic studies in papers on taxonomic classification, particularly in microbiology, are still based on under-complex models such as Jukes-Cantor (JC69) and venerable but outdated algorithms or programs such as neighbour joining or ClustalW. This issue (together with a tendency to over-estimate the reliability of branches that receive poor branch support) casts some doubt on certain taxonomic decisions. This service makes it easy to apply, in contrast, state-of-the-art alignment and phylogenetic inference software.

Answer

This depends not only on the number and length of the uploaded sequences but also on how much phylogenetic signal is in the data. The time-limiting step is the ML analysis, whose bootstrapping part will converge more quickly according to the bootstopping criterion when the data contain a strong phylogenetic signal. Moreover, the computation time also depends on the load of the server. The higher the load, the fewer threads will be allocated to the job, thus increasing the running time. See also the GGDC FAQ.

Answer

Models for phylogenetic inference are usually stationary, i.e. they assume fixed frequencies of the character states (either equal ones or the empirical frequencies, as in the case of this service). This can sometimes lead to artefacts, when sequences are grouped together simply because of similar character-state frequencies. A compositional bias of nucleotide sequence usually means deviating G+C contents, and these might yield an artefact by causing otherwise not closely related sequences together that show a similar G+C content. However, a similar G+C content might as well be caused by a close relationship. The conducted test is simple and ignores phylogenetic structure; a failed test thus is not necessarily problematic. The reported G+C content values should be watched and checked for sequences with a similar value that were grouped together but should not belong together.

Answer

Whether or not a DNA:DNA hybridization value between two strains should be determined for the discrimination of species in a taxonomic analysis depends on the similarity of the two underlying 16S rRNA gene sequences. In the proposal by Meier-Kolthoff et al. (2013), the long-standing 97% 16S threshold (Stackebrandt and Goebel, 1994) was increased by replacing it not only with a general threshold but also with phylum-specific thresholds. Here, it is important to note that these thresholds originate from a statistical model based on an empirical 16S data set from which similarities were calculated under distinct settings. In order to properly apply the suggested thresholds to other strains, 16S similarities between them must be calculated under exactly the same settings, which is what the server does. For instance, even though pairwise sequence alignment can be solved exactly, it can yield distinct results under distinct settings, and these in turn can affect the resulting similarities.

Answer

The error message about the empty or missing query (or reference) FASTA file indicates that the data downloaded from GenBank did not contain the sequences, if any, in an acceptable format (which can only partially be determined before attempting to download files). Usually this is due to wrong GenBank accession numbers, sometimes to accession numbers of master records, and seldom to downtimes of the GenBank servers.

The "invalid accession numbers encountered" warning points to wrong GenBank accession numbers that can be recognized as such before even trying a GenBank query. For instance, sometimes users paste sequences into fields reserved for accession numbers. This cannot work. Also note that the sequences are only searched for in the nucleotide and protein databases.

The "accession number count distinct between query and download" warning most likely means that either GenBank did not respond or that it did not recognize the accession number or an accession number of a master record was used. This service is devoted to single genes and deliberately does not attempt to resolve accessions of master records (such as those of whole genome shotgun sequencing projects) to individual accessions that yield sequence data.

To determine the cause of the error, first check whether what has been submitted to the server actually looks like GenBank accession numbers of nucleotide sequences. If so, try to apply the same accession number via the GenBank web interface. If this fails, the accession number is invalid or the GenBank server is down. If the attempt does not fail and the GenBank site indicates something like "this entry is the master record for a whole genome shotgun sequencing project and contains no sequence data", then the problem is that a master record has accidentally been queried for.

Only if the options listed above can be ruled out, the error might be on the side of the GGDC phylogeny server. Please report a bug in that case; include the accession number(s) used in your message.

Answer

The server supports nucleotide (DNA or RNA) and amino-acid sequences. RNA sequences can be uploaded. They will then automatically be converted to DNA sequences. You cannot mix RNA and DNA sequences, however (you cannot mix DNA and amino-acid sequences either). Attempting to do so would result in an error when determining the sequence data type because the result would be ambiguous.

Answer

When only query sequences but no reference sequences are provided and a query FASTA file containing already aligned sequences is uploaded, this alignment is recognized and used. Multiple sequence alignment conducted by the server itself is skipped in that case. Beware of uploading sequences that are actually unaligned but have been padded with gaps to obtain uniform lengths. Also note that dots are interpreted like dashes by the server because certain software packages use dots to indicate leading and trailing gaps. Some other programs use dots to indicate identity to the first sequences; this use of dots must be avoided when working with the GGDC phylogeny server.