All uploaded sequence data are deleted on the server within 24 hours after the calculations have been completed. The e-mail addresses of the users are stored only on the server and only during that time period. The e-mail addresses are not made available to third parties; they are only used directly on the server to calculate pseudonymised usage statistics.
The entire GGDC web site supports HTTPS, a protocol for secure communication over a computer network and widely used on the Internet. In practice, this provides a reasonable guarantee that one is communicating with precisely the website that one intended to communicate with (as opposed to an impostor), as well as ensuring that the contents of communications between the user and site cannot be read or forged by any third party.
Use of this form is free for academic purposes. For all other uses, please contact the authors.
General information on the FASTA format is available on Wikipedia. The most frequently observed causes of FASTA files rejected by the GGDC are listed below, together with suggestions for fixes.
|Cause of the problem||Recommended solution|
|Not a FASTA file but some other format.||Upload FASTA format instead of, e.g., GenBank flatfiles.|
|Data in FASTA format buried in binary file.||Store FASTA data as plain-text file, not within file formats such as Microsoft Word.|
|Protein sequences where nucleotide sequences were expected, or vice versa.||Upload protein sequences to services that can deal with protein sequences and nucleotide sequences to services that can deal with nucleotide sequences.|
|Forbidden character in FASTA file name or FASTA header or FASTA sequence.||Remove forbidden characters such as parentheses from FASTA file names and FASTA headers. Sequences must only contain characters interpretable as protein or nucleotide.|
|Gap characters (dashes) in FASTA files.||Remove gap characters before uploading FASTA files to services that do not expect gap characters.|
As obvious from this listing, whether or not a service rejects a FASTA file may depend on the service. For instance, some services accept aligned sequences and, hence, gap characters in FASTA files. In any case, the problem should be specified in the error message displayed by the server. We ask all users to thoroughly read this error message.
Across all submission forms, GenBank accession numbers can be specified in three different ways: either by using single accession numbers (separated by blanks), by providing a range of accession numbers (e.g. AE000782-AE000785), or a combination of both (e.g. AE000781 AE000782-AE000785). In either case all accession numbers (either single accession numbers or a range of accession numbers or both) which belong to a particular genome must be written in a single row within the respective text field.
In the following example, rows 1, 3, 4 and 6 denote separate entities (viruses in that case) each of which is represented by a single GenBank accession number. In row no. 2 a range of accession numbers is provided as indicated by the hyphen ('-'), whereas row no. 5 shows an example in which multiple accession numbers are explicitly provided (even though they form a range, too).
Rows 1 and 4 show versioned accession numbers. Version numbers can be provided but the GGDC strips version numbers to obtain the latest version of an accession number. Row 7 is a versioned master record accession number. To download it the version will be set to zero. Row 8 is a truncated versioned master record accession number (without the numeric suffix). Row 9 is just a master record accession prefix. To download them the version will be set to zero and a numeric suffix of six zeroes will be appended. This mostly works but not always. In case of failure, users are advised to search in GenBank for the correct number of zeroes (between six and eight). Please see below for more details on how master records are processed.
AF227250.1 HM246720-HM246724 FJ483838 FJ539134.1 FJ539135 FJ539136 FJ539137 FJ539132 NTFG01000000 ACWN01 ACSS
The following entries are not valid nucleotide accession numbers but other kinds of valid accession numbers (or assembly names). Whether they can be used by the server depends:
# may work -- is assembly accession number GCA_000527135.1 # may work -- is assembly name ASM52713v1 # may work -- is biosample accession number SAMN04320709 # may work -- is bioproject accession number PRJNA342529
Such kinds of accession numbers will work if they yield an unambiguous mapping to a set of nucleotide accession numbers, which will then be used for download. In particular, Bioproject entries may be linked to two to many genome sequences and if so cannot be used. The mapping was introduced because many users appeared to be unaware of the difference between nucleotide accession numbers and other kinds of accession numbers. The recommended way, however, is to use nucleotide accession numbers.
The following entries are either not (valid) accession numbers or are valid but unusable accession numbers:
# does not work -- two master record accession numbers in the same row (see below) AEME00000000 NTFG00000000 # does not work -- garbled range (correct is ACSS01000001-ACSS01000063) ACSS01000001-ACSS010000063 # does not work -- is RefSeq accession number without underscore (NC_005295 would work) NC005295 # does not work -- is RefSeq accession number with underscore replaced by space (NC_005295 would work) NC 005295 # does not work -- is strain ID in culture collection (ATCC) ATCC700665 # does not work -- is fax number (other input containing only numbers may not work either) 00495312616418 # does not work -- is arbitrary name (Dewey and Louie would not work either) Huey
RefSeq and GenBank accessions should not be mixed. If such a mixture occurs it is marked as an error even though the accession numbers are usable from a purely technical viewpoint. This restriction is a precaution against uploading distinct accession numbers for the same genome sequence.
Moreover, protein accession numbers (e.g., ADE57032.1) are unusable for services that conduct a nucleotide download (such as the GGDC), and vice versa. It should be taken care of not accidentally replacing zero by capital-O (or vice versa). For instance, CP014611 is a nucleotide accession number whereas CPO14611 is a protein accession number.
The following entries will work but not as expected:
# does not work as expected -- 16S rRNA gene sequences from distinct organisms in same line KC479803 AB257864 # does not work as expected -- genome sequences from distinct organisms in same line CP002105 CP001859
In fact, results obtained in that manner are not interpretable. Accessions numbers for the same operational taxonomic unit (OTU) must be placed in the same line but accessions numbers for distinct OTUs must be placed in distinct lines. The server is supposed to deliver a warning message when it recognizes such a case but ultimately the user is responsible to correctly specify the input data.
Please note that the download of sequences via GenBank master record accession numbers is treated specially. Such master records frequently link to distinct representations of the same assembled genome sequence and thus would cause the comparison of duplicated sequences unless care is taken to restrict the data. An example for such a master record can be found here. It links to two distinct representations of the same genome sequence, although only one of them should be used:
WGS AGSE01000001-AGSE01000004 WGS_SCAFLD KK583188
The GGDC prefers the first WGS entry in such cases. Of course the GGDC also reports the sequence data which were used in the calculations in "check_file.txt" (attached to the result e-mail). This file should always be consulted. As an alternative to master record accession numbers, users can explicitly specify ranges of accession numbers as shown above.
Accordingly, the following ways to specify WGS accession numbers are equivalent for the GGDC:
AGSE AGSE00 AGSE01 AGSE00000000 AGSE01000000 AGSE01000001-AGSE01000004 AGSE01000001 AGSE01000002 AGSE01000003 AGSE01000004
Note that master record accession numbers will not be resolved when several accession numbers are provided for the same genome sequence, i.e. pasted into the same line of the form for specifying accession numbers. This is a precaution against mixing genome sequences.
Please use the following NCBI link and enter your nucleotide accession number in the box instead of "enter_your_accession_here":
For protein accession numbers, replace "nuccore" by "protein".
If NCBI reports "No items found" after (!) you searched with your chosen accession number, it is very likely that your accession number is not a valid GenBank accession number. Please do no omit "[accn]" as the GGDC only searches for accession numbers while a less well specified search may result in false positives.
If you believe an accession number to be incorrectly rejected by the GGDC, please inform the authors.
Most of the result e-mails emitted by our services are delivered without any difficulties. But if a user really does not receive an expected e-mail with GGDC (or VICTOR or Phylogeny) results, this can have a variety of causes. The most common ones are listed below together with a suggestion for solving the problem.
|Cause of the problem||Recommended solution|
|Misspelled e-mail address.||Correctly spell your e-mail address.|
|GGDC e-mail caught in your spam folder.||Reconfigure your e-mail spam filter to not reject GGDC e-mails.|
|Your incoming e-mails require confirmation by the sender before being delivered to you (e.g., Boxbe waitlisting).||Free GGDC e-mails from this restriction (e.g., add to your Boxbe guest list or deactivate Boxbe altogether).|
||Submit fewer jobs per unit of time or reconfigure your e-mail server. We have observed this problem mainly with qq.com as host.|
|Your e-mail account has exceeded its quota.||Increase your quota or delete some e-mails.|
|The job is still running.||Wait for its completion. If you get impatient, contact the GGDC team (but note that if the job was really still running we would not be able to help you).|
In most of these cases you would need to run the GGDC (or VICTOR or Phylogeny) job again to obtain results. In other cases this is not needed (or would not even help) as, e.g., if a GGDC e-mail is still residing in your spam folder.
In empirical comparisons of GGDC with other digital DDH methods, GGDC yielded the highest correlations with traditional DDH, thus ensuring the highest consistency regarding the species-delimitation approach that currently dominates in microbial taxonomy, without sharing the disadvantages of traditional DDH. This is crucial because approaches such as ANI have solely been justified by their correlation with traditional DDH values, too. To the best of our knowledge, GGDC (2) is also the only replacement method for traditional DDH that provides confidence intervals. Moreover, that GGDC delivers values on the same scale as traditional DDH (instead of, e.g., ANI values) makes it easy to compare GGDC results with wet-lab DDH values. Finally, as of December 2014 GGDC 2 conducts comparisons with subspecies boundaries, too, and in January 2016 G+C content calculations were also incorporated. A graphical overview of the advantages of GGDC over alternative methods is also available in this section of Hans-Peter Klenk's acceptance speak for the 2014 Bergey Award.
This is actually a loaded question, which, unfortunately, is occasionally still raised. The question is loaded because it presupposes a certain equivalence between traditional DDH and digital DDH that simply does not exist. More specifically, the question presupposes that the disadvantages of traditional DDH (which are the reasons for replacing the practice) are also disadvantages of digital DDH. This presumption is flawed, as can easily be demonstrated. The advantageous features of traditional DDH are (1) its use of the information from complete genomes, at least conceptionally, and (2) its use of quantification. The disadvantageous features of traditional DDH are (3) that it is a tedious method that can only be conducted by few specialized molecular laboratories and (4) that it does not work incrementally, i.e. the effort for one pairwise comparison does not yield anything of use for any other pairwise comparison.
It is obvious that digital DDH keeps (and even enhances) the advantageous features (1) and (2) but abandons the disadvantageous features (3) and (4). For this reason, the need to abandon traditional DDH is actually not an argument against but an argument for digital DDH. For instance, (3) does not hold for digital DDH because the limiting factor for digital DDH is genome sequencing, but once obtained a genome sequence can be used for many things in addition to digital DDH.
Sometimes the claim that one wants to get rid of traditional DDH is even put forward as an argument for preferring ANI over the GGDC and digital DDH. As such, the argument is an example of poor scholarship, if not outright foolish. Indeed, the argument completely overlooks that the justification for ANI ‐ regarding its use in general as well as regarding the ANI threshold for species delimitation ‐ was its high correlation with traditional DDH. This holds for the original ANI approach put forward by Goris, Konstantinidis and colleagues as well as the so-called ANIb and ANIm methods published by Richter and Rossello-Mora. However, using exactly the same criterion, digital DDH as calculated using the GGDC works better, because it yields an even higher correlation with traditional DDH. In addition to other advantages of the GGDC over ANI, this was demonstrated in our publications on the GGDC using empirical data sets larger than in previous publications.
Please see our 'Background' page for the publications that describe the algorithms and statistics used by the GGDC. By using this service you agree to cite at least one of the GGDC papers. There are additional references for subspecies delineation and G+C content interpretation.
Yes, if one uses formula 2 (which is the recommended formula anyway), one needs only about 20% of the genome to get the same result as with the full genome. The other two formulas will be severely affected by genome incompleteness. See also this FAQ entry, this FAQ entry and Auch et al. (2010).
Note, however, that this does not mean that arbitrarily short genome fragments can successfully be analyses with the GGDC. For instance, sometimes users unwisely upload 16S rRNA gene sequences to the GGDC. This is not expected to work.
GGDC 2 reports two types of confidence intervals (CIs) for specifying the uncertainty associated with the reported DDH estimates.
As any statistical model, the one used for the prediction of DDH values from intergenomic distances has an inherent error of estimation, which can be assessed with the help of model-based CIs. Briefly, the C.I. for a given point estimate (i.e., fit) is calculated on the link of the GLM and the resulting bounds are then transformed using the inverse of the link function (e.g., according to: Zuur, A., et al. Mixed Effects Models and Extensions in Ecology with R. New York: Springer, 2009. Print.). By definition the resulting C.I. is asymmetric, i.e., the width of the upper part of the interval (as measured from the point estimate) can differ from the corresponding lower part. In the result e-mail the model-based C.I. is given in square brackets after the DDH estimate. Such uncertainty, is, of course, also present in alternative approaches such as ANI, even though we are not aware of an ANI implementation that actually calculate CIs. But in our view, ANI thresholds supposed to be equivalent to 70% wet-lab DDH should better be provided with CIs, too. (The uncertainty in ANI implementations might actually be higher than the one inherent to GGDC because GGDC uses a larger empirical data set; see Meier-Kolthoff et al., 2013).
The intergenomic distances used for DDH prediction can themselves be resampled via a special bootstrap implementation (see Meier-Kolthoff et al., 2013). These so-called replicate distances deviate from the original distance to a certain extent and thus allow for the assessment of a 95% confidence interval on both DDH and GGD (Genome-to-Genome Distance) scales.
Since the model-based CIs are usually larger than those provided by bootstrapping, the latter are optional and not reported by default.
The calculation of bootstrap-based confidence intervals (C.I.) is optional (see previous FAQ item), however it is relatively compute-intense. Hence, to avoid unnecessary load on the server, we disable the calculation of bootstrap-based C.I., if the number of genomes exceeds a certain threshold (current threshold is 20). (No worries, model-based confidence intervals are always reported.)
The GGDC compares a query genome with a reference genome and calculates an intergenomic distance under three different distance formulae. An in-depth description of these can be found in the accompanying publications. The formulae support your decision about the relatedness of your novel strain to known (type) strains.
An exemplary result from the GGDC 2.1 would look like this:
Submission: 13-04-05-11-48-0015480 Program: NCBI-BLAST =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Escherichia_coli_IAI1 (query) vs. Escherichia_coli_IAI39 (reference): =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Formula: 1 (HSP length / total length) Distance: 0.1646 DDH estimate (regression-based): 76.90 Estimate of DDH <=70% (threshold-based): no (threshold=0.2676) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Formula: 2 (identities / HSP length) Distance: 0.0304 DDH estimate (regression-based): 77.06 Estimate of DDH <=70% (threshold-based): no (threshold=0.0412) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Formula: 3 (identities / total length) Distance: 0.1899 DDH estimate (regression-based): 76.06 Estimate of DDH <=70% (threshold-based): no (threshold=0.2945)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= E_coli_K12_W3110 (query) vs. E_coli_O1_K1_H7_DSM_30083 (reference): =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Formula: 1 (HSP length / total length) Distance: 0.1614 DDH estimate (GLM-based): 74.20% +3.62/-3.98 Probability that DDH > 70% (i.e., same species): 83.25% (via logistic regression) Probability that DDH > 79% (i.e., same subspecies): 43.32% (via logistic regression) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Formula: 2 (identities / HSP length) (RECOMMENDED) Distance: 0.0292 DDH estimate (GLM-based): 75.20% +2.78/-3.01 Probability that DDH > 70% (i.e., same species): 85.97% (via logistic regression) Probability that DDH > 79% (i.e., same subspecies): 38.55% (via logistic regression) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Formula: 3 (identities / total length) Distance: 0.1859 DDH estimate (GLM-based): 77.10% +3.13/-3.46 Probability that DDH > 70% (i.e., same species): 92.14% (via logistic regression) Probability that DDH > 79% (i.e., same subspecies): 46.06% (via logistic regression) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Difference in % G+C: 0.16 (interpretation: either distinct or same species)
This example also shows that a lower distance value with formula 2 than with formula 1 (or 3) might yield a lower DDH estimate with formula 2 than with formula 1 (or 3). This is simply caused by the three distance formulae operating on three different scales. Distance values can only be compared if they have been obtained with the same formula. If so, lower distances will of course always yield a higher DDH estimate. The DDH estimates are, of course, all on the same scale.
The interpretation of this example is as follows. All three formulae confirm the hypothesis that the two genomes belong to the same species (note the confidence intervals). Because the reference genome is from the type strain, this confirms that the query genome is from an Escherichia coli strain, too. All three formulae also yield a dDDH estimate smaller than 79%, which indicates that the two strains belong to distinct subspecies. This is not significant for formula 3 but for formula 2, which is the recommended one. More details on the subspecies of E. coli are found elsewhere. For the interpretation of the G+C content, see this FAQ entry and this FAQ entry.
Processing of your submission may take several minutes depending on the current workload of the server, genome sizes and, of course, the total number of comparisons. Using GenBank accession numbers (such as AE000782) is slightly slower than uploading FASTA files directly because the genomes need to be downloaded from GenBank first. Here, the GGDC depends on the accessability of Genbank; in rush hours when Genbank is accessed by numerous users the download speed might be somewhat reduced. Moreover, choosing BLAT as local alignment tool, in conjunction with FASTA files that consist of many genome parts, may be extremely slow; use the default settings instead (BLAST+). After the job has finished, an e-mail containing the results will be sent to the user-provided address.
As a rule of thumb a comparison between two E. coli genomes takes about 1 minute, whereas the same comparison with optional bootstrap replicate-based confidence intervals takes almost 5 minutes (note: calculating bootstrap-based confidence intervals is not necessary because the DDH prediction model itself already provides confidence intervals). However, if you provide significantly larger genomes, the running time might take longer.
This service is designed for small and middle-sized datasets with genome sequences of at most 15 MB in size. This limitation should be sufficient for all currently sequenced prokaryotic genomes. For example, the largest prokaryotic genomes sequenced to date have approximately 13 Mbp (as in the case of Ktedonobacter racemifer). Whereas up to 50 reference genomes are allowed, there is an additional overall size limit of c. 450 MB, which we attempt to lift in the future. It only affects FASTA upload but not the use of GenBank accession numbers (such as AE000782). If you intend to use GGDC for larger data sizes, please contact the authors.
Distance calculation frequently fails with wrong input data, i.e. if neither complete genomes nor draft genomes are provided. The upload of single-gene (e.g., 16S rRNA gene) nucleotide or amino-acid sequences is strongly discouraged, as GGDC is not devised for this kind of data (even though the underlying GBDP method can be applied to such data). This often results in distance value deliberately indicating an error, such as 'NaN' (i.e., 'not a number').
Another frequent kind of failure is caused by wrong GenBank accession numbers, as indicated in the GGDC result e-mails. Note that GenBank accession numbers of amino-acid sequences also yield a GenBank download error. Suitable GenBank accessions numbers auch as AE000782 refer to a nucleotide genome sequence (additional protein annotation causes no harm). GenBank assembly acession numbers are not supported. See the general FAQ for more details.
In rare cases we also observed that alignment programs such as BLAST+ might fail during the alignment of genomes containing an extremely high number of contigs, thus producing no hits. Here, the GGDC would also produce 'NaN' as a result to indicate an erroneous run.
GGDC v1.0 is intended for comparing assembled whole-genome sequences comprising one to several chromosomes, extrachromosomal replicons or scaffolds. It is not intended for dealing with, e.g., multi-FASTA files separately containing the nucleotide sequences of all genes. The GGDC v1.0 service cannot handle multi-FASTA files with a large number of FASTA entries (say, more than 500 sequences), since the multi-FASTA files are split to FASTA files each containing one single entry before processing. The GGDC v2.0 supports any multi-FASTA format. Its DDH prediction model has been accordingly adapted (and has many other advantages over GGDC v1.0 anyway).
One of them is a CSV (comma-separated values) file containing all results in a format readable by spreadsheet programs such as LibreOffice, OpenOffice or Excel (set the column separator to a comma).
Another one is the text file "check_file.txt" that can be used for checking whether or not the appropriate datasets were compared. It contains not only the list of compared genomes but also the list of all genome parts (scaffolds and/or replicons) used by GGDC for this submission. It is recommended that this file is consulted before conclusions are drawn from the GGDC outcome. As of GGDC 2.1, the check file is in valid YAML format and could automatically be parsed with a script.
In the past, if genome names had not been explicitly specified during the submission process, the name "genome_1" would have been used as placeholder to refer to the query organism, whereas "genome_2" to "genome_11" would have referred to the (up to ten) reference organisms in the order they had been listed in the submission form. Now, the GGDC directly uses the FASTA file names as provided by the user, i.e., there is no need for a tedious manual specification of genome labels anymore.
'Infinity' and 'NaN' mean that no matches have been found between the two genomes (given the local-alignment settings), hence the respective distance formula yields anomalous results (i.e., a division by zero occurs). In such cases you can be sure that the DDH estimate is virtually 0% unless the genome sequences could not even be downloaded.
'NaN' can also results from the upload of incorrect sequences. For instance, sometimes users unwisely upload 16S rRNA gene sequences to the GGDC. This is not expected to work.
If you have provided invalid GenBank accession IDs, the GGDC will also report 'NaN' as a result (rarely a negative number), because a genome comparison is not possible in such a case for obvious reasons. See this FAQ entry for why this can happen.
If you have specified an accession number as input data, our server will try to download the according sequences from NCBI. However, if the accession number is either invalid or does not contain sequence data at all, the GGDC will report this incident via the above message. In such a case make sure that all your GenBank accessions are valid and actually provide sequence information. There are various ways to specify GenBank accessions as detailed in the General FAQ.
In many cases the three distinct formulae give almost identical results regarding DDH. For instance, for two E. coli genomes used as test data, with BLAST+ we obtain 74.80% ± 3.80, 76.70% ± 2.87 and 77.80% ± 3.28, respectively. But for other genomes the results can be distinct, and sometimes even different regarding the 70% boundary. This is not caused by an error in the calculations but by the three distance formulae exploring distinct aspects of genome evolution.
All models for estimating DDH from intergenomic distances have been inferred separately, and all formulae yielded very high correlations. Whereas the test dataset used was larger than in any previous studies, we were unable to prefer a certain distance formula on the basis of the model-building results alone when using the test dataset. This does not mean, however, that other criteria would not yield stronger differences between the formulae. For instance, formula 2 is the only one that can be used with incompletely sequenced genomes. But there are other reasons why we recommend formula 2.
From a biological point of view, if all genomes have been sequenced completely (have "Finished" status), or almost so, but formula 1 yields much higher DDH similarities than formula 2, this indicates that the two strains changed comparatively little in gene content but comparatively strongly regarding the gene sequences. Depending on the kind of organisms and on the selection pressures under which they evolved, this might not be an unreasonable scenario, even though many strains within the same species differ considerably in their gene content. For instance, two strains might differ more or less only in the presence of a plasmid (see FAQ entry). This alone might be a reason for preferring formula 2, but the other two formulae potentially provide valuable additional biological information and hence are included in the results.
It is often crucial to not only consider what the conventions are but also what the rationale is behind these conventions. The 70% DDH rule has the advantage of making species quantitatively comparable. Whereas this alone could be achieved with any threshold, taxonomic conservatism dictates that the well-known, predefined threshold should be maintained wherever possible. The GGDC allows for keeping this threshold while at the same moving to modern, genome-sequence based and highly reliable methods.
Bacterial subspecies were, indeed, traditionally not determined based on a distance or similarity threshold, but on a qualitative assessment of usually quite few phenotypic characters. Moreover, compared to the number of validly published species names, not many subspecies names have a standing in nomenclature. So taxonomic conservatism would not be significantly violated by the introduction of a quantitative threshold for subspecies delineation. (This holds even though switching the criterion implies that one cannot expect the new method to frequently confirm existing subspecies boundaries.) Moreover, we believe that the high resolution provided by genome sequences now calls for introducing methods that make microbial subspecies quantitatively comparable, too.
Given the other advantages of the GGDC, it thus makes most sense to establish a dDDH threshold for subspecies delineation. But which one? Because of the low number of validly published subspecies names, and due to the fact that they are not expected to have any quantitative consistency regarding traditional DDH, it makes not much sense to attempt to estimate a boundary from the currently existing microbial subspecies. Rather, clustering consistency was the main criterion in our approach to determine a dDDH threshold for subspecies. It suggested a value of c. 79% dDDH.
To the best of our knowledge a threshold of traditional DDH similarities has never been established for genera. Hence there is neither a GGDC threshold for genus boundaries. We fear it would not make much sense to introduce one either. One of the reasons for our scepticism is saturation. That is, the lower DDH similarities get, the worse might be their representation of phylogenetic relationships. Moreover, the larger the overall phylogenetic distances, the higher the chance of getting less ultrametric data. Non-ultrametricity is a general phenomenon not directly related to traditional or digital DDH. Not caring about ultrametricity at all is taxonomically naive, as evident from probably each textbook on phylogenetics. If any data are too strongly deviating from ultrametricity, applying pairwise distance or similarity thresholds to them for delineating taxa is frivolous. For species and subspecies boundaries, it has been shown that non-ultrametricity is not normally a problem for the GGDC (we are not aware that alternative approaches such as ANI have even been assessed in this respect). However, delineating genera using pairwise distances or similarities could suffer more strongly from non-ultrametricity.
Indeed, extrachromosomal DNA is present in many bacteria and thus is expected to impact traditional DDH experiments as well as digital DDH calculations. But if bias due to extrachromosomal DNA really was a problem, traditional DDH should simply never have become the gold standard in bacterial species delimitation. GGDC as well as methods such as ANI use entire genomes and do not discriminate between chromosomal and extrachromosomal DNA. Note that changes in the assignment of genes into replicons do not necessarily indicate significant differences in gene presence or absence, let alone in gene sequences. But only those would significantly affect DDH. For this reason, users concerned about distortion by extrachomosomal elements should just use the formula 2, which is the recommended formula anyway. It is quite unaffected by changes in gene content; for details see this FAQ entry. All three formulas are expected to be independent from changes in gene order (Henz et al. 2005).
The same reasoning holds for horizontal gene transfer (HGT). All bacterial genomes are to some degree affected by HGT, hence all whole-genome methods (including traditional DDH, digital DDH and ANI) are to some degree affected by HGT. HGT can influence both the gene content (by adding genes homologs of which were not present before to a genome) as well as the similarity between homologous genes found in distinct genomes (by adding genes homologs of which were already present to a genome). Like the main ANI implementations, GBDP formula 2 is quite unaffected by gene content. Regarding gene similarity, all GBDP methods have the advantage of conducting an on-the-fly correction for paralogy. This means that in the case of overlapping hits the better one is preferred. As hits to xenologous genes are unlikely to be better than hits to orthologous genes, particularly in the case of closely related genomes, this kind of correction is likely to temper the impact of HGT, too. In contrast, ANI has no correction for paralogy and thus is more likely to be affected by HGT than GBDP.
In contrast to DDH dissimilarities, differences in percent genomic G+C content between distinct species can be quite close to zero. They just cannot be larger than 1 within the same species (Meier-Kolthoff et al. 2014). Thus when DDH indicates same species, a percent G+C content difference > 1 is not normally possible (and should be reported to the GGDC staff), whereas a percent G+C content difference <=1 confirms the DDH result. When, in contrast, DDH indicates distinct species, a percent G+C content difference> 1 confirms this, whereas a percent G+C content difference <=1 does not say anything. Note that within-species differences in percent G+C content> 1 reported in the older literature are due to artefacts of the applied methods; genome sequencing is expected to be way more exact regarding the G+C content.
Exact G+C content values inferred from the genome sequences are included in the check file attached to each GGDC result message (see FAQ entry). The values for entire genome sequences should be included in publications on these genomes particularly if taxonomic conclusions are drawn from them; see this FAQ entry for details. The G+C content values for individual genome parts (such as scaffolds, contigs, chromosomes or extrachromosomal replicons) should also be checked because strong deviations between them might indicate contaminations or assembly artefacts.
Estimating dDDH similarities is not actually needed to obtain exact genomic G+C content values. If you are only interested in them, just arbitrarily choose one of your genome sequences as query and the others as references. The check file lists the G+C content values for all of them. If you have only a single genome, simply compare it to itself.
It is correct that the expected sequence identity of two random nucleotide sequences is 25%. However, this holds only if the sequences can be globally aligned without gaps. In contrast, genomes evolve not only via substitutions, insertions and deletions within genes, but also via gains, losses and rearrangements of entire genes. Thus the 25% boundary has no direct meaning for entire genomes. Moreover, digital DDH starts by determining local alignments between two genomes. These local alignments would not normally be found by programs such as BLAST (or filtered out later on due to low quality) if the two genomes had a random relationship throughout. Their intergenomic distance would be maximum and their digital DDH similarity would be 0 (or close to 0, depending on the model). For this reason, intergenomic distances calculated by GBDP, and dDDH values derived from them, are meaningful throughout their range and should literally be reported. One must only keep in mind that some kind of saturation occurs, as very small low real identities can not be distinguished from each other. Even in the case of data for which the 25% identity boundary was meaningful, 25% dDDH would correspond to way more than 25% identity.
This can only happen when using Genbank sequence accession numbers (such as AE000782). Cause of this error is low availability of the Genbank servers. This might be due to maintenance downtime or high demand. Unless the check file attached to the GGDC message really indicated that the data used are complete anyway you should definitively try the GGDC server again at a later time in conjunction with Genbank.
The GGDC 1.0 was superseded by the GGDC 2.1, which is an updated and enhanced version of the previous GGDC 1 and incorporates improved DDH-prediction models and additional features such as confidence-interval estimation. To the best of our knowledge, it is the only digital DDH method that provides this feature. Of all genome-based methods we are aware of, GGDC 2 yields the highest correspondence to traditional DDH (without sharing its drawbacks). Details are described in our BMC Bioinformatics study.
The freely available GGDC 2.1 implements the latest version of the Genome BLAST Distance Phylogeny (GBDP) method as published in Meier-Kolthoff et al. (2013). Even though a legacy version of GBDP is available here, it lacks important features (e.g., calculation of pseudo-bootstrapping replicates, prediction of confidence intervals etc.). That said, we are planning to release the latest version of GBDP in the course of 2019 but there is still a little bit of work left (e.g., writing a proper user manual, code documentation, publication of an application note etc.). Once the standalone version is available, we will announce it on this website. Until then, if you require larger analyses (phylogenomic analyses as well as (sub-)species delimitation via digital DDH) that do not fit into the scope of the web service, please let us know.
"VICTOR" stands for "Virus Classification and Tree Building Online Resource".
Use VICTOR if you want to infer phylogenies from the genome or proteome sequences of (prokaryotic and potentially other) viruses and/or obtain estimates for taxon boundaries at distinct ranks.
All relevant citations are listed in the result e-mails sent around by this service. The main VICTOR publication has been published in Bioinformatics.
Close to the middle of the main text of the result e-mails sent around by this service, suggestions for phrasing the according sections in the methods as well as the results chapter are contained. You just have to format and arrange them according to the instructions for authors of the chosen journal. You might also need to rephrase them slightly to avoid being falsely detected by plagiarism scanners. Watch out for instructions enclosed in square brackets. These indicate sections whose content must frequently be adapted, too.
Based on the results of the VICTOR service, users can make an informed decision on the evolutionary relationships between prokaryotic viruses. The method was thoroughly optimized against a large reference dataset of genome-sequenced taxa recognized by the International Committee on Taxonomy of Viruses (ICTV) and showed a high agreement with the classification, particularly at the species and genus level. See the VICTOR references for details.
Technically it should not be a problem to apply VICTOR to the genomes or proteomes of other kinds of viruses. Phylogenetically and regarding the estimates for taxon boundaries, VICTOR might even work well for them, too. VICTOR has just not yet been tested in this respect.
Once your VICTOR submission has received a free computation
slot on the server, the estimated running time of
your job is expected to be as shown below. If you want to check whether or not
there are still free
slots, you can check the payload progress bar at the end of the VICTOR submission page.
If sufficient server resources are available, VICTOR switches to a fast track mode, thus reducing the overall running time of your submission by a factor of about 4.
Analysis is either at the genome or proteome level; you cannot mix them. Incomplete genomes can be analysed but then other distance formulas must be preferred.
At least four usable genomes or proteomes must be uploaded, otherwise phylogenies cannot be inferred.
A length check ensures that genomes of cellular organisms are not processed by VICTOR. If you think this length check hinders you analysing viruses with VICTOR, please contact the authors.
VICTOR delivers e-mails which contain the results from applying distinct distance formulas in otherwise identical GBDP runs. This means one tree per formula and one set of clustering results per formula. The VICTOR study indicates that formula d6 should be preferred when amino-acid sequences of prokaryotic viruses are analysed — unless incomplete proteome sequences are contained in the data set. In that case d4 is the formula of choice. The VICTOR study also indicates that formula d0 should be preferred when nucleotide sequences of prokaryotic viruses are analysed — unless incomplete genome sequences are contained in the data set. In that case d4 is again the formula of choice.
All distances are calculated from matches (local alignments) between two genome or two proteome sequences. in BLAST jargon, these matches are known as HSPs (high-scoring segment pairs). The meaning of the three distance formulas is as follows:
The branch lengths of the resulting VICTOR trees are scaled in terms of these distance formulas.
The GGDC uses the same formulas but for historical reasons it applies a different terminology. GGDC formula 1 is VICTOR d0, GGDC formula 2 is VICTOR d4, and GGDC formula 3 is VICTOR d6. Note that the distance calculations are done after certain corrections for paralogy have been applied. Details are provided in the GGDC and GBDP literature.
The files attached by the service use the following standardized file extensions:
Marks for the cluster affiliations at the species (S), genus (G) and family (F) level contained in the tip labels of the phylogenetic trees are found after an "@" sign.
Sometimes, it is falsely assumed that average support is of no value when it comes to the evaluation and comparison of phylogenies. Indeed, average support values are acceptable. Consider a case where clade A is supported by 100% and clade B by 0% in tree 1 but clade A is supported by 50% and clade B by 0% in tree 2. Then, obviously, overall support is higher in tree 1 than in tree 2. This is what we want to know, and it is well indicated by the average support values of the two clades (50% in tree 1 but only 25% in tree 2).
Hence, for obvious reasons, a tree with on average higher support values is on average better resolved. Of course this does not tell the scientist anything about the support for single clades (unless you obtain 0% or 100% average support, of course).
The rational behind VICTOR and its use of the OPTSIL program (Göker et al, 2009) are described in our according Bioinformatics study. However, briefly, OPTSIL is used to determine the boundaries of species, genera and (sub-)families using optimized thresholds as gained from an in-depth statistical analysis, which was based on a large ICTV reference dataset. In each VICTOR run, the species, genus and (sub-)family clusters are inferred de novo (depending on the underlying data, of course) and than reported by VICTOR. For example, the clusters are shown as part of the trees' tip labels and, together with the underlying phylogeny, they allow for the identification of novel taxa or might support a reclassification of existing taxa. But these applications depend entirely on the scientific question in mind.
In case you are not sure, whether or not your specific question can be addressed using the VICTOR approach, please let us know.
In general, the branch lengths of the resulting VICTOR trees are scaled in terms of the respective distance formula used (VICTOR reports trees based on formulae d0, d4 and d6). For example, if you have conducted a nucleotide-based VICTOR run and chose to use the recommended formula d0, you can write:
The scientific justification as to why VICTOR represents indeed a universal method, especially in comparison to other approaches, is thoroughly explained in the according Bioinformatics study. Briefly, VICTOR makes use of the Genome BLAST Distance Phylogeny (GBDP) method which calculates accurate intergenomic distances between pairs of viruses. Resulting distance matrices are then used as input for distance-based phylogenetic inference methods. Even though, this approach does not make an a priori assumption about the availability of certain marker genes, the phylogenetic signal will of course improve the more homologies are shared across the underlying virus dataset.
In general, the VICTOR approach is a whole-genome phylogenetic method that has been optimized so as to minimize the number of conflicts with the ICTV classification (Meier-Kolthoff and Göker 2017). As VICTOR is conservative, remaining well-supported discrepancies between VICTOR results and the ICTV classification indeed indicate that the classification should be revised.
For phylogeny reconstruction, this service combines state-of-the-art software for multiple sequence alignment, maximum likelihood (ML) and maximum parsimony (MP) analysis. Nucleotide data are optionally downloaded from GenBank and always automatically checked for reverse-complement sequences and duplicated labels. In the case of amino-acid data, the optimal model for ML is automatically determined (for nucleotide data, we believe GTR to be alright, as does the author of RAxML). The pipeline is thus ideally suited for moderately sized single-gene data sets as used, e.g., in the description of new bacterial or other species.
Uploaded RNA sequences are automatically converted to DNA sequences. When long sequences such as genome sequences are encountered, an attempt is made to extract 16S rRNA gene sequences from each genome sequence. Thus the user needs not normally care about this step. (Extraction is expected to succeed for Bacteria and Archaea unless a genome sequence is incomplete and does not contain a 16S rRNA gene sequence.)
Moreover, optionally pairwise nucleotide similarities are calculated. The method for calculating these similarities exactly corresponds to the one used in a study for defining 16S rRNA gene similarity thresholds to determine whether or not a DDH reaction was mandatory for deciding whether or not two strains should be assigned to the same or to distinct species. These thresholds are available for specific user-chosen error ratios as well as with phylum-specific values.
In contrast to many other phylogeny tools which truncate sequence labels or replace characters within them, here the original labels are provided in the output. The only modifications made are trimming whitespace from their ends and replacing consecutive runs of whitespace characters with a single space.
Finally, the results e-mails already include publication-ready text describing all methods used the pipeline, the results, and the according literature references.
All necessary references are listed at the bottom of the main text of result e-mails sent around by this service. You just have to format and arrange them according to the instructions for authors of the chosen journal. However, if you uploaded unannotated genome sequences from which the server was able to extract 16S rRNA gene sequences you should additionally cite barrnap.
The files attached by the service use the following standardized file extensions:
|fas||Multiple sequence alignment in FASTA format, original labels restored.|
|PDF file depicting the midpoint-rooted phylogenetic tree. This is intended to look well but it is not necessarily a publication-ready figure.|
|phy||Phylogenetic tree in Newick format, labels cleaned. Use it if your software cannot read the NEXUS-formatted tree.|
|tre||Phylogenetic tree in NEXUS format, original labels restored and protected.|
|tsv||Tabulator-separated file containing either the percent nucleotide similarities between each query sequence and all reference sequences or the G+C content of the sequences.|
Each of the file types might be missing, as the according step of the analysis might not have been requested, or an error might have occurred. The main text of the result e-mail contains an analysis protocol with a detailed list of requested, skipped, successful and unsuccessful steps.
The phylogenetic tree in NEXUS format is ideally suited for viewing and manipulating it with FigTree but should be compatible with all NEXUS-compliant tree viewers. The tree description itself is unrooted (ML and MP yield unrooted trees!), but the FigTree block contains instructions for midpoint-rooting, as described in the e-mail text. You can re-root the tree when necessary using, e.g., an outgroup contained in the data set, but then you should explicitly explain how the rooting was conducted.
For tree viewers which do not understand NEXUS (and do not understand the proper Newick format, which allows for protecting any label with single quotes), the tree in Newick format can be used. Special characters within labels are replaced in that file. This tree is unrooted, as ML and MP yield unrooted trees. You should re-root it using, e.g., an outgroup contained in the data set, and explicitly explain how rooting was conducted.
Two kinds of TSV files can be produced. The first kind contains three columns per line providing (1) the name of a query sequence, (2) the name of a reference sequence and (3) the pairwise similarity between them. This kind of file can be unselected by choosing to infer phylogenies only. The second kind of TSV file contains two columns per line providing (1) the name of a sequence and (2) its percent G+C content. This is useful for assessing phylogenetic distortion due to a compositional bias (see FAQ entry). This kind of file is automatically unselected by uploading amino-acid sequences.
You don't. If you omit the reference sequences, the calculation of pairwise similarities is simply skipped. Importantly, you can also upload more query sequences in that case. But unless you unselect the similarity calculations, reference sequences remain mandatory.
Close to the middle of the main text of the result e-mails sent around by this service, suggestions for phrasing the according sections in the methods as well as the results chapter are contained. You just have to format and arrange them according to the instructions for authors of the chosen journal. You might also need to slightly rephrase them to avoid being falsely detected by plagiarism scanners. Watch out for instructions enclosed in square brackets. These indicate sections whose content must usually be adapted.
The result messages report the usual parameters which were either optimized as part of the model during an maximum-likelihood (ML) analysis or describe the final outcome of an ML or maximum-parsimony (MP) analysis. None of these numbers are specific for our service. The alpha parameter determines the shape of the GAMMA distribution and thus is part of the model. The highest log likelihood is the ML score of the best ML tree found (the higher the better). For details consult the literature on ML phylogenetic inference. The best MP score is the one of the best MP tree found (the lower the better). Consistency and retention index are related to the proportion of homoplasies (parallelisms and reversals); 1 means no homoplasies at all but this hardly occurs in real-world data sets. For details please consult the literature on MP phylogenetic inference.
The analysis protocol is contained in the result e-mails (after the preamble) and lists the notes, warnings and errors, if any, from all conducted steps. For the technical details see this FAQ entry. Step 0 is special because it is optional; when no GenBank accession numbers are provided nothing is downloaded from GenBank. Further steps include checking the sequences, extracting 16S rRNA gene sequences from genome sequences when genome sequences are provided, determining pairwise similarities, creating a multiple sequence alignment, inferring ML and MP trees, testing for a compositional bias, creating a tree file suitable for FigTree, and drawing the tree in a PDF file. Checking for reverse-complement sequences (of course!), determining pairwise similarities and testing for a base-frequency bias (of course!) are always skipped in the case of amino-acid sequences. Pairwise similarities can be unselected, as well as inferring the trees. The latter might make sense in some situations, because the time-limiting step is ML bootstrapping.
We have observed that many phylogenetic studies in papers on taxonomic classification, particularly in microbiology, are still based on under-complex models such as Jukes-Cantor (JC69) and venerable but outdated algorithms or programs such as neighbour joining or ClustalW. This issue (together with a tendency to over-estimate the reliability of branches that receive poor branch support and together with insufficient taxon sampling in some studies) casts some doubt on certain taxonomic decisions. This service makes it easy to apply, in contrast, state-of-the-art alignment and phylogenetic inference software.
This depends not only on the number and length of the uploaded sequences but also on how much phylogenetic signal is in the data. The time-limiting step is the ML analysis, whose bootstrapping part will converge more quickly according to the bootstopping criterion when the data contain a strong phylogenetic signal. Moreover, the computation time also depends on the load of the server. The higher the load, the fewer threads will be allocated to the job, thus increasing the running time. See also the GGDC FAQ.
Models for phylogenetic inference are usually stationary, i.e. they assume fixed frequencies of the character states (either equal ones or the empirical frequencies, as in the case of this service). This can sometimes lead to artefacts, when sequences are grouped together simply because of similar character-state frequencies. A compositional bias of nucleotide sequence usually means deviating G+C contents, and these might yield an artefact by causing otherwise not closely related sequences together that show a similar G+C content. However, a similar G+C content might as well be caused by a close relationship. The conducted test is simple and ignores phylogenetic structure; a failed test thus is not necessarily problematic. The reported G+C content values should be watched and checked for sequences with a similar value that were grouped together but should not belong together.
Whether or not a DNA:DNA hybridization value between two strains should be determined for the discrimination of species in a taxonomic analysis depends on the similarity of the two underlying 16S rRNA gene sequences. In the proposal by Meier-Kolthoff et al. (2013), the long-standing 97% 16S threshold (Stackebrandt and Goebel, 1994) was increased by replacing it not only with a general threshold but also with phylum-specific thresholds. Here, it is important to note that these thresholds originate from a statistical model based on an empirical 16S data set from which similarities were calculated under distinct settings. In order to properly apply the suggested thresholds to other strains, 16S similarities between them must be calculated under exactly the same settings, which is what the server does. For instance, even though pairwise sequence alignment can be solved exactly, it can yield distinct results under distinct settings, and these in turn can affect the resulting similarities.
The error message about the empty or missing query (or reference) FASTA file indicates that the data downloaded from GenBank did not contain the sequences, if any, in an acceptable format (which can only partially be determined before attempting to download files). Usually this is due to wrong GenBank accession numbers, and seldom to downtimes of the GenBank servers.
The "invalid accession numbers encountered" warning points to wrong GenBank accession numbers that can be recognized as such before even trying a GenBank query. For instance, sometimes users paste sequences into fields reserved for accession numbers. This cannot work. Also note that the sequences are only searched for in the nucleotide and protein databases.
The "accession number count distinct between query and download" warning most likely means that either GenBank did not respond or that it did not recognize the accession number. See the general FAQ for details on how to specify accession numbers.
To determine the cause of the error, first check whether what has been submitted to the server actually looks like GenBank accession numbers of nucleotide sequences. If so, try to apply the same accession number via the GenBank web interface. If this fails, the accession number is invalid or the GenBank server is down.
Only if the options listed above can be ruled out, the error might be on the side of the GGDC phylogeny server. Please report a bug in that case; include the accession number(s) used in your message.
The "non-zero exit status (or empty result) when converting GenBank to FASTA" error message indicates that the GenBank converter did not recognize or did not accept the GenBank flatfile. This can happen in the case of non-standard flatfile sources (which is basically everything except, well, GenBank) or otherwise anomalous flatfiles such as WGS records. Since this service has been generated for analysing single genes (which means that it attempts to extract 16S rRNA gene sequences when entire genomes are provided), we have intentionally implemented rather strict conversion rules. If you believe the server to reject a GenBank flatfile that should be accepted, please report this issue to us; include the accession number(s) used in your message.
The "product extraction did not result" warning indicates that it was impossible to extract a 16S rRNA gene sequence from one or several input sequences. Note that this is not an error if you uploaded nucleotide sequences of another gene, and that the service is supposed to proceed in such a situation. Extraction of 16S rRNA gene sequences is necessary when sequences are too long to be processed otherwise, as in the case of genome sequences. Extraction is primarily based on the genome annotation. If the annotation does not specify the 16S rRNA gene in some recognizable manner, it cannot get extracted. In these cases, barrnap is used, which should thus be additionally cited when unannotated genome sequence were uploaded to the server.
The server supports nucleotide (DNA or RNA) and amino-acid sequences. RNA sequences can be uploaded. They will then automatically be converted to DNA sequences. You cannot mix RNA and DNA sequences, however. Attempting to do so would result in an error when determining the sequence data type because the outcome would be ambiguous. You cannot mix DNA and amino-acid sequences either.
When only query sequences but no reference sequences are provided and a query FASTA file containing already aligned sequences is uploaded, this alignment is recognized and used. Multiple sequence alignment conducted by the server itself is skipped in that case. Beware of uploading sequences that are actually unaligned but have been padded with gaps to obtain uniform lengths. Also note that dots are interpreted like dashes by the server because certain software packages use dots to indicate leading and trailing gaps. Some other programs use dots to indicate identity to the first sequences; this use of dots must be avoided when working with the GGDC phylogeny server.