GGDC FAQ

  1. How is my privacy respected?
  2. Who can use this service?
  3. What are the main advantages of using GGDC?
  4. Can I use GGDC with incompletely sequenced genomes?
  5. What is the purpose of the two types of confidence intervals that are reported by GGDC?
  6. How do I cite this service?
  7. How do I interpret the results?
  8. How long do I have to wait for my results?
  9. Is there any maximum size for submitted data?
  10. Why did my GGDC job fail?
  11. What do I need to consider before submitting multi-FASTA files?
  12. Which kinds of files are attached to the GGDC result e-mails?
  13. Which type of genome names is the GGDC using in the result e-mail?
  14. What does it mean if a distance value is 'Infinity', 'NaN' or negative (-9999)?
  15. What does the error message 'Sequence has length 0 and will be ignored' mean?
  16. Why do the three distance formulae sometimes yield different results and in which DDH estimate should I trust?
  17. Why do you provide subspecies estimates? Microbial subspecies have so far not been based on DDH!
  18. Is there a DDH threshold for genera?
  19. Is DDH affected by extrachromosomal replicons such as plasmids or by horizontal gene transfer?
  20. Why can the G+C content difference indicate "same or distinct species"?
  21. Where do I find the percent G+C content values? Do I need dDDH to obtain the G+C content?
  22. How meaningful are very low DDH values? Two random sequences still have 25% identity!
  23. What does the warning 'Sequence-length inconsistency in download of some Genbank files' mean?
  24. Where do I find the GGDC 1.0 service which was previously available on the GGDC website?
  25. Is a standalone version of the GGDC available?
  26. When everybody wants to get rid of DDH, why should one use digital DDH?

Answers

1. How is my privacy respected?

See the according entry in the gene phylogeny FAQ.

2. Who can use this service?

See the according entry in the gene phylogeny FAQ.

3. What are the main advantages of using GGDC?

In empirical comparisons of GGDC with other digital DDH methods, GGDC yielded the highest correlations with traditional DDH, thus ensuring the highest consistency regarding the species-delimitation approach that currently dominates in microbial taxonomy, without sharing the disadvantages of traditional DDH. This is crucial because approaches auch as ANI have solely been justified by their correlation with traditional DDH values, too. To the best of our knowledge, GGDC (2) is also the only replacement method for traditional DDH that provides confidence intervals. Moreover, that GGDC delivers values on the same scale as traditional DDH (instead of, e.g., ANI values) makes it easy to compare GGDC results with wet-lab DDH values. Finally, as of December 2014 GGDC 2 conducts comparisons with subspecies boundaries, too, and in January 2016 G+C content calculations were also incorporated. A graphical overview of the advantages of GGDC over alternative methods is also available in this section of Hans-Peter Klenk's acceptance speak for the 2014 Bergey Award.

4. Can I use GGDC with incompletely sequenced genomes?

Yes, if one uses formula 2 (which is the recommended formula anyway), one needs only about 20% of the genome to get the same result as with the full genome. The other two formulas will be severely affected by genome incompleteness. See also [FAQ entry 7], [FAQ entry 16] and Auch et al. (2010).

5. What is the purpose of the two types of confidence intervals that are reported by GGDC?

GGDC 2 reports two types of confidence intervals (CIs) for specifying the uncertainty associated with the reported DDH estimates.

  1. Model-based CI.

    As any statistical model, the one used for the prediction of DDH values from intergenomic distances has an inherent error of estimation, which can be assessed with the help of model-based CIs. Briefly, the C.I. for a given point estimate (i.e., fit) is calculated on the link of the GLM and the resulting bounds are then transformed using the inverse of the link function (e.g., according to: Zuur, A., et al. Mixed Effects Models and Extensions in Ecology with R. New York: Springer, 2009. Print.). By definition the resulting C.I. is asymmetric, i.e., the width of the upper part of the interval (as measured from the point estimate) can differ from the corresponding lower part. In the result e-mail the model-based C.I. is given in square brackets after the DDH estimate.

    Such uncertainty, is, of course, also present in alternative approaches such as ANI, even though we are not aware of an ANI implementation that actually calculate CIs. But in our view, ANI thresholds supposed to be equivalent to 70% wet-lab DDH should better be provided with CIs, too. (The uncertainty in ANI implementations might actually be higher than the one inherent to GGDC because GGDC uses a larger empirical data set; see Meier-Kolthoff et al., 2013).

  2. Resampling-based CI.

    The intergenomic distances used for DDH prediction can themselves be resampled via a special bootstrap implementation (see Meier-Kolthoff et al., 2013). These so-called replicate distances deviate from the original distance to a certain extent and thus allow for the assessment of a 95% confidence interval on both DDH and GGD (Genome-to-Genome Distance) scales.

Since the model-based CIs are usually larger than those provided by bootstrapping, the latter are optional and not reported by default.

6. How do I cite this service?

Please see our 'Background' page for the publications that describe the algorithms and statistics used by the GGDC. By using this service you agree to cite at least one of the GGDC papers. There are additional references for subspecies delineation and G+C content interpretation.

7. How do I interpret the results?

The GGDC compares a query genome with a reference genome and calculates an intergenomic distance under three different distance formulae. An in-depth description of these can be found in the accompanying publications. The formulae support your decision about the relatedness of your novel strain to known (type) strains.

Note
Formula 2 is independent of genome length and is thus robust against the use of incomplete draft genomes.
For other reasons for preferring formula 2, see [FAQ entry 16]. If there are any significant differences between the three formulae, please base your decision on the recommended formula 2.

An exemplary result from the GGDC 1.0 would look like this:
 
Submission: 13-04-05-11-48-0015480
Program: NCBI-BLAST


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Escherichia_coli_IAI1 (query) vs. Escherichia_coli_IAI39 (reference):
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


Formula: 1 (HSP length / total length)
Distance: 0.1646
DDH estimate (regression-based): 76.90 
Estimate of DDH <=70% (threshold-based): no (threshold=0.2676)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Formula: 2 (identities / HSP length)
Distance: 0.0304
DDH estimate (regression-based): 77.06 
Estimate of DDH <=70% (threshold-based): no (threshold=0.0412)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Formula: 3 (identities / total length)
Distance: 0.1899
DDH estimate (regression-based): 76.06 
Estimate of DDH <=70% (threshold-based): no (threshold=0.2945)
 
An exemplary result from the GGDC 2.1 would look like this:
 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
E_coli_K12_W3110 (query) vs. E_coli_O1_K1_H7_DSM_30083 (reference):
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


Formula: 1 (HSP length / total length)
Distance: 0.1614
DDH estimate (GLM-based): 74.20% +3.62/-3.98
Probability that DDH > 70% (i.e., same species): 83.25% (via logistic regression)
Probability that DDH > 79% (i.e., same subspecies): 43.32% (via logistic regression)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Formula: 2 (identities / HSP length) (RECOMMENDED)
Distance: 0.0292
DDH estimate (GLM-based): 75.20% +2.78/-3.01
Probability that DDH > 70% (i.e., same species): 85.97% (via logistic regression)
Probability that DDH > 79% (i.e., same subspecies): 38.55% (via logistic regression)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Formula: 3 (identities / total length)
Distance: 0.1859
DDH estimate (GLM-based): 77.10% +3.13/-3.46
Probability that DDH > 70% (i.e., same species): 92.14% (via logistic regression)
Probability that DDH > 79% (i.e., same subspecies): 46.06% (via logistic regression)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Difference in % G+C: 0.16  (interpretation: either distinct or same species)
 

This example also shows that a lower distance value with formula 2 than with formula 1 (or 3) might yield a lower DDH estimate with formula 2 than with formula 1 (or 3). This is simply caused by the three distance formulae operating on three different scales. Distance values can only be compared if they have been obtained with the same formula. If so, lower distances will of course always yield a higher DDH estimate. The DDH estimates are, of course, all on the same scale.

The interpretation of this example is as follows. All three formulae confirm the hypothesis that the two genomes belong to the same species (note the confidence intervals). Because the reference genome is from the type strain, this confirms that the query genome is from an Escherichia coli strain, too. All three formulae also yield a dDDH estimate smaller than 79%, which indicates that the two strains belong to distinct subspecies. This is not significant for formula 3 but for formula 2, which is the recommended one. More details on the subspecies of E. coli are found elsewhere. For the interpretation of the G+C content, see [FAQ entry 20] and [FAQ entry 21].

8. How long do I have to wait for my results?

Processing of your submission may take several minutes depending on the current workload of the server, genome sizes and, of course, the total number of comparisons. Using GenBank accession numbers (such as AE000782) is slightly slower than uploading FASTA files directly because the genomes need to be downloaded from GenBank first. Here, the GGDC depends on the accessability of Genbank; in rush hours when Genbank is accessed by numerous users the download speed might be somewhat reduced. Moreover, choosing BLAT as local alignment tool, in conjunction with FASTA files that consist of many genome parts, may be extremely slow; use the default settings instead (BLAST+). After the job has finished, an e-mail containing the results will be sent to the user-provided address.

As a rule of thumb a comparison between two E. coli genomes takes about 1 minute, whereas the same comparison with optional bootstrap replicate-based confidence intervals takes almost 5 minutes (note: calculating bootstrap-based confidence intervals is not necessary because the DDH prediction model itself already provides confidence intervals). However, if you provide significantly larger genomes, the running time might take longer.

9. Is there any maximum size for submitted data?

This service is designed for small and middle-sized datasets with genome sequences of at most 15 MB in size. This limitation should be sufficient for all currently sequenced prokaryotic genomes. For example, the largest prokaryotic genomes sequenced to date have approximately 13 Mbp (as in the case of Ktedonobacter racemifer). Whereas up to 50 reference genomes are allowed, there is an additional overall size limit of c. 450 MB, which we attempt to lift in the future. It only affects FASTA upload but not the use of GenBank accession numbers (such as AE000782). If you intend to use GGDC for larger data sizes, please contact the authors.

10. Why did my GGDC job fail?

Distance calculation frequently fails with wrong input data, i.e. if neither complete genomes nor draft genomes are provided. The upload of single-gene (e.g., 16S rRNA gene) nucleotide or amino-acid sequences is strongly discouraged, as GGDC is not devised for this kind of data (even though the underlying GBDP method can be applied to such data). This often results in distance value deliberately indicating an error, such as '-9999.0000'. Another frequent kind of failure is caused by wrong GenBank accession numbers, as indicated in the GGDC result e-mails. Note that GenBank accession numbers of amino-acid sequences also yield a GenBank download error. Suitable GenBank accessions numbers auch as AE000782 refer to a nucleotide genome sequence (additional protein annotation causes no harm). GenBank assembly acession numbers are not supported.

In rare cases we also observed that alignment programs such as BLAST+ might fail during the alignment of genomes containing a high number of contigs, thus producing no hits. Here, the GGDC would also produce '-9999.0000' as a result to indicate an erroneous run.

11. What do I need to consider before submitting multi-FASTA files?

GGDC v1.0 is intended for comparing assembled whole-genome sequences comprising one to several chromosomes, extrachromosomal replicons or scaffolds. It is not intended for dealing with, e.g., multi-FASTA files separately containing the nucleotide sequences of all genes. The GGDC v1.0 service cannot handle multi-FASTA files with a large number of FASTA entries (say, more than 500 sequences), since the multi-FASTA files are split to FASTA files each containing one single entry before processing. The GGDC v2.0 supports any multi-FASTA format. Its DDH prediction model has been accordingly adapted (and has many other advantages over GGDC v1.0 anyway).

12. Which kinds of files are attached to the GGDC result e-mails?

13. Which type of genome names is the GGDC using in the result e-mail?

In the past, if genome names had not been explicitly specified during the submission process, the name "genome_1" would have been used as placeholder to refer to the query organism, whereas "genome_2" to "genome_11" would have referred to the (up to ten) reference organisms in the order they had been listed in the submission form. Now, the GGDC directly uses the FASTA file names as provided by the user, i.e., there is no need for a tedious manual specification of genome labels anymore.

14. What does it mean if a distance value is 'Infinity', 'NaN' or negative (-9999)?

'Infinity' and 'NaN' mean that no matches have been found between the two genomes (given the local-alignment settings), hence the respective distance formula yields anomalous results (i.e., a division by zero occurs). In such cases you can be sure that the DDH estimate is virtually 0%.

A value of '-9999.0000' instead means that this distance job has crashed, usually because of unsuitable input data. See [FAQ entry 10] for why this can happen.

15. What does the warning 'Sequence has length 0 and will be ignored' mean?

If you have specified an accession number as input data, our server will try to download the according sequences from NCBI. Usually, the so-called 'master record' does not hold any sequence data but only links to the actual genome parts of that genome. That is why the server cannot download sequence data for that particular master record and just informs you about this incident by saying 'Sequence has length 0 and will be ignored'.

16. Why do the three distance formulae sometimes yield different results and in which DDH estimate should I trust?

In many cases the three distinct formulae give almost identical results regarding DDH. For instance, for two E. coli genomes used as test data, with BLAST+ we obtain 74.80% ± 3.80, 76.70% ± 2.87 and 77.80% ± 3.28, respectively. But for other genomes the results can be distinct, and sometimes even different regarding the 70% boundary. This is not caused by an error in the calculations but by the three distance formulae exploring distinct aspects of genome evolution.

All models for estimating DDH from intergenomic distances have been inferred separately, and all formulae yielded very high correlations. Whereas the test dataset used was larger than in any previous studies, we were unable to prefer a certain distance formula on the basis of the model-building results alone when using the test dataset. This does not mean, however, that other criteria would not yield stronger differences between the formulae. For instance, formula 2 is the only one that can be used with incompletely sequenced genomes. But there are other reasons why we recommend formula 2.

From a biological point of view, if all genomes have been sequenced completely (have "Finished" status), or almost so, but formula 1 yields much higher DDH similarities than formula 2, this indicates that the two strains changed comparatively little in gene content but comparatively strongly regarding the gene sequences. Depending on the kind of organisms and on the selection pressures under which they evolved, this might not be an unreasonable scenario, even though many strains within the same species differ considerably in their gene content. For instance, two strains might differ more or less only in the presence of a plasmid (see [FAQ entry 19]). This alone might be a reason for preferring formula 2, but the other two formulae potentially provide valuable additional biological information and hence are included in the results.

17. Why do you provide subspecies estimates? Microbial subspecies have so far not been based on DDH!

It is often crucial to not only consider what the conventions are but also what the rationale is behind these conventions. The 70% DDH rule has the advantage of making species quantitatively comparable. Whereas this alone could be achieved with any threshold, taxonomic conservatism dictates that the well-known, predefined threshold should be maintained wherever possible. The GGDC allows for keeping this threshold while at the same moving to modern, genome-sequence based and highly reliable methods.

Bacterial subspecies were, indeed, traditionally not determined based on a distance or similarity threshold, but on a qualitative assessment of usually quite few phenotypic characters. Moreover, compared to the number of validly published species names, not many subspecies names have a standing in nomenclature. So taxonomic conservatism would not be significantly violated by the introduction of a quantitative threshold for subspecies delineation. (This holds even though switching the criterion implies that one cannot expect the new method to frequently confirm existing subspecies boundaries.) Moreover, we believe that the high resolution provided by genome sequences now calls for introducing methods that make microbial subspecies quantitatively comparable, too.

Given the other advantages of the GGDC, it thus makes most sense to establish a dDDH threshold for subspecies delineation. But which one? Because of the low number of validly published subspecies names, and due to the fact that they are not expected to have any quantitative consistency regarding traditional DDH, it makes not much sense to attempt to estimate a boundary from the currently existing microbial subspecies. Rather, clustering consistency was the main criterion in our approach to determine a dDDH threshold for subspecies. It suggested a value of c. 79% dDDH.

18. Is there a DDH threshold for genera?

To the best of our knowledge a threshold of traditional DDH similarities has never been established for genera. Hence there is neither a GGDC threshold for genus boundaries. We fear it would not make much sense to introduce one either. One of the reasons for our scepticism is saturation. That is, the lower DDH similarities get, the worse might be their representation of phylogenetic relationships. Moreover, the larger the overall phylogenetic distances, the higher the chance of getting less ultrametric data. Non-ultrametricity is a general phenomenon not directly related to traditional or digital DDH. Not caring about ultrametricity at all is taxonomically naive, as evident from probably each textbook on phylogenetics. If any data are too strongly deviating from ultrametricity, applying pairwise distance or similarity thresholds to them for delineating taxa is frivolous. For species and subspecies boundaries, it has been shown that non-ultrametricity is not normally a problem for the GGDC (we are not aware that alternative approaches such as ANI have even been assessed in this respect). However, delineating genera using pairwise distances or similarities could suffer more strongly from non-ultrametricity.

19. Is DDH affected by extrachromosomal replicons such as plasmids or by horizontal gene transfer?

Indeed, extrachromosomal DNA is present in many bacteria and thus is expected to impact traditional DDH experiments as well as digital DDH calculations. But if bias due to extrachromosomal DNA really was a problem, traditional DDH should simply never have become the gold standard in bacterial species delimitation. GGDC as well as methods such as ANI use entire genomes and do not discriminate between chromosomal and extrachromosomal DNA. Note that changes in the assignment of genes into replicons do not necessarily indicate significant differences in gene presence or absence, let alone in gene sequences. But only those would significantly affect DDH. For this reason, users concerned about distortion by extrachomosomal elements should just use the formula 2, which is the recommended formula anyway. It is quite unaffected by changes in gene content; for details see [FAQ entry 16]. All three formulas are expected to be independent from changes in gene order (Henz et al. 2005).

The same reasoning holds for horizontal gene transfer (HGT). All bacterial genomes are to some degree affected by HGT, hence all whole-genome methods (including traditional DDH, digital DDH and ANI) are to some degree affected by HGT. HGT can influence both the gene content (by adding genes homologs of which were not present before to a genome) as well as the similarity between homologous genes found in distinct genomes (by adding genes homologs of which were already present to a genome). Like the main ANI implementations, GBDP formula 2 is quite unaffected by gene content. Regarding gene similarity, all GBDP methods have the advantage of conducting an on-the-fly correction for paralogy. This means that in the case of overlapping hits the better one is preferred. As hits to xenologous genes are unlikely to be better than hits to orthologous genes, particularly in the case of closely related genomes, this kind of correction is likely to temper the impact of HGT, too. In contrast, ANI has no correction for paralogy and thus is more likely to be affected by HGT than GBDP.

20. Why can the G+C content difference indicate "same or distinct species"?

In contrast to DDH dissimilarities, differences in percent genomic G+C content between distinct species can be quite close to zero. They just cannot be larger than 1 within the same species (Meier-Kolthoff et al. 2014). Thus when DDH indicates same species, a percent G+C content difference > 1 is not normally possible (and should be reported to the GGDC staff), whereas a percent G+C content difference <= 1 confirms the DDH result. When, in contrast, DDH indicates distinct species, a percent G+C content difference > 1 confirms this, whereas a percent G+C content difference <= 1 does not say anything. Note that within-species differences in percent G+C content > 1 reported in the older literature are due to artefacts of the applied methods; genome sequencing is expected to be way more exact regarding the G+C content.

21. Where do I find the percent G+C content values? Do I need dDDH to obtain the G+C content?

Exact G+C content values inferred from the genome sequences are included in the check file attached to each GGDC result message (see [FAQ entry 12]). The values for entire genome sequences should be included in publications on these genomes particularly if taxonomic conclusions are drawn from them; see [FAQ entry 20] for details. The G+C content values for individual genome parts (such as scaffolds, contigs, chromosomes or extrachromosomal replicons) should also be checked because strong deviations between them might indicate contaminations or assembly artefacts.

Estimating dDDH similarities is not actually needed to obtain exact genomic G+C content values. If you are only interested in them, just arbitrarily choose one of your genome sequences as query and the others as references. The check file lists the G+C content values for all of them. If you have only a single genome, simply compare it to itself.

22. How meaningful are very low DDH values? Two random sequences still have 25% identity!

It is correct that the expected sequence identity of two random nucleotide sequences is 25%. However, this holds only if the sequences can be globally aligned without gaps. In contrast, genomes evolve not only via substitutions, insertions and deletions within genes, but also via gains, losses and rearrangements of entire genes. Thus the 25% boundary has no direct meaning for entire genomes. Moreover, digital DDH starts by determining local alignments between two genomes. These local alignments would not normally be found by programs such as BLAST (or filtered out later on due to low quality) if the two genomes had a random relationship throughout. Their intergenomic distance would be maximum and their digital DDH similarity would be 0 (or close to 0, depending on the model). For this reason, intergenomic distances calculated by GBDP, and dDDH values derived from them, are meaningful throughout their range and should literally be reported. One must only keep in mind that some kind of saturation occurs, as very small low real identities can not be distinguished from each other. Even in the case of data for which the 25% identity boundary was meaningful, 25% dDDH would correspond to way more than 25% identity.

23. What does the warning 'Sequence-length inconsistency in download of some Genbank files' mean?

This can only happen when using Genbank sequence accession numbers (such as AE000782). Cause of this error is low availability of the Genbank servers. This might be due to maintenance downtime or high demand. Unless the check file attached to the GGDC message really indicated that the data used are complete anyway you should definitively try the GGDC server again at a later time in conjunction with Genbank.

24. Where do I find the GGDC 1.0 service which was previously available on the GGDC website?

The GGDC 1.0 is still available here. However, the GGDC 2 is an updated and enhanced version of the previous GGDC 1 and incorporates improved DDH-prediction models and additional features such as confidence-interval estimation. To the best of our knowledge, it is the only digital DDH method that provides this feature. Of all genome-based methods we are aware of, GGDC 2 yields the highest correspondence to traditional DDH (without sharing its drawbacks). Details are described in our BMC Bioinformatics study.

25. Is a standalone version of the GGDC available?

The freely available GGDC 2.1 implements the latest version of the Genome BLAST Distance Phylogeny (GBDP) method as published in Meier-Kolthoff et al. (2013). Even though a legacy version of GBDP is available here, it lacks important features (e.g., calculation of pseudo-bootstrapping replicates, prediction of confidence intervals etc.). That said, we are planning to release the latest version of GBDP in the course of 2017 but there is still a little bit of work left (e.g., writing a proper user manual, code documentation, publication of an application note etc.). Once the standalone version is available, we will announce it on this website. Until then, if you require larger analyses (phylogenomic analyses as well as (sub-)species delimitation via digital DDH) that do not fit into the scope of the web service, please let us know.

26. When everybody wants to get rid of DDH, why should one use digital DDH?

This is actually a loaded question, which, unfortunately, is occasionally still raised. The question is loaded because it presupposes a certain equivalence between traditional DDH and digital DDH that simply does not exist. More specifically, the question presupposes that the disadvantages of traditional DDH (which are the reasons for replacing the practice) are also disadvantages of digital DDH. This presumption is flawed, as can easily be demonstrated. The advantageous features of traditional DDH are (1) its use of the information from complete genomes, at least conceptionally, and (2) its use of quantification. The disadvantageous features of traditional DDH are (3) that it is a tedious method that can only be conducted by few specialized molecular laboratories and (4) that it does not work incrementally, i.e. the effort for one pairwise comparison does not yield anything of use for any other pairwise comparison.

It is obvious that digital DDH keeps (and even enhances) the advantageous features (1) and (2) but abandons the disadvantageous features (3) and (4). For this reason, the need to abandon traditional DDH is actually not an argument against but an argument for digital DDH. For instance, (3) does not hold for digital DDH because the limiting factor for digital DDH is genome sequencing, but once obtained a genome sequence can be used for many things in addition to digital DDH.

Sometimes the claim that one wants to get rid of traditional DDH is even put forward as an argument for preferring ANI over the GGDC and digital DDH. As such, the argument is an example of poor scholarship, if not outright foolish. Indeed, the argument completely overlooks that the justification for ANI ‐ regarding its use in general as well as regarding the ANI threshold for species delimitation ‐ was its high correlation with traditional DDH. This holds for the original ANI approach put forward by Goris, Konstantinidis and colleagues as well as the so-called ANIb and ANIm methods published by Richter and Rossello-Mora. However, using exactly the same criterion, digital DDH as calculated using the GGDC works better, because it yields an even higher correlation with traditional DDH. In addition to other advantages of the GGDC over ANI, this was demonstrated in our publications on the GGDC using empirical data sets larger than in previous publications.