NYMC Faculty Publications

First Page

art. no. 7808

Document Type


Publication Date





Whole-genome sequencing is increasingly adopted in clinical settings to identify pathogen transmissions, though largely as a retrospective tool. Prospective monitoring, in which samples are continuously added and compared to previous samples, can generate more actionable information. To enable prospective pathogen comparison, genomic relatedness metrics based on single-nucleotide differences must be consistent across time, efficient to compute and reliable for a large variety of samples. The choice of genomic regions to compare, i.e., the core genome, is critical to obtain a good metric. We propose a novel core genome method that selects conserved sequences in the reference genome by comparing its k-mer content to that of publicly available genome assemblies. The conserved-sequence genome is sample set-independent, which enables prospective pathogen monitoring. Based on clinical data sets of 3436 S. aureus, 1362 K. pneumoniae and 348 E. faecium samples, ROC curves demonstrate that the conserved-sequence genome disambiguates same-patient samples better than a core genome consisting of conserved genes. The conserved-sequence genome confirms outbreak samples with high sensitivity: in a set of 2335 S. aureus samples, it correctly identifies 44 out of 44 known outbreak samples, whereas the conserved-gene method confirms 38 known outbreak samples.


Please see the work itself for the complete list of authors.

Publisher's Statement

Originally published in Scientific Reports, 9 (1), art. no. 7808. The original material can be found here.

This doi no longer works.

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.