Warning: file_put_contents(/opt/frankenphp/design.onmedianet.com/storage/proxy/cache/30a67597ef46d22df79c7e078689a8d9.html): Failed to open stream: No space left on device in /opt/frankenphp/design.onmedianet.com/app/src/Arsae/CacheManager.php on line 36

Warning: http_response_code(): Cannot set response code - headers already sent (output started at /opt/frankenphp/design.onmedianet.com/app/src/Arsae/CacheManager.php:36) in /opt/frankenphp/design.onmedianet.com/app/src/Models/Response.php on line 17

Warning: Cannot modify header information - headers already sent by (output started at /opt/frankenphp/design.onmedianet.com/app/src/Arsae/CacheManager.php:36) in /opt/frankenphp/design.onmedianet.com/app/src/Models/Response.php on line 20
GENOMIC ANALYSIS OF VIRAL OUTBREAKS - PMC Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jan 4.
Published in final edited form as: Annu Rev Virol. 2016 Aug 3;3(1):173–195. doi: 10.1146/annurev-virology-110615-035747

GENOMIC ANALYSIS OF VIRAL OUTBREAKS

Shirlee Wohl 1,2, Stephen F Schaffner 1,2,3, Pardis C Sabeti 1,2,3
PMCID: PMC5210220  NIHMSID: NIHMS838418  PMID: 27501264

Abstract

Genomic analysis is a powerful tool for understanding viral disease outbreaks. Sequencing of viral samples is now easier and cheaper than ever before, and can supplement epidemiological methods by providing nucleotide-level resolution of outbreak-causing pathogens. In this review, we describe methods used to answer crucial questions about outbreaks, such as how they began and how a disease is transmitted. More specifically, we explain current techniques for viral sequencing, phylogenetic analysis, transmission reconstruction, and evolutionary investigation of viral pathogens. By detailing the ways in which genomic data can help understand viral disease outbreaks, we aim to provide a resource that will facilitate the response to future outbreaks.

Keywords: infectious disease, phylogeny, evolution, transmission, sequencing

Graphical Abstract

graphic file with name nihms838418u1.jpg

INTRODUCTION

Human history is replete with viral disease outbreaks that have devastated communities and entire populations. Famous examples include smallpox, the 1918 Spanish flu pandemic, the ongoing HIV/AIDS pandemic, the 2009 H1N1 pandemic, and the 2014–2015 Ebola virus (EBOV) epidemic in Western Africa. Not all outbreaks reach pandemic status; many are contained by environmental factors or control measures. Regardless of the final number of individuals affected, outbreak investigation always has the same two aims: termination of the current outbreak and prevention of future ones.

For decades, epidemiological methods such as detailed contact tracing and mathematical modeling have been used to support these aims (14). While those methods have worked well for stemming outbreaks of low-prevalence diseases like SARS, their effectiveness is limited for large outbreaks, diseases with long latent periods, and outbreaks that occur in remote areas (5). For these kinds of outbreaks, it is difficult to collect the detailed observations needed to parameterize epidemiological models with predictive power.

Applying molecular biology tools to traditional epidemiology has greatly improved outbreak monitoring and prevention for all types of viral diseases. These tools include genotypic and phenotypic methods to determine the specific strain or type of virus circulating in a population. They can be used to improve diagnostics, to guide treatment programs and vaccine development, and to trace the spread of pathogens (69).

Moving to full genomic analysis expands our capacity to understand viral outbreaks even further, since nucleotide-level resolution can distinguish isolates of the same viral strain. For example, when the World Health Organization issued a global alert for SARS in 2003, the disease-causing agent was still unknown. Subsequent sequencing of isolates and identification of SARS coronavirus (SARS-CoV) as the pathogen responsible led to development of sequence-based diagnostics necessary for the remarkable containment of the outbreak (1, 1012). In 1992, viral sequencing was also used to supplement epidemiological investigation, when a patient claimed she had contracted HIV through a dental procedure. Phylogenetic analysis of HIV from the dentist and five of his dental patients—which showed that the viruses were closely related—made it clear that the dentist had indeed transmitted HIV to his patients (13). In these two examples, whole-genome viral sequencing led to advances in diagnostics and in understanding transmission, respectively, not possible using other methods.

The 2014–2015 Ebola epidemic in Western Africa has provided one of the first applications of near real-time whole-genome viral sequencing to understand a disease outbreak from its onset. Genomic analysis during the outbreak was made possible by recent advances in high-throughput sequencing, computational methods, and data processing. This epidemic also spurred the development of numerous methods for exploiting whole genome sequencing in future outbreaks.

In this review, we compile and describe existing methods for analyzing genomic data from viral disease outbreaks (see Summary Figure). We focus on fundamental questions and how genomic data can be used to answer them. While we include examples from a number of viruses in our review, we use the Ebola epidemic as the primary example throughout, both because rich genomic data is available for that outbreak, and because many of the analyses described here were applied during it.

VIRAL SEQUENCING METHODS

Accurate sequencing is key to producing high-quality genomes for analysis. Dramatic improvements in high-throughput (also known as next-generation) sequencing technologies and in virus-specific sequencing (1417) in the past decade have enabled sequencing of viruses, known and novel, from all kinds of samples. We briefly review current sequencing technologies and discuss methods for sequence processing.

SEQUENCING METHODS

It is essential that patient samples be processed and sequenced in a way that will provide the highest quality data for downstream analysis. For RNA viruses, timing is especially important: degradation can occur quickly in clinical samples and is common, so the time between sample collection and sequencing should be minimized ((18); see (19) for procedures used in an EBOV diagnostic laboratory). In general, all sample processing should consider both sample preservation and researcher safety (20).

Sequencing itself has progressed from technologies tailored to a specific viral sequence to sequence-independent, high-throughput approaches. Amplicon-based sequencing is the most common sequence-dependent method and was used early in the Ebola outbreak (21). In that study, viral genomes were amplified in long (often ≥ 2 kb) overlapping fragments by reverse-transcriptase polymerase chain reaction (RT-PCR) with EBOV-specific primers; these fragments were then sequenced by Sanger sequencing. This method is popular for detecting and studying viruses because it is fast and can be used to amplify very small amounts of material.

The speed and accuracy of amplicon-based sequencing has made it an effective method for on-site sequencing even in remote field settings (22), and the resulting genomes are sufficient for pathogen identification and basic analysis. However, amplicon-based sequencing does have drawbacks. First, Sanger sequencing is not conducive to the deep coverage needed to detect low-frequency variants. Second, designing PCR primers requires prior knowledge of the viral sequence, which introduces bias and precludes metagenomic analysis. Third, it can be difficult to design primers that produce full-length genomes for all samples, given the high sequence diversity of many viruses. Lastly, degraded samples prevent full-length amplicon production necessary to obtaining whole-genome sequences.

High-throughput sequencing platforms resolve many of the issues of amplicon-based approaches. These platforms, which produce short reads, are better able to capture fragmented or partially degraded samples. They also allow for the ultra-deep sequencing needed to detect low-frequency within-host variants (23). Sequence independence is essential for studying outbreaks caused by new or unknown pathogens, and for metagenomic analysis. Instead of virus-specific primers, these methods rely on random priming followed by high-throughput sequencing (14, 16, 24). Combining sequence-independent primer amplification with selective RNase H-based digestion of contaminating RNA (mainly host ribosomal RNA) enables rapid, unbiased deep sequencing of viral samples, as was done during the Ebola virus epidemic (25).

Other high-throughput sequencing approaches can contribute to viral genomic analysis. Hybrid selection has been used to enrich the viral content of sequencing libraries with high host contamination even after RNase H digestion (17), and is an active area of development. Refining this technology will improve viral genomic analysis during outbreaks, when sample quality may be variable. Other potentially useful technologies still in development include long-read sequencing, which could allow for phasing of variants, and technologies optimized for rapid on-site sequencing. These cheap and portable approaches (26) are useful for rapid diagnostics, but have high error rates that may preclude some detailed genomic analysis.

SEQUENCE ASSEMBLY AND ALIGNMENT

After sequencing, care should be given to the assembly and alignment of genomes. Some of the necessary steps and best practices for processing high-throughput sequencing reads are shown in Figure 1 (see also Supplemental Table 1). After completing these steps, reads that do not map to the database of possible viruses (step ❷) can be investigated using one of several taxonomic analysis tools (2729). Such reads can be de novo assembled and further investigated using a nucleotide or protein homology search. For a detailed example of how these methods were used to discover a novel flavivirus and two novel rhabodviruses, see (30, 31). Alternatively, comprehensive metagenomics pipelines (32, 33) can be used if rapid pathogen identification is the primary goal.

FIGURE 1.

FIGURE 1

Assembly and alignment pipeline for viral reads in heterogeneous samples. High-throughput sequencing reads are ❶ demultiplexed and filtered for high-quality reads, and ❷ depleted of host reads (130) and mapped to a database of possible viruses. ❸ Reads from each sample are de novo assembled, and ❹ all reads from each sample are mapped onto their own assembly. ❺ The consensus sequence is determined for each sample and then ❻ aligned to all other samples using multiple sequence alignment. See Supplemental Table 1 for available software for each step.

Recombination can affect downstream phylogenetic analysis, so the final sequence alignment should be screened for recombination. Many methods that check for recombination have been compiled into a single software package, RDP4 (34). As described below, there are alternative phylogenetic tools that should be used when recombination is present, but analysis of recombinant viral sequences is still an area of active development.

VARIANT CALLING

Sequence differences between viral genomes mark the evolutionary history and relationships between samples. Single-base substitutions (single nucleotide polymorphisms, or SNPs) are the simplest variants. Given high-quality consensus sequences aligned to a reference, it is relatively easy to manually identify SNPs. However, more complex approaches—such as those implemented in packages like GATK (35) or Samtools (36)—are helpful when samples contain insertions or deletions, or if regions of the genome have poor quality, low coverage, or high diversity.

Individual SNPs should be annotated—classified as nonsense, missense, or intergenic—and located relative to genes and other genomic elements. Many annotation tools are available online, each requiring only a list of SNPs and an annotated reference genome (3739).

At this stage, it is also useful to identify variants within individual samples (intrahost variants, or iSNVs), indicating the presence of multiple viral quasispecies. Powerful tools exist for calling low-frequency variants in heterogeneous viral populations (40, 41). To avoid calling sequencing errors as iSNVs, we suggest discarding variant calls with fewer than five forward or reverse reads and those where the number of reads differs greatly between the forward and reverse strands ((25), see supplemental methods). Because PCR errors during library construction can introduce false variants, replicate libraries should be prepared and sequenced whenever possible to confirm the presence of within-host variants at comparable frequencies.

DETERMINING THE ORIGINS OF AN OUTBREAK

Understanding how and when an outbreak began is critical to curtailing it and to preventing future outbreaks. If an outbreak can be traced to a particular transmission route, steps can be taken to eliminate that route. For example, phylogenetic analysis of human influenza A H5N1 in the 1997 Hong Kong outbreak showed that the virus likely arose through reassortment between an H5N1 virus in terrestrial poultry and a similar virus in quail. This finding led to legislation prohibiting the sale of live quail together with other poultry in Hong Kong (42), and is one of many examples of phylogenetic analysis illuminating the origins of an avian influenza outbreak (43).

Phylogenetic methods all start with the creation of a phylogenetic tree—a reconstruction of the relationship of viral samples to each other—based on nucleotide substitutions in samples from the current outbreak. These phylogenetic relationships can then be used to determine the evolutionary order of sequences and to identify the first cases of an outbreak.

CONSTRUCTING A PHYLOGENETIC TREE

Phylogenetic trees can be constructed using maximum likelihood (44, 45) or Bayesian approaches (46). All methods require only a sequence alignment and a nucleotide substitution model. The nucleotide substitution model describes the rate at which one nucleotide is replaced by another, and is used to estimate the evolutionary distance between sequences. The model is used in calculating the likelihoods of various possible phylogenetic trees, and therefore may greatly affect results (47). A general time reversible (GTR) model is often used for phylogenetic analyses because it is the most general and makes no assumptions about substitution rates or base frequencies (48). Alternatively, several groups have written statistical software to compare substitution models for a given dataset (49).

When constructing or reading trees it is important to keep confidence values in mind. Confidence in maximum likelihood trees is commonly represented by bootstrap values (50). Bootstrapping estimates uncertainty by sampling from a dataset with replacement. In this case, the bootstrap value for a node is the proportion of bootstrap trees in which that particular branch topology occurs. Although there is some debate about the accuracy of bootstrap values (51), reporting these values, at least for important nodes, is common practice. Confidence values are built into Bayesian phylogenies and are the posterior probabilities. A Bayesian approach can be thought of as a faster version of a bootstrapped maximum likelihood approach, though the concordance between the two types of confidence values is variable (52).

Both maximum likelihood and Bayesian methods were used to determine the phylogeny of Ebola viruses sequenced during the outbreak (21, 25, 53). These two methods are often used together to check for agreement: major differences in the resulting trees may suggest a complex evolutionary relationship not fully captured by one or more methods.

ROOTING A PHYLOGENETIC TREE

Without further information, a phylogenetic tree will be unrooted: it will show the relationship of branches relative to one another and the overall topology, but it will not identify the base of the tree or the direction of evolution. It thus cannot tell you which samples are ancestors and which are descendants. Since ancestry is very important to determining the origin of an outbreak, it must be determined by rooting the phylogenetic tree.

There are two primary methods of rooting phylogenetic trees: mid-point rooting and outgroup rooting. Mid-point rooting is done by finding the longest tip-to-tip distance in the tree and setting the root half way between these tips. This method assumes that evolutionary rates are constant throughout the tree, meaning the root should be equidistant from all tips (it also assumes contemporaneous sampling). As discussed in the next section, this assumption (known as the molecular clock assumption) is often incorrect. Therefore, viral outbreaks are typically rooted by selecting an outgroup; that is, a set of sequences known to be more distantly related than anything else in the tree. This can be composed of published sequences from previous outbreaks of the same virus, virus sampled from another host species, or a closely related viral species. Outgroup genomes must be distinct from outbreak genomes, but rooting trees using highly divergent sequences can also be problematic (53). If only outbreak sequences are available, it is also possible to use a particularly divergent cluster of outbreak sequences as the outgroup, if one exists.

Once an outgroup is selected, the phylogenetic tree is reconstructed using these additional sequences; the root is the point of divergence between the outgroup sequences and the rest of the tree (Figure 2). In some cases the root of the tree is ambiguous, as in the recent Ebola virus outbreak. Dudas and Rambaut (53) explain how viral substitution rates and linear regression can be used to select themost likely root for a viral outbreak.

FIGURE 2.

FIGURE 2

Rooting phylogenetic trees. Ebola virus (EBOV) sequences illustrate the importance of correctly rooting trees. Each point represents one sequence from the outbreak indicated by its color (scale bar = nucleotide substitutions per site). (a) Maximum likelihood tree rooted on the Zaire 1976 branch (shown to be the more likely root in (25, 53)). (b) The same tree rooted on the Guinea 2014 branch. Interpretation of the ancestral relationships of a single set of samples changes dramatically with root selection.

ESTIMATING THE START DATE OF AN OUTBREAK

Rooted trees suggest the infection history of an outbreak; outbreak sequences close to the root of the tree (separated by fewer nodes) came earlier in the outbreak. If sampling dates are available, branch lengths can be converted from units of nucleotide substitutions to units of time, which can be used to date the true origin of the outbreak. This is done using a strict molecular clock model, which assumes that nucleotide substitutions accumulate at a constant rate (54). The number of substitutions on each branch of a tree with dated tips can be used to estimate the substitution rate, which then can be used to extrapolate backwards to the date of origin of a particular outbreak strain (55). Maximum likelihood methods (56) can calculate substitution rates given a phylogenetic tree and sampling dates.

While a helpful simplification, the strict molecular clock does not always accurately model real viral evolution; evolutionary rates can vary over time, space, or between different branches. To address this, more flexible models have been developed that allow for variation in the substitution rate over time (57). The BEAST (Bayesian Evolutionary Analysis by Sampling Trees) package (58) implements a Bayesian Markov Chain Monte Carlo (MCMC) method to determine changing substitution rates over time; this is referred to as a relaxed molecular clock. This framework can be used to co-estimate the phylogeny and divergence times given sequence data and sampling dates.

During the Ebola outbreak, BEAST was used to estimate when outbreak viruses split from lineages documented in other outbreaks (53), and to estimate the date of entry of the virus into Sierra Leone from Guinea (25, 59).

DETERMINING THE CAUSE OF AN OUTBREAK

The same phylogenetic methods can be used to determine the type(s) of transmission causing an outbreak (human-human or animal-human), but this analysis requires sequences from appropriate hosts and/or time periods (Figure 3). For example, analysis of Ebola patient samples showed that there was substantial genetic variation between EBOV outbreaks, but limited variation within each outbreak. This suggested that the virus evolves separately in an animal reservoir, and that a single zoonotic transmission was responsible for the start of each outbreak. This hypothesis was supported by the calculation of divergence times calculated by BEAST: the lineages from the two most recent outbreaks diverged from a common ancestor significantly before the start of either outbreak (25).

FIGURE 3.

FIGURE 3

Tree topology illuminates the nature of an outbreak. (a) EBOV tree. Sequences from previous outbreaks are colored in distinct shades of green (see Figure 2). Selected sequences from the Sierra Leone outbreak highlight the low diversity within the outbreak, and the development of new clades (SL1–4, defined in (18, 25)) from a single recent ancestor. This topology suggests that each outbreak began with a single zoonotic transmission but was subsequently sustained by human-to-human transmission. (b) Lassa virus (LASV) tree containing S segment sequences from both human (circle nodes) and M. natalensis (rodent nodes) hosts. Samples are from Sierra Leone (109), where LASV is endemic. Sequences do not cluster by time or by host, indicating frequent animal-to-human transmission and a lack of discrete outbreaks.

CHALLENGES IN ESTIMATING THE ORIGIN OF AN OUTBREAK

Although phylogenetic tools have been used successfully to understand many viral outbreaks, significant challenges remain in correctly establishing the origin of an outbreak. A detailed review of current challenges in phylogenetic methods can be found in (60). Here we highlight those challenges particularly relevant for determining the origin of a viral outbreak.

First, although relaxed molecular clocks allow for some rate variation, current models may still fail to capture the full variation in evolutionary rates. For example, an analysis for pandemic HIV-1 group M found that the time to the most recent common ancestor varies significantly when subtypes are analyzed separately compared to jointly, perhaps because closely related viral lineages have different substitution rates (61).

Additionally, the methods described above cannot account for recombination that occurs in many viruses, since the ancestry of these viruses cannot be represented by a simple branching process. Instead, different parts of a single sample’s genome can be the product of different genealogical trees, and are better modeled by a phylogenetic network or ancestral recombination graph (ARG) that allows for complicated evolutionary relationships (6264). Because many phylogenetic tools cannot account for recombination, it is important to restrict analysis to parts of the viral genome or tree where recombination is limited. Important advances in the field will come from continued development of phylogenetic tools that can incorporate viral recombination.

Lack of data also poses significant barriers to analyzing many viral outbreaks. For example, lack of sampling dates essentially rules out divergence time estimates, and lack of informative outgroup sequences—perhaps due to limited past sequence data, or because a zoonotic reservoir has yet to be determined—prevents accurate rooting of a phylogenetic tree. Non-random sampling over time or space may significantly bias results (60). Finally, understanding the ecological factors leading to an outbreak at a particular place and time requires detailed surveys of the outbreak location, both during and before its start (65, 66). Without detailed epidemiological surveys, it may be impossible to determine the index case of a viral outbreak.

VIRAL DISEASE TRANSMISSION

Understanding the spread of the virus, including the mechanism, speed and direction of spread, is essential to controlling a viral outbreak. In many cases, this is done with epidemiological modeling. During the Ebola epidemic, many groups used case counts to estimate epidemiological parameters and the eventual size of the epidemic (6777). The varied approaches taken by these groups illustrate that there is no standard way to parameterize and use these epidemiological models. Additionally, in the absence of very detailed contact tracing and other epidemiological metrics, these models often cannot capture the complexity of an outbreak.

Even without whole-genome sequencing, studying different viral strains as they move through time and space can be used to determine transmission patterns, especially for viruses with distinct subtypes like HIV or influenza. However, this type of analysis does not always have enough resolution to answer important questions about transmission routes. In the case of possible HIV transmission from a dental procedure, as described in the introduction, both contact tracing and molecular methods failed to prove a link between dentist and patient: contact tracing led to the dentist, but was not conclusive because there was no evidence of shared bodily fluids. Similarly, two individuals with the same HIV subtype do not indicate a direct transmission link. Sequencing of the viruses from the patient, dentist, and several local individuals finally provided significant evidence for direct transmission (13).

Using genetic data to reconstruct transmission routes is also especially important for post-outbreak cases – those who are infected after all known transmission routes have died out. For example, an individual contracted the Ebola virus 68 days after human-to-human transmission linked to the 2014–2015 Ebola epidemic was declared to have ended (78). Because the infected individual had no known Ebola-positive contacts, and because molecular diagnostic methods could not differentiate between outbreak lineages, whole-genome sequencing was needed to identify the most likely source of infection.

As shown by these examples, viral genomic sequencing can be a vital tool for detailed transmission reconstruction when contact tracing data is missing or uninformative. Genetic data can also be used to identify general transmission patterns, and to estimate epidemiological parameters.

TRANSMISSION RECONSTRUCTION

Reconstructed viral transmission routes can reveal modes of transmission that are important for containment and prevention (6). Depending on the depth of sampling, rooted trees either coarsely or in detail correspond to the transmission history of the virus (7982) (Figure 4).

FIGURE 4.

FIGURE 4

Determining virus transmission. Transmission during the 2003 SARS outbreak in Singapore, for which both sequence and contact tracing data are available. (a) Maximum likelihood tree rooted on TOR2, the earliest reported case. The four major branches (blue boxes) roughly correspond to geographic origin (top to bottom: China, Taiwan, Singapore, Singapore). (b) Transmission tree reconstruction from sequences and sample collection dates (created using outbreaker (85); generation time: gamma distribution with mean = 8.4 and sd = 3.8, based on values from (1)). Red and yellow circles correspond to the two Singapore clades identified in (a); lined red circles are samples with only sequence data (no contact tracing). Arrows are labeled with (number of SNPs between samples) / (posterior probability of transmission). (c) Transmission tree created during the SARS outbreak by contact tracing, as reported by (131). Gray circles are unreported cases assumed to be part of the transmission chain. Comparison of panels (a-c) shows that the three methods generate similar relationships between samples.

Formal methods have been developed to reconstruct transmission chains from genetic data (83); several of these methods combine genetic and epidemiological data (e.g. sampling dates and locations) into a single likelihood function that is used to sample possible transmission trees (79, 8486). Jombart et al (85, 87) have developed an R package that constructs transmission trees from genetic and any available epidemiological data (Figure 4b).

Within-host genomic data has recently been recognized as an essential component of phylogenetic analysis and transmission tree reconstruction (55, 86, 88). One challenge of incorporating this type of data is that it requires an understanding of the characteristic within-host dynamics for a given virus before it can be used to effectively inform statistical and epidemiological models. Specifically, it is important to know the underlying viral mutation rate and the typical within-host substitution rate, as well as how much diversity is transmitted during an infection event (the bottleneck size).

Because studying within-host dynamics requires both high sequencing depth and longitudinal sampling, limited information exists for most viruses. Within-host studies are most common in well-studied chronic viral infections such as HIV (89, 90), although similar studies in other viruses are beginning to appear (91). In the same vein, the average size of the transmission bottleneck in HIV is known (92), but is still under investigation in most other viruses. Deep sequences of Ebola viruses published during the 2014–2015 outbreak suggest that the bottleneck size is greater than one (18, 93), but more precise estimates are still needed. Within-host viral dynamics studies, along with the development of robust phylogenetic methods that incorporate within-host variation, are a crucial next step in outbreak research.

CALCULATING THE REPRODUCTION NUMBER

The basic reproduction number (R0)—the number of secondary cases from a single infection—is a useful measure of the infectivity of a pathogen and is usually estimated from epidemiological models. However, this number can also be estimated from a detailed transmission chain or from genomic data (94, 95). This value often frames the discussion about containment for a disease outbreak and can be used to predict outbreak dynamics and eventual size in the presence or absence of various control measures (in the Ebola outbreak: (69, 96)).

Calculation of epidemiological parameters like R0 is part of the new and growing field of phylodynamics, the study of infectious disease behavior that arises from a combination of evolutionary and epidemiological processes (89). Incorporating epidemiological metadata can enhance genetic analysis and vice versa in outbreak situations.

CHALLENGES IN UNDERSTANDING TRANSMISSION

Although joint evolutionary and epidemiological analysis has greatly advanced the field of outbreak investigation, there are still challenges associated with determining the route and rate of viral spread. Major hurdles include sampling bias and the difficulty of allowing for spatial and temporal complexity in phylodynamic models (60). For example, Lloyd-Smith et al (97) highlight problems with using the same reproduction number for all individuals and the resulting implications for outbreak control.

VIRAL EVOLUTION

Beyond epidemiological questions, sequencing is well suited to study viral evolution during an outbreak. Of particular interest are the appearance and spread of mutations that affect viral fitness or virulence. That said, it is important to remember that it is often impossible to draw meaningful conclusions from outbreak data without experimental validation. With this in mind, we discuss the viral mutation and substitution rates, and review metrics of selection in viral populations. We will focus on metrics relevant to understanding an outbreak. A more detailed discussion of viral mutation and substitution rates in particular can be found in Duffy et al (98).

THE MUTATION AND SUBSTITUTION RATES

The mutation rate is a major determinant of the overall rate of evolutionary change during and between outbreaks; it is the number of genetic mutations that occur per viral genome replication. It is largely determined by a virus’s biological properties, such as the fidelity of its polymerase, the speed at which it replicates its own genome, and whether the genome is RNA or DNA (98). In general, RNA viruses mutate fastest and DNA viruses slowest. The mutation rate must be measured experimentally because natural selection affects the number of mutations identified in genetic data. For mutation rate estimates for specific viruses, see Drake (99) and Drake and Hwang (100).

The mutation rate should not be confused with the viral substitution rate, which is the rate at which nucleotide substitutions accumulate in a viral lineage. This rate is determined by the mutation rate and by other factors, including natural selection and the effective viral population size. This is the rate most commonly discussed in an outbreak situation, both because it can be calculated from sequence data and because it can be used to understand selective pressures on a viral population during an outbreak. For example, it may be useful to compare the substitution rate within an outbreak to that in a zoonotic reservoir. The virus should have the same intrinsic mutation rate in both hosts, so differences in substitution rate could be due to selection.

While calculating the viral mutation rate requires careful experimentation, the substitution rate can be calculated given a phylogenetic tree and sampling dates. This can be done with maximum likelihood methods (56) or Bayesian methods (58). A Bayesian method like BEAST is more statistically rigorous than most maximum likelihood implementations because it can allow the substitution rate to vary between branches, but is computationally intensive (98). For approximate substitution rates for various viruses, see Jenkins et al (101).

The caveat to all of these methods is that they assume all substitutions are fixed in the population. Because many mutations on recent branches are mildly deleterious and will disappear from the population over time, the substitution rate for any tree containing recent samples may be artificially high. This is common during outbreaks, where a majority of viral genomes may be terminal branches.

RATE-BASED TESTS FOR SELECTION

During an outbreak, identification of variants that confer a fitness benefit to the virus is a key concern. Variants that affect fitness are likely under selection in a population. A well-known statistical test for selection is the dN/dS (or Ka/Ks or ω) test (Figure 5). This test compares the rates of synonymous and non-synonymous substitutions in some region of the genome. In the absence of selection, these two rates should be the same (corrected for the frequency of each type of site). Non-synonymous substitutions are more likely to affect the resulting protein, and are therefore more likely to influence fitness. Therefore, the ratio of synonymous to non-synonymous changes can indicate selection: dN/dS > 1 suggests positive selection, and dN/dS < 1 suggests negative or purifying selection.

FIGURE 5.

FIGURE 5

Rate-based tests for selection in viruses. Various tests identify signals of selection at different genomic scales. (a) dN-dS scores for every codon in the hemagglutinin (HA) gene of H5N1 influenza A (calculated using (104)). The highest dN-dS scores (red) indicate codons most likely under positive selection. (b) log(dN/dS) for each EBOV gene and for the mucin-like region of the glycoprotein (GP). (values from (18)). (c) Synonymous constraint for every codon position in the West Nile virus genome (sliding window = 20 nucleotides) (110). Red bars mark regions of excess constraint. Asterisks mark two known RNA structural elements (orange = structural proteins, yellow = non-structural proteins), a hairpin in the capsid gene and a pseudoknot element within non-structural protein 2A.

This test requires only a codon alignment and is often applied to viral sequences to identify domains or genes under selection (102104). However, it was originally developed to analyze sequences from divergent species, not to detect selection within a single population (105). The dN/dS ratio is not applicable within single populations, and the results obtained from this test are often misleading when applied to microbes (106). Therefore, while this ratio can still be used to analyze sufficiently divergent, separately-evolving outbreaks, users should be wary of using this test to detect selection within a single outbreak population.

If data from other outbreaks is available, the dN/dS statistic can be used to identify sites or regions under selection. During the Ebola epidemic, several groups found dN/dS > 1 in the glycoprotein gene (GP)—or more specifically, in the disordered, mucin-like region of GP—when including sequences from all known Ebola virus outbreaks (18, 107) (Figure 5b). This is unsurprising because GP encodes the envelope protein: as the only surface-exposed protein on the viral particle, GP is the target of host cell antibodies. Because of this biology, it is suspected that GP undergoes diversifying selection—or relaxed purifying selection—as a response to host immune pressure (108).

In general, comparing non-synonymous to synonymous mutations between sites or between timescales (i.e. comparing inter- and intrahost substitutions, as in (109)) can be used to suggest regions under selection. It is possible to identify selection using only synonymous substitutions by looking for regions of excess synonymous constraint in a virus (110) (Figure 5C). In viruses, many protein-coding regions contain overlapping or embedded functional elements. Because any type of substitution may disrupt an overlapping element, these regions are often characterized by an unusually low synonymous substitution rate. This method was used during the Ebola epidemic to find a constrained region in a known editing site.

OTHER TESTS FOR SELECTION

Tests based on mutation frequency spectra and tree topology can also be used to identify selection in single populations. The basic principle behind these methods is that selective pressure on a population leaves a distinct mark on overall genetic diversity and tree symmetry (89, 111, 112). Statistics that test for selection, or non-neutrality, using these principles include Tajima’s D, Fu and Li’s D, and tree-imbalance metrics (113).

Tajima’s D is a statistic that compares the average number of pairwise differences between sequences to the total number of variable sites within the set of sequences (114). A negative value of Tajima’s D indicates an excess of low frequency polymorphisms compared to a neutral model, and may be due to purifying selection or population size expansion. Conversely, a population size reduction (or bottleneck) and balancing selection keep variants at intermediate frequencies, which translates to a positive value for Tajima’s D. Fu and Li’s D (115) applies this idea to the phylogeny of these sequences and compares the number of mutations on newer, external branches to the total number of mutations on all branches. As with Tajima’s D, a basic understanding of population genetics can be used to interpret the result of this test. For example, under purifying selection, there is likely to be an excess of mutations on external branches because deleterious mutations are generally purged from the population before they can be passed on.

Three measures of tree imbalance are commonly used to test for selection: BI (116), the cherry count (117), and Colless’s tree imbalance index (118). More asymmetry than expected in a phylogenetic tree suggests non-neutral evolution. These methods have been successfully used to understand selection in HIV, influenza, and other viruses (113, 119).

The major drawback of both frequency spectra and tree imbalance methods is that it can be hard to differentiate between selection and epidemiological effects such as changing population size. For example, a very negative Tajima’s D or Fu and Li’s D can be due to exponential growth rather than non-neutral evolution. Drummond and Suchard (113) address this problem by incorporating a demographic model when analyzing three RNA virus datasets, and show that it is possible to use these tests to identify selective pressure on viral populations.

CHALLENGES IN VIRAL EVOLUTION

Detecting selection in viruses is challenging because most statistical methods have been created for the comparison of divergent populations or species, rather than analysis of a single population that may be rapidly evolving and expanding. Additionally, viruses are very biologically diverse and have highly variable mutation and substitution rates. This makes it difficult to use the same selection tests for all viruses. For example, slow-mutating viruses, which usually have low substitution rates, may require extended sampling to achieve the population diversity needed to identify evolutionary trends. Unfortunately, not all viruses and datasets make good subjects for evolutionary analysis, and even when they do, the results may be relatively uninformative or uninteresting.

FUNCTIONAL VARIATION

One important role of genomic data is to inform experimental studies, which are necessary for understanding the biology of pathogenic viruses. In many cases, genetic analysis of outbreak sequences generates hypotheses about particular regions of a virus that may play a role in transmission or pathogenesis. Validating or refuting these hypotheses experimentally leads to a more complete picture of the virus, which may directly inform treatment and prevention measures or be used to improve epidemiological and evolutionary models.

Immediately after sequencing, viral samples can be used to answer one pressing question: whether mutation(s) have impaired the ability of clinical diagnostic tools to detect the virus. This is particularly a concern for the real-time PCR assays commonly used for viral detection, since SNPs and other mismatches in primer binding sites have been shown to greatly reduce assay performance (120). For example, during the Ebola epidemic, various groups periodically compared the most up-to-date list of mutations in the Ebola virus genome with recognition sites for diagnostic probes, as well as for existing and candidate therapeutics (121124). Mutations in the binding regions of diagnostics or therapies should be carefully tested experimentally to ensure that binding still occurs.

Broadly, the results of phylogenetic and evolutionary analyses can be used to identify variants most likely to have a functional effect on the virus. For example, clade-defining mutations—mutations shared by large clusters on a phylogenetic tree—are prime candidates for experimentation. These mutations may have fixed within a cluster of samples simply by genetic drift and patterns of transmission, but could also represent sites under strong positive selection. For example, genomic analysis of Ebola virus sequences demonstrated the presence of four viral lineages circulating in Sierra Leone, each defined by one to four deviations from the reference genome, that rose to prevalence in the population at some point during the outbreak (18, 25, 125, 126). Because of their prominence, these mutations were targeted for experimental study soon after the outbreak started (127).

Variants or genomic regions identified by the evolutionary analyses described in the previous section should also be considered for experimental testing. Analyses suggesting that the glycoprotein in Ebola virus might be under selection are an excellent example of how genomic analyses were not able to definitively classify selective pressures, but were able to identify the most promising region for functional validation.

Another common question during an outbreak is whether mutations correlate with clinical outcomes. Therefore, before conducting experiments, it may be informative to explore mutations in relation to clinical and other types of data. This requires additional data, such as information about symptoms, survival, viral load, etc. Correlations will not prove causation, but can be used to refine a set of mutations for experimental analysis, and to suggest a function or mechanism that can be tested experimentally.

REMAINING CHALLENGES IN VIRAL GENOMICS

Genomic analysis can answer many urgent questions during an ongoing viral outbreak, including ones related to the origin of the outbreak, how the virus is transmitted, and ways in which the virus might be evolving. This was successfully done during the Ebola outbreak, and the same techniques are being applied to past and ongoing outbreaks of other infectious diseases. However, all of these analyses are limited by the quality and availability of data.

Data quality may be improved by updated sample preparation and sequencing methods, such as improved hybrid selection for low quality samples. These methods are especially important for viruses that are present at low titer, such as Zika. To best utilize sequencing data, many of the analyses discussed could be further refined with better information about the virus itself (128). For example, biological investigation of the evolutionary and transmission processes unique to specific viruses would improve the quality of many within-outbreak analyses.

Although these technical challenges remain, logistical issues are a major barrier to effective outbreak response, since phylogenetic methods depend on specific types of data: determining the time an outbreak started is difficult if a suitable outgroup is not available, reconstructing transmission is impeded by inaccurate or missing sample dates, and all of these techniques are limited by sparse sampling during an outbreak.

Missing data has been a major challenge in viral genomics largely because the usefulness of real-time sample collection and sequencing for outbreak control was not recognized until recently. Now, with easier, cheaper sequencing and the development of computational methods to harness that sequence data, it should be evident that genomic data will be a powerful tool in understanding and controlling future outbreaks.

Even when data is collected, decentralized sample collection and analysis means that it may not be readily available. This problem was highlighted in the Ebola epidemic, during which many different groups were conducting studies all over Western Africa, and the data did not always become immediately available. Sharing outbreak data is a necessary component of an efficient response (129). Another lesson from the Ebola epidemic is the usefulness of extensive collaboration. Many of the techniques discussed in this review are computationally intensive (BEAST, for example), and deep sequencing of isolates is very expensive; both require substantial technical expertise. Large collaborations and shared resources and data seem to be the best way to respond quickly to an outbreak situation. Although this has not been the normal approach of many research groups, informal collaborations, like the online forum Virological (http://virological.org), where users can post data and discuss preliminary results, are already influencing how we respond to outbreaks.

CONCLUSIONS

Genomic analysis is a powerful tool for understanding, and therefore combating, viral outbreaks. The field is at a fundamental transition point, supported by recent improvements in viral sequencing and analysis. The biggest immediate hurdle is data collection sufficient to enable full utilization of these methods.

In order to best prepare for future viral outbreaks, we must facilitate collection and rapid sequencing of viral samples. This requires the development of cheap and effective viral diagnostics for use in the field and a continued effort to promote data sharing and collaboration among viral researchers. Additionally, a clear goal for the future of viral genomics is a more complete integration of epidemiological and genetic data in statistical methods. In taking these steps, we can develop the capacity to deal with viral and microbial outbreaks and other threats related to infectious disease.

SUMMARY POINTS.

  1. Recent advances in high-throughput sequencing and computational methods have allowed genomic analysis to become a powerful tool for understanding viral disease outbreaks.

  2. Deep sequencing and high-quality genomes are essential for detailed phylogenetic analysis. Sequencing is also essential to the development and refinement of virus diagnostics and therapies.

  3. Phylogenetic analysis can be used to trace an outbreak to a particular type of transmission, and to estimate its start date. Understanding how and when the outbreak began can be used to prevent the same circumstances from occurring again.

  4. Detailed transmission chain reconstruction is possible using only genomic data, but can also be used to supplement or support transmission chains derived from contact-tracing or molecular epidemiology.

  5. Sequencing is well suited to study viral evolution during an outbreak, and in particular the appearance and spread of mutations that affect viral fitness or virulence. However, genomic evidence for selection is not conclusive, and should be coupled with functional studies.

  6. Effective genomic analysis of viral outbreaks relies on comprehensive sample collection and rapid sequencing. This requires the development of cheap viral diagnostics for field use and consistent data-sharing and collaboration among viral researchers.

FUTURE ISSUES LIST.

  1. Further refining sequencing technologies and sample preparation methods, including long-read sequencing technologies to facilitate phasing of mutations and hybrid selection methods to further enrich viral content of sequencing libraries, is key to increasing the number and quality of full-length viral genomes for analysis.

  2. Developing statistical phylogenetic methods optimized for recombining viruses will allow for more detailed analysis of a wider array of viruses.

  3. Incorporating within-host variation into transmission reconstruction methods will increase the accuracy and resolution of viral transmission trees.

  4. Reducing the amount of missing data remains an urgent need effective outbreak response. Coordinated sample collection and data sharing amongst research groups are necessary for an effective and rapid outbreak response.

Acknowledgments

We would like to thank T. Bedford, A. Lin, B. MacInnis, C. Matranga, D. Nosamiefan, D. Park, A. Piandatosi, H. Metsky, S. Ye, and N. Yozwiak for their helpful comments.

LITERATURE CITED

  • 1.Lipsitch M, Cohen T, Cooper B, Robins JM, Ma S, et al. Transmission dynamics and control of severe acute respiratory syndrome. Science. 2003;300(5627):1966–70. doi: 10.1126/science.1086616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kiss IZ, Green DM, Kao RR. Disease contact tracing in random and clustered networks. Proc Biol Sci. 2005;272(1570):1407–14. doi: 10.1098/rspb.2005.3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Porco TC, Holbrook KA, Fernyak SE, Portnoy DL, Reiter R, Aragón TJ. Logistics of community smallpox control through contact tracing and ring vaccination: a stochastic network model. BMC Public Health. 2004;4:34. doi: 10.1186/1471-2458-4-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kretzschmar M, van den Hof S, Jacco Wallinga, van Wijngaarden J. Ring vaccination and smallpox control. Emerg Infect Dis. 2004;10(5):832–41. doi: 10.3201/eid1005.030419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Klinkenberg D, Fraser C, Heesterbeek H. The effectiveness of contact tracing in emerging epidemics. PLoS ONE. 2006;1:e12. doi: 10.1371/journal.pone.0000012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Grad YH, Lipsitch M. Epidemiologic data and pathogen genome sequences: a powerful synergy for public health. Genome Biol. 2014;15(11):538. doi: 10.1186/s13059-014-0538-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Centers for Disease Control and Prevention. Guidance for clinicians on the use of RT-PCR and other molecular assays for diagnosis of influenza virus infection. Atlanta, GA: Centers for Disease Control and Prevention; 2015. http://www.cdc.gov/flu/professionals/diagnosis/molecular-assays.htm. [Google Scholar]
  • 8.Harper SA, Bradley JS, Englund JA, File TM, Gravenstein S, et al. Seasonal influenza in adults and children--diagnosis, treatment, chemoprophylaxis, and institutional outbreak management: clinical practice guidelines of the Infectious Diseases Society of America. Clin Infect Dis. 2009;48(8):1003–32. doi: 10.1086/604670. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. The global circulation of seasonal influenza A (H3N2) viruses. Science. 2008;320(5874):340–46. doi: 10.1126/science.1154137. [DOI] [PubMed] [Google Scholar]
  • 10.Rota PA, Oberste MS, Monroe SS, Nix WA, Campagnoli R, et al. Characterization of a novel coronavirus associated with severe acute respiratory syndrome. Science. 2003;300(5624):1394–99. doi: 10.1126/science.1085952. [DOI] [PubMed] [Google Scholar]
  • 11.Wang D, Urisman A, Liu Y-T, Springer M, Ksiazek TG, et al. Viral discovery and sequence recovery using DNA microarrays. PLoS Biol. 2003;1(2):E2. doi: 10.1371/journal.pbio.0000002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.World Health Organization. PCR primers for SARS developed by WHO Network Laboratories. World Health Organization; 2003. http://www.who.int/csr/sars/primers/en/ [Google Scholar]
  • 13.Ou CY, Ciesielski CA, Myers G, Bandea CI, Luo CC, et al. Molecular epidemiology of HIV transmission in a dental practice. Science. 1992;256(5060):1165–71. doi: 10.1126/science.256.5060.1165. [DOI] [PubMed] [Google Scholar]
  • 14.Djikeng A, Halpin R, Kuzmickas R, Depasse J, Feldblyum J, et al. Viral genome sequencing by random priming methods. BMC Genomics. 2008;9:5. doi: 10.1186/1471-2164-9-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ninomiya M, Ueno Y, Funayama R, Nagashima T, Nishida Y, et al. Use of illumina deep sequencing technology to differentiate hepatitis C virus variants. J Clin Microbiol. 2012;50(3):857–66. doi: 10.1128/JCM.05715-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Malboeuf CM, Yang X, Charlebois P, Qu J, Berlin AM, et al. Complete viral RNA genome sequencing of ultra-low copy samples by sequence-independent amplification. Nucleic Acids Res. 2013;41(1):e13. doi: 10.1093/nar/gks794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Matranga CB, Andersen KG, Winnicki S, Busby M, Gladden AD, et al. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples. Genome Biol. 2014;15(519) doi: 10.1186/s13059-014-0519-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Park DJ, Dudas G, Wohl S, Goba A, Whitmer SLM, et al. Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell. 2015;161(7):1516–26. doi: 10.1016/j.cell.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Flint M, Goodman CH, Bearden S, Blau DM, Amman BR, et al. Ebola Virus Diagnostics: The US Centers for Disease Control and Prevention Laboratory in Sierra Leone August 2014 to March 2015. J Infect Dis. 2015;212(Suppl 2):S350–58. doi: 10.1093/infdis/jiv361. [DOI] [PubMed] [Google Scholar]
  • 20.Towner JS, Sealy TK, Ksiazek TG, Nichol ST. High-throughput molecular detection of hemorrhagic fever virus threats with applications for outbreak settings. J Infect Dis. 2007;196(Suppl 2):S205–12. doi: 10.1086/520601. [DOI] [PubMed] [Google Scholar]
  • 21.Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, et al. Emergence of Zaire Ebola Virus Disease in Guinea — Preliminary Report. N Engl J Med. 2014 doi: 10.1056/NEJMoa1404505. [DOI] [PubMed] [Google Scholar]
  • 22.Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, et al. Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016 doi: 10.1038/nature16996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Watson SJ, Welkers MRA, Depledge DP, Coulter E, Breuer JM, et al. Viral population analysis and minority-variant detection using short read next-generation sequencing. Philos Trans R Soc London, Ser B. 2013;368(1614):20120205. doi: 10.1098/rstb.2012.0205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Reyes GR, Kim JP. Sequence-independent, single-primer amplification (SISPA) of complex DNA populations. Mol Cell Probes. 1991;5(6):473–81. doi: 10.1016/s0890-8508(05)80020-9. [DOI] [PubMed] [Google Scholar]
  • 25.Gire SK, Goba A, Andersen KG, Sealfon RSG, Park DJ, et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science. 2014;345(6202):1369–72. doi: 10.1126/science.1259657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Greninger AL, Naccache SN, Federman S, Yu G, Mbala P, et al. Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis. Genome Med. 2015;7(1):99. doi: 10.1186/s13073-015-0220-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 2011;21(9):1552–60. doi: 10.1101/gr.120618.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):R46. doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods. 2012;9(8):811–14. doi: 10.1038/nmeth.2066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kapoor A, Kumar A, Simmonds P, Bhuva N, Singh Chauhan L, et al. Virome Analysis of Transfusion Recipients Reveals a Novel Human Virus That Shares Genomic Features with Hepaciviruses and Pegiviruses. MBio. 2015;6(5):e01466–15. doi: 10.1128/mBio.01466-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Stremlau MH, Andersen KG, Folarin O, Grove JN, Odia I, et al. Discovery of Novel Rhabdoviruses in the Blood of Healthy Individuals from West Africa. PLoS Negl Trop Dis. 2015 doi: 10.1371/journal.pntd.0003631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res. 2014;24(7):1180–92. doi: 10.1101/gr.171934.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RGW, et al. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol. 2011;29(5):393–96. doi: 10.1038/nbt.1868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Martin DP, Murrell Ben, Golden M, Khoosa A, Muhire B. RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evol. 2015;1 doi: 10.1093/ve/vev003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–79. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6(2):80. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chen Y, Cunningham F, Rios D, McLaren WM, Smith J, et al. Ensembl variation resources. BMC Genomics. 2010;11:293. doi: 10.1186/1471-2164-11-293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Yang X, Charlebois P, Macalalad A, Henn MR, Zody MC. V-Phaser 2 variant inference for viral populations. BMC Genomics. 2013;14:674. doi: 10.1186/1471-2164-14-674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics. 2009;25(17):2283–85. doi: 10.1093/bioinformatics/btp373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Guan Y, Peiris JSM, Lipatov AS, Ellis TM, Dyrting KC, et al. Emergence of multiple genotypes of H5N1 avian influenza viruses in Hong Kong SAR. PNAS. 2002;99(13):8950–55. doi: 10.1073/pnas.132268999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lei F, Shi W. Prospective of Genomics in Revealing Transmission, Reassortment and Evolution of Wildlife-Borne Avian Influenza A (H5N1) Viruses. Curr Genomics. 2011;12(7):466–74. doi: 10.2174/138920211797904052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52(5):696–704. doi: 10.1080/10635150390235520. [DOI] [PubMed] [Google Scholar]
  • 45.Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22(21):2688–90. doi: 10.1093/bioinformatics/btl446. [DOI] [PubMed] [Google Scholar]
  • 46.Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17(8):754–55. doi: 10.1093/bioinformatics/17.8.754. [DOI] [PubMed] [Google Scholar]
  • 47.Posada D, Crandall KA. Selecting the Best-Fit Model of Nucleotide Substitution. Syst Biol. 2001;50(4):580–601. [PubMed] [Google Scholar]
  • 48.Tavare S. Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sciences. 1986:17. [Google Scholar]
  • 49.Darriba D, Taboada GL, Doallo R, Posada D. JModelTest 2 more models, new heuristics and parallel computing. Nature methods. 2012;9(8):772. doi: 10.1038/nmeth.2109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Felsenstein J. Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution. 1985;39:783–91. doi: 10.1111/j.1558-5646.1985.tb00420.x. [DOI] [PubMed] [Google Scholar]
  • 51.Efron B, Halloran E, Holmes S. Bootstrap confidence levels for phylogenetic trees. PNAS. 1996;93(23):13429–34. doi: 10.1073/pnas.93.23.13429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Douady CJ, Delsuc F, Boucher Y, Doolittle WF, Douzery EJP. Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Mol Biol Evol. 2003;20(2):248–54. doi: 10.1093/molbev/msg042. [DOI] [PubMed] [Google Scholar]
  • 53.Dudas G, Rambaut A. Phylogenetic Analysis of Guinea 2014 EBOV Ebolavirus Outbreak. PLoS Curr. 2014:1–9. doi: 10.1371/currents.outbreaks.84eefe5ce43ec9dc0bf0670f7b8b417d. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Kumar S. Molecular clocks: four decades of evolution. Nature. 2005;6(8):654–62. doi: 10.1038/nrg1659. [DOI] [PubMed] [Google Scholar]
  • 55.Pybus OG, Rambaut A. Evolutionary analysis of the dynamics of viral infectious disease. Nature. 2009;10(8):540–50. doi: 10.1038/nrg2583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Rambaut A. Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics. 2000;16(4):395–99. doi: 10.1093/bioinformatics/16.4.395. [DOI] [PubMed] [Google Scholar]
  • 57.Drummond AJ, Ho SYW, Phillips MJ, Rambaut Andrew. Relaxed Phylogenetics and Dating with Confidence. PLoS Biol. 2006;4(5):e88. doi: 10.1371/journal.pbio.0040088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73. doi: 10.1093/molbev/mss075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Carroll MW, Matthews DA, Hiscox JA, Elmore MJ, Pollakis G, et al. Temporal and spatial analysis of the 2014–2015 Ebola virus outbreak in West Africa. Nature. 2015;524(7563):97–U201. doi: 10.1038/nature14594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Frost SDW, Pybus OG, Gog JR, Viboud C, Bonhoeffer S, Bedford T. Eight challenges in phylodynamic inference. Epidemics. 2015;10:88–92. doi: 10.1016/j.epidem.2014.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wertheim JO, Fourment M, Kosakovsky Pond SL. Inconsistencies in estimating the age of HIV-1 subtypes due to heterotachy. Molecular Biology and Evolution. 2012;29(2):451–56. doi: 10.1093/molbev/msr266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Huson DH. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics. 1998;14(1):68–73. doi: 10.1093/bioinformatics/14.1.68. [DOI] [PubMed] [Google Scholar]
  • 63.Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23(2):254–67. doi: 10.1093/molbev/msj030. [DOI] [PubMed] [Google Scholar]
  • 64.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Pinzon JE, Wilson JM, Tucker CJ, Arthur R, Jahrling PB, Formenty P. Trigger events: enviroclimatic coupling of Ebola hemorrhagic fever outbreaks. Am J Trop Med Hyg. 2004;71(5):664–74. [PubMed] [Google Scholar]
  • 66.Alexander KA, Sanderson CE, Marathe M, Lewis BL, Rivers CM, et al. What factors might have led to the emergence of Ebola in West Africa? PLoS Negl Trop Dis. 2015;9(6):e0003652. doi: 10.1371/journal.pntd.0003652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Alizon S, Lion S, Murall CL, Abbate JL. Quantifying the epidemic spread of Ebola virus (EBOV) in Sierra Leone using phylodynamics. Virulence. 2014;5(8):825–27. doi: 10.4161/21505594.2014.976514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Althaus CL. Estimating the Reproduction Number of Ebola Virus (EBOV) During the 2014 Outbreak in West Africa. PLoS Curr. 2014;6 doi: 10.1371/currents.outbreaks.91afb5e0f279e7f29e7056095255b288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Chowell G, Nishiura H. Characterizing the transmission dynamics and control of ebola virus disease. PLoS Biol. 2015;13(1):e1002057. doi: 10.1371/journal.pbio.1002057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Chowell G, Viboud C, Hyman JM, Simonsen L. The Western Africa ebola virus disease epidemic exhibits both global exponential and local polynomial growth rates. PLoS Curr. 2015;7 doi: 10.1371/currents.outbreaks.8b55f4bad99ac5c5db3663e916803261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Fisman D, Khoo E, Tuite A. Early epidemic dynamics of the west african 2014 ebola outbreak estimates derived with a simple two-parameter model. PLoS Curr. 2014;6 doi: 10.1371/currents.outbreaks.89c0d3783f36958d96ebbae97348d571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.House T. Epidemiological dynamics of Ebola outbreaks. Elife. 2014;3:e03908. doi: 10.7554/eLife.03908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Lewnard JA, Ndeffo Mbah ML, Alfaro-Murillo JA, Altice FL, Bawo L, et al. Dynamics and control of Ebola virus transmission in Montserrado, Liberia: a mathematical modelling analysis. Lancet Infect Dis. 2014;14(12):1189–95. doi: 10.1016/S1473-3099(14)70995-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Meltzer MI, Atkins CY, Santibanez S, Knust B, Petersen BW, et al. Estimating the future number of cases in the Ebola epidemic--Liberia and Sierra Leone, 2014–2015. MMWR Surveill Summ. 2014;63(Suppl 3):1–14. [PubMed] [Google Scholar]
  • 75.Nishiura H, Chowell G. Early transmission dynamics of Ebola virus disease (EVD), West Africa, March to August 2014. Euro Surveill. 2014;19(36) doi: 10.2807/1560-7917.es2014.19.36.20894. [DOI] [PubMed] [Google Scholar]
  • 76.Siettos C, Anastassopoulou C, Russo L, Grigoras C, Mylonakis E. Modeling the 2014 Ebola Virus Epidemic - Agent-Based Simulations, Temporal Analysis and Future Predictions for Liberia and Sierra Leone. PLoS Curr. 2015;7 doi: 10.1371/currents.outbreaks.8d5984114855fc425e699e1a18cdc6c9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Towers S, Patterson-Lomba O, Castillo-Chavez C. Temporal variations in the effective reproduction number of the 2014 west Africa ebola outbreak. PLoS Curr. 2014;6 doi: 10.1371/currents.outbreaks.9e4c4294ec8ce1adad283172b16bc908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.World Health Organization. Ebola Situation Report - 20 January 2016. World Health Organization; 2016. http://apps.who.int/ebola/current-situation/ebola-situation-report-20-january-2016. [Google Scholar]
  • 79.Ypma RJF, van Ballegooijen WM, Jacco Wallinga. Relating phylogenetic trees to transmission trees of infectious disease outbreaks. Genetics. 2013;195(3):1055–62. doi: 10.1534/genetics.113.154856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Lemey P, Derdelinckx I, Rambaut A, Van Laethem K, Dumont S, et al. Molecular footprint of drug-selective pressure in a human immunodeficiency virus transmission chain. J Virol. 2005;79(18):11981–89. doi: 10.1128/JVI.79.18.11981-11989.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Leitner T, Escanilla D, Franzén C, Uhlén M, Albert J. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. PNAS. 1996;93(20):10864–69. doi: 10.1073/pnas.93.20.10864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Paraskevis D, Magiorkinis E, Magiorkinis G, Kiosses VG, Lemey P, et al. Phylogenetic reconstruction of a known HIV-1 CRF04_cpx transmission network using maximum likelihood and Bayesian methods. J Mol Evol. 2004;59(5):709–17. doi: 10.1007/s00239-004-2651-6. [DOI] [PubMed] [Google Scholar]
  • 83.Jombart T, Eggo RM, Dodd PJ, Balloux F. Reconstructing disease outbreaks from genetic data: a graph approach. Heredity. 2010;106(2):383–90. doi: 10.1038/hdy.2010.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Morelli MJ, Thébaud G, Chadœuf J, King DP, Haydon DT, Soubeyrand S. A Bayesian Inference Framework to Reconstruct Transmission Trees Using Epidemiological and Genetic Data. PLoS Comput Biol. 2012;8(11):e1002768. doi: 10.1371/journal.pcbi.1002768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C, Ferguson N. Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data. PLoS Comput Biol. 2014;10(1):e1003457. doi: 10.1371/journal.pcbi.1003457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Didelot X, Gardy J, Colijn C. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol Biol Evol. 2014;31(7):1869–79. doi: 10.1093/molbev/msu121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Jombart T, Aanensen DM, Baguelin M, Birrell P, Cauchemez S, et al. OutbreakTools: A new platform for disease outbreak analysis using the R software. Epidemics. 2014;7:28–34. doi: 10.1016/j.epidem.2014.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Worby CJ, Lipsitch M, Hanage WP. Within-Host Bacterial Diversity Hinders Accurate Reconstruction of Transmission Networks from Genomic Distance Data. PLoS Comput Biol. 2014;10(3) doi: 10.1371/journal.pcbi.1003549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Grenfell BT, Pybus OG, Gog JR, Wood JLN, Daly JM, et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science. 2004;303(5656):327–32. doi: 10.1126/science.1090727. [DOI] [PubMed] [Google Scholar]
  • 90.Lemey P, Rambaut A, Pybus OG. HIV evolutionary dynamics within and among hosts. AIDS Rev. 2006;8(3):125–40. [PubMed] [Google Scholar]
  • 91.Khiabanian H, Carpenter Z, Kugelman J, Chan J, Trifonov V, et al. Viral diversity and clonal evolution from unphased genomic data. BMC Genomics. 2014;15(Suppl 6):S17. doi: 10.1186/1471-2164-15-S6-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, et al. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. PNAS. 2008;105(21):7552–57. doi: 10.1073/pnas.0802203105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Emmett KJ, Lee A, Khiabanian H, Rabadan R. High-resolution Genomic Surveillance of 2014 Ebolavirus Using Shared Subclonal Variants. PLoS Curr. 2015;7 doi: 10.1371/currents.outbreaks.c7fd7946ba606c982668a96bcba43c90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Pybus OG, Charleston MA, Gupta S, Rambaut A, Holmes EC, Harvey PH. The epidemic behavior of the hepatitis C virus. Science. 2001;292(5525):2323–25. doi: 10.1126/science.1058321. [DOI] [PubMed] [Google Scholar]
  • 95.Stadler T, Kouyos R, Wyl von V, Yerly S, Böni J, et al. Estimating the basic reproductive number from viral sequence data. Mol Biol Evol. 2012;29(1):347–57. doi: 10.1093/molbev/msr217. [DOI] [PubMed] [Google Scholar]
  • 96.Gomes MFC, Pastore Y, Piontti A, Rossi L, Chao D, Longini I, et al. Assessing the international spreading risk associated with the 2014 west african ebola outbreak. PLoS Curr. 2014;6 doi: 10.1371/currents.outbreaks.cd818f63d40e24aef769dda7df9e0da5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Lloyd-Smith JO, Cross PC, Briggs CJ, Daugherty M, Getz WM, et al. Should we expect population thresholds for wildlife disease? Trends Ecol Evol. 2005;20(9):511–19. doi: 10.1016/j.tree.2005.07.004. [DOI] [PubMed] [Google Scholar]
  • 98.Duffy S, Shackelton LA, Holmes EC. Rates of evolutionary change in viruses: patterns and determinants. Nature. 2008;9(4):267–76. doi: 10.1038/nrg2323. [DOI] [PubMed] [Google Scholar]
  • 99.Drake JW. Rates of spontaneous mutation among RNA viruses. PNAS. 1993;90(9):4171–75. doi: 10.1073/pnas.90.9.4171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Drake JW, Hwang CBC. On the mutation rate of herpes simplex virus type 1. Genetics. 2005;170(2):969–70. doi: 10.1534/genetics.104.040410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Jenkins GM, Rambaut A, Pybus OG, Holmes EC. Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis. J Mol Evol. 2002;54(2):156–65. doi: 10.1007/s00239-001-0064-3. [DOI] [PubMed] [Google Scholar]
  • 102.Yang Z. PAML 4 phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  • 103.Pond SLK, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21(5):676–79. doi: 10.1093/bioinformatics/bti079. [DOI] [PubMed] [Google Scholar]
  • 104.Pond SLK, Frost SDW. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics. 2005;21(10):2531–33. doi: 10.1093/bioinformatics/bti320. [DOI] [PubMed] [Google Scholar]
  • 105.Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature. 1977;267(5608):275–76. doi: 10.1038/267275a0. [DOI] [PubMed] [Google Scholar]
  • 106.Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008;4(12):e1000304. doi: 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Liu S-Q, Deng C-L, Yuan Z-M, Rayner S, Zhang B. Identifying the pattern of molecular evolution for Zaire ebolavirus in the 2014 outbreak in West Africa. Infect Genet Evol. 2015;32:51–59. doi: 10.1016/j.meegid.2015.02.024. [DOI] [PubMed] [Google Scholar]
  • 108.Wertheim JO, Worobey M. Relaxed selection and the evolution of RNA virus mucin-like pathogenicity factors. J Virol. 2009;83(9):4690–94. doi: 10.1128/JVI.02358-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Andersen KG, Shapiro BJ, Matranga CB, Sealfon R, Lin AE, et al. Clinical Sequencing Uncovers Origins and Evolution of Lassa Virus. Cell. 2015;162(4):738–50. doi: 10.1016/j.cell.2015.07.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Sealfon RS, Lin MF, Jungreis I, Wolf MY, Kellis M, Sabeti PC. FRESCo: finding regions of excess synonymous constraint in diverse viruses. Genome Biol. 2015;16:38. doi: 10.1186/s13059-015-0603-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Bedford T, Cobey S, Pascual M. Strength and tempo of selection revealed in viral gene genealogies. BMC Evol Biol. 2011;11:220. doi: 10.1186/1471-2148-11-220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Neher RA, Russell CA, Shraiman BI. Predicting evolution from the shape of genealogical trees. Elife. 2014;3 doi: 10.7554/eLife.03568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Drummond AJ, Suchard MA. Fully Bayesian tests of neutrality using genealogical summary statistics. BMC Genet. 2008;9:68. doi: 10.1186/1471-2156-9-68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–95. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133(3):693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Kirkpatrick M, Slatkin M. Searching for Evolutionary Patterns in the Shape of a Phylogenetic Tree. Evolution. 1993;47(4):1171. doi: 10.1111/j.1558-5646.1993.tb02144.x. [DOI] [PubMed] [Google Scholar]
  • 117.McKenzie A, Steel M. Distributions of cherries for two models of trees. Math Biosci. 2000;164(1):81–92. doi: 10.1016/s0025-5564(99)00060-7. [DOI] [PubMed] [Google Scholar]
  • 118.Colless DH. Review of: Phylogenetics: the theory and practice of phylogenetic systematics. Syst Zool. 1982;31:100–104. [Google Scholar]
  • 119.Edwards CTT, Holmes EC, Pybus OG, Wilson DJ, Viscidi RP, et al. Evolution of the Human Immunodeficiency Virus Envelope Gene Is Dominated by Purifying Selection. Genetics. 2006;174(3):1441–53. doi: 10.1534/genetics.105.052019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Lefever S, Pattyn F, Hellemans J, Vandesompele J. Single-nucleotide polymorphisms and other mismatches reduce performance of quantitative PCR assays. Clin Chem. 2013;59(10):1470–80. doi: 10.1373/clinchem.2013.203653. [DOI] [PubMed] [Google Scholar]
  • 121.Kugelman JR, Sanchez-Lockhart M, Andersen KG, Gire S, Park DJ, et al. Evaluation of the Potential Impact of Ebola Virus Genomic Drift on the Efficacy of Sequence-Based Candidate Therapeutics. MBio. 2015;6(1) doi: 10.1128/mBio.02227-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Kugelman JR, Wiley MR, Mate S, Ladner JT, Beitzel B, et al. Monitoring of Ebola Virus Makona Evolution through Establishment of Advanced Genomic Capability in Liberia. Emerg Infect Dis. 2015;21(7):1135–43. doi: 10.3201/eid2107.150522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Castilletti C, Carletti F, Gruber CEM, Bordi L, Lalle E, et al. Molecular Characterization of the First Ebola Virus Isolated in Italy, from a Health Care Worker Repatriated from Sierra Leone. Genome Announc. 2015;3(3) doi: 10.1128/genomeA.00639-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Bell A, Lewandowski K, Myers R, Wooldridge D, Aarons E, et al. Genome sequence analysis of Ebola virus in clinical samples from three British healthcare workers, August 2014 to March 2015. Euro Surveill. 2015;20(20):6–10. doi: 10.2807/1560-7917.es2015.20.20.21131. [DOI] [PubMed] [Google Scholar]
  • 125.Tong Y-G, Shi W-F, Di Liu, Qian J, Liang L, et al. Genetic diversity and evolutionary dynamics of Ebola virus in Sierra Leone. Nature. 2015;524(7563):93–96. doi: 10.1038/nature14490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Simon-Loriere E, Faye O, Faye O, Koivogui L, Magassouba N, et al. Distinct lineages of Ebola virus in Guinea during the 2014 West African epidemic. Nature. 2015;524(7563):102–U210. doi: 10.1038/nature14612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Lin A, Akusobi C, Kirchdoerfer RN, Saphire EO, Sabeti PC. Molecular Characterization of Clade-Defining Mutations during the Ebola Outbreak in Sierra Leone. Manuscript in preparation. [Google Scholar]
  • 128.Metcalf CJE, Birger RB, Funk S, Kouyos RD, Lloyd-Smith JO, Jansen VAA. Five challenges in evolution and infectious diseases. Epidemics. 2015;10:40–44. doi: 10.1016/j.epidem.2014.12.003. [DOI] [PubMed] [Google Scholar]
  • 129.Yozwiak NL, Schaffner SF, Sabeti PC. Data sharing: Make outbreak research open access. Nature. 2015;518(7540):477–79. doi: 10.1038/518477a. [DOI] [PubMed] [Google Scholar]
  • 130.MacConaill L, Meyerson M. Adding pathogens by genomic subtraction. Nature Genetics. 2008;40(4):380–82. doi: 10.1038/ng0408-380. [DOI] [PubMed] [Google Scholar]
  • 131.Vega VB, Ruan Y, Liu J, Lee WH, Wei CL, et al. Mutational dynamics of the SARS coronavirus in cell culture and human populations isolated in 2003. BMC Infect Dis. 2004;4:32. doi: 10.1186/1471-2334-4-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Luo R, Liu B, Xie Y, Li Z, Huang W, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):18. doi: 10.1186/2047-217X-1-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–29. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Katoh K, Misawa K, Kuma K-I, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30(14):3059–66. doi: 10.1093/nar/gkf436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7 improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–97. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 141.Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21(3):487–93. doi: 10.1101/gr.113985.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142.Babraham Bioinformatics. FastQC. Babraham Bioinformatics; 2015. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [Google Scholar]
  • 143.The Broad Institute. Picard. The Broad Institute; http://broadinstitute.github.io/picard/ [Google Scholar]
  • 144.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 146.Lee W-P, Stromberg MP, Ward A, Stewart C, Garrison EP, Marth GT. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One. 2014;9(3):e90581. doi: 10.1371/journal.pone.0090581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Novocraft. NovoAlign. Novocraft; 2014. http://www.novocraft.com/products/novoalign/ [Google Scholar]
  • 148.Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D, et al. Faster and More Accurate Sequence Alignment with SNAP. 2011 arXiv:1111.5572v1. [Google Scholar]
  • 149.Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. 2012 arXiv:1207.3907v2. [Google Scholar]
  • 150.Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet. 2014;46(8):912–18. doi: 10.1038/ng.3036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151.Rotmistrovsky KE, Agarwala R. BMTagger: Best Match Tagger for Removing Human Reads from Metagenomics Datasets. 2014 ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/bmtagger/
  • 152.Rambaut Andrew. FigTree. 2014 http://tree.bio.ed.ac.uk/software/figtree.
  • 153.CLC bio. CLC Genomics Workbench. CLC bio; 2015. http://www.clcbio.com/products/clc-genomics-workbench/ [Google Scholar]
  • 154.Biomatters Limited. Geneious. Biomatters Limited; 2016. http://www.geneious.com/ [Google Scholar]
  • 155.Park DJ, Jungreis I, Tomkins-Tinch C, Lin M. viral-ngs. 2015 http://dx.doi.org/10.5281/zenodo.17560.
  • 156.Neher RA, Bedford T. nextflu: real-time tracking of seasonal influenza virus evolution in humans. Bioinformatics. 2015;31(21):3546–48. doi: 10.1093/bioinformatics/btv381. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES