## Abstract

We analyse patterns of genetic variability of populations in the presence of a large seed bank with the help of a new coalescent structure called the seed bank coalescent. This ancestral process appears naturally as scaling limit of the genealogy of large populations that sustain seed banks, if the seed bank size and individual dormancy times are of the same order as the active population. Mutations appear as Poisson processes on the active lineages, and potentially at reduced rate also on the dormant lineages. The presence of ‘dormant’ lineages leads to qualitatively altered times to the most recent common ancestor and non-classical patterns of genetic diversity. To illustrate this we provide a Wright-Fisher model with seed bank component and mutation, motivated from recent models of microbial dormancy, whose genealogy can be described by the seed bank coalescent. Based on our coalescent model, we derive recursions for the expectation and variance of the time to most recent common ancestor, number of segregating sites, pairwise differences, and singletons. Estimates (obtained by simulations) of the distributions of commonly employed distance statistics, in the presence and absence of a seed bank, are compared. The effect of a seed bank on the expected site-frequency spectrum is also investigated using simulations. Our results indicate that the presence of a large seed bank considerably alters the distribution of some distance statistics, as well as the site-frequency spectrum. Thus, one should be able to detect the presence of a large seed bank in genetic data.

## Introduction

Many microorganisms can enter reversible dormant states of low (resp. zero) metabolic activity, for example when faced with unfavourable environmental conditions; see e.g. LENNON and JONES (2011) for a recent overview of this phenomenon. Such dormant forms may stay inactive for extended periods of time and thus create a seed bank that should significantly affect the interplay of evolutionary forces driving the genetic variability of the microbial population. In fact, in many eco-systems, the percentage of dormant cells compared to the total population size is substantial, and sometimes even dominant (for example roughly 20% in human gut, 40% in marine water, 80% in soil, cf. LENNON and JONES (2011)[Box 1, Table *a*]). This abundance of dormant forms, which can be short-lived as well as staying inactive for significant periods of time (decades or century old spores are not uncommon) thus creates a seed bank that buffers against environmental change, but potentially also against classical evolutionary forces such as genetic drift, mutation, or selection.

In this paper, we investigate the effect of large seed banks (that is, comparable to the size of the active population) on the patterns of genetic variability in populations over macroscopic timescales. In particular, we extend a recently introduced mathematical model for the ancestral relationships in a Wright-Fisherian population of size *N* with geometric seed bank age distribution (cf. BLATH*et al.* (2015)) to accommodate different mutation rates for ‘active’ and ‘dormant’ individuals, as well as a positive death rate in the seed bank. The resulting genealogy, measured over timescales of order *N*, can then be described by a new universal coalescent structure, the ‘seed bank coalescent with mutation’, if the individual initiation and resuscitation rates between active and dormant states as well as the individual mutation rates are of order 1/*N*. Measuring times in units of *N* and mutation rates in units of 1/*N* is of course the classical scaling regime in population genetic modeling; in particular, the classical Wright-Fisher model has a genealogy that converges in precisely this setup to the usual Kingman coalescent with mutation (KINGMAN (1982a,c,b); see WAKELEY (2009) for an overview).

We will provide a precise description of these (seed bank) coalescents and corresponding population models, in part motivated by recent research in microbial dormancy JONES and LENNON (2010); LENNON and JONES (2011), in the next section below. We argue that our seed bank coalescent is universal in the sense that it is robust to the specifics of the associated population model, as long as certain basic features are captured.

Our explicit seed bank coalescent model then allows us to derive expressions for several important population genetic quantities. In particular, we provide recursions for the expectation (and variance) of the time to the most recent common ancestor (*T*_{MRCA}), the total number of segregating sites, average pairwise differences and number of singletons in a sample (under the inifinitely-many sites model assumptions). We then use these recursions, and additional simulations based on the seed bank coalescent with mutation, to analyse Tajima’s *D* and related distance statistics in the presence of seed banks, and also the observed site frequency spectrum.

We hope that this basic analysis triggers further research on the effect of seed banks in population genetics, for example concerning statistical methods that allow one to infer the presence and size of seed banks from data, to allow model selection (e.g. seed bank coalescent versus (time-changed) Kingman coalescent), and finally to estimate evolutionary parameters such as the mutation rate in dormant individuals, or the inactivation and reactivation rates between the dormant and active states.

It is important to note that our approach is different from a previously introduced mathematical seed bank model in KAJ*et al.* (2001). There, the authors consider a population of constant size *N* where each individual chooses its parent a random amount of generations in the past and copies its genetic type from there. The number of generations that separate each parent and offspring can be interpreted as the time (in generations) that the offspring stays dormant. The authors show that if the maximal time spent in the seed bank is restricted to finitely many {1, 2*, … , m*}, where *m* is fixed, then the ancestral process induced by the seed bank model converges, after the usual scaling of time by a factor *N,* to a time changed (delayed) Kingman coalescent. Thus, typical patterns of genetic diversity, in particular the normalised site frequency spectrum, will stay (qualitatively) unchanged. Of course, the point here is that the expected seed bank age distribution is not on the order of *N*, but uniformly bounded by *m*, so that for the coalescent approximation to hold one necessarily needs that *m* is *small* compared to *N*, which results a ‘weak’ seed bank effect. This model has been applied in TELLIER*et al.* (2011) in the analysis of seed banks in certain species of wild tomatoes. A related model was considered in VITALIS*et al.* (2004), which shares the feature that the time spent in the seed bank is bounded by a fixed number independent of the population size. For a more detailed mathematical discussion of such models, including previous work in BLATH*et al.* (2014), see BLATH*et al.* (2015). The choice of the adequate coalescent model (seed bank coalescent vs. (time-changed) Kingman coalescent) will thus also be an important question for study design, and the development of corresponding model selection rules will be part of future research.

## Coalescent models and seed banks

Before we discuss the seed bank coalescent, we briefly recall the classical Kingman coalescent for reference this will ease the comparison of the underlying assumptions of both models.

### The Kingman coalescent with mutation

The Kingman coalescent (KINGMAN, 1982a,c,b) describes the ancestral process of a large class of neutral exchangeable population models including the Wright-Fisher model (WRIGHT, 1931; FISHER, 1930), the Moran model (MORAN, 1958) and many Cannings models (CANNINGS, 1974). See e.g. WAKELEY (2009) for an overview. If we trace the ancestral lines (that is, the sequence of genetic ancestors at a locus) of a sample of size *n* backwards in time, we obtain a binary tree, in which we see pairwise coalescences of branches until the most-recent common ancestor is reached. Kingman proved that the probability law of this random tree can be describe as follows: Each pair of lineages (there are many) has the same chance to coalesce, and the successive coalescence times are exponentially distributed with parameters until the last remaining pair of lines has coalesced. This elegant structure allows one to easily determine the expected time to the most recent common ancestor of a sample of size *n*, which is well known to be

Not surprisingly, we will essentially recover (1) for the seed bank coalescent defined below if the relative seed bank size becomes small compared to the ‘active’ population size.

As usual, mutations are placed upon the resulting coalescent tree according to a Poisson-process with rate *θ*/2, for some appropriate *θ* > 0, so that the expected number of mutations of a sample of size 2 is just *θ*.

The underlying assumptions about the population for a Kingman coalescent approximation of its genealogy to be justified are simple but far-reaching, namely that the different genetic types in the population are selectively neutral (i.e. do not exhibit significant fitness differences), and that the population size of the underlying population is essentially constant in time. If the population can be described by the (haploid) Wright-Fisher model (of constant size, say *N*), then, in order to arrive at the described limiting genealogy, it is standard to measure time in units of *N*, *the coalescent time scale*, and to assume that the individual mutation rates per generation are of order *θ*/(2*N*). The exact time-scaling usually depends on the reproductive mechanism and other particularities of the underlying model (it differs already among variants of the Moran model), but the Kingman coalescent is still a universally valid limit for many a priori different population models (including e. g. all reproductive mechanisms with bounded offspring variance, dioecy, age structure, partial selfing and to some degree geographic structure), when these particularities exert their influence over time scales much shorter than the coalescent time scale, cf. e.g. WAKELEY (2013). This is also the reason, why the Kingman coalescent still appears as limiting genealogy of the ‘weak’ seed bank model of KAJ *et al.* (2001) mentioned in the introduction.

This robustness has turned the Kingman coalescent into an extremely useful tool in population genetics. In fact, it can be considered the standard null-model for neutral populations. Its success is also based on the fact that it allows a simple derivation of many population genetic quantities of interest, such as a formula for the expected number of segregating sites
or the expected average number of pairwise differences *π* (TAJIMA, 1983), the expected values of the site-frequency spectrum, cf. Fu (1995), when one assumes the infinite-sites model of WATTEKSON (1975). This analytic tractability has allowed the construction of a sophisticated statistical machinery for the inference of evolutionary parameters. We will investigate the corresponding quantities for the seed bank coalescent below.

### The seed bank coalescent with mutation

Similar to the Kingman coalescent, the seed bank coalescent, mathematically introduced in BLATH*et al.* (2015), describes the ancestral lines of a sample taken from a population with seed bank component. Here, we distinguish whether an ancestral line belongs to an ‘active’ or ‘dormant’ individual for any given point backward in time. The main difference to the Kingman coalescent is that as long as an ancestral line corresponds to a dormant individual (in the seed bank), it cannot coalesce with other lines, since reproduction and thus finding a common ancestor is only possible for ‘active’ individuals.

The dynamics is now easily described as follows: If there are currently *n* active and *m* dormant lineages at some point in the past, each ‘active pair’ may coalesce with the same probability, after an exponential time with rate , entirely similar to a classical Kingman coalescent with currently *n* lineages. However, each active line becomes dormant at a positive rate *c* > 0 (corresponding to an ancestor who emerged from the seed bank), and each dormant line resuscitates, at a rate *cK*, for some *K* > 0. The parameter *K* reflects the relative size of the seed bank compared to the active population, and will be explained below in terms of an explicit underlying population model. Since dormant lines are prevented from merging, they significantly delay the time to the most recent common ancestor. This mechanism is reminiscent of a structured coalescent with two islands (HERBOTS, 1997; NOTOHARA, 1990), where lineages may only merge if they are in the same colony. Of course, if one samples a seed bank coalescent backwards in time, one need not only specify the sample size, but actually the number of sampled individuals from the active population (say *n*), and from the dormant population (say *m*).

In this paper, we also consider mutations along the ancestral lines. As in the Kingman case we place them along the active line segments according to a Poisson process with rate *θ*_{1}, and along the dormant segments at a rate *θ*_{2} *≥* 0. Depending on the concrete situation, one may want to choose *θ*_{2} = 0. To determine the mutation rate in dormant individuals will be an interesting inference question. In Figure 1, we illustrate a realisation of the seed bank coalescent with mutations: Dormant segments are dotted and do not take part in coalescences.

A formal mathematical definition of this process as partition-valued Markov chain can be found in BLATH*et al.* (2015); it is straightforward to extend their framework to include mutations.

The parameters *c* and *K* can be understood as follows: *c* describes the proportion of individuals that enter the seed bank per (macroscopic) coalescent time-unit. It is thus the rate at which individuals become dormant. If the ratio of the size of the active population and the dormant population in the underlying population is *K* : 1 (that is, the active population is *K* times the size of the dormant population), and absolute (and thus also relative) population sizes are assumed to stay constant, then, in order for the relative amount of active and dormant individuals to stay balanced, the rate at which dormant individuals resuscitate and return to the active population is necessarily of the form *cK*, see also Figure 2. It is important to note that in this setup, the average coalescent time that an inactive individual stays dormant is of the order *N*/(*cK*). We will later also include a positive mortality rate for dormant individuals, this will lead to a reduced ‘effective’ relative seed bank

### Robustness and underlying assumptions of the seed bank coalescent

As for the Kingman coalescent, it is important to understand the underlying assumptions that make the seed bank coalescent a reasonable model for the genealogy of a population: Again, we assume the types in the population to be selectively neutral, so that there are no significant fitness differences. Further, we assume the population size *N* and the seed bank size *M* to be constant, and to be of the same order, that is there exists a *K* > 0 so that *N* = *K · M*, i.e. the ratio between active and dormant individuals is constant equal to *K* : 1. Finally, the rate at which an active individual becomes dormant should be *c* (on the macroscopic coalescent scale), so that necessarily the average time (in coalescent time units) that an individual stays dormant before being resuscitated becomes 1/(*cK*). If one includes a positive mortality rate in the seed bank, this will lead to a modified parameter , see below.

We will provide below an example of a concrete seed bank population model, the ‘Wright-Fisher model with geometric seed bank component’, including mutation and mortality in the seed bank, for which it can be proved that the seed bank coalescent with mutation governs the genealogy if the population size *N* (and thus necessarily also seed bank size *M*) gets large, and coalescent time is measured in units of the population size *N*. This is the same scaling regime as in the case of the Kingman coalescent corresponding to genealogy of the classical Wright-Fisher model.

The seed bank coalescent with mutation should be robust against small alterations – such as in the transition or reproduction mechanism, or in the population or seed bank size – of the underlying population, similar to the robustness of the Kingman coalescent. Especially if these alterations occur on time scales that are much shorter than the coalescent time scale (which is *N* for the haploid Wright-Fisher model). For example, one can still obtain this coalescent in a *Moran model* with seed bank component, as long as the seed bank is on the same order as the active population, and if the migration rates between seed bank and active population scale suitably (as well as the mutation rate) with the coalescent time scale. As mentioned above, this is an important difference to the model considered by KAJ *et al.* (2001), where the time an individual stays in the seed bank is negligible compared to the coalescent time scale, thus resulting merely in a (time-change) of a Kingman coalescent - a ‘weak’ seed bank effect.

## A Wright-Fisher nodel with geonetric seed bank distribution

We now introduce a Wright-Fisher type population model with mutation and seed bank in which individuals stay dormant for geometrically distributed amounts of time. The model is very much in line with classical probabilistic population genetics thinking (in particular assuming constant population size), but also captures several features of microbial seed banks described in LENNON and JONES (2011), in particular reversible states of dormancy and mortality in the seed bank. We assume that the following (idealised) aspects of (microbial) dormancy can be observed:

Dormancy generates a seed bank consisting of a reservoir of dormant individuals.

The size of the seed bank is comparable to the order of the total population size, say in a constant ratio

*K*: 1 for some*K*> 0.The size of the active population

*N*and of the seed bank*M*=*M*(*N*) stays constant in time; combined with (ii) we get*N*=*K · M*.The model is selectively neutral so that reproduction is entirely symmetric for all individuals; for concreteness we assume reproduction according to the Wright-Fisher mechanism in fixed generations. That means, the joint offspring distribution of the parents in each generation is symmetric multinomial. We interpret 0 offspring as the death of the parent, one offspring as mere survival of the parent, and two or more offspring as successful reproduction leading to new individuals created by the parent.

Mutations may happen in the active population, at constant probability of the order

*θ*_{1}/(2*N*), but potentially also in the dormant population (at the same, or a reduced, or vanishing, probability*θ*_{2}/(2*N*))There is bi-directional and potentially repeated switching from active to dormant states, which appears essentially independently among individuals (‘spontaneous switching’). The individual initiation probability of dormancy per generation is of the order

*c/N*, for*c*> 0.Dormant individuals may die in the seed bank (due to maintenance and energy costs). If mortality is assumed to be positive, the individual probability of death per generation is of order

*d/N*.For each new generation, all these mechanisms occur independently of the previous generations.

We schematically visualise this mechanism in Figure 2, which is similar to Figure 1 in JONES and LENNON (2010). Vhether these assumptions are met of course needs to be determined for the concrete underlying real population. In this theoretical paper, we use the above assumptions to construct an explicit mathematical model that leads, measuring time in units of *N*, to a seed bank coalescent with mutation. Still, we wish to emphasise that, as dicussed in the previous section, the seed bank coalescent is robust as long as certain basic assumptions are met.

We now turn the above features into a formal mathematical model that can be rigorously analysed, extending the Wright-Fisher model with geometric seed bank component in BLATH*et al.* (2015) by additionally including mortality in the seed bank and potentially different mutation rates in the active and dormant populations.

(Seed bank model with mutation and mortality). Let *N* ∈ ℕ, *and let c, K, θ*_{1} *>* 0 and *θ*_{2}, *d* ≥ 0. *The seed bank model with mutation is obtained by iterating the following dynamics for each discrete generation k* ∈ ℕ_{0} *(with the convention that all occuring numbers are integers; if not one may enforce this using appropriate Gauss brackets):*

The

*N*active individuals from generation*k*= 0 produce active individuals in generation*k*= 1 by multinomial sampling with equal weights.Additionally,

*c*dormant individuals, sampled uniformly at random without replacement from the seed bank of size*M*:=*N/K*in generation 0, reactivate, that is, they turn into exactly one active individual in generation*k*= 1 each, and leave the seed bank.The active individuals from generation 0 are thus replaced by these (

*N - c*) +*c*=*N*new active individuals, forming the active population in the next generation*k*= 1.In the seed bank,

*d*individuals, sampled uniformly at random without replacement from generation*k*= 0, die.To replace the

*c*+*d*vacancies in the seed bank, the*N*active individuals from generation 0 produce*c*+*d*seeds by multinomial sampling with equal weights, filling the vacant slots of the seeds that were activated.The remaining . seeds from generation 0 remain inactive and stay in the seed bank

During reproduction, each newly created individual copies its genetic type from its parent.

In each generation, each active individual is affected by a mutation with probability

*θ*_{1}*/N,*and each dormant individual mutates with probability*θ*_{2}*/N*(where*θ*_{2}may be 0).

This model is an extension of the model in BLATH*et al.* (2015) to additionally include mortality in the seed bank and incorporate (potentially distinct) mutation rates in the active and dormant population. It appears to be a rather natural extension of the classical Wright-Fisher model. Note that the model has a geometric seed bank age distribution, since every dormant individual in each generation has the same probability to become active resp. die in the next generation, so that the time that an individual is in the dormant state is geometrically distributed. The parameter of this geometric distribution is given by
in the absence resp. presence of mortality in the seed bank. With mathematical arguments similar to those applied in BLATH*et al.* (2015), it is now standard to show that the ancestral process of a sample taken from the above population model converges, on the coalescent time scale *N*, to the seed bank coalescent with parameters *c* and *K,* resp.
and mutation rates *θ*_{1}*, θ*_{2}. It is interesting to see that mortality leads to a decrease of the relative seed bank size in a way that depends on the initiation rate *c*, which is of course rather intuitive. In this sense gives the ‘effective’ relative seed bank size.

### The type-frequencies in the bi-allelic seed bank population model

In this paper, we will mostly consider the *infinite sites model* (WATTEKSON, 1975), where it is assumed that each mutation generates an entirely new type. However, before turning to the infinite-site model, we briefly discuss the bi-allelic case, say with types {*a, A*}. Given initial type configurations *ξ*_{0} ∈ {*a*, *A*}^{N} and *η*_{0} ∈ {*a*, *A*}^{M}, denote by
the genetic type configuration of the active individuals (*ξ*) and the dormant individuals (*η*) in generation *k* (obtained from the above mechanism). We assume that each mutation causes a transition from *a* to *A* or from *A* to *a.* Let

We call the discrete-time Markov chain the *Wright-Fisher frequency process with mutation and seed bank component*. It can be seen from a generator computation that under our assumptions it converges as *N → ∞* to the two-dimensional diffusion (*X*_{Nt}, *Y*_{Nt})_{t≥0} that is the solution to the system of stochastic differential equations

Here, (*B*_{t})_{t≥0} denotes standard one-dimensional Brownian motion. An alternative way to represent this stochastic process is via its Kolmogorov backward generator, cf. e. g. KARLIN and TAYLOR (1981), which is given by
for functions *f ∈ C*^{2}([0, 1]^{2}). Note that it this is reminiscent of the backward generator of the *structured coalescent* with two islands (HERBOTS, 1997; NOTOHARA, 1990); however, its qualitative behaviour is very different. Its relation to the structured coalescent with two islands will be investigated in future research.

## Population genetics with the seed bank coalescent

In contrast to LENNON and JONES (2011), who use a deterministic population dynamics approach to study seed banks, we are interested in probabilistic effects of seed banks on genetic variability. Thus our methods are genealogical and sample based, and we use a coalescent approach to study the genealogy of a sample. In order to better understand how seed banks shape genealogies, we consider genealogical properties, such as time to most recent common ancestor, total tree size, and length of external branches.

### Genealogical tree properties

We first discuss some classical population genetic properties of the seed bank coalescent when viewed as a random tree without mutations. For the results that we derive below, it will usually be sufficient to consider the *block-counting process* (*N*_{t}, *M*_{t})_{t≥0}, of our coalescent, where *N*_{t} gives the number of lines in our coalescent that are active and *M*_{t} denotes the number of dormant lines *t* time units in the past. Then, (*N*_{t}, *M*_{t})_{t≥0} is the continuous time Markov chain started in (*N*_{0}, *M*_{0}) ∈ ℕ_{0} × ℕ_{0} with transitions

Again, introducing mutation can be done in the usual way, by superimposing independent Poisson processes with rate *θ*_{1} on the active lines, and at rate *θ*_{2} on the dormant lines. If the block-counting process is currently in state (*N*_{t}, *M*_{t}) = (*n, m*), then a mutation in an active line happens at rate *nθ*_{1}, and a mutation in a dormant line at rate *mθ*_{2}. The total jump rate from state (*n, m*) of the *backward process with mutation* is thus given by

### Time to the most recent common ancestor

It has been shown in BLATH*et al.* (2015) [Theorem 4.6j that the expected time to the most recent common ancestor (𝔼_{n,0}[*T*_{MRCA}]) for the seed bank coalescent, if started in a sample of active individuals of size *n*, is *O*(log log *n*), in stark contrast to the corresponding quantity for the classical Kingman coalescent, which is bounded by 2, uniformly in *n*, cf. (1). This already indicates that one should expect elevated levels of (old) genetic variability under the seed bank coalescent, since more (old) mutations can be accumulated. While the above result shows the asymptotic behaviour of the 𝔼_{n,0}[*T*_{MRCA}] for large *n*, it does not give precise information for the exact absolute value, in particular for ‘small to medium’ *n*. Here, we provide recursions for its expected value and variance that can be computed efficiently. First, we introduce some notation.

We define the *time to the most recent common ancestor* of the seed bank coalescent formally to be

If the sample consists in *an* active and *bn* dormant individuals, for some *a, b* ∈ ℝ^{+}, then the expected time to the most recent common ancestor is log(*bn* + log *an*), (BLATH*et al.*, 2015). Here, it is interesting to note that the time to the most recent common ancestor of the Bolhausen-Sznitman coalescent is also *O*(log log *n*) (GOLDSCHMIDT and MARTIN, 2005). The Bolthausen-Sznitman coalescent is often used as a model for selection, cf. e.g. NEHER and HALLATSCHEK (2013).

One can compute the expected time to most recent common ancestor recursively as follows. For *n, m* ∈ ℕ_{0} let
where 𝔼_{n,m} denotes expectation when started in (*N*_{0}*, M*_{0}) = (*n, m*), ie. with *n active* lines and *m dormant* ones. Observe that we need to consider both types of lines in order to calculate *t*_{n,m}. Write
and abbreviate

Then we have the following recursive representation
with initial conditions *t*_{1,0} = *t*_{0,1} = 0. The proof of (10) and a recursion for the variance of *T*_{MRCA} is given in Section S1. Since the process *N*_{t} + *M*_{t} is non-increasing in *t*, these recursions can be solved iteratively. In fact,
which in the case without mortality (*d* = 0) reduces to

Notably, *t*_{2,0} is constant for sample size 2 (see Eq. 11) as *c* varies (Table 1) if *d* = 0, and in particular does not converge for *c →* 0 to the Kingman case. This effect is similar to the corresponding behaviour of the structured coalescent with two islands if the migration rate goes to 0, cf. NATH and GRIFFITHS (1993). However, the Kingman coalescent values are recovered as the seed bank size decreases (e.g. for *K* = 100 in Table 1).

The fact that *t*_{2,0} = 4 for *K* = 1*, d* = 0 can be understood heuristically if *c* is large: In that situation, transitions between active and dormant states happen very fast, thus at any given time the probability that a line is active is about 1/2, and therefore the probability that both lines of a given pair are active (and thus able to merge) is approximately 1/4. We can therefore conjecture that for *d* = 0*, K* = 1 and *c → ∞* the genealogy of a sample is given by a time change by a factor 4 of Kingman’s coalescent.

Tables 1 and 2 show values of *t*_{n,0} obtained from (10) for various parameter choices and sample sizes. The relative size of the seed bank (*K*) has a significant effect on 𝔼_{n,0} [*T*_{MRCA}]; a large seed bank (*K* small) increases 𝔼_{n,0} [*T*_{MRCA}], while the effect of *c* is to dampen the increase in 𝔼_{n,0} [*T*_{MRCA}] with sample size (Table (1)). The effect of the seed bank death rate *d* on 𝔼_{n,0} [*T*_{MRCA}] is to dampen the effect of the relative size (*K*) of the seed bank (Table 2).

### Total tree length and length of external branches

In order to investigate the genetic variability of a sample, in terms e.g. of the number of segregating sites and the number of singletons, it is useful to have information about the total tree length and the total length of external branches. Let *L*^{(a)} denote the total length of all branches while they are active, and *L*^{(d)} the total lenght of all branches while they are dormant. Their expectations
may be calculated using the following recursions for *n, m* ∈ ℕ_{0}, and with *λ*_{n,m} given by (8),

Similar recursions hold for their variances as well as for the corresponding values of the total length of external branches, which can be found in the Supplementary Information together with the respective proofs. From (14) and (15) one readily obtains

We observe that and given in (16) are independent of *c* if *d* = 0 as also seen for *t*_{2,0} cf. (11). We will use (16) to obtain closed-form expressions for expected average number of pairwise differences.

The numerical solutions of (14) and (15) indicate that for *n* ≥ 2,

Hence the expected total lenght of the active and the dormant parts of the tree are proportional, and ratio is given by the effective relative seed bank size.

Recursions for the expected total length of external branches are given in Prop. S1.3 in Supporting Information. Let and denote the expected total lengths of *active* and *dormant* external branches, respectively, when started with *n* active and *m* dormant lines. The numerical solutions of the recursions indicate that the ratio of expected values and is also given by (17).

Recursions for expected branch lengths associated with any other class than singletons are more complicated to derive, and we postpone those for further study. Simulation results (not shown) suggest that the result (17) we obtained for relative expected total length of active branches, and active external branches, holds for all branch length classes; if denotes the total length of *active (dormant)* branches subtending *i* ∈ {1, 2, … , *n* − 1} leaves, then, if all our sampled lines are active, we claim that is given by (17).

Table S1 shows values of , ie. the relative expected total length of external branches when our sample consists of ten active lines, and ten dormant ones. In contrast to the case when all sampled lines are active, *c* clearly impacts *r*_{10,10} when *d* is small. In line with previous results, *d* reduces the effect of the relative size (*K*) of the seed bank.

Table S2 shows the expected total lengths of active and dormant external branches and and for values of *c*, *K*, and *d* as shown. When the seed bank is large (*K* small), and can be much longer than the expected length equal to 2 when associated with the Kingman coalescent (FU, 1995). However, as noted before, the effect of *K* depends on *d*. The effect of *c* also depends on *d*; changes in *c* have bigger effect when *d* is large.

One can gain insight into the effects of a seed bank on the site frequency spectrum by studying the effects of a seed bank on relative branch lengths. Let denote the relative total length of *active* branches subtending *i* leaves , relative to the total length of active branches , and we only consider the case when all *n* sampled lines are active. Thus, if one assumes that the mutation rate in the seed bank is negligible compared to the mutation rate in the active population, should be a good indicator of the relative number of singletons, relative to the total number of segregating sites. In addition, we investigate to learn if and how the presence of a seed bank affects genetic variation, even if *no* mutations occur in the seed bank. Figure S1 shows estimates of (obtained by simulations) for values of *c*, *K*, and *d* as shown (all *n* = 100 sampled lines assumed active). The main conclusion is that a large seed bank reduces the relative length of external branches, and increases the relative magnitude of the right tail of the branch length spectrum. Thus, one would expect to see a similar pattern in neutral genetic variation: a reduced relative count of singletons, and relative increase of polymorphic sites in high count.

### Neutral genetic variation

In this subsection we derive and study several recursions for common measures of DNA sequence variation in the infinite sites model (ISM) of WATTEKSON (1975). We will also investigate how these quantities differ from the corresponding values under the Kingman coalescent, in an effort to understand how seed bank parameters affect genetic variability.

#### Segregating sites

First we consider the *number of segregating sites S in a sample*, which, assuming the ISM, is the total number of mutations that occur in the genealogy of the sample until the time of its most recent common ancestor. In addition to being of interest on its own, *S* is a key ingredient in commonly employed distance statistics such as those of TAJIMA (1989) and Fu and LI (1993). We let mutations occur on active branch lengths according to independent Poisson processes each with rate *θ*_{1}/2, and on dormant branches with rate *θ*_{2}/2. The expected value of *S* can be expressed in terms of the expected total tree-lengths as

The proof of this, as well as a similar expressions for the variance of the number of segregating sites can be found in the supplementary material.

Table 3 shows the expected number of segregating sites 𝔼_{n,0}[*S*] = *s*_{n,0} in a sample of size *n* taken from the active population for values of *c* and *K* as shown. The size of the seed bank *K* strongly influences the number of segregating sites. If there is no mutation in the seed bank, it roughly doubles for *K* = 1 and approaches the normal value of the Kingman coalescent for small seed banks (*K* = 100). The parameter *K* seems to have a more significant influence than the parameter *c*.

#### Average pairwise differences

Average pairwise differences are a key ingredient in the distance statistics of TAJIMA (1983) and FAY and WU (2000). Expected value and variance for average pairwise differences in the Kingman coalescent were first derived in TAJIMA (1983). Here, we give an expression for the expectation in terms of the expected total tree lengths. Denote by *π* the average number of pairwise differences
where is the total number of pairwise differences, with *K*_{ij} denoting the number of differences observed in the pair of DNA sequences indexed by (*i, j*). We abbreviate *d*_{n,m} := 𝔼_{n,m}[*K*] and obtain
which can be calculated using

Where and are defined in (13).

Hence, given a sample configuration (*n,* 0), i.e. our *n* sampled lines are all active, (20), together with (16), gives

If now *d* = 0, the dependence on *c* disappears again, since we have
which is obviously highly elevated compared to *θ*_{1} if the seed bank is large (*K* small). For comparison, 𝔼E_{(n)} [*π*] = *θ*_{1} when associated with the usual Kingman coalescent, which we recover in the absence of a seed bank (*K → ∞*) in (21).

#### The site-frequency spectrum (SFS)

The site frequency spectrum (SFS) is one of the most important summary statistics of population genetic data in the infinite sites model. Suppose that we can distinguish between mutant and wild-type, e.g. with the help of an outgroup. As before, we distinguish between the number of samples taken from the active population (say *n*) and the dormant population (say *m*). Then, the SFS of an (*n, m*)-sample is given by
where the denote the number of sites at which variants appear *i*-times in our sample of size *n* + *m*. For the Kingman coalescent, the expected values, variances and covariances of the SFS have been derived by Fu (1995). Expected values and covariances can be computed in principle extending the theory in Fu (1995) resp. GRIFFITHS and TAYAK É (1998), however, are far more involved than the previous recursions and will be treated in future research. We derive recursions for the expected number of singletons, and investigate the whole SFS by simulation.

#### Number of singletons

The number of singletons in a sample is often taken as an indicator of the kind of historical processes that have acted on the population. By ‘singletons’ we mean the number of *derived* (or new) mutations which appear only once in the sample, which in the infinite sites model, are equal to the number of mutations occurring on external branches. Thus we can relate the expected number of singletons, denoted by , to the total length of external branches in the same way as we related the number of segregating sites to the total tree length. Let denote the expected total length of external branches when our sample consists of *n active external* lines, *n′ active internal* lines, *m dormant external* lines, and *m′ dormant internal* lines. Define similarly as the expected total length of *dormant external* branches. Recursions for and are given in the supplementary material. For *n, m* ∈ ℕ_{0} we have that the expected number of singletons is given by

Thus, one can compute the expected number of singtetons by solving the recursions for external branch lenghts. By way of example, Table S2 gives values of and for a sample of 10 active lines (*n* = 10, *m* = 0).

#### The whole site-frequency spectrum

Figure 3 shows estimates of the normalised expected frequency spectrum , where * denotes the total number of segregating sites. Figure 3 shows that if the relative size of the seed bank is small (say, **K* = 100), then the SFS is almost unaffected by dormancy, in line with intuition. If the seed bank is large (say *K* = 0.1) and the transition rate *c* = 1 is comparable to the mutation rate *θ*_{1}/2 = 1 then the spectrum differs significantly, in particular the number of singletons is reduced by about one-half, which should be significant, and the right-tail is much heavier.

This can be understood as follows: if the seed bank leads to an extended time to the most recent common ancestor, then the proportion of old mutations should increase, and these should be visible in many sampled individuals, strengthening the right tail of the spectrum.

It is interesting to see that even in the presence of a large seed bank (say *K* = 0.1), large transitions rates (say *c* = 100) do not seem to affect the normalised spectrum. Again, this can be understood intuitively, since by the arguments presented in the discussion after (12) large *c* should lead to a constant time change of the Kingman coalescent (with a time change depending on *K*). Such a time change does not affect the normalised spectrum.

One reason for considering the SFS is naturally that one would like to be able to use the SFS in inference, to determine, say, if a seed bank is present, and how large it is. If one has expressions for the expected SFS under some coalescent model, one can use the normalised expected SFS in an approximate likelihood inference (see eg. EL ON *et al.*, 2015). The normalised spectrum is also appealing since it is quite robust to changes in the mutation rate (EL ON *et al.*, 2015). For comparison, Figure S2 shows estimates of the expected normalised spectrum where *,* and shows a similar pattern as for the normalised expected spectrum in Figure 3.

## Distance statistics

Rigorous inference work is beyond the scope of the current paper. However, we can still consider (by simulation) estimates of the distribution of various commonly employed distance statistics. Distance statistics for the site-frequency spectrum are often employed to make inference about historical processes acting on genetic variation in natural populations. Commonly used statistics include the ones of TAJIMA (1989) (*D*_{T}), Fu and LI (1993) (*D*_{FL}), and FAY and WU (2000) (*D*_{FW}). These statistics contrast different parts of the site-frequency spectrum (cf. eg. ZENG *et al.*, 2006).

### The ℓ_{2} distance

Arguably the most natural distance statistic to consider is the ℓ_{2}-distance (or sum of squares) of the whole SFS (or some lumped version thereof) between the observed SFS and an expected SFS based on some coalescent model. The statistic (*n* denotes sample size) is given by
where, in our case, expectation and variance are taken with respect to the Kingman coalescent (FU, 1995). Estimates of the distribution of are shown in Figure 4. As the size of the seed bank increases (*K* decreases), one observes worse fit of the site-frequency spectrum with the expected SFS associated with the Kingman coalescent.

### Tajima's *D*

Tajima’s statistic (*D*_{T}) for a sample of size *n*, with , is defined as
(TAJIMA, 1989) where the variance depends on the mutation rate *θ* which is usually estimated from the data. Under the Kingman coalescent, 𝔼[*D*_{T}] = 0. Deviations from the Kingman coalescent model become significant at the 5% level if they are either greater than 2 or smaller than *−*2. Negative values of *D*_{T} should appear if there is an excess of either low- or high-frequency polymorphisms and deficiency of middle frequency polymophisms (see e.g. WAKELEY (2009) for further details). Positive values of *D*_{T} are to be expected if variation is common with moderate frequencies, for example in presence of a recent population bottleneck, or balancing selection.

The empirical distribution of *D*_{T} was investigated by simulation for different seed bank parameters (Figures 5, S3), assuming that mutations do not occur in the seed bank (*θ*_{2} = 0). If the seed bank is large (*K* = 1/10, 1/100), then the median of *D* becomes significantly positive. For *c* = *K* = 1, there is very little deviation from the Kingman coalescent. Again *D* seems to be more sensitive to small values of *K* than changes in *c*. This is in line with our results on the 𝔼_{n,0} [T_{MRCA}], with highly elevated times for small *K*. In the latter case, old variation will dominate, thus resembling a population bottleneck, producing positive values of *D*_{T}.

In conclusion, *D*_{T} might not be a very good statistic to detect seed banks.

### Fu and Li’s *D*

Fu and LI (1993) statistic *D*_{FL} is defined as
with *S* being the total number of segregating sites, *ξ*_{1} the total number of singletons, , and *u*_{n} and *v*_{n} as in Fu and Li (1993) (see also Durrett 2008). As with *D*_{T}, 𝔼[*D*_{FL}] = 0 under the Kingman coalescent.

Figure 6 shows estimates of the distribution of *D*_{FL} assuming *θ*_{2} = 0. When the seed bank is large (*K* small), the distribution of *D*_{FL} becomes highly skewed, with most genealogies resulting in low number of singletons compared with the total number of polymorphisms, resulting in positive *D*_{FL}. This is in line with our observations about the relative number of singletons associated with a large seed bank (Figures 3, S2), and the relative length of external branches (Figure S1).

### Fay and Wu’s *H*

The distance statistics *D*_{FW} of FAY and WU (2000) is defined as
where
and *π* is the average number of pairwise differences. A formula for the variance of *D*_{FW}was obtained by ZENG*et al.* (2006). Figure 7 holds estimates of the distribution of *D*_{FW}with *n* = 100, *d* = 0, and *c* and *K* as shown. As the seed bank size increases (*K* decreases) high frequency variants, as captured by *H*, become dominant over the middle-frequency variants captured by *π*. In conclusion, Fu and Li’s *D*_{FL}, or Fay and Wu’s *D*_{FW} may be preferrable over Tajima’s statistic *D*_{T} to detect the presence of a seed bank. A rigorous comparison of different statistics (including the *E* statistic of ZENG*et al.* (2006)), and their power to distinguish between absence and presence of a seed bank, must be the subject of future research.

The C code written for the computations is available at http://page.math.tu-berlin.de/∼eldon/programs.html.

## Discussion

In the previous sections, we have presented and analysed an idealised model of a population sustaining a large seed bank, as well as the resulting patterns of genetic variability, with the help of a new coalescent structure, called the seed bank coalescent (with mutation). This ancestral process appeared naturally as scaling limit of the genealogy of large populations producing dormant forms, in a similar way as the classical Kingman coalescent arises in conventional models, under the following assumptions: the seed bank size is of the same order as the size of the active population, the population and seed bank size is constant over time, and individuals enter the dormant state by spontaneous switching independently of each other, in a way that individual dormancy times are comparable to the active population size. We begin with a discussion of these modeling assumptions.

The assumption that the seed bank is of comparable size to the active population is based on LENNON and JONES (2011), where it is shown in Box 1, Table a, that this is often the case in microbial populations.

Assuming constant population size is a very common simplification in population genetics, and can be explained with constant environmental conditions. We claim that ‘weak ’ fluctuations (of smaller order than the active population size) still lead to the seed bank coalescent model, as is the case for the Kingman coalescent. However, seed banks are often seen as a bet hedging strategy against drastic environmental changes, which is not yet covered by our models. We see this as an important task for future research, which will require serious mathematical analysis. In the case of weak seed bank effects, fluctuating population size has been considered in ŽIVKOVI Ć and TELLIER (2012), where the presence of the seed bank was observed to leading to an increase of the effective population size.

Assuming spontaneous switching of single individuals between active and dormant state is also based on LENNON and JONES (2011) [p. 122/124j. This is somewhat restricting the scope of the model because it will not capture major environmental changes that may trigger a simultaneous change of state of a large proportion of individuals (e.g. due to sudden lack of nutrients). This effect is closely related to drastic changes in population size, and again may lead to serious alterations of our predictions. Hence, including such large switching events will also be an important part of our future work (and will again require substantial mathematical work). In VITALIS*et al.* (2004) a whole proportion of the dormant population becomes active in every generation, but this should be seen in conjunction with the fact that dormancy is of limited duration, which excludes drastic alterations on a long time scale. Assuming that the time spent in the seed bank is of the order of the population size is one of the main features that distinguishes our model from previous models of weak seed bank effects as previously investigated in KAJ*et al.* (2001); VITALIS*et al.* (2004). Statistical inference will be needed to support or reject this assumption, and to distinguish between weak and strong seed banks. One distinguishing feature of weak and strong seed banks is the behaviour of the normalised site frequency spectrum. Since weak seed banks lead to a genealogy which is a constant time change of Kingman’s coalescent KAJ*et al.* (2001); BLATH*et al.* (2013) the normalised frequency spectrum of weak seed banks will be similar to those corresponding to the Kingman coalescent, while under our model we observe (at least for large seed banks) a reduction in the number of singletons (Figure S2). The model of KAJ *et al.* (2001) was used in TELLIER*et al.* (2011), where Tajima’s D was used in order to detect seed banks.

We now discuss our results for the behaviour of classical quantities describing genetic variability under our modeling assumptions, that is, when the genealogy of a sample can be described by the seed bank coalescent. In particular, we used it to derive recursions for quantities such as the time to the most recent common ancestor, the total tree length or the length of external branches. We investigated statistics of interest to genetic variability such as the number of segregating sites, the site frequency spectrum, Tajima’s *D*, Fu and Li’s *D* and Fay and Wu’s *H* by numerical solution of our recursions and by simulation. It turns out that the seed bank size *K* leads to significant changes for example in the site frequency spectrum, producing a positive Tajima’s *D*, indicating the presence of old genetic variability, in line with intuition. Interestingly, the the influence of *c* seems to be less pronounced. For *K → ∞* we observe convergence towards the Kingman coalescent regime, while *c → ∞* seems to lead to a constant time change of Kingman’s coalescent.

We are confident that our results so far have the potential to open up many interesting research questions, both on the modeling and on the statistical inference side, as well as in data analysis. For example, it should be interesting to derive a test to distinguish between the presence of strong vs. weak (resp. negligible) seed banks. Another important task in future research will be to infer parameters of the model. While the relative seed bank size *K* can in principle be directly observed by cell counting (LENNON and JONES, 2011), the parameter *c* seems to be difficult to observe, in particular because we have seen that many statistics we calculated are independent of or at least not very sensitive with respect to *c.* On the other hand, this shows that our results are fairly robust under alterations of *c,* such that estimations or tests may be applied to some extent without prior knowledge on *c.* The mortality rate *d* may for many practical purposes be included into the parameter *K* or measuring the “effective” relative seed bank size.

Estimating the mutation rates *θ*_{1} and *θ*_{2} is another goal for the future. In particular, in view of an ongoing debate on the possibility of mutations in dormant individuals (MAUGHAN, 2007), it would be important to devise a test to determine if *θ*_{2} *>* 0.

## Acknowledgements

JB, BE, and NK acknowledge support by Deutsche Forschungsgemeinschaft (DFG) grant BL 1105/3-1 as part of SPP Priority Programme 1590 ‘Probabilistic Structures in Evolution’. ACG is supported by DFG RTG 1845 ‘Stochastic Analysis and Applications in Biology, Finance, and Physics’, the Berlin Mathematical School (BMS), and the Mexican Council of Science (CONACyT) in collaboration with the German Academic Exchange Service (DAAD). MWB is supported by DFG RTG 1845, and the BMS.