Ewens on the Substitutional Load

The following is a summary of Warren Ewens arguments regarding the cost of natural selection from his book "Mathematical Population Genetics" (Ewens 1979). I have made a strong effort to summarize Ewens' work here, and while I hope to improve this page in the future, you can currently get a far clearer explanation for his work in his original papers. These are fully referenced in the bibliography.

Ewens summarized his arguments regarding the cost of natural selection in his book "Mathematical Population Genetics." He first addresses the substitution load (which he, along with Kimura, identifies as being sometimes called the "evolutionary load" or the "cost of natural selection" (Ewens, 1979, pg. 68) in section 2.10 Genetic Loads). He shows that when h = 0.5 (the coefficient of dominance), if the starting frequency of an allele A1 having fitness coefficient s is x, the mean fitness of the population will be 1 + sx and l (the load or cost contribution of selection for a single generation will be approximately s(1 - x). Hence, the load for the entire substitution process (for a single substitution) will be L = Ss(1 - x) for each generation over the course of the substitution. Note that this is identical to Haldane's formula for the cost of natural selection. He notes that the summation is approximated by òt1t2 s(1 - x)dt, which is the same as òx1x2 s(1 - x)dx which is 2òx1x2 x-1dx = 2log(x2/x1). Ewens notes that this differs only trivially from -2log(x1) (at least for situations where x1 is near 0 and x2 is near 1). He follows Haldane and Kimura in using starting and ending frequencies of 0.0001 and 0.9999 for the substituting allele to come up with a cost of 18.4 when h is 0.5 and notes that the load is generally higher when the coefficient of dominance is not equal to 0.5. Following Haldane, Ewens uses a "typical" value of 30 for the substitution load / cost.

Ewens then describes the meaning of the load as follows:

What does this calculation really mean? Suppose all selection is through viability differences and the number of reproducing adults in each generation remains constant at N. A considerable portion of the depletion in population numbers between birth and the age of reproduction is non-genetic. Taking only the genetic component and supposing there is no depletion through genic deaths of the optimal genotype A1A1, a straightforward calculation shows that when the frequency of A1 is x there must be N(1 + s)/(1 + sx) individuals at birth so that after differential variabilities operate, there are N individuals at maturity. Thus the average individual is required leave approximately 1 + s(1 - x) offspring after non-genetic deaths are taken account of, so that there will be Ns(1 - x) "genetic deaths" in each generation associated with the evolutionary process. Summed over the entire process this gives NL individuals in all or an average of NL/T each generation if the substitutional process takes T generations.
Ewens then considers a series of such substitutions at different loci but with the same fitness parameters, each substitution starting regularly n generations apart. In this scenario, if each substitution takes T generations to complete, there will be T/n substitutions going on at any given time. This will lead to a total of NL/T)(T/n) = NL/n "selective deaths" per generation. He then notes that if one sets an upper limit of 0.1 N on this number (as per Haldane), then using the representative value of L = 30, one calculates a lower limit of n = 300, so that successive mutations cannot start more frequently than once every 300 generations or the number of selective deaths will be too large for the population to carry.

Ewens then mentions that some have argued that selection through fertility differences may escape this load or cost problem. However, he shows that  if one looks at the offspring requirements of the most fit individual required to drive a series of substitutions as described above, a similar argument can be made for selection driven by fertility differences as was just made for substitutions driven by viability differences. The offspring requirement of each individual of the most fit genotype (ie the individuals that have only the fitter gene at each locus of all of the ongoing substitutions) will be 1 + s(1 - x) for each locus currently undergoing substitution. Thus, the most fit individual will be required to produce 1 + L/T offspring for each locus currently undergoing substitution. Using a simple multiplicative model of fitness, this indicates that if T/n substitutions are going on simultaneously, individuals with the most fit genotype will be required to produce (1 + L/T)T/n offspring in all each generation. This is approximately exp(L/n) which is exp(30/n) using the typical value of 30 for the substitution load. (Ewens notes that this is approximately 1.1 offspring per parent for the most fit genotype when n = 300 generations as suggested by Haldane). Ewens shows that the offspring requirement per parent of the most fit genotype rises rapidly as n (the number of substitutions going on simultaneously) decreases. If n is small (as suggested by Kimura), the offspring requirement per parent of the most fit genotype will be high. Ewens gives an example of the kind of numbers required in a later section (9.2) of the book where he shows that these offspring requirements for either viability or fertility selection are not really the problem envisioned by Kimura.

In section 9.2 - Arguments Leading to the Neutral Theory: Loads, Ewens gives an example of the offspring requirements required for a series of substitutions as described above. He recaps Kimura's (then - in 1979) recent estimate of the substitution rate as six substitutions per generation, which puts n = 1/6. Plugging this value into the Load equation ( exp(30/n)) gives an offspring requirement of exp(30/(1/6)) = exp(180) = [approximately] 1078 offspring. Ewens then quotes Kimura to show agreement on this point:

"to carry out mutant substitution at the above rate, each parent must leave e180  1078 offspring for only one of the offspring to survive. This was the main reason why random fixation of selectively neutral mutants was first proposed by one of us as the main factor in molecular evolution."
Ewens mentions that this huge offspring requirement only applies to the parents of the "most fit genotype" and is does not apply to the average individual. He refers this to his derivation of the offspring requirement ( exp(30/n)) that I have described above. Ewens then rederives the same equation using a different set of arguments that arrive at the same equation:

First, he assumes a sequence of loci that are substituting because of selective differences at each locus with h (the coefficient of dominance) = 1/2 and a selection coefficient of s. The contribution of a single locus undergoing substitution to the average fitness (wavg) of the population is expected to be 1 + sx. (Click here for proof.) Considering multiple loci and multiplicative fitnesses, wavg = Pi(1 + sxi) , that is the average fitness will be the product of 1 + sxi where xi is the frequency of the ith locus undergoing substitution. If there are J loci undergoing substitution at any one time, the average fitness will be approximated by wavg = (1 + (1/2)s)J. If each substitution takes T generations and there are n substitutions starting per generation, then J = T/n and  wavg = (1 + (1/2)s)T/n = exp((1/2)sT/n). The fitness of the individual having the optimal genotype (homozygous for each of the favorable alleles undergoing substitution) will be given by wmax = (1 + s)T/n, which is approximately equal to exp(sT/n).  If the fitnesses are rescaled so that wavg =  1, then the fitness requirement for the optimal genotype will be exp((1/2)sT/n). To determine T (so that we will know how many generations are required for a substitution), Ewens uses the usual starting and ending values for favorable gene frequencies (0.0001 and 0.9999 respectively) and the formula T = òx1x2 {sx(1 - x)}-1dx  where x1 = 0.0001 and x2 = 0.9999. This yields T = 36.8/s, meaning that a substitution under these conditions where s = 0.01 will require around 3,680 generations. Plugging this value back into the equation for the offspring requirement of the optimal phenotype (Ewens refers to this as l, the substitution load -  l =  exp((1/2)sT/n)) gives l = exp((1/2)* 0.01*3680/n)) = exp(18.4/n). Using the substitution rate estimated by Kimura of 6 substitutions per generation puts the offspring requirement of the most fit genotype at exp(18.4/(1/6)) = exp(110.4) = 9 X 1047, a ridiculous number of offspring for any living creature. Furthermore, using the "representative value" of 30 for the substitution cost (to account for increases to the cost due to dominance effects) recovered Kimura's estimate of exp(30/(1/6)) = exp(180) =  1 X 1078, another impossible offspring requirement.

After a qualitative discussion of some factors (i.e. frequency dependancy and non-multiplicative epistasis among the various substituting loci) that can be expected to reduce the substitution load and hence the offspring requirement of the optimal genotype, Ewens moves on to the most critical factor that reduces the substitution load. Ewens notes that if the parameter values for a series of substitutions are taken as having an initial frequency of 0.0001, a final frequency of 0.9999, a coefficient of dominance (h)  of 0.5, and an selection coefficient (s) of 0.01, if 6 substitutions start each generation (as suggested by Kimura) leading to n = 1/6; then there will be 22,080 substitutions going on at any given time. That means that there will be 22,080 genes in the process of going from a frequency of very nearly 0 in the population to fixation. Many of these genes, having begun the substitution process relatively recently will have quite low frequencies in the population, making individuals carrying the optimal genotype quite rare. Under these conditions, Ewens calculated the probability of any one individual having the optimal genotype (i.e. having all 22,080 beneficial alleles simultaneously) as 10-23,200. Needless to say, such an individual is never going to exist in a finite population!

Ewens then addressed the problem of determining what the optimal genotype would be that was likely to actually exist in a finite population. Using the statistics of extreme values in a population of finite size, Ewens shows that if the mean and variance of the number of preferred (fitter) alleles is known for a population, the fittest genotype that will be likely to actually exist can be determined. He refers to an earlier paper (Ewens 1970) for the derivation that the variance in preferred alleles in the series of substitutions described above is given by s/n. Using s = 0.01 and n = 1/6, the variance will be 0.06 which leads to a standard deviation of 0.245 (recalling that the standard deviation is given by the square root of the variance). Using the statistics of extreme values, Ewens stated that for a population of size 105 , if s is small (less than 0.1), the population fitness distribution should be approximately normal and the most fit individual in that population would be expected to have a fitness that is no more than 4 standard deviations above the mean. (He references Pearson and Hartley, 1958, Table 28 for this.) For our example, the standard deviation of fitness is 0.245 which leads to an expected optimal fitness of 1 + 4(0.245) = 1.98.

Ewens' calculations indicate that a population maintained at around 100,000 individuals is capable of driving six substitutions per generation  (the highest rate ever claimed for amino acid substitutions among a variety of mammal lineages) with a reproductive excess of 1.98 - 1 = 0.98 offspring per parent. Although this offspring requirement is high compared to Haldane's claim that the intensity of natural selection rarely exceeds 0.1, it is well within the reproductive capabilites of humans and apes where a family size of 4 children will meet the requirement. Families having more than four children will have "extra" offspring available to "pay the cost" of deleterious mutations, random death, and other non-substitutional causes. Nonetheless, despite the questionable signifigance of Haldane's limit of 10% for the selection intensity, we can easily turn the equation around to see how many substitutions can occur without exceeding a 10% reproductive excess to pay the substitution cost:

1 + 4(s/n)0.5 = 1.1
4(s/n)0.5 = 0.1
16s/n = 0.01
n = 16s/0.01

For s = 0.01,

n = 16 * 0.01/0.01
n = 1 substitution every 16 generations.

For the 500,000 generations in the combined human / chimp lines, this would allow 31,250 substitutions.

It's also worth noting that this number is dependent upon the selection coefficient. If the bulk of selection coefficients for substitutions are closer to 0.001 rather than 0.01, then Ewen's formula would allow a subtitution rate of 1 every 1.6 generations, permitting around 300,000 substitutions in the combined 500,000 generations separating chimps and humans from their common ancestor.

Proof that wavg = 1 + sx for a Diploid Species When h = 1/2

s          Selection Coefficient
h          Coefficient of Dominance.
x          Frequency of the "favored" allele. Favored indicates that posession of the allele improves fitness.
1 - x    Frequency of the"non-favored" allele.

Individuals of diploid species have two copies of each gene. We can designate the favored allele as A and the non-favored allele as a. Therefor, if there are two alleles (versions) of this particular gene, then there are three kinds of individuals (genotypes, actually) that may exist:

AA - Has 2 copies of the favored allele. This individual would be homozygous (has 2 copies of the same allele) for the favored allele.
Aa - Has 1 copy of the favored allele and 1 copy of the non-favored allele. Such an individual is heterozygous - it has 2 different alleles.
aa  - Has 2 copies of the non-favored allele. Individuals with the aa genotype are said to homozygous for the non-favored allele.

Fitnesses for each of the three kinds of individuals can be calculated from which alleles they have (their genotypes). Fitness contributions are calculated from the selection coefficient (the fitness of an individual that is homozygous for the favored allele) and the coefficient of dominance (the degree of dominance for the favored allele - ranges from 0 (completely recessive) to 1.0 (fully dominant). The fitness for each genotype is given as follows:

AA - 1 + s
Aa  - 1 + sh
aa   - 1

The average fitness of a poulation is wavg and is given by the the sum of the fitness of each possible genotype multiplied by each genotypes frequency in the population:

 wavg = (1 +s)x2 + 2*(1 + sh)x(1 - x) + (1 - x)2
         = x2 + sx2 + 2x(1 - x + sh - shx) + 1 - 2x + x2
         = x2 + sx2 + 2x - 2x2 + 2shx - 2shx2 + 1 - 2x + x2
         = x2+ sx - 2x2 + 2shx - 2shx2 + 1 + x2                    [Cancelled  out the x2 terms: x2 - 2x2 + x2 = 0.]
         = sx + 2shx - 2shx2 + 1
         = 1 + sx + 2shx(1 - x)
         = 1 + sx(x + 2sh(1 -x))

If h = 1/2, then 2h(1 - x) will reduce to 1 - x:

wavg = 1 + sx(x + 1 - x)

wavg = 1 + sx(1)

wavg = 1 + sx