August 04, 2008

Validation of Ken Nordtvedt's Interclade age estimation method

Ken Nordtvedt has proposed an Interclade estimation method for Y-STR data with this basic idea:

When we sample two alleles from a population, we generally don't know when their common ancestor lived. The common ancestor of alleles a and b may have lived 100 generations ago, and the common ancestor of alleles a and c may have lived 200 generations ago.

The Interclade method circumvents this problem by cleverly exploiting the Y-chromosome phylogeny: alleles found in two different haplogroups (neither of which is a subgroup of the other) coalesce to precisely one man, the common ancestor of these two haplogroups.1

For example an allele sampled in a haplogroup J1 man and allele sampled in a haplogroup J2 man coalesce to the unique common ancestor of haplogroups J1 and J2.2

The Interclade method is encapsulated in the following formula:


where x, y are alleles sampled from the two groups A and B and NA, NB represent the number of different alleles in the two groups. μ is the mutation rate at the locus in consideration, and g is the number of generations that have elapsed since the common ancestor of groups A and B. This equation leads to an estimation of g by dividing the left-hand side by 2μ.

In this post, I will examine the properties of this estimator. My results are averaged over 10,000 simulations for each reported number. Men have sons according to a Poisson process with parameter m. The two groups are created by first generating two independent founders who lived g-g' generations after the common ancestor; thus these founders lived g' before the present. Subsequently their descendants in the present-time are collected. In my simulations, I will keep g=100, and vary m, a parameter regulating how fast haplogroups grow and g' the antiquity of the two groups.3

Simulation Results

The following table shows m, g', the average age estimate, and the average error |age estimate-100|.

m g' Estimated Age Estimate Error
1 99 102 100
1.01 99 97 90
1.05 99 101 64
1 50 102 111
1.01 50 100 110
1.05 50 98 102
1 10 101 125
1.01 10 100 124
1.05 10 98 124


Remarks

The Interclade method is bias free, a very attractive property, since its average performance does not depend on how recent the two groups are, or what kinds of population expansion they experienced.

Its error is dependent on population history (m) and the antiquity of the two groups (g'). It is minimized when the two groups were founded soon after their common ancestor and then expanded at a fast rate.

The average error is substantial, but the estimator will be used in practice over many STR loci. The residual error of its age estimation will be entirely due to our ignorance of (i) generation length, (ii) precise germline mutation rates, and (iii) the mutation process in general.

1Ignoring, of course, as is commonly done, stochasticity in generation size.
2This common ancestor was a J man but not necessarily the ancestor of all J men, since there are also J*(xJ1,J2) men in the world, i.e. men who belong to J but neither in J1 nor in J2.
3In general the two groups will coalesce to different ages, but the assumption that they coalesce to g' allows us to investigate how their antiquity affects the estimator.

2 comments:

KerryODair said...

Looking forward to making the calculations in the E-m35 group as a whole with this new tool. Thanks for the help Dienekes. We will see if it passes the history test or possibly creates new ideas on history.

My V13 expert states that Ken's 2020 B.C. ASD is very close to everything we have calculated. The next hurdle is the generational age. There has been much heated debate on this based on papers and forums. The real issue is 25yrs, 30yrs, or 31.5 years. Personally I like the 31.5 years per generation. Using that calculation of 31.5 years it is very close to Cruciani's 2007 paper and his dates.

McG said...

I do interclade a little differently. Take s21 and s116 as presented at their FtDNA websites.( Use only 13's not 12's at 393)

First I converge each set separately (first 12 dys loci), I could use others as long as they are slow. Over those two sets after convergence I look at the modal allele values. They disagree at only one, 390. I get an answer in this case of a little over 1000 years. I previously had calculated a TMRCA of 6716 BP for s21 and 8320 BP for s116. I do not know my SD, so if I simply average the two and add the 1000, I get about 8500 BP. Again, the issue of accuracy is important, although I do use 12 STR's.

Any other rates would introduce different MRCA's. Note the number of entries is only a little over 100, so that introduces some variability.

I cannot find the note in which you said something to the effect that the occurrence of slow mutations acts like a bottleneck and slows all rates down, but I believe I read it. I believe that, not the die off of lineages, is the real source of the rate difference? Its a constraint on how mutations occur, but it is hard to express how it works. I believe it is expressed by the statement that Goldstein made, that theoretically and dys loci can be used to calculate TMRCA, in practice you average several and average. It is this fact that any dys loci can be used that constrains, in some way, unlimited mutations by any one dys loci??? I'd like to read your opinion on this comment.