As with any new technology, next generation sequencing (NGS) has potential advantages and potential challenges. One advantage is the identification of multiple causal variants for disease that might otherwise be missed by SNP-chip technology. One potential challenge is misclassification error (as with any emerging technology) and the issue of power loss due to multiple testing.
Here, we develop an extension of the linear trend test for association that incorporates differential misclassification error and may be applied to any number of SNPs. We call the statistic the linear trend test allowing for error, applied to NGS, or LTTae,NGS. This statistic allows for differential misclassification.
The observed data are phenotypes for unrelated cases and controls, coverage, and the number of putative causal variants for every individual at all SNPs. We simulate data considering multiple factors (disease mode of inheritance, genotype relative risk, causal variant frequency, sequence error rate in cases, sequence error rate in controls, number of loci, and others) and evaluate type I error rate and power for each vector of factor settings. We compare our results with two recently published NGS statistics. Also, we create a fictitious disease model, based on downloaded 1000 Genomes data for 5 SNPs and 388 individuals, and apply our statistic to that data.
We find that the LTTae,NGS maintains the correct type I error rate in all simulations (differential and non-differential error), while the other statistics show large inflation in type I error for lower coverage. Power for all three methods is approximately the same for all three statistics in the presence of non-differential error.
Application of our statistic to the 1000 Genomes data suggests that, for the data downloaded, there is a 1.5% sequence misclassification rate over all SNPs.
Finally, application of the multi-variant form of LTTae,NGS shows high power for a number of simulation settings, although it can have lower power than the corresponding single variant simulation results, most probably due to our specification of multi-variant SNP correlation values.
In conclusion, our LTTae,NGS addresses two key challenges with NGS disease studies; first, it allows for differential misclassification when computing the statistic; and second, it addresses the multiple-testing issue in that there is a multi-variant form of the statistic that has only one degree of freedom, and provides a single p-value, no matter how many loci.