If you are already familiar with the R-L21 SNP Predictor Tool, you can immediately access the the tool:
R-L21 SNP Predictor
As with any tool, there are major assumptions and this tool is no exception. This is not a generic Y-SNP predictor tool for any Y-STR submission. It is limited to only to submissions that have tested positive for R-L21 or are expected to test positive for R-L21. If this condition is not met, this tool will not accurately predict Y-SNPs for you. This tool also assumes 67 markers have been tested by Family Tree DNA. Any markers above 67 markers are ignored at this time. This tool does not work for submissions with less than 67 markers since any missing Y-STRs would be considered mismatches for the 67 marker fingerprints used. For most Y-SNPs, 37 markers are not adequate for accurate prediction of Y-SNP testing and in all cases, 67 markers are always much more accurate for recommending Y-SNP tests. Before using this tool, you should upgrade to 67 markers first and should have tested positive for R-L21 via the deep clade test (or special order test for L21).
The input screen is very flexible and accepts many formats. Since the ordering of Y-STRs varies dramatically between FTDNA generated reports and Y-Search generated reports, you must specify the source of your report. For Y-Search reports, older style FTDNA reports and many surname project reports, extra multi-copy markers can randomly appear in these reports (19b, 464e, 464f and 464g). You must either manually remove these extra markers or check the appropriate boxes and let the tool to remove them for you (do not remove these markers and check the boxes as they will be removed twice incorrectly). Some projects modify the 389-2 value to the delta between the 389-2 and 389-1 which is better for analysis. If you have this format (like the R-L21 spreadsheet found in the L21Project Yahoo group's file section), just change the 389-2 format to delta for proper analysis.
The copy and paste methodology works for most Surname Project web sites as well. I will be constantly enhancing all input formats that exist, so let me know if you discover another variation that that you want added to the input screens. The input screen currently supports tab, space, comma and line ending characters as delimiters between marker values. The new FTDNA format with dash characters for multi-copy markers is automatically handled as well. In order to ensure that the input string is properly entered, this tool reports back the markers entered so that the marker values can be verified. By looking at the high values of CDYa and CDYb and making sure 67 markers were found, you can verify the markers were entered correctly. After verification, use the submit button and R-L21 SNP Predictor to produce a report that shows which Y-SNPs are likely to test positive.
The SNP Predictor tool currently only analyzes more recent and less broad Y-SNPs under R-L21. Most Y-SNPs are near private or private Y-SNPs but does include a few broader Y-SNPs like M222 and L226. In order to preserve statistical accuracy, only single and double fingerprints are believed to be predictable with this methodology. There are around ten multiple fingerprint Y-SNPs under R-L21 that are too broad and too old to be accurately predicted with this Y-SNP methodology. The L21 Y-SNP and many other older R-L21 Y-SNPs that have multiple fingerprints are not predictable at this point in time.
Y-SNPs under R-L21 are now being discovered on a weekly basis. For newly discovered Y-SNPs, they will not be analyzed until they can be ordered from FTDNA under "Special Orders" and it usually takes another one or two months to validate and analyze new Y-SNPs. Some Y-SNPs are duplicates (mutated very close to other Y-SNPs) and are not analyzed until they are determined not to be duplicates. Only Y-SNPs that have tested positive only for R-L21 and that have been tested to some degree can be safely added to the R-L21 Y-SNP prediction tool. All less broad and more private Y-SNPs will be updated as more submissions are tested. Y-SNP prediction for any Y-SNP with less than 10 to 20 tested submissions should be considered more speculative in nature. Due to the nature of Y-SNP data, any Y-SNP prediction that is in the transtional area (when values start transitioning from 0 to 1) should be considered speculative in nature as well until 100 to 200 submissions are tested.
The R-L21 Predictor tool uses a DNA fingerprint matching methodology based on all known L21+ submissions. A Y-SNP fingerprint is first determined based on all submissions that test positive for the Y-SNP. The Y-SNP fingerprint are all off-modal Y-STR mutations from MRCA of all L21 submissions and the MRCA that best represents all submissions that test positive for the Y-SNP. All submissions that closely match the Y-SNP fingerprint are then analyzed and sorted by the number of fingerpint matches. Based on actual submissions that have tested negative or positive for the fingerprint, the probability of testing positive are derived using form fitting curves.
Since submissions are being tested daily for Y-SNPs, new 67 marker Y-STR submissions become available daily and already tested R-L21 submissions are discovered every day, this tool is constantly being updated with additional input which usually increases accuracy. Over time, the Y-SNP fingerprint can change slightly based as new submissions are being discovered or tested for the first time. For any Y-SNP that has only a handful of submissions testing positive, these Y-SNPs can change significantly after more submssions are tested and then analyzed. At this point in time, most of the analysis is primarily manual in nature but recently statistical software analysis tools are being used to improve accuracy. I am currently working on methods to automate this analysis for more accuracy and increased coverage of Y-SNPs being analyzed. If you see areas for improvements for this tool, feel free to drop me an email with your enhancements and corrections.
If you want to help this project indirectly, you have several options to enhance the discovery and prediction of newly discovered Y-SNPs. First, collecting and updating valid R-L21 submissions is very tedious currently. Many surname and haplogroup projects have not updated their FTDNA web site that dynamically generates Y-STR and Y-SNP reports. Please lobby your project admins to add this functionality so that the data collection process will consume less time. Testing submissions that are found in the transitional area of the X axis (where the probability of testing 0 % starts transitioning to 100 % represented by predictions between 10 and 90 %). Testing submissions in the transitional area is the best way to maximize information about the origins of the Y-SNP and also greatly improves the accuracy of Y-SNP prediction.
Another reason for creating this tool is to reduce the costs associated with unnecessary testing of Y-SNPs. Since this tool should greatly reduce unnecessary testing, donate part of your Y-DNA testing funds to the R-L21 Plus project or R-L21 WTY project. These funds will be used for discovery of new Y-SNPs via the FTDNA "Walk the Y" test, testing specific submissions for Y-SNPs to determine the breadth and origins of each Y-SNP, testing required to qualify for the ISOGG haplotree (and eventually get added to the deep clade test) and testing related Y-SNPs to see how Y-SNPs are related to each other. It is hoped that this tool will discourage random testing of Y-SNPs with virtually no odds of testing positive but will also discourage over testing of high fingerprint matches where testing positive is predicted to be near 100%.
I have completed a major review of the statistics that can model Y-SNP prediction. This document is very math and statistic intensive but has a significant amount of explanatory information as well. This document requires some background in statistics and math in order to read this paper and understand it. Individuals that are professional statisticians may be disappointed in format as I tried to avoid the theory involved with binary logistic regression and intentionally omitted much of the statistical detail. This paper is intended for anyone has a solid background in statistics but it is not a professional in the field of statistics. After receiving feedback from this paper, I will start implementation of this methodology described in this paper as improvements in the R-L21 Y-SNP predictor tool. Please post any comments and corrections in the L21Project Yahoo forum. This is a prelimary version of this document which will be revised based on feedback and knowledge gained by the implementation this methodology.
Math behind the R-L21 Y-SNP predictor