Haplogroup R SNP Predictor

Haplogroup R SNP Predictor - Introduction

Finus Ewing Bryan, b. 1840, TN, d. 1915, TX, photograph ca. 1910, Hillsboro, TX

Haplogroup R SNP Predictor Tool by Robert Casey

The R-L21 YSNP predictor tool is being phased out and has been replaced by the new and improved Haplogroup R YSNP prediction tool. This new version of the tool has many improvements:

1) Decreases YSNP prediction errors by ten fold. Accuracy improves for 90 % to 99 %.
2) Scope is expanded to cover all of Haplogroup R vs. only R-L21.
3) The number of branches predicted is increased by five fold.
4) R-L21 was only 20 % coverage which is now 65 % covered.
5) Requirement for confirmed R-L21 status no longer required.
6) Over 200 predictable haplogroups vs. 40 predictable haplogroups.
7) SAPP charts are now available via new link (includes over 20,000 testers).
8) You can now sort the header fields and change the number haplogroups displayed.
9) Extracted from over 74,000 Y67 marker testers under Haplogroup R.

If you are already familiar with the Haplogroup R SNP Predictor Tool, you can immediately access the YSNP prediction tool via the link below:

Haplogroup R SNP Predictor

As with any tool, there are major assumptions and this tool is no exception. This tool is limited to only to testers that have tested positive for Haplogroup R or are predicted by FTDNA to test positive for Haplogroup R. This tool does not cover other letters of broad haplogroups due time constraints. If this condition is not met, this tool will not accurately predict YSNPs for you. This tool only works if you have tested 67 markers (or more) by Family Tree DNA. Any markers above 67 markers are ignored at this time (markers from Y68 to Y111 are not used). This tool does not work for testers with less than 67 markers since any missing YSTRs would be considered mismatches for the 67 marker signatures used. For the vast majority of most YSNPs, 37 markers will not result in accurate prediction of YSNPs and using 67 markers is always much more accurate for prediction of YSNPs. Before using this tool, you should upgrade to Y67 or Y111 markers first and any entry must be confirmed or predicted to be Haplogroup R by FTDNA.

The input screen is very flexible and accepts several formats. Input is best done via copy and paste of only the Y67 marker values from either FTDNA YSTR report. Also, copy and paste can be done from EXCEL spreadsheets as well. There is one major exception since extra multi-copy markers can randomly appear in these reports (19, 385, 459, 464, YCA 395S1 and 413). The number of extra must be normalized down to the normal number of markers. Below is are the steps to due this modification (only affects 1 or 2 % of testers)::

1) Find any marker value with the highest count. Delete extra marker values until they are the same as other marker values.
2) If you still have extra markers left, if the remaining markers have higher counts than others, you need to delete extra copies with the highest values first, the lowest values next and then any middle values last.
3) Repeat step two again if necessary.
4) On very rare occasions, there may be only single values left but more than the normal values. You should then delete the highest value, lowest value and then one middle value.
5) Repeat step four again if necessary.

Currently, there are no plans to automate the removal of extra values. Some EXCEL files modify the 389-2 value to the delta between the 389-2 and 389-1 which is better for analysis. If you have this format (like many EXCEL files linked in Facebook Group posts), just change the 389-2 format to delta format for proper analysis.

The input screen currently supports tab, space, comma and line ending characters as delimiters between marker values. The FTDNA format with dash characters for multi-copy markers is automatically handled as well. In order to ensure that the input string is properly entered, this tool reports back the markers entered so that the marker values can be verified. You can verify your marker values by looking at the values of CDYa and CDYb (they should be in the 30s and 40s). Also, the 413 values should be in the lower 20s. CDY markers mutate too fast for YSNP prediction and charting, so these markers are ignored (but should not be deleted from input). After verification, use the submit button and SNP Predictor tool to produce a report that shows which YSNPs are likely to test positive. Any value that has 50 % or higher is considered positive. Any value below 50 % is considered negative.

The new version of the YSNP prediction tool has many output options that are new. First, the seven columns can now be sorted. Just click on the header the rows will switch between low to high and high to low. If you click on the YSNP column, it will list alphabetically all YSNPs currently predicted. The tool defaults to ten haplogroups that have the higest probability of testing positive. On the bottom right you can change the default of ten rows to 20, 30, 40 or "All Rows." Another major new enhancement is the "Link to Analysis" which now displays a chart produced by SAPP that inclues all confirmed and predicted testers found during the analysis. The date on the file name indicates when the last analysis for this YSNP was completed.

This SNP Predictor tool has to meet the most of following four criteria: 1) The date of the YSNP being predicted should be between 1,500 and 2,500 years ago. For older YSNPs, there are too many hidden YSTR mutations that exist by testing only living individuals. For younger YSNPs, there is not enough time to develop large enough YSTR signatures. Quite of few YSNPs between 1,000 and 1,500 years do work if the the other criteria are very well met. It is pretty rare for YSNPs older than 2,500 years to work but occassionally a few do work that are 100 to 200 years older. 2) The YSTR signatures normally need to be at least seven markers (one or two six maker YSTR signatures have worked). Also, if the YSTR signature includes multi-step mutations and slower mutating markers, this improves the accuracy as well. Occasionally, seven marker signatures do not work (very few); 3) The YSNP branch needs to be genetically isolated from other YSNP branches. You need 15 to 25 YSNPs in the YSNP block (including the son if it is the only son). Occasionally, YSNP prediction works with only a block of ten YSNPs (not very often); 4) The sample size must have at least 20 confirmed testers. Statisical models require ten samples for each variable input (this tool uses two variables). However, normally only ten to fifteen confirmed testers are needed to produce high accuarcy. As more testers are added, the model constants could require updating to maintain high accuracy. In addition to these four criteria, I do no include any predictor models if the overall accuracy prediction of positive testers falls below 98 %. The vast majority are 100 % accuracy but several are 99 % (a few have 98 % accuracy and I did keep R-Z255 which has only 97 % accuracy).

Predictable YSNPs under Haplogroup R will continue to be discovered on a weekly basis. There are curently over 200 YSNP prediction models created. Due to the nature of YSNP prediction, any YSNP prediction that is in the transtional area (when prediction values from 10 % to 90 %) should be considered speculative in nature. Those under 50 % are predicted negative and those 50 % or higher are predicted positive. As the sample size grows, small updates to the prediction models may be required or slightly lower accuracy will result.

The YSTR signature is first determined based on all submissions that test positive for the YSNP. The YSNP signature are all off-modal YSTR mutations from modal values of several very old major haplogroups. The off-modal values must exceed 75 % to be included in the YSTR signature (occassionally 70 % is accepted). All testers that closely match the YSTR signature are then analyzed and sorted by the number of signature matches and genetic distance from the signature. Based on actual submissions that have tested negative or positive for the signature, the probability of testing positive are derived using a statistical model (I use the economic statistical software tool AcaStat to determine the model constants of the binary logistic regression model). This model is included in EXCEL spreadsheets.

Testers are being tested daily for new YSNPs, new 67 marker YSTR submissions become available daily and already tested testers are discovered via extraction every day. This tool is constantly being updated with additional input which could increase accuracy. Over time, the YSNP signature and the constants of the prediction model could change slightly based as new submissions that are being discovered or tested for the first time. For any YSNP that has only a handful of submissions testing positive, these YSNPs can change significantly after more testers are YSNP tested and then analyzed.

With this new improved prediction model, I now only use the statistical software tool AcaStat, so this prediction is now driven only by software. If you see areas for improvements for this tool, feel free to drop me an email with your enhancements and corrections or make posts under more widely used YDNA Facebook Groups.

I have reached to point in time where the total coverage of this tool will modestly add new predicted haplogroups over time. However, the total coverage of branches covered will steadily decline slightly as prediction only covers 34.1 % of the YSNP branches under haplogroup R and the unpredicted continues to grow faster with twice as many branches that are growing. Analysis is very time intensive (primarily data collection). This predictor tool now predicts 34.1 % of all YSNP branches under the FTDNA Haplogroup R haplotree (up from 26.1 % one year ago). Unfortunately, the expansion characteristics of different parts of the haplotree are not well suited for this kind of YSNP prediction. For R-L21, 65.0 % of the branches can now be predicted. However, R-U106, only covers 8.6 % of the branches (this part of the haplotree has a slow and steady growth of its population which is not favorable to YSNP prediction). R-L21 has numerous YSNP bottlenecks followed by significant growth which matches the criteria for high accuracy YSNP prediction. R-U152 has 6.4 % coverage to date. R-L23 has 24.9 % and R-M198 (R1a) has 7.4 % coverage. R-DF27 has 12.5 % coverage and R-DF19 has 29.3 % coverage. There is still some room for small growth but the growth will be very small improvements in coverage.

The Haplogroup R Predictor tool uses a YSTR signature methodology based on all known "public" FTDNA YSTR reports. Currently, due to various privacy settings make around 50 % of the testers impossible to analyze (since there is no public access to this data). If you want to help YSNP prediction and be included in the SAPP charts, you have several options to enhance the prediction of predictable YSNP branches. For the past several years, FTDNA has changed the default privacy setting to "private." This means that all new testers will not be included in public FTDNA YSTR reports. If you want to be included in YSNP prediction and charted, log into your kit number and click on "Account Setting" (pull down menu where your Name and Kit number are displayed). Then click on the "Project Preferences" tab. Under "Group Project Profile," slide the slider bar for "Opt in to Sharing" to the right so that is changes to "ON," Also, many admins do not publish "public" YSTR reports. You should lobby your admins to make these reports available (if all the projects that you belong to are private, you results are not available to the public to analyze. This results in your information not being accessible for public analysis such as this tool and the associated charts which now include around 25,000 Y67 testers under Haplogroup R.

Collecting and updating valid Y67 testers is very tedious currently. Big Y700 testing is highly recommened for testers in the transitional area (where the probability of testing positive is between 10 % and 90 %). YSNP testing testers in this transitional area is the best way to maximize information about the geographic origins of the YSNP and also greatly improves the accuracy of YSNP prediction and charting of your predictable haplogroup.

Another reason for creating this tool is to reduce the costs associated with unnecessary YSNP testing. YSNP packs and individual YSNP testing are no longer very cost effective with so many YSNP branches under Haplogroup R. Since this tool should greatly reduce unnecessary testing (by confirming what part of the haplotree that you belong to), donate part of your YDNA testing funds to your favorite FTNDA haplogroup or surname project. These funds will be used for discovery of new YSNPs via the FTDNA Big Y700 test. It is hoped that this tool will discourage random testing of YSNP offerings with virtually no odds of testing positive or extremely high odds of testing postive. This predicton model should discourage testing of high signature matches where testing positive is predicted to be at or near 100%. The best usage of YSNP prediction is improving the quality of charting (SAPP or manually done). An average predictable haplogroup will, on the average, double in size with YSTR only testers being included in the analysis. Charting can also predict to lower level YSNPs branches based on YSTR signatures that form below the predictable haplogroup.

If you are interested in getting your smaller haplogroup analyzed or an existing predictable updated, I now offer services for this analysis. Costs range from the cost of a Y111 test to a Big Y700 test depending on the primarily on the size of the predictable and the options that for analysis.

I have completed an updated paper that explains the details of this new YSNP prediction methodology. This document is fairly math and statistic intensive but also has a significant amount of explanatory information as well. This document requires some background in statistics and/or math in order to read this paper and understand it. Individuals that are professional statisticians may be disappointed in format as I tried to avoid the theory involved with binary logistic regression and intentionally omitted much of the statistical detail. This paper is intended for anyone has a solid understanding of statistics but who is not a professional in the field of statistics. This paper documents the science behind this YSNP prediction tool. Please post any comments and corrections in the major YDNA Groups found on Facebook. This is a prelimary version of this document which will be revised based on feedback and knowledge gained by the implementation this methodology.

Math behind the R-L21 Y-SNP predictor - February, 2018