mthap Frequently Asked Questions
Q. What are the qualifications of the author of mthap?
A. mthap was written by James Lick. I am an amateur genetic hobbyist, and mthap is intended for use by knowledgable genetic hobbyists. If you have difficulty understanding your results or are interested in a professional opinion, please consult with an experienced professional genetic counselor. I cannot take any responsibility for how you use or interpret the results of mthap. This is not a medical device or service.
Q. What privacy protections are in place for mthap?
A. mthap is designed not to keep any data you have uploaded. The uploaded data is temporarily stored in an "anonymous file" (a file without a name, accessible only to the program which created it) which is automatically destroyed once the report is delivered and the program ends. This design helps ensure that your uploaded data is not accesible to anything but the mthap program. The web server logs only the usual web server activity, which includes IP address, time of visit, pages viewed, size of data uploaded, etc. The mthap program and web server have access to the filename of the data being uploaded. Raw data files from testing services often contain the name or other identification of the person whose data it contains, but you can rename the file before uploading. The web server does not use encryption, so there is a remote chance that the activity may be intercepted by a third party. The web server is subject to the legal jurisdiction of the state of California, USA. I cannot take any responsibility for any unintended disclosure of any private information.
Q. Why do I get one haplogroup assignment from one testing service, a different one from another testing service, and yet another from mthap?
A. There are four different aspects which affect this answer: 1) accuracy of the test, 2) completeness of the test, 3) the haplogroup reference used, and 4) how the results are interpreted. See the following questions for more on these four subjects.
Q. What is the accuracy of the various tests available?
A. The best accuracy available is through sequencing. Sequencing determines each base in the genetic sequence. It is a mature technology which offers the highest degree of accuracy. FTDNA is probably the best known testing service which offers sequencing. The other type of test is genotyping done by a micro-array. Genotyping is good, but offers 99.8%-99.9% accuracy overall. However, the accuracy of mtDNA genotyping is less than this due to the short length of the genetic sequence and the presence of clusters of markers relatively close together. Genotyping also is unable to determine results for some markers (no-calls), so even if the test covers relevant markers, some of those markers may not have any result.
Q. How complete are the various tests?
A. The only complete and most accurate test available is an mtDNA FGS (Mitochondrial DNA Full Genetic Sequence). You can only have complete confidence in your haplogroup assignment if it is based on this test. Next in accuracy is a detailed genotyping such as offered by 23andMe. 23andMe currently tests about 2000 positions covering about 12% of mtDNA, but that covers the vast majority of positions needed to make an accurate haplogroup determination. For most people, these results will be sufficient to determine the haplogroup very accurately, but not always. Because some necessary positions are not tested, and because genotyping is not 100% accurate, it cannot always yield an exact haplogroup assignment. It's not quite as good as an mtDNA FGS, but it very often will be able to make an accurate assignment.
The entry-level sequencing tests only sequence a part of mtDNA. The most basic test looks only at a segment called HVR1, while a slightly more advanced test looks at HVR1 and HVR2. (One testing company even has something called HVR3 which other companies include in HVR2.) These tests are enough to get an approximate haplogroup assignment most of the time, but usually cannot be very specific. In some cases a haplogroup determination based on these tests can be wrong. Likewise, the genotyping offered by deCODEme only tests 162 positions, and usually can only approximately determine a haplogroup. You shouldn't put too much faith in haplogroup determinations made based on the type of tests discussed in this paragraph. It will be at best approximate for most people.
Q. What is a haplogroup reference and why will it affect my haplogroup assignment?
A. Genetic research is still a rapidly evolving field. It is only in the last few years that mtDNA sequencing has become reliable and cheap enough to be used on a wide scale. Now it is possible for individuals to afford to get their mtDNA fully sequenced, and for researchers to be able to stretch their funding to cover fully sequencing large numbers of people. What this all means is that new unique sequences are being discovered more frequently now that more people are being tested. Each new unique sequence potentially means a new branch in the haplogroup tree, and sometimes a reorganization of the branches. The current accepted haplogroup reference is PhyloTree which is currently on Build 12. A new reference tree is released several times a year. As more is discovered, your haplogroup assignment may change from time to time. Usually this only means it gets more specific, but occasionally may involve a significant change.
The important thing from all of this is that depending on what haplogroup reference is used, your haplogroup determination will be different. That means that even two tests which are completely accurate will still yield different results if they use a different reference, and both will technically be correct. At this writing, mthap uses PhyloTree Build 12, and I try to get it updated to new PhyloTree builds within a few days of release. 23andMe currently claims to use the PhyloTree Build 7 release, which is still relatively recent (November 2009). FTDNA claims to use the Behar 2007 reference for their basic tests and what appears to be a homegrown reference for their FGS tests, both of which are very outdated. Thus even though their FGS test is the most accurate available, their interpretation is based on the most outdated reference among the leading testing services.
Q. How are the results interpreted, and how does this affect the haplogroup assignment?
A. Interpreting test results is a difficult problem! Most services devise a list of "defining markers" and base their assignments on that. The main problem with this approach is that many markers arise in many different parts of the tree. Another problem is that particular test results may not test relevant markers, or may have no-call for relevant markers. Creating these lists of criteria is also painstaking to do, so it is easy for errors to creep in. At this writing, it is known that 23andMe has some errors in the v2 charts which result in incorrect assignments for some people. These errors are reportedly fixed for v3 results and v2 customers will receive an update later.
mthap takes a different approach. The PhyloTree data is processed into a computer usable form. This is done completely automatically to reduce the chance of errors. mthap then takes the test data and compares it to the complete marker list for each individual haplogroup in the reference and a score is calculated of how well the data matches the haplogroup. The haplogroup with the best score will then be presented as the top match. At least two other close matches are also displayed, because the limitations of the test data may not be able to yield an exact match.
Based on user experience, mthap usually makes the most accurate and up to date haplogroup assignment, limited by the completeness and accuracy of the test results used. Even so, it is possible that it may not make the best possible choice. For the best results, get an mtDNA FGS test done by a reputable company and have the results interpreted by an experienced professional genetic counselor.
Q. How do I read the mthap report?
A. The first line of the report shows the version of mthap and the version of the reference chart used. This is followed by a summary of the data file used and how many markers were found. Then a list of markers relative to the rCRS reference sequence is shown as interpreted from the data. This is the industry standard shorthand for summarizing mtDNA sequences. These are your test results, which will then be compared to the reference haplogroups below.
This is followed by a list of possible matches. In most cases the first choice is the best possible match to the data, but not always. For each possible match, the complete list of markers for the haplogroup is shown. This is then compared to your test results which are at the top of the report. A summary of how well you match this haplogroup is then displayed, followed by a summary of how your results differ from the haplogroup in the reference chart.
Q. What does it mean when a marker is in parentheses?
A. Markers which are listed in your results in parentheses are markers which are considered non-phylogenetic. Basically that means that those markers occur too frequently or change too frequently to be relevant to determining ancestry. These are treated as optional by mthap and do not contribute to scoring.
Markers which are listed in the haplogroup defining markers in parentheses are those marked as optional in the reference chart. This happens when the known sequences are too limited to be completely confident where the marker occurs, or where the marker frequently reverts back to the reference sequence value. These are also considered to be optional and are not scored.
Q. What are Matches?
A. A "Match" is a marker for which you match what is required for this haplogroup. Each match has a strong positive influence on the score.
Q. What are Mismatches?
A. A "Mismatch" is a marker required for the haplogroup for which you have a different test result. This can indicate either an inaccurate test result, or that you are a poor match for the haplogroup. It may also indicate that you are "in between" two different haplogroups. Mismatches have the strongest negative influence on score.
Q. What are Extras?
A. An "Extra" is a marker you have that is not part of the defining markers list for the haplogroup. These could be private mutations, an indication of a poor match, or an inaccurate test result. These have a strong negative influence on score.
Q. What are No-Calls?
A. "No-Call" means that the marker was tested but, for whatever reason, it was not possible to determine the result. These have a small negative influence on score.
Q. What are Untested markers?
A. "Untested" means that your testing service did not test this position. Usually there are enough other markers to determine a match, but a long list of Untested markers reduces the accuracy of the match. Therefore they have a small negative influence on score.
Q. What are Flips?
A. "Flips" are those markers for which you have a result which is different from both the reference sequence and what is required to match a particular haplogroup. For example, say that a particular marker is T in the reference sequence, is C in your haplogroup, but your test results have G. On microarray tests this situation usually indicates that the test is reading the change incorrectly. Instead of being treated as a mismatch and extra, these are not used for scoring.
Q. What are the numbers in parentheses right after Mismatches/Extras/No-Calls/Untested?
A. This is the number of markers in the list used for scoring. Some markers are not scored because they are non-phylogenetic or uncertain. Optional markers will also be shown in parentheses.
Q. What do all these different colors mean?
A. Matches are shown in green, Mismatches as crimson, Untested markers as gray, No-Calls as orange, Flips as lime, Extras as blue, and markers for which you have a reversion as indigo.
Q. Why are there three (or more) possible matches listed?
A. Due to differences in testing, inaccuracies, no-calls, etc. it is not always possible to determine a precise match. Even with mtDNA FGS results it is sometimes not possible to make an exact match. Sometimes more than one haplogroup has the same or similar score. Therefore three or more potential matches are presented. As you research your results in more detail, you can try to determine if the first match is not as good as a later match. (If this happens, please let me know so that I can try to improve mthap.)
Q. Why don't you use "Private Mutations" instead of "Extras"?
A. Even with mtDNA FGS results, an Extra is not necessarily really a private mutation, especially with the second and later possible matches. Non-phylogenetic markers are also listed here (in parentheses). In addition, with genotyping it is possible that an Extra is just an inaccurate result. You will need to research your results in more detail or consult with an experienced professional genetic counselor to determine if you really have private mutations.
Q. What is a private mutation?
When you match a haplogroup exactly and still have additional markers "left over" (excluding non-phylogenetic markers), those additional markers will be called private mutations. In most cases this is because you have an unusual genetic sequence which hasn't yet been discovered, or hasn't been seen often enough to be added to the official chart. It is necessary to have an mtDNA FGS test performed to be sure that you really have private mutations. You may even have a completely new mutation unique to you, though you would need to test both yourself and your mother to determine this.
Q. I have had an mtDNA FGS and do not have any exact matches, or have private mutations. How do I get a haplogroup assignment for my sequence?
If you have private mutations or are in-between two haplogroups, your sequence may help advance science and expand the reference haplogroup chart. To get in the chart, at least two independent unique sequences (sometimes more) will need to be published or submitted to GenBank. If you have had an mtDNA FGS test from FTDNA, you can submit your own sequence to GenBank. Testing services will not submit your sequences for you as it is your choice whether to publish your own private genetic information (though your name is not associated with GenBank published sequences). I have submitted my mtDNA sequence to GenBank, though at this time it is still unique and not yet in the reference chart.
Q. What are non-phylogenetic markers?
A. These are markers which occur too often or change too quickly to be considered useful for determining ancestry, and are not used for scoring. They are currently: 309.1C 309.2C 315.1C 522- 523- 524- 16182C 16183C 16193.1C 16193.2C 16519C
Q. Why can't I used genotyping rCRS differences on mitosearch?
A. Genotyping results are not comparable to sequencing results, and will result in incorrect matches. mitosearch entries are assumed to have completely sequenced HVR1 or HVR1+HVR2. Using incomplete results from genotyping will result in incorrect matches except in the unlikely event that your genotyping results just happen to be the same as sequencing.
Q. Why should I always use a fresh download of my raw data?
A. Genotyping results are subject to interpretation. From time to time improvements are made in how genotype tests are interpreted. 23andMe will re-process genotyping results and release new raw data files as advances are made in interpreting the tests. Usually the latest file will be more accurate than previous versions. You should definitely download a fresh copy if your copy is more than 3 months old. (mthap will try to detect and filter a set of known incorrect markers in 23andMe data files older than April 2010, but this may effect the accuracy.)
Q. How often should I check my data with mthap?
A. Some good times to run mthap: 1) When a new PhyloTree build is released (allow a few days for me to update mthap). 2) When your testing service releases updated test results with any mtDNA fixes. 3) When a new version of mthap is released.
PhyloTree is updated 3-5 times a year. 23andMe releases corrected test results about 2 times a year. Checking about once every 3 months should be sufficient to keep on top of changes.
Q. Why are Mismatches/Extras/No-Calls/Untesteds more numerous when genotyping non-Europeans?
A. The standard reference sequence, rCRS, is of a European person. Markers are called and tests are designed relative to that sequence. As a result, Europeans tend to have fewer markers as they are genetically closer to the reference. Africans, Asians and other non-Europeans tend to have more markers relative to the reference sequence, and tests are more error prone because of the decreased similarity. Over time the tests should improve to accomodate these differences. In addition, since the defining marker lists are much longer, the number of No-Calls and Untested positions will be more.
Q. What is this reference sequence and what are the reference sequence values at each position?
A. The reference sequence is the Revised Cambridge Reference Sequence, abbreviated rCRS. This is also called just CRS or Cambridge, though that technically refers to the older sequence which had some errors. You can find out about the entire sequence and values at each position by consulting the Annotated rCRS.
Q. Why don't we use the terms ancestral and derived when discussion mtDNA markers?
mtDNA is usually described in differences to rCRS. rCRS is haplogroup H2a2a and is a very recent haplogroup, which means that many markers relative to rCRS are actually ancestral, while the reference value is the derived one. To determine what is really ancestral, you would need to compare to Mitochondrial Eve, or what PhyloTree calls mt-MRCA (Mitochondrial Most Recent Common Ancestor). Such a sequence does not exist, though one could be guessed at based on our common non-human ancestors. For now we have to go with the standard, where ancestral vs. derived is often hard to determine.
Q. Why are the rCRS position numbers different from what is found in the 23andMe raw data?
23andMe results use the Human Genome Build 36 reference sequence for all their results. This sequence used an older reference called "Yoruba" as the reference sequence for mtDNA. Since then the industry has moved to the rCRS standard, and this is what is used in Human Genome Build 37. Insertions and deletions in the different sequences mean that positioning is not the same. The mthap program will convert between the standards automatically.
Q. Why is the number of markers more than the number of positions tested?
A. Genotyping sometimes needs to use more than one type of test for a single position. This can be because the flanking sequences vary, or the position has more than one possible value. Therefore your test results might have two or more results for a single position. In most cases the results will be the same, or one will have a no-call result.
Q. Why are the markers found in my FASTA file different from what my testing service lists?
A. Sequences are not all the same length. There may be insertions and/or deletions in your sequence relative to the reference sequence. Sometimes there are areas where the same nucleotide or pair of nucleotides is repeated several times, and the number of repeats in this part of the sequence will vary widely between different sequences. It is often the case that there are several different ways to align these sequences. For example, what FTDNA calls "522- 523-" is called by others "523- 524-." They both accurately describe the same sequence, just in different ways. There is a wide variety of ways that sequences can be aligned.
Q. What convention does mthap use for aligning sequences?
A. The algorithm used in mthap is roughly based on the recommendations in Wilson (2002) with some differences. In particular, mthap handles examples 9 and 15 differently from the recommendations, and follows the recommendations for the other 20 examples.
Q. I'm getting an incorrect match due to an alignment issue. Will you fix it?
In most of the examples I've tested, differences in alignment did not affect the best match. If you can give me an example where the alignment results in an incorrect best match, please send me the example data and I will try to fix it. I will also try to accomodate issues where there are mismatches due to an alignment problem, but if the best match is still correct then this will have a lower priority.
Q. I'm getting an error like "WARNING 5331 unresolved mismatched duplicate; old: A new: T crs: C". What does this mean?
A. Occasionally you'll have two different results for the same position (see the previous question). When this happens, mthap arbitrarily picks the first one. It usually will not affect the results. If this happens to you and it appears to result in an incorrect haplogroup assignment, please let me know. The number is the position and the "old" value will be the one used. For the example error message, the test results had 5331A and 5311T as possible results, while the reference sequence has 5331C. mthap will use 5331A in this case.
Q. One of my matches looks like "A2(64)." What does it mean?
A. In the haplogroup reference there are a few areas in the chart where there is an unnamed branch. For example, between A2 and A2c through A2r there is an unnamed branch defined by the marker 64T. In order to simplify the algorithms, I treat these as normal haplogroups and arbitrarily name them by appending the marker number(s) in parentheses. If this is your best match, it could mean you have a unique sequence, or that you have untested markers for one or more of the children haplogroups.
Q. How do I save my report as a PDF document?
A. For Windows, you can install the free PDFCreator program. From your web browser, you can then "print" to the "PDFCreator" printer which will save the document as PDF file.
Need more help?
There is a discussion about mthap on eng.molgen.org. You can also email your questions to me at firstname.lastname@example.org. So that I can best help you, please include a copy of the complete mthap report and/or your mtDNA data file in your email.