Normally when I write something here it’s because I think I have something useful to say. This time I want to write about how I record my DNA matches in Excel. I’ll be honest, I’m sure I’m not the first person to do this, and I’m not suggesting that this is the best method, but it does help.
Firstly I should explain why you should record your matches. As I see it there are at least five great benefits of doing this:
- It’s cross-platform i.e. it keeps all your matches from Ancestry, 23andMe, MyHeritage etc. in one place.
- DNA matches sometimes disappear from your online lists. The matching algorithm used by the DNA testing companies do change, but more commonly matches choose to remove/hide their information.
- Recording matches can show you patterns of inheritance that you might not otherwise spot. As an example I can see that someone that shares my mitochondrial DNA as both she and I are on the same direct maternal line. There are plenty of examples online where yDNA or mitochondrial DNA has helped prove a theory, probably non more-so than the identification of Richard III of England from his mitochondrial DNA.
- You can get statistics from this data. I love statistics, more of this later.
- You don’t keep going back to the same DNA matches, only to realise you’ve already done the research.
I should mention that this article focuses on working with results from ancestry.com. This is because it is the main source of cousin-matching information. Roughly 85% of my cousin-matches come from ancestry.com (the joy of statistics).
Using the DNA websites’ tools.
Before I work through my Excel approach, it’s worth looking at the in-built tools available to help organise your matches. Both Ancestry and FamilyTreeDNA have a Notes field for each of your DNA matches. I am now fairly methodical, particularly at Ancestry, about the structure I use in this field, which displays the calculated relationship and the shared ancestor(s) name(s), birth date(s) and birth place(s). This is because the Ancestry match-list displays some of the notes field information (see illustration below).
What I like to see at a glance is the exact relationship and the common ancestor(s) of match and myself. More detailed info can then be added.
Recently (Feb 2019) Ancestry rolled-out their Groups feature. It allows you to create up to 24 colour-coded groups. I guess most people, like myself, will use it to group matches based on their shared ancestors. If you are doing so then 24 is an interesting number. If you don’t have complicated patchwork families then it means that you can directly identify all 16 great-great-grandparents, whilst having 8 other groups to play with. I’m still deciding how best to use these groups, but at the moment I’m using 4 groups to broadly categorise my matches as being related through one of my four grand-parents. I then create 16 groups for my gg-grandparents. A couple of suggestions with the groups:
- Number each group (in the Group name), so that they are listed in a way that makes sense to you. I have my four grandparents numbered 1 to 4. The 16 gg-grandparent numbers are labeled either M or P (Maternal/Paternal) plus numbers 1 to 8.
- Use the colour of each group to help visually cluster your ancestors. I use 4 primary colours for my four grandparents plus 4 similar colours for the grandparents of each grandparent. Lighter tones represent a male gg-grandparent, darker colours represent a female gg-grandparent. As ever a picture paints a thousand words, so hopefully the diagram below makes things clearer.
Just for completeness I use the blue shades for my Paternal Grandfather, green for Paternal mother, yellow for my Maternal grandfather and pink for my Maternal grandmother.
An Excel spreadsheet of your DNA matches
The next part assumes you have a certain familiarity with Excel (mainly using Formulas and Named Ranges). If you are reading this and something is unclear, please leave questions in the comments section and I will try to answer them.
It may be that you think this spreadsheet is not a bad idea and would like to try it out. For those of you that would like to do this I have labelled the important information that you need in Dark Red.
In addition I’ve uploaded a sample spreadsheet to Microsoft One Drive. This spreadsheet you can download and store somewhere in the privacy of your own computer. Much as I hate recommending that you download and open an Excel spreadsheet from the Internet, I think this is the best way to view and understand the spreadsheet. (But PLEASE make sure you scan it for viruses. There’s a lot that can go wrong between what I upload and what you download).
You can access this spreadsheet via this link:
(If your PC is set up like might then you will then have a view-only copy of the spreadsheet – use File, Save As to download a copy of the spreadsheet to somewhere convenient on your PC)
Lets start with the basic concept:
Each row on your spreadsheet represents one DNA match AND the “most-likely” shared ancestors(s).
Your “Most-likely” Shared Ancestors
I will be using the term “most-likely” quite a lot in this article. DNA helps significantly in genealogical research, but commonly doesn’t totally prove something, especially when you are dealing with distant relatives. There are two important factors here:
- There may be dead-ends in your family tree that hide other relationships. As an example my gg-grandmother, Ann Atkinson was illegitimate. There are no records identifying her father. Through her husband I have a number of DNA matches with his ancestors. Since both Ann and her husband came from the same small community (Hawnby in North Yorkshire) then it is possible that either there is another inheritance path that connects my matches with our “most-likely” shared ancestors OR I share other unknown ancestors of my gg-grandmother with my matches.
- There are circumstances that may well be hidden from any documentation that you have used in your research. Small DNA matches wouldn’t tell you that, say, one of your ancestors was having an affair with the brother of her husband, or was raped by her father-in-law.
What to record ?
My spreadsheet is split into three “logical” parts representing different things I want to store about my DNA matches. Each row of the spreadsheet represents one connection between myself and my DNA match.
Part 1 – Details about the Match
This is everything you want to know about the match themselves. Obviously you can record what information you want. I record the following
- Match Name (or their Username, or their Initials)
- Relationship. Although I enter this manually it’s based on calculated information in the second part of the spreadsheet. More later.
- DNA testing company used (ancestry, ftDNA, MyHeritage , GEDmatch etc.). I colour-code these details. It wsometimes makes it easier to find matches.
- Amount of shared DNA (both the centiMorgan total and the number of matches segments). Whilst the number of segments isn’t so interesting the amount shared is.
- Whether they match on my Mum’s or Dad’s side of the family
Part 2 – Relationship details
Here I record their degree of Cousinship, normally in a shorthand form eg. 3C1R for a third cousin once removed. I also have a number of other fields I use to calculate and validate this information. As ever, a picture paints a thousand words. below is a skeleton spreadsheet that I’ve prepared.
Part 3 – the Names of our ancestors up to a common match
It’s probably easiest if I describe this part of the spreadsheet as a Christmas Tree. At the core I have a “central” column where I record the most-likely common ancestor and then progress with the children of both my match and myself until they are all recorded. I use one column for each generation. Again an illustration is better:
Some points to note:
- To add new DNA matches to this, right-click on the row number below the point where you want to insert a new. match.
- Our most-likely shared ancestor(s) are recorded in black. I’ve use blue and pink to identify male and female ancestors. The colours help you spot direct paternal/maternal lines that may be interesting for additional yDNA/mitochondrial DNA testing.
- For all ancestors I record their year of birth and a location. The year of birth helps me spot any errors in the paths to a shared ancestor. The locations really helps me remember where my relatives migrated to.
- Since I’ve already recorded whether a match is on my paternal or maternal side (in the “Details of the Match” section) I don’t bother to re-record my Mum or Dad in this part of the spreadsheet.
- There is a visual trade off between how many columns of data you allocate to record the descendants of your shared ancestors. The more columns, the greater the amount of time you spend scrolling. I’ve allocated 6 columns for “my side” ancestors. This allows me to record up to 7th Cousins. (My most extreme match is currently a 7th cousin twice removed).
- I order matches based on which side of the family they come. Matches on my Mum’s side go at the top, matches on my Dad’s side are at the both.
- Since both my parents were “only-childs” my nearest relatives are going to be second cousins, sharing one of 4 sets of great grandparents which I use for clustering matches.
- Sometimes, thanks to cousin marriages, you may be related to someone in more than one way. To deal with this I record both paths of inheritance, but skip repeating some of the “Details of the Match” data to avoid screwing up some of my stats.
- For ease of use I’ve split and frozen the spreadsheet after column B. This mans I always see the match name and their relationship to myself.
- I often end up having to put in male or female “unknowns” in my DNA matches list. These are normally living parents and grandparents of my DNA match, which Ancestry hides for living people.
This is where the difficult bits come on. These are the five columns G to K in my spreadsheet. Each column is explained separately.
Count of Generations to Myself (column G).
I can count the number of direct ancestors I have recorded between myself and my common shared ancestor. The Excel function COUNTA does this by counting the non-blank cells in a range. Since I don’t record myself or my parents in the ancestors I add 2 to the COUNTA value to give me the total number of generations.
Count of Generations to my DNA match (column H)
Again the COUNTA function counts the number of generations between the shared ancestor and the DNA match. Since I have recorded all the generations to the match I dont need to add additional generations like I did with my own calculations.
Generations Difference (column I)
This is the result of subtracting column G (Generations to Myself) from column H (Generations to my DNA match). A positive number means that they have more generations to our shared ancestor. I use Excel’s ability to do Conditional Formatting to help highlight this information. As an aside I’ve found that in most cases where there is a difference it is my DNA Match that has more generations. My ancestors were slow to procreate!
Cousinship (column J)
I normally calculated Cousinship based on the minimum number of generations from either my DNA match or myself. The Excel function MIN does this. Given that first cousins are two generations back to a common ancestor I need to deduct one from this figure.
Number of Generations Removed (column K)
This is basically a count of the number of generations difference between myself and my DNA match. I’ve already, in column I calculated this, but since it’s always a positive number I use the Excel ABS function to calculate this.
Once I have all these fields calculated my relationship with my DNA match is column I and column J. I’m sure I could use Excel to enter this information directly in column B (the Relationship) field, however I’ve yet to work on this.
To summarise, below is a table of the formulas I use on in the spreadsheet.
|Column Description||Column Letter||Cell contents|
|Count of Generations to Myself||G||=2+COUNTA($L3:$Q3)|
|Count of Generations to DNA match||H||=COUNTA($S3:$Z3)|
|Number of Generations removed||J||=ABS(G3-H3)|
Finally, I should point out that you need to copy these formulas into the new row you create when you add a DNA match.
Prep-work (Excel Ranges)
Once you have sorted the structure used to record your DNA matches it’s time to grab some statistics off your data. To do this I start by creating 4 Named Ranges in Excel. These cover the data help in columns B (Relationship), C (DNA Website – refered to as DTC), D (Total CentiMorgan match length) and F (Parent). This can all be done through the Name Manager (Ctrl + F3 in Excel).
Once this is done I used a second sheet within Excel to generate the statistics. At the moment I’m interested in two sets of statistics. The percentage split between my DNA matches between my Mum’s and Dad’s side of the family and the percentage of DNA matches from the different DNA matching sites. Below are the values I currently (Mar 2019) have.
To get the number of DNA matches from each side of the family I use the count function against the Named Range “Parent” e.g. =COUNTIF(Parent,”Dad”).
To find the number of matches through each DNA matching service I use the same function to check the Named Range “DTC” for the companies named in Column B e.g. =COUNTIF(DTC,$B6)
The last part of the spreadsheet is a table counting my cousins and a bar graph that shows visually how many I have. Again I’ll start with some screenshots (taken March 2019) from my “live” spreadsheet.
As you can see from the image this part of the spreadsheet looks at each relationship (nth degree Cousin x times removed). For each relationship I’m interested in 5 things:
- The number of recombination events that occurred between myself and my match e.g. a first cousin has 4 recombination event Me -> Parent -> grandparent(s) -> Aunt/Uncle -> First Cousin.
- I count the number of cousins on my Maternal and Paternal lines.
- I’m interested in the Minimum, Maximum and Average number of cM matches between myself and my match.
- I take expected centiMorgan(cM) match values from the The Shared cM Project 3.0 tool.
- I look for any outliers in my data. Specifically anywhere where my Average cM length is over the expected average and. more importantly, if any maximum centiMorgam length is over the expected maximum cM length. If this later condition occurred it would suggest that there may be something wrong with my understanding of the relationship between myself and my DNA match.
I should mention that each new type of cousin relationship requires me to manually create a new row. I then enter the relationship, figure out the number of recombination events and grab the Min, Max and Average expected cM values from The Shared cM Project tool.
Below are the excel calculations used:
|Column Description||Column Letter||Cell contents|
|Mum-(number of cousins per relationship)||C||=COUNTIFS(Relations!$B:$B, Diagrams!$B3, Relations!$F:$F, Diagrams!C$2)|
|Dad-(number of cousins per relationship)||D||=COUNTIFS(Relations!$B:$B, Diagrams!$B3, Relations!$F:$F, Diagrams!D$2)|
|Outliers – Average cM||L||=IF(H3>I3,H3-I3,””)|
|Outliers – Max cM||M||=IF(G3>K3,G3-K3,””)|
Something to show your Friends
The last thing I have created is a simple, stacked, Bar chart which shows the distribution of my DNA matches. It’s colour-coded to show the split between my maternal and paternal DMA-matches.
If you are making your own Excel spreadsheet follow these steps:
- Insert -> Column Chart -> 2-D Column chart -> Stacked Chart
- Design -> Select Data Source -> select the three summary columns Relationship and the columns that count how many maternal and paternal cousins you have. (columns B,C and D in my spreadsheet)
- Design -> Add Chart -> Legend -> Right
- double-click on the Chart Title to change it to something you prefer
As you can see the number of matches currently peaks at around 4th cousin level. I’m curious to see how this will develop.
What I’ve learnt from this work
- Most of my matches have more generations to our most-likely common match, in other words my family had children when they were older in life and/or I am the product of a number of younger/youngest children. This is despite the fact that some of my matches are tens of years older than myself. Sometime I need to look at this effect more.
- As yet I’ve only identified two match who have a direct maternal line to our shared ancestors. That is to say they should have the same mitochondrial DNA as myself. There is only one DNA match who has a direct paternal line to our shared ancestors, but unfortunately not on my own paternal line.
- I’ve now found three matches who have cousin-marriage within their direct family-tree. Normally this would mean my matches share more DNA with myself than would be expected from a single path of inheritance.
- Given the avalanche of new cousin-matches coming from the recent (Feb 2019) introduction of ThruLines and Common Ancestors at AncestryDNA it has been incredibly useful to keep track of these matches. This new information is on top of the “Theory of Family Relativity™ information at MyHeritage.
I’ve yet to try this work in Google Sheets. It should work, although I’m not sure how powerful google sheets is for the stats. If you are interested in this, then let me know in the comments.
I’ve used Excel 2013 and Excel 2010 for this work. Interoperability between these two versions doesn’t seem to be a problem.