Executables required: pedraw (optional), makeped, unknown, mlink.
The LINKAGE programs carry out a variety of likelihood calculations for linkage analysis in different circumstances. They all require two data files: a pedigree file called pedfile.dat which gives information about the observations made on pedigree members and a locus file called datafile.dat which gives information about the genetic loci typed in the pedigree. For this exercise, the pedigree and locus data files are supplied as autdom.ped and autdom.par respectively.
Consider the following pedigrees, in which an autosomal dominant disease is segregating and which have been typed for a DNA marker having 4 alleles:
The subjects shaded solid are affected with an inherited disease. The first number under each subject is an ID number, and the pair of numbers underneath represent the alleles of a marker for which the subjects have been genotyped.
The pedigree file for these families might appear as follows:
004 1 0 0 1 2 2 2 004 2 0 0 2 1 1 2 004 3 1 2 1 2 1 2 004 4 0 0 2 1 2 3 004 5 3 4 2 2 2 3 004 6 3 4 1 2 2 2 004 7 3 4 1 1 1 3 004 8 3 4 1 2 2 3 004 9 3 4 2 1 2 2 007 1 101 102 1 2 1 4 007 2 0 0 2 1 2 4 007 3 1 2 2 2 1 2 007 4 1 2 1 2 1 2 007 5 1 2 1 1 2 4 007 6 1 2 1 2 1 4 007 7 1 2 2 1 1 2 007 101 0 0 1 2 0 0 007 102 0 0 2 1 0 0
Each subject is coded in one row of the file according to the following scheme:
The first column provides a number identifying the family;
The second column provides a number identifying the subject within the family;
The third and fourth columns provide the codes identifying the father and mother, or zeroes if these are not included in the pedigree (the pedigree must contain both parents or neither);
The fifth column codes gender, 1=male 2=female;
Subsequent columns code data for the genetic loci examined in the pedigrees. In the present example these consist of the disease locus and a marker locus. The affection status is coded with a single digit: 1 for unaffected, 2 for affected or 0 if the affection status is unknown (e.g. for a subject who has not been examined or who has not passed through the age of risk). The observed alleles at the marker locus are coded as a pair of numbers, or two 0's if the genotype is unknown.
This example pedigree file is called autdom.ped. If the pedraw program is available then you can use it to display the pedigrees on screen by running pedraw and specifying autdom.ped as the pedigree file to load.
In order for the LINKAGE programs to understand what loci are coded for in the pedigree data file and know what analyses to perform, a locus data file is required. The locus data file corresponding to autdom.ped is called autdom.par and appears as follows:
2 0 0 5 << no loci, risk locus, sexlinked(if 1) 0 0.0 0.0 0 << mut locus, mut rate, haplotype freq(if 1) 1 2 << order of loci 1 2 # DIS1 0.9995 0.0005 << gene freqs 1 << number of liability classes 0.0 1.0 1.0 3 4 # MAR1 0.14 0.32 0.21 0.33 << gene freqs 0 0 0.0 1 0.05 0.4There are three sections to a locus data file. The first mainly defines how many loci there are and what order they are in along the chromosome, the second section provides detailed information about each locus in turn and the third section defines what type of analysis is to be performed.
The 2 in the first line means that two loci are coded for. The first 0 means that risk calculations are not to be performed, and the second 0 means that the loci are not sex-linked. The 5 means that the program mlink is to be used for the analyses. The next line containing four 0's means that no mutation or linkage disequilibrium is to be incorporated in the analysis.The third line gives the order of the two loci (here rather meaningless as only two are involved, but important for multipoint analyses).
The second part of the locus data file contains detailed information for each locus. The line reading 1 2 means that the locus is an affection locus with two alleles, the 1 being the definition of an affection locus. The next line gives the frequency of these two alleles, the second allele being rare with a frequency of 0.0005. The 1 on the next line means there is just one liability class because we are dealing with a simple Mendelian trait. The fourth line describing this affection locus provides the penetrance values for the three possible genotypes, meaning the probability of appearing affected conditional on each genotype. The first number provides the probability of affection if the subject is homozygous for the first allele, and is 0 because subjects homozygous for this (normal) allele are always unaffected. The second penetrance value is for heterozygous subjects, and because we are dealing with a dominant disease this is 1, because all subjects carrying even one disease allele are affected. The third penetrance is for those homozygous for the disease allele and is also 1. (If the disease were recessive, this set of penetrances would read 0 0 1 instead of 0 1 1.)For the next locus, the line reading 3 4 means that the locus is a codominant marker with 4 alleles, the 3 being the definition for a locus coded with codominant numbered alleles. The second line gives the frequencies of these alleles.
The third section of the file determines what analysis is to be performed. The two 0's mean that no interference parameter will be used and that the recombination fraction is assumed to be equal in male and female meioses. The next line specifies the initial recombination fraction(s) between the loci. Here we wish to measure the lod score at theta=0 so we set the recombination fraction to begin at 0. The 1 on the next line means that it is the first recombination fraction which is to be changed for subsequent evaluations (of course, here there is only one recombination fraction because there are only two loci). The 0.05 is the value to increment the recombination fraction by each time and the 0.4 is the value at which to stop. This means that mlink will calculate likelihoods at recombination fractions from 0 to 0.4, at intervals of 0.05. (The likelihood will also be evaluated at a recombination fraction of 0.5, so that lod scores can be calculated.)
With these two data files, we can proceed to carry out a linkage analysis evaluating the lod scores between the two loci in these families. This takes place in several stages. We begin by running the pedigree file through a preprocessor called makeped which renumbers subjects and adds some additional information concerning the children and siblings of each subject and the member of each pedigree to be used as the "proband". Next we run the program called unknown which works out possible genotypes for subjects who have not been typed. Finally we run the mlink program, which carries out the actual likelihood calculations.
The first stage is to run makeped on the pedigree file. The makeped program runs from the operating system prompt, so you may want click on the following link in order to refer to the instructions on how to run a program from the operating system prompt. Since our example pedigree file is called autdom.ped, at the operating system prompt enter:
makeped autdom.ped autdom.ppd n
This instructs makeped to use autdom.ped as input and to write output to a file called autdom.ppd. The n means that the pedigrees contain no in-breeding or marriage loops and that "probands" are to be assigned automatically.
Use a text editor to examine the file autdom.ppd. It should appear as follows:
004 1 0 0 3 0 0 1 1 2 2 2 Ped: 004 Per: 1 004 2 0 0 3 0 0 2 0 1 1 2 Ped: 004 Per: 2 004 3 1 2 5 0 0 1 0 2 1 2 Ped: 004 Per: 3 004 4 0 0 5 0 0 2 0 1 2 3 Ped: 004 Per: 4 004 5 3 4 0 6 6 2 0 2 2 3 Ped: 004 Per: 5 004 6 3 4 0 7 7 1 0 2 2 2 Ped: 004 Per: 6 004 7 3 4 0 8 8 1 0 1 1 3 Ped: 004 Per: 7 004 8 3 4 0 9 9 1 0 2 2 3 Ped: 004 Per: 8 004 9 3 4 0 0 0 2 0 1 2 2 Ped: 004 Per: 9 007 1 2 3 5 0 0 1 0 2 1 4 Ped: 007 Per: 1 007 2 0 0 1 0 0 1 1 2 0 0 Ped: 007 Per: 101 007 3 0 0 1 0 0 2 0 1 0 0 Ped: 007 Per: 102 007 4 0 0 5 0 0 2 0 1 2 4 Ped: 007 Per: 2 007 5 1 4 0 6 6 2 0 2 1 2 Ped: 007 Per: 3 007 6 1 4 0 7 7 1 0 2 1 2 Ped: 007 Per: 4 007 7 1 4 0 8 8 1 0 1 2 4 Ped: 007 Per: 5 007 8 1 4 0 9 9 1 0 2 1 4 Ped: 007 Per: 6 007 9 1 4 0 0 0 2 0 1 1 2 Ped: 007 Per: 7
The following changes have been made to the original file. The subjects within each pedigree have been renumbered so that they are consecutive. Three new ID's have been added for each subject after the parental ID's: the ID for the first child; the ID for the next paternal sibling; the ID for the next maternal sibling. (Because in these pedigrees no subject has step-brothers or step-sisters the paternal and maternal siblings are always the same.) After the digit coding for gender an additional digit, either 0 or 1, has been added to denote whether or not the subject is to be regarded as the proband for the pedigree. In this context "proband" does not have its usual meaning of being the affected subject who led to ascertainment of the pedigree, but is simply the subject at which the "peeling" process for the likelihood calculations will begin and end. These likelihood calculations will generally be more efficient if a founder is chosen as proband, and makeped does this automatically. (The original pedigree and individual ID's are appended, although these will not be read by the LINKAGE programs.)
The modified pedigree file produced by makeped is sometimes referred to as a "pedigree file with pointers" because of the pointers to children and siblings.
A number of other features can apply to the post-makeped pedigree file, although they are not relevant to the current example. The pedigree identifier in the original pedigree file need not be a number but could be, for example, the family name. If some of the pedigree identifiers are not integers then makeped will assign a numerical identifier to all the pedigrees in its output file. The advantage of using numbers as pedigree identifiers in the original pedigree files is that these will then be retained through all stages of linkage analysis, allowing one to easily relate the final results back to the pedigrees which have produced them. If the pedigree contains marriage or in-breeding loops then makeped will split each loop by duplicating one subject and coding one copy as being without children and the other as without parents. Both copies of the original subject would then be coded with the digit 2 (or higher if there was more than one loop) in the proband field to indicate that they represented a single subject at whom a loop had been broken.The LINKAGE programs can also be used to carry out risk calculations which provide probabilities for the proband to be carrying a disease gene. If this is to be done then the proband should be chosen to be the subject for whom the risk calculations are desired.
The next stage in the linkage analysis is to run the unknown program on the pedigree and locus data files. In order for unknown to read these files, autdom.ppd and autdom.par, they must be named pedfile.dat and datafile.dat. To accomplish this, copy autdom.ppd to be called pedfile.dat and copy autdom.par to be called datafile.dat. (If you do not know how to copy a file then click on the link.) Then at the operating system prompt enter the command:
unknown
The unknown program checks through the pedigrees and eliminates impossible genotypes, considerably speeding up the subsequent analysis performed by mlink.
If unknown runs correctly it will create two new files called ipedfile.dat and speedfile.dat (under MSDOS speedfil.dat). List the directory to check they have been created. The files contain information regarding lists of possible genotypes which untyped subjects might have, based on the genotypes of their parents and children. The mlink program requires these two new files as well as datafile.dat. At the operating system prompt enter the command:
mlink
You should see the log likelihoods and lod scores being displayed on the screen as they are calculated. They will also be written to a file called outfile.dat. Because every time the linkage programs are run they always write their output to outfile.dat (and another file called stream.dat) it is sensible to copy the results to a file with another name so that they can be kept permanently without being overwritten. Copy outfile.dat to a new file called autdom.res. Then examine autdom.res with an editor. It should appear as follows:
Length of real variables = 8 bytes
LINKAGE (V5.1) WITH 2-POINT AUTOSOMAL DATA
ORDER OF LOCI: 1 2
-----------------------------------
-----------------------------------
THETAS 0.500
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -25.391461 -11.027348
7 -21.936689 -9.526963
-----------------------------------
TOTALS -47.328150 -20.554311
-2 LN(LIKE) = 9.46562998410630E+0001 LOD SCORE = 0.000000
-----------------------------------
-----------------------------------
THETAS 0.000
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -100000000000000000000.000000 -43429355638650388500.000000
7 -100000000000000000000.000000 -43429355638650388500.000000
-----------------------------------
TOTALS -200000000000000000000.000000 -86858711277300777000.000000
-2 LN(LIKE) = 4.00000000000000E+0020 LOD SCORE = -86858711277300777000.000000
-----------------------------------
-----------------------------------
THETAS 0.050
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -25.126631 -10.912334
7 -22.364860 -9.712914
-----------------------------------
TOTALS -47.491490 -20.625248
-2 LN(LIKE) = 9.49829808289596E+0001 LOD SCORE = -0.070938
-----------------------------------
-----------------------------------
THETAS 0.100
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.649752 -10.705229
7 -21.886756 -9.505277
-----------------------------------
TOTALS -46.536509 -20.210506
-2 LN(LIKE) = 9.30730176092179E+0001 LOD SCORE = 0.343805
-----------------------------------
-----------------------------------
THETAS 0.150
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.472921 -10.628432
7 -21.705815 -9.426696
-----------------------------------
TOTALS -46.178736 -20.055128
-2 LN(LIKE) = 9.23574722464653E+0001 LOD SCORE = 0.499183
-----------------------------------
-----------------------------------
THETAS 0.200
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.427737 -10.608809
7 -21.650608 -9.402720
-----------------------------------
TOTALS -46.078345 -20.011528
-2 LN(LIKE) = 9.21566906886759E+0001 LOD SCORE = 0.542782
-----------------------------------
-----------------------------------
THETAS 0.250
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.462748 -10.624014
7 -21.664755 -9.408863
-----------------------------------
TOTALS -46.127503 -20.032877
-2 LN(LIKE) = 9.22550059063503E+0001 LOD SCORE = 0.521433
-----------------------------------
-----------------------------------
THETAS 0.300
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.556398 -10.664685
7 -21.719000 -9.432422
-----------------------------------
TOTALS -46.275398 -20.097107
-2 LN(LIKE) = 9.25507957943628E+0001 LOD SCORE = 0.457203
-----------------------------------
-----------------------------------
THETAS 0.350
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.698679 -10.726477
7 -21.791982 -9.464118
-----------------------------------
TOTALS -46.490662 -20.190595
-2 LN(LIKE) = 9.29813231712021E+0001 LOD SCORE = 0.363716
-----------------------------------
-----------------------------------
THETAS 0.400
-----------------------------------
PEDIGREE | LN LIKE | LOG 10 LIKE
-----------------------------------
4 -24.885319 -10.807533
7 -21.864182 -9.495473
-----------------------------------
TOTALS -46.749501 -20.303007
-2 LN(LIKE) = 9.34990011077663E+0001 LOD SCORE = 0.251304
This section introduces the main files used for linkage analysis, including the pedigree and locus data files called pedfile.dat and datafile.dat, and the programs makeped, unknown and mlink. A simple two-point linkage analysis is performed between a disease and marker locus.
Exercises in genetic linkage analysis
All material copyright (C) Dave Curtis 1996-8
dcurtis@hgmp.mrc.ac.uk