Executables required: qdb, splink.
There are several different programs available for carrying out affected sib pair analysis. One of these is splink, which works by comparing the likelihood of the data assuming that a pair of affected siblings share 0, 1 or 2 alleles identical-by-descent with probabilities 0.25, 0.5 and 0.25 compared with the likelihood assuming that the probabilities may be different from this and that there may be increased allele-sharing.
The splink program can read datafiles produced by modifying standard LINKAGE pedigree files using a utility called ped2spl. However for this exercise a qdb report file is provided which produces splink datafiles directly.
Examine this report file, called alz2spl.rep, with a text editor. It appears as follows:
:DETAIL
[id,1,3] fformat "%03.0f "
[id,4,3] fformat "%3.0f "
.if (father=" ")
" 0 "
.else
[father,4,3] fformat "%3.0f "
.endif
.if (mother=" ")
" 0 "
.else
[mother,4,3] fformat "%3.0f "
.endif
.if (sex="M")
"1 "
.else
"2 "
.endif
.if (phe0!=" ")
" "
[phe0,1,1]
.else
" 0"
.endif
("123456789abcdefghijklmnopq" strstr [phe1,1,1]) fformat " %.0f"
"/"
("123456789abcdefghijklmnopq" strstr [phe1,2,1]) fformat "%.0f"
/1
It is very similar to the the report files used to export data as LINKAGE pedigree files except that a slash is placed between the two alleles of the marker. The file exports the data for affection status and the first marker locus in format which can be understood by splink.
To produce the splink datafile, run qdb , load alz.hdc and select Reports, All records and choose alz2spl.rep, and then enter adm1.spl as the output file. A file called adm1.spl containing data for Alzheimer's disease and the first marker should be written to disk, and if you examine the file with a text editor you should see that the beginning of the file appears as follows:
001 1 0 0 1 2 2/2 001 2 0 0 2 1 1/1 001 3 1 2 1 2 1/2 001 4 0 0 2 1 2/3 001 5 3 4 2 2 2/3 001 6 3 4 1 2 2/2 001 7 3 4 1 1 1/3 001 8 3 4 1 2 2/3 001 9 3 4 1 1 2/2 002 1 0 0 1 1 0/0 002 2 0 0 2 1 1/1 002 3 1 2 1 2 1/2 002 4 0 0 2 1 2/3
In order to carry out a splink analysis on the new data file which has been produced, simply enter the following command at the operating system prompt:
splink -l1 -spt <adm1.spl > adm1.spo
This command makes splink read input from the adm1.spl file and sends it to a new file called adm1.spo. The -l1 (letter L followed by number 1) switch means that there is only one marker, and the -spt switch means that p values will be calculated according to the "possible triangle" method.
After splink has run examine the output file, adm1.spo, with a text editor. The first few lines should appear as follows:
A program for linkage analysis using affected sib pairs
with uncertain marker haplotype assignment
Version 1.04
Input is from standard input channel
Command line options:
Number of marker loci: 1
The marker loci are on an autosome
Maximum number of persons: 2000
Maximum number of (nuclear) families: 500
Maximum number of parental haplotype assignments considered: 1000
Families without an affected sibpair: excluded
Multiple families from the same pedigree: permitted
Estimation of ibd sharing probabilities: Restricted to the "possible triangle"
Score tests:
1~2 df "possible triangle" test
Variance estimates used:
Naive (pseudo-likelihood based)
Bootstrap testing not selected
Affected sib trios, quartets etc.: Each sib pair weighted by 2/3, 1/2 etc.
These lines just describe some of the options which the program has used to analyse the data.
In order to carry out an affected sib pair analysis, splink begins by breaking each pedigree into nuclear families consisting of affected subjects together with their parents and siblings. The only families that will be used in the analysis are those containing at least two affected children. The next few lines of output appear as follows:
Phase 1: Read input data ======= Input file describes 94 persons Computing nuclear family groupings ... There are 20 nuclear families Number of alleles at each locus 3 Number of possible marker haplotypes 3 Phase 2: Compute and store possible haplotype assignments ======= (001.2 : Female, Unaffected, 1/1, (001.1 : Male, Affected, 2/2, |---001.3 : Male, Affected, 1/2, There are 1 affected siblings This family is uninformative about IBD sharing probabilities *** This family will be omitted *** (001.4 : Female, Unaffected, 2/3, (001.3 : Male, Affected, 1/2, |---001.5 : Female, Affected, 2/3, |---001.6 : Male, Affected, 2/2, |---001.8 : Male, Affected, 2/3, |---001.7 : Male, Unaffected, 1/3, |---001.9 : Male, Unaffected, 2/2, There are 1 possible parental haplotype assignments
This means that splink can derive 20 nuclear families from the pedigrees. However the first family contains only one affected offspring so cannot be used for an affected sib pair analysis. The second family can be used as it contains two affected siblings. The statement that there is 1 possible parental haplotype assignment means that it is possible to deduce exactly which alleles were inherited from each parent without any ambiguity. You can look through the output for each of the families. As it happens, it is possible to deduce the haplotype assignment for each of the 8 informative families, but splink could still be used even if there was incomplete information owing to some parents not being genotyped.
The "Phase 3" output appears as follows:
Phase 3: Maximize log-likelihood with respect to haplotype probabilities ======= Iteration 1, Log-likelihood = -23.5343 Iteration 2, Log-likelihood = -22.1807 Iteration 3, Log-likelihood = -22.1807 Estimated haplotype probabilities [1] : 0.25 [2] : 0.5 [3] : 0.25 Estimated information content of marker Heterozygosity 0.625 PIC score 0.5546875
Here splink is estimating the allele frequencies (which it calls "haplotype frequencies") under the null hypothesis that allele-sharing between affected sib pairs occurs according to chance expectation. The allele frequencies are estimated as 0.25, 0.5 and 0.25 for alleles 1, 2 and 3. (It is pure coincidence that these values are the same as the Mendelian probabilities for siblings to share 0, 1 or 2 alleles IBD.) These values are used to estimate the heterozygosity and PIC of the marker. The maximum log likelihood under the null hypothesis of Mendelian allele-sharing probabilities is -22.1807.
The next part of the output provides one way of testing for linkage:
Phase 4: Significance tests ======= Effective sample size: 16 (the number of fully informative sib pairs to carry the same information) Initial ibd assignments (0:1:2) (1 : 9.33333 : 5.66667) Expected values if no linkage (4 : 8 : 4) Score vector (-3, 1.33333, 1.66667) Naive (pseudo-likelihood) covariance matrix 3 -2, 4 -1, -2, 3 Pseudo-likelihood chi-squared score test(s) 1~2 df "possible triangle" test = 2.72222 (P = 0.07459379)
This analysis uses the observed IBD sharing values to produce a test for linkage based on the curvature of the likelihood surface at the null hypothesis values, called a "score test". A more familiar kind of test is the likelihood ratio test which forms the final part of the output:
Phase 5: Maximize log-likelihood with respect to ALL parameters ======= Iteration 1, Log-likelihood = -22.1807 Iteration 2, Log-likelihood = -20.3778 Iteration 3, Log-likelihood = -20.3778 Estimated haplotype probabilities [1] : 0.25 [2] : 0.5 [3] : 0.25 Estimated information content of marker Heterozygosity 0.625 PIC score 0.5546875 Final total (weighted) ibd assignments (0-ibd : 1-ibd : 2-ibd) 1.000000 : 9.333333 : 5.666667 Estimated ibd sharing probabilities (0-ibd : 1-ibd : 2-ibd) 0.075000 : 0.500000 : 0.425000 Log-likelihood ratio chi-squared, -2*(log likelihood ratio) 3.605841 (P = 0.044933) (Equivalent to a lod score of 0.782998)
Here the likelihood is maximised over both the allele frequencies and IBD allele-sharing probabilities simultaneously. The MLE allele frequencies are still 0.25, 0.5 and 0.25, but the IBD sharing probabilities are estimated to be 0.075, 0.5 and 0.425. Thus the dataset has the highest likelihood if the actual probability of two affected sibs sharing both alleles IBD is 0.425 rather than 0.25. The log likelihood under this hypothesis is -20.38, whereas that under the null hypothesis was -22.18. Taking twice the difference produces a chi-squared statistic. When the IBD probabilities are restricted to the "possible triangle" the distribution of this statistic is a complex mixture of chi-squared statistics with 0, 1 and 2 degrees of freedom.
Exercises in genetic linkage analysis
All material copyright (C) Dave Curtis 1996-2000 dcurtis@hgmp.mrc.ac. uk