Using genecounting to carry out multimarker association analysis

Example data files needed for this exercise

Executables required: gc, rungc, scanassoc

Approach implemented in genecounting

The genecounting program accepts multilocus genotypes from sets of unrelated subjects and estimates the most likely haplotype frequencies. It outputs the likelihood for the data assuming no linkage disequilibrium between markers (L0) and assuming linkage disequilibrium (L1).

In order to test whether a set of markers shows association with a disease, we can see whether or not the haplotype frequencies of the markers appear to be different in cases and controls. To do this, we estimate the most likely frequencies, assuming linkage disequilibrium between markers, in the cases, the controls and in the combined sample. If estimating the frequencies separately produces a significantly higher total likelihood than estimating them jointly this is evidence for heterogeneity in haplotype frequencies between cases and controls. This can be expressed as a likelihood ratio statistic with degrees of freedom equal to the total number of haplotypes whose frequencies are estimated.

This process is automated by the rungc program, which also allows selection of the markers to be used and which can carry out permutation tests which will provide more reliable measures of statistical significance.

Also available is the scanassoc program, which can automatically test all subsets of markers, either selecting sets of contiguous markers or all possible combinations of a given number of markers.

Carrying out an example analysis with rungc

The rungc program is a utility program which calls the gc executable to carry out the haplotype frequency estimations. Rungc requires a data file and a parameter specification file. Examine gcex.dat and gcex.par with a text editor.

Each line of the data file consists of a subject ID, affection status (0 or 1) and then a list of genotypes. The parameter file appears as follows:

28 1 << number of loci, case/control label
8 15 7 8 18 7 12 19 6 5 15 14 2 2 6 14 2 2 2 2 2 7 11 26 10 9 9 9 << numbers of alleles
0 0  0 0 0  0 0  0  0 0 0  0  0 0 0 0  0 0 0 0 0 0 0  0  0  0 0 0 << marker selection                   
0 999 << no permutation, would use 999 permutations
42 << random number

The first line indicates that there are 28 markers and that a case/control label is included in the data file (other genecounting analyses might not use case-control data). The second line shows the number of alleles at each marker. The third line is to choose which markers to use in the analysis. The fourth determines whether or not to use permutation testing and if so how many permutations to perform and the fifth provides a random number seed for permutation testing.

In order to carry out a test for association with the first two markers, change the first two 0s on the third line to be 1s and then write the file with the new name of gcex12.par. The file would appear as follows:

28 1 << number of loci, case/control label
8 15 7 8 18 7 12 19 6 5 15 14 2 2 6 14 2 2 2 2 2 7 11 26 10 9 9 9 << numbers of alleles
1 1  0 0 0  0 0  0  0 0 0  0  0 0 0 0  0 0 0 0 0 0 0  0  0  0 0 0 << marker selection                   
0 999 << no permutation, would use 999 permutations
42 << random number

Now, at the operating system prompt enter:

rungc gcex12.par gcex.dat gcex12.lrt

Examine the output file, gcex12.lrt with a text editor. The first part is as follows:

Results from likelihood ratio test for heterogeneity

Log-likelihoods assuming LD in cases, controls and combined sample:
cases:    -1794.55 (number of haplotypes with non-zero frequency = 51)
controls: -2154.43 (number of haplotypes with non-zero frequency = 51)
combined: -3971.42 (number of haplotypes with non-zero frequency = 59)

Number of degrees of freedom, based on number of haplotypes found in one or other sample or both:
df = 61

LRT statistic = -2*(-3971.42-(-1794.55+-2154.43)) = 44.88
p value = 0.939495

Estimated numbers of informative haplotypes in
Cases    = 622.92
Controls = 752.12


Contributions of individual haplotypes to LRT statistics
LLCom is for the combined sample based on directly estimated frequencies while
LLMean is for the whole sample based on weighted mean frequencies
LRTCom uses estimated frequencies, LRTMean uses weighted means

Estimated frequencies      Log likelihood contribution         LRT contribution   Haplotype
Controls Cases    Combined LLCont   LLCase   LLCom    LLMean   LRTCom   LRTMean   GC-ID Alleles 

0.240685 0.235505 0.236387   -257.83  -212.13  -468.91  -469.97    -2.10     0.01   1   1  1 
0.049560 0.061376 0.054736   -112.00  -106.70  -218.71  -219.12     0.03     0.86   2   1  2 
0.034113 0.055016 0.044038    -86.67   -99.39  -189.14  -187.76     6.15     3.39   3   1  3 
0.008639 0.006952 0.008279    -30.87   -21.52   -54.59   -52.45     4.40     0.12   4   1  4 
0.002724 0.000000 0.001612    -12.10     0.00   -14.25   -13.33     4.31     2.47   5   1  5 
0.027075 0.032733 0.028959    -73.50   -69.72  -141.07  -143.40    -4.30     0.36   6   1  6 
0.030564 0.027484 0.028913    -80.18   -61.53  -140.91  -141.77    -1.61     0.11   7   1  7 
0.004371 0.010104 0.007717    -17.86   -28.92   -51.63   -47.59     9.70     1.61   8   1  8 
0.006847 0.005153 0.005142    -25.67   -16.91   -37.27   -42.66   -10.61     0.16   9   1  9 
0.020788 0.018951 0.021657    -60.56   -46.82  -114.15  -107.41    13.55     0.06  10   1 10 
0.008573 0.001831 0.005701    -30.69    -7.19   -40.52   -39.46     5.28     3.16  12   1 12 
0.002976 0.002781 0.003307    -13.02   -10.19   -25.98   -23.21     5.54     0.00  13   1 13 
0.001159 0.000000 0.000632     -5.89     0.00    -6.40    -6.42     1.02     1.05  15   1 15 
0.021137 0.025317 0.023183    -61.31   -57.98  -120.02  -119.42     1.47     0.26  16   2  1 
0.000004 0.000003 0.000001     -0.03    -0.03    -0.02    -0.06    -0.08     0.00  17   2  2 
0.003623 0.003059 0.003319    -15.31   -11.03   -26.06   -26.36    -0.58     0.03  18   2  3 
0.001351 0.002784 0.002020     -6.71   -10.20   -17.24   -17.09     0.64     0.35  19   2  4 
0.000000 0.001392 0.000633      0.00    -5.70    -6.41    -6.39     1.41     1.37  20   2  5 
0.077209 0.074639 0.075808   -148.73  -120.66  -268.95  -269.40    -0.88     0.02  21   2  6 
0.080210 0.073719 0.077590   -152.21  -119.74  -272.80  -272.04     1.68     0.18  22   2  7 
0.003228 0.007887 0.005163    -13.93   -23.79   -37.39   -38.41    -0.64     1.39  23   2  8 
0.005489 0.008739 0.008337    -21.49   -25.80   -54.89   -47.55    15.20     0.51  24   2  9 
0.000000 0.001527 0.000745      0.00    -6.17    -7.38    -6.92     2.42     1.51  25   2 10 
0.010746 0.010889 0.009621    -36.64   -30.66   -61.45   -67.30   -11.70    -0.00  27   2 12 
0.003413 0.000000 0.001550    -14.58     0.00   -13.79   -16.13    -1.58     3.10  28   2 13 
0.001160 0.000000 0.000633     -5.90     0.00    -6.41    -6.42     1.03     1.05  29   2 14 
0.083359 0.082217 0.083516   -155.78  -127.95  -285.18  -283.73     2.89    -0.00  31   3  1 
0.008430 0.003917 0.006562    -30.28   -13.52   -45.37   -44.37     3.12     1.14  32   3  2 
0.026745 0.020689 0.022581    -72.85   -49.98  -117.72  -123.09   -10.21     0.52  33   3  3 
0.001610 0.000000 0.000447     -7.79     0.00    -4.74    -8.52    -6.09     1.46  34   3  4 
0.000002 0.000010 0.001060     -0.02    -0.07    -9.99    -0.09    19.79     0.00  35   3  5 
0.006307 0.000083 0.004426    -24.03    -0.49   -32.99   -27.14    16.95     5.23  36   3  6 
0.055165 0.046545 0.051202   -120.22   -88.93  -209.29  -209.40     0.28     0.49  37   3  7 
0.004573 0.006959 0.005427    -18.53   -21.54   -38.94   -40.24    -2.26     0.34  38   3  8 
0.020192 0.017607 0.018997    -59.27   -44.30  -103.56  -103.63    -0.03     0.12  39   3  9 
0.000000 0.003029 0.000002      0.00   -10.94    -0.03   -12.44   -21.83     2.99  40   3 10 
0.001743 0.001639 0.002467     -8.33    -6.55   -20.38   -14.88    11.00     0.00  42   3 12 
0.001736 0.000000 0.000831     -8.30     0.00    -8.11    -9.09    -0.38     1.58  43   3 13 
0.000000 0.000110 0.000002      0.00    -0.63    -0.04    -0.68    -1.16     0.11  46   4  1 
0.001248 0.001563 0.001402     -6.28    -6.29   -12.67   -12.58     0.20     0.02  48   4  3 
0.010185 0.011195 0.010766    -35.14   -31.33   -67.10   -66.48     1.27     0.03  51   4  6 
0.001364 0.001943 0.001570     -6.77    -7.56   -13.94   -14.36    -0.77     0.07  52   4  7 
0.000000 0.001361 0.000000      0.00    -5.60     0.00    -6.27   -11.19     1.34  54   4  9 
0.000000 0.002945 0.001300      0.00   -10.69   -11.88   -12.14     2.38     2.90  55   4 10 
0.000964 0.000118 0.001209     -5.03    -0.67   -11.17    -5.95    10.94     0.50  57   4 12 
0.001476 0.004563 0.002873     -7.24   -15.32   -23.13   -23.13     1.14     1.15  61   5  1 
0.020211 0.016042 0.018944    -59.31   -41.30  -103.34  -100.77     5.48     0.32  63   5  3 
0.000000 0.001363 0.000009      0.00    -5.60    -0.14    -6.28   -10.92     1.34  65   5  5 
0.000000 0.000047 0.000002      0.00    -0.29    -0.04    -0.32    -0.51     0.05  66   5  6 
0.001145 0.001468 0.001308     -5.83    -5.96   -11.95   -11.81     0.31     0.03  67   5  7 
0.001251 0.000000 0.000650     -6.29     0.00    -6.56    -6.86     0.54     1.14  69   5  9 
0.089873 0.081684 0.087499   -162.86  -127.46  -293.17  -290.45     5.70     0.26  76   6  1 
0.000004 0.000000 0.000000     -0.03     0.00     0.00    -0.04    -0.07     0.00  77   6  2 
0.001867 0.000622 0.001669     -8.82    -2.86   -14.68   -11.90     5.99     0.44  78   6  3 
0.001890 0.000000 0.000481     -8.92     0.00    -5.06    -9.77    -7.72     1.72  80   6  5 
0.004317 0.007383 0.005576    -17.68   -22.57   -39.80   -40.53    -0.92     0.56  81   6  6 
0.000600 0.000000 0.000018     -3.35     0.00    -0.26    -3.62    -6.16     0.54  83   6  8 
0.000000 0.001823 0.000995      0.00    -7.16    -9.46    -8.06     4.60     1.80  84   6  9 
0.006952 0.011539 0.008625    -25.98   -32.07   -56.38   -58.45    -3.34     0.79  85   6 10 
0.000000 0.002858 0.001272      0.00   -10.43   -11.67   -11.84     2.48     2.82  86   6 11 
0.000000 0.000839 0.000000      0.00    -3.70     0.00    -4.12    -7.41     0.83  87   6 12 
0.001147 0.000000 0.000625     -5.84     0.00    -6.34    -6.36     1.00     1.04 107   8  2 


Totals:                     -2154.43 -1794.55 -3971.42 -3977.39    44.88    56.82

The analysis produces an LRT statistic of 44.9 with 61 degrees of freedom and is not significant. The number of degrees of freedom is based on the number of haplotypes which are estimated to actually occur in the data set rather than the number of haplotypes which could potentially be formed. The program also outputs the estimated haplotype frequencies for cases and controls.

Analysing multiple subsets of markers using scanassoc

The scanassoc program can automatically select multiple subsets of markers and run heterogeneity tests but provides less detailed output and does not perform permutation tests. In order to set up an analysis, load gcex.par into a text editor and maker the following changes:

The modified file should appear as follows:

28 1 << number of loci, case/control label
8 15 7 8 18 7 12 19 6 5 15 14 2 2 6 14 2 2 2 2 2 7 11 26 10 9 9 9 << numbers of alleles
0 0  0 0 0  0 0  0  0 0 0  0  0 0 0 0  0 0 1 1 1 1 1  1  1  1 1 1 << marker selection                   
2 1

The means that scanassoc will confine its attention to the last ten markers and will test all pairs of adjacent markers for association. (If the second line had read 2 2 instead of 2 1 then it would have tested all possible pairs.) Write the modified file to be called scan2.par.

To run the analyses, at the operating system prompt enter:

scanassoc scan2.par gcex.dat scan2.out

Examine the output file, scan2.out, a with text editor. It should appear as follows:

LRT       df  p value   markers

  19.06     3 0.000266  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
   6.06     3 0.108730  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 
  30.50    13 0.003987  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 
  42.84    42 0.434971  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 
 108.42    87 0.059788  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 
  99.04    85 0.141597  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 
  67.12    50 0.053346  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 
  44.14    57 0.893521  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 
  52.28    54 0.540995  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 

This shows that the most signficant p value occurs with the first two markers used, which are markers 19 and 20 in the dataset. We would want to check that this p value was correct using permutation testing. In fact, there are only two degrees of freedom so the p value based on the asymptotic distribution for the LRT is likely to be fairly reliable in this case. However for marker combinations with large numbers of haplotypes and large numbers of degrees of freedom it is essential to check significance using permutation testing.

Carrying out permutation testing using rungc

We will carry out a permutation test to check that the association we have observed is really statistically significant. In a real world situation we would want to perform large numbers of permutations but in order to save time we will only carry out 99. This means that the lowest p value we will be able to obtain is 0.01. In order to carry out permutation tests to see whether markers 19 and 20 show association with disease, load gcex.par into a text editor and make the following modifications:

The modified file should appear as follows:

28 1 << number of loci, case/control label
8 15 7 8 18 7 12 19 6 5 15 14 2 2 6 14 2 2 2 2 2 7 11 26 10 9 9 9 << numbers of alleles
0 0  0 0 0  0 0  0  0 0 0  0  0 0 0 0  0 0 1 1 0 0 0  0  0  0 0 0 << marker selection                   
1 99 << no permutation, would use 999 permutations
42 << random number seed

Write the modified file to be called gcex1920.par.

To run the analysis, at the operating system prompt enter:

rungc gcex1920.par gcex.dat gcex1920.lrt

Examine the output file, gcex1920.lrt, a with text editor. It should appear as follows:

Results from likelihood ratio test for heterogeneity

Log-likelihoods assuming LD in cases, controls and combined sample:
cases:     -573.88 (number of haplotypes with non-zero frequency = 4)
controls:  -549.97 (number of haplotypes with non-zero frequency = 4)
combined: -1133.38 (number of haplotypes with non-zero frequency = 4)

Number of degrees of freedom, based on number of haplotypes found in one or other sample or both:
df = 3

LRT statistic = -2*(-1133.38-(-573.88+-549.97)) = 19.06
p value = 0.000265692

Estimated numbers of informative haplotypes in
Cases    = 464.94
Controls = 497.26


Contributions of individual haplotypes to LRT statistics
LLCom is for the combined sample based on directly estimated frequencies while
LLMean is for the whole sample based on weighted mean frequencies
LRTCom uses estimated frequencies, LRTMean uses weighted means

Estimated frequencies      Log likelihood contribution         LRT contribution   Haplotype
Controls Cases    Combined LLCont   LLCase   LLCom    LLMean   LRTCom   LRTMean   GC-ID Alleles 

0.031803 0.095578 0.062799    -54.53  -104.33  -167.21  -166.93    16.69    16.13   1   1  1 
0.477957 0.449684 0.464126   -175.45  -167.10  -342.71  -342.67     0.33     0.23   2   1  2 
0.349369 0.299117 0.324997   -182.69  -167.85  -351.39  -351.42     1.68     1.74   3   2  1 
0.140871 0.155621 0.148078   -137.29  -134.60  -272.08  -272.04     0.36     0.29   4   2  2 


Totals:                      -549.97  -573.88 -1133.38 -1133.05    19.06    18.40

Permutation testing:
Number of permutations = 99
Number of times real LRT statistic exceeded by permuted dataset = 0
Empirical p value = (0+1)/(99+1) = 0.01

The asymptotic p value is 0.00027, as reported in the output from scanassoc. The empirical p value is calculated as (r+1)/(N+1), where N is the number of permutations performed and r is the number of permuted datasets which by chance produce a higher for the LRT statistic than does the real dataset. In order to test whether the p value was really as low as 0.00027 one would want to do 9999 or more permutations. In fact, rungc incorporates a feature called "sequential Monte Carlo testing". This means that a target can be set for the number of permuted datasets to reach the value produced by the real dataset and if this target is reached then the simulation procedure can be terminated early.

Once one has established that there is association between a set of markers and a disease it may be of interest to consider which particular hapolotypes account for the association. The output lists the approximate contribution to the overall LRT statistic produced by each haplotype. It can be seen that first haplotype, with alleles 1 1, produces a much large contribution than the others. Its estimated frequency is 0.032 in controls and 0.096 in cases. The increased frequency of this haplotype in cases seems to largely account for the evidence of association with disease.

Once one has identified an associated haplotype, it may be of interest to know which individuals are carriers of that haplotype. This may not always be obvious, especially when multiple markers having multiple alleles are used. An additional feature of rungc can assist in the process of identifying individuals likely to carry high risk haplotypes. They might then be targetted for intensive attempts at mutation detection.

Using rungc to identify individuals with high-risk genotypes

An additional feature of rungc is that it will output the probable haplotype assignments for each subject so that one can identify subjects probably possessing haplotypes which appear to be associated with disease. This is done by specifying a fourth filename as an argument.

We do not wish to repeat the permutation testing, so edit gcex1920.par and change the 1 on the fourth line to a zero. Then write the file as haps1920.par. To run the analysis, at the operating system prompt enter:

rungc haps1920.par gcex.dat haps1920.lrt haps1920.hap

The first output file, haps1920.lrt, just contains the same results as previously but without the permutation test. Examine haps1920.hap with a text editor.

The format of the haplotype assignment file takes a little thought to understand. Each line consists of an ID code, as in the original data file, the subject's case-control status and the genotype as coded for the genecounting program followed by a list of possible haplotype combinations and their probabilities. The system for coding genotypes can be different between cases and controls, because it depends on what genotypes actually occur in each dataset. Likewise, for any given genotype the probabilities for the constituent haplotype pairs can differ between cases and controls because these probabilities are calculated under the assumption that the haplotype frequencies differ. The probability for each haplotype combination precedes the pair of haplotypes making up the combination. To see which allele sequences each haplotype refers to one can look at the haps1920.lrt file, which contains this information.

The first lines of haps1920.hap consist of subjects which only have one possible haplotype combination and hence this is shown as having a probability of 1.0000. The first two entries are cases who are homozygous for haplotype 1 (which haps1920.lrt tells us is made up of alleles 1 1). The next several lines consist of both cases and controls which must be heterozygous for haplotypes 1 and 2. Next are a number of subjects which are homozygous for haplotype 2 (alleles 1 2). Looking further down the file, we finally reach a couple of subjects who have ambiguous haplotypes, with IDs 091 and 322. Each of these could have one of these haplotype combinations: 1/1, 1/2 or 2/2. The relative probabilities for these combinations are different because one subject is a case and the other a control. In fact, these are two subjects who are homozygous for allele 1 at the first locus and who are untyped at the second locus. Next come from subjects who must have haplotypes 1 and 3 (alleles 1 1 and 2 1) followed by subjects who could either have haplotypes 1 and 4 (alleles 1 1 and 2 2) or haplotypes 2 and 3 (alleles 1 2 and 2 1). However the 2/3 combination is reckoned to have a much higher probability. Right down the bottom of the file are subjects who were untyped for either of these markers and for them no possible haplotype combinations are output.

In terms of identifying subjects who probably carry the "associated" haplotype, consisting of alleles 1 1, these results show that one could choose subjects who are homozygous for the 1 allele at one locus and possess at least 1 allele at the other. Of course, these subjects must have a 1 1 haplotype. Subjects not meeting this condition are unlikely to carry a 1 1 haplotype, even where this haplotype could possibly be present. Such analyses are generally of greater utility when applied to larger numbers of markers, but this example illustrates the principles involved.

Summary

This section demonstrates how genecounting can be used to detect association between a disease and a group of markers, showing how different subsets can be selected and how statistical significance can be measured using permutation tests.

Exercises in genetic linkage analysis

Copyright (C) Dave Curtis 2004-6

david.curtis@qmul.ac.uk