Executables required: clump
In case-control analysis one seeks to detect whether certain marker alleles have different frequencies among cases and controls, suggesting that the marker may lie close to a disease gene or may itself affect susceptibility. Conventionally one can compare frequencies by applying a chi-squared test to a table of the allele counts observed in cases and controls. However microsatellite markers frequently have many alleles, of which some of which are relatively rare. This means that with realistic sample sizes certain alleles will be observed on only a few occasions. If expected counts are less than 5-10 in a contingency table then the standard chi-squared statistic may be inaccurate. Standard approaches to avoid this problem include pooling rare alleles together, or testing each allele against the rest and then applying a Bonferroni correction for the number of alleles tested.
The clump program is designed to assess the significance of the departure of observed values in a contingency table from the expected values conditional on the marginal totals. The present implementation works on 2 x N tables and was designed for use in genetic case-control association studies, but the program could be useful for any 2 x N contingency table, especially where N is large and the table is sparse. The significance is assessed using a Monte Carlo approach, by performing repeated simulations to generate tables having the same marginal totals as the one under consideration, and counting the number of times that a chi-squared value associated with the real table is achieved by the randomly simulated data. This means that the empirical significance levels assigned should be accurate (with precision dependent on the number of simulations performed) and that no special account needs to be taken of continuity corrections or small expected values.
An original feature of clump is a novel chi-squared value which it derives. This is produced by clumping columns together into a new two-by-two table in a way which is designed to maximise the chi-squared value. This is like testing a post hoc hypothesis: putting all the columns with the first value higher than expected into one group and all those with the second value higher into another and then looking at the difference between the groups. This directly tests the hypothesis that several alleles are commoner among the cases than among the controls.
The method of clumping the columns into two groups is slightly more complicated than mentioned above. The procedure does begin by dividing the columns into those with higher than expected values in the first row from those with lower values. However this will not necessarily yield the maximum chi-squared value. What happens next is that each column in turn is moved into the opposite group to see if this increases the chi-squared value. If it does then the column is assigned to the new group, but if not it is put back into its original group. This process is repeated until no further moves can be found which increase the chi-squared value for the table. Again, there is no guarantee that this procedure will yield the absolute maximum chi-squared value possible, but it does represent a simple and intuitively appealing method for producing a value which seems at any rate likely to be close to the maximum.
The clump program uses Monte Carlo methods also to evaluate the significance of chi-squared values produced by more conventional methods of analysis. In all, chi-squared values are generated for four tables, and the significance of each of these is evaluated by seeing how many times the observed value produced is exceeded by chance from the randomly generated simulated datasets. The four tables are as follows:
Suppose that a certain marker has been typed in a sample of cases and controls and that the allele counts appear as follows:
A B C D E F
cases: 0 42 32 1 1 29
controls: 3 67 23 1 1 15
In a real situation we might expect that the genotypes of all subjects had been entered into a database and then that a report file had been constructed which would automatically total the number times of each allele occurred in the two groups.
To analyse these results using clump, edit a new file which will have the name mar7clum.inp. Write the following lines in the file:
6 0 42 32 1 1 29 3 67 23 1 1 15 100 3
This input file would instruct clump to carry out 100 sets of simulations to assess the significance of the supplied 2-by-6 table. Save the file with the name mar7clum.inp.
To run clump, at the operating system prompt enter:
clump mar7clum.inp mar7clum.out
The output consists of the chi-squared values produced by each of the four procedures above together with the number of times such a value was reached by a simulated table. The proportion of times the chi-squared value produced by the real data is reached yields an estimate of the significance of the departure of the observed data from the expectation under the null hypothesis. The more simulations which are performed the more accurate this estimate will be. It would be possible to use binomial probabilities to calculate an upper confidence limit for the true significance based on this estimate of the significance, but a simpler approach is to perform a large enough number of simulations to give a reasonably accurate estimate of the significance. As a rough rule of thumb, one might perform as many simulations as are necessary for the real chi-squared value to be reached 20 or more times. One might begin by performing a set of 100 simulations. If the real chi-squared value were reached more than 15 or 20 times the results would clearly be non-significant. Otherwise one might go on to perform a set of 1000 or 2000 simulations, and if only a few of these reached the real chi-squared value one might go on to perform 10000 or more, until a satisfactorily accurate estimate of the true significance was achieved.
Use a text editor to examine the output contained in mar7clum.out. Depending on the random number generator used, you will see that the observed chi-squared statistics are achieved by chance between about 0 and 2 times out of 100. This suggests that the empirical significance of the results is around 0.01, but it is not a very accurate estimate.
To obtain a more accurate estimate of the significance, edit the mar7clum.inp input file and change the number of simulations performed from 100 to 1000, so that the file appears as follows:
6 0 42 32 1 1 29 3 67 23 1 1 15 1000 3
Then save the file and again enter:
clump mar7clum.inp mar7clum.out
Again, examine the output produced in mar7clum.out. Now a somewhat more accurate assessment of each of the chi-squared statistics is obtained. Increasing the number of simulations further would increase the accuracy, at the cost of increasing the time taken.
The clump program produces a significance value for each of the 4 statistics derived, assessing the strength of evidence for a deviation from the null hypothesis that the underlying population frequencies are the same for the top and bottom rows. In most circumstances the two most powerful statistics seem to be the normal chi-squared (T1) or the chi-squared for the clumped 2x2 table (T4). The other two statistics were less sensitive in detecting association with test data, although this might not be the case in every situation.
This section demonstrates the application of a Monte Carlo approach to analyse sparse contingency tables, such as are produced in case-control studies using multiallelic markers.
Exercises in genetic linkage analysis
Copyright (C) Dave Curtis and Gili Koochaki 1996-2000
david.curtis@qmul.ac.uk