Tải bản đầy đủ
Chapter 20. Cochran-Armitage Test for Trend

Chapter 20. Cochran-Armitage Test for Trend

Tải bản đầy đủ

Cochran-Armitage Algorithm
The CATT is applied when the data takes the form of a 2 × k contingency table. The
number of rows (2) indicates the outcome of an experiment, and the number of col‐
umns indicates a variable number (k) of experiments. For example, if k = 3, then the
contingency table will be as shown in Table 20-1.
Table 20-1. 2 × 3 contingency table
Group B = 1 B = 2 B = 3
A = 1 N11

N12

N13

A = 2 N21

N22

N23

This contingency table can be completed with the marginal totals of the two variables,
as shown in Table 20-2.
Table 20-2. 2 × 3 contingency table with marginal totals
Group B = 1 B = 2 B = 3 Sum
A = 1 N11

N12

N13

R1

A = 2 N21

N22

N23

R2

Sum

C2

C3

N

C1

Where:
• R1 = N11 + N12 + N13
• R2 = N21 + N22 + N23
• C1 = N11 + N21
• C2 = N12 + N22
• C3 = N13 + N23
• N = R1 + R2 = C1 + C2 + C3 = N11 + N12 + N13 + N21 + N22 + N23
The trend test statistic is:
T≡

k

∑ wi N 1iR2 − N 2iR1
i=1

where wi is weight. In using CATT for alleles of germline data, we can apply three
different tests based on the value of the weight:

448

|

Chapter 20: Cochran-Armitage Test for Trend

• weight = {0, 1, 2}: for additive
• weight = {1, 1, 0}: for dominant
• weight = {0, 1, 1}: for recessive
The hypothesis of no association (known as the null hypothesis) can be expressed as:
Pr(A = 1|B = 1) = ... = Pr(A = 1|B = k)
Assuming that the null hypothesis holds, then using iterated expectation we can
write:
E(T) = E(E(T |R1, R2)) = E(0) = 0
Given two discrete random variables X and Y, we can define the conditional expecta‐
tion as:
E X Y=y =

∑x x ·

P X=x Y=y

Now, using all these definitions and formulas, we are ready to write our CochranArmitage algorithm in Java (Example 20-1). One major goal of the CATT is to com‐
pute the p-value (a probability value between 0.00 and 1.00). Using the algorithm
defined in Wikipedia, we implement the CATT as a POJO class, CochranArmitage.
This Java class will be used in our MapReduce solution.
Example 20-1. Cochran-Armitage algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import
import
import
import
import

java.io.BufferedReader;
java.io.BufferedWriter;
java.io.FileReader;
java.io.FileWriter;
java.io.IOException;

import org.apache.log4j.Logger;
import org.apache.commons.math3.distribution.NormalDistribution;
/**
* Class that calculates the Cochran-Armitage test for trend
* on a 2x3 contingency table. Used to estimate association
* in additive genetic models of genotype data.
*/
public class CochranArmitage {
private static final Logger THE_LOGGER =
Logger.getLogger(CochranArmitage.class);

Cochran-Armitage Algorithm

|

449

19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

450

// use weights corresponding to additive/codominant model
private static final int[] WEIGHTS = { 0, 1, 2 };
// dimensions of passed contingency table - must be 2 rows x 3 columns
private static final int NUMBER_OF_ROWS = 2;
private static final int NUMBER_OF_COLUMNS = 3;
// variables to hold variance, raw statistic, standardized statistic,
// and p-value
private double stat = 0.0;
private double standardStatistics = 0.0;
private double variance = 0.0;
private double pValue = -1.0; // range is 0.0 to 1.0 (-1.0 means undefined)
// NormalDistribution class from Apache used to calculate p-values
private static NormalDistribution normDist = new NormalDistribution();
/**
* Get the variance
*/
public double getVariance() {
return variance;
}
/**
* Get the Stat
*/
public double getStat() {
return stat;
}
/**
* Get the StandardStatistics
*/
public double getStandardStatistics() {
return standardStatistics;
}
/**
* Get the p-value
*/
public double getpValue() {
return pValue;
}
/**
* Computes the Cochran-Armitage test for trend for the passed
* 2 row by 3 column contingency table
* @param countTable = 2x3 contingency table.
* @return the p-value of the Cochran-Armitage statistic of the passed table
*/

|

Chapter 20: Cochran-Armitage Test for Trend

71
72
73
74
75
76
77
78
79
80
81
82
83
84
85 }

public double callCochranArmitageTest(int[][] countTable) {
// defined in Example 20-2
}

/**
* @param args input/output files for testing/debugging
* args[0] as input file
* args[1] as output file
*/
public static void main(String[] args) throws IOException {
// defined in Example 20-3
}

The callCochranArmitageTest() method, defined in Example 20-2, is the core of
the Cochran-Armitage algorithm.
Example 20-2. Cochran-Armitage algorithm: callCochranArmitageTest()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

/**
* Computes the Cochran-Armitage test for trend for the passed
* 2 row by 3 column contingency table
* @param countTable = 2x3 contingency table.
* @return the p-value of the Cochran-Armitage statistic of the passed table
*/
public double callCochranArmitageTest(int[][] countTable) {
if (countTable == null) {
throw new IllegalArgumentException(
"contingency table cannot be null/empty.");
}
if ( (countTable.length != NUMBER_OF_ROWS) ||
(countTable[0].length != NUMBER_OF_COLUMNS) ) {
throw new IllegalArgumentException(
"contingency table must be 2 rows by 3 columns");
}
int totalSum=0;
int[] rowSum = new int[NUMBER_OF_ROWS];
int[] colSum = new int[NUMBER_OF_COLUMNS];
// calculate marginal and overall sums for the contingency table
for (int i=0; ifor (int j=0; jrowSum[i] += countTable[i][j];
colSum[j] += countTable[i][j];
totalSum += countTable[i][j];
}

Cochran-Armitage Algorithm

|

451

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58 }

}
// calculate the test statistic and variance based on the formulae at
// http://en.wikipedia.org/wiki/Cochran-Armitage_test_for_trend
stat = 0.0;
variance = 0.0;
for (int j=0; jstat += WEIGHTS[j] * (countTable[0][j]*rowSum[1] countTable[1][j]*rowSum[0]);
variance += WEIGHTS[j]*WEIGHTS[j]*colSum[j]*(totalSum-colSum[j]);
if (j!=NUMBER_OF_COLUMNS-1) {
for (int k=j+1;kvariance -= 2*WEIGHTS[j]*WEIGHTS[k]*colSum[j]*colSum[k];
}
}
}
variance *= rowSum[0]*rowSum[1]/totalSum;
// standardized statistic is stat divided by SD
standardStatistics = stat/Math.sqrt(variance);
// use Apache Commons normal distribution to calculate two-tailed p-value
pValue = 2*normDist.cumulativeProbability(-Math.abs(standardStatistics));
// return the p-value
return pValue;

The program shown in Example 20-3 tests the Cochran-Armitage algorithm.
Example 20-3. Cochran-Armitage algorithm: main()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

452

/**
* @param args input/output files for testing/debugging
* args[0] as input file
* args[1] as output file
*/
public static void main(String[] args) throws IOException {
if (args.length != 2) {
THE_LOGGER.info("usage: java CochranArmitage " +
" ");
throw new IOException("must provide input and output files for testing.");
}
long startTime = System.currentTimeMillis();
String inputFileName = args[0];
String outputFileName = args[1];
BufferedWriter outfile = new BufferedWriter(new FileWriter(outputFileName));
outfile.write("score\tp-value\n");
BufferedReader infile = new BufferedReader(new FileReader(inputFileName));

| Chapter 20: Cochran-Armitage Test for Trend

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44 }

int[][] countTable = new int[2][3];
String line = null;
while (( line = infile.readLine()) != null) {
String[] tokens = line.split("\t");
int index=0;
// populate 2x3 contingency table
for(int i=0; i<2; i++) {
for(int j=0; j<3; j++) {
countTable[i][j] = Integer.parseInt(tokens[index++]);
}
}
CochranArmitage catest = new CochranArmitage();
double pValue = catest.callCochranArmitageTest(countTable);
outfile.write(String.format("%f\t%f\n",
catest.getStandardStatistics(), pValue));
}
long elapsedTime = System.currentTimeMillis() - startTime;
THE_LOGGER.info("run time (in milliseconds): " + elapsedTime);
infile.close();
outfile.close();

The following is a sample run of the Cochran-Armitage algorithm. First, we compile
the algorithm and define the input:
$ javac CochranArmitage.java
$ cat test3.txt
1386 1565 401 1342 1579 434
2716
672
13 2689
695
9
2062 1144 151 2021 1184 173

Then we run the algorithm:
$java CochranArmitage test3.txt test3.txt.out
[main] [INFO ] [CochranArmitage] - run time (in milliseconds): 9

The expected output looks like this:
$ cat test3.txt.out
score
p-value
-1.414843
0.157114
-0.488857
0.624943
-1.555344
0.119864

Application of Cochran-Armitage
In genome analysis, the CATT is applied for statistical tests for differences in genotype
frequency, in which each individual should be coded as {0, 1, 2} based on the number
Application of Cochran-Armitage

|

453

of the particular variant allele in that individual. These counts can be used to assem‐
ble a contingency table consisting of two rows and three columns (each row repre‐
senting a specific group and each column representing the outcome of an experiment
such as an allele count) that can be analyzed by standard statistical methods. The
CATT is used to approximate an additive genetic model.
The real example in genomic analysis can be stated as follows: let Group A denote a
set of biosets (generated from VCF files) for a set of patients, and let Group B denote
another set of biosets (also generated from VCF files) for another set of patients. The
goal is to apply the CATT to a set of alleles found at a specific chromosome1 with a
given start position and stop position (for details on chromosomes, visit the National
Institute of Health website). This data can be truly huge. Each bioset generated from a
(human sample) VCF file may have over 4,000,000 chromosomes. If Group A has
3,000 samples and Group B has 5,000 samples, then to compare genotype frequency
we have to analyze 32 billion records (obviously, this is a big data problem!):
Group A records = 3,000 × 4,000,000 = 12,000,000,000
Group B records = 5,000 × 4,000,000 = 20,000,000,000
Total records = 12,000,000,000 + 20,000,000,000 = 32,000,000,000
To use the CATT for genotype frequency we form a 2 × 3 contingency table for each
allele found at the common key of a specific chromosome (identified by its start posi‐
tion and end position). See Table 20-3.
Table 20-3. Contingency table for each allele
Group

Count of 0 Count of 1 Count of 2

Group A N11

N12

N13

Group B N21

N22

N23

I will demonstrate building a 2 × 3 contingency table for each allele through the fol‐
lowing example. Let Group A be a set of six biosets identified by {B1, B2, B3, B4, B5, B6}
(see Table 20-4) and let Group B be a set of five biosets identified by {B7, B8, B9, B10,
B11} (see Table 20-5). Note that this data is for a very specific chromosome at a
defined start and stop position.

1 Humans normally have 46 chromosomes in each cell, divided into 23 pairs. Two copies of chromosome 1, one

copy inherited from each parent, form one of the pairs. (Source: http://ghr.nlm.nih.gov/chromosome/1.)

454

|

Chapter 20: Cochran-Armitage Test for Trend

Table 20-4. Group A biosets
Bioset ID Allele1 Allele2
B1

A

C

B2

A

A

B3

A

C

B4

G

G

B5

A

A

B6

AC

T

Table 20-5. Group B
biosets
Bioset ID Allele1 Allele2
B7

A

A

B8

C

C

B9

A

C

B10

A

A

B11

A

A

Before generating/building contingency tables, we need to build some data structures,
as outlined in Table 20-6.
Table 20-6. Genotype frequency
Bioset ID Group

A count C count G count T count AC count

B1

Group A 1

1

0

0

0

B2

Group A 2

0

0

0

0

B3

Group A 1

1

0

0

0

B4

Group A 0

0

2

0

0

B5

Group A 2

0

0

0

0

B6

Group A 0

0

0

1

1

B7

Group B 2

0

0

0

0

B8

Group B 0

2

0

0

0

B9

Group B 1

1

0

0

0

B10

Group B 2

0

0

0

0

B11

Group B 2

0

0

0

0

Application of Cochran-Armitage

|

455

Now, we can generate a contingency table for each allele (A, C, G, T, and AC; see
Tables 20-7 through 20-11), after which we may apply the CATT algorithm. In our
MapReduce algorithm, each reducer for (Key2, Value2) will generate a set of contin‐
gency tables (where Key2 is a composite key of chromosomeID:start:stop).
Table 20-7. Contingency table for allele A
Group

Count of 0 Count of 1 Count of 2

Group A 2

2

2

Group B 1

1

3

Table 20-8. Contingency table for allele C
Group

Count of 0 Count of 1 Count of 2

Group A 4

2

0

Group B 3

1

1

Table 20-9. Contingency table for allele G
Group

Count of 0 Count of 1 Count of 2

Group A 5

0

1

Group B 5

0

0

Table 20-10. Contingency table for allele T
Group

Count of 0 Count of 1 Count of 2

Group A 5

1

0

Group B 5

0

0

Table 20-11. Contingency table for allele AC
Group

Count of 0 Count of 1 Count of 2

Group A 5

1

0

Group B 5

0

0

MapReduce Solution
This section will present a MapReduce algorithm for the CATT that can be imple‐
mented by Hadoop and Spark. Our implementation is based on MapReduce/Hadoop.

Input
Since the same bioset can be selected for both groups of biosets (A and B), we will
generate two types of data (the only difference will be the GROUP-NAME; for Group A,
456

| Chapter 20: Cochran-Armitage Test for Trend

GROUP-NAME will be a and for Group B, GROUP-NAME will be b, which will enable us to
distinguish one group from the other).

Each bioset record will have the following format:

<:>

<:>

<;>

<:>

<:>

<:>

<:>

<:>

<:>

<:>


For example, if we select six biosets for Group A, then we will have:
7:10005296:10005296;a:A:C:A:snpid:mc:geneid:1000
7:10005296:10005296;a:A:A:A:snpid:mc:geneid:2000
7:10005296:10005296;a:A:C:C:snpid:mc:geneid:3000
7:10005296:10005296;a:G:G:G:snpid:mc:geneid:4000
7:10005296:10005296;a:A:A:A:snpid:mc:geneid:5000
7:10005296:10005296;a:AC:T:A:snpid:mc:geneid:6000

And if we select five biosets for Group B, then we will have:
7:10005296:10005296;b:A:A:A:snpid:mc:geneid:7000
7:10005296:10005296;b:C:C:C:snpid:mc:geneid:7100
7:10005296:10005296;b:A:C:C:snpid:mc:geneid:7200
7:10005296:10005296;b:A:A:A:snpid:mc:geneid:7300
7:10005296:10005296;b:A:A:A:snpid:mc:geneid:7400

Expected Output
Each result record (the p-value generated by the Cochran-Armitage test) will have the
following format:

<:>

<:>

MapReduce Solution

|

457


<:>

<:>

<:>

<:>

<:>

<:>

<:>

<:>

<:>

<:>

<:>

<:>


Mapper
The mapper (see Example 20-4) will generate a key-value pair for each record, where
the key will be:
<:><:>

and the value will be the remaining attributes:

<:>

<:>

<:>

<:>

<:>

<:>

<:>


458

| Chapter 20: Cochran-Armitage Test for Trend