Amy L. Bauer, William S. Hlavacek, Pat J. Unkefer and Fangping Mu (Submitted) Using Sequence-specific Chemical and Structural Properties of DNA to Predict Transcription Factor Binding Sites.

Online materials

Computational results for 54 TFs whose number of known binding sites documented in RegulonDB is five or more. The cutoff values for the BvH, Match, and MATRIX SEARCH methods are defined as the mean score of the N positive training examples. For the QPMEME method, cutoff values are set to -1, and for the SiteSlueth method, cutoff values are set to 0. For MATRIX SEARCH and QPMEME, background frequency is defined as: A: 0.25007276777571225, T: 0.2507150646759991, G: 0.23820269415412798 and C: 0.26100947339416075. The cross-validation score V is given in parentheses and is defined as the fraction of positive examples predicted to be true binding sites. In 3-fold cross-validation, the available positive examples are divided into training sets and testing sets without overlap as described in the main text. Models are built based on the training set and tested using the remaining positive examples. Each model (derived through any of the five methods that we consider here) is built with cutoff values set as defined above.

TF Name Training Set Size BvH Match MATRIX SEARCH QPMEME SiteSleuth
AgaR 11 465 (0.09) 2236 (0.09) 360 (0) 1884 (0.05) 167 (0.27)
AraC 20 33412 (0.25) 52734 (0.2) 33390 (0.35) 102457 (0.18) 2442 (0.3)
ArcA 91 73895 (0.23) 76283 (0.29) 72860 (0.23) 132264 (0.31) 25229 (0.27)
ArgR 24 671 (0.5) 1686 (0.5) 652 (0.46) 18912 (0.4) 2171 (0.58)
CpxR 33 33101 (0.33) 55743 (0.39) 33221 (0.33) 368930 (0.41) 13964 (0.39)
CRP 260 7639 (0.63) 11030 (0.64) 7580 (0.63) 1069283 (0.8) 10189 (0.67)
CysB 8 278 (0) 1551 (0) 146 (0) 729 (0) 33 (0)
CytR 14 202 (0.29) 1505 (0.21) 175 (0.43) 2599 (0.16) 133 (0.43)
DeoR 7 6957 (0.14) 16652 (0) 6191 (0) 20923 (0.07) 467 (0.14)
DgsA 8 17 (0) 224 (0.25) 13 (0) 84 (0.1) 142 (0.38)
DnaA 10 48972 (0) 108426 (0.1) 47570 (0.1) 76486 (0.19) 2724 (0.6)
FadR 10 4238 (0.3) 14726 (0.3) 3920 (0.3) 33838 (0.1) 754 (0.2)
Fis 133 202096 (0.44) 262224 (0.47) 199966 (0.43) 1506632 (0.61) 129150 (0.33)
FlhDC 20 162411 (0.25) 234753 (0) 163533 (0.2) 492630 (0.09) 5688 (0.15)
FNR 85 4900 (0.51) 8543 (0.56) 4928 (0.51) 340882 (0.67) 2463 (0.53)
FruR 13 21 (0.23) 97 (0.38) 21 (0.15) 263 (0.22) 661 (0.69)
Fur 54 25843 (0.54) 37387 (0.54) 25838 (0.48) 275020 (0.54) 24684 (0.59)
GadE 5 54 (0.2) 368 (0) 29 (0) 255 (0.16) 12 (0.4)
GalR 10 244 (0.4) 996 (0.4) 206 (0.4) 3769 (0.25) 198 (0.7)
GalS 9 149 (0.22) 528 (0.56) 126 (0.22) 2591 (0.2) 395 (0.67)
GcvA 5 14 (0) 141 (0) 11 (0) 49 (0) 22 (0)
GlpR 23 10195 (0.09) 27198 (0.22) 10063 (0.04) 31268 (0.17) 3661 (0.3)
GntR 17 2358 (0.24) 7342 (0.29) 2317 (0.24) 23530 (0.27) 1455 (0.59)
H-NS 34 218778 (0.32) 303046 (0.24) 218934 (0.18) 597416 (0.28) 31154 (0.06)
IclR 10 232538 (0) 319893 (0.1) 235576 (0.1) 294673 (0.1) 11357 (0.1)
IHF 87 167137 (0.41) 316148 (0.47) 167122 (0.41) 1050684 (0.63) 16056 (0.18)
IscR 8 7 (0) 50 (0) 5 (0) 30 (0) 104 (0.25)
LexA 24 259 (0.54) 1745 (0.67) 254 (0.46) 30290 (0.35) 1578 (0.71)
Lrp 84 541474 (0.31) 625587 (0.3) 539556 (0.3) 1319479 (0.56) 3196 (0.3)
MalT 20 35743 (0.2) 55431 (0.3) 35261 (0.2) 125337 (0.34) 6468 (0.55)
MarA 16 39156 (0.13) 85183 (0.13) 38597 (0.06) 137897 (0.06) 3808 (0.06)
MelR 8 146 (0) 680 (0.25) 129 (0.25) 1114 (0.14) 30 (0.5)
MetJ 27 634694 (0.26) 1137063 (0.37) 621589 (0.41) 1449353 (0.31) 27334 (0.33)
MetR 6 174 (0) 800 (0) 114 (0) 778 (0.07) 4 (0.33)
ModE 8 95 (0) 765 (0.25) 73 (0) 724 (0.14) 29 (0.5)
Nac 10 22752 (0) 48020 (0) 21363 (0.1) 53086 (0.08) 2382 (0.1)
NagC 14 27 (0.14) 110 (0.21) 24 (0) 100 (0.17) 843 (0.5)
NanR 6 973 (0.5) 973 (0.5) 973 (0.67) 15678 (0.18) 664 (0.83)
NarL 91 426754 (0.37) 472550 (0.46) 431808 (0.46) 1682666 (0.72) 92451 (0.65)
NarP 16 115566 (0.19) 129839 (0.19) 115658 (0.06) 216367 (0.33) 2823 (0.44)
NtrC 22 1248 (0.41) 6264 (0.5) 1200 (0.41) 14721 (0.36) 2843 (0.73)
OmpR 20 11612 (0.15) 18694 (0.1) 11637 (0.15) 34207 (0.16) 3532 (0.3)
OxyR 9 1029 (0) 3870 (0) 768 (0) 2397 (0) 67 (0)
PhoB 14 2579 (0.29) 9954 (0.21) 2537 (0.21) 35604 (0.13) 1211 (0.36)
PhoP 22 470 (0.5) 1778 (0.5) 437 (0.41) 6042 (0.31) 2781 (0.55)
PspF 5 1280 (0) 4727 (0) 1116 (0) 2671 (0) 122 (0)
PurR 18 20483 (0.17) 81537 (0.39) 19980 (0.11) 118837 (0.23) 3956 (0.39)
RcsAB 5 442 (0) 1011 (0) 382 (0) 910 (0) 230 (0)
Rob 6 4712 (0) 17587 (0) 4019 (0) 9418 (0) 177 (0)
SoxS 18 170007 (0) 403160 (0.06) 168981 (0.06) 265797 (0.02) 5940 (0)
TorR 8 2060 (0.13) 6429 (0.5) 2027 (0) 9394 (0.28) 1520 (0.63)
TrpR 10 10 (0.2) 16 (0.2) 9 (0.1) 68 (0.19) 78 (0.7)
TyrR 19 15695 (0.11) 40645 (0.16) 15431 (0) 54411 (0.19) 1551 (0.32)
UxuR 5 5 (0) 7 (0) 5 (0) 8 (0.1) 84 (0.6)


Computational results for 44 TFs documented in DPInteract. Here, the cutoff values for the BvH, Match, and MATRIX SEARCH methods were each set to the lowest scoring sequence in the training set from which a model for a TF binding site was built. This approach, which guarantees that positive examples used in training are correctly classified, is different from the above. For the QPMEME method, cutoff values are set to -1, and for the SiteSlueth method, cutoff values are set to 0. For QPMEME, background frequency is defined as in [6], i.e., Djordjevic et al (2003) Genome Res.: A: 0.2844, T: 0.2156, G: 0.2157 and C: 0.2843. The cross-validation score V is given in parentheses. In cross-validation, the available positive examples are divided into training sets and testing sets without overlap as described in the main text. Recall that each model (derived through any of the five methods that we consider here) is built to ensure that the binding sites in the training set are classified correctly; however, the testing examples withheld from training may not be predicted perfectly by a method.
DPInteract Results