Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests

Xin Guan, Li Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Due to its robustness and built-in feature selection capability, random forest is frequently employed in omics studies for biomarker discovery and predictive modeling. However, random forest assumes equal importance of all features, while in reality domain knowledge may justify the prioritization of more relevant features. Furthermore, it has been shown that an antecedent feature selection step can improve the performance of random forest by reducing noises and search space. In this paper, we present a novel Know-guided regularized random forest (Know-GRRF) method that incorporates domain knowledge in a random forest framework for feature selection. Via rigorous simulations, we show that Know-GRRF outperforms existing methods by correctly identifying informative features and improving the accuracy of subsequent predictive models. Know-GRRF is responsive to a wide range of tuning parameters that help to better differentiate candidate features. Know-GRRF is also stable from run to run, making it robust to noises. We further proved that Know-GRRF is a generalized form of existing methods, RRF and GRRF. We applied Known-GRRF to a real world radiation biodosimetry study that uses non-human primate data to discover biomarkers for human applications. By using cross-species correlation as domain knowledge, Know-GRRF was able to identify three gene markers that significantly improved the cross-species prediction accuracy. We implemented Know-GRRF as an R package that is available through the CRAN archive.

Original languageEnglish (US)
Title of host publicationBioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings
PublisherSpringer Verlag
Pages3-14
Number of pages12
ISBN (Print)9783319787589
DOIs
StatePublished - Jan 1 2018
Event6th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2018 - Granada, Spain
Duration: Apr 25 2018Apr 27 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10814 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other6th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2018
CountrySpain
CityGranada
Period4/25/184/27/18

Fingerprint

Random Forest
Biomarkers
Domain Knowledge
Feature extraction
Tuning
Genes
Feature Selection
Radiation
Predictive Modeling
Prioritization
Predictive Model
Parameter Tuning
Differentiate
Justify
Search Space
Robustness
Gene

Keywords

  • Biomarker discovery
  • Domain knowledge
  • Feature selection
  • Regularized random forest

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Guan, X., & Liu, L. (2018). Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests. In Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings (pp. 3-14). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10814 LNBI). Springer Verlag. https://doi.org/10.1007/978-3-319-78759-6_1

Know-GRRF : Domain-Knowledge Informed Biomarker Discovery with Random Forests. / Guan, Xin; Liu, Li.

Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings. Springer Verlag, 2018. p. 3-14 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10814 LNBI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Guan, X & Liu, L 2018, Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests. in Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10814 LNBI, Springer Verlag, pp. 3-14, 6th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2018, Granada, Spain, 4/25/18. https://doi.org/10.1007/978-3-319-78759-6_1
Guan X, Liu L. Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests. In Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings. Springer Verlag. 2018. p. 3-14. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-78759-6_1
Guan, Xin ; Liu, Li. / Know-GRRF : Domain-Knowledge Informed Biomarker Discovery with Random Forests. Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings. Springer Verlag, 2018. pp. 3-14 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{1ec0e0405c9243ac898adfd30a02e75d,
title = "Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests",
abstract = "Due to its robustness and built-in feature selection capability, random forest is frequently employed in omics studies for biomarker discovery and predictive modeling. However, random forest assumes equal importance of all features, while in reality domain knowledge may justify the prioritization of more relevant features. Furthermore, it has been shown that an antecedent feature selection step can improve the performance of random forest by reducing noises and search space. In this paper, we present a novel Know-guided regularized random forest (Know-GRRF) method that incorporates domain knowledge in a random forest framework for feature selection. Via rigorous simulations, we show that Know-GRRF outperforms existing methods by correctly identifying informative features and improving the accuracy of subsequent predictive models. Know-GRRF is responsive to a wide range of tuning parameters that help to better differentiate candidate features. Know-GRRF is also stable from run to run, making it robust to noises. We further proved that Know-GRRF is a generalized form of existing methods, RRF and GRRF. We applied Known-GRRF to a real world radiation biodosimetry study that uses non-human primate data to discover biomarkers for human applications. By using cross-species correlation as domain knowledge, Know-GRRF was able to identify three gene markers that significantly improved the cross-species prediction accuracy. We implemented Know-GRRF as an R package that is available through the CRAN archive.",
keywords = "Biomarker discovery, Domain knowledge, Feature selection, Regularized random forest",
author = "Xin Guan and Li Liu",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-319-78759-6_1",
language = "English (US)",
isbn = "9783319787589",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "3--14",
booktitle = "Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings",

}

TY - GEN

T1 - Know-GRRF

T2 - Domain-Knowledge Informed Biomarker Discovery with Random Forests

AU - Guan, Xin

AU - Liu, Li

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Due to its robustness and built-in feature selection capability, random forest is frequently employed in omics studies for biomarker discovery and predictive modeling. However, random forest assumes equal importance of all features, while in reality domain knowledge may justify the prioritization of more relevant features. Furthermore, it has been shown that an antecedent feature selection step can improve the performance of random forest by reducing noises and search space. In this paper, we present a novel Know-guided regularized random forest (Know-GRRF) method that incorporates domain knowledge in a random forest framework for feature selection. Via rigorous simulations, we show that Know-GRRF outperforms existing methods by correctly identifying informative features and improving the accuracy of subsequent predictive models. Know-GRRF is responsive to a wide range of tuning parameters that help to better differentiate candidate features. Know-GRRF is also stable from run to run, making it robust to noises. We further proved that Know-GRRF is a generalized form of existing methods, RRF and GRRF. We applied Known-GRRF to a real world radiation biodosimetry study that uses non-human primate data to discover biomarkers for human applications. By using cross-species correlation as domain knowledge, Know-GRRF was able to identify three gene markers that significantly improved the cross-species prediction accuracy. We implemented Know-GRRF as an R package that is available through the CRAN archive.

AB - Due to its robustness and built-in feature selection capability, random forest is frequently employed in omics studies for biomarker discovery and predictive modeling. However, random forest assumes equal importance of all features, while in reality domain knowledge may justify the prioritization of more relevant features. Furthermore, it has been shown that an antecedent feature selection step can improve the performance of random forest by reducing noises and search space. In this paper, we present a novel Know-guided regularized random forest (Know-GRRF) method that incorporates domain knowledge in a random forest framework for feature selection. Via rigorous simulations, we show that Know-GRRF outperforms existing methods by correctly identifying informative features and improving the accuracy of subsequent predictive models. Know-GRRF is responsive to a wide range of tuning parameters that help to better differentiate candidate features. Know-GRRF is also stable from run to run, making it robust to noises. We further proved that Know-GRRF is a generalized form of existing methods, RRF and GRRF. We applied Known-GRRF to a real world radiation biodosimetry study that uses non-human primate data to discover biomarkers for human applications. By using cross-species correlation as domain knowledge, Know-GRRF was able to identify three gene markers that significantly improved the cross-species prediction accuracy. We implemented Know-GRRF as an R package that is available through the CRAN archive.

KW - Biomarker discovery

KW - Domain knowledge

KW - Feature selection

KW - Regularized random forest

UR - http://www.scopus.com/inward/record.url?scp=85045985735&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045985735&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-78759-6_1

DO - 10.1007/978-3-319-78759-6_1

M3 - Conference contribution

AN - SCOPUS:85045985735

SN - 9783319787589

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 3

EP - 14

BT - Bioinformatics and Biomedical Engineering - 6th International Work-Conference, IWBBIO 2018, Proceedings

PB - Springer Verlag

ER -