Collaborative Research: Active Statistical Learning: Ensembles Manifolds and Optimal Experimental Design

Project: Research project

Description

Within enterprise systems in numerous industries such as manufacturing, health care, energy, environmental, etc., automated sensing can generate enormous numbers of measurements, and a common objective is to predict a response (target) y. Modern systems can generate a large number of data instances at low cost (e.g., medical images, chemo-informatics, manufacturing sensors, remote and environmental sensors), but labels (i.e., observed values of y) may require human effort that is time-consuming and expensive. An Active Learning (AL) strategy selects instances to label in order to improve a model with a relatively small number of queries, accelerating learning. There are many similarities between AL and optimal experimental design. As a secondary goal of the proposed research, AL methods will be used to explore open questions in optimal design of experiments.

Intellectual Merit: Existing AL methods are often based on strong assumptions for
the joint input/output distribution or use a distance-based approach. These methods are susceptible to noise in the input space, assume numerical inputs only, and often work poorly in high dimensions. In addition, for methods that rely on distance computations and/or linear models, computational complexity limits their use on large datasets. In applications, data sets are often large, noisy, contain missing values and mixed (numerical/categorical) variable types. Often, queries should be arranged in groups or batches. In a batch query one should consider both the usefulness of individual queries, and the batch diversity.
Batch AL, of great importance in practice, is less commonly addressed by the existing AL approaches. Here, a non-parametric approach to the AL problem called Stochastic Query-by-Forest (SQRF) is proposed that effectively addresses the challenges described above. The algorithm is based on a batch diversification strategy applied to an ensemble of decision trees. Successful preliminary work with this approach focused in binary classification problems. In this research we propose to consider more general models including regression and multi-class problems, along with other challenging innovations. Furthermore, a novel AL
strategy that incorporates the geometric structure of the unlabeled data is proposed. In many applications, unlabeled data exists only in a lower dimensional nonlinear manifold. Our expectation is that incorporating the geometric properties of the data will result in more informative samples/solutions. This work is a collaborative effort between researchers at Arizona State University, Pennsylvania State University, and Intel Corporation with complementary expertise in machine learning and optimal design. The participation of Intel will help ensure the successful dissemination and broad applicability of the results.

Broader impacts of the proposed research: AL can accelerate the learning cycle
needed to improve systems in multiple domains (e.g., better diagnostics in health care, better quality in manufacturing, etc.). Consequently, this interdisciplinary topic applies broadly to numerous enterprise systems, and presents a large opportunity for impact. Cross-fertilization in the area of optimal experimental design will also occur. Dissemination is planned at conferences and journal publications, and given the collaboration with Intel, through the semiconductor industry. Results will be integrated into data mining courses at the institutions including heavily enrolled Web courses. REUs at the two institutions are also planned. The PIs both have a history of mentoring women and under-represented
minorities in research projects and this will continue in the proposed research.
StatusFinished
Effective start/end date9/1/158/31/18

Funding

  • National Science Foundation (NSF): $175,000.00

Fingerprint

Design of experiments
Industry
Health care
Labels
Problem-Based Learning
Sensors
Decision trees
Data mining
Learning systems
Computational complexity
Innovation
Semiconductor materials
Costs