Subset selection algorithms for prediction

Subset selection algorithms for prediction(2011)

引用 24|浏览36
暂无评分
摘要
In this dissertation, we study the subset selection problem for prediction. It deals with choosing the “best” or “most informative” k-subset from a large set of n k observable variables, to predict the value of a function or another variable of interest that is related to the observable variables. Natural applications of this problem abound in areas as diverse as medicine, social sciences, economics, numerical analysis, signal processing and sensor networks. There are various mathematical formulations for this problem, depending on the characterization of the best subset and of the dependencies between variables. We study two versions: the first version is a stochastic framework for subset selection of random variables using linear regression, and the second is an adversarial framework for estimating aggregate statistics of a function in the presence of metric-space induced spatial constraints. The goal of this dissertation is to perform an algorithmic analysis of the subset selection problems, characterize natural conditions which make these problems tractable, and explore polynomial-time algorithms with guaranteed optimal or near-optimal solutions. For the stochastic subset selection problem, we explore two broad approaches for designing efficient approximation algorithms. The first approach uses a graph-theoretic framework to characterize the covariance structure of the problem instance, and design efficient algorithms for several classes of covariance graphs. The second approach uses an algebraic framework based on spectral and submodular analysis, to identify conditions under which greedy algorithms can obtain good performance guarantees. For adversarial subset selection, we provide efficient deterministic and randomized sampling strategies and corresponding prediction functions to approximate some commonly used aggregate statistics. For the deterministic setting, we show an interesting connection with common clustering problems, and obtain constant factor approximation algorithms for predicting the average and maximum statistics. For the randomized setting, we obtain a polynomial-time approximation scheme for the problem of finding the optimal randomized algorithm for choosing a single sample to predict the average statistic. We also solve the interesting special case of estimating the integral of a univariate Lipschitz-continuous function over the [0,1] interval using one sample, and design an optimal randomized algorithm in this setting. For several of our subset selection algorithms, we also experimentally validate our theoretical analysis on several real-world data sets.
更多
查看译文
关键词
aggregate statistic,optimal randomized algorithm,common clustering problem,subset selection algorithm,subset selection problem,adversarial subset selection,best subset,problem instance,subset selection,stochastic subset selection problem
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要