Here are articles, which I found most interesting, from ICML Beijing 2014. Full list is awailable online.

**Boosting multi-step autoregressive forecasts **

**Souhaib Ben Taieb, Rob Hyndman**

Multi-step forecasts can be produced recursively by iterating a one-step model, or directly using a specific model for each horizon. Choosing between these two strategies is not an easy task since it involves a trade-off between bias and estimation variance over the forecast horizon. Using a nonlinear machine learning model makes the tradeoff even more difficult. To address this issue, we propose a new forecasting strategy which boosts traditional recursive linear forecasts with a direct strategy using a boosting autoregression procedure at each horizon. First, we investigate the performance of the proposed strategy in terms of bias and variance decomposition of the error using simulated time series. Then, we evaluate the proposed strategy on real-world time series from two forecasting competitions. Overall, we obtain excellent performance with respect to the standard forecasting strategies.

[pdf]

**Max-Margin Infinite Hidden Markov Models **

**Aonan Zhang, Jun Zhu, Bo Zhang**

Infinite hidden Markov models (iHMMs) are nonparametric Bayesian extensions of hidden Markov models (HMMs) with an infinite number of states. Though flexible in describing sequential data, the generative formulation of iHMMs could limit their discriminative ability in sequential prediction tasks. Our paper introduces max-margin infinite HMMs (M2iHMMs), new infinite HMMs that explore the max-margin principle for discriminative learning. By using the theory of Gibbs classifiers and data augmentation, we develop efficient beam sampling algorithms without making restricting mean-field assumptions or truncated approximation. For single variate classification, M2iHMMs reduce to a new formulation of DP mixtures of max-margin machines. Empirical results on synthetic and real data sets show that our methods obtain superior performance than other competitors in both single variate classification and sequential prediction tasks.

[pdf]

**Diagnosis determination: decision trees optimizing simultaneously worst and expected testing cost **

**Ferdinando Cicalese, Eduardo Laber, Aline Medeiros Saettler**

In several applications of automatic diagnosis and active learning a central problem is the evaluation of a discrete function by adaptively querying the values of its variables until the values read uniquely determine the value of the function. In general reading the value of a variable is done at the expense of some cost (computational or possibly a fee to pay the corresponding experiment). The goal is to design a strategy for evaluating the function incurring little cost (in the worst case or in expectation according to a prior distribution on the possible variables’ assignments). We provide an algorithm that builds a strategy (decision tree) with both expected cost and worst cost which are at most an O(logn)factor away from, respectively, the minimum possible expected cost and the minimum possible worst cost. Our algorithm provides the best possible approximation simultaneously with respect to both criteria. In fact, there is no algorithm that can guarantee o(logn) approximation, under the assumption that P≠NP.

[pdf]

**Narrowing the Gap: Random Forests In Theory and In Practice**

**Misha Denil, David Matheson, Nando De Freitas**

Despite widespread interest and practical use, the theoretical properties of random forests are still not well understood. In this paper we contribute to this understanding in two ways. We present a new theoreti- cally tractable variant of random regression forests and prove that our algorithm is con- sistent. We also provide an empirical eval- uation, comparing our algorithm and other theoretically tractable random forest models to the random forest algorithm used in prac- tice. Our experiments provide insight into the relative importance of different simplifi- cations that theoreticians have made to ob- tain tractable models for analysis.

[pdf]

**Maximum Margin Multiclass Nearest Neighbors**

**Aryeh Kontorovich, Roi Weiss**

We develop a general framework for margin-based multicategory classification in metric spaces. The basic work-horse is a margin-regularized version of the nearest-neighbor classifier. We prove generalization bounds that match the state of the art in sample size n and significantly improve the dependence on the number of classes k. Our point of departure is a nearly Bayes-optimal finite-sample risk bound independent of k. Although k-free, this bound is unregularized and non-adaptive, which motivates our main result: Rademacher and scale-sensitive margin bounds with a logarithmic dependence on k. As the best previous risk estimates in this setting were of order k√, our bound is exponentially sharper. From the algorithmic standpoint, in doubling metric spaces our classifier may be trained on n examples in O(n2logn) time and evaluated on new points in O(logn) time.

[pdf]

**Ensemble Methods for Structured Prediction**

**Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri**

We present a series of learning algorithms and theoretical guarantees for designing accurate ensembles of structured prediction tasks. This includes several randomized and deterministic algorithms devised by converting on-line learning algorithms to batch ones, and a boosting-style algorithm applicable in the context of structured prediction with a large number of labels. We give a detailed study of all these algorithms, including the description of new on-line-to-batch conversions and learning guarantees. We also report the results of extensive experiments with these algorithms in several structured prediction tasks.

[pdf]

**Deep Boosting**

**Corinna Cortes, Mehryar Mohri, Umar Syed**

We present a new ensemble learning algorithm, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or members of other rich or complex families, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorithm is a ‘capacity-conscious’ criterion for the selection of the hypotheses. We give new data-dependent learning bounds for convex ensembles expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. Our algorithm directly benefits from these guarantees since it seeks to minimize the corresponding learning bound. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments showing that its performance compares favorably to that of AdaBoost and Logistic Regression and their L1-regularized variants.

[pdf]

**Learning Character-level Representations for Part-of-Speech Tagging**

**Cicero Dos Santos, Bianca Zadrozny**

Distributed word representations have recently been proven to be an invaluable resource for NLP. These representations are normally learned using neural networks and capture syntactic and semantic information about words. Information about word morphology and shape is normally ignored when learning word representations. However, for tasks like part-of-speech tagging, intra-word information is extremely useful, specially when dealing with morphologically rich languages. In this paper, we propose a deep neural network that learns character-level representation of words and associate them with usual word representations to perform POS tagging. Using the proposed approach, while avoiding the use of any handcrafted feature, we produce state-of-the-art POS taggers for two languages: English, with 97.32% accuracy on the Penn Treebank WSJ corpus; and Portuguese, with 97.47% accuracy on the Mac-Morpho corpus, where the latter represents an error reduction of 12.2% on the best previous known result.

[pdf]