[an error occurred while processing this directive] ICGI-2002 Paul Vitanyi

Paul Vitanyi

Algorithmic Satistics

Abstract:

As perhaps the last mathematical innovation of an extraordinary scientific career, Kolmogorov in 1974 proposed to found statistical theory on finite combinatorial principles independent of probabilistic assumptions, as the relation between the individual data and its explanation (model), expressed by Kolmogorov's structure function.

In classical probabilistic statistics the goodness of the selection process is measured in terms of expectations over probabilistic ensembles. For current applications, average relations are often irrelevant, since the part of the support of the probability density function that will ever be observed has about zero measure. This may be the case in, for example, complex video and sound analysis. There arises the problem that for individual cases the selection performance may be bad although the performance is good on average, or vice versa. There is also the problem of what probability means, whether it is subjective, objective, or exists at all. Kolmogorov's proposal outlined strives for the firmer and less contentious ground expressed in finite combinatorics and effective computation.

This Kolmogorov's structure function, its variations and its relation to model selection, have obtained some notoriety (many papers and Cover and Thomas textbook on Information Theory) but have not before been comprehensively analyzed and understood. It has always been questioned why Kolmogorov chose to focus on the a mysterious function denoted as $h_x$, rather than on a more evident function denoted as $\beta_x$ (for details see paper referred to below). Our main result, with the beauty of truth, justifies Kolmogorov's intuition. One easily stated consequence is: For all data, minimizing a two-part code consisting of one part model description and one part data-to-model code (essentially the celebrated MDL code), subject to a given model-complexity constraint, as well as minimizing the one-part code consisting of just the data-to-model code (essentially the maximum likelihood estimator), in every case (and not only with high probability) selects a model that is a ``best explanation'' of the data within the given constraint. In particular, when the ``true'' model that generated the data is not in the model class considered, then the ML or MDL estimator still give a model that ``best fits'' the data. This notion of ``best explanation'' and ``best fit'' is understood in the sense that the data is ``most typical'' for the selected model in a rigorous mathematical sense that is discussed below. A practical consequence is as follows: While the best fit (minimal randomness deficiency under complexity constraints on the model) cannot be computationally monotonically approximated, we can monotonically minimize the two-part code, or the one-part code, and thus monotonically approximate implicitly the best fitting model. But this should be sufficient: we want the best model rather than a number that measures its goodness. These results open the possibility of model selection and prediction that are best possible for individual data samples, and thus usher in a completely new era of statistical inference that is always best rather than expected.

Based on joint work with Nikolai Vereshchagin to be presented at the 47th IEEE Symp on Foundat. Comput. Sci., Vancouver, Canada. See paper http://arXiv.org/abs/cs.CC/0204037.