## MML Glossary

**Bayes** (1702-1761), as in Bayes's theorem P(H&D) = P(H).P(D|H) = P(D).P(H|D).

**Bayesian**: Styles of inference (machine learning, statistics, etc.) that rely on Bayes's theorem and the use of priors.

**Classification**: See supervised and unsupervised classification.

**Conditional Probability**: P(B|A), the probability of B given A.

**Conjugate prior**: A family of prior distributions is *conjugate* for f(x|θ) if the posterior distribution is in the family whenever the prior is in the family;

**Consistent**: An estimator is consistent if it converges to the correct estimate (assuming that the model class includes the true model) as more and more data are made available.

**Data Mining**: Machine learning + some aspects of data bases, with the emphasis on (very) large data sets and efficient and robust (and sometimes ad hoc) methods. (If you know of a better, short definition, tell me.)

**Data Space** = Sample Space: Set of values from which data are drawn, e.g.,

**Estimate**: Theta-hat, a value of a parameter, theta, inferred from (i.e., fitted to) data.

**Estimator**: A function (mapping) from the data-space to the space of parameter values.

**Expected Future Data**: Weighted (by posterior probability) average over all hypotheses (models, parameter estimates).

**Fisher**, R. A. (1890-1962).

**Independent**: A and B are independent if P(A&B)=P(A).P(B).

**Invariant**: An estimator is invariant if f'(e(D)) = e'(f(D)), where f is a monotonic transformation on the data space, and f' is the corresponding transformation on the parameter space.

**Joint Probability**: E.g. P(A&B), the joint probability of A and B. See conditional and independent.

**Kullback Leibler distance**: Between two probability distributions, KL({p_{i}},{q_{i}}) = _{i} p_{i}.log_{2}(p_{i}/q_{i})._{i}}. Integral replaces ∑ for continuous distributions.)

**Likelihood**: P(D|H), where D is the data set (training data), and H is a hypothesis (parameter estimate, model, theory).

**MAP**: Maximum aposteriori estimation; in the simplest cases only, MML is equivalent to MAP, this is *not* true in general

**Markov Model** of order k: Given a series x_{1}, x_{2}, x_{3},... where P(x_{t}=e) can depend on x_{t-k} to x_{t-1}, only.

**MDL**: Minimum Description Length, since J.Rissannen, *Parameter Estimation by Shortest Description of Data*, Proc JACE Conf. RSME, pp.593-, 1976. Also see MML below.

**Message Length**: The length, usually in bits, of a message in an optimal code encoding some event (or data D). Often as *two-part message*, -log_{2}(P(H))+-log_{2}(P(D|H). *Message* after Shannon's mathematical theory of communication (1948).

**Minimum EKL Estimator**, MinEKL: The parameter estimate for a distribution (or model or hypothesis) that minimises the KL distance between the distribution and Expected Future Data, i.e., maximises the likelihood of Expected Future Data.

**Mixture Model**: The weighted average of two of more models, especially mixture of probability distributions in unsupervised classification.

**MML**: Minimum Message Length, since C.S.Wallace & D.M.Boulton, *An Information Measure for Classification*, Computer Jrnl., **11**(2) pp.185-194, 1968.

**Multivariate**: Data, distribution etc. having multiple attributes (variables).

**Observation**: A data item, e.g., from an experiment.

**Ockham**: As in Ockham's razor. Also Occam.

**Odds ratio**: Simply the ratio of two probabilities, P(A)/P(B). Also as in posterior odds-ratio P(H_{1}|D)/P(H_{2}|D)=P(H_{1}).P(D|H_{1})/(P(H_{2}).P(D|H_{2})).

**Prior**: Before, particularly "before actual data are seen", as in prior probability distribution of parameters and/or models, P(H).

**Posterior**: After, particularly "after actual data are seen", as in posterior probability distribution of parameters and/or models, P(H|D)=P(H&D)/P(D)=P(H).P(D|H)/P(D).

**Regression**: To model, fit or infer, but particularly to fit a function (line, polynomial, etc.) through points {(x_{i},y_{i})} where y is dependent on x.

**Sample Space**: Space, set of values over which a random variable ranges.

**Strict MML** (SMML): See Farr and Wallace (2002).

**Supervised Classification**: To infer a function, c:S→T, a classification function, given examples (training data) drawn from S×T.

**Univariate**: Data, distribution etc. having one attribute.

**Unsupervised Classification**: To infer a mixture model from examples (data).

**Variable** (1): Random variable.

**Variable** (2): An attribute of an observation (thing), e.g., a column of a data-set.

**von Mises** (- Fisher, vMF), probability distributions on directions in **R**^{D}.

**Wallace, C. S.** (1933-2004).

#### Some sources

- G. Farr,
*Information Theory and MML Inference*, School of Comp. Sci. and Software Eng., Monash University 1997-**1999** - G. Farr & C. S. Wallace.
*The Complexity of Strict Minimum Message Length Inference*, The Computer Journal, 45(3), pp.285-292,**2002** - C. S. Wallace & D. M. Boulton,
*An Information Measure for Classification*, The Computer Journal,**11**(2), pp.185-194, August**1968** - C. S. Wallace & P. R. Freeman,
*Estimation and Inference by Compact Coding*, J. Royal Stat. Soc., 49(3), pp.240-265,**1987** - C. S. Wallace's
book,
**2005**