Fisher Information
Fisher information,
two-part message,
accuracy of parameter (inference),
multiple parameters |
For one continuous-valued parameter, θ, the Fisher information is defined to be:
- F(θ) = Ex( d2/dθ2 { - ln f(x|θ) } )
where f(x|θ) is the likelihood,
i.e. P(x|θ) for data 'x' and parameter value
(or hypothesis, ...) θ.
Ex is the expectation,
i.e. average over x in the
data-space X.
(NB. The 'd's should be curly but this is HTML not XML.)
The Fisher information shows how sensitive the likelihood is to the parameter θ. It turns out to be the key to how accurately parameter estimates, i.e. inferences, should be stated. We should infer a parameter estimate (usually) close to the maximum likelihood estimate, i.e. close to where d/dθ f(x|θ) = 0, and the second derivative, d2/dθ2 f(x|θ), is the inverse of the curvature of the likelihood function here.
The logs can be in any base, provided that we remember which one, for the units (bits, nits, ...), but differentiation etc. favour natural logs to base e, loge=ln. Quantities can easily be converted to bits later.
MML
A parameter estimate, θ, can only be stated to finite accuracy. How much accuracy is optimal? If a coin is tossed three times and comes up heads once we surely have much less information about any bias (θ) than if the coin is tossed 300 times and comes up heads 100 times. Finite accuracy amounts to stating that θ lies in an interval (θ-s/2, θ+s/2); note that the width, s, depends on θ in general.
First Part of Message
If h(θ) is the prior probability density function of θ, the probability, and message length, of the interval are approximated by
- probability = h(θ) . s
- msgLen = - ln( h(θ) . s ) nits
Second Part of Message
The second part of the message transmits the data given the first part. The receiver has not seen the data, x, and does not know any estimate based on the data unless told by the transmitter, so we must use the average over the interval (θ-s/2, θ+s/2).
Letting θ' = θ + t, where -s/2<t<s/2, the message length of the second part is
- - ln f(x|θ')
- = - ln f(x|θ+t) where -s/2<t<s/2
- = - ln f(x|θ) + t (d/dθ{ - ln f(x|θ) }) + (1/2) t2 (d2/dθ2{ - ln f(x|θ) }) + ...
- = - ln f(x|θ+t) where -s/2<t<s/2
Noting that
- the linear term in t averages to zero over (-s/2, s/2), and
- the integral of t2 over (-s/2, s/2) is [t3/3]-s/2,s/2 = s3/12,
- - ln f(x|θ) + (s2/24) d2/dθ2{ - ln f(x|θ) }
Choosing 's'
Adding the message lengths for the two parts of the message:
- - ln( h(θ).s ) - ln f(x|θ) + (s2/24) d2/dθ2{ - ln f(x|θ) }
- let F(x, θ) =
d2/dθ2{ - ln f(x|θ) }
- s2 = 12 / F(x, θ)
- s2 = 12/(Ex f(x|θ).F(x,θ)) = 12/F(θ)
- msgLen = - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + (1/2) F(x,θ) / F(θ)
- ~ - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + 1/2
A number of simplifying assumptions have been made along the way; beware if their preconditions do not hold! The simplifications lead to more tractable mathematics.
Multiple Parameters
With multiple parameters, or equivalently a vector of parameters θ = <θ1, ..., θn>, the sensitivity of the likelihood is indicated by the second partial derivatives (Wallace & Freeman 1987).
- θ = <θ1, ..., θn>
- F(x, θ)ij = d2/d θi θj { - ln f(x|θ) }
- F(θ) = ∑x:X f(x|θ).F(x,θ)
- F(x, θ)ij = d2/d θi θj { - ln f(x|θ) }
The message length is
- msgLen =
- ln(h θ)
- ln f(x|θ)
+ (1/2) ln(F θ)
+ (n/2) (1 + ln kn) nits
model data|model = - ln(h θ) + (1/2) ln(F θ) + (n/2) ln kn - ln f(x|θ) + n/2
Strict MML, SMML
Note that [Strict MML] (SMML) (Wallace & Boulton 1975, Farr 1999 p.49) does not make the simplifying approximations of MML, however the mathematical and algorithmic consequences can be severe (Farr & Wallace 1997).
Notes
- The MML derivations above generalise the particular forms for the
binomial,
multinomial and
normal distributions,
which were first given by Wallace and Boulton (1968),
to other distributions such as
Student's t-distribution.
— LA, 8/1999
- This material is based on talks given by C. S. Wallace c1988, on Wallace & Freeman (1987), R. Baxter's PhD thesis (1996), and G. Farr (1999).
- C. S. Wallace & D. M. Boulton. An Invariant Bayes Method for Point Estimation. Classification Soc. Bulletin, 3, pp.11-34, 1975.
- C. S. Wallace & P. R. Freeman. Estimation and Inference by Compact Coding. J. Royal Stat. Soc., 49(3), pp.240-265, 1987, [paper].
- R. Baxter. Minimum Message Length Inductive Inference - Theory and Application. PhD thesis, Dept. Computer Science, Monash University, Dec. 1996.
- G. Farr & C. S. Wallace. The Complexity of Strict Minimum Message Length Inference. TR97/321, Department of Computer Science, Monash University, Aug 1997.
- G. Farr. Information Theory and MML Inference. School of Computer Science and Software Engineering, 1999.