Fisher Information

Fisher information, two-part message, accuracy of parameter (inference), multiple parameters

For one continuous-valued parameter, θ, the Fisher information is defined to be:

F(θ) = Ex( d2/dθ2 { - ln f(x|θ) } )

where f(x|θ) is the likelihood, i.e. P(x|θ) for data 'x' and parameter value (or hypothesis, ...) θ. Ex is the expectation, i.e. average over x in the data-space X.
(NB. The 'd's should be curly but this is HTML not XML.)

The Fisher information shows how sensitive the likelihood is to the parameter θ. It turns out to be the key to how accurately parameter estimates, i.e. inferences, should be stated. We should infer a parameter estimate (usually) close to the maximum likelihood estimate, i.e. close to where d/dθ f(x|θ) = 0, and the second derivative, d2/dθ2 f(x|θ), is the inverse of the curvature of the likelihood function here.

The logs can be in any base, provided that we remember which one, for the units (bits, nits, ...), but differentiation etc. favour natural logs to base e, loge=ln. Quantities can easily be converted to bits later.

MML

A parameter estimate, θ, can only be stated to finite accuracy. How much accuracy is optimal? If a coin is tossed three times and comes up heads once we surely have much less information about any bias (θ) than if the coin is tossed 300 times and comes up heads 100 times. Finite accuracy amounts to stating that θ lies in an interval (θ-s/2, θ+s/2); note that the width, s, depends on θ in general.

First Part of Message

If h(θ) is the prior probability density function of θ, the probability, and message length, of the interval are approximated by

probability = h(θ) . s
msgLen = - ln( h(θ) . s )   nits
always assuming that h(θ) does not vary much over the interval.

Second Part of Message

The second part of the message transmits the data given the first part. The receiver has not seen the data, x, and does not know any estimate based on the data unless told by the transmitter, so we must use the average over the interval (θ-s/2, θ+s/2).

Letting θ' = θ + t, where -s/2<t<s/2, the message length of the second part is

- ln f(x|θ')
= - ln f(x|θ+t)   where -s/2<t<s/2
= - ln f(x|θ) + t (d/dθ{ - ln f(x|θ) }) + (1/2) t2 (d2/dθ2{ - ln f(x|θ) }) + ...
by the Taylor expansion, ignoring O(t3)-terms.

Noting that

  1. the linear term in t averages to zero over (-s/2, s/2), and
  2. the integral of t2 over (-s/2, s/2) is [t3/3]-s/2,s/2 = s3/12,
the average for t ranging over (-s/2, s/2) is
- ln f(x|θ) + (s2/24) d2/dθ2{ - ln f(x|θ) }

Choosing 's'

Adding the message lengths for the two parts of the message:

- ln( h(θ).s ) - ln f(x|θ) + (s2/24) d2/dθ2{ - ln f(x|θ) }
to find the minimum, and thus the value for s, differentiate w.r.t. s and set to zero
let F(x, θ) = d2/dθ2{ - ln f(x|θ) }
 
s2 = 12 / F(x, θ)
This value of s depends on x which the receiver does not know. We must instead use the expected quantity
s2 = 12/(Ex f(x|θ).F(x,θ)) = 12/F(θ)
as x ranges over X, i.e the Fisher information; both transmitter and receiver can evaluate this.
  
msgLen = - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + (1/2) F(x,θ) / F(θ)
Finally, "what is usually done is to replace the last term [...] by 1/2" (- Farr 1999 p.41) to give an approximation which is reasonable provided that F(x,θ)-F(θ) is small over (θ-s/2, θ+s/2).

  
~  - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + 1/2

A number of simplifying assumptions have been made along the way; beware if their preconditions do not hold! The simplifications lead to more tractable mathematics.

Multiple Parameters

With multiple parameters, or equivalently a vector of parameters θ = <θ1, ..., θn>, the sensitivity of the likelihood is indicated by the second partial derivatives (Wallace & Freeman 1987).

θ = <θ1, ..., θn>
F(x, θ)ij = d2/d θi θj { - ln f(x|θ) }
F(θ) = ∑x:X f(x|θ).F(x,θ)
We have two n*n matrices, F(x,θ)ij and F(θ)ij. The Fisher information is now defined to be the determinant of F(θ).

The message length is

  
msgLen = - ln(h θ) - ln f(x|θ) + (1/2) ln(F θ) + (n/2) (1 + ln kn) nits
 modeldata|model
= - ln(h θ) + (1/2) ln(F θ) + (n/2) ln kn  - ln f(x|θ) + n/2
where the kn are lattice constants to do with partitioning the n-dimensional parameter space, k1 = 1 / 12 = 0.0833..., k2 = 5 / (36.√3), k3 = 19 / (192 . 21/3), and kn → 1/(2 π e) = 0.0585498 as n → ∞ (Farr 1999 p.43).

Strict MML, SMML

Note that [Strict MML] (SMML) (Wallace & Boulton 1975, Farr 1999 p.49) does not make the simplifying approximations of MML, however the mathematical and algorithmic consequences can be severe (Farr & Wallace 1997).

Notes

The MML derivations above generalise the particular forms for the binomial, multinomial and normal distributions, which were first given by Wallace and Boulton (1968), to other distributions such as Student's t-distribution.
LA, 8/1999

This material is based on talks given by C. S. Wallace c1988, on Wallace & Freeman (1987), R. Baxter's PhD thesis (1996), and G. Farr (1999).