## Fisher Information

Fisher information,
two-part message,
accuracy of parameter (inference),
multiple parameters |

For one continuous-valued parameter, θ, the
*Fisher
information* is defined to be:

- F(θ) =
E
_{x}( d^{2}/dθ^{2}{ - ln f(x|θ) } )

where f(x|θ) is the *likelihood*,
i.e. P(x|θ) for data 'x' and parameter value
(or hypothesis, ...) θ.
E_{x} is the *expectation*,
i.e. average over x in the
data-space **X**.

(NB. The 'd's should be curly but this is HTML not XML.)

The Fisher information shows how *sensitive*
the likelihood is to the parameter θ.
It turns out to be the key to how accurately parameter estimates,
i.e. inferences, should be stated.
We should infer a parameter estimate (*usually*) close to
the maximum likelihood estimate,
i.e. close to where
d/dθ f(x|θ) = 0,
and the second derivative,
d^{2}/dθ^{2} f(x|θ),
is the inverse of the *curvature* of the likelihood function here.

The logs can be in any base, provided that we remember which one,
for the units (bits, nits, ...), but differentiation etc.
favour natural logs to base e, log_{e}=ln.
Quantities can easily be converted to bits later.

### MML

A parameter estimate, θ, can only be stated to *finite*
accuracy. How much accuracy is optimal?
If a coin is tossed three times and comes up heads once
we surely have much less information about any bias (θ) than
if the coin is tossed 300 times and comes up heads 100 times.
Finite accuracy amounts to stating that θ lies in an interval
(θ-s/2, θ+s/2); note that the width, s,
depends on θ in general.

#### First Part of Message

If h(θ) is the prior probability *density* function of θ,
the probability, and message length, of the interval are approximated by

- probability = h(θ) . s
- msgLen = - ln( h(θ) . s ) nits

*assuming*that h(θ) does not vary much over the interval.

#### Second Part of Message

The second part of the message transmits the data given the first part. The receiver has not seen the data, x, and does not know any estimate based on the data unless told by the transmitter, so we must use the average over the interval (θ-s/2, θ+s/2).

Letting θ' = θ + t, where -s/2<t<s/2, the message length of the second part is

- - ln f(x|θ')
- = - ln f(x|θ+t) where -s/2<t<s/2
- = - ln f(x|θ)
+ t (d/dθ{ - ln f(x|θ) })
+ (1/2) t
^{2}(d^{2}/dθ^{2}{ - ln f(x|θ) }) + ...

^{3})-terms.

Noting that

- the linear term in t averages to zero over (-s/2, s/2), and
- the integral of t
^{2}over (-s/2, s/2) is [t^{3}/3]_{-s/2,s/2}= s^{3}/12,

- - ln f(x|θ)
+ (s
^{2}/24) d^{2}/dθ^{2}{ - ln f(x|θ) }

#### Choosing 's'

Adding the message lengths for the two parts of the message:

- - ln( h(θ).s )
- ln f(x|θ)
+ (s
^{2}/24) d^{2}/dθ^{2}{ - ln f(x|θ) }

- let F(x, θ) =
d
^{2}/dθ^{2}{ - ln f(x|θ) } - s
^{2}= 12 / F(x, θ)

*expected*quantity

- s
^{2}= 12/(E_{x}f(x|θ).F(x,θ)) = 12/F(θ)

**X**, i.e the

*Fisher information*; both transmitter and receiver can evaluate this.

- msgLen = - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + (1/2) F(x,θ) / F(θ)

*provided*that F(x,θ)-F(θ) is small over (θ-s/2, θ+s/2).

- ~ - ln h(θ) - ln f(x|θ) + (1/2) ln(F θ) - (1/2) ln 12 + 1/2

A number of simplifying assumptions have been made along the way; beware if their preconditions do not hold! The simplifications lead to more tractable mathematics.

### Multiple Parameters

With multiple parameters, or equivalently a vector of parameters
θ = <θ_{1}, ..., θ_{n}>,
the sensitivity of the likelihood is indicated by
the second partial derivatives (Wallace & Freeman 1987).

- θ = <θ
_{1}, ..., θ_{n}> - F(x, θ)
_{ij}= d^{2}/d θ_{i}θ_{j}{ - ln f(x|θ) } - F(θ) = ∑
_{x:X}f(x|θ).F(x,θ)

_{ij}and F(θ)

_{ij}. The

*Fisher information*is now defined to be the determinant of F(θ).

The message length is

- msgLen =
- ln(h θ)
- ln f(x|θ)
+ (1/2) ln(F θ)
+ (n/2) (1 + ln k
_{n}) nits -
*model**data*|*model*= - ln(h θ) + (1/2) ln(F θ) + (n/2) ln k _{n}- ln f(x|θ) + n/2

_{n}are

*lattice constants*to do with partitioning the n-dimensional parameter space, k

_{1}= 1 / 12 = 0.0833..., k

_{2}= 5 / (36.√3), k

_{3}= 19 / (192 . 2

^{1/3}), and k

_{n}→ 1/(2 π e) = 0.0585498 as n → ∞ (Farr 1999 p.43).

### Strict MML, SMML

Note that
*[Strict MML]* (SMML)
(Wallace & Boulton 1975, Farr 1999 p.49)
does not make the simplifying approximations of MML,
however the mathematical and algorithmic consequences can be severe
(Farr & Wallace 1997).

### Notes

- The MML derivations above generalise the particular forms for the binomial, multinomial and normal distributions, which were first given by Wallace and Boulton (1968), to other distributions such as Student's t-distribution.
- — LA, 8/1999

- This material is based on talks given by C. S. Wallace c1988, on Wallace & Freeman (1987), R. Baxter's PhD thesis (1996), and G. Farr (1999).

- C. S. Wallace & D. M. Boulton.
*An Invariant Bayes Method for Point Estimation*. Classification Soc. Bulletin,**3**, pp.11-34, 1975. - C. S. Wallace & P. R. Freeman.
*Estimation and Inference by Compact Coding*. J. Royal Stat. Soc.,**49**(3), pp.240-265, 1987, [paper]. - R. Baxter.
*Minimum Message Length Inductive Inference - Theory and Application*. PhD thesis, Dept. Computer Science, Monash University, Dec. 1996. - G. Farr & C. S. Wallace.
*The Complexity of Strict Minimum Message Length Inference*. TR97/321, Department of Computer Science, Monash University, Aug 1997. - G. Farr.
*Information Theory and MML Inference*. School of Computer Science and Software Engineering, 1999.