Information Inequality
In this chapter, the lower bounds $B(\theta)$ for the variance, which
is the smallest variance that can be attained by an unbiased estimator of
$g(\theta)$, are derived. Generally, these lower bounds are simple to
calculate. The performance of an unbiased estimator is judged by the closeness
to $V(\theta) + \Theta \in \mathbb{H}$. Here $L$ is local.
Regular Family of Distributions, Score Functions, and Fisher Information
Regularity Conditions
Score Functions
The first derivative of the log-likelihood function is called the
score function of the sample, denoted as:
`S(X, \theta) = \frac{\partial}{\partial \theta} \log f(X;\theta).`
It measures the sensitivity of the log-likelihood function to small changes in
the value of $\theta$.
Fisher Information
`S(x,\theta) = \frac{\partial}{\partial \theta} \log f(x;\theta)`
The variance of a score function measures the strength of information
contained in the sample of observations about $\theta$. Small variance for a
given value of $\theta$ indicates that all the samples have their score near
0, meaning all the samples contain little information about the true value of
$\theta$. Therefore, the variance of the score becomes the natural measure of
information that the sample contains about $\theta$.
`I_X(\theta) = E_{\theta}[S^2(x, \theta)] = \nu_{\theta}[S(x,\theta)]`
Therefore, the squared average of the relative rate of change in the density,
$E_{\theta} \left[ \frac{\partial}{\partial \theta} \log f(x;\theta)
\right]^2$, at some point $\theta$ measures the strength by which the value of
$\theta$ can be distinguished from its neighboring values. This quantity is
denoted by $I(\theta)$.
A high value of $I(\theta)$ indicates that $\theta$ can be more accurately
estimated by the sample observations $X$. We expect that we will get an
unbiased estimator $\hat{\theta}$ with smaller variance. So, $I(\theta)$
measures the information that $X$ contains about the parameter $\theta$. This
is known as Fisher information.
`I_X(\theta) = E \left[ - \frac{\partial^2}{\partial \theta^2} \log
f(x;\theta) \right]= E \left[ \left( \frac{\partial}{\partial \theta} \log L
\right)^2 \right]`
This is called R.A. Fisher’s measure, as it represents the amount of
information on $\theta$ supplied by the sample $(x_1, x_2, \dots, x_n)$. The
reciprocal $\frac{1}{I(\theta)}$ represents the information limit to the
variance of the estimator $t = t(x_1, x_2, \dots, x_n)$.
Lower Bounds for Variance of Unbiased Estimator
- Rao and Cramer Lower Bound
- Bhattacharyya Lower Bound
- Chapman, Robbin, and Kiefer Lower Bound
Rao and Cramer Lower Bound
Suppose a family of PDFs $F = \{f(x;\theta), \theta \in \Theta\}$ satisfies
the regularity conditions. Let a random sample $X_1, X_2, \dots, X_n$ be drawn
from a population with PDF $f(x;\theta)$ in $F$, where $\theta$ is not known.
Let $S(X)$ be an unbiased estimator of $g(\theta)$, so that its second moment
exists. Then:
`\text{var}[S(X)] \geq \frac{\left( \frac{d}{d\theta} g(\theta)
\right)^2}{E_{\theta} \left[ \frac{\partial}{\partial \theta} \log L
\right]^2} = \frac{\left( g'(\theta) \right)^2}{I(\theta)}`
Remark
-
Regularity conditions hold for an exponential family but are not necessarily
true for a non-exponential family.
-
CRLB, $B(\theta)$, depends only on the parametric function $g(\theta)$ and
the joint density $f(x;\theta)$. This lower bound is uniform for any
unbiased estimator.
-
CRLB in the i.i.d. case: If $X_1, X_2, \dots, X_n$ are i.i.d. from
$f(x;\theta)$, then by $\text{eqn}$, $I_X(\theta) = n I_x(\theta)$. In this
case, the CRLB $B(\theta)$ is given by:
`\nu_{\theta}[S(X)] \geq \frac{\left( g'(\theta) \right)^2}{n I_x(\theta)} =
B(\theta)`
-
Fisher’s information contained in the sample $X_1, X_2, \dots, X_n$ on the
parameter $\theta$, $I_x(\theta)$, increases with the increase in sample
size $n$. Consequently, we have a smaller lower bound on the increase in
sample variance of an unbiased estimator of $g(\theta)$.
-
In some cases, when regularity conditions are satisfied and UMVUE exists,
the CRLB $B(\theta)$ is not sharp. In other words, in these cases, the
variance of the estimator fails to reach the CRLB, the UMVUE are not most
efficient. This may be considered as a drawback of defining an estimator as
the most efficient corresponding to the CRLB. However, under such estimator
situations, one fails to decide whether one should continue the search for
an estimator that could attain the CRLB or just no estimator can attain it.
-
In cases where the regularity conditions are not satisfied, we cannot talk
of the CRLB, even though UMVUEs may still exist.
Definition (Most Efficient Estimator):
An unbiased estimator $S$ is said to be the most efficient estimator for a
regular family of distributions $\{f(x;\theta), \theta \in \Theta\}$, if
`\nu_{\theta}(S_{\theta}) = \text{CRLB} = \frac{\left( g'(\theta)
\right)^2}{I_x(\theta)}`
$S$ is the best estimator of $g(\theta)$ in the sense that it achieves the
minimum value for the average squared deviation $E_{\theta} [S_{\theta} -
g(\theta)]^2$ for all $\theta$.
Definition (Efficiency of an Estimator):
The efficiency of an estimator $\delta$, when $S_{\theta}$ is given to be the
most efficient estimator for a regular family $\{f(x;\theta), \theta \in
\Theta\}$, is defined by:
`e(\delta, \theta) = \frac{\text{CRLB}}{\nu_{\theta}(\delta)} = \frac{\left[
I_x(\theta) \right]}{\nu_{\theta}(\delta)}`
The estimators become better and better with an increase in their
efficiencies. Generally, the efficiency of an estimator $e < 1 $, and
when it attains $1$, the corresponding estimator is said to be the most
efficient.
- `S(x,\theta) = \frac{\partial}{\partial \theta} \log f(x; \theta) =
c(\theta) [S(x) - g(\theta)]`, it is the condition of linearity between the
score and the unbiased estimator of $g(\theta)$. If the condition is
satisfied, then $S(X)$ is not only UMVUE but also the most efficient
(attains CR lower bound) for estimating $g(\theta)$.
-
If an unbiased estimator attains the CRLB (i.e., it is the most efficient),
then it is MLE, but the converse is not necessarily true.
MLEs are asymptotically CRLB estimators (most efficient).
-
MLE is not only consistent and asymptotically normal but also asymptotically
most efficient.
Proof:
Let $X$ be a random variable following the pdf $f(x; \theta)$ and let $L$ be
the likelihood function of the sample:
\[ L = l(x, \theta) = \prod_{i=1}^{n} f(x_i, \theta) \] \[ = \int L(x, \theta)
dx = 1. \]
where $dx = \int \dots \int dx_1 dx_2 \dots dx_n$.
Differentiating with respect to $\theta$ and using regularity conditions given
above, we get:
\[ \frac{\partial}{\partial \theta} \int L dx = 0 \Rightarrow \int
\frac{\partial}{\partial \theta} \log L dx = 0 \Rightarrow E \left(
\frac{\partial}{\partial \theta} \log L \right) = 0. \]
Let $t = t(x_1, x_2, \dots, x_n)$ be an unbiased estimator of $g(\theta)$ such
that
\[ E(t) = g(\theta) \Rightarrow \int t L dx = g(\theta) \neq 0 \neq \int
\left( \frac{\partial}{\partial \theta} \log L \right) dx = 0. \]
Differentiating w.r.t $\theta$, we get
\[ \int t \cdot \frac{\partial}{\partial \theta} L dx = g'(\theta) \] \[
\Rightarrow \int t \left( \frac{\partial}{\partial \theta} \log L \right) dx =
g'(\theta). \]
Cramér-Rao Inequality
The covariance between an estimator $t$ and the score function is given by:
`\text{cov} \left( t, \frac{\partial}{\partial \theta} \log L \right) = E
\left( t \cdot \frac{\partial}{\partial \theta} \log L \right) - E(t)E \left(
\frac{\partial}{\partial \theta} \log L \right)`
`= \gamma'(\theta)`
where
`E \left( \frac{\partial}{\partial \theta} \log L \right) = 0, \quad E \left(
t \cdot \frac{\partial}{\partial \theta} \log L \right) = \gamma'(\theta)`
Using the Cauchy-Schwarz inequality:
`\eta(X,Y)^2 \leq 1 \Rightarrow \left\{ \frac{\text{cov} \left( t,
\frac{\partial}{\partial \theta} \log L \right) }{\sqrt{\text{var}(t) \cdot
\text{var} \left( \frac{\partial}{\partial \theta} \log L \right) }}
\right\}^2 \leq 1`
which leads to:
`\left\{ \gamma'(\theta) \right\}^2 \leq \text{var}(t) \cdot E \left( \left(
\frac{\partial}{\partial \theta} \log L \right)^2 \right)`
`\Rightarrow \gamma'(\theta)^2 \leq \text{var}(t) \cdot E \left( \left(
\frac{\partial}{\partial \theta} \log L \right)^2 \right)`
which gives the Cramér-Rao lower bound:
`\text{var}(t) \geq \frac{\gamma'(\theta)^2}{E \left( \left(
\frac{\partial}{\partial \theta} \log L \right)^2 \right)}`
Fisher Information
If $ t $ is an unbiased estimator of parameter $ \theta $, i.e.,
`E(t) = \theta \Rightarrow \gamma(\theta) = \theta \Rightarrow \gamma'(\theta)
= 1`
`\text{var}(t) \geq \frac{1}{E \left( \left( \frac{\partial}{\partial \theta}
\log L \right)^2 \right)}`
`= \frac{1}{I(\theta)}`
This is called R.A. Fisher’s information measure. The Fisher information is
defined as:
`I(\theta) = E \left\{ \left( \frac{\partial}{\partial \theta} \log L
\right)^2 \right\} = -E \left( \frac{\partial^2}{\partial \theta^2} \log L
\right)`
`I(\theta) = n \left\{ E \left( \frac{\partial}{\partial \theta} \log
f(x,\theta) \right)^2 \right\} = -n E \left( \frac{\partial^2}{\partial
\theta^2} \log f \right)`
An unbiased estimator $ t $ of $ \gamma(\theta) $ for which the Cramér-Rao
lower bound is attained is called a
minimum variance bound (MVB) estimator. An MVB estimator for
$ \gamma(\theta) $ exists if and only if there exists a sufficient estimator
for $ \gamma(\theta) $.
As $n$ gets larger, the lower bound for $var_{\theta}(T(X))$ gets smaller. This as the Fisher Information increases, the lower bound decreases and the "best" estimator will have smaller variance, consequently more information about $\theta$.