Information Inequality
  In this chapter, the lower bounds $B(\theta)$ for the variance, which
  is the smallest variance that can be attained by an unbiased estimator of
  $g(\theta)$, are derived. Generally, these lower bounds are simple to
  calculate. The performance of an unbiased estimator is judged by the closeness
  to $V(\theta) + \Theta \in \mathbb{H}$. Here $L$ is local.
  Regular Family of Distributions, Score Functions, and Fisher Information
Regularity Conditions
Score Functions
  The first derivative of the log-likelihood function is called the
  score function of the sample, denoted as:
  `S(X, \theta) = \frac{\partial}{\partial \theta} \log f(X;\theta).`
  It measures the sensitivity of the log-likelihood function to small changes in
  the value of $\theta$.
Fisher Information
  `S(x,\theta) = \frac{\partial}{\partial \theta} \log f(x;\theta)`
  The variance of a score function measures the strength of information
  contained in the sample of observations about $\theta$. Small variance for a
  given value of $\theta$ indicates that all the samples have their score near
  0, meaning all the samples contain little information about the true value of
  $\theta$. Therefore, the variance of the score becomes the natural measure of
  information that the sample contains about $\theta$.
  `I_X(\theta) = E_{\theta}[S^2(x, \theta)] = \nu_{\theta}[S(x,\theta)]`
  Therefore, the squared average of the relative rate of change in the density,
  $E_{\theta} \left[ \frac{\partial}{\partial \theta} \log f(x;\theta)
  \right]^2$, at some point $\theta$ measures the strength by which the value of
  $\theta$ can be distinguished from its neighboring values. This quantity is
  denoted by $I(\theta)$.
  A high value of $I(\theta)$ indicates that $\theta$ can be more accurately
  estimated by the sample observations $X$. We expect that we will get an
  unbiased estimator $\hat{\theta}$ with smaller variance. So, $I(\theta)$
  measures the information that $X$ contains about the parameter $\theta$. This
  is known as Fisher information.
  `I_X(\theta) = E \left[ - \frac{\partial^2}{\partial \theta^2} \log
  f(x;\theta) \right]= E \left[ \left( \frac{\partial}{\partial \theta} \log L
  \right)^2 \right]`
  This is called R.A. Fisher’s measure, as it represents the amount of
  information on $\theta$ supplied by the sample $(x_1, x_2, \dots, x_n)$. The
  reciprocal $\frac{1}{I(\theta)}$ represents the information limit to the
  variance of the estimator $t = t(x_1, x_2, \dots, x_n)$.
  Lower Bounds for Variance of Unbiased Estimator
  - Rao and Cramer Lower Bound
- Bhattacharyya Lower Bound
- Chapman, Robbin, and Kiefer Lower Bound
Rao and Cramer Lower Bound
  Suppose a family of PDFs $F = \{f(x;\theta), \theta \in \Theta\}$ satisfies
  the regularity conditions. Let a random sample $X_1, X_2, \dots, X_n$ be drawn
  from a population with PDF $f(x;\theta)$ in $F$, where $\theta$ is not known.
  Let $S(X)$ be an unbiased estimator of $g(\theta)$, so that its second moment
  exists. Then:
  `\text{var}[S(X)] \geq \frac{\left( \frac{d}{d\theta} g(\theta)
  \right)^2}{E_{\theta} \left[ \frac{\partial}{\partial \theta} \log L
  \right]^2} = \frac{\left( g'(\theta) \right)^2}{I(\theta)}`
Remark
  - 
    Regularity conditions hold for an exponential family but are not necessarily
    true for a non-exponential family.
  
- 
    CRLB, $B(\theta)$, depends only on the parametric function $g(\theta)$ and
    the joint density $f(x;\theta)$. This lower bound is uniform for any
    unbiased estimator.
  
- 
    CRLB in the i.i.d. case: If $X_1, X_2, \dots, X_n$ are i.i.d. from
    $f(x;\theta)$, then by $\text{eqn}$, $I_X(\theta) = n I_x(\theta)$. In this
    case, the CRLB $B(\theta)$ is given by:
  
  `\nu_{\theta}[S(X)] \geq \frac{\left( g'(\theta) \right)^2}{n I_x(\theta)} =
  B(\theta)`
  - 
    Fisher’s information contained in the sample $X_1, X_2, \dots, X_n$ on the
    parameter $\theta$, $I_x(\theta)$, increases with the increase in sample
    size $n$. Consequently, we have a smaller lower bound on the increase in
    sample variance of an unbiased estimator of $g(\theta)$.
  
  - 
    In some cases, when regularity conditions are satisfied and UMVUE exists,
    the CRLB $B(\theta)$ is not sharp. In other words, in these cases, the
    variance of the estimator fails to reach the CRLB, the UMVUE are not most
    efficient. This may be considered as a drawback of defining an estimator as
    the most efficient corresponding to the CRLB. However, under such estimator
    situations, one fails to decide whether one should continue the search for
    an estimator that could attain the CRLB or just no estimator can attain it.
  
- 
    In cases where the regularity conditions are not satisfied, we cannot talk
    of the CRLB, even though UMVUEs may still exist.
  
Definition (Most Efficient Estimator):
  An unbiased estimator $S$ is said to be the most efficient estimator for a
  regular family of distributions $\{f(x;\theta), \theta \in \Theta\}$, if
  `\nu_{\theta}(S_{\theta}) = \text{CRLB} = \frac{\left( g'(\theta)
  \right)^2}{I_x(\theta)}`
  $S$ is the best estimator of $g(\theta)$ in the sense that it achieves the
  minimum value for the average squared deviation $E_{\theta} [S_{\theta} -
  g(\theta)]^2$ for all $\theta$.
Definition (Efficiency of an Estimator):
  The efficiency of an estimator $\delta$, when $S_{\theta}$ is given to be the
  most efficient estimator for a regular family $\{f(x;\theta), \theta \in
  \Theta\}$, is defined by:
  `e(\delta, \theta) = \frac{\text{CRLB}}{\nu_{\theta}(\delta)} = \frac{\left[
  I_x(\theta) \right]}{\nu_{\theta}(\delta)}`
  The estimators become better and better with an increase in their
  efficiencies. Generally, the efficiency of an estimator $e < 1 $, and
  when it attains $1$, the corresponding estimator is said to be the most
  efficient.
  - `S(x,\theta) = \frac{\partial}{\partial \theta} \log f(x; \theta) =
    c(\theta) [S(x) - g(\theta)]`, it is the condition of linearity between the
    score and the unbiased estimator of $g(\theta)$. If the condition is
    satisfied, then $S(X)$ is not only UMVUE but also the most efficient
    (attains CR lower bound) for estimating $g(\theta)$.
  
- 
    If an unbiased estimator attains the CRLB (i.e., it is the most efficient),
    then it is MLE, but the converse is not necessarily true.
    MLEs are asymptotically CRLB estimators (most efficient).
  
- 
    MLE is not only consistent and asymptotically normal but also asymptotically
    most efficient.
  
Proof:
  Let $X$ be a random variable following the pdf $f(x; \theta)$ and let $L$ be
  the likelihood function of the sample:
  \[ L = l(x, \theta) = \prod_{i=1}^{n} f(x_i, \theta) \] \[ = \int L(x, \theta)
  dx = 1. \]
where $dx = \int \dots \int dx_1 dx_2 \dots dx_n$.
  Differentiating with respect to $\theta$ and using regularity conditions given
  above, we get:
  \[ \frac{\partial}{\partial \theta} \int L dx = 0 \Rightarrow \int
  \frac{\partial}{\partial \theta} \log L dx = 0 \Rightarrow E \left(
  \frac{\partial}{\partial \theta} \log L \right) = 0. \]
  Let $t = t(x_1, x_2, \dots, x_n)$ be an unbiased estimator of $g(\theta)$ such
  that
  \[ E(t) = g(\theta) \Rightarrow \int t L dx = g(\theta) \neq 0 \neq \int
  \left( \frac{\partial}{\partial \theta} \log L \right) dx = 0. \]
Differentiating w.r.t $\theta$, we get
  \[ \int t \cdot \frac{\partial}{\partial \theta} L dx = g'(\theta) \] \[
  \Rightarrow \int t \left( \frac{\partial}{\partial \theta} \log L \right) dx =
  g'(\theta). \]
Cramér-Rao Inequality
  The covariance between an estimator $t$ and the score function is given by:
  `\text{cov} \left( t, \frac{\partial}{\partial \theta} \log L \right) = E
  \left( t \cdot \frac{\partial}{\partial \theta} \log L \right) - E(t)E \left(
  \frac{\partial}{\partial \theta} \log L \right)`
`= \gamma'(\theta)`
where
  `E \left( \frac{\partial}{\partial \theta} \log L \right) = 0, \quad E \left(
  t \cdot \frac{\partial}{\partial \theta} \log L \right) = \gamma'(\theta)`
Using the Cauchy-Schwarz inequality:
  `\eta(X,Y)^2 \leq 1 \Rightarrow \left\{ \frac{\text{cov} \left( t,
  \frac{\partial}{\partial \theta} \log L \right) }{\sqrt{\text{var}(t) \cdot
  \text{var} \left( \frac{\partial}{\partial \theta} \log L \right) }}
  \right\}^2 \leq 1`
which leads to:
  `\left\{ \gamma'(\theta) \right\}^2 \leq \text{var}(t) \cdot E \left( \left(
  \frac{\partial}{\partial \theta} \log L \right)^2 \right)`
  `\Rightarrow \gamma'(\theta)^2 \leq \text{var}(t) \cdot E \left( \left(
  \frac{\partial}{\partial \theta} \log L \right)^2 \right)`
which gives the Cramér-Rao lower bound:
  `\text{var}(t) \geq \frac{\gamma'(\theta)^2}{E \left( \left(
  \frac{\partial}{\partial \theta} \log L \right)^2 \right)}`
Fisher Information
If $ t $ is an unbiased estimator of parameter $ \theta $, i.e.,
  `E(t) = \theta \Rightarrow \gamma(\theta) = \theta \Rightarrow \gamma'(\theta)
  = 1`
  `\text{var}(t) \geq \frac{1}{E \left( \left( \frac{\partial}{\partial \theta}
  \log L \right)^2 \right)}`
`= \frac{1}{I(\theta)}`
  This is called R.A. Fisher’s information measure. The Fisher information is
  defined as:
  `I(\theta) = E \left\{ \left( \frac{\partial}{\partial \theta} \log L
  \right)^2 \right\} = -E \left( \frac{\partial^2}{\partial \theta^2} \log L
  \right)`
  `I(\theta) = n \left\{ E \left( \frac{\partial}{\partial \theta} \log
  f(x,\theta) \right)^2 \right\} = -n E \left( \frac{\partial^2}{\partial
  \theta^2} \log f \right)`
  An unbiased estimator $ t $ of $ \gamma(\theta) $ for which the Cramér-Rao
  lower bound is attained is called a
  minimum variance bound (MVB) estimator. An MVB estimator for
  $ \gamma(\theta) $ exists if and only if there exists a sufficient estimator
  for $ \gamma(\theta) $.
As $n$ gets larger, the lower bound for $var_{\theta}(T(X))$ gets smaller. This as the Fisher Information increases, the lower bound decreases and the "best" estimator will have smaller variance, consequently more information about $\theta$.