Cramer Rao Inequality - statsclick

Maximum likelihood estimation for Pareto observations

Published by Sumit Kumar On December 03, 2024

MLE for Pareto distribution The pdf of Pareto distribution is given by: `f(x; \alpha, \lambda) = \frac{\alpha \lambda^\alpha}{x^{\alpha+1}} \quad \text{for } x \geq \lambda.` `f(x_i, \alpha, \lambda) = \frac{\alpha \lambda^\alpha}{x_i^{\alpha+1}} \cdot I(x_i \geq \lambda), \quad x_i \in \mathbb{R} ` Writing the likelihood function `L(\alpha, \lambda) = \prod_{i=1}^n f(x_i , \alpha, \lambda)` `\Rightarrow L(\alpha, \lambda) = \prod_{i=1}^n \left( \frac{\alpha \lambda^\alpha}{x_i^{\alpha+1}} \cdot I(x_(i) \geq \lambda) \right)` `= L(\alpha, \lambda) = \alpha^n \lambda^{n\alpha} \prod_{i=1}^n \frac{1}{x_i^{\alpha+1}} \cdot \prod_{i=1}^n I(x_(i)\geq \lambda)` Finding the calculus for MLE in Pareto Distribution will be easy if we optimize it by taking log both sides because it is also a increasing function. `= \log L(\alpha, \lambda) = n \log \alpha + n \alpha \log \lambda - (\alpha+1) \sum_{i=1}^n \log x_i + \sum_{i=1}^n \log I(x_i \geq \lambda)` `= \log L(\alpha, \l...

Information Inequality

In this chapter, the lower bounds $B(\theta)$ for the variance, which is the smallest variance that can be attained by an unbiased estimator of $g(\theta)$, are derived. Generally, these lower bounds are simple to calculate. The performance of an unbiased estimator is judged by the closeness to $V(\theta) + \Theta \in \mathbb{H}$. Here $L$ is local.

Regular Family of Distributions, Score Functions, and Fisher Information

Regularity Conditions

The parameter space $\Theta$ is a non-degenerate open interval on the real line $\mathbb{R}^1 (-\infty, \infty)$. That is, $\Theta \subset \mathbb{R}$ and $\Theta$ is an open interval.
The support of the distribution $f(x;\theta)$, denoted as $S(\theta) = \{x | f(x;\theta) > 0\}$, is independent of $\theta$. In other words, the family of distributions $\{f(x;\theta) : \theta \in \Theta\}$ has a common support.
For almost all $X = (x_1, x_2, ..., x_n)$ and for all $\theta \in \Theta$, the derivative:
`\frac{\partial}{\partial \theta} f(x;\theta)`
exists and is independent of $\theta$.
The range of integration is independent of the parameter $\theta$, so that $f(x;\theta)$ is differentiable under the integral sign.
The conditions of uniform convergence of integrals are satisfied, ensuring that the integral sign is valid.

Score Functions

The first derivative of the log-likelihood function is called the score function of the sample, denoted as:

`S(X, \theta) = \frac{\partial}{\partial \theta} \log f(X;\theta).`

It measures the sensitivity of the log-likelihood function to small changes in the value of $\theta$.

Fisher Information

`S(x,\theta) = \frac{\partial}{\partial \theta} \log f(x;\theta)`

The variance of a score function measures the strength of information contained in the sample of observations about $\theta$. Small variance for a given value of $\theta$ indicates that all the samples have their score near 0, meaning all the samples contain little information about the true value of $\theta$. Therefore, the variance of the score becomes the natural measure of information that the sample contains about $\theta$.

`I_X(\theta) = E_{\theta}[S^2(x, \theta)] = \nu_{\theta}[S(x,\theta)]`

Therefore, the squared average of the relative rate of change in the density, $E_{\theta} \left[ \frac{\partial}{\partial \theta} \log f(x;\theta) \right]^2$, at some point $\theta$ measures the strength by which the value of $\theta$ can be distinguished from its neighboring values. This quantity is denoted by $I(\theta)$.

A high value of $I(\theta)$ indicates that $\theta$ can be more accurately estimated by the sample observations $X$. We expect that we will get an unbiased estimator $\hat{\theta}$ with smaller variance. So, $I(\theta)$ measures the information that $X$ contains about the parameter $\theta$. This is known as Fisher information.

`I_X(\theta) = E \left[ - \frac{\partial^2}{\partial \theta^2} \log f(x;\theta) \right]= E \left[ \left( \frac{\partial}{\partial \theta} \log L \right)^2 \right]`

This is called R.A. Fisher’s measure, as it represents the amount of information on $\theta$ supplied by the sample $(x_1, x_2, \dots, x_n)$. The reciprocal $\frac{1}{I(\theta)}$ represents the information limit to the variance of the estimator $t = t(x_1, x_2, \dots, x_n)$.

Lower Bounds for Variance of Unbiased Estimator

Rao and Cramer Lower Bound
Bhattacharyya Lower Bound
Chapman, Robbin, and Kiefer Lower Bound

Rao and Cramer Lower Bound

Suppose a family of PDFs $F = \{f(x;\theta), \theta \in \Theta\}$ satisfies the regularity conditions. Let a random sample $X_1, X_2, \dots, X_n$ be drawn from a population with PDF $f(x;\theta)$ in $F$, where $\theta$ is not known. Let $S(X)$ be an unbiased estimator of $g(\theta)$, so that its second moment exists. Then:

`\text{var}[S(X)] \geq \frac{\left( \frac{d}{d\theta} g(\theta) \right)^2}{E_{\theta} \left[ \frac{\partial}{\partial \theta} \log L \right]^2} = \frac{\left( g'(\theta) \right)^2}{I(\theta)}`

Remark

Regularity conditions hold for an exponential family but are not necessarily true for a non-exponential family.
CRLB, $B(\theta)$, depends only on the parametric function $g(\theta)$ and the joint density $f(x;\theta)$. This lower bound is uniform for any unbiased estimator.
CRLB in the i.i.d. case: If $X_1, X_2, \dots, X_n$ are i.i.d. from $f(x;\theta)$, then by $\text{eqn}$, $I_X(\theta) = n I_x(\theta)$. In this case, the CRLB $B(\theta)$ is given by:

`\nu_{\theta}[S(X)] \geq \frac{\left( g'(\theta) \right)^2}{n I_x(\theta)} = B(\theta)`

Fisher’s information contained in the sample $X_1, X_2, \dots, X_n$ on the parameter $\theta$, $I_x(\theta)$, increases with the increase in sample size $n$. Consequently, we have a smaller lower bound on the increase in sample variance of an unbiased estimator of $g(\theta)$.

In some cases, when regularity conditions are satisfied and UMVUE exists, the CRLB $B(\theta)$ is not sharp. In other words, in these cases, the variance of the estimator fails to reach the CRLB, the UMVUE are not most efficient. This may be considered as a drawback of defining an estimator as the most efficient corresponding to the CRLB. However, under such estimator situations, one fails to decide whether one should continue the search for an estimator that could attain the CRLB or just no estimator can attain it.
In cases where the regularity conditions are not satisfied, we cannot talk of the CRLB, even though UMVUEs may still exist.

Definition (Most Efficient Estimator):

An unbiased estimator $S$ is said to be the most efficient estimator for a regular family of distributions $\{f(x;\theta), \theta \in \Theta\}$, if

`\nu_{\theta}(S_{\theta}) = \text{CRLB} = \frac{\left( g'(\theta) \right)^2}{I_x(\theta)}`

$S$ is the best estimator of $g(\theta)$ in the sense that it achieves the minimum value for the average squared deviation $E_{\theta} [S_{\theta} - g(\theta)]^2$ for all $\theta$.

Definition (Efficiency of an Estimator):

The efficiency of an estimator $\delta$, when $S_{\theta}$ is given to be the most efficient estimator for a regular family $\{f(x;\theta), \theta \in \Theta\}$, is defined by:

`e(\delta, \theta) = \frac{\text{CRLB}}{\nu_{\theta}(\delta)} = \frac{\left[ I_x(\theta) \right]}{\nu_{\theta}(\delta)}`

The estimators become better and better with an increase in their efficiencies. Generally, the efficiency of an estimator $e < 1 $, and when it attains $1$, the corresponding estimator is said to be the most efficient.

`S(x,\theta) = \frac{\partial}{\partial \theta} \log f(x; \theta) = c(\theta) [S(x) - g(\theta)]`, it is the condition of linearity between the score and the unbiased estimator of $g(\theta)$. If the condition is satisfied, then $S(X)$ is not only UMVUE but also the most efficient (attains CR lower bound) for estimating $g(\theta)$.
If an unbiased estimator attains the CRLB (i.e., it is the most efficient), then it is MLE, but the converse is not necessarily true. MLEs are asymptotically CRLB estimators (most efficient).
MLE is not only consistent and asymptotically normal but also asymptotically most efficient.

Proof:

Let $X$ be a random variable following the pdf $f(x; \theta)$ and let $L$ be the likelihood function of the sample:

\[ L = l(x, \theta) = \prod_{i=1}^{n} f(x_i, \theta) \] \[ = \int L(x, \theta) dx = 1. \]

where $dx = \int \dots \int dx_1 dx_2 \dots dx_n$.

Differentiating with respect to $\theta$ and using regularity conditions given above, we get:

\[ \frac{\partial}{\partial \theta} \int L dx = 0 \Rightarrow \int \frac{\partial}{\partial \theta} \log L dx = 0 \Rightarrow E \left( \frac{\partial}{\partial \theta} \log L \right) = 0. \]

Let $t = t(x_1, x_2, \dots, x_n)$ be an unbiased estimator of $g(\theta)$ such that

\[ E(t) = g(\theta) \Rightarrow \int t L dx = g(\theta) \neq 0 \neq \int \left( \frac{\partial}{\partial \theta} \log L \right) dx = 0. \]

Differentiating w.r.t $\theta$, we get

\[ \int t \cdot \frac{\partial}{\partial \theta} L dx = g'(\theta) \] \[ \Rightarrow \int t \left( \frac{\partial}{\partial \theta} \log L \right) dx = g'(\theta). \]

Cramér-Rao Inequality

The covariance between an estimator $t$ and the score function is given by:

`\text{cov} \left( t, \frac{\partial}{\partial \theta} \log L \right) = E \left( t \cdot \frac{\partial}{\partial \theta} \log L \right) - E(t)E \left( \frac{\partial}{\partial \theta} \log L \right)`

`= \gamma'(\theta)`

where

`E \left( \frac{\partial}{\partial \theta} \log L \right) = 0, \quad E \left( t \cdot \frac{\partial}{\partial \theta} \log L \right) = \gamma'(\theta)`

Using the Cauchy-Schwarz inequality:

`\eta(X,Y)^2 \leq 1 \Rightarrow \left\{ \frac{\text{cov} \left( t, \frac{\partial}{\partial \theta} \log L \right) }{\sqrt{\text{var}(t) \cdot \text{var} \left( \frac{\partial}{\partial \theta} \log L \right) }} \right\}^2 \leq 1`

which leads to:

`\left\{ \gamma'(\theta) \right\}^2 \leq \text{var}(t) \cdot E \left( \left( \frac{\partial}{\partial \theta} \log L \right)^2 \right)`

`\Rightarrow \gamma'(\theta)^2 \leq \text{var}(t) \cdot E \left( \left( \frac{\partial}{\partial \theta} \log L \right)^2 \right)`

which gives the Cramér-Rao lower bound:

`\text{var}(t) \geq \frac{\gamma'(\theta)^2}{E \left( \left( \frac{\partial}{\partial \theta} \log L \right)^2 \right)}`

Fisher Information

If $ t $ is an unbiased estimator of parameter $ \theta $, i.e.,

`E(t) = \theta \Rightarrow \gamma(\theta) = \theta \Rightarrow \gamma'(\theta) = 1`

`\text{var}(t) \geq \frac{1}{E \left( \left( \frac{\partial}{\partial \theta} \log L \right)^2 \right)}`

`= \frac{1}{I(\theta)}`

This is called R.A. Fisher’s information measure. The Fisher information is defined as:

`I(\theta) = E \left\{ \left( \frac{\partial}{\partial \theta} \log L \right)^2 \right\} = -E \left( \frac{\partial^2}{\partial \theta^2} \log L \right)`

`I(\theta) = n \left\{ E \left( \frac{\partial}{\partial \theta} \log f(x,\theta) \right)^2 \right\} = -n E \left( \frac{\partial^2}{\partial \theta^2} \log f \right)`

An unbiased estimator $ t $ of $ \gamma(\theta) $ for which the Cramér-Rao lower bound is attained is called a minimum variance bound (MVB) estimator. An MVB estimator for $ \gamma(\theta) $ exists if and only if there exists a sufficient estimator for $ \gamma(\theta) $.

As $n$ gets larger, the lower bound for $var_{\theta}(T(X))$ gets smaller. This as the Fisher Information increases, the lower bound decreases and the "best" estimator will have smaller variance, consequently more information about $\theta$.

Report Abuse

Labels

Maximum Likelihood Estimator for Log Normal Distribution

Methods of Estimation in Statistical Inference

Maximum likelihood estimation for Pareto observations

Data Deep Dive

Cramer Rao Inequality - statsclick

Information Inequality

Regular Family of Distributions, Score Functions, and Fisher Information

Regularity Conditions

Score Functions

Fisher Information

Lower Bounds for Variance of Unbiased Estimator

Rao and Cramer Lower Bound

Remark

Definition (Most Efficient Estimator):

Definition (Efficiency of an Estimator):

Proof:

Cramér-Rao Inequality

Fisher Information

Post a Comment

Literature in Testing of Hypotheses: Concepts, Theory, and Applications

Maximum Likelihood Estimator in exponential distribution

Exploring the world of artificial intelligence robot

Regular Exponential Family of Distributions

Data Deep Dive