###### What it means, and why it’s calculated the way it is

## A brief Introduction to the concept

Consider a random variable * X * which is assumed to follow some probability distribution

*f(.)*, such as the Normal or the Poisson distribution. Suppose also that the function

*f(.)*accepts some parameter

*θ.*Examples of

*θ*are the mean

*μ*of the the normal distribution, or the mean event rate

*λ*of the Poisson distribution. The Fisher Information of

*measures the amount of information that the*

**X***contains about the*

**X***true population value of θ*(such as the true mean of the population).

## The formula for Fisher Information

Clearly, there is a a lot to take in at one go in the above formula. Indeed, Fisher Information can be a complex concept to understand. So will explain it using a real world example. Along the way, we’ll also take apart the formula for Fisher Information and put it back together block by block so as to gain insight into why it is calculated the way it is.

## An illustrative example

Consider the following data set of 30K+ data points downloaded from Zillow Research under their free to use terms:

Each row in the data set contains a forecast of Year-over-Year percentage change in house prices in a specific geographical location within the United States. This value is in the column **ForecastYoYPctChange**.

Let’s load the data set into memory using Python and Pandas and let’s plot the frequency distribution of **ForecastYoYPctChange**.

import mathimport pandas as pdimport numpy as npfrom matplotlib import pyplot as pltfrom scipy.stats import norm#Load the data filedf = pd.read_csv('zhvf_uc_sfrcondo_tier_0.33_0.67_month.csv', header=0, infer_datetime_format=True, parse_dates=['ForecastedDate'])#Plot the frequency distribution of ForecastYoYPctChangeplt.hist(df['ForecastYoYPctChange'], bins=1000)plt.xlabel('YoY % change in house prices in some geographical area of the US')plt.ylabel('Frequency of occurence in the dataset')plt.show()

We see the following frequency distribution plot:

### Defining the random variable *X*

*X*

In the above example, **ForecastYoYPctChange** is our random variable of interest. Thus, ** X**=

*.*

**ForecastYoYPctChange**### The probability distribution of *X*

*X*

From looking at the above mentioned frequency distribution plot of **ForecastYoYPctChange**, we’ll assume that the random variable **ForecastYoYPctChange** is normally distributed with some unknown mean μ and variance σ². For reference, here is the **P**robability **D**ensity **F**unction (PDF) of such a *N(μ, σ²) *distributed random variable:

The PDF of **ForecastYoYPctChange** peaks at the population level mean *μ* which is unknown. Incidentally, here is the code that produced the above plot:

from scipy.stats import normxlower=norm.ppf(0.01,loc=6,scale=2)xupper=norm.ppf(0.99,loc=6,scale=2)x = np.linspace(xlower,xupper, 100)plt.plot(x, norm.pdf(x,loc=6,scale=2))plt.xlabel('X')plt.ylabel('Probability density f(X=x)')plt.show()

### The relationship between Fisher Information of *X* and variance of *X*

*X*

*X*

Now suppose we observe a single value of the random variable **ForecastYoYPctChange** such as 9.2%. What can be said about the true population mean *μ* of **ForecastYoYPctChange** by observing this value of 9.2%?

If the distribution of **ForecastYoYPctChange** peaks sharply at *μ* and the probability is vanishing small at most other values of **ForecastYoYPctChange**, then common sense suggests the chance of the observed value of 9.2% being very different than the true mean is also vanishingly small. By implication, the amount of uncertainty existing in the observed value of 9.2% being a ‘good’ estimate of *μ* is also very small. This holds true any particular observed value of **ForecastYoYPctChange**. Therefore, we would expect the Fisher Information contained in **ForecastYoYPctChange** about the population mean *μ* to be large.

Conversely, if the distribution of **ForecastYoYPctChange** is spread out pretty widely around the population mean *μ*, then the chance of a particular observation of **ForecastYoYPctChange** such as 9.2 being at or close to *μ* is small and therefore in this case, the Fisher Information contained in **ForecastYoYPctChange** about the population mean *μ* is small.

Clearly, the concept of Fisher Information of ** X** for some population parameter

*θ*(such as the mean

*μ*), is proportional to the variance of the probability distribution of

**around**

*X**θ*. That would explain the presence of variance in the formula for Fisher Information:

So far, we have been able to show that Fisher Information of ** X** about the population parameter

*θ*, has a direct relationship with the variance of

**around**

*X**θ*. However, it is

*not directly equal*to the variance of

**. Instead, it is equal to the partial derivative of the log-likelihood of**

*X**. To see why that is, let’s first look at the concepts of Likelihood, log-Likelihood and its partial derivative.*

*θ*## The concept of the Likelihood function

Returning to our data set of house price changes, since we have assumed that **ForecastYoYPctChange** is normally distributed, the probability (density) corresponding to a specific observation of **ForecastYoYPctChange** such as 9.2% is as follows:

Let’s make a simplifying substitution. We’ll use the following sample variance as a substitute for the variance of the population:

It can be shown that S² is an unbiased estimate of the population variance σ². So this is a valid substitution, especially for large samples.

In our house prices data set, the sample variance S² can be gotten as follows:

S_squared = df['ForecastYoYPctChange'].var()print('S^2='+str(S_squared))

This prints out the following:

S^2=2.1172115582214155

Substituting S² for σ² in the PDF of **ForecastYoYPctChange** , we have:

Notice one important thing about the above equation:

*f(X=9.2| μ; σ² =2.11721 ) *is actually a function of the population mean *μ*. In this form, as a function of the population parameter *μ* , we call this function the **Likelihood function**, denoted by *ℒ( μ | X=9.2), or in general ℒ( θ | X=x).*

*ℒ( θ | X=x)* is literally the likelihood of observing the particular value

**x**of

**, for different values of the population mean**

*X**μ*.

Let’s plot *ℒ( μ | X=9.2)* w.r.t.

*:*

*μ*x=np.linspace(-20,20,1000)y=0.27418*np.exp(-0.23616*np.power(9.2-x,2))plt.plot(x,y)plt.xlabel('mu')plt.ylabel('L(mu|X=9.2)')plt.show()

The Likelihood function peaks at *μ =9.2*, which is another way of saying that if ** X** follows a normal distribution, the likelihood of observing a value of

*is maximum when the mean of the population*

**X**=9.2*μ = 9.2*. That seems kind of intuitive.

### The concept of the Log-Likelihood function

Often, one is dealing with a sample of many observations *[x_1, x_2, x_3,…,x_n] *which form one’s data set. The likelihood of observing that particular data set of values under some assumed distribution of ** X**, is simply the product of the individual likelihoods, in other words, the following:

Continuing with our example of house prices data set, the likelihood equation for a data set of YoY % increase values *[x_1, x_2, x_3,…,x_n] * is the following joint probability density function:

We would like to know what value of the true mean *μ* would maximize the likelihood of observing this particular sample of *n *observations. This is accomplished by taking the partial derivative of the joint probability w.r.t. *μ*, setting it to zero and solving for *μ*.

It is a lot easier to solve the partial derivative if one takes the natural logarithm of the above likelihood function. The Logarithm function turns the product into a sum, and for many probability distribution functions, their logarithm is a concave function, thereby aiding the process of finding a maximum (or minimum value). Finally, *log(x)* rises and falls with *x*. So whatever optimization goals we had about *x*, taking *log(x)* will keep those goals intact.

The logarithm of the Likelihood function is called the Log-Likelihood and is often denoted using the stylized small ‘l’:

ℓ(θ| X=x)

For our house prices example, the log-likelihood of *μ* for a single observed value of X=9.2% and *σ² =2.11721* can be expressed as follows:

In the above expression, we have made use of a couple of basic rules of logarithms, namely: *ln(A*B)=ln(A)+ln(B)*, *ln(A ^{x})=x*ln(A)*, and the natural logarithm

*ln*.

_{e}(e) =1.0As with the Likelihood function, the Log-Likelihood is a function of some population parameter *θ* (in our example, *θ=* *μ* ). Let’s plot this log-likelihood function w.r..t. *μ*:

x=np.linspace(-10,100,10000)y = -1.29397 - 0.23616 * np.power(9.2-x,2)plt.xlabel('mu')plt.ylabel('LL(mu|X=9.2;sigma^2=2.11721)')plt.plot(x,y)plt.show()

As with the Likelihood function, the Log-Likelihood appears to be achieving its maximum value (in this case, zero) when *μ* =9.2%.

## Maximization of the Log-Likelihood function: The Maximum Likelihood Estimate of θ

As mentioned earlier, often, one is dealing with a sample of many observations *[x_1, x_2, x_3,…,x_n] *which form one’s sample data set and one would like to know the likelihood of observing that particular data set of values under some assumed distribution of ** X** . As we have seen by now, this likelihood (or Log-Likelihood) of observing a specific value of

**varies depending on what is the true mean of the underlying population values.**

*X*For a set of observed values * x= [x_1, x_2, x_3,…,x_n]*, the log-likelihood

*ℓ(θ*of observing

**| X=x)****is maximized for that value of**

*x**θ*for which the partial derivative of

*ℓ(θ*w.r.t.

**| X=x)***θ*is 0. In notation form:

For our house prices example, the maximum likelihood estimate is calculated as follows:

It’s easy to see this is an equation of a straight line with slope -0.47232 and y-intercept=0.47232*9.2. This line crosses the X-axis at *μ* =9.2% where the partial derivative is zero. Let’s plot this line.

x=np.linspace(-10,100,10000)y = 0.47232*(9.2-x)plt.xlabel('mu')plt.ylabel('Partial derivative of Log-Likelihood')plt.plot(x,y)plt.show()

Recollect that we have assumed that our data set has a variance σ²= *2.11721* . If instead, we don’t make this assumption, the maximum likelihood estimate for *μ* is as follows:

From the above equation, we can see that the variance σ² of the probability distribution of ** X** has an inverse relationship with the absolute value of the slope of the partial derivative line, and therefore also the variance of the partial derivative function.

In other words, ** X** is has a large spread around the true mean

*μ*, the variance of the partial derivative of the log-likelihood function is small. Conversely, when

**is tightly spread around the mean**

*X**μ*, the variance σ² is small, the slope of the partial derivative function is large, and therefore the variance of this function is also large.

This observation is exactly in line with the formulation of Fisher Information of ** X** for

*μ*, namely that it is the variance of the partial derivative of the log-likelihood of

**=**

*X**x*:

Or in general terms, the following formulation:

Let’s use the above concepts to derive the Fisher Information of a Normally distributed random variable.

## Fisher Information of a Normally distributed random variable

We have shown that the Fisher Information of a Normally distributed random variable with mean *μ* and variance σ² can be represented as follows:

To find out the variance on the R.H.S., we will use the following identity:

Using this formula, we solve the variance as follows:

The first expectation *E[( X– μ)^{2}]* is simply the variance σ². And the second expectation

*E(*is zero as the expected value a.k.a. mean of

**X**– μ)**is**

*X**μ*.

Therefore, the R.H.S. works out to σ² / σ^{4} = 1/ σ² which is what is the Fisher Information of a normally distributed random variable with mean *μ* and variance σ².

## Citations and copyrights

Fisher R. A., (1922) On the mathematical foundations of theoretical statistics, *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. *222309–368. http://doi.org/10.1098/rsta.1922.0009

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **Estimator Consistency And Its Connection With The Bienaymé–Chebyshev Inequality

**NEXT: **Estimating The Range Of A Population Parameter: A Guide To Interval Estimation

**UP:**** **Table of Contents

## FAQs

### What does Fisher information tell us? ›

Fisher information tells us **how much information about an unknown parameter we can get from a sample**. In other words, it tells us how well we can measure a parameter, given a certain amount of data.

### Is the Fisher information always positive? ›

Covariance matrices are always positive semi-definite. Since the Fisher information is a convex combination of positive semi-definite matrices, so **it must also be positive semi-definite**.

### How do you calculate Fisher information? ›

Given a random variable y that is assumed to follow a probability distribution f(y;θ), where θ is the parameter (or parameter vector) of the distribution, the Fisher Information is calculated as **the Variance of the partial derivative w.r.t. θ of the Log-likelihood function ℓ( θ | y )** .

### How do you read Fisher information Matrix? ›

The Fisher Information - YouTube

### Can the Fisher information be zero? ›

**If the Fisher information of a parameter is zero, that parameter doesn't matter**. We call it "information" because the Fisher information measures how much this parameter tells us about the data.

### Can the Fisher information be negative? ›

In statistics, the observed information, or observed Fisher information, is the negative of the second derivative (the Hessian matrix) of the "log-likelihood" (the logarithm of the likelihood function).

### What is Fisher information for normal distribution? ›

In mathematical statistics, the Fisher information (sometimes simply called information) is **a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X**.

### Who invented Fisher information? ›

Sir Ronald Fisher FRS | |
---|---|

Scientific career | |

Fields | Statistics, genetics, and evolutionary biology |

Institutions | Rothamsted Experimental Station University College London University of Cambridge University of Adelaide Commonwealth Scientific and Industrial Research Organisation |

### What is Fisher number? ›

It is defined as **the geometric average of the Laspeyres price index (which only uses the base period basket) and the Paasche price index (which only uses the current period basket)**. For this reason, the Fisher price index (named after American economist Irving Fisher) is also known as the "ideal" price index.

### How do you find the Fisher information for an exponential distribution? ›

How to Calculate Fisher Information: Exponential Distribution Example

### How the Fisher information does relate to variance of ML estimate? ›

**A lower fisher information on the other hand, would indicate the score function has low variance at the MLE, and has mean zero**. This implies that regardless of the sampling distribution, we will get a gradient of log likelihood to be zero (which is good!).

### How do you find the Variance of a maximum likelihood estimator? ›

This property is called asymptotic efficiency. I(θ) = −E [ ∂2 ∂θ2 ln L(θ|X) ] . Thus, the estimate of the variance given data **x ˆσ2 = −1 / ∂2 ∂θ2 ln L(ˆθ|x)**. the negative reciprocal of the second derivative, also known as the curvature, of the log-likelihood function evaluated at the MLE.

### How do you calculate expected information? ›

Fisher's Information: Examples - YouTube

### What is the asymptotic Variance? ›

Though there are many definitions, asymptotic variance can be defined as **the variance, or how far the set of numbers is spread out, of the limit distribution of the estimator**.

### What is the information in statistics? ›

Statistical information is **data that has been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges**.

### Is a normal distribution asymptotic? ›

**Perhaps the most common distribution to arise as an asymptotic distribution is the normal distribution**. In particular, the central limit theorem provides an example where the asymptotic distribution is the normal distribution.

### What is a score vector? ›

Definition. The score is **the gradient (the vector of partial derivatives) of , the natural logarithm of the likelihood function, with respect to an m -dimensional parameter vector** . This differentiation yields a. row vector, and indicates the sensitivity of the likelihood (its derivative normalized by its value).

### How are the information matrix and Hessian from a log likelihood function related? ›

It follows that **if you minimize the negative log-likelihood, the returned Hessian is the equivalent of the observed Fisher information matrix** whereas in the case that you maximize the log-likelihood, then the negative Hessian is the observed information matrix.

### What is the benefit of using observed information to estimate your information matrix? ›

As you surmised, observed information is typically **easier to work with** because differentiation is easier than integration, and you might have already evaluated it in the course of some numeric optimization.

### What is the variance of a Poisson distribution? ›

For a Poisson distribution, the variance is given by **V(X)=λ=rt** V ( X ) = λ = r t where λ is the average number of occurrences of the event in the given time period, r is the average rate of the occurrence of the events, and t is the length of the given time period.

### How is Cramer Rao lower bound calculated? ›

= p(1 − p) m . Alternatively, we can compute the Cramer-Rao lower bound as follows: **∂2 ∂p2 log f(x;p) = ∂ ∂p ( ∂ ∂p log f(x;p)) = ∂ ∂p (x p − m − x 1 − p ) = −x p2 − (m − x) (1 − p)2** .

### Who is known as the father of statistics? ›

Father. of Indian Statistics. **Prof.** **Prasanta Chandra Mahalanobis** is also known as the father of Indian Statistics.

### Who is known as the father of statistics in world? ›

All right, let's take a moment or two to review. This lesson provided you a biography of Sir Ronald Fisher, who was a British statistician and biologist who was known for his contributions to experimental design and population genetics and was famously known as the father of modern statistics.

### Why Fisher method is ideal? ›

Fisher's formula is called the ideal because of the following reasons: i **It is based on geometric mean which is considered best for constructing index numbers**. ii It fulfills both the time reversal and factor reversal tests. iii It takes into account both current year as well as base year prices and quantities.

### What is Fisher's ideal method? ›

Definition: Fisher's Ideal volume index is **the geometric mean of the Laspeyres and Paasche volume indices**. Context: A measure of change in volume from period to period. It is calculated as the geometric mean of a chain Paasche volume index and a chain Laspeyres volume index.

### Why is Fisher's index called ideal? ›

Fisher's index lies between the other two indexes. It is referred to as an “ideal” index because **it correctly predicts the expenditure index and it satisfies both the time reversal test as well as factor reversal test**.

### How do you find the confidence interval for an exponential distribution? ›

Asymptotic (1-α) Confidence Interval for an Exponential model - YouTube

### What is Fisher information for normal distribution? ›

In mathematical statistics, the Fisher information (sometimes simply called information) is **a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X**.

### How the Fisher information does relate to variance of ML estimate? ›

**A lower fisher information on the other hand, would indicate the score function has low variance at the MLE, and has mean zero**. This implies that regardless of the sampling distribution, we will get a gradient of log likelihood to be zero (which is good!).

### What is the information in statistics? ›

Statistical information is **data that has been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges**.

### Who invented Fisher information? ›

Sir Ronald Fisher FRS | |
---|---|

Scientific career | |

Fields | Statistics, genetics, and evolutionary biology |

Institutions | Rothamsted Experimental Station University College London University of Cambridge University of Adelaide Commonwealth Scientific and Industrial Research Organisation |