February 27, 2019

Poisson Regression can be a really useful tool if you know how and when to use it. In this tutorial we’re going to take a long look at Poisson Regression, what it is, and how R programmers can use it in the real world.

Specifically, we’re going to cover:

- What Poisson Regression actually is and when we should use it
- Poisson Distribution, and how it differs from Normal Distribution
- Poisson Regression modeling with GLMs
- Modeling Poisson Regression for count data
- Visualizing findings from model using jtools
- Modeling Poisson Regression for rate data

> install.packages("Dataquest")

Start learning R today with our Introduction to R course — no credit card required!

SIGN UP

## What Are Poisson Regression Models?

Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically,*count data*: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given timeframe or the number of people in line at the grocery store.

*Count data*can also be expressed as*rate data*, since the number of times an event occurs within a timeframe can be expressed as a raw count (i.e. “In a day, we eat three meals”) or as a rate (“We eat at a rate of 0.125 meals per hour”).

Poisson Regression helps us analyze both count data and rate data by allowing us to determine which explanatory variables (X values) have an effect on a given response variable (Y value, the count or a rate). For example, Poisson regression could be applied by a grocery store to better understand and predict the number of people in a line.

## What is Poisson Distribution?

Poisson distribution is a statistical theory named after French mathematician Siméon Denis Poisson. It models the probability of event or events*y*occurring within a specific timeframe, assuming that*y*occurrences are not affected by the timing of previous occurrences of*y*. This can be expressed mathematically using the following formula:

Here,*μ*(in some textbooks you may see*λ*instead of*μ*) is the average number of times an event may occur per unit of*exposure*. It is also called the **parameter** of Poisson distribution. The *exposure*may be time, space, population size, distance, or area, but it is often time, denoted with*t*. If exposure value is not given it is assumed to be equal to**1**.

Let’s visualize this by creating a Poisson distribution plot for different values of*μ*.

First, we’ll create a vector of 6 colors:

`# vector of colorscolors <- c("Red", "Blue", "Gold", "Black", "Pink", "Green") `

Next, we’ll create a list for the distribution that will have different values for*μ*:

`# declare a list to hold distribution valuespoisson.dist < - list() `

</code>

Then, we’ll create a vector of values for*μ*and loop over the values from*μ*each with quantile range 0-20, storing the results in a list:

`a < - c(1, 2, 3, 4, 5, 6) # A vector for values of ufor (i in 1:6) { poisson.dist[[i]] <- c(dpois(0:20, i)) # Store distribution vector for each corresponding value of u}`

</code>

Finally, we’ll plot the points using`plot()`

.`plot()`

is a base graphics function in R. Another common way to plot data in R would be using the popular`ggplot2`

package; this is covered inDataquest’s R courses. But for this tutorial, we will stick to base R functions.

`# plot each vector in the list using the colors vectors to represent each value for uplot(unlist(poisson.dist[1]), type = "o", xlab="y", ylab = "P(y)", col = colors[i])for (i in 1:6) { lines(unlist(poisson.dist[i]), type = "o", col = colors[i])}# Adds legend to the graph plottedlegend("topright", legend = a, inset = 0.08, cex = 1.0, fill = colors, title = "Values of u")`

Note that we used `dpois(sequence,lambda)`

to plot the Probability Density Functions (PDF) in our Poisson distribution. In probability theory, a probability density function is a function that describes the relative likelihood that a continuous random variable (a variable whose possible values are continuous outcomes of a random event) will have a given value. (In statistics, a “random” variable is simply a variable whose outcome is result of a random event.)

### How Does Poisson Distribution Differ From Normal Distribution?

Poisson Distribution is most commonly used to find the probability of events occurring within a given time interval. Since we’re talking about a count, with Poisson distribution, the result must be 0 or higher – it’s not possible for an event to happen a negative number of times. On the other hand,*Normal distribution*is a continuous distribution for a continuous variable and it could result in a positive or negative value:

Poisson Distribution | Normal Distribution |
---|---|

Used for count data or rate data | Used for continuous variables |

Skewed depending on values of lambda. | Bell shaped curve that is symmetric around the mean. |

Variance = Mean | Variance and mean are different parameters; mean, median and mode are equal |

We can generate a Normal Distribution in R like this:

`# create a sequence -3 to +3 with .05 increments xseq < - seq(-3, 3, .05) # generate a Probability Density Functiondensities <- dnorm(xseq, 0, 1) # plot the graphplot(xseq, densities, col = "blue", xlab = "", ylab = "Density", type = "l", lwd = 2) # col: changes the color of line# 'xlab' and 'ylab' are labels for x and y axis respectively# type: defines the type of plot. 'l' gives a line graph# lwd: defines line width`

</code>

In R, `dnorm(sequence, mean, std.dev)`

is used to plot the Probability Density Function (PDF) of a Normal Distribution.

To understand the Poisson distribution, consider the following problem fromChi Yau’s R Tutorial textbook:

**If there are 12 cars crossing a bridge per minute on average, what is the probability of having seventeen or more cars crossing the bridge in any given minute?**

Here, average number of cars crossing a bridge per minute is*μ*= 12.

`ppois(q, u, lower.tail = TRUE)`

is an R function that gives the probability that a random variable will be lower than or equal to a value.

We have to find the probability of having seventeen or**more**cars, so we will use`lower.trail = FALSE`

and set q at 16:

`ppois(16, 12, lower.tail = FALSE)# lower.tail = logical; if TRUE (default) then probabilities are P[X < = x], otherwise, P[X > x].`

`## [1] 0.101291`

To get a percentage, we simply need to multiply this output by 100. Now we have the answer to our question: there is a**10.1%**probability of having 17 or more cars crossing the bridge in any particular minute.

### Poisson Regression Models and GLMs

Generalized Linear Models are models in which response variables follow a distribution other than the normal distribution. That’s in contrast to Linear regression models, in which response variables follow normal distribution. This is because Generalized Linear Models have response variables that are categorical such as Yes, No; or Group A, Group B and, therefore, do not range from -∞ to +∞. Hence, the relationship between response and predictor variables may not be linear. In GLM:

*y*_{i} = *α* + *β*_{1}*x*_{1}*i* + *β*_{2}*x*_{2}*i* + ….+*β*_{p}*x*_{p}*i* + *e*_{i} *i* = 1, 2….*n*

The response variable*y*_{i}is modeled by a*linear function of predictor variables*and some error term.

A Poisson Regression model is a*Generalized Linear Model (GLM)*that is used to model count data and contingency tables. The output*Y*(count) is a value that follows the Poisson distribution. It assumes the logarithm of*expected values (mean)*that can be modeled into a linear form by some unknown parameters.

**Note:***In statistics, contingency tables(example)are matrix of frequencies depending on multiple variables.*

To transform the non-linear relationship to linear form, a**link function**is used which is the**log**for Poisson Regression. For that reason, a Poisson Regression model is also called*log-linear model*. The general mathematical form of Poisson Regression model is:

*l*o*g*(*y*)=*α* + *β*_{1}*x*_{1} + *β*_{2}*x*_{2} + ….+*β*_{p}*x*_{p}

Where,

*y*: Is the response variable*α*and*β*: are numeric coefficients,*α*being the intercept, sometimes*α*also is represented by*β*_{0}, it’s the same*x*is the predictor/explanatory variable

The coefficients are calculated using methods such as Maximum Likelihood Estimation(MLE) ormaximum quasi-likelihood.

Consider an equation with one predictor variables and one response variable:

*l*o*g*(*y*)=*α* + *β*(*x*)

This is equivalent to:

*y* = *e*^{(α + β(x))} = *e*^{α} + *e*^{β * x}

Note: In Poisson Regression models, predictor or explanatory variables can have a mixture of both numeric or categorical values.

One of the most important characteristics for Poisson distribution and Poisson Regression is**equidispersion**, which means that the mean and variance of the distribution are equal.

Variance measures the spread of the data. It is the “average of the squared differences from the mean”. Variance (Var) is equal to 0 if all values are identical. The greater the difference between the values, the greater the variance. Mean is the average of values of a dataset. Average is the sum of the values divided by the number of values.

Let us say that the mean (*μ*) is denoted by*E*(*X*)

*E*(*X*)=*μ*

For Poisson Regression, mean and variance are related as:

*v*a*r*(*X*)=*σ*^{2}*E*(*X*)

Where*σ*^{2}is the dispersion parameter. Since*v*a*r*(*X*)=*E*(*X*)(variance=mean) must hold for the Poisson model to be completely fit,*σ*^{2}must be equal to 1.

When variance is greater than mean, that is called**over-dispersion**and it is greater than 1. If it is less than 1 than it is known as**under-dispersion**.

## Poisson Regression Modeling Using Count Data

In R, the`glm()`

command is used to model Generalized Linear Models. Here is the general structure of`glm()`

:

`glm(formula, family = familytype(link = ""), data,...)`

In this tutorial, we’ll be using those three parameters. For further details we can consultthe R documentation, but let’s take a quick look at what each refers to:

Parameter | Description |
---|---|

formula | The formula is symbolic representation of how modeled is to fitted |

family | Family tells choice of variance and link functions.There are several choices of family, including Poisson and Logistic |

data | Data is the dataset to be used |

`glm()`

provides eight choices for family with the following default link functions:

Family | Default Link Function |
---|---|

binomial | (link = “logit”) |

gaussian | (link = “identity”) |

Gamma | (link = “inverse”) |

inverse.gaussian | (link = $frac{1}{mu^2}$) |

poisson | (link = “log”) |

quasi | (link = “identity”, variance = “constant”) |

quasibinomial | (link = “logit”) |

quasipoisson | (link = “log”) |

### Let’s Get Modeling!

We’re going to model Poisson Regression related to how frequently yarn breaks during weaving. This data is found in the`datasets`

package in R, so the first thing we need to do is install the package using`install.package("datasets")`

and load the library with`library(datasets)`

:

`# install.packages("datasets")library(datasets) # include library datasets after installation`

The`datasets`

package includes tons of datasets, so we need to specifically select our yarn data.Consulting the package documentation, we can see that it is called**warpbreaks**, so let’s store that as an object.

`data < - warpbreaks`

</code>

Let’s take a look at the data:

`columns < - names(data) # Extract column names from dataframecolumns # show columns`

</code>

Output: `[1] "breaks" "wool" "tension"`

### What’s In Our Data?

This data set looks at how many warp breaks occurred for different types of looms per loom, per fixed length of yarn. We can read more details about this dataset in the documentationhere, but here are the three columns we’ll be looking at and what each refers to:

Column | Type | Description |
---|---|---|

breaks | numeric | The number of breaks |

wool | factor | The type of wool (A or B) |

tension | factor | The level of tension (L, M, H) |

There are measurements on 9 looms of each of the six types of warp, for a total of 54 entries in the dataset.

Let’s look at how the data is structured using the`ls.str()`

command:

`ls.str(warpbreaks)`

Output:

`breaks : num [1:54] 26 30 54 25 70 52 51 26 67 18 ...tension : Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...`

From the above, we can see both the types and levels present in the data.Read thisto learn a bit more about factors in R.

Now we will work with the`data`

dataframe. Remember, with a Poisson Distribution model we’re trying to figure out how some predictor variables affect a response variable. Here,`breaks`

is the response variable and`wool`

and`tension`

are predictor variables.

We can view the dependent variable`breaks`

data continuity by creating a histogram:

`hist(data$breaks) `

Clearly, the data is not in the form of a bell curve like in a normal distribution.

Let’s check out the`mean()`

and`var()`

of the dependent variable:

`mean(data$breaks) # calculate mean`

Output: `[1] 28.14815`

`var(data$breaks) # calculate variance`

Output: `[1] 174.2041`

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

Let’s fit the Poisson model using the`glm()`

command.

`# model poisson regression using glm()poisson.model < - glm(breaks ~ wool + tension, data, family = poisson(link = "log"))summary(poisson.model)`

</code>

`summary()`

is a generic function used to produce result summaries of the results of various model fitting functions.

Output:

`Call: glm(formula = breaks ~ wool + tension, family = poisson(link = "log"), data = data)Deviance Residuals: Min 1Q Median 3Q Max -3.6871 -1.6503 -0.4269 1.1902 4.2616 Coefficients:Estimate Std. Error z value Pr(>|z|) (Intercept) 3.69196 0.04541 81.302 < 2e-16 ***woolB -0.20599 0.05157 -3.994 6.49e-05 ***tensionM -0.32132 0.06027 -5.332 9.73e-08 ***tensionH -0.51849 0.06396 -8.107 5.21e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)Null deviance: 297.37 on 53 degrees of freedomResidual deviance: 210.39 on 50 degrees of freedomAIC: 493.06Number of Fisher Scoring iterations: 4</code>`

### Interpreting the Poisson Model

We’ve just been given a lot of information, now we need to interpret it. The first column named`Estimate`

is the coefficient values of*α*(intercept),*β*_{1}and so on. Following is the interpretation for the parameter estimates:

*e*x*p*(*α*)= effect on the mean*μ*, when X = 0*e*x*p*(*β*) = with every unit increase in X, the predictor variable has multiplicative effect of*e*(**x**p*β*) on the mean of Y, that is*μ*- If
*β*= 0, then exp(*β*) = 1, and the expected count is*e*x*p*(*α*) and, Y and X are not related. - If
*β*> 0, then exp(*β*) > 1, and the expected count is exp(*β*) times larger than when X = 0 - If
*β*< 0, then exp(*β*) < 1, and the expected count is exp(*β*) times smaller than when X = 0

If`family = poisson`

is kept in`glm()`

then, these parameters are calculated usingMaximum Likelihood Estimation MLE.

R treats categorical variables as dummy variables. Categorical variables, also called indicator variables, are converted into dummy variables by assigning the levels in the variable some numeric representation.The general rule is that if there are*k*categories in a factor variable, the output of`glm()`

will have*k*−1 categories with remaining 1 as the base category.

We can see in above summary that for wool, ‘A’ has been made the base and is not shown in summary. Similarly, for tension ‘L’ has been made the base category.

To see which explanatory variables have an effect on response variable, we will look at the*p*values. If the** p is less than 0.05**then, the variable has an effect on the response variable. In the summary above, we can see that all p values are less than 0.05, hence,

*both*explanatory variables (wool and tension) have significant effect on breaks. Notice how R output used

`***`

at the end of each variable. The number of stars signifies significance.Before starting to interpret results, let’s check whether the model has over-dispersion or under-dispersion. If the*Residual Deviance*is greater than the degrees of freedom, then over-dispersion exists. This means that the estimates are correct, but the standard errors (standard deviation) are wrong and unaccounted for by the model.

The Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) whereas residual with the inclusion of independent variables. Above, we can see that the addition of 3 (53-50 =3) independent variables decreased the deviance to 210.39 from 297.37. Greater difference in values means a bad fit.

So, to have a more correct standard error we can use a*quasi-poisson*model:

`poisson.model2 < - glm(breaks ~ wool + tension, data = data, family = quasipoisson(link = "log"))summary(poisson.model2)`

</code>

Output:

`Call: glm(formula = breaks ~ wool + tension, family = poisson(link = "log"), data = data)Deviance Residuals: Min 1Q Median 3Q Max -3.6871 -1.6503 -0.4269 1.1902 4.2616 Coefficients:Estimate Std. Error z value Pr(>|z|) (Intercept) 3.69196 0.04541 81.302 < 2e-16 ***woolB -0.20599 0.05157 -3.994 6.49e-05 ***tensionM -0.32132 0.06027 -5.332 9.73e-08 ***tensionH -0.51849 0.06396 -8.107 5.21e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)Null deviance: 297.37 on 53 degrees of freedomResidual deviance: 210.39 on 50 degrees of freedomAIC: 493.06Number of Fisher Scoring iterations: 4</code>`

### Comparing The Models:

Now that we’ve got two different models, let’s compare them to see which is better. First, we’ll install the`arm`

library because it contains a function we need:

`# install.packages("arm")# load library arm that contains the function se.coef()library(arm)`

Now we’ll use that`se.coef()`

function to extract the coefficients from each model, and then use`cbind()`

combine those extracted values into a single dataframe so we can compare them.

`# extract coefficients from first model using 'coef()'coef1 = coef(poisson.model) # extract coefficients from second modelcoef2 = coef(poisson.model2) # extract standard errors from first model using 'se.coef()'se.coef1 = se.coef(poisson.model) # extract standard errors from second modelse.coef2 = se.coef(poisson.model2)# use 'cbind()' to combine values into one dataframemodels.both < - cbind(coef1, se.coef1, coef2, se.coef2, exponent = exp(coef1)) # show dataframemodels.both`

</code>

Output:

`coef1 se.coef1 coef2 se.coef2 exponent(Intercept) 3.6919631 0.04541069 3.6919631 0.09374352 40.1235380woolB -0.2059884 0.05157117 -0.2059884 0.10646089 0.8138425tensionM -0.3213204 0.06026580 -0.3213204 0.12440965 0.7251908tensionH -0.5184885 0.06395944 -0.5184885 0.13203462 0.5954198`

In above output, we can see the coefficients are the same, but the standard errors are different.

Keeping these points in mind, let’s see estimate for*wool*. Its value is**-0.2059884**, and the exponent of**-0.2059884**is**0.8138425**.

`1-0.8138425`

Output:`[1] 0.1861575`

This shows that changing from type A wool to type B wool results in a*decrease*in breaks*0.8138425*times the intercept, because estimate -0.2059884 is negative. Another way of saying this is if we change wool type from A to B, the number of breaks will fall by 18.6% assuming all other variables are the same.

### Predicting From The Model

Once the model is made, we can use`predict(model, data, type)`

to predict outcomes using new dataframes containing data other than the training data. Let’s look at an example.

`# make a dataframe with new datanewdata = data.frame(wool = "B", tension = "M")# use 'predict()' to run model on new datapredict(poisson.model2, newdata = newdata, type = "response")`

Output: `[1] 23.68056`

Our model is predicting there will be roughly**24**breaks with wool type B and tension level M.

### Visualizing Findings Usingjtools

When you are sharing your analysis with others, tables are often not the best way to grab people’s attention. Plots and graphs help people grasp your findings more quickly. The most popular way to visualize data in R is probably`ggplot2`

(which is taught inDataquest’s data visualization course), we’re also going to use an awesome R package calledjtoolsthat includes tools for specifically summarizing and visualizing regression models. Let’s use`jtools`

to visualize`poisson.model2`

.

`# Install the package jtools if not already installedinstall.packages("jtools")# you may be asked to install 'broom' and 'ggstance' packages as wellinstall.packages("broom")install.packages("ggstance")`

`jtools`

provides`plot_summs()`

and`plot_coefs()`

to visualize the summary of the model and also allows us to compare different models with`ggplot2`

.

`# Include jtools librarylibrary(jtools)# plot regression coefficients for poisson.model2plot_summs(poisson.model2, scale = TRUE, exp = TRUE)`

`# plot regression coefficients for poisson.model2 and poisson.modelplot_summs(poisson.model, poisson.model2, scale = TRUE, exp = TRUE)`

Output:

In above code, the `plot_summs(poisson.model2, scale = TRUE, exp = TRUE)`

plots the second model using the quasi-poisson family in`glm`

.

- The first argument in
`plot_summs()`

is the regression model to be used, it may be one or more than one. `scale`

helps with the problem of differing scales of the variables.`exp`

is set to`TRUE`

because for Poisson regression we are more likely to be interested in exponential values of estimates rather than linear.

You can find more details on `jtools`

and`plot_summs()`

here in the documentation.

We can also visualize the interaction between predictor variables.`jtools`

provides different functions for different types of variables. For example, if all the variables are categorical, we could use`cat_plot()`

to better understand interactions among them. For continuous variables,`interact_plot()`

is used.

In the**warpbreaks**data we have categorical predictor variables, so we’ll use`cat_plot()`

to visualize the interaction between them, by giving it arguments specifying which model we’d like to use, the predictor variable we’re looking at, and the other predictor variable that it combines with to produce the outcome.

`cat_plot(poisson.model2, pred = wool, modx = tension)# argument 1: regression model# pred: The categorical variable that will appear on x-axis# modx: Moderator variable that has an effect in combination to pred on outcome`

Output:

We can do the same thing to look at `tension`

:

`# using cat_plot. Pass poisson.model2 and we want to see effect of tension type so we set pred=tensioncat_plot(poisson.model2, pred = tension, modx = wool)`

Output:

Above, we see how the three different categories of tension (L, M, and H) for each affects breaks with each wool type. For example, breaks tend to be highest with low tension and type A wool.

We can also define the type of plot created by`cat_plot()`

using the`geom`

parameter. This parameter enhances the interpretation of plot. We can use it like so, passing`geom`

as an additional argument to`cat_plot`

:

`cat_plot(poisson.model2, pred = tension, modx = wool, geom = "line")`

Output:

We can also to include observations in the plot by adding `plot.points = TRUE`

:

`cat_plot(poisson.model2, pred = tension, modx = wool, geom = "line", plot.points = TRUE)`

Output:

There are lots of other design options, including line style, color, etc, that will allow us to customize the appearance of these visualizations. For specifics, consult the `jtools`

documentationhere.

## Poisson Regression Modeling Using Rate Data

So far this in this tutorial, we have modeled count data, but we can also model rate data that is predicting the number of counts over a period of time or grouping. Formula for modelling rate data is given by:

*l*o*g*(*X*/*n*)=*β*_{0} + ∑_{i}*β*_{i}*X*_{i}

This is equivalent to: (applying log formula)

*l*o*g*(*X*)−*l*o*g*(*n*)=*β*_{0} + ∑_{i}*β*_{i}*X*_{i}

*l*o*g*(*X*)=*l*o*g*(*n*)+*β*_{0} + ∑_{i}*β*_{i}*X*_{i}

Thus, rate data can be modeled by including the**log(n)**term with coefficient of 1. This is called an**offset**. This offset is modelled with`offset()`

in R.

Let’s use another a dataset called`eba1977`

from theISwR packageto model Poisson Regression Model for rate data. First, we’ll install the package:

`# install.packages("ISwR")library(ISwR)`

`## Warning: package 'ISwR' was built under R version 3.4.4`

Now, let’s take a look at some details about the data, and print the first ten rows to get a feel for what the dataset includes.

`data(eba1977)cancer.data = eba1977cancer.data[1:10, ]# Description# Lung cancer incidence in four Danish cities 1968-1971# Description:# This data set contains counts of incident lung cancer cases and# population size in four neighbouring Danish cities by age group.# Format:# A data frame with 24 observations on the following 4 variables: # city a factor with levels Fredericia, Horsens, Kolding, and Vejle.# age a factor with levels 40-54, 55-59, 60-64, 65-69,70-74, and 75+.# pop a numeric vector, number of inhabitants.# cases a numeric vector, number of lung cancer cases.`

Output:

`city age pop cases1 Fredericia 40-54 3059 112 Horsens 40-54 2879 133 Kolding 40-54 3142 44 Vejle 40-54 2520 55 Fredericia 55-59 800 116 Horsens 55-59 1083 67 Kolding 55-59 1050 88 Vejle 55-59 878 79 Fredericia 60-64 710 1110 Horsens 60-64 923 15`

To model rate data, we use**X/n**where*X*is the event to happen and*n*is the grouping. In this example,**X=cases**(the event is a case of cancer) and**n=pop**(the population is the grouping).

As in the formula above, rate data is accounted by*log*(*n*) and in this data*n*is population, so we will find log of population first. We can model for*cases/population*as follows:

`# find the log(n) of each value in 'pop' column. It is the third columnlogpop = log(cancer.data[ ,3])# add the log values to the dataframe using 'cbind()'new.cancer.data = cbind(cancer.data, logpop)# display new dataframenew.cancer.data`

Output:

`city age pop cases logpop1 Fredericia 40-54 3059 11 8.0258432 Horsens 40-54 2879 13 7.9651983 Kolding 40-54 3142 4 8.0526154 Vejle 40-54 2520 5 7.8320145 Fredericia 55-59 800 11 6.6846126 Horsens 55-59 1083 6 6.9874907 Kolding 55-59 1050 8 6.9565458 Vejle 55-59 878 7 6.7776479 Fredericia 60-64 710 11 6.56526510 Horsens 60-64 923 15 6.82762911 Kolding 60-64 895 7 6.79682412 Vejle 60-64 839 10 6.73221113 Fredericia 65-69 581 10 6.36475114 Horsens 65-69 834 10 6.72623315 Kolding 65-69 702 11 6.55393316 Vejle 65-69 631 14 6.44730617 Fredericia 70-74 509 11 6.23244818 Horsens 70-74 634 12 6.45204919 Kolding 70-74 535 9 6.28226720 Vejle 70-74 539 8 6.28971621 Fredericia 75+ 605 10 6.40522822 Horsens 75+ 782 2 6.66185523 Kolding 75+ 659 12 6.49072424 Vejle 75+ 619 7 6.428105`

Now, let’s model the rate data with`offset()`

.

`poisson.model.rate < - glm(cases ~ city + age+ offset(logpop), family = poisson(link = "log"), data = cancer.data)#display summarysummary(poisson.model.rate)`

</code>

Output:

`Call:glm(formula = cases ~ city + age + offset(logpop), family = poisson(link = "log"), data = cancer.data)Deviance Residuals: Min 1Q Median 3Q Max -2.63573 -0.67296 -0.03436 0.37258 1.85267 Coefficients:Estimate Std. Error z value Pr(>|z|) (Intercept) -5.6321 0.2003 -28.125 < 2e-16 ***cityHorsens -0.3301 0.1815 -1.818 0.0690 . cityKolding -0.3715 0.1878 -1.978 0.0479 * cityVejle -0.2723 0.1879 -1.450 0.1472 age55-59 1.1010 0.2483 4.434 9.23e-06 ***age60-64 1.5186 0.2316 6.556 5.53e-11 ***age65-69 1.7677 0.2294 7.704 1.31e-14 ***age70-74 1.8569 0.2353 7.891 3.00e-15 ***age75+ 1.4197 0.2503 5.672 1.41e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)Null deviance: 129.908 on 23 degrees of freedomResidual deviance: 23.447 on 15 degrees of freedomAIC: 137.84Number of Fisher Scoring iterations: 5</code>`

In this dataset, we can see that the residual deviance is near to degrees of freedom, and the dispersion parameter is**1.5 (23.447/15)**which is small, so the model is a good fit.

We use`fitted(model)`

to return values fitted by the model. It returns outcomes using the training data on which the model is built. Let’s give it a try:

`fitted(poisson.model.rate) `

`1 2 3 4 5 6 7 8 10.954812 7.411803 7.760169 6.873215 8.615485 8.384458 7.798635 7.201421 9 10 11 12 13 14 15 16 11.609373 10.849479 10.092831 10.448316 12.187276 12.576313 10.155638 10.080773 17 18 19 20 21 22 23 24 11.672630 10.451942 8.461440 9.413988 8.960422 8.326004 6.731286 6.982287`

Using this model, we can predict the number of cases per 1000 population for a new data set, using the`predict()`

function, much like we did for our model of count data previously:

`# create a test dataframe containing new values of variablestest.data = data.frame(city = "Kolding", age = "40-54", pop = 1000, logpop = log(1000)) # predict outcomes (responses) using 'predict()' predicted.value < - predict(poisson.model.rate, test.data, type = "response") # show predicted valuepredicted.value`

</code>

Output: `[1] 2.469818`

So,**for the city of Kolding among people in the age group 40-54, we could expect roughly 2 or 3 cases of lung cancer per 1000 people**.

As with the count data, we could also use quasi-poisson to get more correct standard errors with rate data, but we won’t repeat that process for the purposes of this tutorial.

## Conclusion

Poisson regression models have great significance in econometric and real world predictions. In this tutorial, we’ve learned about Poisson Distribution, Generalized Linear Models, and Poisson Regression models.

We also learned how to implement Poisson Regression Models for both count and rate data in R using`glm()`

, and how to fit the data to the model to predict for a new dataset. Additionally, we looked at how to get more accurate standard errors in`glm()`

using*quasipoisson*and saw some of the possibilities available for visualization with`jtools`

.

[thrive_leads id=’21203′]

### References

- https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Poisson.html
- https://www.theanalysisfactor.com/generalized-linear-models-in-r-part-6-poisson-regression-count-variables/
- https://stats.idre.ucla.edu/r/dae/poisson-regression/
- https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/summary

### Ready to level up your R skills?

Our Data Analyst in R path covers all the skills you need to land a job, including:

- Data visualization with
**ggplot2** - Advanced data cleaning skills with
**tidyverse**packages - Important
**SQL skills**for R users - Fundamentals in
**statistics**and probability - ...and
**much more**

There's nothing to install, **no prerequisites**, and no schedule.

Start learning for free!

poisson regression in RrR for data scienceR projectsR tutorialrstatsTutorials

## FAQs

### How do you do Poisson regression? ›

Poisson Regression Analysis in SPSS with Assumption Testing - YouTube

### What is the function used for Poisson regression in R? ›

The function used to create the Poisson regression model is the **glm()** function.

### How do you find the Poisson distribution in R? ›

The Poisson Distribution in R - YouTube

### For which situations can you use Poisson regression? ›

Poisson regression applies **where the response variable is a count of events (e.g. crime incidents, cases of a disease) rather than a continuous variable**. This model may also be applied to standardized counts or “rates”, such as disease incidence per capita, species of tree per square kilometer.

### What is the difference between logistic regression and Poisson regression? ›

**Poisson regression is most commonly used to analyze rates, whereas logistic regression is used to analyze proportions**. The chapter considers statistical models for counts of independently occurring random events, and counts at different levels of one or more categorical outcomes.

### How do you analyze Poisson distribution? ›

The Formula for the Poisson Distribution Is

**e is Euler's number (e = 2.71828...)** **x is the number of occurrences**. **x! is the factorial of x**. **λ is equal to the expected value (EV) of x when that is also equal to its variance**.

### How do you check Poisson regression assumptions? ›

9.11 Poisson Regression: Model Assumptions - YouTube

### How do you apply Poisson distribution in R in a data set? ›

Poisson Distribution in R | R Tutorial 3.2 | MarinStatsLectures - YouTube

### How do you use a Poisson model? ›

The Poisson Distribution formula is: **P(x; μ) = (e ^{-}^{μ}) (μ^{x}) / x!** Let's say that that x (as in the prime counting function is a very big number, like x = 10

^{100}. If you choose a random number that's less than or equal to x, the probability of that number being prime is about 0.43 percent.

### How do you regress two variables in R? ›

Multiple Linear Regression in R | R Tutorial 5.3 | MarinStatsLectures

### How do I know if my data is Poisson in R? ›

**How to know if a data follows a Poisson Distribution in R?**

- The number of outcomes in non-overlapping intervals are independent. ...
- The probability of two or more outcomes in a sufficiently short interval is virtually zero.

### How do you use a Poisson model? ›

The Poisson Distribution formula is: **P(x; μ) = (e ^{-}^{μ}) (μ^{x}) / x!** Let's say that that x (as in the prime counting function is a very big number, like x = 10

^{100}. If you choose a random number that's less than or equal to x, the probability of that number being prime is about 0.43 percent.

### How do you model Poisson distribution? ›

The Poisson Distribution - explained with examples and illustrated using ...

### How do you analyze Poisson distribution? ›

The Formula for the Poisson Distribution Is

**e is Euler's number (e = 2.71828...)** **x is the number of occurrences**. **x! is the factorial of x**. **λ is equal to the expected value (EV) of x when that is also equal to its variance**.