Tutorial: Poisson Regression in R (2022)

February 27, 2019

Poisson Regression can be a really useful tool if you know how and when to use it. In this tutorial we’re going to take a long look at Poisson Regression, what it is, and how R programmers can use it in the real world.

Specifically, we’re going to cover:

  • What Poisson Regression actually is and when we should use it
  • Poisson Distribution, and how it differs from Normal Distribution
  • Poisson Regression modeling with GLMs
  • Modeling Poisson Regression for count data
  • Visualizing findings from model using jtools
  • Modeling Poisson Regression for rate data

> install.packages("Dataquest")

Start learning R today with our Introduction to R course — no credit card required!

SIGN UP

What Are Poisson Regression Models?

Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically,count data: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given timeframe or the number of people in line at the grocery store.

Count datacan also be expressed asrate data, since the number of times an event occurs within a timeframe can be expressed as a raw count (i.e. “In a day, we eat three meals”) or as a rate (“We eat at a rate of 0.125 meals per hour”).

Poisson Regression helps us analyze both count data and rate data by allowing us to determine which explanatory variables (X values) have an effect on a given response variable (Y value, the count or a rate). For example, Poisson regression could be applied by a grocery store to better understand and predict the number of people in a line.

What is Poisson Distribution?

Poisson distribution is a statistical theory named after French mathematician Siméon Denis Poisson. It models the probability of event or eventsyoccurring within a specific timeframe, assuming thatyoccurrences are not affected by the timing of previous occurrences ofy. This can be expressed mathematically using the following formula:

Tutorial: Poisson Regression in R (1)

Here,μ(in some textbooks you may seeλinstead ofμ) is the average number of times an event may occur per unit ofexposure. It is also called the parameter of Poisson distribution. The exposuremay be time, space, population size, distance, or area, but it is often time, denoted witht. If exposure value is not given it is assumed to be equal to1.

Let’s visualize this by creating a Poisson distribution plot for different values ofμ.

First, we’ll create a vector of 6 colors:

# vector of colorscolors <- c("Red", "Blue", "Gold", "Black", "Pink", "Green") 

Next, we’ll create a list for the distribution that will have different values forμ:

# declare a list to hold distribution valuespoisson.dist < - list() </code>

Then, we’ll create a vector of values forμand loop over the values fromμeach with quantile range 0-20, storing the results in a list:

a < - c(1, 2, 3, 4, 5, 6) # A vector for values of ufor (i in 1:6) { poisson.dist[[i]] <- c(dpois(0:20, i)) # Store distribution vector for each corresponding value of u}</code>

Finally, we’ll plot the points usingplot().plot()is a base graphics function in R. Another common way to plot data in R would be using the popularggplot2package; this is covered inDataquest’s R courses. But for this tutorial, we will stick to base R functions.

# plot each vector in the list using the colors vectors to represent each value for uplot(unlist(poisson.dist[1]), type = "o", xlab="y", ylab = "P(y)", col = colors[i])for (i in 1:6) { lines(unlist(poisson.dist[i]), type = "o", col = colors[i])}# Adds legend to the graph plottedlegend("topright", legend = a, inset = 0.08, cex = 1.0, fill = colors, title = "Values of u")

Tutorial: Poisson Regression in R (2)

Note that we used dpois(sequence,lambda)to plot the Probability Density Functions (PDF) in our Poisson distribution. In probability theory, a probability density function is a function that describes the relative likelihood that a continuous random variable (a variable whose possible values are continuous outcomes of a random event) will have a given value. (In statistics, a “random” variable is simply a variable whose outcome is result of a random event.)

How Does Poisson Distribution Differ From Normal Distribution?

Poisson Distribution is most commonly used to find the probability of events occurring within a given time interval. Since we’re talking about a count, with Poisson distribution, the result must be 0 or higher – it’s not possible for an event to happen a negative number of times. On the other hand,Normal distributionis a continuous distribution for a continuous variable and it could result in a positive or negative value:

Poisson DistributionNormal Distribution
Used for count data or rate dataUsed for continuous variables
Skewed depending on values of lambda.Bell shaped curve that is symmetric around the mean.
Variance = MeanVariance and mean are different parameters; mean, median and mode are equal

We can generate a Normal Distribution in R like this:

# create a sequence -3 to +3 with .05 increments xseq < - seq(-3, 3, .05) # generate a Probability Density Functiondensities <- dnorm(xseq, 0, 1) # plot the graphplot(xseq, densities, col = "blue", xlab = "", ylab = "Density", type = "l", lwd = 2) # col: changes the color of line# 'xlab' and 'ylab' are labels for x and y axis respectively# type: defines the type of plot. 'l' gives a line graph# lwd: defines line width</code>

Tutorial: Poisson Regression in R (3)

In R, dnorm(sequence, mean, std.dev)is used to plot the Probability Density Function (PDF) of a Normal Distribution.

To understand the Poisson distribution, consider the following problem fromChi Yau’s R Tutorial textbook:

If there are 12 cars crossing a bridge per minute on average, what is the probability of having seventeen or more cars crossing the bridge in any given minute?

Here, average number of cars crossing a bridge per minute isμ= 12.

ppois(q, u, lower.tail = TRUE)is an R function that gives the probability that a random variable will be lower than or equal to a value.

We have to find the probability of having seventeen ormorecars, so we will uselower.trail = FALSEand set q at 16:

(Video) R Tutorial : Poisson regression

ppois(16, 12, lower.tail = FALSE)# lower.tail = logical; if TRUE (default) then probabilities are P[X < = x], otherwise, P[X > x].
## [1] 0.101291

To get a percentage, we simply need to multiply this output by 100. Now we have the answer to our question: there is a10.1%probability of having 17 or more cars crossing the bridge in any particular minute.

Poisson Regression Models and GLMs

Generalized Linear Models are models in which response variables follow a distribution other than the normal distribution. That’s in contrast to Linear regression models, in which response variables follow normal distribution. This is because Generalized Linear Models have response variables that are categorical such as Yes, No; or Group A, Group B and, therefore, do not range from -∞ to +∞. Hence, the relationship between response and predictor variables may not be linear. In GLM:

yi = α + β1x1i + β2x2i + ….+βpxpi + ei               i = 1, 2….n

The response variableyiis modeled by alinear function of predictor variablesand some error term.

A Poisson Regression model is aGeneralized Linear Model (GLM)that is used to model count data and contingency tables. The outputY(count) is a value that follows the Poisson distribution. It assumes the logarithm ofexpected values (mean)that can be modeled into a linear form by some unknown parameters.

Note:In statistics, contingency tables(example)are matrix of frequencies depending on multiple variables.

To transform the non-linear relationship to linear form, alink functionis used which is thelogfor Poisson Regression. For that reason, a Poisson Regression model is also calledlog-linear model. The general mathematical form of Poisson Regression model is:

log(y)=α + β1x1 + β2x2 + ….+βpxp

Where,

  • y: Is the response variable
  • αandβ: are numeric coefficients,αbeing the intercept, sometimesαalso is represented byβ0, it’s the same
  • xis the predictor/explanatory variable

The coefficients are calculated using methods such as Maximum Likelihood Estimation(MLE) ormaximum quasi-likelihood.

Consider an equation with one predictor variables and one response variable:

log(y)=α + β(x)

This is equivalent to:

y = e(α + β(x)) = eα + eβ * x

Note: In Poisson Regression models, predictor or explanatory variables can have a mixture of both numeric or categorical values.

One of the most important characteristics for Poisson distribution and Poisson Regression isequidispersion, which means that the mean and variance of the distribution are equal.

Variance measures the spread of the data. It is the “average of the squared differences from the mean”. Variance (Var) is equal to 0 if all values are identical. The greater the difference between the values, the greater the variance. Mean is the average of values of a dataset. Average is the sum of the values divided by the number of values.

Let us say that the mean (μ) is denoted byE(X)

E(X)=μ

For Poisson Regression, mean and variance are related as:

var(X)=σ2E(X)

Whereσ2is the dispersion parameter. Sincevar(X)=E(X)(variance=mean) must hold for the Poisson model to be completely fit,σ2must be equal to 1.

When variance is greater than mean, that is calledover-dispersionand it is greater than 1. If it is less than 1 than it is known asunder-dispersion.

Poisson Regression Modeling Using Count Data

In R, theglm()command is used to model Generalized Linear Models. Here is the general structure ofglm():

glm(formula, family = familytype(link = ""), data,...)

In this tutorial, we’ll be using those three parameters. For further details we can consultthe R documentation, but let’s take a quick look at what each refers to:

ParameterDescription
formulaThe formula is symbolic representation of how modeled is to fitted
familyFamily tells choice of variance and link functions.There are several choices of family, including Poisson and Logistic
dataData is the dataset to be used

glm()provides eight choices for family with the following default link functions:

FamilyDefault Link Function
binomial(link = “logit”)
gaussian(link = “identity”)
Gamma(link = “inverse”)
inverse.gaussian(link = $frac{1}{mu^2}$)
poisson(link = “log”)
quasi(link = “identity”, variance = “constant”)
quasibinomial(link = “logit”)
quasipoisson(link = “log”)

Let’s Get Modeling!

We’re going to model Poisson Regression related to how frequently yarn breaks during weaving. This data is found in thedatasetspackage in R, so the first thing we need to do is install the package usinginstall.package("datasets")and load the library withlibrary(datasets):

# install.packages("datasets")library(datasets) # include library datasets after installation

Thedatasetspackage includes tons of datasets, so we need to specifically select our yarn data.Consulting the package documentation, we can see that it is calledwarpbreaks, so let’s store that as an object.

(Video) R tutorial: Poisson Regression

data < - warpbreaks</code>

Let’s take a look at the data:

columns < - names(data) # Extract column names from dataframecolumns # show columns</code>

Output: [1] "breaks" "wool" "tension"

What’s In Our Data?

This data set looks at how many warp breaks occurred for different types of looms per loom, per fixed length of yarn. We can read more details about this dataset in the documentationhere, but here are the three columns we’ll be looking at and what each refers to:

ColumnTypeDescription
breaksnumericThe number of breaks
woolfactorThe type of wool (A or B)
tensionfactorThe level of tension (L, M, H)

There are measurements on 9 looms of each of the six types of warp, for a total of 54 entries in the dataset.

Let’s look at how the data is structured using thels.str()command:

ls.str(warpbreaks)

Output:

breaks : num [1:54] 26 30 54 25 70 52 51 26 67 18 ...tension : Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...

From the above, we can see both the types and levels present in the data.Read thisto learn a bit more about factors in R.

Now we will work with thedatadataframe. Remember, with a Poisson Distribution model we’re trying to figure out how some predictor variables affect a response variable. Here,breaksis the response variable andwoolandtensionare predictor variables.

We can view the dependent variablebreaksdata continuity by creating a histogram:

hist(data$breaks) 

Tutorial: Poisson Regression in R (4)

Clearly, the data is not in the form of a bell curve like in a normal distribution.

Let’s check out themean()andvar()of the dependent variable:

mean(data$breaks) # calculate mean

Output: [1] 28.14815

var(data$breaks) # calculate variance

Output: [1] 174.2041

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

Let’s fit the Poisson model using theglm()command.

# model poisson regression using glm()poisson.model < - glm(breaks ~ wool + tension, data, family = poisson(link = "log"))summary(poisson.model)</code>

summary() is a generic function used to produce result summaries of the results of various model fitting functions.

Output:

Call: glm(formula = breaks ~ wool + tension, family = poisson(link = "log"), data = data)Deviance Residuals: Min 1Q Median 3Q Max -3.6871 -1.6503 -0.4269 1.1902 4.2616 Coefficients:Estimate Std. Error z value Pr(>|z|) (Intercept) 3.69196 0.04541 81.302 < 2e-16 ***woolB -0.20599 0.05157 -3.994 6.49e-05 ***tensionM -0.32132 0.06027 -5.332 9.73e-08 ***tensionH -0.51849 0.06396 -8.107 5.21e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)Null deviance: 297.37 on 53 degrees of freedomResidual deviance: 210.39 on 50 degrees of freedomAIC: 493.06Number of Fisher Scoring iterations: 4</code>

Interpreting the Poisson Model

We’ve just been given a lot of information, now we need to interpret it. The first column namedEstimateis the coefficient values ofα(intercept),β1and so on. Following is the interpretation for the parameter estimates:

  • exp(α)= effect on the meanμ, when X = 0
  • exp(β) = with every unit increase in X, the predictor variable has multiplicative effect ofexp(β) on the mean of Y, that isμ
  • Ifβ= 0, then exp(β) = 1, and the expected count isexp(α) and, Y and X are not related.
  • Ifβ> 0, then exp(β) > 1, and the expected count is exp(β) times larger than when X = 0
  • Ifβ< 0, then exp(β) < 1, and the expected count is exp(β) times smaller than when X = 0

Iffamily = poissonis kept inglm()then, these parameters are calculated usingMaximum Likelihood Estimation MLE.

R treats categorical variables as dummy variables. Categorical variables, also called indicator variables, are converted into dummy variables by assigning the levels in the variable some numeric representation.The general rule is that if there arekcategories in a factor variable, the output ofglm()will havek−1 categories with remaining 1 as the base category.

We can see in above summary that for wool, ‘A’ has been made the base and is not shown in summary. Similarly, for tension ‘L’ has been made the base category.

To see which explanatory variables have an effect on response variable, we will look at thepvalues. If thep is less than 0.05then, the variable has an effect on the response variable. In the summary above, we can see that all p values are less than 0.05, hence,bothexplanatory variables (wool and tension) have significant effect on breaks. Notice how R output used***at the end of each variable. The number of stars signifies significance.

Before starting to interpret results, let’s check whether the model has over-dispersion or under-dispersion. If theResidual Devianceis greater than the degrees of freedom, then over-dispersion exists. This means that the estimates are correct, but the standard errors (standard deviation) are wrong and unaccounted for by the model.

The Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) whereas residual with the inclusion of independent variables. Above, we can see that the addition of 3 (53-50 =3) independent variables decreased the deviance to 210.39 from 297.37. Greater difference in values means a bad fit.

So, to have a more correct standard error we can use aquasi-poissonmodel:

poisson.model2 < - glm(breaks ~ wool + tension, data = data, family = quasipoisson(link = "log"))summary(poisson.model2)</code>

Output:

Call: glm(formula = breaks ~ wool + tension, family = poisson(link = "log"), data = data)Deviance Residuals: Min 1Q Median 3Q Max -3.6871 -1.6503 -0.4269 1.1902 4.2616 Coefficients:Estimate Std. Error z value Pr(>|z|) (Intercept) 3.69196 0.04541 81.302 < 2e-16 ***woolB -0.20599 0.05157 -3.994 6.49e-05 ***tensionM -0.32132 0.06027 -5.332 9.73e-08 ***tensionH -0.51849 0.06396 -8.107 5.21e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)Null deviance: 297.37 on 53 degrees of freedomResidual deviance: 210.39 on 50 degrees of freedomAIC: 493.06Number of Fisher Scoring iterations: 4</code>

Comparing The Models:

Now that we’ve got two different models, let’s compare them to see which is better. First, we’ll install thearmlibrary because it contains a function we need:

# install.packages("arm")# load library arm that contains the function se.coef()library(arm)

Now we’ll use thatse.coef()function to extract the coefficients from each model, and then usecbind()combine those extracted values into a single dataframe so we can compare them.

(Video) 9.8 Poisson Regression in R: Fitting a Model To Count Data in R

# extract coefficients from first model using 'coef()'coef1 = coef(poisson.model) # extract coefficients from second modelcoef2 = coef(poisson.model2) # extract standard errors from first model using 'se.coef()'se.coef1 = se.coef(poisson.model) # extract standard errors from second modelse.coef2 = se.coef(poisson.model2)# use 'cbind()' to combine values into one dataframemodels.both < - cbind(coef1, se.coef1, coef2, se.coef2, exponent = exp(coef1)) # show dataframemodels.both</code>

Output:

coef1 se.coef1 coef2 se.coef2 exponent(Intercept) 3.6919631 0.04541069 3.6919631 0.09374352 40.1235380woolB -0.2059884 0.05157117 -0.2059884 0.10646089 0.8138425tensionM -0.3213204 0.06026580 -0.3213204 0.12440965 0.7251908tensionH -0.5184885 0.06395944 -0.5184885 0.13203462 0.5954198

In above output, we can see the coefficients are the same, but the standard errors are different.

Keeping these points in mind, let’s see estimate forwool. Its value is-0.2059884, and the exponent of-0.2059884is0.8138425.

1-0.8138425

Output:[1] 0.1861575

This shows that changing from type A wool to type B wool results in adecreasein breaks0.8138425times the intercept, because estimate -0.2059884 is negative. Another way of saying this is if we change wool type from A to B, the number of breaks will fall by 18.6% assuming all other variables are the same.

Predicting From The Model

Once the model is made, we can usepredict(model, data, type)to predict outcomes using new dataframes containing data other than the training data. Let’s look at an example.

# make a dataframe with new datanewdata = data.frame(wool = "B", tension = "M")# use 'predict()' to run model on new datapredict(poisson.model2, newdata = newdata, type = "response")

Output: [1] 23.68056

Our model is predicting there will be roughly24breaks with wool type B and tension level M.

Visualizing Findings Usingjtools

When you are sharing your analysis with others, tables are often not the best way to grab people’s attention. Plots and graphs help people grasp your findings more quickly. The most popular way to visualize data in R is probablyggplot2(which is taught inDataquest’s data visualization course), we’re also going to use an awesome R package calledjtoolsthat includes tools for specifically summarizing and visualizing regression models. Let’s usejtoolsto visualizepoisson.model2.

# Install the package jtools if not already installedinstall.packages("jtools")# you may be asked to install 'broom' and 'ggstance' packages as wellinstall.packages("broom")install.packages("ggstance")

jtoolsprovidesplot_summs()andplot_coefs()to visualize the summary of the model and also allows us to compare different models withggplot2.

# Include jtools librarylibrary(jtools)# plot regression coefficients for poisson.model2plot_summs(poisson.model2, scale = TRUE, exp = TRUE)

Tutorial: Poisson Regression in R (5)

# plot regression coefficients for poisson.model2 and poisson.modelplot_summs(poisson.model, poisson.model2, scale = TRUE, exp = TRUE)

Output:

Tutorial: Poisson Regression in R (6)

In above code, the plot_summs(poisson.model2, scale = TRUE, exp = TRUE)plots the second model using the quasi-poisson family inglm.

  • The first argument inplot_summs()is the regression model to be used, it may be one or more than one.
  • scalehelps with the problem of differing scales of the variables.
  • expis set to TRUE because for Poisson regression we are more likely to be interested in exponential values of estimates rather than linear.

You can find more details on jtools andplot_summs()here in the documentation.

We can also visualize the interaction between predictor variables.jtoolsprovides different functions for different types of variables. For example, if all the variables are categorical, we could usecat_plot()to better understand interactions among them. For continuous variables,interact_plot()is used.

In thewarpbreaksdata we have categorical predictor variables, so we’ll usecat_plot()to visualize the interaction between them, by giving it arguments specifying which model we’d like to use, the predictor variable we’re looking at, and the other predictor variable that it combines with to produce the outcome.

cat_plot(poisson.model2, pred = wool, modx = tension)# argument 1: regression model# pred: The categorical variable that will appear on x-axis# modx: Moderator variable that has an effect in combination to pred on outcome

Output:

Tutorial: Poisson Regression in R (7)

We can do the same thing to look at tension:

# using cat_plot. Pass poisson.model2 and we want to see effect of tension type so we set pred=tensioncat_plot(poisson.model2, pred = tension, modx = wool)

Output:

Tutorial: Poisson Regression in R (8)

Above, we see how the three different categories of tension (L, M, and H) for each affects breaks with each wool type. For example, breaks tend to be highest with low tension and type A wool.

We can also define the type of plot created bycat_plot()using thegeomparameter. This parameter enhances the interpretation of plot. We can use it like so, passinggeomas an additional argument tocat_plot:

cat_plot(poisson.model2, pred = tension, modx = wool, geom = "line")

Output:

Tutorial: Poisson Regression in R (9)

We can also to include observations in the plot by adding plot.points = TRUE:

cat_plot(poisson.model2, pred = tension, modx = wool, geom = "line", plot.points = TRUE)

Output:

Tutorial: Poisson Regression in R (10)

There are lots of other design options, including line style, color, etc, that will allow us to customize the appearance of these visualizations. For specifics, consult the jtools documentationhere.

(Video) Poisson Regression Part I | Statistics for Applied Epidemiology | Tutorial 9

Poisson Regression Modeling Using Rate Data

So far this in this tutorial, we have modeled count data, but we can also model rate data that is predicting the number of counts over a period of time or grouping. Formula for modelling rate data is given by:

log(X/n)=β0 + ∑iβiXi

This is equivalent to: (applying log formula)

log(X)−log(n)=β0 + ∑iβiXi

log(X)=log(n)+β0 + ∑iβiXi

Thus, rate data can be modeled by including thelog(n)term with coefficient of 1. This is called anoffset. This offset is modelled withoffset()in R.

Let’s use another a dataset calledeba1977from theISwR packageto model Poisson Regression Model for rate data. First, we’ll install the package:

# install.packages("ISwR")library(ISwR)
## Warning: package 'ISwR' was built under R version 3.4.4

Now, let’s take a look at some details about the data, and print the first ten rows to get a feel for what the dataset includes.

data(eba1977)cancer.data = eba1977cancer.data[1:10, ]# Description# Lung cancer incidence in four Danish cities 1968-1971# Description:# This data set contains counts of incident lung cancer cases and# population size in four neighbouring Danish cities by age group.# Format:# A data frame with 24 observations on the following 4 variables: # city a factor with levels Fredericia, Horsens, Kolding, and Vejle.# age a factor with levels 40-54, 55-59, 60-64, 65-69,70-74, and 75+.# pop a numeric vector, number of inhabitants.# cases a numeric vector, number of lung cancer cases.

Output:

city age pop cases1 Fredericia 40-54 3059 112 Horsens 40-54 2879 133 Kolding 40-54 3142 44 Vejle 40-54 2520 55 Fredericia 55-59 800 116 Horsens 55-59 1083 67 Kolding 55-59 1050 88 Vejle 55-59 878 79 Fredericia 60-64 710 1110 Horsens 60-64 923 15

To model rate data, we useX/nwhereXis the event to happen andnis the grouping. In this example,X=cases(the event is a case of cancer) andn=pop(the population is the grouping).

As in the formula above, rate data is accounted bylog(n) and in this datanis population, so we will find log of population first. We can model forcases/populationas follows:

# find the log(n) of each value in 'pop' column. It is the third columnlogpop = log(cancer.data[ ,3])# add the log values to the dataframe using 'cbind()'new.cancer.data = cbind(cancer.data, logpop)# display new dataframenew.cancer.data

Output:

city age pop cases logpop1 Fredericia 40-54 3059 11 8.0258432 Horsens 40-54 2879 13 7.9651983 Kolding 40-54 3142 4 8.0526154 Vejle 40-54 2520 5 7.8320145 Fredericia 55-59 800 11 6.6846126 Horsens 55-59 1083 6 6.9874907 Kolding 55-59 1050 8 6.9565458 Vejle 55-59 878 7 6.7776479 Fredericia 60-64 710 11 6.56526510 Horsens 60-64 923 15 6.82762911 Kolding 60-64 895 7 6.79682412 Vejle 60-64 839 10 6.73221113 Fredericia 65-69 581 10 6.36475114 Horsens 65-69 834 10 6.72623315 Kolding 65-69 702 11 6.55393316 Vejle 65-69 631 14 6.44730617 Fredericia 70-74 509 11 6.23244818 Horsens 70-74 634 12 6.45204919 Kolding 70-74 535 9 6.28226720 Vejle 70-74 539 8 6.28971621 Fredericia 75+ 605 10 6.40522822 Horsens 75+ 782 2 6.66185523 Kolding 75+ 659 12 6.49072424 Vejle 75+ 619 7 6.428105

Now, let’s model the rate data withoffset().

poisson.model.rate < - glm(cases ~ city + age+ offset(logpop), family = poisson(link = "log"), data = cancer.data)#display summarysummary(poisson.model.rate)</code>

Output:

Call:glm(formula = cases ~ city + age + offset(logpop), family = poisson(link = "log"), data = cancer.data)Deviance Residuals: Min 1Q Median 3Q Max -2.63573 -0.67296 -0.03436 0.37258 1.85267 Coefficients:Estimate Std. Error z value Pr(>|z|) (Intercept) -5.6321 0.2003 -28.125 < 2e-16 ***cityHorsens -0.3301 0.1815 -1.818 0.0690 . cityKolding -0.3715 0.1878 -1.978 0.0479 * cityVejle -0.2723 0.1879 -1.450 0.1472 age55-59 1.1010 0.2483 4.434 9.23e-06 ***age60-64 1.5186 0.2316 6.556 5.53e-11 ***age65-69 1.7677 0.2294 7.704 1.31e-14 ***age70-74 1.8569 0.2353 7.891 3.00e-15 ***age75+ 1.4197 0.2503 5.672 1.41e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)Null deviance: 129.908 on 23 degrees of freedomResidual deviance: 23.447 on 15 degrees of freedomAIC: 137.84Number of Fisher Scoring iterations: 5</code>

In this dataset, we can see that the residual deviance is near to degrees of freedom, and the dispersion parameter is1.5 (23.447/15)which is small, so the model is a good fit.

We usefitted(model)to return values fitted by the model. It returns outcomes using the training data on which the model is built. Let’s give it a try:

fitted(poisson.model.rate) 
1 2 3 4 5 6 7 8 10.954812 7.411803 7.760169 6.873215 8.615485 8.384458 7.798635 7.201421 9 10 11 12 13 14 15 16 11.609373 10.849479 10.092831 10.448316 12.187276 12.576313 10.155638 10.080773 17 18 19 20 21 22 23 24 11.672630 10.451942 8.461440 9.413988 8.960422 8.326004 6.731286 6.982287

Using this model, we can predict the number of cases per 1000 population for a new data set, using thepredict()function, much like we did for our model of count data previously:

# create a test dataframe containing new values of variablestest.data = data.frame(city = "Kolding", age = "40-54", pop = 1000, logpop = log(1000)) # predict outcomes (responses) using 'predict()' predicted.value < - predict(poisson.model.rate, test.data, type = "response") # show predicted valuepredicted.value</code>

Output: [1] 2.469818

So,for the city of Kolding among people in the age group 40-54, we could expect roughly 2 or 3 cases of lung cancer per 1000 people.

As with the count data, we could also use quasi-poisson to get more correct standard errors with rate data, but we won’t repeat that process for the purposes of this tutorial.

Conclusion

Poisson regression models have great significance in econometric and real world predictions. In this tutorial, we’ve learned about Poisson Distribution, Generalized Linear Models, and Poisson Regression models.

We also learned how to implement Poisson Regression Models for both count and rate data in R usingglm(), and how to fit the data to the model to predict for a new dataset. Additionally, we looked at how to get more accurate standard errors inglm() usingquasipoissonand saw some of the possibilities available for visualization withjtools.

[thrive_leads id=’21203′]

References

  1. https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Poisson.html
  2. https://www.theanalysisfactor.com/generalized-linear-models-in-r-part-6-poisson-regression-count-variables/
  3. https://stats.idre.ucla.edu/r/dae/poisson-regression/
  4. https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/summary

Ready to level up your R skills?

Our Data Analyst in R path covers all the skills you need to land a job, including:

  • Data visualization with ggplot2
  • Advanced data cleaning skills with tidyverse packages
  • Important SQL skills for R users
  • Fundamentals in statistics and probability
  • ...and much more

There's nothing to install, no prerequisites, and no schedule.

Start learning for free!

poisson regression in RrR for data scienceR projectsR tutorialrstatsTutorials

(Video) Poisson Regression with R - Insect Sprays

FAQs

How do you do Poisson regression? ›

Poisson Regression Analysis in SPSS with Assumption Testing - YouTube

What is the function used for Poisson regression in R? ›

The function used to create the Poisson regression model is the glm() function.

How do you find the Poisson distribution in R? ›

The Poisson Distribution in R - YouTube

For which situations can you use Poisson regression? ›

Poisson regression applies where the response variable is a count of events (e.g. crime incidents, cases of a disease) rather than a continuous variable. This model may also be applied to standardized counts or “rates”, such as disease incidence per capita, species of tree per square kilometer.

What is the difference between logistic regression and Poisson regression? ›

Poisson regression is most commonly used to analyze rates, whereas logistic regression is used to analyze proportions. The chapter considers statistical models for counts of independently occurring random events, and counts at different levels of one or more categorical outcomes.

How do you analyze Poisson distribution? ›

The Formula for the Poisson Distribution Is

e is Euler's number (e = 2.71828...) x is the number of occurrences. x! is the factorial of x. λ is equal to the expected value (EV) of x when that is also equal to its variance.

How do you check Poisson regression assumptions? ›

9.11 Poisson Regression: Model Assumptions - YouTube

How do you apply Poisson distribution in R in a data set? ›

Poisson Distribution in R | R Tutorial 3.2 | MarinStatsLectures - YouTube

How do you use a Poisson model? ›

The Poisson Distribution formula is: P(x; μ) = (e-μ) (μx) / x! Let's say that that x (as in the prime counting function is a very big number, like x = 10100. If you choose a random number that's less than or equal to x, the probability of that number being prime is about 0.43 percent.

How do you regress two variables in R? ›

Multiple Linear Regression in R | R Tutorial 5.3 | MarinStatsLectures

How do I know if my data is Poisson in R? ›

How to know if a data follows a Poisson Distribution in R?
  1. The number of outcomes in non-overlapping intervals are independent. ...
  2. The probability of two or more outcomes in a sufficiently short interval is virtually zero.

How do you use a Poisson model? ›

The Poisson Distribution formula is: P(x; μ) = (e-μ) (μx) / x! Let's say that that x (as in the prime counting function is a very big number, like x = 10100. If you choose a random number that's less than or equal to x, the probability of that number being prime is about 0.43 percent.

How do you model Poisson distribution? ›

The Poisson Distribution - explained with examples and illustrated using ...

How do you analyze Poisson distribution? ›

The Formula for the Poisson Distribution Is

e is Euler's number (e = 2.71828...) x is the number of occurrences. x! is the factorial of x. λ is equal to the expected value (EV) of x when that is also equal to its variance.

Videos

1. Poisson Regression Example / Workout in R n Detail Interpretation of Output
(Gopal Prasad Malakar)
2. GLM in R: Poisson regression | crime data | fuller version
(Phil Chan)
3. 9.10 Poisson Regression in R: Fitting a Model To Rate Data (with offset) in R
(MarinStatsLectures-R Programming & Statistics)
4. Unit #6 Lesson 12: Poisson regression in R
(Brian Zaharatos)
5. Poisson Distribution in R | R Tutorial 3.2 | MarinStatsLectures
(MarinStatsLectures-R Programming & Statistics)
6. Estimating Poisson Regressions in R Studio
(Noman Arshed)

You might also like

Latest Posts

Article information

Author: Jerrold Considine

Last Updated: 09/14/2022

Views: 6218

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.