2.2. Statistical analysis

MG Maryam Ganjavi
BF Bahram Faraji
ask Ask a question
Favorite

The schematic diagram of the objective of this study is shown in Figure 2. In this study, we attempt to find a model, as shown in Figure 2a, which relates the cancer rate at a certain year to the food consumption in preceding years (lag effect). This model can have a mathematical form. Here, we consider a linear model. In this linear model the food consumed in each year at and before the year of cancer diagnosis multiplied by a coefficient (βs in Figure 2), and then the results are summed up to estimate the cancer rate. In statistics, the coefficients βs are so called linear parameter estimates to be calculated, the cancer rate is dependent variable, and food consumptions are called the predictors.

Schematic representation of the study.

As an example, if a lag of 2 years is considered for analysis, the problem is to find 3 unknown βs and one unknown α. Therefore, at least 4 equations are needed to solve the 2 years-lag problem. An example of such a problem and the method to construct the equations are shown in Figure 3, where the colorectal cancer rate in each year (CRCt) is related to meat consumptions within past 2 years (t, t-1, and t-2):

Distributed lag model for colorectal cancer and meat consumption with a lag of 2 years. The model should predict the cancer rate in a certain year to the meat consumed in that year and 2 years prior to the disgnosis.

In Figure 3a, as a demonstration we chose cancer rate at t=1975 and the goal is to find the effect of food consumption from two years before 1975, including 1975 (lag= 2). Therefore, we should calculate βs for 1975, 1974, and1973. In Figure 3b we set t=1976, and similarly in Figures 3c and 3d we set t=1977 and t=1978, respectively. All of the data up to t=2013 can be used to construct the regression model for estimating βs and α. The maximum lag length is limited by availability of data points (observations). However, preferably the lag length should be considerably smaller than the number of observations (data points). An ordinary least square fit is used to solve the equations.

The length of the lag is decided by iteration as long as i) the t-statistics of each coefficient is significant, ii) R2 is high, and iii) Akaike information criterion (AIC) are low (see appendix A also reference (Gujarati Damodar N and Porter 1999)).

We also imposed a restriction on the endpoints so that β−1=0 and βn+1=0. In our case this means that the effect of the food consumed at year t+1 and t-(n+1) has no effect on the cancer rate at year t. For example, if the lag is selected to be 20 years for the cancer rate in the current year, the food consumed next year, and the food consumed 21 years ago have no effect on the current year’s cancer rate.

The analysis was done on the effect of red meat, vegetables, and fruits on CRC rate. The number of the data pair for each analysis is 44. Therefore, according to equation 2, the maximum lag (n) can be as large as 22. However, larger the lag is, wider the t-distribution function will be, which leads to a larger p-value. We limit the lag length to have a maximum value of 20 years. For the same reason the effect of the different foods on CRC had to be studied separately.

Calculations were performed in package PDLREG of the software system SAS 9.3 (SAS Institute, Cary NC).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A