Logistic regression of potential explanatory variables on citation counts

Yassine Gargouri & Stevan Harnad

Cognition & communication Laboratory

(Last update: 11/04/2009)

 

The number of citations an article receives ("citation counts") can be influenced by or correlated with a variety of variables. A logistic regression analysis has been conducted to study the correlation between citation counts (as dependent variable) and the following set of potential correlator/predictor variables:

The metadata for the articles were collected from our four institutional archives, as well as from the ISI database. Citation counts were extracted from ISI (in November, 2008). For each mandated article Mi, we collected all corresponding articles Nj published in the same journal, volume and year as controls.

In order to reduce our article sample to a reasonable processing size, we limited the number of journal/volume/year-matched articles  to 10 articles Nj that were semantically close to Mi. This narrowing of content should also make the control articles more comparable than using the entire spectrum of the journal's content. (The semantic closeness is computed based on shared words in titles, omitting stop words.)

Journals that are OA are excluded from our sample because we our OA/non-OA comparisons are all within-journal comparisons (hence there would be no control articles in an OA journal). Based on the Directory of Open Access Journals (DOAJ), which presently indexes 3563 OA journals, only 2.10% out of our sample ISI journals were OA. Those journals were excluded from our analysis.

The total size of the article sample (6215 Mandated, M, and 20982 corresponding controls, N) from 2002 to 2006 was 27197.

The full-text OA status of the articles in our sample was verified using an automated webwide search-robot (Hajjem and al, 2005). The result was consolidated using another robot based on Google Scholar search.

Rather than comparing the regression models separately for science and social science, we added a dichotomous variable (Sci) as a predictor variable.

About 32% of the articles in our sample have at least 1 self-citation with an average of about 2 self-citations per article. We accordingly excluded all self-citations from the citation counts.

Citation counts are not normally distributed, particularly because of the many articles having zero citations and they cannot be successfully transformed into a normal distribution. Fig.1 shows the citation counts (minus self-citations) distribution. So we used binary logistic regression analysis, with a dichotomous variable taking value 0 if the article has no citations and 1 if it has at least 1 citation.

distribution

Fig.1 Citation count distribution

 

1. Logistic regression:

We used stepwise logistic regression, for each test selecting the model that maximizes the chi-square likelihood ratio.

To make the interpretation of the coefficients easier, we exponentiated the ß coefficients (Exp(ß)) and interpreted them as odds-ratios. For example, we can say for the first model that for a one unit increase in OA, the odds of receiving 1-5 citations (versus zero citations) increased by a factor of 0.957.

The following table (Tab.1) reports Exp(ß) values for each model having "Cit_a_x-y&y-z" as dependent variables  ((x,y,z) {1, 2, 3, ..., 20}), where Cit_a_x-y&y-z = 1 if citation count (minus self-citations) is between y and z and 0 if between x and y. Models are referred to as "M_r".

The Exp(ß) values of variables turned out to have the same polarity and to be quite similar, with and without self-citations.

The figure (Fig.2) shows that citation count is positively correlated with IF, Age, Ref_N, Auth_N, OA, USA and M. In other words:

- The higher the IF of the journal in which it was published, the higher an article's citation count.

- The longer since an article was published, the higher its citation count.

- The more references an article cites, the higher its citation count.

- The more co-authors an article has, the higher its citation count.

- Articles that are made OA have higher citation counts, and this small but significant independent OA effect is present in every citation range but it is greatest in the highest citation range (1-5 citations vs 20+ citations): The OA advantage is strongest for highly cited articles.

- Articles from authors at institutions that have Mandates have higher citation counts; this effect is present only in the medium-high citation ranges (and is of course confounded with the level of author compliance with the institutional Mandate, discussed further below).

- Review articles have higher citation counts; the effect is greater, the higher the citation range.

- CERN articles have higher citation counts in the lowest and especially the highest citation range. (However, when all CERN articles are excluded from our sample, there is no significant change in the other variables).

 

Model N. Dependent V. Age IF Ref_N Auth_N Page_N OA M USA Review Sci CERN South Minho Queens Age*OA
      M_1 Cit_a_0&1-5 1,494 2,229 1,020 1,007 0,993 0,957     0,627 1,249 0,789     1,476 1,209
M_2 Cit_a_1-5&5-10 1,490 1,514 1,016 1,002 0,986 1,323 1,889 1,415 0,777 1,475          
M_3 Cit_a_1-5&10-20 1,786 1,776 1,020 1,002 0,992 1,392 1,716 1,406 0,992 1,887
M_4 Cit_a_1-5&20+ 2,439 2,114 1,019 0,999   8,953   1,860 1,914 3,050 2,306       0,968

Tab.1:  The Exp(ß) values for logistic regressions

* Bold: significance < 0,01

* Italique: significance between 0,01 and 0,05 

 

Fig.2:  The Exp(ß) values for logistic regressions

There is a significant interaction between Age and OA (Age*OA) for low citation interval (between 1 and 5) as well for high citation interval (20 citations and more). Both the linear main effect of age and OA, and this nonlinear interaction are significant.. The following figure ( Fig.3) shows the citation mean (Cit_a_1-5&20+) for OA and NOA articles corresponding to each Age value. This figure confirms the OA advantage. The difference between the two lines corresponding to OA and NOA is higher for older articles.  

intercation

Fig.3: The citation count means of Age and OA

 

 

2. Logistic Regression by Impact Factor interval:

In order to compare articles belonging to comparable journals, we divided our sample into 4 quartile ranges by journal impact Factor (IF), each range covering 25% of the articles:

        IF_1 :          0 ≤ IF < 0.633

        IF_2 :   0.633 IF < 1.053

        IF_3 :   1.035 ≤ IF < 1.782

        IF_4 :   1.782 ≤ IF < 29.957

Only the top quartile contains journals with IFs from 1.782 to 29.957.  As we are also interested in the variability within this quartile, we further subdivided it into two subgroups, each covering 12.5% of all the articles. Subdividing more minutely would generate would make the sample sizes too small to detect effects o interest. Finally, 5 ranges of IF are selected:

        IF_1 :         0 ≤ IF < 0.633

        IF_2 :   0.633 IF < 1.053

        IF_3 :   1.035 ≤ IF < 1.782

        IF_4 :   1.782 ≤ IF < 2.468

        IF_5 :   2.468 ≤ IF 29.957

The same regression is done separately for each IF range by controlling all the variables (except IF). The following tables summarizes the values of Exp(ß) corresponding to the controlled variables for each IF range.

Our earlier remark also applies to these regressions: Exp(ß) values of variables have the same polarity and pattern whether or not we exclude self-citations from the citations count.

2.1   IF 1

When articles are published in a low IF journal, article citation counts are positively correlated with Age, Ref_N, Auth_N, OA and M. The OA effect increases for higher citation count intervals. For the low article citation range, the Age*OA interaction is significant, but OA itself is not.

Model N. Dependent Var. Age Ref_N Auth_N Page_N OA M USA Review Sci CERN South Minho Queens Age*OA
M_1 Cit_a_0&1-5 1,537 1,017 1,079                 0,701   1,093
M_2 Cit_a_1-5&5-10 1,847 1,013 1,066     1,881               1,059
M_3 Cit_a_1-5&10-20 2,071 1,026 1,054 0,962 1,533 1,902                
M_4 Cit_a_1-5&20+ 2,689 1,020 1,087   2,406     4,760 3,214          

Tab.2:  The Exp(ß) values for logistic regressions (IF 1)

Fig.4:  The Exp(ß) values for logistic regressions (IF 1)

 

2.2   IF 2

For articles in journals with IFs between 0.633 and 1.053, the pattern is quite similar, except the Age*OA interaction is absent and OA itself (alongside Age, as separate variables) is significant.

 

Model N. Dependent Var. Age Ref_N Auth_N Page_N OA M USA Review Sci CERN South Minho Queens Age*OA
M_1 Cit_a_0&1-5 1,407 1,016 1,028     1,265   0,605   0,511        
M_2 Cit_a_1-5&5-10 1,548 1,012     1,346 1,963                
M_3 Cit_a_1-5&10-20 1,869 1,018 1,007   1,337 1,722                
M_4 Cit_a_1-5&20+ 2,117 1,011     2,322     3,106            

Tab.3:  The Exp(ß) values for logistic regressions (IF 2)

 

Fig.5:  The Exp(ß) values for logistic regressions (IF 2)

 

 

2.3   IF 3

For articles in journals with IFs between 1.053 and 1.782, the pattern is again quite similar. The USA and Review variables now also correlate with citation increase. In this IF range, some institutions (QUT, Southampton and CERN) have a small citation advantage. However, removing the articles from one of these institutions, does not change the pattern for the other variables. 

Model N. Dependent Var. Age Ref_N Auth_N Page_N OA M USA Review Sci CERN South Minho Queens Age*OA
M_1 Cit_a_0&1-5 1,581 1,012 1,032   1,236         0,401     1,856  
M_2 Cit_a_1-5&5-10 1,540 1,007 1,033     1,428 1,330              
M_3 Cit_a_1-5&10-20 1,879 1,013 1,026   1,263               1,382  
M_4 Cit_a_1-5&20+ 2,305 1,009 1,041 1,026 1,449 1,492 1,791 1,939     3,734      

Tab.4:  The Exp(ß) values for logistic regressions (IF 3)

 

 

Fig.6:  The Exp(ß) values for logistic regressions (IF 3)

 

2.4   IF 4

For journals with IFs between 1.782 and 2.468, longer articles (Page_N) have more citations. The OA citation advantage is only significant for the higher citation count ranges. Also, the number of co-authors (Auth_N) is less correlated with increased citations as the citation range gets higher. CERN has a citation advantage in this IF range. However, removing CERN articles does not change the pattern for the other variables. 

Model N. Dependent Var. Age Ref_N Auth_N Page_N OA M USA Review Sci CERN South Minho Queens Age*OA
M_1 Cit_a_0&1-5 1,690 1,020             2,090 0,554 0,233      
M_2 Cit_a_1-5&5-10 1,427 1,010       1,645       1,657        
M_3 Cit_a_1-5&10-20 1,800 1,019       1,729       1,615        
M_4 Cit_a_1-5&20+ 2,540 1,024 0,994 1,028 1,747   1,822     3,974        

Tab.5:  The Exp(ß) values for logistic regressions (IF 4)

 

 

Fig.7:  The Exp(ß) values for logistic regressions (IF 4)

 

2.5   IF 5

For journals with IFs between 2.468 and 29.957. The OA advantage is significant for the highest citation ranges. The increased citations for USA and Review articles are more significant.

Model N. Dependent Var. Age Ref_N Auth_N Page_N OA M USA Review Sci CERN South Minho Queens Age*OA
M_1 Cit_a_0&1-5 1,484 1,016           0,182   0,446        
M_2 Cit_a_1-5&5-10 1,312 1,010   0,976   1,468 1,391 0,586            
M_3 Cit_a_1-5&10-20 1,590 1,007 0,998       1,360           1,751  
M_4 Cit_a_1-5&20+ 2,259 1,009 0,995   1,722   1,635 1,650 2,007          

Tab.6:  The Exp(ß) values for logistic regressions (IF 5)

 

 

Fig.8:  The Exp(ß) values for logistic regressions (IF 5)

 

Conclusion:

Overall, OA is correlated with a significant citation advantage for all journal IF intervals as well as for the sample as a whole. This advantage is greater for the  higher citation citations.

There is no significant effect of a specific institution compared to the rest institutions, hence there is no need to exclude any specific institution from our sample.

When regressions are done for separately for the different IF ranges, the Age*OA interaction disappears, but OA and Age (as separate variables) are significant.

References

Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin, 28 (4). pp. 39-47.