Logistic regression of potential explanatory variables on citation counts
Yassine Gargouri & Stevan Harnad
Cognition & communication Laboratory
(Last update: 11/04/2009)
The number of citations an article
receives ("citation counts") can be influenced by or correlated with
a variety of variables. A logistic regression analysis has been conducted to
study the correlation between citation counts (as dependent variable) and the
following set of potential correlator/predictor variables:
OA
:
Is the article Open Access (1 if OA and 0 otherwise)?
M
:
Does the author's institution Mandate Open Access (1) or Not (0)?
Age
:
How old is the article (articles published from 2002 to 2006)?
Auth_N
:
How
many co-authors does the article have?
Ref_N
:
How
many references does the article cite?
IF
:
What is the Thompson/ISI "Impact Factor" (average citations per
article in 2-year window) of the journal in which the article was published
(from 0 to 30)?
The metadata for the articles were
collected from our four institutional archives, as well as from the ISI
database. Citation counts were extracted from ISI (in November, 2008). For each mandated article Mi, we collected all corresponding
articles Nj published in
the same journal, volume and year as controls.
In order to reduce our article sample to a reasonable processing size,
we limited the number of journal/volume/year-matched articles to 10 articles Nj that were semantically close to Mi. This
narrowing of content should also make the control articles more comparable than
using the entire spectrum of the journal's content. (The semantic closeness is
computed based on shared words in titles, omitting stop words.)
Journals
that are OA are excluded from our sample because we our OA/non-OA comparisons
are all within-journal comparisons (hence there would be no control articles in
an OA journal). Based on the Directory of Open Access Journals (DOAJ), which
presently indexes 3563 OA journals, only 2.10% out of our sample ISI journals
were OA. Those journals were excluded from our analysis.
The total size of the article sample (6215
Mandated, M, and 20982 corresponding controls,
N) from 2002 to 2006 was 27197.
The full-text OA status of the articles in our sample was verified using
an automated webwide search-robot (Hajjem and al, 2005).
The result was consolidated using another robot
based on Google Scholar search.
Rather than comparing the regression
models separately for science and social science, we added a dichotomous
variable (Sci) as a predictor variable.
Citation counts are not normally distributed, particularly because of the many articles having zero citations and they cannot be successfully transformed into a normal distribution. Fig.1 shows the citation counts (minus self-citations) distribution. So we used binary logistic regression analysis, with a dichotomous variable taking value 0 if the article has no citations and 1 if it has at least 1 citation.

Fig.1 Citation count distribution
1. Logistic regression:
We used stepwise
logistic regression, for
each test selecting the model that maximizes the chi-square likelihood ratio.
To make the interpretation of the
coefficients easier, we exponentiated the ß coefficients (Exp(ß)) and
interpreted them as odds-ratios. For example, we can say for the first model
that for a one unit increase in OA, the odds of receiving 1-5 citations (versus
zero citations) increased by a factor of 0.957.
{1, 2, 3, ..., 20}), where
Cit_a_x-y&y-z
= 1 if citation
count (minus self-citations) is between y and z and 0 if between x and y.
Models are referred to
as "M_r".
The Exp(ß) values of variables turned out to have the same polarity and to be quite similar, with and without self-citations.
The figure (Fig.2) shows that citation count is
positively correlated with IF, Age, Ref_N, Auth_N, OA, USA and M. In other
words:
- The higher the IF of the journal in
which it was published, the higher an article's citation count.
- The longer since an article was
published, the higher its citation count.
- The more references an article cites,
the higher its citation count.
- The more co-authors an article has, the
higher its citation count.
- Articles that are made OA have higher
citation counts, and this small but significant independent OA effect is
present in every citation range but it is greatest in the highest citation
range (1-5 citations vs 20+ citations): The OA advantage is strongest for
highly cited articles.
- Articles from authors at institutions that
have Mandates have higher citation counts; this effect is present only in the
medium-high citation ranges (and is of course confounded with the level of
author compliance with the institutional Mandate, discussed further below).
- Review articles have higher citation
counts; the effect is greater, the higher the citation range.
- CERN articles have higher citation
counts in the lowest and especially the highest citation range. (However, when
all CERN articles are excluded from our sample, there is no significant change
in the other variables).
| Model N. | Dependent V. | Age | IF | Ref_N | Auth_N | Page_N | OA | M | USA | Review | Sci | CERN | South | Minho | Queens | Age*OA |
| M_1 | Cit_a_0&1-5 | 1,494 | 2,229 | 1,020 | 1,007 | 0,993 | 0,957 | 0,627 | 1,249 | 0,789 | 1,476 | 1,209 | ||||
| M_2 | Cit_a_1-5&5-10 | 1,490 | 1,514 | 1,016 | 1,002 | 0,986 | 1,323 | 1,889 | 1,415 | 0,777 | 1,475 | |||||
| M_3 | Cit_a_1-5&10-20 | 1,786 | 1,776 | 1,020 | 1,002 | 0,992 | 1,392 | 1,716 | 1,406 | 0,992 | 1,887 | |||||
| M_4 | Cit_a_1-5&20+ | 2,439 | 2,114 | 1,019 | 0,999 | 8,953 | 1,860 | 1,914 | 3,050 | 2,306 | 0,968 |
Tab.1: The Exp(ß) values for logistic regressions
* Bold: significance < 0,01
* Italique: significance between 0,01 and 0,05

Fig.2: The Exp(ß) values for logistic regressions
There is a significant interaction between Age and OA (Age*OA) for low citation interval (between 1 and 5) as well for high citation interval (20 citations and more). Both the linear main effect of age and OA, and this nonlinear interaction are significant.. The following figure ( Fig.3) shows the citation mean (Cit_a_1-5&20+) for OA and NOA articles corresponding to each Age value. This figure confirms the OA advantage. The difference between the two lines corresponding to OA and NOA is higher for older articles.

Fig.3: The citation count means of Age and OA
2. Logistic Regression by Impact Factor interval:
In order to compare articles belonging to
comparable journals, we divided our sample into 4 quartile ranges by journal
impact Factor (IF), each range covering 25% of the articles:
IF_1 :
0
≤
IF < 0.633
IF_2 : 0.633
≤ IF < 1.053
IF_3 : 1.035
≤ IF < 1.782
IF_4 :
1.782 ≤ IF <
29.957
Only the top quartile contains journals
with IFs from 1.782 to 29.957. As we are also interested in the
variability within this quartile, we further subdivided it into two subgroups,
each covering 12.5% of all the articles. Subdividing more minutely would
generate would make the sample sizes too small to detect effects o interest.
Finally, 5 ranges of IF are selected:
IF_1 :
0
≤
IF < 0.633
IF_2 : 0.633
≤ IF < 1.053
IF_3 : 1.035
≤ IF < 1.782
IF_4 :
1.782 ≤ IF < 2.468
IF_5 : 2.468 ≤ IF
≤ 29.957
The same regression is done separately for
each IF range by controlling all the variables (except IF). The following
tables summarizes the values of Exp(ß) corresponding to the controlled
variables for each IF range.
Our earlier remark also applies to these
regressions: Exp(ß) values of variables have the same polarity and pattern
whether or not we exclude self-citations from the citations count.
| Model N. | Dependent Var. | Age | Ref_N | Auth_N | Page_N | OA | M | USA | Review | Sci | CERN | South | Minho | Queens | Age*OA |
| M_1 | Cit_a_0&1-5 | 1,537 | 1,017 | 1,079 | 0,701 | 1,093 | |||||||||
| M_2 | Cit_a_1-5&5-10 | 1,847 | 1,013 | 1,066 | 1,881 | 1,059 | |||||||||
| M_3 | Cit_a_1-5&10-20 | 2,071 | 1,026 | 1,054 | 0,962 | 1,533 | 1,902 | ||||||||
| M_4 | Cit_a_1-5&20+ | 2,689 | 1,020 | 1,087 | 2,406 | 4,760 | 3,214 |
Tab.2: The Exp(ß) values for logistic regressions (IF 1)

Fig.4: The Exp(ß) values for logistic regressions (IF 1)
| Model N. | Dependent Var. | Age | Ref_N | Auth_N | Page_N | OA | M | USA | Review | Sci | CERN | South | Minho | Queens | Age*OA |
| M_1 | Cit_a_0&1-5 | 1,407 | 1,016 | 1,028 | 1,265 | 0,605 | 0,511 | ||||||||
| M_2 | Cit_a_1-5&5-10 | 1,548 | 1,012 | 1,346 | 1,963 | ||||||||||
| M_3 | Cit_a_1-5&10-20 | 1,869 | 1,018 | 1,007 | 1,337 | 1,722 | |||||||||
| M_4 | Cit_a_1-5&20+ | 2,117 | 1,011 | 2,322 | 3,106 |
Tab.3: The Exp(ß) values for logistic regressions (IF 2)

Fig.5: The Exp(ß) values for logistic regressions (IF 2)
| Model N. | Dependent Var. | Age | Ref_N | Auth_N | Page_N | OA | M | USA | Review | Sci | CERN | South | Minho | Queens | Age*OA |
| M_1 | Cit_a_0&1-5 | 1,581 | 1,012 | 1,032 | 1,236 | 0,401 | 1,856 | ||||||||
| M_2 | Cit_a_1-5&5-10 | 1,540 | 1,007 | 1,033 | 1,428 | 1,330 | |||||||||
| M_3 | Cit_a_1-5&10-20 | 1,879 | 1,013 | 1,026 | 1,263 | 1,382 | |||||||||
| M_4 | Cit_a_1-5&20+ | 2,305 | 1,009 | 1,041 | 1,026 | 1,449 | 1,492 | 1,791 | 1,939 | 3,734 |
Tab.4: The Exp(ß) values for logistic regressions (IF 3)
Fig.6: The Exp(ß) values for logistic regressions (IF 3)
| Model N. | Dependent Var. | Age | Ref_N | Auth_N | Page_N | OA | M | USA | Review | Sci | CERN | South | Minho | Queens | Age*OA |
| M_1 | Cit_a_0&1-5 | 1,690 | 1,020 | 2,090 | 0,554 | 0,233 | |||||||||
| M_2 | Cit_a_1-5&5-10 | 1,427 | 1,010 | 1,645 | 1,657 | ||||||||||
| M_3 | Cit_a_1-5&10-20 | 1,800 | 1,019 | 1,729 | 1,615 | ||||||||||
| M_4 | Cit_a_1-5&20+ | 2,540 | 1,024 | 0,994 | 1,028 | 1,747 | 1,822 | 3,974 |
Tab.5: The Exp(ß) values for logistic regressions (IF 4)
Fig.7: The Exp(ß) values for logistic regressions (IF 4)
| Model N. | Dependent Var. | Age | Ref_N | Auth_N | Page_N | OA | M | USA | Review | Sci | CERN | South | Minho | Queens | Age*OA |
| M_1 | Cit_a_0&1-5 | 1,484 | 1,016 | 0,182 | 0,446 | ||||||||||
| M_2 | Cit_a_1-5&5-10 | 1,312 | 1,010 | 0,976 | 1,468 | 1,391 | 0,586 | ||||||||
| M_3 | Cit_a_1-5&10-20 | 1,590 | 1,007 | 0,998 | 1,360 | 1,751 | |||||||||
| M_4 | Cit_a_1-5&20+ | 2,259 | 1,009 | 0,995 | 1,722 | 1,635 | 1,650 | 2,007 |
Tab.6: The Exp(ß) values for logistic regressions (IF 5)
Fig.8: The Exp(ß) values for logistic regressions (IF 5)
Conclusion:
Overall, OA is correlated with a
significant citation advantage for all journal IF intervals as well as for the
sample as a whole. This advantage is greater for the higher citation citations.
There is no significant effect of a
specific institution compared to the rest institutions, hence there is no need
to exclude any specific institution from our sample.
When regressions are done for separately
for the different IF ranges, the Age*OA interaction disappears, but OA and Age
(as separate variables) are significant.
References
Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin, 28 (4). pp. 39-47.