Scientific truth has historically emerged from the battlefield of conflicting ideas. But in a new and worrying twist, it is now being guaranteed by governments’ “objective” ranking of scientists, journals, or universities. In Italy, the governmental agency running academic research assessment programs provides a vivid example of this growing phenomenon of state-affirmed pseudo-science. Detailing an in vivo experiment in Italy, this article uncovers the un-scholarly practices at the root of this development, and the dangerous implications for both science and democracy.
Evaluation of research at all levels – scientific papers, individual researchers, departments and even universities – appears increasingly obsessed with assigning labels of excellence, potentially in some automated fashion. Examples include international and national university rankings, the classification and ranking of journals by metrics such as “impact factor,” and the automatic evaluation of researchers based on metrics including the h-index – which tracks the quantity of academic citations and purportedly measures a scholar’s productivity. This “metric tide” impacts not only institutional and individual hierarchies but also, in the long run, the very core of scientific inquiry. While scientific truth used to emerge from a battlefield of conflicting ideas, in our new world, truth is increasingly guaranteed by the label – that is, by some “objective” ranking of scientists, journals or universities.
In Italy, probably more than any other country in the Western world, such an obsession with labels of excellence is shaping institutions and researchers’ behavior. Indeed, the Italian academic system has become a laboratory for an unprecedented in vivo experiment in governing and controlling research and teaching via automatic bibliometric tools. “Objective measures” of science and professors’ activities apply not only to research assessment exercises (VQR), but also to national scientific qualifications for professorship (ASN), distributing individual micro-grants to researchers (FFABR funding), and, finally, in some universities, determining salary increases. In this article, we illustrate how growing centralized control has emerged from the development of national assessment process– the scientific probing which sparks a conflict between political, scientific and ethical dimensions. We also document the effort that the governmental agency put forward to self-validate its practices by disseminating shady experimental results in the international bibliometric literature.
The institutional context
In 2010, a profound modification of the structure and governance of Italian universities started in the context of austerity cuts to higher education . That batch of laws is known as “Gelmini’s reform,” named after the Minister of Education during Silvio Berlusconi’s government. In Italy, state universities continue to be autonomous organizations and the constitution protects the freedom of teaching and research. But Gelmini’s reform, and the rules enacted by the succeeding center-left governments, introduced more and more tools for governing and controlling universities and faculties “at a distance.” ANVUR, the Italian National Agency for the Evaluation of the University and Research, plays a central role. ANVUS is neither an autonomous agency nor a quango run at arm’s length by the government. It is instead a governmental agency: Its board consists of seven professors directly nominated by the Minister of Education. Moreover, ANVUR acts principally by implementing activities directly defined by ministerial decrees, such as: research assessment exercises, quality assurance for teaching, evaluating the administrative tasks of universities, and assessing the qualifications of candidates for professorship. Among similar European institutions, such as AERES in France or ANECA in Spain, none concentrate so much power and so many functions in one place.
Moreover, no other Western country has developed this level of governmental control of science and universities. To find similar features, one has to look back at the organization of science in planned economies.
In this highly centralized and politically controlled institutional framework, Italy adopted also a performance-based system for funding research and universities. The research performance of universities is measured by means of a national research assessment, called VQR, which is by and large inspired by the British experiences of RAE/REF, which was instituted by the Thatcher government to direct austerity-limited university funds. ANVUR conducted the research assessment in Italy.
Do peer review and bibliometrics agree? The experiment
ANVUR adopted a “dual system of evaluation” for the VQR, which assigned each piece of submitted work to one class of merit by either informed peer review or an automatic scoring algorithm based on bibliometric indicators. In fact, sometimes there is no alternative to peer review, as some research outputs simply do not appear in bibliometric databases, or the scoring algorithm does not provide a definite response for them. The scores yielded by these two different techniques were gathered and summed up at a field, department or university level to obtain aggregated scores and rankings, as reported in the thousands of pages of the VQR report. The cornerstone assumption underlying this methodology is that peer review and bibliometrics are interchangeable. This assessment design, already adopted in the first edition of the VQR (2004-2010), was repeated in the second edition (2011-2014).
In the first edition (VQR1), in an attempt to validate the dual system of evaluation, ANVUR performed an experiment to assess, for a large sample of papers, the degree of agreement between scores obtained by peer review and by bibliometrics. The results of this experiment are central to the consistency of the whole research assessment exercise. If peer review and bibliometrics did not agree, the results of the exercise would be subject to a structural bias and the final scores (and rankings) would be affected.
An extraordinary dissemination effort
The results of the experiment originally appeared in an appendix of ANVUR’s official reports. They were then widely disseminated in working papers and scholarly articles originating from, or reproducing parts of, the ANVUR reports. The main paper, coauthored by Sergio Benedetto, coordinator of the VQR, came out in Research Evaluation in 2015. But the strongest dissemination effort concerned the part of the experiment regarding papers in economics and statistics. Originally published as an official report (in English), it became a working paper, uploaded to five different working paper series and authored by only 6 of the more than 30 members of the panel performing the experiment. It was ultimately published as an “original research article” in a recognized scholarly journal, Research Policy, without any mention of the institutional nature of the content, and without any mention of the fact that nearly all text and tables in the paper came from the official report. Results also appeared in mainstream economic policy blogs.
And an unnoticed conflict of interest
Why did scholars working for ANVUR engage in this extraordinary dissemination effort? Probably because publication in scholarly journals represents an ex-post justification of the unprecedented dual system of evaluation developed and applied by ANVUR. The conflict of interest is also remarkable: the papers justifying the methodology and results of the research assessment after the fact were written by the same scholars that developed and applied the methodology in the first place. As if that were not enough, ANVUR never disclosed the data, making replication of their results impossible.
Just imagine a government prescribing a new mandatory vaccine in compliance with the recommendation of a report issued by an agency such as the U.S. Food and Drug Administration (FDA). Imagine that several years after the mandatory adoption, scholarly journals published articles, authored by the same members of the FDA committee that issued the report, reproducing contents and conclusions of the FDA report, without declaring it, thus providing a de facto – though ex-post – scientific justification of the report itself. Imagine then that, when independent scholars asked for the data to replicate the results, the agency did not reply or, alternatively, refused to release the data, claiming that they are confidential. Fortunately, this is not how health care decisions are usually taken. But are culture and science of lesser importance for society?
A non-random experiment
Since 2014, the authors of this post have tried to replicate the ANVUR experiment. First, we asked for access to raw data, but received no reply from ANVUR. Given the unavailability of raw data, we had no alternative but to rely on a careful reading of official VQR reports and on statistical meta-analysis of the results reported thereon. This is what we could ascertain:
- The ANVUR experiment was not conducted on a random sample of articles, but instead on a non-random subsample that was obtained by excluding from the original random sample all articles for which bibliometrics produced an uncertain classification. This non-random selection induced unknown and uncontrolled bias in the final results of the experiment.
- The degree of agreement between peer review and bibliometrics was measured by a statistical index known as Cohen’s kappa. But ANVUR confounded the notion of a kappa statistically different from zero with the practical significance of its value. This “false belief that [statistically] significant results are automatically big and important” is a well known statistical fallacy, that occurs when the confidence in the existence of an effect, whatever small, is mistaken with its size and practical importance. In particular, they found statistically significant kappas, but their values, according to the statistics literature, indicate “poor to fair” agreement.
- ANVUR claimed to use a “unique protocol” for experiments conducted in all research areas analyzed. But we have documented that many different protocols were adopted, and possibly different systems of weights used, to calculate Cohen’s kappas, i.e., the degree of agreement between peer review and bibliometrics.
Is economics an exception?
The three findings outlined immediately above do not apply to economics: By contrast with the process used for other scientific areas, the data for economics were selected at random, and observed agreement between peer review and bibliometrics was “good” and not just “statistically significant”. We have argued that this agreement, rather than originating from specific characteristic of that research field, is explained by specific modifications of the experimental protocol that were introduced only for economics. These modifications were neither explicitly highlighted in ANVUR reports, nor disclosed or justified in subsequent papers. Only in economics, bibliometric evaluation stemmed from a ranking of journals directly developed by ANVUR. Peer reviewers knew the ranking of journals and were also aware they were participating in the experiment – two conditions that occured in no other research area. Moreover, in all other scientific areas, the peer review final score for each article originated automatically from the scores assigned by two independent reviewers. This was not the case in economics, where no less than 55% of the final scores were decided directly by panelists, well aware of the experiment. It is therefore hardly surprising that in economics the agreement between bibliometric and peer review evaluation rose to a level recorded in no other area.
The upshot for science—and democracy
It took some years to conclude the investigation because raw data, though accessible to people working for ANVUR, were never disclosed to independent scholars. Putting all the pieces together, it is now possible to conclude that peer review and bibliometrics did not agree in the Italian experiment. As a matter of fact, the coexistence of two different evaluation methodologies introduced an unknown bias in the final results of the Italian research assessment that is currently used by the Italian government for funding universities.
But this is just one of the issues raised by the Italian example. A second issue concerns the status of scientific knowledge when it intertwines with policy issues. In this case, the official position is that “peer review and bibliometric agree,” despite contrary evidence that “peer review and bibliometric do not agree,” or at least that “the experiment is not able to confirm the agreement.” ANVUR adopted the dual system of evaluation before it was scientifically validated. This cannot but create a clash between the two roles played by ANVUR: designer of regulations and procedures, and provider of ex-post scientific evidence in support of its designs.
When U.K. universities underwent a bibliometric assessment, commentators wrote of a “very Stalinist management model.” For Italy, we wonder whether a parallel with Lysenkoism may be more appropriate. Trofim Lysenko, director of the Institute of Genetics at the USSR’s Academy of Science, exercised political power in his campaign to see that Soviet science reject Mendelian genetics in favor of Lamarckism. In Italy, a group of professors selected by the government adopted a self-developed methodology, which approaches bibliometric Lysenkoism, for evaluating science and researchers and finally for deciding which research is worth funding.
A third issue concerns the openness of data. There is a transparency issue with the Italian government and ANVUR refusing to disclose data for replicating tests. Such control of data probably stems from the fear that an independent investigation would question the key assumption underlying the whole research exercise. But there is also an issue of editorial ethics when prominent journals, such as Research Policy, publish papers and replies based on data unavailable to scholars for replication. When public policies are grounded on an influential paper, the impossibility of replicating results may have far reaching implications. A recent and disturbing example is that of economic austerity measures:
Coding errors happen, yet the greater research problem was not allowing for other researchers to review and replicate the results through making the data openly available. If the data and code were available upon publication already in 2010, it may not have taken three years to prove these results wrong – results which may have influenced the direction of public policy around the world towards stricter austerity measures. Sharing research data means a possibility to replicate and discuss, enabling the scrutiny of research findings as well as improvement and validation of research methods through more scientific enquiry and debate.
Research assessment may not be perceived as important as austerity measures vis a vis society and the economy at large. But it still has great impact on a country’s research and on the quality of future papers influencing economic and non-economic government policies.