There are many significant issues with our review process which has both false-accepts (when papers that should be rejected are accepted) and false-rejects (when papers that should be accepted are rejected). I am not talking about borderline cases that require nuanced consideration — such cases are a minority. Though these two problems need different solutions, the two are related (e.g., false-accepts exacerbate false-rejects, as I describe below) and the solution for one may affect the other. Because any overhaul would change many things and may break what isn’t broken, I advocate incremental fixes, so we fix what does not work while retaining what does. These problems have always existed but (1) I believe they have worsened due to the reasons below and (2) that does not mean we should not solve them.
False-accepts: Normally, false-accepts would not get citations and would vanish over time. But I care about false-accepts because they push down good papers (especially false-rejects) in the rank order for the Program Committee (PC) discussion and eventually out of the accepted set (e.g., 30+ false-accepts in the top 100 papers). So even if a false-accept is eventually rejected at the PC meeting, the paper takes time away from false-rejects.
Reasons for false-accepts: (1) When the submission counts went up in the past, our community grew the PC (from 50 to 80+ reviewers) and added the ERC (120+ reviewers) without monitoring the effects on the review process. Huge PC/ERC (200+ reviewers) means wide variability in review scores, per-PC-member acceptance rates, and indeed even the most basic of criteria (e.g., whether a work is novel even if it puts known things together in known ways but for a new application domain). For example, 33% of an actual PC had personal acceptance rates of 45%+ each (before the PC meeting) and 20% of the PC had under 15% acceptance each, whereas the final acceptance rate was under 20%. That is, your paper’s acceptance depends significantly on who reviews it and not only on what is in it. (2) Huge PC/ERC also often means variability in review expertise. For example, more than 25% of an actual PC rated themselves as “familiar” in expertise (a score of 2 out of 4) for more than 50% of their reviews. (3) The ERC has many issues: (a) A large ERC means wide score variability which affects all the papers (the ERC touches all the papers). (b) Each ERC member reviews only a few papers which makes review calibration hard for him/her. (4) Another reason for score variability is the absolute “accept/reject” scores without any standardization (a “weak accept” for one reviewer is a “reject” for another).
Fixing false-accepts: (1) All the PC/ERC members, not just the PC Chair, should ensure that all the per-PC-member scores fall within statistical expectations based on the historical acceptance rates. In HotCRP, it takes three clicks and an hour to analyze 300+ submissions, 200 reviewers, and 1500+ reviews – it’s that easy! This monitoring should occur at three critical points: before the rebuttal, before the PC meeting, and during the PC meeting. (2) Even better, let’s ask each PC member to select the top five papers out of his/her pile for further consideration, instead of using non-standardized “accepts/rejects” (5 out of say 15 papers, or 33%, which avoids under-shooting final acceptance rates). If everybody selects 5 papers each, there is no reviewer variability. But, “top five” is not meaningful for ERC members who review only 4-5 papers each (this problem is fixed next). (3) The first two fixes do not address the root cause: huge PC/ERC. Let us eliminate the ERC and use a smaller PC of 60 reviewers with a “core subset” of 20 fair-minded, seasoned reviewers who cover all the papers (e.g., 320 papers/20 = 16 reviews per core PC member). The core subset would be fundamentally responsible for fairness (control false-rejects, as explained later). A smaller PC does not mean unreasonably many reviews per PC member: Assuming 320 papers, a two-phase process, and three reviews per paper in Phase-I, each of the non-core PC gets 320*2/40 = 16 reviews. Because there is no ERC, Phase-I can use the “top five” selection process. Assuming 120 papers survive Phase-I (double of 60 final accepts, say), Phase-II adds 120*2/60 = 4 more reviews per PC member (for a total of 20 reviews per PC member which is not terrible). In Phase-II, each PC member ranks his/her papers one through four, so no reviewer variability. The current process has many more than 120 Phase-I survivors requiring more Phase-II reviews because of false-accepts which will go down with a smaller PC. (4) The review-form buttons for expertise scores of “zero knowledge” and “familiar” should exist for calibration but should not be selectable. In the first week after paper assignment, the reviewers should read the abstracts and return the papers for which he/she cannot choose “knowledgeable” or “expert”. It is not ok to write a review when you don’t know the stuff.
False-rejects: This problem is harder than that of false-accepts because it is not statistically discernible. In contrast to 30+ false-accepts out of the top 100 papers, there may be only 12 false-rejects out of 320 papers (but 12 compared to 60 final accepts means 20% of the final program could be different!). Most importantly, false-rejects hurt our community’s good ideas and progress, and impose a high human cost (e.g., graduate student job prospects, and faculty tenure and promotion).
Reasons for false-rejects: Many factors conspire. (1) Each reviewer sees all the other reviews and reviewer names of a paper immediately upon submitting his/her review. This practice kills review independence and increases negative reviewers’ influence because reviewers often readily agree to reject. (2) Rebuttal has become useless. A common misconception is that rebuttals should convince the reviewers, but most reviewers are not looking to be convinced. Instead, rebuttals should be used to improve reviewer accountability, whereas in our process, the reviewer is free to ignore/not read/misunderstand the rebuttal (“not convinced”) — nobody is checking! I am talking about disagreements over facts (a majority of false-rejects), not about subjective arguments that can generate legitimate disagreements (a minority). Having four other reviewers does not help if there is no independence and no cross-checking. (3) Novelty has also become a useless metric: a new idea in a well-known context is marked “incremental” as is an old idea.
Fixing false-rejects: For reason (1), reviewers should not see each other’s reviews or rebuttals until post-rebuttal discussions, or the reviewer names until the PC meeting (important below). For reason (2), we can take several steps: (a) After the rebuttal (including Phase I rejects), let’s shuffle a paper’s review-rebuttal pairs among the paper’s reviewers so a positive reviewer gets a negative review-rebuttal and vice versa (even if there are no positive reviewers, this shuffle helps). If the rebuttal fully blocks the review’s concerns, then the new reviewer (anonymous to the other reviewers) must assign a higher post-rebuttal score within the first post-rebuttal week — “not convinced” is not an option. A caveat is that this step may increase false-accepts without a smaller PC and the above safeguards! (b) To increase reviewer accountability, let the authors check if the post-rebuttal scores make sense (let’s add only two days to the process). If not, the authors can “flag” the review with a short explanation. In the remaining post-rebuttal period, the core-PC member assigned to the paper should adjudicate and change the score if needed (anonymously). To discourage all rejects from being flagged, any incorrect flag would cause an immediate reject (before the PC meeting). (c) The core-PC members should be committed to fairness for every paper, especially the rejected ones. In contrast, our current process assigns “discussion leads” at the PC meeting only for the papers that are to be discussed (most false-rejects have already occurred). Even for a paper that a core-PC member him/herself does not like, he/she must challenge bogus reviews and ensure that a final reject is for valid reasons. In our current process, reviewers focus on the agreement to reject without ensuring that the others’ reasons are valid – yet another manifestation of lack of review independence. For reason (3) above, here is a suggestion: A paper is not novel if you can give a reference for each of the claimed contributions, otherwise it is.
Conclusion: This problem is not new — other communities have opted for different trade-offs: DB has “journalized” its conferences which still involves 200+ reviewers; SIGCOMM uses small PCs (< 50), no ERC, two-phase process, and 25+ reviews per PC member (no rebuttal); and PLDI uses small PCs (50) with ERC, two-phase process, and 20-25 reviews per PC member. For our community, a smaller PC (of 60) with a committed core PC subset (of 20) can lead the way. Further, the above ideas are not all-or-nothing – we can freely cherry-pick. Above all, we as reviewers must stop false-accepts and false-rejects whenever we see them; if we don’t, our own good papers will get killed repeatedly.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.