Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

Assuming that you are reviewing the papers summarized below, please take the following quiz (the papers are fictitious and any resemblance to real papers is completely coincidental):

1. Weather prediction models have improved in accuracy by about 10 points over the last decade (from 86% to 96% accuracy) but have also become increasingly slower due to increasingly-complex modeling. In this paper we propose a new accelerator, called Zeus, which improves five real-world state-of-the-art weather prediction models’ performance by 2x while reducing accuracy only from 96% to 95% (i.e., < 1% accuracy loss).  We synthesized our RTL on an FPGA.

a) Weather prediction is important – strong accept
b) Good performance for negligible loss of accuracy – weak accept
c) Incorrect accuracy loss claims – weak reject
d) Experiments don’t have enough benchmarks – weak reject
e) Has FPGA – accept!

2. There was a brain-inspired machine learning model proposed in the 1990’s which has not received much attention in the ML community. The ISCA community should pay attention, so we have proposed an accelerator, called Brainiac, which achieves 270x speedup over a sequential CPU by exploiting regular parallelism. Our ASIC’s area and power are 20 mm^2 and 40 mW.

a) Ooh what can be cooler than brain-inspired ML? – strong accept
b) Impressive speedup – accept
c) Has ASIC – accept
d) Model not accepted in ML and GPU/multicore can also exploit regular parallelism – (weak) reject
e) Opens a completely new line of investigation – weak accept

3. A previous paper proposes a high-bandwidth, magnetic packaging technology which was published in HPCA. We propose a new architecture, called MagVia, which exploits the high bandwidth via prefetching. We use omniscient prefetching proposed in ISCA to achieve 60% speedup over a conventional baseline.

a) Not sure if this technology is legit (not published in packaging venues) – neutral
b) Not novel – reject
c) Invalid experiments – weak reject
d) Magnetic packaging looks new to me – accept
e) All of a-c

4. It is well known that virtualization puts enormous pressure on multicore cache hierarchies [SOSP]. In this paper, we propose a novel game-theory-based address-mapping strategy, called Virtumap, to reduce the miss rate. The simple mapping requires only 16-20 extra bits in the tag arrays and is invisible to the application layer. Using GEM5 simulations, we show that Virtumap improves performance of commercial workloads by 32% and 22% over the best previous proposal which ignores well-known OS optimizations for virtualization and a conventional baseline, respectively.

a) Incremental – weak reject (speaking as an expert because the previous work was my paper and our rules do not recognize this obvious intellectual conflict of interest)
b) Previously-published work can’t be flawed – weak reject (more loyal than the king)
c) Low speedups – weak reject
d) Novel idea and reasonable evaluation – accept
e) Too simple and obvious – weak reject

5. In this paper, we have fabricated a chip using TSMC NewFET 100-pm process (die photo is in the paper). Our SoC chip, called Omnia, includes a 1024-core multicore using out-of-order cores, a GPU, a TPU, and 4000 other previously-proposed accelerators for crypto, voice/language processing, smell detection, wine tasting, music appreciation, cameras, haptic interfaces, gene sequencing, etc. We wrote over a million lines of RTL.

a) Wow! Real chip – strong accept!
b) Not novel – reject
c) An ASIC with a die photo? accept.
d) Which HDL compiler did you use? – weak reject
e) A million lines of RTL deserves to be accepted, if not an award – accept

6. In this paper, we propose a novel cache coherence protocol for GPUs, called streaming coherence. To support streaming coherence, we add a high-performance interconnection network scheme [ISCA] to our conventional baseline.  We use the network because the network paper showed 32% speedup for Rodinia on GPUsim. Our GPUsim results show that streaming coherence improves Rodinia performance by 35% over the conventional baseline (this is not an accelerator so no 200x speedups!).

a) Most of the speedup is from the network  – weak reject
b) Good speedup on solid workload – weak accept
c) I like it (no reason)! – weak accept
d) GPU research is important – weak accept
e) Authors concede that the speedup is not 200x – weak reject

7. We propose a new caching scheme, called NuevoDinero, for exploiting extreme spatial locality in datacenter applications. NuevoDinero is inspired by previous work on improving datacenter applications via large spatial locality. We deployed NuevoDinero on a 10,000-server, production datacenter for one year and show 100 graphs including line graphs, bar graphs, stacked bar graphs, pie charts, heat maps, scatter plots, 3-D surface graphs, PDFs, CDFs, etc. Our key result is that there is spatial locality at all levels — super page, page, and cache block. Due to corporate restrictions, we cannot share the data other than these graphs. Sorry.

a) OMG! Real data. Strong accept!
b) Neither the technique nor the finding is new – reject
c) So much hard work deserves to be accepted – accept
d) Which year is the data from? Weak accept
e) Even if the technique is not new the data may be useful to other researchers – weak accept

8. The emerging workload of personalized drug design [PLOS Computational Biology] involves solving large Hamiltonian systems. The long-running workload is highly parallel but makes irregular memory accesses. We make the key observation that the access patterns can be captured by a Taylor series in most cases. Accordingly, we propose an accelerator architecture, called Alchemist, to extract the Taylor series and prefetch the memory accesses. Our simulations show that Alchemist achieves 8x speedup over a GPU without degrading energy or memory bandwidth. We show that our new hardware requires a 32-KB table to be accessed once every 10 cycles and a 5-bit state machine per 32-ALU cluster.

a) Prefetching is well-known – weak reject
b) No RTL – reject
c) Novel idea and reasonable evaluation – accept
d) No ASIC area and power – reject
e) I hate it (no reason)! – Weak reject

9. In this paper, we observe that data initialization is slow in the emerging compute-heavy quantum simulation workload (O(n^5) multiplications for every data value). We propose a new hardware-compiler-application codesign technique, called Initguru, using a combination of tiling, application annotation, and a bitonic sorting network in hardware. Initguru improves performance by 82.22% on small test cases (larger quantum circuits take an inordinate amount of time to run). We synthesize the sorting network using Synopsis Design Compiler version 5.6.3.8a with placement and routing optimizations and TSMC 15-nm technology. The ASIC area is 0.5 mm^2 and power is 3 mW.

a) Non-problem – reject
b) Which version of gcc did you use? Weak reject
c) Quantum is cool – accept
d) Impressive combination of techniques – weak accept
e) Has ASIC area and power – accept!

10. The emerging field of organic computing promises low power (the chips can serve as snack if you are hungry!) but poses the challenge that memory is both slow and high-energy. In this paper, we address this problem via aggressive multi-level caches (combinations of inclusion/exclusion and LRU/FIFO replacement) and prefetching (combinations of spatial, temporal, temporo-spatial and spatio-temporal). These techniques are well known but have not been evaluated for organic chips. Our simulations show that our proposal improves performance by 383.67% over a baseline of no cache. Our analysis shows breakdowns of the power set of the combinations of the techniques (64 combinations).

a) New direction – accept!
b) Thorough analysis – accept
c) Impressive speedup though over a weak baseline –weak accept
d) Edible chips must be accepted – accept
e) Not new – reject

Key:
1. C: 96% to 95% is 25% increase in inaccuracy and loses a whole year’s model improvement. If needed, software should do this trade-off. Hardware undoing algorithmic improvements in the name of performance is a bad idea.
2. D: The model should be blessed by the ML community, else hardware for irrelevant software
3. E: Neither the prefetching scheme which can improve the baseline also nor the technology is new (of course, even if the technology were new, the architecture community cannot judge that – such a paper should go to the right community).
4. D: Old context, old problem, new solution. That’s still novel. Old context/problem does not imply the solution is incremental.
5. B: Lots of work but not new. Architecture papers should be about new ideas, not busy work.
6. A.
7. B:  Graphs/data can’t replace novelty.
8. C:  New context, old problem, new solution. 32-KB table and 5-bit state machine don’t need RTL/ASIC. Architecture papers should be about new ideas, not busy work.
9. A:  Wait! How did they get speedups, then? By running small circuits where initialization dominates.
10. E: New context, old problem, old solution. New context alone does not make it novel.

Conclusion

Reviews should consider the following: (1) Is the problem real and important, (2) Is the solution novel and sufficiently better than previous work, (3) do the experiments or analyses conclusively show, within reason, that the solution is sufficiently better for the reasons claimed, and (4) is the writing clear enough to understand the paper.  These questions together are necessary and sufficient.. A review should give a (weak) accept if and only if all the answers are yes while keeping in mind that no paper is perfect. Most importantly, all the reviewers for a conference must apply the same community standards.

About the author: T. N. Vijaykumar teaches and studies computer architecture in the School of Electrical and Computer Engineering at Purdue University.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.