EE/CS277 (2022 Spring): Data-Centric Computer Architecture

TuTh 3:30p-4:50p

Location: Boyce Hall Room 1471

Instructor

Hung-Wei Tseng
email: htseng @ ucr.edu
Office Hours: by appointment

Course Overview

With the rapid growth of application dataset size and the introduction of hardware accelerators, the non-computational part of an application — the data movement overhead among storage devices, memory units, and computing components becomes the major performance bottleneck. Instead of tweaking the entrenched CPU/compute-centric models in designing systems, architectures, and applications, we need to explore the design from a data-centric perspective.

This class will focus on the underrated architectural components and alternative computing models, including system interconnects in heterogeneous computers, I/O system stacks, emerging non-volatile memory technologies, near-data processing and data flow architectures.

By the end of this course, the students will be able to:

  1. Obtain a complete overview of interactions among different components in computer systems 
  2. Identify the performance issue of modern heterogeneous computer systems.
  3. Perform system-level programming on important I/O system modules.
  4. Propose a data-centric design in addressing performance issues.
  5. Using full system emulators (e.g., QEMU) to validate the proposed idea.

This class will be a seminar-style class that requires students to present and discuss the assigned reading materials as well as conducting term projects in groups.

Materials

Grading

  • 50% Research Project or Paper Presentation
    This class will encourage students to participate in projects that lead to innovative research outcome.
  • 50% Class participation and discussion
    This class will require students to attend and discuss the studied research paper every week.

Schedule and Slides

DateTopicReadingSlides(Release)Note
3/29/2022Introduction–The Life of Data in Computer Systems– Jaeyoung Do, Sudipta Sengupta, and Steven Swanson. 2019. Programmable solid-state storage in future cloud datacenters. Commun. ACM 62, 6 (June 2019), 54–62. Slides
Video
Demo
3/31/2022Emerging Hardware Accelerators– Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65–76.

Slides
4/5/2022Emerging Hardware Accelerators (2)– J. Burgess, “RTX on—The NVIDIA Turing GPU,” in IEEE Micro, vol. 40, no. 2, pp. 36-44, 1 March-April 2020, doi: 10.1109/MM.2020.2971677.

– Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In ISCA ’17. Association for Computing Machinery, New York, NY, USA, 1–12.

– N. Jouppi, C. Young, N. Patil and D. Patterson, “Motivation for and Evaluation of the First Tensor Processing Unit,” in IEEE Micro, vol. 38, no. 3, pp. 10-19, May./Jun. 2018, doi: 10.1109/MM.2018.032271057.
Slides
Video
4/7/2022Memory technologies– Jacob, Bruce L. Synchronous DRAM architectures, organizations, and alternative technologiesUniversity of Maryland (2002).Slides

Video
4/12/2022Main memory subsystem (1)Slides

Demo

Video
4/14/2022Main memory subsystem (2)– V. Seshadri et al., “RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization,” 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Davis, CA, 2013, pp. 185-197.
Slides

Demo

Video
4/19/2022Peripherals and system interconnect (1)– Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42).

– (Optional) H. Tseng, L. Grupp and S. Swanson, “Understanding the impact of power loss on flash memory,” 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 35-40

– Jisoo Yang, Dave B. Minturn, and Frank Hady. 2012. When poll is better than interrupt. In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST’12). USENIX Association, USA, 3.
Slides

Demo

Video
4/21/2022Peripherals and system interconnect (2)– Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. 2017. SPIN: seamless operating system integration of peer-to-peer DMA between SSDs and GPUs. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’17). USENIX Association, USA, 167–179Slides

Demo

Video
4/26/2022Near data processing and in-storage processing– Rajeev Balasubramonian Jichuan Chang Troy Manning Jaime H. Moreno Richard Murphy Ravi Nair Steven Swanson. Near-Data Processing: Insights from a MICRO-46 Workshop. IEEE Micro (Special Issue on Big Data), vol. 34 (2014), pp. 36-43

– Jaeyoung Do, Sudipta Sengupta, and Steven Swanson. 2019. Programmable solid-state storage in future cloud datacenters. Commun. ACM 62, 6 (June 2019), 54–62.

– Hoang Anh Du Nguyen, Jintao Yu, Muath Abu Lebdeh, Mottaqiallah Taouil, Said Hamdioui, and Francky Catthoor. 2020. A Classification of Memory-Centric Computing. J. Emerg. Technol. Comput. Syst. 16, 2, Article 13 (April 2020), 26 pages. DOI:https://doi.org/10.1145/3365837
Slides

Video
4/28/2022In-storage processing (2)– Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, and Steven Swanson. 2016. Morpheus: creating application objects efficiently for heterogeneous computing. SIGARCH Comput. Archit. News 44, 3 (June 2016), 53–65. DOI:https://doi.org/10.1145/3007787.3001143

– Yu-Ching Hu, Murtuza Taher Lokhandwala, Te I., and Hung-Wei Tseng. 2019. Dynamic Multi-Resolution Data Storage. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 196–210. DOI:https://doi.org/10.1145/3352460.3358282

– G. Singh et al., “A Review of Near-Memory Computing Architectures: Opportunities and Challenges,” 2018 21st Euromicro Conference on Digital System Design (DSD), Prague, 2018, pp. 608-617, doi: 10.1109/DSD.2018.00106.
Slides

Video

Demo
5/3/2022– Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, and David J. DeWitt. 2013. Query processing on smart SSDs: opportunities and challenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). Association for Computing Machinery, New York, NY, USA, 1221–1230. DOI:https://doi.org/10.1145/2463676.2465295Slides

Video

Demo
5/5/2022In-memory processing

– Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: a computation model for intelligent memory. SIGARCH Comput. Archit. News 26, 3 (June 1998), 192–203. DOI:https://doi.org/10.1145/279361.279387

– David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A Case for Intelligent RAM. IEEE Micro 17, 2 (March 1997), 34–44.

– K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally and M. Horowitz, “Smart Memories: a modular reconfigurable architecture,” Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), Vancouver, BC, Canada, 2000, pp. 161-171, doi: 10.1109/ISCA.2000.854387.

– Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing (HPDC ’14). Association for Computing Machinery, New York, NY, USA, 85–98. DOI:https://doi.org/10.1145/2600212.2600213
Slides

Video
5/10/2022No lecture
5/12/2022In-NVM processing

– C. Xu, X. Dong, N. P. Jouppi and Y. Xie, “Design implications of memristor-based RRAM cross-point structures,” 2011 Design, Automation & Test in Europe, Grenoble, 2011, pp. 1-6, doi: 10.1109/DATE.2011.5763125.

– Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 105–117. DOI:https://doi.org/10.1145/2749469.2750386
Slides

Video
5/17/2022Processor In MemoryJinyoung Choi — Intermediate representation to abstract various types of accelerators

Miguel Gutierrez — Automata
– P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal and H. Noyes, “An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing,” in IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 12, pp. 3088-3098, Dec. 2014, doi: 10.1109/TPDS.2014.8.

-Chao Gao: GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing
5/19/2022Yunan Zhang — SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Boram Jung — Accelerating Lattice-Based Cryptography Using Tensor Core

Nafis Mustakin — Investigation of DRAM Latency
5/24/2022Jinyao Zhang — Accelerate post-quantum algorithms with processing-in-memory

Jason Zellmer — Leaky Buddies: Cross-Component Covert Channels on Integrated CPU-GPU Systems

Yu-Ching Hu — TCUDB
5/26/2022Abenezer Wudenhe — PCIE Optics Investigation

Kuan-Chieh Hsu — Heterogeneous Edge General-Purpose Computing Framework

Ke Ma — Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
5/31/2022Cong GUO — A Hardware Accelerator for Protocol Buffers

Wyland Lau — DNA enables a robust and efficient storage architecture

Shao-Tse Chien/ Chin-Han Wu/ Jing-Jhe Hsu — TCU acclerated applications

Windy Liu — String matching in hardware using the FM-Index
6/2/2022Carissa Lo and Samuel Wiggins — 1) Programmable FPGA-based Memory Controller 2) A High Throughput Parallel Hash Table Accelerator on HBM-enabled FPGAs

Matthew Choi and Keerthinivash Korisal — Architecting an Energy-Efficient DRAM System For GPUs, Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems

Yu-Chia Liu — High-Dimensional Benchmark

Other References

– Ke Wang, Kevin Angstadt, Chunkun Bo, Nathan Brunelle, Elaheh Sadredini, Tommy Tracy, Jack Wadden, Mircea Stan, and Kevin Skadron. 2016. An overview of micron’s automata processor. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES ’16). Association for Computing Machinery, New York, NY, USA, Article 14, 1–3. DOI:https://doi.org/10.1145/2968456.2976763
– Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 740–753. DOI:https://doi.org/10.1145/3352460.3358284

– Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. 2016. Biscuit: a framework for near-data processing of big data workloads. SIGARCH Comput. Archit. News 44, 3 (June 2016), 153–165. DOI:https://doi.org/10.1145/3007787.3001154

– Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013. Linux block IO: introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR ’13). Association for Computing Machinery, New York, NY, USA, Article 22, 1–10. DOI:https://doi.org/10.1145/2485732.2485740

– Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. 2015. Performance analysis of NVMe SSDs and their implication on real world databases. In Proceedings of the 8th ACM International Systems and Storage Conference (SYSTOR ’15). Association for Computing Machinery, New York, NY, USA, Article 6, 1–11. DOI:https://doi.org/10.1145/2757667.2757684

– T. Coughlin, “A Solid-State Future [The Art of Storage],” in IEEE Consumer Electronics Magazine, vol. 7, no. 1, pp. 113-116, Jan. 2018, doi: 10.1109/MCE.2017.2755339.

– Seshadri, Sudharsan, et al. “Willow: A User-Programmable {SSD}.” 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014.

– Yang, Jingpei, et al. “Don’t stack your log on my log.” 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads ({INFLOW} 14). 2014.

– Yanqin Jin, Hung-Wei Tseng, Yannis Papakonstantinou, Steven Swanson, KAML: A Flexible, High-Performance Key-Value SSD. The 23rd IEEE Symposium on High Performance Computer Architecture (HPCA 2017).

– S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw and R. Das, “Compute Caches,” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, 2017, pp. 481-492, doi: 10.1109/HPCA.2017.21.
– Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural cache: bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA ’18).
– W. A. Simon, Y. M. Qureshi, M. Rios, A. Levisse, M. Zapater and D. Atienza, “BLADE: An in-Cache Computing Architecture for Edge Devices,” in IEEE Transactions on Computers, vol. 69, no. 9, pp. 1349-1363, 1 Sept. 2020, doi: 10.1109/TC.2020.2972528.