ESCAL has 4 fresh alumni this year!
ESCAL has 4 fresh alumni this year!

ESCAL has 4 fresh alumni this year! Andrew, Kuan-Chieh, Abe, and Kimbo are going to their new life chapters!

Andrew is heading to Google, Kuan-Chieh is going to Brookhaven National Lab, Kimbo (Teng-Hung) is going to a startup company and Abe is taking a sabbatical (still hirable!) starting from July!



Tensor cores can do more than AI/ML! Yu-Ching Hu presented our paper “TCUDB: Accelerating Database with Tensor Processors”, at SIGMOD 2022! TCUDB is a database engine using Tensor Cores on NVIDIA’s latest Ampere architecture to accelerate entity matching, graph query processing, and matrix-based data analytics by up to 288x compared against conventional GPU database systems.

You may access our paper through
and experience TCUDB at

SIMD^2 at ISCA 2022!
SIMD^2 at ISCA 2022!

Tensor Cores architecture can go beyond AI/ML! Matrix multiplication has many other siblings, including dynamic programming algorithms and minimum spanning tree problems that share the same computation pattern but different arithmetic operations! Andrew Zhang presented our work at ISCA 2022, called SIMD^2 that uses NVIDIA’s Tensor Cores to create a set of 2-dimensional SIMD instructions in supporting these matrix semiring problems. We achieved 10x speedup over state-of-the-art SIMD/CUDA implementations with just 5% area overhead!

You may find our paper at
and the artifacts at

SHMT (Simultaneous and Heterogenous Multithreading) @ MICRO 2023 and Micro TopPicks!
SHMT (Simultaneous and Heterogenous Multithreading) @ MICRO 2023 and Micro TopPicks!

Simultaneous and Heterogenous Multithreading that explores the concept of simultaneously using multiple accelerators for the same code regions was well presented by Kuan-Chieh at MICRO 2023 @ Toronto! This paper is also selected as one of the MICRO TopPicks in 2024! Please check

Extreme Scale Computer Architecture Laboratory

With the rapid growth of dataset sizes but limited improvement of high-performance computers, we need to revisit the existing programming and execution models to efficiently utilize all system components. In modern computers, lots of deficiencies in applications are related to data management and movements. The vision of Extreme Scale Computer Architecture Laboratory is to revolutionary change the way how people think about programming and computing today — using a data-centric perspective in programming instead of the conventional computing-centric approach. ESCAL conducts research in systems and computer architecture with focus on tensor processors, hardware accelerators, data storage systems, parallel processing, high-performance computing, programming languages and runtime systems.

Research Projects

Accelerating non-AI/ML applications using AI/ML accelerators

The explosive demand on AI/ML workloads drive the emergence of AI/ML accelerators, including commercialized NVIDIA Tensor Cores and Google TPUs. These AI/ML accelerators are essentially matrix processors and are theoretically helpful to any application with matrix operations. This project bridges the missing system/architecture/programming language support in democratizing AI/ML accelerators. As matrix operations are conventionally inefficient, this project also revises the core algorithm in compute kernels to better utilize operators of AI/ML accelerators. With this project, ESCAL envisions ourselves to lead the next trend of a revolution — similar to the one happened on GPUs. You may now try our most recent GPTPU project from the GitHub repo:

Related papers:

Building intelligent data storage & I/O devices

As parallel computer architectures significantly shrinking the execution time in compute kernels, the performance bottlenecks of applications shift to the rest of part of execution, including data movement, object deserialization/serialization as well as other software overheads in managing data storage. To address this new bottleneck, the best approach is to not move data and endow storage devices with new roles. Morpheus is one of the very first research project that implements this concept in real systems. We utilize existing, commercially available hardware components to build the Morpheus-SSD. The Morpheus model not only speeds up a set of heterogeneous computing applications by 1.32x, but also allows these applications to better utilize emerging data transfer methods that can send data directly to the GPU via peer-to-peer to further achieve 1.39x speedup. Summarizer further provides mechanisms to dynamically adjust the workload between the host and intelligent SSDs, making more efficient use of all computing units in a system and boost the performance of big data analytics. This line of research also helps ESCAL receive Facebook research award, 2018 and MICRO TopPicks in 2020.

Related papers:

  • Yu-Chia Liu, Kuan-Chieh Hsu and Hung-Wei Tseng. Autonomous In-Storage Processing. In 60th Design Automation Conference (DAC 2023)
  • Yu-Chia Liu and Hung-Wei Tseng. NDS: N-Dimensional Storage. In the 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2021. (Best Paper Nominee)
  • Yu-Ching Hu, Murtuza Lokhandwala, Te I and Hung-Wei Tseng. Varifocal Storage: Dynamic Multi-Resolution Data Storage. In IEEE Micro (Micro Toppicks from Computer Architecture Conferences), 2020.
  • Yu-Ching Hu, Murtuza Lokhandwala, Te I and Hung-Wei Tseng. Dynamic Multi-Resolution Data Storage. In in the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019 (Best Paper Honorable Mention)
  • Kiran Kumar Matam, Gunjae Koo, Haipeng Zha, Hung-Wei Tseng and Murali Anavarum. GraphSSD: Graph Semantics Aware SSD. In the 46th International Symposium on Computer Architecture, ISCA 2019.

Efficient storage system for heterogeneous servers

Although high-performance, non-volatile memory technologies and network devices significantly improve the speed of supplying data to heterogeneous computing units, the performance of these devices are still far behind the capabilities of heterogeneous computing units. For example, modern SSDs can read more than 3GB of data per second, but GPUs can process more than 17GB of data for database aggregation operations within the same period of time. As result, the heterogeneous computing units are under-utilized. We will revisit the design of existing runtime systems to transparently improve the utilization of system components, potentially leading to speedup or better energy-efficiency.

Related papers:

Optimizing the I/O system software stack for emerging applications

With hardware accelerators improving the latency in computation, the system software stack that were traditionally underrated in designing applications becomes more critical. In ESCAL, we focus on those underrated bottlenecks to achieve significant performance improvement without using new hardware. The most recent example is the OpenUVR system, where we eliminate unnecessary memory copies and allow the VR system loop to complete within 14 ms latency with just modern desktop PC, existing WiFi network links, raspberry Pi 4b+ and an HDMI compatible head mount display.

Related papers:


If you’re interested at joining ESCAL, please fill this form. We do not respond to inquiries of perspectives or review applicants that do not fill the form.


Hung-Wei Tseng
Hung-Wei Tseng


Graduate students

Boram Jung
Boram Jung

Undergraduate Researchers

  • Hongrui Zhang (C.E. UCSD)
  • Honghao Lin (C.S. UCSD)
  • Andy Li (C.S. UCR)

  • Alumni

    Dongho Ha
    Dongho Ha
    Yu-Chia "Hank" Liu
    Yu-Chia “Hank” Liu
    • Xindi Li (C.S., M.S., 2018. Now at Bloomberg)
    • Chao Huang (C.S., M.S., 2018)
    • Zackary Allen (C.S., B.S., 2018. Now at LexisNexis)
    • Alec Rohloff (C.S., B.S., 2018.)
    • Te I (C.S., M.S., 2018. Now at Google)
    • Vaibhava Lakshmi (ECE, M.S., 2018. Dell EMC)
    • Murtuza Taher Lokhandwala (ECE, M.S., 2018. Apple)
    • Mahesh Bonagiri(ECE, M.S., 2018. Nvidia)
    • Joshua Okrend
    • Stefan O’Neil