Publications | Wenyi Wang

2025

SC Posters’25
Exploring Fine-Grained Parallelism in Data-Flow Runtime Systems on Many-Core Systems

Wenyi Wang, Maxime Gonthier, Haibin Lai, and 5 more authors

In Proceedings of the SC ’25 Research Posters of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025

Abs Bib PDF Poster

High synchronization overhead in frameworks like GNU OpenMP impedes fine-grained task parallelism on many-core architectures. We introduce three advances to GNU OpenMP: a lock-less concurrent queue (XQueue), a scalable distributed tree barrier, and two NUMA-aware, lock-less load-balancing strategies. Evaluated with Barcelona OpenMP Task Suite (BOTS) benchmarks, our XQueue and tree barrier improve performance by up to 1522.8× over the original GNU OpenMP. The load-balancing strategies provide an additional performance improvement of up to 4×. We further apply these techniques to the TaskFlow runtime, demonstrating performance and scalability gains in selected applications while also analyzing the inherent limitations of the lock-less approach on x86 architectures.
@inproceedings{sc25posters-wenyi, author = {Wang, Wenyi and Gonthier, Maxime and Lai, Haibin and Nookala, Poornima and Pan, Haochen and Foster, Ian and Raicu, Ioan and Chard, Kyle}, title = {Exploring Fine-Grained Parallelism in Data-Flow Runtime Systems on Many-Core Systems}, year = {2025}, publisher = {Association for Computing Machinery}, address = {St. Louis, MO, USA}, url = {https://sc25.supercomputing.org/proceedings/posters/poster_pages/post147.html}, booktitle = {Proceedings of the SC '25 Research Posters of the International Conference for High Performance Computing, Networking, Storage and Analysis}, keywords = {parallel computing, high-performance computing, supercomputer}, }
SC Workshops’25
KVMSR+UDWeave: Extreme-Scaling with Fine-grained Parallelism on the UpDown Graph Supercomputer

Alexander Fell, Yuqing Wang, Tianshuo Su, and 12 more authors

In Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 2025

Abs Bib PDF

Programming irregular graph applications is challenging on today’s scalable supercomputers. We describe a novel programming model, KVMSR+UDWeave, that supports extreme scaling by exposing fine-grained parallelism. By enabling the expression of maximum parallelism, it opens the door for extreme scaling, even on both small and large graph problems. KVMSR+UDWeave cleanly separates the three key dimensions of parallel programming: parallelism, computation binding, and data placement. This decomposition reduces effort to achieve scalable, high-performance for graph algorithms on real-world, highly skewed graphs. Key features of the UpDown supercomputer (computation location naming and shared global address space) enable decomposition and scalable, high performance. In the IARPA AGILE program, we built numerous graph benchmarks and workflows, and use them to illustrate the programming model. Simulation results for UpDown show excellent strong-scaling to million-fold hardware parallelism and high absolute performance. Results suggest KVMSR+UDWeave enables reduced programming effort for scaling the most demanding irregular applications.
@inproceedings{10.1145/3731599.3767499, author = {Fell, Alexander and Wang, Yuqing and Su, Tianshuo and Nourian, Marziyeh and Wang, Wenyi and Monsalve-Diaz, Jose M. and Rajasukumar, Andronicus Samsundar and Su, Jiya and Xu, Ruiqi and Khandelwal, Rajat and Zhang, Tianchi and Gleich, David and Li, Yanjing and Hoffmann, Hank and Chien, Andrew A.}, title = {KVMSR+UDWeave: Extreme-Scaling with Fine-grained Parallelism on the UpDown Graph Supercomputer}, year = {2025}, isbn = {9798400718717}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3731599.3767499}, booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis}, pages = {1243–1262}, numpages = {20}, location = {St. Louis, MO, USA}, keywords = {parallel computing, high-performance computing, graph computing, supercomputer, mapreduce}, series = {SC Workshops '25}, }
IPDPS’25
Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Wenyi Wang, Maxime Gonthier, Poornima Nookala, and 4 more authors

In 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS) , Jun 2025

Abs Bib PDF

Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, shortrunning tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU’s priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU’s centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that the use of XQueue and the distributed tree barrier can improve performance by up to 1522.8× compared to the original GNU OpenMP. We further show that lock-less load balancing can improve performance by up to 4× compared to GNU OpenMP using XQueue.
@inproceedings{11078401, author = {Wang, Wenyi and Gonthier, Maxime and Nookala, Poornima and Pan, Haochen and Foster, Ian and Raicu, Ioan and Chard, Kyle}, booktitle = { 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS) }, title = {Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems}, year = {2025}, pages = {81-93}, keywords = {Distributed processing;Runtime;Parallel programming;Parallel processing;Performance gain;Load management;Dynamic scheduling;Hardware;Dynamic programming;Synchronization}, url = {https://doi.ieeecomputersociety.org/10.1109/IPDPS64566.2025.00016}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month = jun, }

2024

arXiv’24
UpDown: Programmable fine-grained Events for Scalable Performance on Irregular Applications

Andronicus Rajasukumar, Jiya Su, Tianshuo Su, and 8 more authors

arXiv preprint arXiv:2407.20773, Jun 2024

Abs arXiv Bib PDF

Applications with irregular data structures, data-dependent control flows and fine-grained data transfers (e.g., real-world graph computations) perform poorly on cache-based systems. We propose the UpDown accelerator that supports fine-grained execution with novel architecture mechanisms - lightweight threading, event-driven scheduling, efficient ultra-short threads, and split-transaction DRAM access with software-controlled synchronization. These hardware primitives support software programmable events, enabling high performance on diverse data structures and algorithms. UpDown also supports scalable performance; hardware replication enables programs to scale up performance. Evaluation results show UpDown’s flexibility and scalability enable it to outperform CPUs on graph mining and analytics computations by up to 116-195x geomean speedup and more than 4x speedup over prior accelerators. We show that UpDown generates high memory parallelism ( 4.6x over CPU) required for memory intensive graph computations. We present measurements that attribute the performance of UpDown (23x architectural advantage) to its individual architectural mechanisms. Finally, we also analyze the area and power cost of UpDown’s mechanisms for software programmability.
@article{rajasukumar2024updown, title = {UpDown: Programmable fine-grained Events for Scalable Performance on Irregular Applications}, author = {Rajasukumar, Andronicus and Su, Jiya and Su, Tianshuo and Nourian, Marziyeh and Diaz, Jose M Monsalve and Zhang, Tianchi and Ding, Jianru and Wang, Wenyi and Zhang, Ziyi and Jeje, Moubarak and others}, journal = {arXiv preprint arXiv:2407.20773}, year = {2024}, }
BDCAT’23
Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision

Nathaniel C Hudson, J. Gregory Pauloski, Matt Baughman, and 13 more authors

In Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies, Taormina (Messina), Italy, Jun 2024

Abs Bib PDF

Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters—such as Huawei’s PanGu-Σ. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
@inproceedings{10.1145/3632366.3632396, author = {Hudson, Nathaniel C and Pauloski, J. Gregory and Baughman, Matt and Kamatar, Alok and Sakarvadia, Mansi and Ward, Logan and Chard, Ryan and Bauer, Andr\'{e} and Levental, Maksim and Wang, Wenyi and Engler, Will and Price Skelly, Owen and Blaiszik, Ben and Stevens, Rick and Chard, Kyle and Foster, Ian}, title = {Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision}, year = {2024}, isbn = {9798400704734}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3632366.3632396}, booktitle = {Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies}, articleno = {15}, numpages = {10}, keywords = {artificial intelligence, grid computing, deep learning applications, systems design, survey}, location = {Taormina (Messina), Italy}, series = {BDCAT '23}, }

2023

LCPC’23
Efficiently exploiting irregular parallelism using keys at scale

Yuqing Wang, Andronicus Rajasukumar, Tianshuo Su, and 8 more authors

In International Workshop on Languages and Compilers for Parallel Computing, Jun 2023

Abs Bib PDF

Motivated by the challenges of programming irregular applications for machines with million-fold parallelism, we present a key-based programming model, called key-value map-shuffle-reduce (KVMSR), that enables programmers to optimize fine-grained parallel programs. KVMSR expresses parallelism on a global address space and features modular interfaces to flexibly bind computation to available compute resources. We define the KVMSR model and illustrate it with three programs, convolution filter, PageRank and BFS, to show its ability to separate computation expression from binding to computation location for high performance. On a 8,192-way parallel compute system, KVMSR modular computation location control achieves up to 2,317× performance with static approaches and an increase of 549× to 2,715× speedup with dynamic approaches for computation location binding.
@inproceedings{wang2023efficiently, title = {Efficiently exploiting irregular parallelism using keys at scale}, author = {Wang, Yuqing and Rajasukumar, Andronicus and Su, Tianshuo and Nourian, Marziyeh and Monsalve Diaz, Jose M and Pervaiz, Ahsan and Ding, Jerry and Colley, Charles and Wang, Wenyi and Li, Yanjing and others}, booktitle = {International Workshop on Languages and Compilers for Parallel Computing}, pages = {78--95}, year = {2023}, organization = {Springer}, }

2021

SC’21
Paths to openmp in the kernel

Jiacheng Ma, Wenyi Wang, Aaron Nelson, and 7 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Jun 2021

Abs Bib PDF

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.
@inproceedings{ma2021paths, title = {Paths to openmp in the kernel}, author = {Ma, Jiacheng and Wang, Wenyi and Nelson, Aaron and Cuevas, Michael and Homerding, Brian and Liu, Conghao and Huang, Zhen and Campanoni, Simone and Hale, Kyle and Dinda, Peter}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, pages = {1--17}, year = {2021}, }