Novel Models and Methods for Accelerating Parallel Full-Batch GNN Training on Distributed-Memory Systems
Ahmet Can Bağırgan
Master Student
(Supervisor: Prof.Dr.Cevdet Aykanat) Computer Engineering Department
Bilkent University
Abstract: Graph Neural Networks (GNNs) have emerged as effective tools for learning from graph-structured data across diverse application domains. Despite their success, the scalability of GNNs remains a critical challenge, particularly in full-batch training on large-scale, irregularly sparse, and scale-free graphs. Traditional one-dimensional (1D) vertex-parallel training strategies, while widely adopted, often suffer from severe load imbalance and excessive communication overhead, limiting their performance on distributed-memory systems. This thesis addresses the scalability limitations of 1D approaches by investigating alternative partitioning strategies for parallelization that better exploit the structure of modern graph workloads. A systematic evaluation framework is developed to assess parallel GNN training performance across a range of datasets with varying sparsity and degree distributions. The framework captures key performance indicators such as computational load balance, inter-process communication volume, and parallel runtime. Extensive experiments are conducted on two Tier-0 supercomputers, LUMI and MareNostrum5, using hundreds of real-world graph instances. On average of 22 well-known GNN datasets, the results show up to 61% decrease in total communication volume and up to 39% decrease in parallel runtime compared to 1D partitioning strategies on 1024 processes. These improvements are consistent across graphs with high variance in degree and sparsity, confirming the robustness of the proposed approaches. The findings demonstrate the potential of moving beyond traditional 1D paradigms and provide practical insights into scalable and communication-efficient GNN training on distributed platforms.
DATE: July 24, Thursday @ 11:00
Place: EA 409