MS Tez Sunumu: “Reducing Coherency Traffic Volume in Chip Multiprocessors Through Pointer Analysis,” Erdem Derebaşoğlu (CS), EA-409, 14:30 20 Eylül (EN)

MS THESIS PRESENTATION: “Reducing Coherency Traffic Volume in Chip Multiprocessors Through Pointer Analysis” by Erdem Derebaşoğlu
MS Student
(Supervisor: Assoc. Prof. Dr. Özcan Öztürk )
Computer Engineering Department

With increasing number of cores in chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although snooping based protocols are appropriate solutions to small scale systems, they are inefficient for large systems because of the limited bandwidth.Therefore, large scale CMPs require directory based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which core’s cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinates simultaneous accesses to a cache block. As directory based protocols scaled to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems.\par In this paper, we present hardware and software mechanisms to improve effectiveness of directory based cache coherency on CMPs with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores. For example, read-only (private) data can be considered in this category. On the other hand, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data. In our proposed system, private data and shared data are stored in separate caches, and cache coherence protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen’s pointer analysis to analyze a program and mark its private instructions, i.e instructions that load or store private data, at compile time. Second, we run the program in Sniper Multi-Core Simulator\cite{Carlson} with the proposed hardware configuration. We used SPLASH-2 and PARSEC-2.1 parallel benchmarks to test our approach. Simulation results show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic.

DATE: 20 September 2017, Wednesday @ 14:30