Hi,

I was looking at some code(given below) which seems to perform very badly when attachments and detachments to used to simulate cache coherency.
In the code below, when remote_attach is false(ie no remote processors), using just the two A9 cores the following code runs in 8.8 seconds. But when remote_attach is true then even though there are other cores also executing and sharing the workload the following code takes 52.7 seconds. This shows that detach and attach is very heavy for this kind of code. (The system call detach performs dma_buf_unmap_attachment and dma_buf_detach, system call attach performs dma_buf_attach and dma_buf_map_attachment).

for (k = 0; k < N; k++) {
    if(remote_attach) {
        detach(path) ;
        attach(path) ;
    }

    for(i = start_indx; i < end_indx; i++) {
        for (j = 0; j < N; j++) {
            if(path[i][j] < (path[i][k] + path[k][j])) {
                path[i][j] = path[i][k] + path[k][j] ;
            }
        }
    }
}

I would like to manage the cache explicitly and flush cache lines rather than pages to reduce overhead. I also want to access these buffers from the userspace. I can change some kernel code for this. Where should I start ?

Thanks in advance.

--Kiran