fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has quit [Ping timeout: 250 seconds]
khaled has quit [Quit: Konversation terminated!]
hpt has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
yogananth has joined #systemtap
khaled has joined #systemtap
hpt has quit [Ping timeout: 256 seconds]
tromey has joined #systemtap
yogananth has quit [Quit: Leaving]
khaled has quit [Remote host closed the connection]
khaled has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
<lindi->
Amy1: for(i=0;i<Nbody;i++){ should be for(i=0;i<Nbody-1;i++){
<lindi->
Amy1: since you want i+1 to stay below Nbody
<lindi->
Amy1: depending on the values for Nbody and Ndim you might benefit from chunking to keep the working set in the cache
<lindi->
Amy1: but without some benchmarks its pretty hard to say
<Amy1>
lindi-: NBody is 4*1024, NDim is 3.
<lindi->
Amy1: in that case your memory access pattern is not very optimal
<Amy1>
I think it is clear about the function.
<lindi->
Amy1: think about how the cache gets used
<Amy1>
lindi-: ?
<lindi->
Amy1: do the memory accesses stay in the L1 cache?
<Amy1>
I can exchage i and l. It will make pos and delta_pos 's access more locality.