Performance of Kernel Inlining

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance of Kernel Inlining

Jared Crean
I'm working on an high dimensional finite difference code, and I got a strange performance result. I have a kernel function that
computes the stencil at a given point, and an outer function, outer_func, that loops over the dimensions and calls the kernel function at every grid point.
I created a second function, outer_func2, with the same loops as outer_func, but rather than call the kernel function it has the contents of
the kernel function copied into it.  The source code is here: https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl

The performance results (with bounds checking disabled and --math-mode=fast) are:

testing outer_func
 
0.398586 seconds
 
0.398821 seconds
testing outer_func2
 
2.522230 seconds
 
2.522479 seconds



I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4

I looked at the llvm code (attached), and noticed outer_func2 has a bunch of extra statements that look like

  %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader ]



that are not present for outer_func.  I don't know llvm code very well (hardly at all), so I'm not sure what these mean.  Any help
understanding either the llvm code or the performance difference would be appreciated.



  Thanks,
     Jared Crean

outer_func_llvm.txt (15K) Download Attachment
outer_func2_llvm.txt (44K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Performance of Kernel Inlining

Kristoffer Carlsson
Could it be some alias checking going on?

Anyway, this code is horribly slow on 0.6 (even with #19097) it seems.

to_indexes(::Int64, ::Int64, ::Vararg{Int64,N}) at operators.jl:868 (repeats 3 times)
kills performance.


On Saturday, October 29, 2016 at 5:56:12 AM UTC+2, Jared Crean wrote:
I'm working on an high dimensional finite difference code, and I got a strange performance result. I have a kernel function that
computes the stencil at a given point, and an outer function, outer_func, that loops over the dimensions and calls the kernel function at every grid point.
I created a second function, outer_func2, with the same loops as outer_func, but rather than call the kernel function it has the contents of
the kernel function copied into it.  The source code is here: <a href="https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJaredCrean2%2Fwave6d%2Fblob%2Fmaster%2Fsrc%2Ftest_inline.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGBuBl5XBoPnJYZU3vMo6tGPoIFXA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJaredCrean2%2Fwave6d%2Fblob%2Fmaster%2Fsrc%2Ftest_inline.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGBuBl5XBoPnJYZU3vMo6tGPoIFXA&#39;;return true;">https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl

The performance results (with bounds checking disabled and --math-mode=fast) are:

testing outer_func
 
0.398586 seconds
 
0.398821 seconds
testing outer_func2
 
2.522230 seconds
 
2.522479 seconds



I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4

I looked at the llvm code (attached), and noticed outer_func2 has a bunch of extra statements that look like

  %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader ]



that are not present for outer_func.  I don't know llvm code very well (hardly at all), so I'm not sure what these mean.  Any help
understanding either the llvm code or the performance difference would be appreciated.



  Thanks,
     Jared Crean
Reply | Threaded
Open this post in threaded view
|

Re: Performance of Kernel Inlining

Jared Crean
I noticed this morning that the loop are in the wrong order for a column major array.  Reversing them, I get:

testing outer_func
 
0.294904 seconds
 
0.296689 seconds
testing outer_func2
 
0.280391 seconds
 
0.281223 seconds

Now both versions have the phi instructions, so I guess that wasn't the problem


And sprinkling a little @simd on the inner loops:

testing outer_func
 
0.159910 seconds
 
0.157640 seconds
testing outer_func2
 
0.151384 seconds
 
0.152224 seconds

I'm going to write a Fortran code to do a performance comparison, but this is looking pretty good.

Do you think I should file a performance issue for the original code?

  Jared Crean



On Saturday, October 29, 2016 at 4:13:48 AM UTC-4, Kristoffer Carlsson wrote:
Could it be some alias checking going on?

Anyway, this code is horribly slow on 0.6 (even with #19097) it seems.

to_indexes(::Int64, ::Int64, ::Vararg{Int64,N}) at operators.jl:868 (repeats 3 times)
kills performance.


On Saturday, October 29, 2016 at 5:56:12 AM UTC+2, Jared Crean wrote:
I'm working on an high dimensional finite difference code, and I got a strange performance result. I have a kernel function that
computes the stencil at a given point, and an outer function, outer_func, that loops over the dimensions and calls the kernel function at every grid point.
I created a second function, outer_func2, with the same loops as outer_func, but rather than call the kernel function it has the contents of
the kernel function copied into it.  The source code is here: <a href="https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJaredCrean2%2Fwave6d%2Fblob%2Fmaster%2Fsrc%2Ftest_inline.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGBuBl5XBoPnJYZU3vMo6tGPoIFXA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJaredCrean2%2Fwave6d%2Fblob%2Fmaster%2Fsrc%2Ftest_inline.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGBuBl5XBoPnJYZU3vMo6tGPoIFXA&#39;;return true;">https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl

The performance results (with bounds checking disabled and --math-mode=fast) are:

testing outer_func
 
0.398586 seconds
 
0.398821 seconds
testing outer_func2
 
2.522230 seconds
 
2.522479 seconds



I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4

I looked at the llvm code (attached), and noticed outer_func2 has a bunch of extra statements that look like

  %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader ]



that are not present for outer_func.  I don't know llvm code very well (hardly at all), so I'm not sure what these mean.  Any help
understanding either the llvm code or the performance difference would be appreciated.



  Thanks,
     Jared Crean
Reply | Threaded
Open this post in threaded view
|

Re: Performance of Kernel Inlining

Jared Crean
The timing for the Fortran code (using -Ofast) is

  outer_func2 time = 0.160010 second.

I checked and it is using vector instructions.  I'm impressed Julia is as fast as Fortran in this case.  I would have thought alias checking would Julia down.

The Julia code is slow on release-0.5 as well as 0.6, so I will file an issue.

  Jared Crean



On Saturday, October 29, 2016 at 11:05:38 AM UTC-4, Jared Crean wrote:
I noticed this morning that the loop are in the wrong order for a column major array.  Reversing them, I get:

testing outer_func
 
0.294904 seconds
 
0.296689 seconds
testing outer_func2
 
0.280391 seconds
 
0.281223 seconds

Now both versions have the phi instructions, so I guess that wasn't the problem


And sprinkling a little @simd on the inner loops:

testing outer_func
 
0.159910 seconds
 
0.157640 seconds
testing outer_func2
 
0.151384 seconds
 
0.152224 seconds

I'm going to write a Fortran code to do a performance comparison, but this is looking pretty good.

Do you think I should file a performance issue for the original code?

  Jared Crean



On Saturday, October 29, 2016 at 4:13:48 AM UTC-4, Kristoffer Carlsson wrote:
Could it be some alias checking going on?

Anyway, this code is horribly slow on 0.6 (even with #19097) it seems.

to_indexes(::Int64, ::Int64, ::Vararg{Int64,N}) at operators.jl:868 (repeats 3 times)
kills performance.


On Saturday, October 29, 2016 at 5:56:12 AM UTC+2, Jared Crean wrote:
I'm working on an high dimensional finite difference code, and I got a strange performance result. I have a kernel function that
computes the stencil at a given point, and an outer function, outer_func, that loops over the dimensions and calls the kernel function at every grid point.
I created a second function, outer_func2, with the same loops as outer_func, but rather than call the kernel function it has the contents of
the kernel function copied into it.  The source code is here: <a href="https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJaredCrean2%2Fwave6d%2Fblob%2Fmaster%2Fsrc%2Ftest_inline.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGBuBl5XBoPnJYZU3vMo6tGPoIFXA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJaredCrean2%2Fwave6d%2Fblob%2Fmaster%2Fsrc%2Ftest_inline.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGBuBl5XBoPnJYZU3vMo6tGPoIFXA&#39;;return true;">https://github.com/JaredCrean2/wave6d/blob/master/src/test_inline.jl

The performance results (with bounds checking disabled and --math-mode=fast) are:

testing outer_func
 
0.398586 seconds
 
0.398821 seconds
testing outer_func2
 
2.522230 seconds
 
2.522479 seconds



I ran this on in Intel Ivy Bridge (i7-3820) processor, using Julia 0.4.4

I looked at the llvm code (attached), and noticed outer_func2 has a bunch of extra statements that look like

  %lsr.iv570 = phi i8* [ %scevgep571, %L21 ], [ %scevgep569, %L.preheader ]



that are not present for outer_func.  I don't know llvm code very well (hardly at all), so I'm not sure what these mean.  Any help
understanding either the llvm code or the performance difference would be appreciated.



  Thanks,
     Jared Crean