@inbounds and @simd not showing any sign of speedup

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

@inbounds and @simd not showing any sign of speedup

Juan Lopez
Hello,

I have a function which is doing basically an operation inside a loop and when adding @simd or @inbounds time doesn't improve, in any case it seems slightly worse.

julia> using BenchmarkTools

julia> A = rand(1000,1000)

julia> function f!(n::Integer, DA::Number, DX::AbstractArray, incx::Integer) #Original function
           i = 1
           n = min(n,length(DX))
           while i <= n
               DX[i] *= DA
               i += incx
           end
           DX
       end
f! (generic function with 1 method)

julia> function f2!(n::Integer, DA::Number, DX::AbstractArray, incx::Integer) #inner cycle @inbounds and @simd
           i = 1
           n = min(n,length(DX))
           @inbounds @simd for i in 1:incx:n
               DX[i] *= DA
           end
           DX
       end
f2! (generic function with 1 method)

julia> @inbounds function f3!(n::Integer, DA::Number, DX::AbstractArray, incx::Integer) #inner cycle @simd, function @inbounds
           i = 1
           n = min(n,length(DX))
           @simd for i in 1:incx:n
               DX[i] *= DA
           end
           DX
       end

julia> minimum(@benchmark f!(length(A),1.0,A,1))
BenchmarkTools.TrialEstimate:
  time:             52.04 ms
  gctime:           0.00 ns (0.00%)
  memory:           16.00 bytes
  allocs:           1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> minimum(@benchmark f2!(length(A),1.0,A,1))
BenchmarkTools.TrialEstimate:
  time:             55.80 ms
  gctime:           0.00 ns (0.00%)
  memory:           16.00 bytes
  allocs:           1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> minimum(@benchmark f3!(length(A),1.0,A,1))
BenchmarkTools.TrialEstimate:
  time:             55.62 ms
  gctime:           0.00 ns (0.00%)
  memory:           16.00 bytes
  allocs:           1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Is there an explanation for this? Thank you
Reply | Threaded
Open this post in threaded view
|

Re: @inbounds and @simd not showing any sign of speedup

Andreas Lobinger
Hello colleague,

On Friday, July 29, 2016 at 8:59:36 AM UTC+2, Juan Lopez wrote:
Hello,

I have a function which is doing basically an operation inside a loop and when adding @simd or @inbounds time doesn't improve, in any case it seems slightly worse.
 
Is there an explanation for this? Thank you

there is a non-vanishing propability, that the plain loop is already compiled to the optimal code. Maybe you try to look at the lowered code.

 
Reply | Threaded
Open this post in threaded view
|

Re: @inbounds and @simd not showing any sign of speedup

Kristoffer Carlsson
It is likely because the ranges are not UnitRanges.

On Friday, July 29, 2016 at 5:35:57 AM UTC-4, Andreas Lobinger wrote:
Hello colleague,

On Friday, July 29, 2016 at 8:59:36 AM UTC+2, Juan Lopez wrote:
Hello,

I have a function which is doing basically an operation inside a loop and when adding @simd or @inbounds time doesn't improve, in any case it seems slightly worse.
 
Is there an explanation for this? Thank you

there is a non-vanishing propability, that the plain loop is already compiled to the optimal code. Maybe you try to look at the lowered code.

 
Reply | Threaded
Open this post in threaded view
|

Re: @inbounds and @simd not showing any sign of speedup

Valentin Churavy
A great tool to figuring out what is going on in these cases is `@code_llvm`. It shows you a representation of your code that is still readable, but very close to the machine.

Your simple julia code without a `@simd` is nearly optimal, but does benefits from the inclusion of `@inbounds`

  • While_loop with `@inbounds`:  minimum time:     797.28 μs
  • While loop without `@inbounds`: minimum time:    1.01 ms
  • For loop without & with `@inbounds`: minimum time: 802/812.11 μs
function simple(A, b, stride, N)
  N = min(N, length(A))
  for i in 1:stride:N
    @inbounds A[i] *= b
  end 
end

function while_based(A, b, stride, N)
  i = 1 
  N = min(N, length(A))
  while i <= N
    A[i] *= b
    i += stride
  end 
end


Now to the question whether or not `@simd` is beneficial in this case. LLVM has a loop vectorizer that we run and it has a cost-benefits (and correctness) analysis when it sees a loop. The fact that in the code_llvm we don't see vectorized code means that LLVM did not deem it worth while to vectorize our code (as Kristoffer said most likely because of non unit strides). With `@simd` we (forcibly) tell LLVM to vectorize out code and to be less strict about correctness and to also not to do a cost-benefit analysis. While vectorized code has great performance benefits it also comes with costs (code size increase, overhead).

I hope this tough analysis helps.


On Friday, 29 July 2016 22:36:50 UTC+9, Kristoffer Carlsson wrote:
It is likely because the ranges are not UnitRanges.

On Friday, July 29, 2016 at 5:35:57 AM UTC-4, Andreas Lobinger wrote:
Hello colleague,

On Friday, July 29, 2016 at 8:59:36 AM UTC+2, Juan Lopez wrote:
Hello,

I have a function which is doing basically an operation inside a loop and when adding @simd or @inbounds time doesn't improve, in any case it seems slightly worse.
 
Is there an explanation for this? Thank you

there is a non-vanishing propability, that the plain loop is already compiled to the optimal code. Maybe you try to look at the lowered code.