I am probably rediscovering the wheel or about to say something that will get a response in the form of "its complicated"
It occurred to me that what we need to optimize are loops and iterations. The following code will run slow:
whereas the following code will run fast
because in the first version total starts in a different type than the element of the matrix A . and there is now way for the JITter to tell if this is deliberate or careless coding . What I suggest is to replace any loop in the code with an atleast one unrolled version and to push the rest into an anonymous function. In the parser level. for example:
And the timing results compared to builtin sum function:

On Mon, Sep 26, 2016 at 3:23 AM, Tsur Herman <[hidden email]> wrote:
> I am probably rediscovering the wheel or about to say something that will > get a response in the form of "its complicated" > > It occurred to me that what we need to optimize are loops and iterations. > > The following code will run slow: > > function F1(A) > total = 0; > for a in A > total += a; > end > total > end > > whereas the following code will run fast > > function F2(A) > total = zero(eltype(A)); > for a in A > total += a; > end > total > end > julia> A=rand(4000,4000); > > julia> F1(A);@time F1(A) > 0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc time) > 7.997862445435431e6 > > julia> F2(A);@time F2(A) > 0.025840 seconds (5 allocations: 176 bytes) > 7.997862445435431e6 > > > because in the first version total starts in a different type than the > element of the matrix A . and there is now way for the JITter to tell if > this is deliberate or careless coding . > What I suggest is to replace any loop in the code with an atleast one > unrolled version and to push the rest into an anonymous function. > In the parser level. > > for example: > > function FF(A) > total = 0 > > loopbody(total,a) = return total += a > > total = loopbody(total,A[1]) > > lambda(A,range,total,loopbody) = begin > @simd for i in range > @inbounds total = loopbody(total,A[i]) > end > total > end > > return lambda(A,2:prod(size(A)),total,loopbody) > end > > > And the timing results compared to builtin sum function: > > julia> FF(A);@time FF(A) > 0.010596 seconds (7 allocations: 208 bytes) > 7.997862445434973e6 > > julia> sum(A);@time sum(A) > 0.010666 seconds (5 allocations: 176 bytes) > 7.997862445434952e6 > > This is the henchmen unrolling in https://github.com/JuliaLang/julia/issues/3440 
Just a footnote that the term "henchmen unrolling" makes no sense, and
I believe originates from an iphone autocorrect in an email by Stefan. We found it funny so we just kept it. Hopefully this term will eventually appear in a textbook somewhere :) On Mon, Sep 26, 2016 at 8:15 AM, Yichao Yu <[hidden email]> wrote: > On Mon, Sep 26, 2016 at 3:23 AM, Tsur Herman <[hidden email]> wrote: >> I am probably rediscovering the wheel or about to say something that will >> get a response in the form of "its complicated" >> >> It occurred to me that what we need to optimize are loops and iterations. >> >> The following code will run slow: >> >> function F1(A) >> total = 0; >> for a in A >> total += a; >> end >> total >> end >> >> whereas the following code will run fast >> >> function F2(A) >> total = zero(eltype(A)); >> for a in A >> total += a; >> end >> total >> end >> julia> A=rand(4000,4000); >> >> julia> F1(A);@time F1(A) >> 0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc time) >> 7.997862445435431e6 >> >> julia> F2(A);@time F2(A) >> 0.025840 seconds (5 allocations: 176 bytes) >> 7.997862445435431e6 >> >> >> because in the first version total starts in a different type than the >> element of the matrix A . and there is now way for the JITter to tell if >> this is deliberate or careless coding . >> What I suggest is to replace any loop in the code with an atleast one >> unrolled version and to push the rest into an anonymous function. >> In the parser level. >> >> for example: >> >> function FF(A) >> total = 0 >> >> loopbody(total,a) = return total += a >> >> total = loopbody(total,A[1]) >> >> lambda(A,range,total,loopbody) = begin >> @simd for i in range >> @inbounds total = loopbody(total,A[i]) >> end >> total >> end >> >> return lambda(A,2:prod(size(A)),total,loopbody) >> end >> >> >> And the timing results compared to builtin sum function: >> >> julia> FF(A);@time FF(A) >> 0.010596 seconds (7 allocations: 208 bytes) >> 7.997862445434973e6 >> >> julia> sum(A);@time sum(A) >> 0.010666 seconds (5 allocations: 176 bytes) >> 7.997862445434952e6 >> >> > > This is the henchmen unrolling in https://github.com/JuliaLang/julia/issues/3440 
I believe the actual term for this kind of optimization is "loop peeling", which makes much more sense. I'm probably still gonna call it henchmen unrolling though. An issue with henchmen unrolling is that while it may improve type inference for the loop body, it often doesn't improve the inferred type for the entire function. The classic example is the typeunstable summation loop:
function sumit(v::Vector{Float64}) Henchmen unrolling this makes the loop body typestable – s is Float64 – but the function can still return an Int in the case where v is empty, so the function remains type unstable. I guess that a return type annotation plus henchmen unrolling does fix the problem, but it seems better to use static analysis to uncover this to the programmer so they know to write s = 0.0 instead. On Tue, Sep 27, 2016 at 3:32 PM, Jeff Bezanson <[hidden email]> wrote: Just a footnote that the term "henchmen unrolling" makes no sense, and 
Le mardi 27 septembre 2016 à 17:13 0400, Stefan Karpinski a écrit :
> I believe the actual term for this kind of optimization is "loop > peeling", which makes much more sense. I'm probably still gonna call > it henchmen unrolling though. An issue with henchmen unrolling is > that while it may improve type inference for the loop body, it often > doesn't improve the inferred type for the entire function. The > classic example is the typeunstable summation loop: > > function sumit(v::Vector{Float64}) > s = 0 > for x in v > s += x > end > return s > end > > Henchmen unrolling this makes the loop body typestable – s is > Float64 – but the function can still return an Int in the case where > v is empty, so the function remains type unstable. I guess that a > return type annotation plus henchmen unrolling does fix the problem, > but it seems better to use static analysis to uncover this to the > programmer so they know to write s = 0.0 instead. whatever type it needs to make the function type stable... These zero() initialization steps can be so tedious to write fully correctly when the computation inside the loop is complex. This is true in particular when it involves a multiplication and an addition, for which custom types might have arbitrary promotion rules. A poster child for this is computing the deviance of a linear model: https://github.com/JuliaStats/GLM.jl/blob/4d9399b559345b3926c74631019d645073b53bb0/src/lm.jl#L31 Any chance we could find a solution to this? Julia is generally smart enough to infer the best type everywhere it's logically possible, and this is one of the very few places were it doesn't. Regards > On Tue, Sep 27, 2016 at 3:32 PM, Jeff Bezanson > <[hidden email]> wrote: > > Just a footnote that the term "henchmen unrolling" makes no sense, > > and > > I believe originates from an iphone autocorrect in an email by > > Stefan. > > We found it funny so we just kept it. Hopefully this term will > > eventually appear in a textbook somewhere :) > > > > > > On Mon, Sep 26, 2016 at 8:15 AM, Yichao Yu <[hidden email]> > > wrote: > > > On Mon, Sep 26, 2016 at 3:23 AM, Tsur Herman <[hidden email] > > om> wrote: > > >> I am probably rediscovering the wheel or about to say something > > that will > > >> get a response in the form of "its complicated" > > >> > > >> It occurred to me that what we need to optimize are loops and > > iterations. > > >> > > >> The following code will run slow: > > >> > > >> function F1(A) > > >> total = 0; > > >> for a in A > > >> total += a; > > >> end > > >> total > > >> end > > >> > > >> whereas the following code will run fast > > >> > > >> function F2(A) > > >> total = zero(eltype(A)); > > >> for a in A > > >> total += a; > > >> end > > >> total > > >> end > > >> julia> A=rand(4000,4000); > > >> > > >> julia> F1(A);@time F1(A) > > >> 0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc > > time) > > >> 7.997862445435431e6 > > >> > > >> julia> F2(A);@time F2(A) > > >> 0.025840 seconds (5 allocations: 176 bytes) > > >> 7.997862445435431e6 > > >> > > >> > > >> because in the first version total starts in a different type > > than the > > >> element of the matrix A . and there is now way for the JITter to > > tell if > > >> this is deliberate or careless coding . > > >> What I suggest is to replace any loop in the code with an at > > least one > > >> unrolled version and to push the rest into an anonymous > > function. > > >> In the parser level. > > >> > > >> for example: > > >> > > >> function FF(A) > > >> total = 0 > > >> > > >> loopbody(total,a) = return total += a > > >> > > >> total = loopbody(total,A[1]) > > >> > > >> lambda(A,range,total,loopbody) = begin > > >> @simd for i in range > > >> @inbounds total = loopbody(total,A[i]) > > >> end > > >> total > > >> end > > >> > > >> return lambda(A,2:prod(size(A)),total,loopbody) > > >> end > > >> > > >> > > >> And the timing results compared to builtin sum function: > > >> > > >> julia> FF(A);@time FF(A) > > >> 0.010596 seconds (7 allocations: 208 bytes) > > >> 7.997862445434973e6 > > >> > > >> julia> sum(A);@time sum(A) > > >> 0.010666 seconds (5 allocations: 176 bytes) > > >> 7.997862445434952e6 > > >> > > >> > > > > > > This is the henchmen unrolling in https://github.com/JuliaLang/ju > > lia/issues/3440 > > > 
Stefan wrote:
Maybe a solution to this is to define operations involving empty containers as a conversion operator. and any iteration over an item from container will be preceded with an iteration using empty containers. so a=[] #type float64 s=0 #type int s += a => convert(typeof(a),s) And another thing in that matter : I think eltype of function should return the inferred type for that function (applied recursively) so lets say we have:
then
would be :
I think Base.code_typed already does that although I don't think it is recursive. The whole idea came from observing that:
But if I declare eltype(G::Base.Generator) = Base.code_typed(G.f,(eltype(G. then:

Free forum by Nabble  Edit this page 