Packages Distances problem with Distances.Jaccard : very slow

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Packages Distances problem with Distances.Jaccard : very slow

jean-pierre both


I encountered in my application with Distances.Jaccard compared with Distances.Euclidean
It was very slow.

For example with 2 vecteurs Float64 of size 11520

I get the following
julia> D=Euclidean()
Distances.Euclidean()
julia> @time for i in 1:500
       evaluate(D,v1,v2)
       end
  0.002553 seconds (500 allocations: 7.813 KB)

and with Jaccard

julia> D=Jaccard()
Distances.Jaccard()
@time for i in 1:500
              evaluate(D,v1,v2)
              end
  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)

With a simple loop for computing jaccard :


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
           num = 0
           den = 0
           for i in 1:length(a)
                   num = num + min(a[i],b[i])
                   den = den + max(a[i],b[i])     
           end
               1. - num/den
       end
myjaccard2 (generic function with 1 method)

julia> @time for i in 1:500
              myjaccard2(v1,v2)
              end
  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)

I do not see the problem in jaccard distance implementation in the Distances packages
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Mauro
I am not sure I quite understand.  But it looks like you know how to
improve the Jaccard distance code, so why not make a pull request at
Distances.jl?

> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
>            num = 0
>            den = 0
>            for i in 1:length(a)
>                    num = num + min(a[i],b[i])
>                    den = den + max(a[i],b[i])
>            end
>                1. - num/den
>        end

Initialising num and den as Int gives you a type instability.  This
should work better (plus a few cosmetic changes):

function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
    num = 0.0
    den = 0.0
    for i in 1:length(a)
        num += min(a[i],b[i])
        den += max(a[i],b[i])
    end
    return 1. - num/den
end
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Simon Danisch
In reply to this post by jean-pierre both
It also seems unnecessary to restrict it to Float64 and Array:

function myjaccard2{T}(A::AbstractArray{T,1}, B::AbstractArray{T, 1})
    num = zero(T)
    den = zero(T) 
    for (a,b) in zip(A,B) 
        num += min(a,b) 
        den += max(a,b) 
    end 
    return 1.0 - num/den 
end

Maybe also use T1, T2 and then promote to a common type ;)

Best,
Simon

Am Montag, 13. Juni 2016 03:53:14 UTC+2 schrieb jean-pierre both:


I encountered in my application with Distances.Jaccard compared with Distances.Euclidean
It was very slow.

For example with 2 vecteurs Float64 of size 11520

I get the following
julia> D=Euclidean()
Distances.Euclidean()
julia> @time for i in 1:500
       evaluate(D,v1,v2)
       end
  0.002553 seconds (500 allocations: 7.813 KB)

and with Jaccard

julia> D=Jaccard()
Distances.Jaccard()
@time for i in 1:500
              evaluate(D,v1,v2)
              end
  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)

With a simple loop for computing jaccard :


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
           num = 0
           den = 0
           for i in 1:length(a)
                   num = num + min(a[i],b[i])
                   den = den + max(a[i],b[i])     
           end
               1. - num/den
       end
myjaccard2 (generic function with 1 method)

julia> @time for i in 1:500
              myjaccard2(v1,v2)
              end
  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)

I do not see the problem in jaccard distance implementation in the Distances packages
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Kristoffer Carlsson
In reply to this post by jean-pierre both
It seems weird to me that you guys want to call Jaccard distance with float arrays. AFAIK Jaccard distance measures the distance between two distinct samples from a pair of sets so basically between two Vector{Bool}, see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html

"Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays."

Is there some more general formulation of it that extends to vectors in a continuous vector space?

And, to note, Jaccard is type stable for inputs of Vector{Bool} in Distances.jl.

On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:


I encountered in my application with Distances.Jaccard compared with Distances.Euclidean
It was very slow.

For example with 2 vecteurs Float64 of size 11520

I get the following
julia> D=Euclidean()
Distances.Euclidean()
julia> @time for i in 1:500
       evaluate(D,v1,v2)
       end
  0.002553 seconds (500 allocations: 7.813 KB)

and with Jaccard

julia> D=Jaccard()
Distances.Jaccard()
@time for i in 1:500
              evaluate(D,v1,v2)
              end
  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)

With a simple loop for computing jaccard :


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
           num = 0
           den = 0
           for i in 1:length(a)
                   num = num + min(a[i],b[i])
                   den = den + max(a[i],b[i])     
           end
               1. - num/den
       end
myjaccard2 (generic function with 1 method)

julia> @time for i in 1:500
              myjaccard2(v1,v2)
              end
  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)

I do not see the problem in jaccard distance implementation in the Distances packages
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

jean-pierre both
It makes perfect sense to use Jaccard distances for float values
Cf  for example http://www.ncbi.nlm.nih.gov/pubmed/16794951

Nevertheless the problem is just an implementation, the time spent should be
comparable with the one with Euclidean.

The problem I mention is that the nice  implementation used in
 packages Distances is a problem for this distance as a simple loop is really faster.
I presume there is an optimization issue as the difference in time with Euclidean is many orders of magnitude
larger than what can be expected from the complexity.


The funny thing is that min and max seems also part of the problem as can be seen in the following:


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
    num = 0.
    den = 0.
    for I in 1:length(a)
        @inbounds ai = a[I]
        @inbounds bi = b[I]
        num = num + min(ai,bi)
        den = den + max(ai,bi)     
    end
    1. - num/den
end



function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
    for i in 1:50000
        myjaccard2(v1,v2)
    end
end

@time testDistances2(v1,v2)
machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)



function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})
    num = 0.
    den = 0.
    for I in 1:length(a)
        @inbounds ai = a[I]
        @inbounds bi = b[I]
        abs_m = abs(ai-bi)
        abs_p = abs(ai+bi)
        num += abs_p - abs_m
        den += abs_p + abs_m  
    end
    1. - num/den
end


function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
    for i in 1:5000
        myjaccard5(a,b)
    end
end

end


julia> @time testDistances5(v1,v2)
  0.166979 seconds (4 allocations: 160 bytes)



We see that using abs is faster.

I do not do a pull request beccause

I would expect a good implementation to be 2 or 3 times slower than Euclidean, and I have not
that yet.

Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :
It seems weird to me that you guys want to call Jaccard distance with float arrays. AFAIK Jaccard distance measures the distance between two distinct samples from a pair of sets so basically between two Vector{Bool}, see: <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;">http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html

"Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays."

Is there some more general formulation of it that extends to vectors in a continuous vector space?

And, to note, Jaccard is type stable for inputs of Vector{Bool} in Distances.jl.

On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:


I encountered in my application with Distances.Jaccard compared with Distances.Euclidean
It was very slow.

For example with 2 vecteurs Float64 of size 11520

I get the following
julia> D=Euclidean()
Distances.Euclidean()
julia> @time for i in 1:500
       evaluate(D,v1,v2)
       end
  0.002553 seconds (500 allocations: 7.813 KB)

and with Jaccard

julia> D=Jaccard()
Distances.Jaccard()
@time for i in 1:500
              evaluate(D,v1,v2)
              end
  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)

With a simple loop for computing jaccard :


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
           num = 0
           den = 0
           for i in 1:length(a)
                   num = num + min(a[i],b[i])
                   den = den + max(a[i],b[i])     
           end
               1. - num/den
       end
myjaccard2 (generic function with 1 method)

julia> @time for i in 1:500
              myjaccard2(v1,v2)
              end
  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)

I do not see the problem in jaccard distance implementation in the Distances packages
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Mauro
> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         num = num + min(ai,bi)
>         den = den + max(ai,bi)
>     end
>     1. - num/den
> end
>
>
>
> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
>     for i in 1:50000
>         myjaccard2(v1,v2)
>     end
> end

I recommend using the values returned for something, otherwise the
compiler sometimes eliminates the loop (but not here):

julia> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
           out = 0.0
           for i in 1:50000
               out += myjaccard2(v1,v2)
           end
           out
       end

> @time testDistances2(v1,v2)
> machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)

I cannot reproduce this, when I run it I get no allocations:

julia> v2 = rand(10^4);

# warm-up
julia> @time testDistances2(v1,v2)
  3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time)
24999.00112162811

julia> @time testDistances2(v1,v2)
  3.647563 seconds (5 allocations: 176 bytes)
24999.00112162811

What version of Julia are you running. Me 0.4.5.

> function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})
>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         abs_m = abs(ai-bi)
>         abs_p = abs(ai+bi)
>         num += abs_p - abs_m
>         den += abs_p + abs_m
>     end
>     1. - num/den
> end
>
>
> function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
>     for i in 1:5000
>         myjaccard5(a,b)
>     end
> end
>
> end
>
>
> julia> @time testDistances5(v1,v2)
>   0.166979 seconds (4 allocations: 160 bytes)
>
>
>
> We see that using abs is faster.
>
> I do not do a pull request beccause
>
> I would expect a good implementation to be 2 or 3 times slower than
> Euclidean, and I have not
> that yet.
>
> Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :
>>
>> It seems weird to me that you guys want to call Jaccard distance with
>> float arrays. AFAIK Jaccard distance measures the distance between two
>> distinct samples from a pair of sets so basically between two Vector{Bool},
>> see:
>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html
>>
>> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D
>> arrays."
>>
>> Is there some more general formulation of it that extends to vectors in a
>> continuous vector space?
>>
>> And, to note, Jaccard is type stable for inputs of Vector{Bool} in
>> Distances.jl.
>>
>> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:
>>>
>>>
>>>
>>> I encountered in my application with Distances.Jaccard compared with
>>> Distances.Euclidean
>>> It was very slow.
>>>
>>> For example with 2 vecteurs Float64 of size 11520
>>>
>>> I get the following
>>> julia> D=Euclidean()
>>> Distances.Euclidean()
>>> julia> @time for i in 1:500
>>>        evaluate(D,v1,v2)
>>>        end
>>>   0.002553 seconds (500 allocations: 7.813 KB)
>>>
>>> and with Jaccard
>>>
>>> julia> D=Jaccard()
>>> Distances.Jaccard()
>>> @time for i in 1:500
>>>               evaluate(D,v1,v2)
>>>               end
>>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)
>>>
>>> With a simple loop for computing jaccard :
>>>
>>>
>>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
>>>            num = 0
>>>            den = 0
>>>            for i in 1:length(a)
>>>                    num = num + min(a[i],b[i])
>>>                    den = den + max(a[i],b[i])
>>>            end
>>>                1. - num/den
>>>        end
>>> myjaccard2 (generic function with 1 method)
>>>
>>> julia> @time for i in 1:500
>>>               myjaccard2(v1,v2)
>>>               end
>>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)
>>>
>>> I do not see the problem in jaccard distance implementation in the
>>> Distances packages
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Kristoffer Carlsson
In reply to this post by jean-pierre both
Please try https://github.com/JuliaStats/Distances.jl/pull/44

Using:

function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
    @time for i in 1:5000
        myjaccard5(a,b)
    end
    d = Jaccard()
    @time for i in 1:5000
        evaluate(d, a,b)
    end
end


I get:
julia> testDistances5(rand(100), rand(100))
  0.000849 seconds
  0.001670 seconds




On Monday, June 13, 2016 at 7:57:20 PM UTC+2, jean-pierre both wrote:
It makes perfect sense to use Jaccard distances for float values
Cf  for example <a href="http://www.ncbi.nlm.nih.gov/pubmed/16794951" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpubmed%2F16794951\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFo8pREB6kaCvdxBYiJkzrV-Xwusw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpubmed%2F16794951\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFo8pREB6kaCvdxBYiJkzrV-Xwusw&#39;;return true;">http://www.ncbi.nlm.nih.gov/pubmed/16794951

Nevertheless the problem is just an implementation, the time spent should be
comparable with the one with Euclidean.

The problem I mention is that the nice  implementation used in
 packages Distances is a problem for this distance as a simple loop is really faster.
I presume there is an optimization issue as the difference in time with Euclidean is many orders of magnitude
larger than what can be expected from the complexity.


The funny thing is that min and max seems also part of the problem as can be seen in the following:


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
    num = 0.
    den = 0.
    for I in 1:length(a)
        @inbounds ai = a[I]
        @inbounds bi = b[I]
        num = num + min(ai,bi)
        den = den + max(ai,bi)     
    end
    1. - num/den
end



function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
    for i in 1:50000
        myjaccard2(v1,v2)
    end
end

@time testDistances2(v1,v2)
machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)



function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})
    num = 0.
    den = 0.
    for I in 1:length(a)
        @inbounds ai = a[I]
        @inbounds bi = b[I]
        abs_m = abs(ai-bi)
        abs_p = abs(ai+bi)
        num += abs_p - abs_m
        den += abs_p + abs_m  
    end
    1. - num/den
end


function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
    for i in 1:5000
        myjaccard5(a,b)
    end
end

end


julia> @time testDistances5(v1,v2)
  0.166979 seconds (4 allocations: 160 bytes)



We see that using abs is faster.

I do not do a pull request beccause

I would expect a good implementation to be 2 or 3 times slower than Euclidean, and I have not
that yet.

Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :
It seems weird to me that you guys want to call Jaccard distance with float arrays. AFAIK Jaccard distance measures the distance between two distinct samples from a pair of sets so basically between two Vector{Bool}, see: <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;">http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html

"Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays."

Is there some more general formulation of it that extends to vectors in a continuous vector space?

And, to note, Jaccard is type stable for inputs of Vector{Bool} in Distances.jl.

On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:


I encountered in my application with Distances.Jaccard compared with Distances.Euclidean
It was very slow.

For example with 2 vecteurs Float64 of size 11520

I get the following
julia> D=Euclidean()
Distances.Euclidean()
julia> @time for i in 1:500
       evaluate(D,v1,v2)
       end
  0.002553 seconds (500 allocations: 7.813 KB)

and with Jaccard

julia> D=Jaccard()
Distances.Jaccard()
@time for i in 1:500
              evaluate(D,v1,v2)
              end
  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)

With a simple loop for computing jaccard :


function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
           num = 0
           den = 0
           for i in 1:length(a)
                   num = num + min(a[i],b[i])
                   den = den + max(a[i],b[i])     
           end
               1. - num/den
       end
myjaccard2 (generic function with 1 method)

julia> @time for i in 1:500
              myjaccard2(v1,v2)
              end
  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)

I do not see the problem in jaccard distance implementation in the Distances packages
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Kristoffer Carlsson
In reply to this post by Mauro
Please try https://github.com/JuliaStats/Distances.jl/pull/44

On Monday, June 13, 2016 at 8:14:01 PM UTC+2, Mauro wrote:
> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})

>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         num = num + min(ai,bi)
>         den = den + max(ai,bi)
>     end
>     1. - num/den
> end
>
>
>
> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
>     for i in 1:50000
>         myjaccard2(v1,v2)
>     end
> end

I recommend using the values returned for something, otherwise the
compiler sometimes eliminates the loop (but not here):

julia> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
           out = 0.0
           for i in 1:50000
               out += myjaccard2(v1,v2)
           end
           out
       end

> @time testDistances2(v1,v2)
> machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)

I cannot reproduce this, when I run it I get no allocations:

julia> v2 = rand(10^4);

# warm-up
julia> @time testDistances2(v1,v2)
  3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time)
24999.00112162811

julia> @time testDistances2(v1,v2)
  3.647563 seconds (5 allocations: 176 bytes)
24999.00112162811

What version of Julia are you running. Me 0.4.5.

> function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})
>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         abs_m = abs(ai-bi)
>         abs_p = abs(ai+bi)
>         num += abs_p - abs_m
>         den += abs_p + abs_m
>     end
>     1. - num/den
> end
>
>
> function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
>     for i in 1:5000
>         myjaccard5(a,b)
>     end
> end
>
> end
>
>
> julia> @time testDistances5(v1,v2)
>   0.166979 seconds (4 allocations: 160 bytes)
>
>
>
> We see that using abs is faster.
>
> I do not do a pull request beccause
>
> I would expect a good implementation to be 2 or 3 times slower than
> Euclidean, and I have not
> that yet.
>
> Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :
>>
>> It seems weird to me that you guys want to call Jaccard distance with
>> float arrays. AFAIK Jaccard distance measures the distance between two
>> distinct samples from a pair of sets so basically between two Vector{Bool},
>> see:
>> <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;">http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html
>>
>> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D
>> arrays."
>>
>> Is there some more general formulation of it that extends to vectors in a
>> continuous vector space?
>>
>> And, to note, Jaccard is type stable for inputs of Vector{Bool} in
>> Distances.jl.
>>
>> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:
>>>
>>>
>>>
>>> I encountered in my application with Distances.Jaccard compared with
>>> Distances.Euclidean
>>> It was very slow.
>>>
>>> For example with 2 vecteurs Float64 of size 11520
>>>
>>> I get the following
>>> julia> D=Euclidean()
>>> Distances.Euclidean()
>>> julia> @time for i in 1:500
>>>        evaluate(D,v1,v2)
>>>        end
>>>   0.002553 seconds (500 allocations: 7.813 KB)
>>>
>>> and with Jaccard
>>>
>>> julia> D=Jaccard()
>>> Distances.Jaccard()
>>> @time for i in 1:500
>>>               evaluate(D,v1,v2)
>>>               end
>>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)
>>>
>>> With a simple loop for computing jaccard :
>>>
>>>
>>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
>>>            num = 0
>>>            den = 0
>>>            for i in 1:length(a)
>>>                    num = num + min(a[i],b[i])
>>>                    den = den + max(a[i],b[i])
>>>            end
>>>                1. - num/den
>>>        end
>>> myjaccard2 (generic function with 1 method)
>>>
>>> julia> @time for i in 1:500
>>>               myjaccard2(v1,v2)
>>>               end
>>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)
>>>
>>> I do not see the problem in jaccard distance implementation in the
>>> Distances packages
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

jean-pierre both
Hi,

The fix is really great, thank you for the analysis and the fix.
Thanks you

Le lundi 13 juin 2016 20:19:26 UTC+2, Kristoffer Carlsson a écrit :
Please try <a href="https://github.com/JuliaStats/Distances.jl/pull/44" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJuliaStats%2FDistances.jl%2Fpull%2F44\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIQEMoUDAHswOwfoA8s8WWFYdySg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJuliaStats%2FDistances.jl%2Fpull%2F44\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIQEMoUDAHswOwfoA8s8WWFYdySg&#39;;return true;">https://github.com/JuliaStats/Distances.jl/pull/44

On Monday, June 13, 2016 at 8:14:01 PM UTC+2, Mauro wrote:
> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})

>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         num = num + min(ai,bi)
>         den = den + max(ai,bi)
>     end
>     1. - num/den
> end
>
>
>
> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
>     for i in 1:50000
>         myjaccard2(v1,v2)
>     end
> end

I recommend using the values returned for something, otherwise the
compiler sometimes eliminates the loop (but not here):

julia> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
           out = 0.0
           for i in 1:50000
               out += myjaccard2(v1,v2)
           end
           out
       end

> @time testDistances2(v1,v2)
> machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)

I cannot reproduce this, when I run it I get no allocations:

julia> v2 = rand(10^4);

# warm-up
julia> @time testDistances2(v1,v2)
  3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time)
24999.00112162811

julia> @time testDistances2(v1,v2)
  3.647563 seconds (5 allocations: 176 bytes)
24999.00112162811

What version of Julia are you running. Me 0.4.5.

> function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})
>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         abs_m = abs(ai-bi)
>         abs_p = abs(ai+bi)
>         num += abs_p - abs_m
>         den += abs_p + abs_m
>     end
>     1. - num/den
> end
>
>
> function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
>     for i in 1:5000
>         myjaccard5(a,b)
>     end
> end
>
> end
>
>
> julia> @time testDistances5(v1,v2)
>   0.166979 seconds (4 allocations: 160 bytes)
>
>
>
> We see that using abs is faster.
>
> I do not do a pull request beccause
>
> I would expect a good implementation to be 2 or 3 times slower than
> Euclidean, and I have not
> that yet.
>
> Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :
>>
>> It seems weird to me that you guys want to call Jaccard distance with
>> float arrays. AFAIK Jaccard distance measures the distance between two
>> distinct samples from a pair of sets so basically between two Vector{Bool},
>> see:
>> <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;">http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html
>>
>> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D
>> arrays."
>>
>> Is there some more general formulation of it that extends to vectors in a
>> continuous vector space?
>>
>> And, to note, Jaccard is type stable for inputs of Vector{Bool} in
>> Distances.jl.
>>
>> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:
>>>
>>>
>>>
>>> I encountered in my application with Distances.Jaccard compared with
>>> Distances.Euclidean
>>> It was very slow.
>>>
>>> For example with 2 vecteurs Float64 of size 11520
>>>
>>> I get the following
>>> julia> D=Euclidean()
>>> Distances.Euclidean()
>>> julia> @time for i in 1:500
>>>        evaluate(D,v1,v2)
>>>        end
>>>   0.002553 seconds (500 allocations: 7.813 KB)
>>>
>>> and with Jaccard
>>>
>>> julia> D=Jaccard()
>>> Distances.Jaccard()
>>> @time for i in 1:500
>>>               evaluate(D,v1,v2)
>>>               end
>>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)
>>>
>>> With a simple loop for computing jaccard :
>>>
>>>
>>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
>>>            num = 0
>>>            den = 0
>>>            for i in 1:length(a)
>>>                    num = num + min(a[i],b[i])
>>>                    den = den + max(a[i],b[i])
>>>            end
>>>                1. - num/den
>>>        end
>>> myjaccard2 (generic function with 1 method)
>>>
>>> julia> @time for i in 1:500
>>>               myjaccard2(v1,v2)
>>>               end
>>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)
>>>
>>> I do not see the problem in jaccard distance implementation in the
>>> Distances packages
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Packages Distances problem with Distances.Jaccard : very slow

Kristoffer Carlsson
Ok, I merged it. Enjoy.

On Tuesday, June 14, 2016 at 8:53:26 PM UTC+2, jean-pierre both wrote:
Hi,

The fix is really great, thank you for the analysis and the fix.
Thanks you

Le lundi 13 juin 2016 20:19:26 UTC+2, Kristoffer Carlsson a écrit :
Please try <a href="https://github.com/JuliaStats/Distances.jl/pull/44" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJuliaStats%2FDistances.jl%2Fpull%2F44\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIQEMoUDAHswOwfoA8s8WWFYdySg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FJuliaStats%2FDistances.jl%2Fpull%2F44\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFIQEMoUDAHswOwfoA8s8WWFYdySg&#39;;return true;">https://github.com/JuliaStats/Distances.jl/pull/44

On Monday, June 13, 2016 at 8:14:01 PM UTC+2, Mauro wrote:
> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})

>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         num = num + min(ai,bi)
>         den = den + max(ai,bi)
>     end
>     1. - num/den
> end
>
>
>
> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
>     for i in 1:50000
>         myjaccard2(v1,v2)
>     end
> end

I recommend using the values returned for something, otherwise the
compiler sometimes eliminates the loop (but not here):

julia> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})
           out = 0.0
           for i in 1:50000
               out += myjaccard2(v1,v2)
           end
           out
       end

> @time testDistances2(v1,v2)
> machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)

I cannot reproduce this, when I run it I get no allocations:

julia> v2 = rand(10^4);

# warm-up
julia> @time testDistances2(v1,v2)
  3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time)
24999.00112162811

julia> @time testDistances2(v1,v2)
  3.647563 seconds (5 allocations: 176 bytes)
24999.00112162811

What version of Julia are you running. Me 0.4.5.

> function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})
>     num = 0.
>     den = 0.
>     for I in 1:length(a)
>         @inbounds ai = a[I]
>         @inbounds bi = b[I]
>         abs_m = abs(ai-bi)
>         abs_p = abs(ai+bi)
>         num += abs_p - abs_m
>         den += abs_p + abs_m
>     end
>     1. - num/den
> end
>
>
> function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})
>     for i in 1:5000
>         myjaccard5(a,b)
>     end
> end
>
> end
>
>
> julia> @time testDistances5(v1,v2)
>   0.166979 seconds (4 allocations: 160 bytes)
>
>
>
> We see that using abs is faster.
>
> I do not do a pull request beccause
>
> I would expect a good implementation to be 2 or 3 times slower than
> Euclidean, and I have not
> that yet.
>
> Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :
>>
>> It seems weird to me that you guys want to call Jaccard distance with
>> float arrays. AFAIK Jaccard distance measures the distance between two
>> distinct samples from a pair of sets so basically between two Vector{Bool},
>> see:
>> <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fdocs.scipy.org%2Fdoc%2Fscipy%2Freference%2Fgenerated%2Fscipy.spatial.distance.jaccard.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHKSm1FwZh9aWxFRTjXPthhi3I78w&#39;;return true;">http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html
>>
>> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D
>> arrays."
>>
>> Is there some more general formulation of it that extends to vectors in a
>> continuous vector space?
>>
>> And, to note, Jaccard is type stable for inputs of Vector{Bool} in
>> Distances.jl.
>>
>> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:
>>>
>>>
>>>
>>> I encountered in my application with Distances.Jaccard compared with
>>> Distances.Euclidean
>>> It was very slow.
>>>
>>> For example with 2 vecteurs Float64 of size 11520
>>>
>>> I get the following
>>> julia> D=Euclidean()
>>> Distances.Euclidean()
>>> julia> @time for i in 1:500
>>>        evaluate(D,v1,v2)
>>>        end
>>>   0.002553 seconds (500 allocations: 7.813 KB)
>>>
>>> and with Jaccard
>>>
>>> julia> D=Jaccard()
>>> Distances.Jaccard()
>>> @time for i in 1:500
>>>               evaluate(D,v1,v2)
>>>               end
>>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)
>>>
>>> With a simple loop for computing jaccard :
>>>
>>>
>>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})
>>>            num = 0
>>>            den = 0
>>>            for i in 1:length(a)
>>>                    num = num + min(a[i],b[i])
>>>                    den = den + max(a[i],b[i])
>>>            end
>>>                1. - num/den
>>>        end
>>> myjaccard2 (generic function with 1 method)
>>>
>>> julia> @time for i in 1:500
>>>               myjaccard2(v1,v2)
>>>               end
>>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)
>>>
>>> I do not see the problem in jaccard distance implementation in the
>>> Distances packages
>>>
>>