# Packages Distances problem with Distances.Jaccard : very slow Classic List Threaded 10 messages Open this post in threaded view
|

## Packages Distances problem with Distances.Jaccard : very slow

 I encountered in my application with Distances.Jaccard compared with Distances.EuclideanIt was very slow.For example with 2 vecteurs Float64 of size 11520I get the following julia> D=Euclidean()Distances.Euclidean()julia> @time for i in 1:500       evaluate(D,v1,v2)       end  0.002553 seconds (500 allocations: 7.813 KB)and with Jaccardjulia> D=Jaccard()Distances.Jaccard()@time for i in 1:500              evaluate(D,v1,v2)              end  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)With a simple loop for computing jaccard :function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})           num = 0           den = 0           for i in 1:length(a)                   num = num + min(a[i],b[i])                   den = den + max(a[i],b[i])                 end               1. - num/den       endmyjaccard2 (generic function with 1 method)julia> @time for i in 1:500              myjaccard2(v1,v2)              end  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)I do not see the problem in jaccard distance implementation in the Distances packages
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 I am not sure I quite understand.  But it looks like you know how to improve the Jaccard distance code, so why not make a pull request at Distances.jl? > function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >            num = 0 >            den = 0 >            for i in 1:length(a) >                    num = num + min(a[i],b[i]) >                    den = den + max(a[i],b[i]) >            end >                1. - num/den >        end Initialising num and den as Int gives you a type instability.  This should work better (plus a few cosmetic changes): function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})     num = 0.0     den = 0.0     for i in 1:length(a)         num += min(a[i],b[i])         den += max(a[i],b[i])     end     return 1. - num/den end
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 In reply to this post by jean-pierre both It also seems unnecessary to restrict it to Float64 and Array:function myjaccard2{T}(A::AbstractArray{T,1}, B::AbstractArray{T, 1})    num = zero(T)    den = zero(T)     for (a,b) in zip(A,B)         num += min(a,b)         den += max(a,b)     end     return 1.0 - num/den endMaybe also use T1, T2 and then promote to a common type ;)Best,SimonAm Montag, 13. Juni 2016 03:53:14 UTC+2 schrieb jean-pierre both:I encountered in my application with Distances.Jaccard compared with Distances.EuclideanIt was very slow.For example with 2 vecteurs Float64 of size 11520I get the following julia> D=Euclidean()Distances.Euclidean()julia> @time for i in 1:500       evaluate(D,v1,v2)       end  0.002553 seconds (500 allocations: 7.813 KB)and with Jaccardjulia> D=Jaccard()Distances.Jaccard()@time for i in 1:500              evaluate(D,v1,v2)              end  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)With a simple loop for computing jaccard :function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})           num = 0           den = 0           for i in 1:length(a)                   num = num + min(a[i],b[i])                   den = den + max(a[i],b[i])                 end               1. - num/den       endmyjaccard2 (generic function with 1 method)julia> @time for i in 1:500              myjaccard2(v1,v2)              end  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)I do not see the problem in jaccard distance implementation in the Distances packages
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 In reply to this post by jean-pierre both It seems weird to me that you guys want to call Jaccard distance with float arrays. AFAIK Jaccard distance measures the distance between two distinct samples from a pair of sets so basically between two Vector{Bool}, see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html"Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays."Is there some more general formulation of it that extends to vectors in a continuous vector space?And, to note, Jaccard is type stable for inputs of Vector{Bool} in Distances.jl.On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:I encountered in my application with Distances.Jaccard compared with Distances.EuclideanIt was very slow.For example with 2 vecteurs Float64 of size 11520I get the following julia> D=Euclidean()Distances.Euclidean()julia> @time for i in 1:500       evaluate(D,v1,v2)       end  0.002553 seconds (500 allocations: 7.813 KB)and with Jaccardjulia> D=Jaccard()Distances.Jaccard()@time for i in 1:500              evaluate(D,v1,v2)              end  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)With a simple loop for computing jaccard :function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})           num = 0           den = 0           for i in 1:length(a)                   num = num + min(a[i],b[i])                   den = den + max(a[i],b[i])                 end               1. - num/den       endmyjaccard2 (generic function with 1 method)julia> @time for i in 1:500              myjaccard2(v1,v2)              end  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)I do not see the problem in jaccard distance implementation in the Distances packages
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 It makes perfect sense to use Jaccard distances for float valuesCf  for example http://www.ncbi.nlm.nih.gov/pubmed/16794951Nevertheless the problem is just an implementation, the time spent should becomparable with the one with Euclidean.The problem I mention is that the nice  implementation used in packages Distances is a problem for this distance as a simple loop is really faster.I presume there is an optimization issue as the difference in time with Euclidean is many orders of magnitude larger than what can be expected from the complexity.The funny thing is that min and max seems also part of the problem as can be seen in the following:function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})    num = 0.    den = 0.    for I in 1:length(a)        @inbounds ai = a[I]        @inbounds bi = b[I]        num = num + min(ai,bi)        den = den + max(ai,bi)          end    1. - num/denendfunction testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})    for i in 1:50000        myjaccard2(v1,v2)    endend@time testDistances2(v1,v2)machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})    num = 0.    den = 0.    for I in 1:length(a)        @inbounds ai = a[I]        @inbounds bi = b[I]        abs_m = abs(ai-bi)        abs_p = abs(ai+bi)        num += abs_p - abs_m        den += abs_p + abs_m       end    1. - num/denendfunction testDistances5(a::Array{Float64,1}, b::Array{Float64,1})    for i in 1:5000        myjaccard5(a,b)    endendendjulia> @time testDistances5(v1,v2)  0.166979 seconds (4 allocations: 160 bytes)We see that using abs is faster.I do not do a pull request beccauseI would expect a good implementation to be 2 or 3 times slower than Euclidean, and I have not that yet.Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :It seems weird to me that you guys want to call Jaccard distance with float arrays. AFAIK Jaccard distance measures the distance between two distinct samples from a pair of sets so basically between two Vector{Bool}, see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html"Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays."Is there some more general formulation of it that extends to vectors in a continuous vector space?And, to note, Jaccard is type stable for inputs of Vector{Bool} in Distances.jl.On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:I encountered in my application with Distances.Jaccard compared with Distances.EuclideanIt was very slow.For example with 2 vecteurs Float64 of size 11520I get the following julia> D=Euclidean()Distances.Euclidean()julia> @time for i in 1:500       evaluate(D,v1,v2)       end  0.002553 seconds (500 allocations: 7.813 KB)and with Jaccardjulia> D=Jaccard()Distances.Jaccard()@time for i in 1:500              evaluate(D,v1,v2)              end  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)With a simple loop for computing jaccard :function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})           num = 0           den = 0           for i in 1:length(a)                   num = num + min(a[i],b[i])                   den = den + max(a[i],b[i])                 end               1. - num/den       endmyjaccard2 (generic function with 1 method)julia> @time for i in 1:500              myjaccard2(v1,v2)              end  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)I do not see the problem in jaccard distance implementation in the Distances packages
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 > function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >     num = 0. >     den = 0. >     for I in 1:length(a) >         @inbounds ai = a[I] >         @inbounds bi = b[I] >         num = num + min(ai,bi) >         den = den + max(ai,bi) >     end >     1. - num/den > end > > > > function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1}) >     for i in 1:50000 >         myjaccard2(v1,v2) >     end > end I recommend using the values returned for something, otherwise the compiler sometimes eliminates the loop (but not here): julia> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})            out = 0.0            for i in 1:50000                out += myjaccard2(v1,v2)            end            out        end > @time testDistances2(v1,v2) > machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time) I cannot reproduce this, when I run it I get no allocations: julia> v2 = rand(10^4); # warm-up julia> @time testDistances2(v1,v2)   3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time) 24999.00112162811 julia> @time testDistances2(v1,v2)   3.647563 seconds (5 allocations: 176 bytes) 24999.00112162811 What version of Julia are you running. Me 0.4.5. > function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1}) >     num = 0. >     den = 0. >     for I in 1:length(a) >         @inbounds ai = a[I] >         @inbounds bi = b[I] >         abs_m = abs(ai-bi) >         abs_p = abs(ai+bi) >         num += abs_p - abs_m >         den += abs_p + abs_m >     end >     1. - num/den > end > > > function testDistances5(a::Array{Float64,1}, b::Array{Float64,1}) >     for i in 1:5000 >         myjaccard5(a,b) >     end > end > > end > > > julia> @time testDistances5(v1,v2) >   0.166979 seconds (4 allocations: 160 bytes) > > > > We see that using abs is faster. > > I do not do a pull request beccause > > I would expect a good implementation to be 2 or 3 times slower than > Euclidean, and I have not > that yet. > > Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit : >> >> It seems weird to me that you guys want to call Jaccard distance with >> float arrays. AFAIK Jaccard distance measures the distance between two >> distinct samples from a pair of sets so basically between two Vector{Bool}, >> see: >> http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html>> >> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D >> arrays." >> >> Is there some more general formulation of it that extends to vectors in a >> continuous vector space? >> >> And, to note, Jaccard is type stable for inputs of Vector{Bool} in >> Distances.jl. >> >> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote: >>> >>> >>> >>> I encountered in my application with Distances.Jaccard compared with >>> Distances.Euclidean >>> It was very slow. >>> >>> For example with 2 vecteurs Float64 of size 11520 >>> >>> I get the following >>> julia> D=Euclidean() >>> Distances.Euclidean() >>> julia> @time for i in 1:500 >>>        evaluate(D,v1,v2) >>>        end >>>   0.002553 seconds (500 allocations: 7.813 KB) >>> >>> and with Jaccard >>> >>> julia> D=Jaccard() >>> Distances.Jaccard() >>> @time for i in 1:500 >>>               evaluate(D,v1,v2) >>>               end >>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time) >>> >>> With a simple loop for computing jaccard : >>> >>> >>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >>>            num = 0 >>>            den = 0 >>>            for i in 1:length(a) >>>                    num = num + min(a[i],b[i]) >>>                    den = den + max(a[i],b[i]) >>>            end >>>                1. - num/den >>>        end >>> myjaccard2 (generic function with 1 method) >>> >>> julia> @time for i in 1:500 >>>               myjaccard2(v1,v2) >>>               end >>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time) >>> >>> I do not see the problem in jaccard distance implementation in the >>> Distances packages >>> >>
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 In reply to this post by jean-pierre both Please try https://github.com/JuliaStats/Distances.jl/pull/44Using:`function testDistances5(a::Array{Float64,1}, b::Array{Float64,1})    @time for i in 1:5000        myjaccard5(a,b)    end    d = Jaccard()    @time for i in 1:5000        evaluate(d, a,b)    endend`I get:`julia> testDistances5(rand(100), rand(100))  0.000849 seconds  0.001670 seconds`On Monday, June 13, 2016 at 7:57:20 PM UTC+2, jean-pierre both wrote:It makes perfect sense to use Jaccard distances for float valuesCf  for example http://www.ncbi.nlm.nih.gov/pubmed/16794951Nevertheless the problem is just an implementation, the time spent should becomparable with the one with Euclidean.The problem I mention is that the nice  implementation used in packages Distances is a problem for this distance as a simple loop is really faster.I presume there is an optimization issue as the difference in time with Euclidean is many orders of magnitude larger than what can be expected from the complexity.The funny thing is that min and max seems also part of the problem as can be seen in the following:function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})    num = 0.    den = 0.    for I in 1:length(a)        @inbounds ai = a[I]        @inbounds bi = b[I]        num = num + min(ai,bi)        den = den + max(ai,bi)          end    1. - num/denendfunction testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})    for i in 1:50000        myjaccard2(v1,v2)    endend@time testDistances2(v1,v2)machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time)function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1})    num = 0.    den = 0.    for I in 1:length(a)        @inbounds ai = a[I]        @inbounds bi = b[I]        abs_m = abs(ai-bi)        abs_p = abs(ai+bi)        num += abs_p - abs_m        den += abs_p + abs_m       end    1. - num/denendfunction testDistances5(a::Array{Float64,1}, b::Array{Float64,1})    for i in 1:5000        myjaccard5(a,b)    endendendjulia> @time testDistances5(v1,v2)  0.166979 seconds (4 allocations: 160 bytes)We see that using abs is faster.I do not do a pull request beccauseI would expect a good implementation to be 2 or 3 times slower than Euclidean, and I have not that yet.Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit :It seems weird to me that you guys want to call Jaccard distance with float arrays. AFAIK Jaccard distance measures the distance between two distinct samples from a pair of sets so basically between two Vector{Bool}, see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html"Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays."Is there some more general formulation of it that extends to vectors in a continuous vector space?And, to note, Jaccard is type stable for inputs of Vector{Bool} in Distances.jl.On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote:I encountered in my application with Distances.Jaccard compared with Distances.EuclideanIt was very slow.For example with 2 vecteurs Float64 of size 11520I get the following julia> D=Euclidean()Distances.Euclidean()julia> @time for i in 1:500       evaluate(D,v1,v2)       end  0.002553 seconds (500 allocations: 7.813 KB)and with Jaccardjulia> D=Jaccard()Distances.Jaccard()@time for i in 1:500              evaluate(D,v1,v2)              end  1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time)With a simple loop for computing jaccard :function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1})           num = 0           den = 0           for i in 1:length(a)                   num = num + min(a[i],b[i])                   den = den + max(a[i],b[i])                 end               1. - num/den       endmyjaccard2 (generic function with 1 method)julia> @time for i in 1:500              myjaccard2(v1,v2)              end  0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time)I do not see the problem in jaccard distance implementation in the Distances packages
Open this post in threaded view
|

## Re: Packages Distances problem with Distances.Jaccard : very slow

 In reply to this post by Mauro Please try https://github.com/JuliaStats/Distances.jl/pull/44On Monday, June 13, 2016 at 8:14:01 PM UTC+2, Mauro wrote:> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >     num = 0. >     den = 0. >     for I in 1:length(a) >         @inbounds ai = a[I] >         @inbounds bi = b[I] >         num = num + min(ai,bi) >         den = den + max(ai,bi) >     end >     1. - num/den > end > > > > function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1}) >     for i in 1:50000 >         myjaccard2(v1,v2) >     end > end I recommend using the values returned for something, otherwise the compiler sometimes eliminates the loop (but not here): julia> function testDistances2(v1::Array{Float64,1}, v2::Array{Float64,1})            out = 0.0            for i in 1:50000                out += myjaccard2(v1,v2)            end            out        end > @time testDistances2(v1,v2) > machine   3.217329 seconds (200.01 M allocations: 2.981 GB, 19.91% gc time) I cannot reproduce this, when I run it I get no allocations: julia> v2 = rand(10^4); # warm-up julia> @time testDistances2(v1,v2)   3.604478 seconds (8.15 k allocations: 401.797 KB, 0.42% gc time) 24999.00112162811 julia> @time testDistances2(v1,v2)   3.647563 seconds (5 allocations: 176 bytes) 24999.00112162811 What version of Julia are you running. Me 0.4.5. > function myjaccard5(a::Array{Float64,1}, b::Array{Float64,1}) >     num = 0. >     den = 0. >     for I in 1:length(a) >         @inbounds ai = a[I] >         @inbounds bi = b[I] >         abs_m = abs(ai-bi) >         abs_p = abs(ai+bi) >         num += abs_p - abs_m >         den += abs_p + abs_m >     end >     1. - num/den > end > > > function testDistances5(a::Array{Float64,1}, b::Array{Float64,1}) >     for i in 1:5000 >         myjaccard5(a,b) >     end > end > > end > > > julia> @time testDistances5(v1,v2) >   0.166979 seconds (4 allocations: 160 bytes) > > > > We see that using abs is faster. > > I do not do a pull request beccause > > I would expect a good implementation to be 2 or 3 times slower than > Euclidean, and I have not > that yet. > > Le lundi 13 juin 2016 13:43:00 UTC+2, Kristoffer Carlsson a écrit : >> >> It seems weird to me that you guys want to call Jaccard distance with >> float arrays. AFAIK Jaccard distance measures the distance between two >> distinct samples from a pair of sets so basically between two Vector{Bool}, >> see: >> http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html >> >> "Computes the Jaccard-Needham dissimilarity between two boolean 1-D >> arrays." >> >> Is there some more general formulation of it that extends to vectors in a >> continuous vector space? >> >> And, to note, Jaccard is type stable for inputs of Vector{Bool} in >> Distances.jl. >> >> On Monday, June 13, 2016 at 3:53:14 AM UTC+2, jean-pierre both wrote: >>> >>> >>> >>> I encountered in my application with Distances.Jaccard compared with >>> Distances.Euclidean >>> It was very slow. >>> >>> For example with 2 vecteurs Float64 of size 11520 >>> >>> I get the following >>> julia> D=Euclidean() >>> Distances.Euclidean() >>> julia> @time for i in 1:500 >>>        evaluate(D,v1,v2) >>>        end >>>   0.002553 seconds (500 allocations: 7.813 KB) >>> >>> and with Jaccard >>> >>> julia> D=Jaccard() >>> Distances.Jaccard() >>> @time for i in 1:500 >>>               evaluate(D,v1,v2) >>>               end >>>   1.995046 seconds (40.32 M allocations: 703.156 MB, 9.68% gc time) >>> >>> With a simple loop for computing jaccard : >>> >>> >>> function myjaccard2(a::Array{Float64,1}, b::Array{Float64,1}) >>>            num = 0 >>>            den = 0 >>>            for i in 1:length(a) >>>                    num = num + min(a[i],b[i]) >>>                    den = den + max(a[i],b[i]) >>>            end >>>                1. - num/den >>>        end >>> myjaccard2 (generic function with 1 method) >>> >>> julia> @time for i in 1:500 >>>               myjaccard2(v1,v2) >>>               end >>>   0.451582 seconds (23.04 M allocations: 351.592 MB, 20.04% gc time) >>> >>> I do not see the problem in jaccard distance implementation in the >>> Distances packages >>> >>