
12

Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


I made some simple changes to your `xpy!`, and managed to get it to allocate nothing at all, while performing very close to the speed of `dot`. I don't know anything about e.g. `@simd` instructions, but I imagine they could help speeding this up even further. The most significant change was switching `size(A)[1]` to `size(A,1)` (and similarly for `B`)  the former has to construct and index into a tuple, while the latter won't have to do that. `length(A)` would have worked too.
Notebook, also produced on JuliaBox (running Julia 0.4rc2): http://nbviewer.ipython.org/github/tlycken/IJuliaNotebooks/blob/master/dot%20vs%20xpy%21.ipynb
// T On Tuesday, October 6, 2015 at 4:28:29 PM UTC+2, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


Well, I guess your table pretty much shows it, right? It seems as it allocates a lot of temporary memory to carry out the calculations. On Tuesday, October 6, 2015 at 10:28:29 AM UTC4, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


a *= b is equivalent to a = a * b, which allocates a temporary variable I think?
Try
@fastmath @inbounds @simd for i=1:n A[i] *= B[i]


or, possibly A[i] = A[i] * B[i]
(I'm not sure whether @simd automatically translates *= to what it needs) On Tuesday, 6 October 2015 17:29:04 UTC+1, Christoph Ortner wrote: a *= b is equivalent to a = a * b, which allocates a temporary variable I think?
Try
@fastmath @inbounds @simd for i=1:n A[i] *= B[i]


On Tuesday, October 6, 2015 at 12:29:04 PM UTC4, Christoph Ortner wrote: a *= b is equivalent to a = a * b, which allocates a temporary variable I think?
A * A only allocates memory on the heap if A is an array or something other heapallocated datatype. For A[i] *= B[i] where A[i] and B[i] are small scalar types like Float64, no temporary is allocated, the compiler just puts the result in a register.


That was supposed to be "A * B only allocates..." right? On Tuesday, October 6, 2015 at 1:52:18 PM UTC4, Steven G. Johnson wrote: On Tuesday, October 6, 2015 at 12:29:04 PM UTC4, Christoph Ortner wrote: a *= b is equivalent to a = a * b, which allocates a temporary variable I think?
A * A only allocates memory on the heap if A is an array or something other heapallocated datatype. For A[i] *= B[i] where A[i] and B[i] are small scalar types like Float64, no temporary is allocated, the compiler just puts the result in a register.


On Tuesday, October 6, 2015 at 2:23:33 PM UTC4, Patrick Kofod Mogensen wrote: That was supposed to be "A * B only allocates..." right?
Yes.


Note that the BLAS dot product probably uses all sorts of tricks to squeeze the last cycle of SIMD performance out of the CPU. e.g. here is the OpenBLAS ddot function for SandyBridge, which is handcoded in assembly:
https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/ddot_microk_sandy2.c
Getting the last 30% or so of this performance can be extremely tricky.


Thank you for all of your suggestions. The @simd macro effectively gives a (very) slightly improved performance (5%).


I have the same question regarding how to calculate the entrywise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entrywise vector multiplication that one should stick to in practice? Thanks! @fastmath @inbounds @simd for i=1:n A[i] *= B[i] end
On Tuesday, October 6, 2015 at 3:28:29 PM UTC+1, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


I think that for medium size (but not large) arrays in v0.5 you may want to use @threads from the threadding branch, and then for really large arrays you may want to use @parallel. But you'd have to test some timings. On Monday, June 20, 2016 at 11:38:15 AM UTC+1, [hidden email] wrote: I have the same question regarding how to calculate the entrywise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entrywise vector multiplication that one should stick to in practice? Thanks! @fastmath @inbounds @simd for i=1:n A[i] *= B[i] end
On Tuesday, October 6, 2015 at 3:28:29 PM UTC+1, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


Thanks! I'm still using v0.4.5. In this case, is the code I highlighted above still the best choice for doing the job? On Monday, June 20, 2016 at 1:57:25 PM UTC+1, Chris Rackauckas wrote: I think that for medium size (but not large) arrays in v0.5 you may want to use @threads from the threadding branch, and then for really large arrays you may want to use @parallel. But you'd have to test some timings. On Monday, June 20, 2016 at 11:38:15 AM UTC+1, [hidden email] wrote: I have the same question regarding how to calculate the entrywise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entrywise vector multiplication that one should stick to in practice? Thanks! @fastmath @inbounds @simd for i=1:n A[i] *= B[i] end
On Tuesday, October 6, 2015 at 3:28:29 PM UTC+1, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


Most likely. I would also time it with and without @simd at your problem size. For some reason I've had some simple loops do better without @simd. On Monday, June 20, 2016 at 2:50:22 PM UTC+1, [hidden email] wrote: Thanks! I'm still using v0.4.5. In this case, is the code I highlighted above still the best choice for doing the job? On Monday, June 20, 2016 at 1:57:25 PM UTC+1, Chris Rackauckas wrote: I think that for medium size (but not large) arrays in v0.5 you may want to use @threads from the threadding branch, and then for really large arrays you may want to use @parallel. But you'd have to test some timings. On Monday, June 20, 2016 at 11:38:15 AM UTC+1, [hidden email] wrote: I have the same question regarding how to calculate the entrywise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entrywise vector multiplication that one should stick to in practice? Thanks! @fastmath @inbounds @simd for i=1:n A[i] *= B[i] end
On Tuesday, October 6, 2015 at 3:28:29 PM UTC+1, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


Thanks for the confirmation! Yes, I need more tests to see what the best practice is for my particular problem. On Monday, June 20, 2016 at 3:05:31 PM UTC+1, Chris Rackauckas wrote: Most likely. I would also time it with and without @simd at your problem size. For some reason I've had some simple loops do better without @simd. On Monday, June 20, 2016 at 2:50:22 PM UTC+1, [hidden email] wrote: Thanks! I'm still using v0.4.5. In this case, is the code I highlighted above still the best choice for doing the job? On Monday, June 20, 2016 at 1:57:25 PM UTC+1, Chris Rackauckas wrote: I think that for medium size (but not large) arrays in v0.5 you may want to use @threads from the threadding branch, and then for really large arrays you may want to use @parallel. But you'd have to test some timings. On Monday, June 20, 2016 at 11:38:15 AM UTC+1, [hidden email] wrote: I have the same question regarding how to calculate the entrywise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entrywise vector multiplication that one should stick to in practice? Thanks! @fastmath @inbounds @simd for i=1:n A[i] *= B[i] end
On Tuesday, October 6, 2015 at 3:28:29 PM UTC+1, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


Should this be added to a package? I imagine if the arrays are on the GPU (AFArrays) then the operation could be much faster, and having a consistent name would be helpful. On Wednesday, October 7, 2015 at 1:28:29 AM UTC+11, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


This is pretty much obsolete by the . fusing changes:
A .= A.*B
should be an inplace update of A scaled by B (Tomas' solution). On Tuesday, November 1, 2016 at 4:39:15 PM UTC7, Sheehan Olver wrote: Should this be added to a package? I imagine if the arrays are on the GPU (AFArrays) then the operation could be much faster, and having a consistent name would be helpful. On Wednesday, October 7, 2015 at 1:28:29 AM UTC+11, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


Ah, good point. Though I guess that won't work til 0.6 since .* won't autofuse yet? Sent from my iPhone
This is pretty much obsolete by the . fusing changes:
A .= A.*B
should be an inplace update of A scaled by B (Tomas' solution). On Tuesday, November 1, 2016 at 4:39:15 PM UTC7, Sheehan Olver wrote: Should this be added to a package? I imagine if the arrays are on the GPU (AFArrays) then the operation could be much faster, and having a consistent name would be helpful. On Wednesday, October 7, 2015 at 1:28:29 AM UTC+11, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################


As I understand it, the .* will fuse, but the .= will not (until 0.6?), so A will be rebound to a newly allocated array. If my understanding is wrong I'd love to know. There have been many times in the last few days that I would have used it...


It's the other way around. .* won't fuse because it's still an operator. .= will. It you want .* to fuse, you can instead do:
A .= *.(A,B)
since this invokes the broadcast on *, instead of invoking .*. But that's just a temporary thing. On Tuesday, November 1, 2016 at 7:27:40 PM UTC7, Tom Breloff wrote: As I understand it, the .* will fuse, but the .= will not (until 0.6?), so A will be rebound to a newly allocated array. If my understanding is wrong I'd love to know. There have been many times in the last few days that I would have used it... On Tue, Nov 1, 2016 at 10:06 PM, Sheehan Olver <<a href="javascript:" target="_blank" gdfobfuscatedmailto="NJrrx1GAgAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">dlfiv...@...> wrote: Ah, good point. Though I guess that won't work til 0.6 since .* won't autofuse yet? Sent from my iPhone
On 2 Nov. 2016, at 12:55, Chris Rackauckas <<a href="javascript:" target="_blank" gdfobfuscatedmailto="NJrrx1GAgAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">rack...@...> wrote:
This is pretty much obsolete by the . fusing changes:
A .= A.*B
should be an inplace update of A scaled by B (Tomas' solution). On Tuesday, November 1, 2016 at 4:39:15 PM UTC7, Sheehan Olver wrote: Should this be added to a package? I imagine if the arrays are on the GPU (AFArrays) then the operation could be much faster, and having a consistent name would be helpful. On Wednesday, October 7, 2015 at 1:28:29 AM UTC+11, Lionel du Peloux wrote: Dear all, I'm looking for the fastest way to do elementwise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the elementwise product. The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is : 1) is there any chance I can do vector elementwise multiplication faster then the actual dot product ? 2) why the builtin elementwise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ? Thank you, Lionel Best custom implementation : function xpy!{T<:Number}(A::Vector{T},B::Vector{T}) n = size(A)[1] if n == size(B)[1] for i=1:n @inbounds A[i] *= B[i] end end return A end
Bench mark results (JuliaBox, A = randn(300000) : function CPU (s) GC (%) ALLOCATION (bytes) CPU (x)
dot(A,B) 1.58e04 0.00 16 1.0
xpy!(A,B) 2.31e04 0.00 80 1.5
NumericExtensions.multiply!(P,Q) 3.60e04 0.00 80 2.3
xpy!(A,B)  no @inbounds check 4.36e04 0.00 80 2.8
P.*Q 2.52e03 50.36 2400512 16.0
############################################################

12
