In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free. By way of background I am working with blockdiagonal matrices in which all the diagonal blocks are square and of the same size. In the particular case I am considering there are 960 blocks of size 3 by 3. I store the diagonal blocks as an Array(Float64, (3, 3, 960)). The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3Darray is the arr member of the type.
An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and 25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960. During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c, of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block. The current version of the code is function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T}) c, a, r = C.arr, A.z, A.f.refs fill!(c, 0) if !is(A, B) throw(ArgumentError("Currently defined only for A === B")) end for i in eachindex(r) Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i]))) end for k in 1:size(c, 3) Base.LinAlg.copytri!(sub(c, :, :, k), 'U') end C end I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not. When I track allocations while fitting the model being represented here I get  function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T}) 1503384 c, a, r = C.arr, A.z, A.f.refs 0 fill!(c, 0) 0 if !is(A, B) 0 throw(ArgumentError("Currently defined only for A === B"))  end 0 for i in eachindex(r) 1352727904 Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))  end 0 for k in 1:size(c, 3) 1016371200 Base.LinAlg.copytri!(sub(c, :, :, k), 'U')  end 0 C  end As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation. I suppose I can also replace the syr! call with explicit loops. It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4 Am I missing something here? Is there a way to do the programming at a higher level and not suffer all these allocations? Should I expect this problem to go away after the Arraypocalypse? 
On Mon, May 9, 2016 at 2:04 PM, Douglas Bates <[hidden email]> wrote:
> In v0.4 I am encountering consdierable storage allocation and garbage > collection in linear algebra operations that I think should be allocation > free. By way of background I am working with blockdiagonal matrices in > which all the diagonal blocks are square and of the same size. In the > particular case I am considering there are 960 blocks of size 3 by 3. I > store the diagonal blocks as an Array(Float64, (3, 3, 960)). The type is > called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3Darray > is the arr member of the type. > > An associated type is MixedModels.VectorReMat which contains a matrix with, > in this case, 3 rows and 25,498 columns, and a DataArrays.PooledDataVector > whose refs are of length 25,498 and whose pool is of length 960. > > During iterations I need to update the contents of the HBlkDiag by taking > the ith column, say c, of the VectorReMat, forming c * c' and adding that > to the refs[i]'th 3 by 3 diagonal block. The current version of the code is > > function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > B::VectorReMat{T}) > c, a, r = C.arr, A.z, A.f.refs > fill!(c, 0) > if !is(A, B) > throw(ArgumentError("Currently defined only for A === B")) > end > for i in eachindex(r) > Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i]))) > end > for k in 1:size(c, 3) > Base.LinAlg.copytri!(sub(c, :, :, k), 'U') > end > C > end > > I thought I was being careful by using sub or slice (I think the effect is > the same in this case) but apparently not. When I track allocations while > fitting the model being represented here I get > >  function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > B::VectorReMat{T}) > 1503384 c, a, r = C.arr, A.z, A.f.refs > 0 fill!(c, 0) > 0 if !is(A, B) > 0 throw(ArgumentError("Currently defined only for A === B")) >  end > 0 for i in eachindex(r) > 1352727904 Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, > Int(r[i]))) >  end > 0 for k in 1:size(c, 3) > 1016371200 Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >  end > 0 C >  end > > As far as the copytri! call goes, I can write 3 nested loops to achieve the > copy without all this allocation. I suppose I can also replace the syr! > call with explicit loops. It turns out that there are a lot of places like > this that I would need to back off Lapack/BLAS calls and run explicit loops > if I want to make this fast in v 0.4 slice and sub allocates, they should be cheaper than copying a big array though. While we are trying to add stack allocation for these objects, it's relatively hard to make a escaped object stack allocated. If you care about the performance of the loop, you should benchmark the cost of the creation of these allocations and the efficiency of calling blas function on subarray. > > Am I missing something here? Is there a way to do the programming at a > higher level and not suffer all these allocations? Should I expect this > problem to go away after the Arraypocalypse? 
In reply to this post by Douglas Bates
On Monday, May 9, 2016 at 1:04:05 PM UTC5, Douglas Bates wrote:
If I rewrite that function as function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T}) c, a, r = C.arr, A.z, A.f.refs _, m, n = size(c) fill!(c, 0) if !is(A, B) throw(ArgumentError("Currently defined only for A === B")) end for k in eachindex(r) ri = Int(r[k]) for j in 1 : m aj = a[j, k] c[j, j, ri] += abs2(aj) for i in 1 : j  1 aij = a[i, k] * aj c[i, j, ri] += aij c[j, i, ri] += aij end end end C end my execution time goes from 110 seconds to 68 seconds. There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster. 
Death by abstraction. In a world of complete inlining, I guess it shouldn't actually be necessary to allocate the `SubArray`s at all. sub and slice are just convenient ways of calculating a pointer address and two integers and then pass them to BLAS. I think it would be more important to figure out how to completely eliminate the creation of the `SubArray`s instead of how to stack allocate them...but I'm probably missing something. On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote:

On Mon, May 9, 2016 at 2:32 PM, Andreas Noack
<[hidden email]> wrote: > Death by abstraction. In a world of complete inlining, I guess it shouldn't > actually be necessary to allocate the `SubArray`s at all. sub and slice are > just convenient ways of calculating a pointer address and two integers and > then pass them to BLAS. I think it would be more important to figure out how > to completely eliminate the creation of the `SubArray`s instead of how to > stack allocate them...but I'm probably missing something. If the functions using subarray are inlined, it'll be easy to eliminate their allocation. Or of course, you can pass in the stride to the blas wrapper functions separately. > > On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote: >> >> >> >> On Monday, May 9, 2016 at 1:04:05 PM UTC5, Douglas Bates wrote: >>> >>> In v0.4 I am encountering consdierable storage allocation and garbage >>> collection in linear algebra operations that I think should be allocation >>> free. By way of background I am working with blockdiagonal matrices in >>> which all the diagonal blocks are square and of the same size. In the >>> particular case I am considering there are 960 blocks of size 3 by 3. I >>> store the diagonal blocks as an Array(Float64, (3, 3, 960)). The type is >>> called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3Darray >>> is the arr member of the type. >>> >>> An associated type is MixedModels.VectorReMat which contains a matrix >>> with, in this case, 3 rows and 25,498 columns, and a >>> DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool >>> is of length 960. >>> >>> During iterations I need to update the contents of the HBlkDiag by taking >>> the ith column, say c, of the VectorReMat, forming c * c' and adding that >>> to the refs[i]'th 3 by 3 diagonal block. The current version of the code is >>> >>> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, >>> B::VectorReMat{T}) >>> c, a, r = C.arr, A.z, A.f.refs >>> fill!(c, 0) >>> if !is(A, B) >>> throw(ArgumentError("Currently defined only for A === B")) >>> end >>> for i in eachindex(r) >>> Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i]))) >>> end >>> for k in 1:size(c, 3) >>> Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >>> end >>> C >>> end >>> >>> I thought I was being careful by using sub or slice (I think the effect >>> is the same in this case) but apparently not. When I track allocations >>> while fitting the model being represented here I get >>> >>>  function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, >>> B::VectorReMat{T}) >>> 1503384 c, a, r = C.arr, A.z, A.f.refs >>> 0 fill!(c, 0) >>> 0 if !is(A, B) >>> 0 throw(ArgumentError("Currently defined only for A === >>> B")) >>>  end >>> 0 for i in eachindex(r) >>> 1352727904 Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, >>> Int(r[i]))) >>>  end >>> 0 for k in 1:size(c, 3) >>> 1016371200 Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >>>  end >>> 0 C >>>  end >>> >>> As far as the copytri! call goes, I can write 3 nested loops to achieve >>> the copy without all this allocation. I suppose I can also replace the syr! >>> call with explicit loops. It turns out that there are a lot of places like >>> this that I would need to back off Lapack/BLAS calls and run explicit loops >>> if I want to make this fast in v 0.4 >>> >>> Am I missing something here? Is there a way to do the programming at a >>> higher level and not suffer all these allocations? Should I expect this >>> problem to go away after the Arraypocalypse? >> >> >> If I rewrite that function as >> >> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, >> B::VectorReMat{T}) >> c, a, r = C.arr, A.z, A.f.refs >> _, m, n = size(c) >> fill!(c, 0) >> if !is(A, B) >> throw(ArgumentError("Currently defined only for A === B")) >> end >> for k in eachindex(r) >> ri = Int(r[k]) >> for j in 1 : m >> aj = a[j, k] >> c[j, j, ri] += abs2(aj) >> for i in 1 : j  1 >> aij = a[i, k] * aj >> c[i, j, ri] += aij >> c[j, i, ri] += aij >> end >> end >> end >> C >> end >> >> my execution time goes from 110 seconds to 68 seconds. There are a few >> other places where I can do similar replacement of function calls by >> explicit loops, even though that is aesthetically unpleasant, to make the >> model fitting faster. > > 
In reply to this post by Andreas Noack
Thanks for the responses. I just wanted to check that I wasn't doing something dumb or missing an obvious way to circumvent the problem.
Thanks to the trackallocation option when running Julia I can easily find and eliminate the allocation bottlenecks. On Monday, May 9, 2016 at 1:32:25 PM UTC5, Andreas Noack wrote:

In reply to this post by Andreas Noack
I definitely agree that the goal should be to get it to the point where LLVM
generates the "trivial" code. My understanding is that the heap allocation interferes with LLVM's ability to realize how simple the problem really is, and that implementing stack allocation may be an important step along the way towards eliding the creation of the SubArray. But I'm pretty fuzzy on such things, so I could be wrong. Best, Tim On Monday, May 09, 2016 02:32:22 PM Andreas Noack wrote: > Death by abstraction. In a world of complete inlining, I guess it shouldn't > actually be necessary to allocate the `SubArray`s at all. sub and slice are > just convenient ways of calculating a pointer address and two integers and > then pass them to BLAS. I think it would be more important to figure out > how to completely eliminate the creation of the `SubArray`s instead of how > to stack allocate them...but I'm probably missing something. > > On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote: > > On Monday, May 9, 2016 at 1:04:05 PM UTC5, Douglas Bates wrote: > >> In v0.4 I am encountering consdierable storage allocation and garbage > >> collection in linear algebra operations that I think should be allocation > >> free. By way of background I am working with blockdiagonal matrices in > >> which all the diagonal blocks are square and of the same size. In the > >> particular case I am considering there are 960 blocks of size 3 by 3. I > >> store the diagonal blocks as an Array(Float64, (3, 3, 960)). The type is > >> called MixedModels.HBLkDiag (homoegeneous block diagonal) and the > >> 3Darray > >> is the arr member of the type. > >> > >> An associated type is MixedModels.VectorReMat which contains a matrix > >> with, in this case, 3 rows and 25,498 columns, and a > >> DataArrays.PooledDataVector whose refs are of length 25,498 and whose > >> pool > >> is of length 960. > >> > >> During iterations I need to update the contents of the HBlkDiag by taking > >> the ith column, say c, of the VectorReMat, forming c * c' and adding > >> that > >> to the refs[i]'th 3 by 3 diagonal block. The current version of the code > >> is > >> > >> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > >> B::VectorReMat{T}) > >> > >> c, a, r = C.arr, A.z, A.f.refs > >> fill!(c, 0) > >> if !is(A, B) > >> > >> throw(ArgumentError("Currently defined only for A === B")) > >> > >> end > >> for i in eachindex(r) > >> > >> Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i]))) > >> > >> end > >> for k in 1:size(c, 3) > >> > >> Base.LinAlg.copytri!(sub(c, :, :, k), 'U') > >> > >> end > >> C > >> > >> end > >> > >> I thought I was being careful by using sub or slice (I think the effect > >> is the same in this case) but apparently not. When I track allocations > >> while fitting the model being represented here I get > >> > >>  function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > >> > >> B::VectorReMat{T}) > >> > >> 1503384 c, a, r = C.arr, A.z, A.f.refs > >> > >> 0 fill!(c, 0) > >> 0 if !is(A, B) > >> 0 throw(ArgumentError("Currently defined only for A === > >> > >> B")) > >> > >>  end > >> 0 for i in eachindex(r) > >> > >> 1352727904 Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, > >> Int(r[i]))) > >> > >>  end > >> 0 for k in 1:size(c, 3) > >> > >> 1016371200 Base.LinAlg.copytri!(sub(c, :, :, k), 'U') > >> > >>  end > >> 0 C > >>  end > >> > >> As far as the copytri! call goes, I can write 3 nested loops to achieve > >> the copy without all this allocation. I suppose I can also replace the > >> syr! call with explicit loops. It turns out that there are a lot of > >> places > >> like this that I would need to back off Lapack/BLAS calls and run > >> explicit > >> loops if I want to make this fast in v 0.4 > >> > >> Am I missing something here? Is there a way to do the programming at a > >> higher level and not suffer all these allocations? Should I expect this > >> problem to go away after the Arraypocalypse? > > > > If I rewrite that function as > > > > function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > > B::VectorReMat{T}) > > > > c, a, r = C.arr, A.z, A.f.refs > > _, m, n = size(c) > > fill!(c, 0) > > if !is(A, B) > > > > throw(ArgumentError("Currently defined only for A === B")) > > > > end > > for k in eachindex(r) > > > > ri = Int(r[k]) > > for j in 1 : m > > > > aj = a[j, k] > > c[j, j, ri] += abs2(aj) > > for i in 1 : j  1 > > > > aij = a[i, k] * aj > > c[i, j, ri] += aij > > c[j, i, ri] += aij > > > > end > > > > end > > > > end > > C > > > > end > > > > my execution time goes from 110 seconds to 68 seconds. There are a few > > other places where I can do similar replacement of function calls by > > explicit loops, even though that is aesthetically unpleasant, to make the > > model fitting faster. 
Free forum by Nabble  Edit this page 