# Allocations in numerical linear algebra in v0.4

7 messages
Open this post in threaded view
|

## Allocations in numerical linear algebra in v0.4

 In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})    c, a, r = C.arr, A.z, A.f.refs    fill!(c, 0)    if !is(A, B)        throw(ArgumentError("Currently defined only for A === B"))    end    for i in eachindex(r)        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))    end    for k in 1:size(c, 3)        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')    end    CendI thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})  1503384     c, a, r = C.arr, A.z, A.f.refs        0     fill!(c, 0)        0     if !is(A, B)        0         throw(ArgumentError("Currently defined only for A === B"))        -     end        0     for i in eachindex(r)-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))        -     end        0     for k in 1:size(c, 3)1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')        -     end        0     C        - endAs far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?
Open this post in threaded view
|

## Re: Allocations in numerical linear algebra in v0.4

 On Mon, May 9, 2016 at 2:04 PM, Douglas Bates <[hidden email]> wrote: > In v0.4 I am encountering consdierable storage allocation and garbage > collection in linear algebra operations that I think should be allocation > free.  By way of background I am working with block-diagonal matrices in > which all the diagonal blocks are square and of the same size.  In the > particular case I am considering there are 960 blocks of size 3 by 3.  I > store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is > called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array > is the arr member of the type. > > An associated type is MixedModels.VectorReMat which contains a matrix with, > in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector > whose refs are of length 25,498 and whose pool is of length 960. > > During iterations I need to update the contents of the HBlkDiag by taking > the ith column, say c,  of the VectorReMat, forming c * c' and adding that > to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is > > function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > B::VectorReMat{T}) >     c, a, r = C.arr, A.z, A.f.refs >     fill!(c, 0) >     if !is(A, B) >         throw(ArgumentError("Currently defined only for A === B")) >     end >     for i in eachindex(r) >         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i]))) >     end >     for k in 1:size(c, 3) >         Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >     end >     C > end > > I thought I was being careful by using sub or slice (I think the effect is > the same in this case) but apparently not.  When I track allocations while > fitting the model being represented here I get > >         - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, > B::VectorReMat{T}) >   1503384     c, a, r = C.arr, A.z, A.f.refs >         0     fill!(c, 0) >         0     if !is(A, B) >         0         throw(ArgumentError("Currently defined only for A === B")) >         -     end >         0     for i in eachindex(r) > -1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, > Int(r[i]))) >         -     end >         0     for k in 1:size(c, 3) > 1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >         -     end >         0     C >         - end > > As far as the copytri! call goes, I can write 3 nested loops to achieve the > copy without all this allocation.  I suppose I can also replace the syr! > call with explicit loops.  It turns out that there are a lot of places like > this that I would need to back off Lapack/BLAS calls and run explicit loops > if I want to make this fast in v 0.4 slice and sub allocates, they should be cheaper than copying a big array though. While we are trying to add stack allocation for these objects, it's relatively hard to make a escaped object stack allocated. If you care about the performance of the loop, you should benchmark the cost of the creation of these allocations and the efficiency of calling blas function on subarray. > > Am I missing something here?  Is there a way to do the programming at a > higher level and not suffer all these allocations?  Should I expect this > problem to go away after the Arraypocalypse?
Open this post in threaded view
|

## Re: Allocations in numerical linear algebra in v0.4

 In reply to this post by Douglas Bates On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})    c, a, r = C.arr, A.z, A.f.refs    fill!(c, 0)    if !is(A, B)        throw(ArgumentError("Currently defined only for A === B"))    end    for i in eachindex(r)        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))    end    for k in 1:size(c, 3)        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')    end    CendI thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})  1503384     c, a, r = C.arr, A.z, A.f.refs        0     fill!(c, 0)        0     if !is(A, B)        0         throw(ArgumentError("Currently defined only for A === B"))        -     end        0     for i in eachindex(r)-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))        -     end        0     for k in 1:size(c, 3)1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')        -     end        0     C        - endAs far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?If I rewrite that function asfunction Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})    c, a, r = C.arr, A.z, A.f.refs    _, m, n = size(c)    fill!(c, 0)    if !is(A, B)        throw(ArgumentError("Currently defined only for A === B"))    end    for k in eachindex(r)        ri = Int(r[k])        for j in 1 : m            aj = a[j, k]            c[j, j, ri] += abs2(aj)            for i in 1 : j - 1                aij = a[i, k] * aj                c[i, j, ri] += aij                c[j, i, ri] += aij            end        end    end    Cendmy execution time goes from 110 seconds to 68 seconds.  There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster.
Open this post in threaded view
|

## Re: Allocations in numerical linear algebra in v0.4

 Death by abstraction. In a world of complete inlining, I guess it shouldn't actually be necessary to allocate the `SubArray`s at all. sub and slice are just convenient ways of calculating a pointer address and two integers and then pass them to BLAS. I think it would be more important to figure out how to completely eliminate the creation of the `SubArray`s instead of how to stack allocate them...but I'm probably missing something.On Mon, May 9, 2016 at 2:25 PM, Douglas Bates wrote:On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})    c, a, r = C.arr, A.z, A.f.refs    fill!(c, 0)    if !is(A, B)        throw(ArgumentError("Currently defined only for A === B"))    end    for i in eachindex(r)        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))    end    for k in 1:size(c, 3)        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')    end    CendI thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})  1503384     c, a, r = C.arr, A.z, A.f.refs        0     fill!(c, 0)        0     if !is(A, B)        0         throw(ArgumentError("Currently defined only for A === B"))        -     end        0     for i in eachindex(r)-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))        -     end        0     for k in 1:size(c, 3)1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')        -     end        0     C        - endAs far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?If I rewrite that function asfunction Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})    c, a, r = C.arr, A.z, A.f.refs    _, m, n = size(c)    fill!(c, 0)    if !is(A, B)        throw(ArgumentError("Currently defined only for A === B"))    end    for k in eachindex(r)        ri = Int(r[k])        for j in 1 : m            aj = a[j, k]            c[j, j, ri] += abs2(aj)            for i in 1 : j - 1                aij = a[i, k] * aj                c[i, j, ri] += aij                c[j, i, ri] += aij            end        end    end    Cendmy execution time goes from 110 seconds to 68 seconds.  There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster.
Open this post in threaded view
|

## Re: Allocations in numerical linear algebra in v0.4

 On Mon, May 9, 2016 at 2:32 PM, Andreas Noack <[hidden email]> wrote: > Death by abstraction. In a world of complete inlining, I guess it shouldn't > actually be necessary to allocate the `SubArray`s at all. sub and slice are > just convenient ways of calculating a pointer address and two integers and > then pass them to BLAS. I think it would be more important to figure out how > to completely eliminate the creation of the `SubArray`s instead of how to > stack allocate them...but I'm probably missing something. If the functions using subarray are inlined, it'll be easy to eliminate their allocation. Or of course, you can pass in the stride to the blas wrapper functions separately. > > On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote: >> >> >> >> On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote: >>> >>> In v0.4 I am encountering consdierable storage allocation and garbage >>> collection in linear algebra operations that I think should be allocation >>> free.  By way of background I am working with block-diagonal matrices in >>> which all the diagonal blocks are square and of the same size.  In the >>> particular case I am considering there are 960 blocks of size 3 by 3.  I >>> store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is >>> called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array >>> is the arr member of the type. >>> >>> An associated type is MixedModels.VectorReMat which contains a matrix >>> with, in this case, 3 rows and  25,498 columns, and a >>> DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool >>> is of length 960. >>> >>> During iterations I need to update the contents of the HBlkDiag by taking >>> the ith column, say c,  of the VectorReMat, forming c * c' and adding that >>> to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is >>> >>> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, >>> B::VectorReMat{T}) >>>     c, a, r = C.arr, A.z, A.f.refs >>>     fill!(c, 0) >>>     if !is(A, B) >>>         throw(ArgumentError("Currently defined only for A === B")) >>>     end >>>     for i in eachindex(r) >>>         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i]))) >>>     end >>>     for k in 1:size(c, 3) >>>         Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >>>     end >>>     C >>> end >>> >>> I thought I was being careful by using sub or slice (I think the effect >>> is the same in this case) but apparently not.  When I track allocations >>> while fitting the model being represented here I get >>> >>>         - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, >>> B::VectorReMat{T}) >>>   1503384     c, a, r = C.arr, A.z, A.f.refs >>>         0     fill!(c, 0) >>>         0     if !is(A, B) >>>         0         throw(ArgumentError("Currently defined only for A === >>> B")) >>>         -     end >>>         0     for i in eachindex(r) >>> -1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, >>> Int(r[i]))) >>>         -     end >>>         0     for k in 1:size(c, 3) >>> 1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U') >>>         -     end >>>         0     C >>>         - end >>> >>> As far as the copytri! call goes, I can write 3 nested loops to achieve >>> the copy without all this allocation.  I suppose I can also replace the syr! >>> call with explicit loops.  It turns out that there are a lot of places like >>> this that I would need to back off Lapack/BLAS calls and run explicit loops >>> if I want to make this fast in v 0.4 >>> >>> Am I missing something here?  Is there a way to do the programming at a >>> higher level and not suffer all these allocations?  Should I expect this >>> problem to go away after the Arraypocalypse? >> >> >> If I rewrite that function as >> >> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, >> B::VectorReMat{T}) >>     c, a, r = C.arr, A.z, A.f.refs >>     _, m, n = size(c) >>     fill!(c, 0) >>     if !is(A, B) >>         throw(ArgumentError("Currently defined only for A === B")) >>     end >>     for k in eachindex(r) >>         ri = Int(r[k]) >>         for j in 1 : m >>             aj = a[j, k] >>             c[j, j, ri] += abs2(aj) >>             for i in 1 : j - 1 >>                 aij = a[i, k] * aj >>                 c[i, j, ri] += aij >>                 c[j, i, ri] += aij >>             end >>         end >>     end >>     C >> end >> >> my execution time goes from 110 seconds to 68 seconds.  There are a few >> other places where I can do similar replacement of function calls by >> explicit loops, even though that is aesthetically unpleasant, to make the >> model fitting faster. > >