Allocations in numerical linear algebra in v0.4

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Allocations in numerical linear algebra in v0.4

Douglas Bates
In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.

An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.

During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is 

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for i in eachindex(r)
        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
    end
    for k in 1:size(c, 3)
        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
    end
    C
end

I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get

        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
  1503384     c, a, r = C.arr, A.z, A.f.refs
        0     fill!(c, 0)
        0     if !is(A, B)
        0         throw(ArgumentError("Currently defined only for A === B"))
        -     end
        0     for i in eachindex(r)
-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
        -     end
        0     for k in 1:size(c, 3)
1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
        -     end
        0     C
        - end

As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4

Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?
Reply | Threaded
Open this post in threaded view
|

Re: Allocations in numerical linear algebra in v0.4

Yichao Yu
On Mon, May 9, 2016 at 2:04 PM, Douglas Bates <[hidden email]> wrote:

> In v0.4 I am encountering consdierable storage allocation and garbage
> collection in linear algebra operations that I think should be allocation
> free.  By way of background I am working with block-diagonal matrices in
> which all the diagonal blocks are square and of the same size.  In the
> particular case I am considering there are 960 blocks of size 3 by 3.  I
> store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is
> called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array
> is the arr member of the type.
>
> An associated type is MixedModels.VectorReMat which contains a matrix with,
> in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector
> whose refs are of length 25,498 and whose pool is of length 960.
>
> During iterations I need to update the contents of the HBlkDiag by taking
> the ith column, say c,  of the VectorReMat, forming c * c' and adding that
> to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is
>
> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
> B::VectorReMat{T})
>     c, a, r = C.arr, A.z, A.f.refs
>     fill!(c, 0)
>     if !is(A, B)
>         throw(ArgumentError("Currently defined only for A === B"))
>     end
>     for i in eachindex(r)
>         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
>     end
>     for k in 1:size(c, 3)
>         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
>     end
>     C
> end
>
> I thought I was being careful by using sub or slice (I think the effect is
> the same in this case) but apparently not.  When I track allocations while
> fitting the model being represented here I get
>
>         - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
> B::VectorReMat{T})
>   1503384     c, a, r = C.arr, A.z, A.f.refs
>         0     fill!(c, 0)
>         0     if !is(A, B)
>         0         throw(ArgumentError("Currently defined only for A === B"))
>         -     end
>         0     for i in eachindex(r)
> -1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :,
> Int(r[i])))
>         -     end
>         0     for k in 1:size(c, 3)
> 1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
>         -     end
>         0     C
>         - end
>
> As far as the copytri! call goes, I can write 3 nested loops to achieve the
> copy without all this allocation.  I suppose I can also replace the syr!
> call with explicit loops.  It turns out that there are a lot of places like
> this that I would need to back off Lapack/BLAS calls and run explicit loops
> if I want to make this fast in v 0.4

slice and sub allocates, they should be cheaper than copying a big
array though. While we are trying to add stack allocation for these
objects, it's relatively hard to make a escaped object stack
allocated.

If you care about the performance of the loop, you should benchmark
the cost of the creation of these allocations and the efficiency of
calling blas function on subarray.

>
> Am I missing something here?  Is there a way to do the programming at a
> higher level and not suffer all these allocations?  Should I expect this
> problem to go away after the Arraypocalypse?
Reply | Threaded
Open this post in threaded view
|

Re: Allocations in numerical linear algebra in v0.4

Douglas Bates
In reply to this post by Douglas Bates


On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:
In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.

An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.

During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is 

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for i in eachindex(r)
        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
    end
    for k in 1:size(c, 3)
        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
    end
    C
end

I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get

        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
  1503384     c, a, r = C.arr, A.z, A.f.refs
        0     fill!(c, 0)
        0     if !is(A, B)
        0         throw(ArgumentError("Currently defined only for A === B"))
        -     end
        0     for i in eachindex(r)
-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
        -     end
        0     for k in 1:size(c, 3)
1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
        -     end
        0     C
        - end

As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4

Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?

If I rewrite that function as

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    _, m, n = size(c)
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for k in eachindex(r)
        ri = Int(r[k])
        for j in 1 : m
            aj = a[j, k]
            c[j, j, ri] += abs2(aj)
            for i in 1 : j - 1
                aij = a[i, k] * aj
                c[i, j, ri] += aij
                c[j, i, ri] += aij
            end
        end
    end
    C
end

my execution time goes from 110 seconds to 68 seconds.  There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster.
Reply | Threaded
Open this post in threaded view
|

Re: Allocations in numerical linear algebra in v0.4

Andreas Noack
Death by abstraction. In a world of complete inlining, I guess it shouldn't actually be necessary to allocate the `SubArray`s at all. sub and slice are just convenient ways of calculating a pointer address and two integers and then pass them to BLAS. I think it would be more important to figure out how to completely eliminate the creation of the `SubArray`s instead of how to stack allocate them...but I'm probably missing something.

On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote:


On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:
In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.

An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.

During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is 

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for i in eachindex(r)
        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
    end
    for k in 1:size(c, 3)
        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
    end
    C
end

I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get

        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
  1503384     c, a, r = C.arr, A.z, A.f.refs
        0     fill!(c, 0)
        0     if !is(A, B)
        0         throw(ArgumentError("Currently defined only for A === B"))
        -     end
        0     for i in eachindex(r)
-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
        -     end
        0     for k in 1:size(c, 3)
1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
        -     end
        0     C
        - end

As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4

Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?

If I rewrite that function as

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    _, m, n = size(c)
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for k in eachindex(r)
        ri = Int(r[k])
        for j in 1 : m
            aj = a[j, k]
            c[j, j, ri] += abs2(aj)
            for i in 1 : j - 1
                aij = a[i, k] * aj
                c[i, j, ri] += aij
                c[j, i, ri] += aij
            end
        end
    end
    C
end

my execution time goes from 110 seconds to 68 seconds.  There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster.

Reply | Threaded
Open this post in threaded view
|

Re: Allocations in numerical linear algebra in v0.4

Yichao Yu
On Mon, May 9, 2016 at 2:32 PM, Andreas Noack
<[hidden email]> wrote:
> Death by abstraction. In a world of complete inlining, I guess it shouldn't
> actually be necessary to allocate the `SubArray`s at all. sub and slice are
> just convenient ways of calculating a pointer address and two integers and
> then pass them to BLAS. I think it would be more important to figure out how
> to completely eliminate the creation of the `SubArray`s instead of how to
> stack allocate them...but I'm probably missing something.

If the functions using subarray are inlined, it'll be easy to
eliminate their allocation.
Or of course, you can pass in the stride to the blas wrapper functions
separately.

>
> On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote:
>>
>>
>>
>> On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:
>>>
>>> In v0.4 I am encountering consdierable storage allocation and garbage
>>> collection in linear algebra operations that I think should be allocation
>>> free.  By way of background I am working with block-diagonal matrices in
>>> which all the diagonal blocks are square and of the same size.  In the
>>> particular case I am considering there are 960 blocks of size 3 by 3.  I
>>> store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is
>>> called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array
>>> is the arr member of the type.
>>>
>>> An associated type is MixedModels.VectorReMat which contains a matrix
>>> with, in this case, 3 rows and  25,498 columns, and a
>>> DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool
>>> is of length 960.
>>>
>>> During iterations I need to update the contents of the HBlkDiag by taking
>>> the ith column, say c,  of the VectorReMat, forming c * c' and adding that
>>> to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is
>>>
>>> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
>>> B::VectorReMat{T})
>>>     c, a, r = C.arr, A.z, A.f.refs
>>>     fill!(c, 0)
>>>     if !is(A, B)
>>>         throw(ArgumentError("Currently defined only for A === B"))
>>>     end
>>>     for i in eachindex(r)
>>>         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
>>>     end
>>>     for k in 1:size(c, 3)
>>>         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
>>>     end
>>>     C
>>> end
>>>
>>> I thought I was being careful by using sub or slice (I think the effect
>>> is the same in this case) but apparently not.  When I track allocations
>>> while fitting the model being represented here I get
>>>
>>>         - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
>>> B::VectorReMat{T})
>>>   1503384     c, a, r = C.arr, A.z, A.f.refs
>>>         0     fill!(c, 0)
>>>         0     if !is(A, B)
>>>         0         throw(ArgumentError("Currently defined only for A ===
>>> B"))
>>>         -     end
>>>         0     for i in eachindex(r)
>>> -1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :,
>>> Int(r[i])))
>>>         -     end
>>>         0     for k in 1:size(c, 3)
>>> 1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
>>>         -     end
>>>         0     C
>>>         - end
>>>
>>> As far as the copytri! call goes, I can write 3 nested loops to achieve
>>> the copy without all this allocation.  I suppose I can also replace the syr!
>>> call with explicit loops.  It turns out that there are a lot of places like
>>> this that I would need to back off Lapack/BLAS calls and run explicit loops
>>> if I want to make this fast in v 0.4
>>>
>>> Am I missing something here?  Is there a way to do the programming at a
>>> higher level and not suffer all these allocations?  Should I expect this
>>> problem to go away after the Arraypocalypse?
>>
>>
>> If I rewrite that function as
>>
>> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
>> B::VectorReMat{T})
>>     c, a, r = C.arr, A.z, A.f.refs
>>     _, m, n = size(c)
>>     fill!(c, 0)
>>     if !is(A, B)
>>         throw(ArgumentError("Currently defined only for A === B"))
>>     end
>>     for k in eachindex(r)
>>         ri = Int(r[k])
>>         for j in 1 : m
>>             aj = a[j, k]
>>             c[j, j, ri] += abs2(aj)
>>             for i in 1 : j - 1
>>                 aij = a[i, k] * aj
>>                 c[i, j, ri] += aij
>>                 c[j, i, ri] += aij
>>             end
>>         end
>>     end
>>     C
>> end
>>
>> my execution time goes from 110 seconds to 68 seconds.  There are a few
>> other places where I can do similar replacement of function calls by
>> explicit loops, even though that is aesthetically unpleasant, to make the
>> model fitting faster.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Allocations in numerical linear algebra in v0.4

Douglas Bates
In reply to this post by Andreas Noack
Thanks for the responses.  I just wanted to check that I wasn't doing something dumb or missing an obvious way to circumvent the problem.

Thanks to the track-allocation option when running Julia I can easily find and eliminate the allocation bottlenecks.

On Monday, May 9, 2016 at 1:32:25 PM UTC-5, Andreas Noack wrote:
Death by abstraction. In a world of complete inlining, I guess it shouldn't actually be necessary to allocate the `SubArray`s at all. sub and slice are just convenient ways of calculating a pointer address and two integers and then pass them to BLAS. I think it would be more important to figure out how to completely eliminate the creation of the `SubArray`s instead of how to stack allocate them...but I'm probably missing something.

On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="G9admlgDAQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">dmb...@...> wrote:


On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:
In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free.  By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size.  In the particular case I am considering there are 960 blocks of size 3 by 3.  I store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.

An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and  25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.

During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c,  of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block.  The current version of the code is 

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for i in eachindex(r)
        Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
    end
    for k in 1:size(c, 3)
        Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
    end
    C
end

I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not.  When I track allocations while fitting the model being represented here I get

        - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
  1503384     c, a, r = C.arr, A.z, A.f.refs
        0     fill!(c, 0)
        0     if !is(A, B)
        0         throw(ArgumentError("Currently defined only for A === B"))
        -     end
        0     for i in eachindex(r)
-1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
        -     end
        0     for k in 1:size(c, 3)
1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
        -     end
        0     C
        - end

As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation.  I suppose I can also replace the syr! call with explicit loops.  It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4

Am I missing something here?  Is there a way to do the programming at a higher level and not suffer all these allocations?  Should I expect this problem to go away after the Arraypocalypse?

If I rewrite that function as

function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
    c, a, r = C.arr, A.z, A.f.refs
    _, m, n = size(c)
    fill!(c, 0)
    if !is(A, B)
        throw(ArgumentError("Currently defined only for A === B"))
    end
    for k in eachindex(r)
        ri = Int(r[k])
        for j in 1 : m
            aj = a[j, k]
            c[j, j, ri] += abs2(aj)
            for i in 1 : j - 1
                aij = a[i, k] * aj
                c[i, j, ri] += aij
                c[j, i, ri] += aij
            end
        end
    end
    C
end

my execution time goes from 110 seconds to 68 seconds.  There are a few other places where I can do similar replacement of function calls by explicit loops, even though that is aesthetically unpleasant, to make the model fitting faster.

Reply | Threaded
Open this post in threaded view
|

Re: Allocations in numerical linear algebra in v0.4

Tim Holy
In reply to this post by Andreas Noack
I definitely agree that the goal should be to get it to the point where LLVM
generates the "trivial" code. My understanding is that the heap allocation
interferes with LLVM's ability to realize how simple the problem really is,
and that implementing stack allocation may be an important step along the way
towards eliding the creation of the SubArray.

But I'm pretty fuzzy on such things, so I could be wrong.

Best,
--Tim

On Monday, May 09, 2016 02:32:22 PM Andreas Noack wrote:

> Death by abstraction. In a world of complete inlining, I guess it shouldn't
> actually be necessary to allocate the `SubArray`s at all. sub and slice are
> just convenient ways of calculating a pointer address and two integers and
> then pass them to BLAS. I think it would be more important to figure out
> how to completely eliminate the creation of the `SubArray`s instead of how
> to stack allocate them...but I'm probably missing something.
>
> On Mon, May 9, 2016 at 2:25 PM, Douglas Bates <[hidden email]> wrote:
> > On Monday, May 9, 2016 at 1:04:05 PM UTC-5, Douglas Bates wrote:
> >> In v0.4 I am encountering consdierable storage allocation and garbage
> >> collection in linear algebra operations that I think should be allocation
> >> free.  By way of background I am working with block-diagonal matrices in
> >> which all the diagonal blocks are square and of the same size.  In the
> >> particular case I am considering there are 960 blocks of size 3 by 3.  I
> >> store the diagonal blocks as an Array(Float64, (3, 3, 960)).  The type is
> >> called MixedModels.HBLkDiag (homoegeneous block diagonal) and the
> >> 3D-array
> >> is the arr member of the type.
> >>
> >> An associated type is MixedModels.VectorReMat which contains a matrix
> >> with, in this case, 3 rows and  25,498 columns, and a
> >> DataArrays.PooledDataVector whose refs are of length 25,498 and whose
> >> pool
> >> is of length 960.
> >>
> >> During iterations I need to update the contents of the HBlkDiag by taking
> >> the ith column, say c,  of the VectorReMat, forming c * c' and adding
> >> that
> >> to the refs[i]'th 3 by 3 diagonal block.  The current version of the code
> >> is
> >>
> >> function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
> >> B::VectorReMat{T})
> >>
> >>     c, a, r = C.arr, A.z, A.f.refs
> >>     fill!(c, 0)
> >>     if !is(A, B)
> >>    
> >>         throw(ArgumentError("Currently defined only for A === B"))
> >>    
> >>     end
> >>     for i in eachindex(r)
> >>    
> >>         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
> >>    
> >>     end
> >>     for k in 1:size(c, 3)
> >>    
> >>         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
> >>    
> >>     end
> >>     C
> >>
> >> end
> >>
> >> I thought I was being careful by using sub or slice (I think the effect
> >> is the same in this case) but apparently not.  When I track allocations
> >> while fitting the model being represented here I get
> >>
> >>         - function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
> >>
> >> B::VectorReMat{T})
> >>
> >>   1503384     c, a, r = C.arr, A.z, A.f.refs
> >>  
> >>         0     fill!(c, 0)
> >>         0     if !is(A, B)
> >>         0         throw(ArgumentError("Currently defined only for A ===
> >>
> >> B"))
> >>
> >>         -     end
> >>         0     for i in eachindex(r)
> >>
> >> -1352727904         Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :,
> >> Int(r[i])))
> >>
> >>         -     end
> >>         0     for k in 1:size(c, 3)
> >>
> >> 1016371200         Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
> >>
> >>         -     end
> >>         0     C
> >>         - end
> >>
> >> As far as the copytri! call goes, I can write 3 nested loops to achieve
> >> the copy without all this allocation.  I suppose I can also replace the
> >> syr! call with explicit loops.  It turns out that there are a lot of
> >> places
> >> like this that I would need to back off Lapack/BLAS calls and run
> >> explicit
> >> loops if I want to make this fast in v 0.4
> >>
> >> Am I missing something here?  Is there a way to do the programming at a
> >> higher level and not suffer all these allocations?  Should I expect this
> >> problem to go away after the Arraypocalypse?
> >
> > If I rewrite that function as
> >
> > function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T},
> > B::VectorReMat{T})
> >
> >     c, a, r = C.arr, A.z, A.f.refs
> >     _, m, n = size(c)
> >     fill!(c, 0)
> >     if !is(A, B)
> >    
> >         throw(ArgumentError("Currently defined only for A === B"))
> >    
> >     end
> >     for k in eachindex(r)
> >    
> >         ri = Int(r[k])
> >         for j in 1 : m
> >        
> >             aj = a[j, k]
> >             c[j, j, ri] += abs2(aj)
> >             for i in 1 : j - 1
> >            
> >                 aij = a[i, k] * aj
> >                 c[i, j, ri] += aij
> >                 c[j, i, ri] += aij
> >            
> >             end
> >        
> >         end
> >    
> >     end
> >     C
> >
> > end
> >
> > my execution time goes from 110 seconds to 68 seconds.  There are a few
> > other places where I can do similar replacement of function calls by
> > explicit loops, even though that is aesthetically unpleasant, to make the
> > model fitting faster.