How to create a covariance matrix of matrix that contains NA values

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

How to create a covariance matrix of matrix that contains NA values

Jessica Koh
Hi all,

Is there a way to create a covariance matrix of matrix that contains NA values, using "cov()" function from StatsBase?

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Andreas Noack
I think you'd have to remove them first. E.g. something like 

julia> X = DataArray(randn(10,2));

julia> X[2,1] = X[3,2] = NA;

julia> cov(X[!vec(any(isna(X), 2)),:])
2×2 DataArrays.DataArray{Float64,2}:
 1.19373   0.236507
 0.236507  0.524404


On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <[hidden email]> wrote:
Hi all,

Is there a way to create a covariance matrix of matrix that contains NA values, using "cov()" function from StatsBase?

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Jessica Koh
Hello Andreas,

Sorry I deleted the post before you commented on this. Thank you so much for your comment!

Yes, I have already tried that, and that works great with 2 variables. However, I am dealing with multiple variables with missing values, and the location of missing values differ across different variables. I want the covariance function to handle missing values by pairwise deletion; all available observations should be used to calculate each pairwise covariance without regard to whether variables outside that pair are missing.

I can technically write up the function from scratch to do this. But this seems like a basic problem, so I was guessing there might be some library already written that handle this. Do you suggest writing the function from scratch, or are you aware of the existing functions to solve this? 


On Tuesday, June 7, 2016 at 7:15:55 PM UTC-5, Andreas Noack wrote:
I think you'd have to remove them first. E.g. something like 

julia> X = DataArray(randn(10,2));

julia> X[2,1] = X[3,2] = NA;

julia> cov(X[!vec(any(isna(X), 2)),:])
2×2 DataArrays.DataArray{Float64,2}:
 1.19373   0.236507
 0.236507  0.524404


On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="HSVbNF-8DgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">jessica.y...@...> wrote:
Hi all,

Is there a way to create a covariance matrix of matrix that contains NA values, using "cov()" function from StatsBase?

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="HSVbNF-8DgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Michael Borregaard

You can test for nas in the column sums

Den 08/06/2016 02.23 skrev "Jessica Koh" <[hidden email]>:
Hello Andreas,

Sorry I deleted the post before you commented on this. Thank you so much for your comment!

Yes, I have already tried that, and that works great with 2 variables. However, I am dealing with multiple variables with missing values, and the location of missing values differ across different variables. I want the covariance function to handle missing values by pairwise deletion; all available observations should be used to calculate each pairwise covariance without regard to whether variables outside that pair are missing.

I can technically write up the function from scratch to do this. But this seems like a basic problem, so I was guessing there might be some library already written that handle this. Do you suggest writing the function from scratch, or are you aware of the existing functions to solve this? 


On Tuesday, June 7, 2016 at 7:15:55 PM UTC-5, Andreas Noack wrote:
I think you'd have to remove them first. E.g. something like 

julia> X = DataArray(randn(10,2));

julia> X[2,1] = X[3,2] = NA;

julia> cov(X[!vec(any(isna(X), 2)),:])
2×2 DataArrays.DataArray{Float64,2}:
 1.19373   0.236507
 0.236507  0.524404


On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <[hidden email]> wrote:
Hi all,

Is there a way to create a covariance matrix of matrix that contains NA values, using "cov()" function from StatsBase?

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Milan Bouchet-Valat
In reply to this post by Jessica Koh
Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :

> Hello Andreas,
>
> Sorry I deleted the post before you commented on this. Thank you so
> much for your comment!
>
> Yes, I have already tried that, and that works great with 2
> variables. However, I am dealing with multiple variables with missing
> values, and the location of missing values differ across different
> variables. I want the covariance function to handle missing values by
> pairwise deletion; all available observations should be used to
> calculate each pairwise covariance without regard to whether
> variables outside that pair are missing.
>
> I can technically write up the function from scratch to do this. But
> this seems like a basic problem, so I was guessing there might be
> some library already written that handle this. Do you suggest writing
> the function from scratch, or are you aware of the existing functions
> to solve this? 
You're right that it's an essential function. I think we should write
one based on the Nullable framework instead of on the NA/DataArrays one
(which is on its way out). That function could either live in
StatsBase.jl or in NullableArrays.jl.


Regards

> > I think you'd have to remove them first. E.g. something like 
> >
> > julia> X = DataArray(randn(10,2));
> >
> > julia> X[2,1] = X[3,2] = NA;
> >
> > julia> cov(X[!vec(any(isna(X), 2)),:])
> > 2×2 DataArrays.DataArray{Float64,2}:
> >  1.19373   0.236507
> >  0.236507  0.524404
> >
> >
> > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <[hidden email]
> > > wrote:
> > > Hi all,
> > >
> > > Is there a way to create a covariance matrix of matrix that
> > > contains NA values, using "cov()" function from StatsBase?
> > >

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Andreas Noack
It would be great if we could come up with a solution where the NA/Nullable handling wouldn't have to be hard coded in a specific statistical function, say cov. It's early and I haven't had coffee yet so the idea is probably flawed but, in general, it might be useful to use a dedicated `Accumulator` type when doing accumulations, e.g. a sum would be something like

function sum(x::AbstractVector)
    acc = Acc{eltype(x) + eltype(x)}(0)
    for xx in x
        acc !+ xx
    end
end

then instead of specifying the NA handling for every statistical function. It would be a matter of defining something like `(!+)(x::Acc, y::Nullable) = x` to "remove" the effect of NAs in the accumulation. Of course, you don't always want to remove NAs so this would have to be adjustable. What kind of functionality exists in NullableArrays for handling Nullable is different ways?

The original reason I've started to consider the accumulator type is to have a way of handling memory reuse, e.g. for BigFloats and JuMP expressions but maybe it could also be useful for NA/Nullable handling.


On Wed, Jun 8, 2016 at 4:42 AM, Milan Bouchet-Valat <[hidden email]> wrote:
Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :
> Hello Andreas,
>
> Sorry I deleted the post before you commented on this. Thank you so
> much for your comment!
>
> Yes, I have already tried that, and that works great with 2
> variables. However, I am dealing with multiple variables with missing
> values, and the location of missing values differ across different
> variables. I want the covariance function to handle missing values by
> pairwise deletion; all available observations should be used to
> calculate each pairwise covariance without regard to whether
> variables outside that pair are missing.
>
> I can technically write up the function from scratch to do this. But
> this seems like a basic problem, so I was guessing there might be
> some library already written that handle this. Do you suggest writing
> the function from scratch, or are you aware of the existing functions
> to solve this? 
You're right that it's an essential function. I think we should write
one based on the Nullable framework instead of on the NA/DataArrays one
(which is on its way out). That function could either live in
StatsBase.jl or in NullableArrays.jl.


Regards

> > I think you'd have to remove them first. E.g. something like 
> >
> > julia> X = DataArray(randn(10,2));
> >
> > julia> X[2,1] = X[3,2] = NA;
> >
> > julia> cov(X[!vec(any(isna(X), 2)),:])
> > 2×2 DataArrays.DataArray{Float64,2}:
> >  1.19373   0.236507
> >  0.236507  0.524404
> >
> >
> > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <[hidden email]
> > > wrote:
> > > Hi all,
> > >
> > > Is there a way to create a covariance matrix of matrix that
> > > contains NA values, using "cov()" function from StatsBase?
> > >

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

David Gold
There are two main ways of dealing with nullables in NullableArrays, and they're both keyword arguments: (i) `lift`, in `map` and `broadcast`, and (ii) `skipnull` in reducing functions. The behavior they elicit is pretty much what their names describe; `lift=true` will lift `f` over the entries of `X` in `map(f, X)`, and `skipnull=true` will skip null entries in a reducing function. Both are disabled by default.

On Wednesday, June 8, 2016 at 8:48:42 AM UTC-4, Andreas Noack wrote:
It would be great if we could come up with a solution where the NA/Nullable handling wouldn't have to be hard coded in a specific statistical function, say cov. It's early and I haven't had coffee yet so the idea is probably flawed but, in general, it might be useful to use a dedicated `Accumulator` type when doing accumulations, e.g. a sum would be something like

function sum(x::AbstractVector)
    acc = Acc{eltype(x) + eltype(x)}(0)
    for xx in x
        acc !+ xx
    end
end

then instead of specifying the NA handling for every statistical function. It would be a matter of defining something like `(!+)(x::Acc, y::Nullable) = x` to "remove" the effect of NAs in the accumulation. Of course, you don't always want to remove NAs so this would have to be adjustable. What kind of functionality exists in NullableArrays for handling Nullable is different ways?

The original reason I've started to consider the accumulator type is to have a way of handling memory reuse, e.g. for BigFloats and JuMP expressions but maybe it could also be useful for NA/Nullable handling.


On Wed, Jun 8, 2016 at 4:42 AM, Milan Bouchet-Valat <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="AKs8iXPlDgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">nali...@...> wrote:
Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :
> Hello Andreas,
>
> Sorry I deleted the post before you commented on this. Thank you so
> much for your comment!
>
> Yes, I have already tried that, and that works great with 2
> variables. However, I am dealing with multiple variables with missing
> values, and the location of missing values differ across different
> variables. I want the covariance function to handle missing values by
> pairwise deletion; all available observations should be used to
> calculate each pairwise covariance without regard to whether
> variables outside that pair are missing.
>
> I can technically write up the function from scratch to do this. But
> this seems like a basic problem, so I was guessing there might be
> some library already written that handle this. Do you suggest writing
> the function from scratch, or are you aware of the existing functions
> to solve this? 
You're right that it's an essential function. I think we should write
one based on the Nullable framework instead of on the NA/DataArrays one
(which is on its way out). That function could either live in
StatsBase.jl or in NullableArrays.jl.


Regards

> > I think you'd have to remove them first. E.g. something like 
> >
> > julia> X = DataArray(randn(10,2));
> >
> > julia> X[2,1] = X[3,2] = NA;
> >
> > julia> cov(X[!vec(any(isna(X), 2)),:])
> > 2×2 DataArrays.DataArray{Float64,2}:
> >  1.19373   0.236507
> >  0.236507  0.524404
> >
> >
> > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <[hidden email]
> > > wrote:
> > > Hi all,
> > >
> > > Is there a way to create a covariance matrix of matrix that
> > > contains NA values, using "cov()" function from StatsBase?
> > >

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="AKs8iXPlDgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Jessica Koh
In reply to this post by Andreas Noack
I actually really agree with this! Does it mean we need to change the existing function's source code to deal with the problem as you suggest?

On Wednesday, June 8, 2016 at 7:48:42 AM UTC-5, Andreas Noack wrote:
It would be great if we could come up with a solution where the NA/Nullable handling wouldn't have to be hard coded in a specific statistical function, say cov. It's early and I haven't had coffee yet so the idea is probably flawed but, in general, it might be useful to use a dedicated `Accumulator` type when doing accumulations, e.g. a sum would be something like

function sum(x::AbstractVector)
    acc = Acc{eltype(x) + eltype(x)}(0)
    for xx in x
        acc !+ xx
    end
end

then instead of specifying the NA handling for every statistical function. It would be a matter of defining something like `(!+)(x::Acc, y::Nullable) = x` to "remove" the effect of NAs in the accumulation. Of course, you don't always want to remove NAs so this would have to be adjustable. What kind of functionality exists in NullableArrays for handling Nullable is different ways?

The original reason I've started to consider the accumulator type is to have a way of handling memory reuse, e.g. for BigFloats and JuMP expressions but maybe it could also be useful for NA/Nullable handling.


On Wed, Jun 8, 2016 at 4:42 AM, Milan Bouchet-Valat <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="AKs8iXPlDgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">nali...@...> wrote:
Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :
> Hello Andreas,
>
> Sorry I deleted the post before you commented on this. Thank you so
> much for your comment!
>
> Yes, I have already tried that, and that works great with 2
> variables. However, I am dealing with multiple variables with missing
> values, and the location of missing values differ across different
> variables. I want the covariance function to handle missing values by
> pairwise deletion; all available observations should be used to
> calculate each pairwise covariance without regard to whether
> variables outside that pair are missing.
>
> I can technically write up the function from scratch to do this. But
> this seems like a basic problem, so I was guessing there might be
> some library already written that handle this. Do you suggest writing
> the function from scratch, or are you aware of the existing functions
> to solve this? 
You're right that it's an essential function. I think we should write
one based on the Nullable framework instead of on the NA/DataArrays one
(which is on its way out). That function could either live in
StatsBase.jl or in NullableArrays.jl.


Regards

> > I think you'd have to remove them first. E.g. something like 
> >
> > julia> X = DataArray(randn(10,2));
> >
> > julia> X[2,1] = X[3,2] = NA;
> >
> > julia> cov(X[!vec(any(isna(X), 2)),:])
> > 2×2 DataArrays.DataArray{Float64,2}:
> >  1.19373   0.236507
> >  0.236507  0.524404
> >
> >
> > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh <[hidden email]
> > > wrote:
> > > Hi all,
> > >
> > > Is there a way to create a covariance matrix of matrix that
> > > contains NA values, using "cov()" function from StatsBase?
> > >

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="AKs8iXPlDgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Milan Bouchet-Valat
Le vendredi 24 juin 2016 à 23:09 -0700, Jessica Koh a écrit :
> I actually really agree with this! Does it mean we need to change the
> existing function's source code to deal with the problem as you
> suggest?
The code for cov() will have to be more complex that the sum()
illustration Andreas gave: to compute the covariance, you need to skip
all observations for which one of the two variables is missing. This
gets even more complex when computing covariances between columns of
matrices, since you need to decide whether to skip rows with at least
one missing value, or to use different row subsets depending on the
pairs of columns involved.

Alternatively, I wonder whether this problem could be solved using
special pseudo-weights types. This could allow sharing the code with
the weighted covariance function. A special weights type would simply
be passed, with weight 1 for non-missing observations, and 0 for
missing ones. These values could be a custom (internal) number type for
which 0 * NULL would return 0, in order to skip these observations.

Anyway, one will need to experiment with these approaches in practice
to see whether that would work.


Regards

> > It would be great if we could come up with a solution where the
> > NA/Nullable handling wouldn't have to be hard coded in a specific
> > statistical function, say cov. It's early and I haven't had coffee
> > yet so the idea is probably flawed but, in general, it might be
> > useful to use a dedicated `Accumulator` type when doing
> > accumulations, e.g. a sum would be something like
> >
> > function sum(x::AbstractVector)
> >     acc = Acc{eltype(x) + eltype(x)}(0)
> >     for xx in x
> >         acc !+ xx
> >     end
> > end
> >
> > then instead of specifying the NA handling for every statistical
> > function. It would be a matter of defining something like
> > `(!+)(x::Acc, y::Nullable) = x` to "remove" the effect of NAs in
> > the accumulation. Of course, you don't always want to remove NAs so
> > this would have to be adjustable. What kind of functionality exists
> > in NullableArrays for handling Nullable is different ways?
> >
> > The original reason I've started to consider the accumulator type
> > is to have a way of handling memory reuse, e.g. for BigFloats and
> > JuMP expressions but maybe it could also be useful for NA/Nullable
> > handling.
> >
> >
> > On Wed, Jun 8, 2016 at 4:42 AM, Milan Bouchet-Valat
> >  wrote:
> > > Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :
> > > > Hello Andreas,
> > > >
> > > > Sorry I deleted the post before you commented on this. Thank
> > > you so
> > > > much for your comment!
> > > >
> > > > Yes, I have already tried that, and that works great with 2
> > > > variables. However, I am dealing with multiple variables with
> > > missing
> > > > values, and the location of missing values differ across
> > > different
> > > > variables. I want the covariance function to handle missing
> > > values by
> > > > pairwise deletion; all available observations should be used to
> > > > calculate each pairwise covariance without regard to whether
> > > > variables outside that pair are missing.
> > > >
> > > > I can technically write up the function from scratch to do
> > > this. But
> > > > this seems like a basic problem, so I was guessing there might
> > > be
> > > > some library already written that handle this. Do you suggest
> > > writing
> > > > the function from scratch, or are you aware of the existing
> > > functions
> > > > to solve this? 
> > > You're right that it's an essential function. I think we should
> > > write
> > > one based on the Nullable framework instead of on the
> > > NA/DataArrays one
> > > (which is on its way out). That function could either live in
> > > StatsBase.jl or in NullableArrays.jl.
> > >
> > >
> > > Regards
> > >
> > > > > I think you'd have to remove them first. E.g. something like 
> > > > >
> > > > > julia> X = DataArray(randn(10,2));
> > > > >
> > > > > julia> X[2,1] = X[3,2] = NA;
> > > > >
> > > > > julia> cov(X[!vec(any(isna(X), 2)),:])
> > > > > 2×2 DataArrays.DataArray{Float64,2}:
> > > > >  1.19373   0.236507
> > > > >  0.236507  0.524404
> > > > >
> > > > >
> > > > > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh
> > >
> > > > > > wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > Is there a way to create a covariance matrix of matrix that
> > > > > > contains NA values, using "cov()" function from StatsBase?
> > > > > >
> > >
> > > --
> > > You received this message because you are subscribed to the
> > > Google Groups "julia-stats" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to [hidden email].
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
> -- 
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Jessica Koh
Yes, computing covariance with pairwise comparison will be troublesome.

The more general way of computing a covariance matrix for multi-column data is, (for example in STATA) to only keep all the rows that have no missing value across all columns and compute a covariance matrix for that data-frame with no missing values at all. 

With cov(), I can do something like for the existing dataframe called "df" the following:

# Create new dataframe with no missing values at all
df_new = df
for col in names(df)
     df_new = df_new[!isna(df_new[col]), :]
end

# Compute covariance matrix for new dataframe
cov(Array(df_new))

Although this is a short code, I still think it will be nice if "cov" function automatically drops the rows with missing values and compute covariance, rather than returning "NA" as covariance if there is any NA value. 

Putting "pairwise" option can be the next step! 

On Saturday, June 25, 2016 at 8:38:36 AM UTC-5, Milan Bouchet-Valat wrote:
Le vendredi 24 juin 2016 à 23:09 -0700, Jessica Koh a écrit :
> I actually really agree with this! Does it mean we need to change the
> existing function's source code to deal with the problem as you
> suggest?
The code for cov() will have to be more complex that the sum()
illustration Andreas gave: to compute the covariance, you need to skip
all observations for which one of the two variables is missing. This
gets even more complex when computing covariances between columns of
matrices, since you need to decide whether to skip rows with at least
one missing value, or to use different row subsets depending on the
pairs of columns involved.

Alternatively, I wonder whether this problem could be solved using
special pseudo-weights types. This could allow sharing the code with
the weighted covariance function. A special weights type would simply
be passed, with weight 1 for non-missing observations, and 0 for
missing ones. These values could be a custom (internal) number type for
which 0 * NULL would return 0, in order to skip these observations.

Anyway, one will need to experiment with these approaches in practice
to see whether that would work.


Regards

> > It would be great if we could come up with a solution where the
> > NA/Nullable handling wouldn't have to be hard coded in a specific
> > statistical function, say cov. It's early and I haven't had coffee
> > yet so the idea is probably flawed but, in general, it might be
> > useful to use a dedicated `Accumulator` type when doing
> > accumulations, e.g. a sum would be something like
> >
> > function sum(x::AbstractVector)
> >     acc = Acc{eltype(x) + eltype(x)}(0)
> >     for xx in x
> >         acc !+ xx
> >     end
> > end
> >
> > then instead of specifying the NA handling for every statistical
> > function. It would be a matter of defining something like
> > `(!+)(x::Acc, y::Nullable) = x` to "remove" the effect of NAs in
> > the accumulation. Of course, you don't always want to remove NAs so
> > this would have to be adjustable. What kind of functionality exists
> > in NullableArrays for handling Nullable is different ways?
> >
> > The original reason I've started to consider the accumulator type
> > is to have a way of handling memory reuse, e.g. for BigFloats and
> > JuMP expressions but maybe it could also be useful for NA/Nullable
> > handling.
> >
> >
> > On Wed, Jun 8, 2016 at 4:42 AM, Milan Bouchet-Valat
> >  wrote:
> > > Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :
> > > > Hello Andreas,
> > > >
> > > > Sorry I deleted the post before you commented on this. Thank
> > > you so
> > > > much for your comment!
> > > >
> > > > Yes, I have already tried that, and that works great with 2
> > > > variables. However, I am dealing with multiple variables with
> > > missing
> > > > values, and the location of missing values differ across
> > > different
> > > > variables. I want the covariance function to handle missing
> > > values by
> > > > pairwise deletion; all available observations should be used to
> > > > calculate each pairwise covariance without regard to whether
> > > > variables outside that pair are missing.
> > > >
> > > > I can technically write up the function from scratch to do
> > > this. But
> > > > this seems like a basic problem, so I was guessing there might
> > > be
> > > > some library already written that handle this. Do you suggest
> > > writing
> > > > the function from scratch, or are you aware of the existing
> > > functions
> > > > to solve this? 
> > > You're right that it's an essential function. I think we should
> > > write
> > > one based on the Nullable framework instead of on the
> > > NA/DataArrays one
> > > (which is on its way out). That function could either live in
> > > StatsBase.jl or in NullableArrays.jl.
> > >
> > >
> > > Regards
> > >
> > > > > I think you'd have to remove them first. E.g. something like 
> > > > >
> > > > > julia> X = DataArray(randn(10,2));
> > > > >
> > > > > julia> X[2,1] = X[3,2] = NA;
> > > > >
> > > > > julia> cov(X[!vec(any(isna(X), 2)),:])
> > > > > 2×2 DataArrays.DataArray{Float64,2}:
> > > > >  1.19373   0.236507
> > > > >  0.236507  0.524404
> > > > >
> > > > >
> > > > > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh
> > >
> > > > > > wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > Is there a way to create a covariance matrix of matrix that
> > > > > > contains NA values, using "cov()" function from StatsBase?
> > > > > >
> > >
> > > --
> > > You received this message because you are subscribed to the
> > > Google Groups "julia-stats" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to julia-stats...@googlegroups.com.
> > > For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.
> > >
> >
> -- 
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="CLeCb43DAAAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
> For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a covariance matrix of matrix that contains NA values

Jessica Koh
In reply to this post by Milan Bouchet-Valat
Yes, computing covariance with pairwise comparison will be troublesome.

A more general way of computing a covariance matrix for multi-column data is (for example in STATA) to keep only the rows that have no missing value across all columns and compute a covariance matrix for that dataframe with no missing values at all. 

With cov(), I can do something like:

# Create new dataframe with no missing values at all
df_new = df      # assume "df" is the dataframe I have.
for col in names(df)
     df_new = df_new[!isna(df_new[col]), :]
end

# Compute covariance matrix for new dataframe
cov(Array(df_new))

Although this is a short code, I still think it will be nice if "cov" function automatically drops the rows with missing values and compute covariance, rather than returning "NA" as covariance if there is any NA value. 

Putting "pairwise" option can be the next step! 

On Saturday, June 25, 2016 at 8:38:36 AM UTC-5, Milan Bouchet-Valat wrote:
Le vendredi 24 juin 2016 à 23:09 -0700, Jessica Koh a écrit :
> I actually really agree with this! Does it mean we need to change the
> existing function's source code to deal with the problem as you
> suggest?
The code for cov() will have to be more complex that the sum()
illustration Andreas gave: to compute the covariance, you need to skip
all observations for which one of the two variables is missing. This
gets even more complex when computing covariances between columns of
matrices, since you need to decide whether to skip rows with at least
one missing value, or to use different row subsets depending on the
pairs of columns involved.

Alternatively, I wonder whether this problem could be solved using
special pseudo-weights types. This could allow sharing the code with
the weighted covariance function. A special weights type would simply
be passed, with weight 1 for non-missing observations, and 0 for
missing ones. These values could be a custom (internal) number type for
which 0 * NULL would return 0, in order to skip these observations.

Anyway, one will need to experiment with these approaches in practice
to see whether that would work.


Regards

> > It would be great if we could come up with a solution where the
> > NA/Nullable handling wouldn't have to be hard coded in a specific
> > statistical function, say cov. It's early and I haven't had coffee
> > yet so the idea is probably flawed but, in general, it might be
> > useful to use a dedicated `Accumulator` type when doing
> > accumulations, e.g. a sum would be something like
> >
> > function sum(x::AbstractVector)
> >     acc = Acc{eltype(x) + eltype(x)}(0)
> >     for xx in x
> >         acc !+ xx
> >     end
> > end
> >
> > then instead of specifying the NA handling for every statistical
> > function. It would be a matter of defining something like
> > `(!+)(x::Acc, y::Nullable) = x` to "remove" the effect of NAs in
> > the accumulation. Of course, you don't always want to remove NAs so
> > this would have to be adjustable. What kind of functionality exists
> > in NullableArrays for handling Nullable is different ways?
> >
> > The original reason I've started to consider the accumulator type
> > is to have a way of handling memory reuse, e.g. for BigFloats and
> > JuMP expressions but maybe it could also be useful for NA/Nullable
> > handling.
> >
> >
> > On Wed, Jun 8, 2016 at 4:42 AM, Milan Bouchet-Valat
> >  wrote:
> > > Le mardi 07 juin 2016 à 17:23 -0700, Jessica Koh a écrit :
> > > > Hello Andreas,
> > > >
> > > > Sorry I deleted the post before you commented on this. Thank
> > > you so
> > > > much for your comment!
> > > >
> > > > Yes, I have already tried that, and that works great with 2
> > > > variables. However, I am dealing with multiple variables with
> > > missing
> > > > values, and the location of missing values differ across
> > > different
> > > > variables. I want the covariance function to handle missing
> > > values by
> > > > pairwise deletion; all available observations should be used to
> > > > calculate each pairwise covariance without regard to whether
> > > > variables outside that pair are missing.
> > > >
> > > > I can technically write up the function from scratch to do
> > > this. But
> > > > this seems like a basic problem, so I was guessing there might
> > > be
> > > > some library already written that handle this. Do you suggest
> > > writing
> > > > the function from scratch, or are you aware of the existing
> > > functions
> > > > to solve this? 
> > > You're right that it's an essential function. I think we should
> > > write
> > > one based on the Nullable framework instead of on the
> > > NA/DataArrays one
> > > (which is on its way out). That function could either live in
> > > StatsBase.jl or in NullableArrays.jl.
> > >
> > >
> > > Regards
> > >
> > > > > I think you'd have to remove them first. E.g. something like 
> > > > >
> > > > > julia> X = DataArray(randn(10,2));
> > > > >
> > > > > julia> X[2,1] = X[3,2] = NA;
> > > > >
> > > > > julia> cov(X[!vec(any(isna(X), 2)),:])
> > > > > 2×2 DataArrays.DataArray{Float64,2}:
> > > > >  1.19373   0.236507
> > > > >  0.236507  0.524404
> > > > >
> > > > >
> > > > > On Tue, Jun 7, 2016 at 6:26 PM, Jessica Koh
> > >
> > > > > > wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > Is there a way to create a covariance matrix of matrix that
> > > > > > contains NA values, using "cov()" function from StatsBase?
> > > > > >
> > >
> > > --
> > > You received this message because you are subscribed to the
> > > Google Groups "julia-stats" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to julia-stats...@googlegroups.com.
> > > For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.
> > >
> >
> -- 
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="CLeCb43DAAAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
> For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.