isna() no longer works in DataFrames?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

isna() no longer works in DataFrames?

Bradley Setzler
isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

John Myles White
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <[hidden email]> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

Taylor Maxwell
In reply to this post by Bradley Setzler
The function complete_cases returns a Vector{Bool} of indexes of complete cases (rows with no NA's).

On Thursday, September 4, 2014 10:37:01 AM UTC-6, Bradley Setzler wrote:
isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

Bradley Setzler
In reply to this post by John Myles White
Hi John,

I'm doing cell-level imputation. Need to index that cells that need to be imputed.

Best,
Bradley



On Thursday, September 4, 2014 11:45:25 AM UTC-5, John Myles White wrote:
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ww16UjZCfjsJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bradley...@...> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ww16UjZCfjsJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

John Myles White
Why not use a DataMatrix then? In general, I'm very uncomfortable talking about the cells of a DataFrame, since I distrust the idea that the rows have an ordering.

 -- John

On Sep 4, 2014, at 9:47 AM, Bradley Setzler <[hidden email]> wrote:

Hi John,

I'm doing cell-level imputation. Need to index that cells that need to be imputed.

Best,
Bradley



On Thursday, September 4, 2014 11:45:25 AM UTC-5, John Myles White wrote:
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ww16UjZCfjsJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bradley...@...> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ww16UjZCfjsJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

Bradley Setzler
In reply to this post by Bradley Setzler
Simple use case: Suppose you have 100 individuals (rows) and 100 variables (columns). Each individual is missing exactly one value, and no column has more than one missing. If you keep only complete_cases, you have 0 rows in your data. If you keep only complete columns, you have 0 columns in your data.

So instead, you impute the missings according to some algorithm. But how do you know where the missings are to impute them? I used to use isna() to find them.

Best,
Bradley



On Thursday, September 4, 2014 11:47:29 AM UTC-5, Bradley Setzler wrote:
Hi John,

I'm doing cell-level imputation. Need to index that cells that need to be imputed.

Best,
Bradley



On Thursday, September 4, 2014 11:45:25 AM UTC-5, John Myles White wrote:
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <[hidden email]> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

John Myles White
Ok, so you do have a matrix of data. Why not use a matrix?

 -- John

On Sep 4, 2014, at 9:52 AM, Bradley Setzler <[hidden email]> wrote:

Simple use case: Suppose you have 100 individuals (rows) and 100 variables (columns). Each individual is missing exactly one value, and no column has more than one missing. If you keep only complete_cases, you have 0 rows in your data. If you keep only complete columns, you have 0 columns in your data.

So instead, you impute the missings according to some algorithm. But how do you know where the missings are to impute them? I used to use isna() to find them.

Best,
Bradley



On Thursday, September 4, 2014 11:47:29 AM UTC-5, Bradley Setzler wrote:
Hi John,

I'm doing cell-level imputation. Need to index that cells that need to be imputed.

Best,
Bradley



On Thursday, September 4, 2014 11:45:25 AM UTC-5, John Myles White wrote:
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <[hidden email]> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

Bradley Setzler
Okay, great, you are saying that isna() still works for DataArrays, which I just confirmed, that should do the job.

Thanks,
Bradley


On Thursday, September 4, 2014 11:54:13 AM UTC-5, John Myles White wrote:
Ok, so you do have a matrix of data. Why not use a matrix?

 -- John

On Sep 4, 2014, at 9:52 AM, Bradley Setzler <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ywbY4MDWe1wJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bradley...@...> wrote:

Simple use case: Suppose you have 100 individuals (rows) and 100 variables (columns). Each individual is missing exactly one value, and no column has more than one missing. If you keep only complete_cases, you have 0 rows in your data. If you keep only complete columns, you have 0 columns in your data.

So instead, you impute the missings according to some algorithm. But how do you know where the missings are to impute them? I used to use isna() to find them.

Best,
Bradley



On Thursday, September 4, 2014 11:47:29 AM UTC-5, Bradley Setzler wrote:
Hi John,

I'm doing cell-level imputation. Need to index that cells that need to be imputed.

Best,
Bradley



On Thursday, September 4, 2014 11:45:25 AM UTC-5, John Myles White wrote:
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <[hidden email]> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a
2x2 DataFrame
|-------|-----|-----|
| Row # | x1 | x2 |
| 1 | NA | 2.0 |
| 2 | 3.0 | 4.0 |

julia> isna(a)
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ywbY4MDWe1wJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

John Myles White
Yeah, DataArrays will always have array-like properties. What I'm uncomfortable with is the use of DataFrames as heteregeneous-column-typed matrices. That creates an odd situation in which DataFrames have properties that mean you can't safely translate them into databases.

 -- John

On Sep 4, 2014, at 9:56 AM, Bradley Setzler <[hidden email]> wrote:

Okay, great, you are saying that isna() still works for DataArrays, which I just confirmed, that should do the job.

Thanks,
Bradley


On Thursday, September 4, 2014 11:54:13 AM UTC-5, John Myles White wrote:
Ok, so you do have a matrix of data. Why not use a matrix?

 -- John

On Sep 4, 2014, at 9:52 AM, Bradley Setzler <bradley...@gmail.com> wrote:

Simple use case: Suppose you have 100 individuals (rows) and 100 variables (columns). Each individual is missing exactly one value, and no column has more than one missing. If you keep only complete_cases, you have 0 rows in your data. If you keep only complete columns, you have 0 columns in your data.

So instead, you impute the missings according to some algorithm. But how do you know where the missings are to impute them? I used to use isna() to find them.

Best,
Bradley



On Thursday, September 4, 2014 11:47:29 AM UTC-5, Bradley Setzler wrote:
Hi John,

I'm doing cell-level imputation. Need to index that cells that need to be imputed.

Best,
Bradley



On Thursday, September 4, 2014 11:45:25 AM UTC-5, John Myles White wrote:
Why do you very much need one? What's a use case?

In general, there's a push to remove functionality from DataFrames that isn't strictly necessary. isna() was one of those things.

 -- John

On Sep 4, 2014, at 9:37 AM, Bradley Setzler <[hidden email]> wrote:

isna() used to return a matrix of Booleans of the same size as the DataFrame, apparently it now returns a single Boolean for the entire DataFrame:

julia> a 
2x2 DataFrame 
|-------|-----|-----| 
| Row # | x1 | x2 | 
| 1 | NA | 2.0 | 
| 2 | 3.0 | 4.0 | 

julia> isna(a) 
false

I very much need a function that can tell me which entries in the DataFrame are missing.

Thanks,
Bradley

-- 
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

K Leo
In reply to this post by John Myles White
Hi john,

This sounds very confusing.  Is there a change in the concept of
DataFrames?  If so, then someone needs to write a clear
definition/description of DataFrames.  As far as I understand, rows have
indexes and can be sorted, which means they have ordering.  I do use
these features extensively.  Are they going away?

On 2014年09月05日 00:50, John Myles White wrote:
> In general, I'm very uncomfortable talking about the cells of a
> DataFrame, since I distrust the idea that the rows have an ordering.
>
>  -- John
>

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

John Myles White
I'm planning to do a bunch more work to clean up DataFrames for the 0.4 release so that it's in a stable, usable state and has all of the essential functionality you might want. After that, my current plan is to transition my personal work efforts to drafting a new DataTables package that will have an interface that allows one to switch between in-memory data stores and on-disk data stores. Over time, I've come to see DataFrames as an evolutionary dead-end in the analytics stack: they're very convenient, but they achieve convenience by sacrificing safety, scalability and interoperability. I'd much prefer working on a plan for a brighter future.

By the time I transition to working on DataTables, I imagine that many people in the community will have stepped up and taken over maintenance of DataFrames. As is, Simon and Sean already do more work on DataFrames than I do these days. I imagine others will also take charge over the coming year or two.

Note that your concern about sorting is in part misplaced: sorting DataFrames can always be done based on data that's actually contained in the DataFrame. The problem with depending on implicit row indices is that you're working with an identifier for each row that is (a) constantly changing and (b) not contained in the actual data set you're working with. This makes interop with SQL impossible because SQL never pretends that tables are matrices and doesn't let you talk about a row index for Row A that could be changed by mutating Row B. The whole idea that changing Row B can affect Row A strikes me a bad design decision that was wisely left out of SQL and the relational model.

Long story short: if you really want to work with a matrix, I think you should use a DataMatrix. If you like the fact that DataFrames behave like matrices, they'll continue to do so into the indefinite future. But I'm personally going to stop doing work to support that behavior in the mid-term future, because I think it's a bad design decision that I'd like to see fade into the thankfully forgotten past. I think most people will benefit from adopting a strict separation between tables and matrices, because it will enable a whole range of new optimizations that are currently difficult or impossible to achieve.

 -- John

On Sep 4, 2014, at 3:39 PM, K Leo <[hidden email]> wrote:

> Hi john,
>
> This sounds very confusing.  Is there a change in the concept of DataFrames?  If so, then someone needs to write a clear definition/description of DataFrames.  As far as I understand, rows have indexes and can be sorted, which means they have ordering.  I do use these features extensively.  Are they going away?
>
> On 2014年09月05日 00:50, John Myles White wrote:
>> In general, I'm very uncomfortable talking about the cells of a DataFrame, since I distrust the idea that the rows have an ordering.
>>
>> -- John
>>
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

Harlan Harris
FWIW (not much these days), +1. I always wanted to push the scalability of DFs, and never got a chance to do so.


On Thu, Sep 4, 2014 at 7:42 PM, John Myles White <[hidden email]> wrote:
I'm planning to do a bunch more work to clean up DataFrames for the 0.4 release so that it's in a stable, usable state and has all of the essential functionality you might want. After that, my current plan is to transition my personal work efforts to drafting a new DataTables package that will have an interface that allows one to switch between in-memory data stores and on-disk data stores. Over time, I've come to see DataFrames as an evolutionary dead-end in the analytics stack: they're very convenient, but they achieve convenience by sacrificing safety, scalability and interoperability. I'd much prefer working on a plan for a brighter future.

By the time I transition to working on DataTables, I imagine that many people in the community will have stepped up and taken over maintenance of DataFrames. As is, Simon and Sean already do more work on DataFrames than I do these days. I imagine others will also take charge over the coming year or two.

Note that your concern about sorting is in part misplaced: sorting DataFrames can always be done based on data that's actually contained in the DataFrame. The problem with depending on implicit row indices is that you're working with an identifier for each row that is (a) constantly changing and (b) not contained in the actual data set you're working with. This makes interop with SQL impossible because SQL never pretends that tables are matrices and doesn't let you talk about a row index for Row A that could be changed by mutating Row B. The whole idea that changing Row B can affect Row A strikes me a bad design decision that was wisely left out of SQL and the relational model.

Long story short: if you really want to work with a matrix, I think you should use a DataMatrix. If you like the fact that DataFrames behave like matrices, they'll continue to do so into the indefinite future. But I'm personally going to stop doing work to support that behavior in the mid-term future, because I think it's a bad design decision that I'd like to see fade into the thankfully forgotten past. I think most people will benefit from adopting a strict separation between tables and matrices, because it will enable a whole range of new optimizations that are currently difficult or impossible to achieve.

 -- John

On Sep 4, 2014, at 3:39 PM, K Leo <[hidden email]> wrote:

> Hi john,
>
> This sounds very confusing.  Is there a change in the concept of DataFrames?  If so, then someone needs to write a clear definition/description of DataFrames.  As far as I understand, rows have indexes and can be sorted, which means they have ordering.  I do use these features extensively.  Are they going away?
>
> On 2014年09月05日 00:50, John Myles White wrote:
>> In general, I'm very uncomfortable talking about the cells of a DataFrame, since I distrust the idea that the rows have an ordering.
>>
>> -- John
>>
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isna() no longer works in DataFrames?

Michael Smith
In reply to this post by John Myles White
John,

More power to you with your DataTables package, it sounds wonderful.

Just out of curiosity, will it be similar to PyTables or some other
existing software, or will it be completely different?

Thanks,

M

On 09/05/2014 07:42 AM, John Myles White wrote:

> I'm planning to do a bunch more work to clean up DataFrames for the 0.4 release so that it's in a stable, usable state and has all of the essential functionality you might want. After that, my current plan is to transition my personal work efforts to drafting a new DataTables package that will have an interface that allows one to switch between in-memory data stores and on-disk data stores. Over time, I've come to see DataFrames as an evolutionary dead-end in the analytics stack: they're very convenient, but they achieve convenience by sacrificing safety, scalability and interoperability. I'd much prefer working on a plan for a brighter future.
>
> By the time I transition to working on DataTables, I imagine that many people in the community will have stepped up and taken over maintenance of DataFrames. As is, Simon and Sean already do more work on DataFrames than I do these days. I imagine others will also take charge over the coming year or two.
>
> Note that your concern about sorting is in part misplaced: sorting DataFrames can always be done based on data that's actually contained in the DataFrame. The problem with depending on implicit row indices is that you're working with an identifier for each row that is (a) constantly changing and (b) not contained in the actual data set you're working with. This makes interop with SQL impossible because SQL never pretends that tables are matrices and doesn't let you talk about a row index for Row A that could be changed by mutating Row B. The whole idea that changing Row B can affect Row A strikes me a bad design decision that was wisely left out of SQL and the relational model.
>
> Long story short: if you really want to work with a matrix, I think you should use a DataMatrix. If you like the fact that DataFrames behave like matrices, they'll continue to do so into the indefinite future. But I'm personally going to stop doing work to support that behavior in the mid-term future, because I think it's a bad design decision that I'd like to see fade into the thankfully forgotten past. I think most people will benefit from adopting a strict separation between tables and matrices, because it will enable a whole range of new optimizations that are currently difficult or impossible to achieve.
>
>  -- John
>
> On Sep 4, 2014, at 3:39 PM, K Leo <[hidden email]> wrote:
>
>> Hi john,
>>
>> This sounds very confusing.  Is there a change in the concept of DataFrames?  If so, then someone needs to write a clear definition/description of DataFrames.  As far as I understand, rows have indexes and can be sorted, which means they have ordering.  I do use these features extensively.  Are they going away?
>>
>> On 2014年09月05日 00:50, John Myles White wrote:
>>> In general, I'm very uncomfortable talking about the cells of a DataFrame, since I distrust the idea that the rows have an ordering.
>>>
>>> -- John
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "julia-stats" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.