DataFrame and Memory Limitations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

DataFrame and Memory Limitations

Michael Smith
All,

Are there currently any solutions in Julia to handle larger-than-memory
datasets in a similar way you do in a DataFrame?

The reason I'm asking is that R has the limitation that you need to fit
all your data into memory. On the other hand, SAS (while being quite
different) does not have this limitations.

In the age of "big data" this can be quite an advantage.

Of course, you can "patch" this situation, e.g. in R you can use the ff
or bigmemory packages, or use SQL.

But my point is that it is bolted on, and you need to spend extra mental
loops switching between, say, data.frame and ff, instead of focusing on
your data problem at hand. This is a clear advantage of SAS, where you
don't have to do that. So I'm wondering how this is handled in Julia.

Thanks,

M

P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
Julia is better. I'm just interested to find out whether such a solution
exists in Julia (I haven't found any, but maybe I overlooked something).
And if no such solution exists, given that Julia is still young,
evolving, and malleable (in a positive sense), it might make sense to
think about it.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Harlan Harris
Not currently, but it's been talked about as long as there's been DataFrames in Julia. See these issues, and references therein, for a start:


Also look around in the package and issues list for DataStreams (which I believe are not currently functional) which is a related issue.



On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote:
All,

Are there currently any solutions in Julia to handle larger-than-memory
datasets in a similar way you do in a DataFrame?

The reason I'm asking is that R has the limitation that you need to fit
all your data into memory. On the other hand, SAS (while being quite
different) does not have this limitations.

In the age of "big data" this can be quite an advantage.

Of course, you can "patch" this situation, e.g. in R you can use the ff
or bigmemory packages, or use SQL.

But my point is that it is bolted on, and you need to spend extra mental
loops switching between, say, data.frame and ff, instead of focusing on
your data problem at hand. This is a clear advantage of SAS, where you
don't have to do that. So I'm wondering how this is handled in Julia.

Thanks,

M

P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
Julia is better. I'm just interested to find out whether such a solution
exists in Julia (I haven't found any, but maybe I overlooked something).
And if no such solution exists, given that Julia is still young,
evolving, and malleable (in a positive sense), it might make sense to
think about it.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Ramesh Fernando
In reply to this post by Michael Smith
Hi I don't know Julia, but in R you don't need to load all data into  memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest you contact someone at Hortonworks or Revolution R) g  which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods  and bigmemory  http://cran.r-project.org/web/packages/bigmemory/index.html and http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data.


On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote:
All,

Are there currently any solutions in Julia to handle larger-than-memory
datasets in a similar way you do in a DataFrame?

The reason I'm asking is that R has the limitation that you need to fit
all your data into memory. On the other hand, SAS (while being quite
different) does not have this limitations.

In the age of "big data" this can be quite an advantage.

Of course, you can "patch" this situation, e.g. in R you can use the ff
or bigmemory packages, or use SQL.

But my point is that it is bolted on, and you need to spend extra mental
loops switching between, say, data.frame and ff, instead of focusing on
your data problem at hand. This is a clear advantage of SAS, where you
don't have to do that. So I'm wondering how this is handled in Julia.

Thanks,

M

P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
Julia is better. I'm just interested to find out whether such a solution
exists in Julia (I haven't found any, but maybe I overlooked something).
And if no such solution exists, given that Julia is still young,
evolving, and malleable (in a positive sense), it might make sense to
think about it.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

John Myles White
In reply to this post by Harlan Harris
At some point, we need to create some additional data tools for working with data sets that do not fit in memory. Harlan’s list touches on a lot of the best strategies for doing that in a way that would smoothly integrate with the rest of the language.

 — John

On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email]> wrote:

Not currently, but it's been talked about as long as there's been DataFrames in Julia. See these issues, and references therein, for a start:


Also look around in the package and issues list for DataStreams (which I believe are not currently functional) which is a related issue.



On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote:
All,

Are there currently any solutions in Julia to handle larger-than-memory
datasets in a similar way you do in a DataFrame?

The reason I'm asking is that R has the limitation that you need to fit
all your data into memory. On the other hand, SAS (while being quite
different) does not have this limitations.

In the age of "big data" this can be quite an advantage.

Of course, you can "patch" this situation, e.g. in R you can use the ff
or bigmemory packages, or use SQL.

But my point is that it is bolted on, and you need to spend extra mental
loops switching between, say, data.frame and ff, instead of focusing on
your data problem at hand. This is a clear advantage of SAS, where you
don't have to do that. So I'm wondering how this is handled in Julia.

Thanks,

M

P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
Julia is better. I'm just interested to find out whether such a solution
exists in Julia (I haven't found any, but maybe I overlooked something).
And if no such solution exists, given that Julia is still young,
evolving, and malleable (in a positive sense), it might make sense to
think about it.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Michael Smith
Thanks everybody. It should come as no surprise to me that the Julia
community is already working on this. Awesome.

One minor point that I have not seen discussed in the issues is a
reference to the plyrmr package, which is essentially plyr/dplyr for
Hadoop.

https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome

Maybe it's possible to pillage some ideas from there.

M


On 08/06/2014 12:23 PM, John Myles White wrote:

> At some point, we need to create some additional data tools for working
> with data sets that do not fit in memory. Harlan's list touches on a lot
> of the best strategies for doing that in a way that would smoothly
> integrate with the rest of the language.
>
>  -- John
>
> On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>> Not currently, but it's been talked about as long as there's been
>> DataFrames in Julia. See these issues, and references therein, for a
>> start:
>>
>> https://github.com/JuliaStats/DataFrames.jl/issues/25
>> https://github.com/JuliaStats/DataFrames.jl/issues/26
>>
>> Also look around in the package and issues list for DataStreams (which
>> I believe are not currently functional) which is a related issue.
>>
>>
>>
>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     All,
>>
>>     Are there currently any solutions in Julia to handle
>>     larger-than-memory
>>     datasets in a similar way you do in a DataFrame?
>>
>>     The reason I'm asking is that R has the limitation that you need
>>     to fit
>>     all your data into memory. On the other hand, SAS (while being quite
>>     different) does not have this limitations.
>>
>>     In the age of "big data" this can be quite an advantage.
>>
>>     Of course, you can "patch" this situation, e.g. in R you can use
>>     the ff
>>     or bigmemory packages, or use SQL.
>>
>>     But my point is that it is bolted on, and you need to spend extra
>>     mental
>>     loops switching between, say, data.frame and ff, instead of
>>     focusing on
>>     your data problem at hand. This is a clear advantage of SAS, where you
>>     don't have to do that. So I'm wondering how this is handled in Julia.
>>
>>     Thanks,
>>
>>     M
>>
>>     P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
>>     Julia is better. I'm just interested to find out whether such a
>>     solution
>>     exists in Julia (I haven't found any, but maybe I overlooked
>>     something).
>>     And if no such solution exists, given that Julia is still young,
>>     evolving, and malleable (in a positive sense), it might make sense to
>>     think about it.
>>
>>     --
>>     You received this message because you are subscribed to the Google
>>     Groups "julia-stats" group.
>>     To unsubscribe from this group and stop receiving emails from it,
>>     send an email to [hidden email]
>>     <mailto:julia-stats%[hidden email]>.
>>     For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "julia-stats" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to [hidden email]
>> <mailto:[hidden email]>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email]
> <mailto:[hidden email]>.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

John Myles White
Isn't Hive already "plyr for Hadoop"?

 -- John

On Aug 6, 2014, at 6:29 AM, Michael Smith <[hidden email]> wrote:

> Thanks everybody. It should come as no surprise to me that the Julia
> community is already working on this. Awesome.
>
> One minor point that I have not seen discussed in the issues is a
> reference to the plyrmr package, which is essentially plyr/dplyr for
> Hadoop.
>
> https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome
>
> Maybe it's possible to pillage some ideas from there.
>
> M
>
>
> On 08/06/2014 12:23 PM, John Myles White wrote:
>> At some point, we need to create some additional data tools for working
>> with data sets that do not fit in memory. Harlan's list touches on a lot
>> of the best strategies for doing that in a way that would smoothly
>> integrate with the rest of the language.
>>
>> -- John
>>
>> On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>> Not currently, but it's been talked about as long as there's been
>>> DataFrames in Julia. See these issues, and references therein, for a
>>> start:
>>>
>>> https://github.com/JuliaStats/DataFrames.jl/issues/25
>>> https://github.com/JuliaStats/DataFrames.jl/issues/26
>>>
>>> Also look around in the package and issues list for DataStreams (which
>>> I believe are not currently functional) which is a related issue.
>>>
>>>
>>>
>>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>>
>>>    All,
>>>
>>>    Are there currently any solutions in Julia to handle
>>>    larger-than-memory
>>>    datasets in a similar way you do in a DataFrame?
>>>
>>>    The reason I'm asking is that R has the limitation that you need
>>>    to fit
>>>    all your data into memory. On the other hand, SAS (while being quite
>>>    different) does not have this limitations.
>>>
>>>    In the age of "big data" this can be quite an advantage.
>>>
>>>    Of course, you can "patch" this situation, e.g. in R you can use
>>>    the ff
>>>    or bigmemory packages, or use SQL.
>>>
>>>    But my point is that it is bolted on, and you need to spend extra
>>>    mental
>>>    loops switching between, say, data.frame and ff, instead of
>>>    focusing on
>>>    your data problem at hand. This is a clear advantage of SAS, where you
>>>    don't have to do that. So I'm wondering how this is handled in Julia.
>>>
>>>    Thanks,
>>>
>>>    M
>>>
>>>    P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
>>>    Julia is better. I'm just interested to find out whether such a
>>>    solution
>>>    exists in Julia (I haven't found any, but maybe I overlooked
>>>    something).
>>>    And if no such solution exists, given that Julia is still young,
>>>    evolving, and malleable (in a positive sense), it might make sense to
>>>    think about it.
>>>
>>>    --
>>>    You received this message because you are subscribed to the Google
>>>    Groups "julia-stats" group.
>>>    To unsubscribe from this group and stop receiving emails from it,
>>>    send an email to [hidden email]
>>>    <mailto:julia-stats%[hidden email]>.
>>>    For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "julia-stats" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [hidden email]
>>> <mailto:[hidden email]>.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "julia-stats" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to [hidden email]
>> <mailto:[hidden email]>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Randy Zwitch
Yes, and FWIW, I use Julia and Hive all the time via ODBC.jl. I just don't do it within the context of a "dataframe".

On Wednesday, August 6, 2014 10:52:29 AM UTC-4, John Myles White wrote:
Isn’t Hive already “plyr for Hadoop”?

 — John

On Aug 6, 2014, at 6:29 AM, Michael Smith <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">my.r...@...> wrote:

> Thanks everybody. It should come as no surprise to me that the Julia
> community is already working on this. Awesome.
>
> One minor point that I have not seen discussed in the issues is a
> reference to the plyrmr package, which is essentially plyr/dplyr for
> Hadoop.
>
> <a href="https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FRevolutionAnalytics%2FRHadoop%2Fwiki%2Fuser%253Eplyrmr%253EHome\46sa\75D\46sntz\0751\46usg\75AFQjCNEa-ONkBqg4HGtA7uI1v60UVXF3pA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FRevolutionAnalytics%2FRHadoop%2Fwiki%2Fuser%253Eplyrmr%253EHome\46sa\75D\46sntz\0751\46usg\75AFQjCNEa-ONkBqg4HGtA7uI1v60UVXF3pA';return true;">https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome
>
> Maybe it's possible to pillage some ideas from there.
>
> M
>
>
> On 08/06/2014 12:23 PM, John Myles White wrote:
>> At some point, we need to create some additional data tools for working
>> with data sets that do not fit in memory. Harlan's list touches on a lot
>> of the best strategies for doing that in a way that would smoothly
>> integrate with the rest of the language.
>>
>> -- John
>>
>> On Aug 5, 2014, at 7:48 AM, Harlan Harris <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">har...@...
>> <mailto:<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">har...@...>> wrote:
>>
>>> Not currently, but it's been talked about as long as there's been
>>> DataFrames in Julia. See these issues, and references therein, for a
>>> start:
>>>
>>> <a href="https://github.com/JuliaStats/DataFrames.jl/issues/25" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FDataFrames.jl%2Fissues%2F25\46sa\75D\46sntz\0751\46usg\75AFQjCNFK2ayoOVMHqdHGRmvViuV4RuOgMQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FDataFrames.jl%2Fissues%2F25\46sa\75D\46sntz\0751\46usg\75AFQjCNFK2ayoOVMHqdHGRmvViuV4RuOgMQ';return true;">https://github.com/JuliaStats/DataFrames.jl/issues/25
>>> <a href="https://github.com/JuliaStats/DataFrames.jl/issues/26" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FDataFrames.jl%2Fissues%2F26\46sa\75D\46sntz\0751\46usg\75AFQjCNEmVd4bbDN_MnqUxAwib_5HhMtFeQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FDataFrames.jl%2Fissues%2F26\46sa\75D\46sntz\0751\46usg\75AFQjCNEmVd4bbDN_MnqUxAwib_5HhMtFeQ';return true;">https://github.com/JuliaStats/DataFrames.jl/issues/26
>>>
>>> Also look around in the package and issues list for DataStreams (which
>>> I believe are not currently functional) which is a related issue.
>>>
>>>
>>>
>>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">my.r...@...
>>> <mailto:<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">my.r...@...>> wrote:
>>>
>>>    All,
>>>
>>>    Are there currently any solutions in Julia to handle
>>>    larger-than-memory
>>>    datasets in a similar way you do in a DataFrame?
>>>
>>>    The reason I'm asking is that R has the limitation that you need
>>>    to fit
>>>    all your data into memory. On the other hand, SAS (while being quite
>>>    different) does not have this limitations.
>>>
>>>    In the age of "big data" this can be quite an advantage.
>>>
>>>    Of course, you can "patch" this situation, e.g. in R you can use
>>>    the ff
>>>    or bigmemory packages, or use SQL.
>>>
>>>    But my point is that it is bolted on, and you need to spend extra
>>>    mental
>>>    loops switching between, say, data.frame and ff, instead of
>>>    focusing on
>>>    your data problem at hand. This is a clear advantage of SAS, where you
>>>    don't have to do that. So I'm wondering how this is handled in Julia.
>>>
>>>    Thanks,
>>>
>>>    M
>>>
>>>    P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
>>>    Julia is better. I'm just interested to find out whether such a
>>>    solution
>>>    exists in Julia (I haven't found any, but maybe I overlooked
>>>    something).
>>>    And if no such solution exists, given that Julia is still young,
>>>    evolving, and malleable (in a positive sense), it might make sense to
>>>    think about it.
>>>
>>>    --
>>>    You received this message because you are subscribed to the Google
>>>    Groups "julia-stats" group.
>>>    To unsubscribe from this group and stop receiving emails from it,
>>>    send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com
>>>    <mailto:<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats%2Bunsubscribe@...>.
>>>    For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "julia-stats" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com
>>> <mailto:<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats+unsubscribe@...>.
>>> For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "julia-stats" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com
>> <mailto:<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats+unsubscribe@...>.
>> For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="4s5I-NXzypEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
> For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Michael Smith
Hive is great, but it's SQL-like, and plyrmr is R, which is more similar
to Julia, and because of this difference, plyrmr might be worth keeping
in mind. Also, Antonio Piccolboni (the guy behind much of this) has done
an excellent job integrating R with Hadoop (mostly MapReduce), so maybe
there's something to be learned.

On a different note, since Spark seems to be the new big thing, and is
going to obsolete much of MapReduce (based on what I hear from people in
the industry), it might be important to keep in mind that the industry
is moving towards Spark (and Taz) and away from MapReduce. (This is not
to say that Hadoop will be deprecated, since Spark and Taz integrate
well with Hadoop; just that MapReduce, which is part of Hadoop, will
mainly be used to keep old projects running, while new development will
mainly be done for Spark (and Taz).)

Anyway, this is leading us more to _really_ big data. In contrast, what
I had in mind was something like what PyTables does for Python (i.e.
sort of _intermediate_ big data, not really big data), but with better
integration with DataFrame. I think HDF5 (not HDFS) has been already
discussed in one of the issues, so things look fine (although I haven't
seen PyTables mentioned explicitly in the github issues mentioned by
Harlan, and since PyTables provides an abstraction on top of HDF5,
PyTables might also be worth considering to get some ideas from).

Anyway, that's my core dump, hope it helps.

Cheers,
M


On 08/07/2014 01:00 AM, Randy Zwitch wrote:

> Yes, and FWIW, I use Julia and Hive all the time via ODBC.jl. I just
> don't do it within the context of a "dataframe".
>
> On Wednesday, August 6, 2014 10:52:29 AM UTC-4, John Myles White wrote:
>
>     Isn’t Hive already “plyr for Hadoop”?
>
>      — John
>
>     On Aug 6, 2014, at 6:29 AM, Michael Smith <[hidden email]
>     <javascript:>> wrote:
>
>     > Thanks everybody. It should come as no surprise to me that the Julia
>     > community is already working on this. Awesome.
>     >
>     > One minor point that I have not seen discussed in the issues is a
>     > reference to the plyrmr package, which is essentially plyr/dplyr for
>     > Hadoop.
>     >
>     >
>     https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome
>     <https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome>
>
>     >
>     > Maybe it's possible to pillage some ideas from there.
>     >
>     > M
>     >
>     >
>     > On 08/06/2014 12:23 PM, John Myles White wrote:
>     >> At some point, we need to create some additional data tools for
>     working
>     >> with data sets that do not fit in memory. Harlan's list touches
>     on a lot
>     >> of the best strategies for doing that in a way that would smoothly
>     >> integrate with the rest of the language.
>     >>
>     >> -- John
>     >>
>     >> On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email]
>     <javascript:>
>     >> <mailto:[hidden email] <javascript:>>> wrote:
>     >>
>     >>> Not currently, but it's been talked about as long as there's been
>     >>> DataFrames in Julia. See these issues, and references therein,
>     for a
>     >>> start:
>     >>>
>     >>> https://github.com/JuliaStats/DataFrames.jl/issues/25
>     <https://github.com/JuliaStats/DataFrames.jl/issues/25>
>     >>> https://github.com/JuliaStats/DataFrames.jl/issues/26
>     <https://github.com/JuliaStats/DataFrames.jl/issues/26>
>     >>>
>     >>> Also look around in the package and issues list for DataStreams
>     (which
>     >>> I believe are not currently functional) which is a related issue.
>     >>>
>     >>>
>     >>>
>     >>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith
>     <[hidden email] <javascript:>
>     >>> <mailto:[hidden email] <javascript:>>> wrote:
>     >>>
>     >>>    All,
>     >>>
>     >>>    Are there currently any solutions in Julia to handle
>     >>>    larger-than-memory
>     >>>    datasets in a similar way you do in a DataFrame?
>     >>>
>     >>>    The reason I'm asking is that R has the limitation that you need
>     >>>    to fit
>     >>>    all your data into memory. On the other hand, SAS (while
>     being quite
>     >>>    different) does not have this limitations.
>     >>>
>     >>>    In the age of "big data" this can be quite an advantage.
>     >>>
>     >>>    Of course, you can "patch" this situation, e.g. in R you can use
>     >>>    the ff
>     >>>    or bigmemory packages, or use SQL.
>     >>>
>     >>>    But my point is that it is bolted on, and you need to spend
>     extra
>     >>>    mental
>     >>>    loops switching between, say, data.frame and ff, instead of
>     >>>    focusing on
>     >>>    your data problem at hand. This is a clear advantage of SAS,
>     where you
>     >>>    don't have to do that. So I'm wondering how this is handled
>     in Julia.
>     >>>
>     >>>    Thanks,
>     >>>
>     >>>    M
>     >>>
>     >>>    P.S.: I do not intend to start a flame war, e.g. whether R or
>     SAS or
>     >>>    Julia is better. I'm just interested to find out whether such a
>     >>>    solution
>     >>>    exists in Julia (I haven't found any, but maybe I overlooked
>     >>>    something).
>     >>>    And if no such solution exists, given that Julia is still young,
>     >>>    evolving, and malleable (in a positive sense), it might make
>     sense to
>     >>>    think about it.
>     >>>
>     >>>    --
>     >>>    You received this message because you are subscribed to the
>     Google
>     >>>    Groups "julia-stats" group.
>     >>>    To unsubscribe from this group and stop receiving emails from
>     it,
>     >>>    send an email to [hidden email] <javascript:>
>     >>>    <mailto:julia-stats%[hidden email]
>     <javascript:>>.
>     >>>    For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>     >>>
>     >>>
>     >>>
>     >>> --
>     >>> You received this message because you are subscribed to the Google
>     >>> Groups "julia-stats" group.
>     >>> To unsubscribe from this group and stop receiving emails from
>     it, send
>     >>> an email to [hidden email] <javascript:>
>     >>> <mailto:[hidden email] <javascript:>>.
>     >>> For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>     >>
>     >> --
>     >> You received this message because you are subscribed to the Google
>     >> Groups "julia-stats" group.
>     >> To unsubscribe from this group and stop receiving emails from it,
>     send
>     >> an email to [hidden email] <javascript:>
>     >> <mailto:[hidden email] <javascript:>>.
>     >> For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>     >
>     > --
>     > You received this message because you are subscribed to the Google
>     Groups "julia-stats" group.
>     > To unsubscribe from this group and stop receiving emails from it,
>     send an email to [hidden email] <javascript:>.
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email]
> <mailto:[hidden email]>.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Ariel Katz
In reply to this post by Michael Smith
What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 

http://blaze.pydata.org/docs/latest/index.html 

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

John Myles White
If somebody wants to build that, it would be awesome.

 -- John

On Aug 13, 2014, at 2:42 PM, Ariel Katz <[hidden email]> wrote:

What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 



--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Ariel Katz
In reply to this post by Michael Smith
What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backends can be swapped out and driven by common syntax. 

Spark is one of the compatible formats.



--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Ariel Katz
In reply to this post by John Myles White
Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.

They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Julia... 

On Wednesday, August 13, 2014 5:42:45 PM UTC-4, John Myles White wrote:
If somebody wants to build that, it would be awesome.

 -- John

On Aug 13, 2014, at 2:42 PM, Ariel Katz <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZgVn5bg1hEIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">arika...@...> wrote:

What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 

<a href="http://blaze.pydata.org/docs/latest/index.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;">http://blaze.pydata.org/docs/latest/index.html 


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZgVn5bg1hEIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Ariel Katz
In reply to this post by John Myles White
Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.

They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Blaze...

On Wednesday, August 13, 2014 5:42:45 PM UTC-4, John Myles White wrote:
If somebody wants to build that, it would be awesome.

 -- John

On Aug 13, 2014, at 2:42 PM, Ariel Katz <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZgVn5bg1hEIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">arika...@...> wrote:

What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 

<a href="http://blaze.pydata.org/docs/latest/index.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;">http://blaze.pydata.org/docs/latest/index.html 


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZgVn5bg1hEIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

John Myles White
I don't think Continuum has anything to port with the Julia bindings for Bokeh. I believe that project's largely the work of a single volunteer: https://github.com/samuelcolvin/Bokeh.jl/graphs/contributors

 -- John

On Aug 13, 2014, at 2:53 PM, Ariel Katz <[hidden email]> wrote:

Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.

They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Blaze...

On Wednesday, August 13, 2014 5:42:45 PM UTC-4, John Myles White wrote:
If somebody wants to build that, it would be awesome.

 -- John

On Aug 13, 2014, at 2:42 PM, Ariel Katz <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZgVn5bg1hEIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">arika...@...> wrote:

What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 

<a href="http://blaze.pydata.org/docs/latest/index.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;">http://blaze.pydata.org/docs/latest/index.html 


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZgVn5bg1hEIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Ariel Katz
Ah ok.

On Wednesday, August 13, 2014 5:56:36 PM UTC-4, John Myles White wrote:
I don't think Continuum has anything to port with the Julia bindings for Bokeh. I believe that project's largely the work of a single volunteer: <a href="https://github.com/samuelcolvin/Bokeh.jl/graphs/contributors" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fsamuelcolvin%2FBokeh.jl%2Fgraphs%2Fcontributors\46sa\75D\46sntz\0751\46usg\75AFQjCNGJub27F-x-OBia5KZ9QIXjffFqLw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fsamuelcolvin%2FBokeh.jl%2Fgraphs%2Fcontributors\46sa\75D\46sntz\0751\46usg\75AFQjCNGJub27F-x-OBia5KZ9QIXjffFqLw';return true;">https://github.com/samuelcolvin/Bokeh.jl/graphs/contributors

 -- John

On Aug 13, 2014, at 2:53 PM, Ariel Katz <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Jy4muqCOqtYJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">arika...@...> wrote:

Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.

They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Blaze...

On Wednesday, August 13, 2014 5:42:45 PM UTC-4, John Myles White wrote:
If somebody wants to build that, it would be awesome.

 -- John

On Aug 13, 2014, at 2:42 PM, Ariel Katz <[hidden email]> wrote:

What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 

<a href="http://blaze.pydata.org/docs/latest/index.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fblaze.pydata.org%2Fdocs%2Flatest%2Findex.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHjpJoM8dE1lr2VuWAFRBLd0W96bg';return true;">http://blaze.pydata.org/docs/latest/index.html 


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="Jy4muqCOqtYJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Juan
In reply to this post by Michael Smith
I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.

Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.

We don't just need to be able to get some summaries, as we do with databases,  we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.

Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Juan
In reply to this post by Ramesh Fernando
Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't  more complex things such as mixed effect regression or use stan nor any other generic bayesian package.
The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4.

On Tuesday, August 5, 2014 at 5:04:38 PM UTC+2, Ramesh Fernando wrote:
Hi I don't know Julia, but in R you don't need to load all data into  memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest you contact someone at Hortonworks or Revolution R) g  which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods  and bigmemory  <a href="http://cran.r-project.org/web/packages/bigmemory/index.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fbigmemory%2Findex.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHjuX9auSpWvGsNvHsvQDRo7Aqb9g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fbigmemory%2Findex.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHjuX9auSpWvGsNvHsvQDRo7Aqb9g&#39;;return true;">http://cran.r-project.org/web/packages/bigmemory/index.html and <a href="http://www.bigmemory.org/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.bigmemory.org%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGERIWPyylMnnwfei8NBxGXPgn9jw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.bigmemory.org%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGERIWPyylMnnwfei8NBxGXPgn9jw&#39;;return true;">http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data.
<a href="https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg&#39;;return true;">https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg
Regards,
Ramesh


On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="tHVRCcRXMsYJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">my.r...@...> wrote:
All,

Are there currently any solutions in Julia to handle larger-than-memory
datasets in a similar way you do in a DataFrame?

The reason I'm asking is that R has the limitation that you need to fit
all your data into memory. On the other hand, SAS (while being quite
different) does not have this limitations.

In the age of "big data" this can be quite an advantage.

Of course, you can "patch" this situation, e.g. in R you can use the ff
or bigmemory packages, or use SQL.

But my point is that it is bolted on, and you need to spend extra mental
loops switching between, say, data.frame and ff, instead of focusing on
your data problem at hand. This is a clear advantage of SAS, where you
don't have to do that. So I'm wondering how this is handled in Julia.

Thanks,

M

P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
Julia is better. I'm just interested to find out whether such a solution
exists in Julia (I haven't found any, but maybe I overlooked something).
And if no such solution exists, given that Julia is still young,
evolving, and malleable (in a positive sense), it might make sense to
think about it.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="tHVRCcRXMsYJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

Milan Bouchet-Valat
We're not completely there yet, but with Query.jl and
StructuredQueries.jl, combined with JuliaDB/JuliaData packages, one
should be able to work on out-of-memory data sets as (or more)
efficiently as e.g. SAS. The high-level API is the same whether you
work on a DataFrame or on an external data base.

There's also OnlineStats.jl for computing statistics without loading
the full data set in memory at once.


Regards


Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit :

> Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't  more complex things such as mixed effect regression or use stan nor any other generic bayesian package.
> The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4.
>
> > > > Hi I don't know Julia, but in R you don't need to load all data into  memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest you contact someone at Hortonworks or Revolution R) g  which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods  and bigmemory  http://cran.r-project.org/web/packages/bigmemory/index.html and http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data.
> > https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg
> > Regards,
> > Ramesh
> >
> >
> > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote:
> > > All,
> > >
> > > Are there currently any solutions in Julia to handle larger-than-memory
> > > datasets in a similar way you do in a DataFrame?
> > >
> > > The reason I'm asking is that R has the limitation that you need to fit
> > > all your data into memory. On the other hand, SAS (while being quite
> > > different) does not have this limitations.
> > >
> > > In the age of "big data" this can be quite an advantage.
> > >
> > > Of course, you can "patch" this situation, e.g. in R you can use the ff
> > > or bigmemory packages, or use SQL.
> > >
> > > But my point is that it is bolted on, and you need to spend extra mental
> > > loops switching between, say, data.frame and ff, instead of focusing on
> > > your data problem at hand. This is a clear advantage of SAS, where you
> > > don't have to do that. So I'm wondering how this is handled in Julia.
> > >
> > > Thanks,
> > >
> > > M
> > >
> > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
> > > Julia is better. I'm just interested to find out whether such a solution
> > > exists in Julia (I haven't found any, but maybe I overlooked something).
> > > And if no such solution exists, given that Julia is still young,
> > > evolving, and malleable (in a positive sense), it might make sense to
> > > think about it.
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "julia-stats" group.
> > > > > > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> > > > > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
> >
> -- 
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> > For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
dnm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame and Memory Limitations

dnm
What about using a tuple of distributed vectors/arrays as table subclass, or using dagger for an out of core lazy array.

Then it can be loaded into a distributed array for linear algebra. 

On Thursday, September 29, 2016 at 4:33:21 AM UTC-4, Milan Bouchet-Valat wrote:
We're not completely there yet, but with Query.jl and
StructuredQueries.jl, combined with JuliaDB/JuliaData packages, one
should be able to work on out-of-memory data sets as (or more)
efficiently as e.g. SAS. The high-level API is the same whether you
work on a DataFrame or on an external data base.

There's also OnlineStats.jl for computing statistics without loading
the full data set in memory at once.


Regards


Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit :

> Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't  more complex things such as mixed effect regression or use stan nor any other generic bayesian package.
> The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4.
>
> > > > Hi I don't know Julia, but in R you don't need to load all data into  memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest you contact someone at Hortonworks or Revolution R) g  which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods  and bigmemory  <a href="http://cran.r-project.org/web/packages/bigmemory/index.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fbigmemory%2Findex.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHjuX9auSpWvGsNvHsvQDRo7Aqb9g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fbigmemory%2Findex.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHjuX9auSpWvGsNvHsvQDRo7Aqb9g&#39;;return true;">http://cran.r-project.org/web/packages/bigmemory/index.html and <a href="http://www.bigmemory.org/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.bigmemory.org%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGERIWPyylMnnwfei8NBxGXPgn9jw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.bigmemory.org%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGERIWPyylMnnwfei8NBxGXPgn9jw&#39;;return true;">http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data.
> > <a href="https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg&#39;;return true;">https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg
> > Regards,
> > Ramesh
> >
> >
> > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote:
> > > All,
> > >
> > > Are there currently any solutions in Julia to handle larger-than-memory
> > > datasets in a similar way you do in a DataFrame?
> > >
> > > The reason I'm asking is that R has the limitation that you need to fit
> > > all your data into memory. On the other hand, SAS (while being quite
> > > different) does not have this limitations.
> > >
> > > In the age of "big data" this can be quite an advantage.
> > >
> > > Of course, you can "patch" this situation, e.g. in R you can use the ff
> > > or bigmemory packages, or use SQL.
> > >
> > > But my point is that it is bolted on, and you need to spend extra mental
> > > loops switching between, say, data.frame and ff, instead of focusing on
> > > your data problem at hand. This is a clear advantage of SAS, where you
> > > don't have to do that. So I'm wondering how this is handled in Julia.
> > >
> > > Thanks,
> > >
> > > M
> > >
> > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
> > > Julia is better. I'm just interested to find out whether such a solution
> > > exists in Julia (I haven't found any, but maybe I overlooked something).
> > > And if no such solution exists, given that Julia is still young,
> > > evolving, and malleable (in a positive sense), it might make sense to
> > > think about it.
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "julia-stats" group.
> > > > > > To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
> > > > > > For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.
> > >
> >
> >
> -- 
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="zkwnTKRjAwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
> > For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: DataFrame and Memory Limitations

David Anthoff
In reply to this post by Milan Bouchet-Valat
Yes, at least in theory it should be possible to e.g. load a very large CSV file with CSV.jl, transform it with Query.jl and then feed it into OnlineStats.jl. I think the architecture of all three packages should be such that this could work with a dataset that is larger than memory. In practice I don't think anyone has tried and I'm sure we would run into things that need fixing, but I can't think of some basic design decision in any of these packages that would prevent this kind of thing in principle.

There is a general question of the core interop type for these things. Right now things like regression packages mostly expect a DataFrame. But we could imagine a world where these packages expected a more generic type. I think right now there are a bunch of potential options out there: both DataStreams and Query define their own streaming interfaces for tabular data (in the case of Query it is just a normal julia iterator that returns NamedTuple elements). DataStreams in addition defines a column based interface that might be much faster when the dataset actually fits into memory (pure speculation on my end). I think there are also a bunch of attempts out there to define something like an abstract table structure, but I'm not sure to what extend they would enable a streaming data story.

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> On Behalf Of Milan Bouchet-Valat
> Sent: Thursday, September 29, 2016 1:33 AM
> To: [hidden email]
> Subject: Re: [julia-stats] DataFrame and Memory Limitations
>
> We're not completely there yet, but with Query.jl and StructuredQueries.jl,
> combined with JuliaDB/JuliaData packages, one should be able to work on
> out-of-memory data sets as (or more) efficiently as e.g. SAS. The high-level
> API is the same whether you work on a DataFrame or on an external data
> base.
>
> There's also OnlineStats.jl for computing statistics without loading the full
> data set in memory at once.
>
>
> Regards
>
>
> Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit :
> > Yes, but you can only do simple things such as summaries or use functions
> implemented on that special packages. You can do linear regression, till now
> but you can't  more complex things such as mixed effect regression or use
> stan nor any other generic bayesian package.
> > The same goes for Spark, you can only use predefined functions, very
> simple ones, or create your own by hand, but it's very difficult that you can
> program from scratch something like lme4.
> >
> > > > > Hi I don't know Julia, but in R you don't need to load all data
> into  memory just like SAS you can read off disk, in R both proprietary
> Revolutionary Analytics R I think working with Hortonworks/Cloudera and
> Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know
> little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest you
> contact someone at Hortonworks or Revolution R) g  which I saw a
> demonstration of in R User group here in Ottawa, Canada as well as
> Revolution R's other proprietary methods  and bigmemory  http://cran.r-
> project.org/web/packages/bigmemory/index.html
> and http://www.bigmemory.org/ can handle more data. I Here is a
> discussion on large size data.
> > > https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg
> > > Regards,
> > > Ramesh
> > >
> > >
> > > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]>
> wrote:
> > > > All,
> > > >
> > > > Are there currently any solutions in Julia to handle
> > > > larger-than-memory datasets in a similar way you do in a DataFrame?
> > > >
> > > > The reason I'm asking is that R has the limitation that you need
> > > > to fit all your data into memory. On the other hand, SAS (while
> > > > being quite
> > > > different) does not have this limitations.
> > > >
> > > > In the age of "big data" this can be quite an advantage.
> > > >
> > > > Of course, you can "patch" this situation, e.g. in R you can use
> > > > the ff or bigmemory packages, or use SQL.
> > > >
> > > > But my point is that it is bolted on, and you need to spend extra
> > > > mental loops switching between, say, data.frame and ff, instead of
> > > > focusing on your data problem at hand. This is a clear advantage
> > > > of SAS, where you don't have to do that. So I'm wondering how this is
> handled in Julia.
> > > >
> > > > Thanks,
> > > >
> > > > M
> > > >
> > > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS
> > > > or Julia is better. I'm just interested to find out whether such a
> > > > solution exists in Julia (I haven't found any, but maybe I overlooked
> something).
> > > > And if no such solution exists, given that Julia is still young,
> > > > evolving, and malleable (in a positive sense), it might make sense
> > > > to think about it.
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> > > > > > > To unsubscribe from this group and stop receiving emails from it,
> send an email to [hidden email].
> > > > > > > For more options, visit https://groups.google.com/d/optout.
> > > >
> > >
> > >
> > --
> > You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> > > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
12
Loading...